[PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios
@ 2024-09-24  1:17 Kanchana P Sridhar
  2024-09-24  1:17 ` [PATCH v7 1/8] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined Kanchana P Sridhar
                   ` (9 more replies)
  0 siblings, 10 replies; 79+ messages in thread
From: Kanchana P Sridhar @ 2024-09-24  1:17 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
	ying.huang, 21cnbao, akpm
  Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

Hi All,

This patch-series enables zswap_store() to accept and store mTHP
folios. The most significant contribution in this series is from the 
earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
migrated to mm-unstable as of 9-23-2024 in patches 5,6 of this series.

[1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
     https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u

Additionally, there is an attempt to modularize some of the functionality
in zswap_store(), to make it more amenable to supporting any-order
mTHPs. For instance, the function zswap_store_entry() stores a zswap_entry
in the xarray. Likewise, zswap_delete_stored_offsets() can be used to
delete all offsets corresponding to a higher order folio stored in zswap.

For accounting purposes, the patch-series adds per-order mTHP sysfs
"zswpout" counters that get incremented upon successful zswap_store of
an mTHP folio:

/sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout

A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default)
will enable/disable zswap storing of (m)THP. When disabled, zswap will
fallback to rejecting the mTHP folio, to be processed by the backing
swap device.

This patch-series is a pre-requisite for ZSWAP compress batching of mTHP
swap-out and decompress batching of swap-ins based on swapin_readahead(),
using Intel IAA hardware acceleration, which we would like to submit in
subsequent patch-series, with performance improvement data.

Thanks to Ying Huang for pre-posting review feedback and suggestions!

Thanks also to Nhat, Yosry, Barry, Chengming, Usama and Ying for their
helpful feedback, data reviews and suggestions!

Co-development signoff request:
===============================
I would like to request Ryan Roberts' co-developer signoff on patches
5 and 6 in this series. Thanks Ryan!

Changes since v6:
=================
1) Rebased to mm-unstable as of 9-23-2024,
   commit acfabf7e197f7a5bedf4749dac1f39551417b049.
2) Refactored into smaller commits, as suggested by Yosry and
   Chengming. Thanks both!
3) Reworded the commit log for patches 5 and 6 as per Yosry's
   suggestion. Thanks Yosry!
4) Gathered data on a Sapphire Rapids server that has 823GiB SSD swap disk
   partition. Also, all experiments are run with usemem --sleep 10, so that
   the memory allocated by the 70 processes remains in memory
   longer. Posted elapsed and sys times. Thanks to Yosry, Nhat and Ying for
   their help with refining the performance characterization methodology.
5) Updated Documentation/admin-guide/mm/transhuge.rst as suggested by
   Nhat. Thanks Nhat!

Changes since v5:
=================
1) Rebased to mm-unstable as of 8/29/2024,
   commit 9287e4adbc6ab8fa04d25eb82e097fed877a4642.
2) Added CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default) to
   enable/disable zswap_store() of mTHP folios. Thanks Nhat for the
   suggestion to add a knob by which users can enable/disable this
   change. Nhat, I hope this is along the lines of what you were
   thinking.
3) Added vm-scalability usemem data with 4K folios with
   CONFIG_ZSWAP_STORE_THP_DEFAULT_ON off, that I gathered to make sure
   there is no regression with this change.
4) Added data with usemem with 64K and 2M THP for an alternate view of
   before/after, as suggested by Yosry, so we can understand the impact
   of when mTHPs are split into 4K folios in shrink_folio_list()
   (CONFIG_THP_SWAP off) vs. not split (CONFIG_THP_SWAP on) and stored
   in zswap. Thanks Yosry for this suggestion.

Changes since v4:
=================
1) Published before/after data with zstd, as suggested by Nhat (Thanks
   Nhat for the data reviews!).
2) Rebased to mm-unstable from 8/27/2024,
   commit b659edec079c90012cf8d05624e312d1062b8b87.
3) Incorporated the change in memcontrol.h that defines obj_cgroup_get() if
   CONFIG_MEMCG is not defined, to resolve build errors reported by kernel
   robot; as per Nhat's and Michal's suggestion to not require a separate
   patch to fix the build errors (thanks both!).
4) Deleted all same-filled folio processing in zswap_store() of mTHP, as
   suggested by Yosry (Thanks Yosry!).
5) Squashed the commits that define new mthp zswpout stat counters, and
   invoke count_mthp_stat() after successful zswap_store()s; into a single
   commit. Thanks Yosry for this suggestion!

Changes since v3:
=================
1) Rebased to mm-unstable commit 8c0b4f7b65fd1ca7af01267f491e815a40d77444.
   Thanks to Barry for suggesting aligning with Ryan Roberts' latest
   changes to count_mthp_stat() so that it's always defined, even when THP
   is disabled. Barry, I have also made one other change in page_io.c
   where count_mthp_stat() is called by count_swpout_vm_event(). I would
   appreciate it if you can review this. Thanks!
   Hopefully this should resolve the kernel robot build errors.

Changes since v2:
=================
1) Gathered usemem data using SSD as the backing swap device for zswap,
   as suggested by Ying Huang. Ying, I would appreciate it if you can
   review the latest data. Thanks!
2) Generated the base commit info in the patches to attempt to address
   the kernel test robot build errors.
3) No code changes to the individual patches themselves.

Changes since RFC v1:
=====================

1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
   Thanks Barry!
2) Addressed some of the code review comments that Nhat Pham provided in
   Ryan's initial RFC [1]:
   - Added a comment about the cgroup zswap limit checks occuring once per
     folio at the beginning of zswap_store().
     Nhat, Ryan, please do let me know if the comments convey the summary
     from the RFC discussion. Thanks!
   - Posted data on running the cgroup suite's zswap kselftest.
3) Rebased to v6.11-rc3.
4) Gathered performance data with usemem and the rebased patch-series.

Regression Testing:
===================
I ran vm-scalability usemem 70 processes without mTHP, i.e., only 4K
folios with mm-unstable and with this patch-series. The main goal was
to make sure that there is no functional or performance regression
wrt the earlier zswap behavior for 4K folios,
CONFIG_ZSWAP_STORE_THP_DEFAULT_ON is not set, and zswap_store() of 4K
pages goes through the newly added code path [zswap_store(),
zswap_store_page()].

The data indicates there is no regression.

 ------------------------------------------------------------------------------
                     mm-unstable 8-28-2024                        zswap-mTHP v6
                                              CONFIG_ZSWAP_STORE_THP_DEFAULT_ON
                                                                     is not set
 ------------------------------------------------------------------------------
 ZSWAP compressor        zstd     deflate-                     zstd    deflate-
                                       iaa                                  iaa
 ------------------------------------------------------------------------------
 Throughput (KB/s)    110,775      113,010               111,550        121,937
 sys time (sec)      1,141.72       954.87              1,131.95         828.47
 memcg_high           140,500      153,737               139,772        134,129
 memcg_swap_high            0            0                     0              0
 memcg_swap_fail            0            0                     0              0
 pswpin                     0            0                     0              0
 pswpout                    0            0                     0              0
 zswpin                   675          690                   682            684
 zswpout            9,552,298   10,603,271             9,566,392      9,267,213
 thp_swpout                 0            0                     0              0
 thp_swpout_                0            0                     0              0
  fallback                                                                     
 pgmajfault             3,453        3,468                 3,841          3,487
 ZSWPOUT-64kB-mTHP        n/a          n/a                     0              0
 SWPOUT-64kB-mTHP           0            0                     0              0
 ------------------------------------------------------------------------------

Performance Testing:
====================
Testing of this patch-series was done with mm-unstable as of 9-23-2024,
commit acfabf7e197f7a5bedf4749dac1f39551417b049. Data was gathered
without/with this patch-series, on an Intel Sapphire Rapids server,
dual-socket 56 cores per socket, 4 IAA devices per socket, 503 GiB RAM and
823G SSD disk partition swap. Core frequency was fixed at 2500MHz.

The vm-scalability "usemem" test was run in a cgroup whose memory.high
was fixed at 40G. The is no swap limit set for the cgroup. Following a
similar methodology as in Ryan Roberts' "Swap-out mTHP without splitting"
series [2], 70 usemem processes were run, each allocating and writing 1G of
memory, and sleeping for 10 sec before exiting:

    usemem --init-time -w -O -s 10 -n 70 1g

The vm/sysfs mTHP stats included with the performance data provide details
on the swapout activity to ZSWAP/swap.

Other kernel configuration parameters:

    ZSWAP Compressors : zstd, deflate-iaa
    ZSWAP Allocator   : zsmalloc
    SWAP page-cluster : 2

In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
IAA "compression verification" is enabled. Hence each IAA compression
will be decompressed internally by the "iaa_crypto" driver, the crc-s
returned by the hardware will be compared and errors reported in case of
mismatches. Thus "deflate-iaa" helps ensure better data integrity as
compared to the software compressors.

Throughput is derived by averaging the individual 70 processes' throughputs
reported by usemem. elapsed/sys times are measured with perf. All data
points per compressor/kernel/mTHP configuration are averaged across 3 runs.

Case 1: Comparing zswap 4K vs. zswap mTHP
=========================================

In this scenario, the "before" is CONFIG_THP_SWAP set to off, that results in
64K/2M (m)THP to be split into 4K folios that get processed by zswap.

The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that results
in 64K/2M (m)THP to not be split, and processed by zswap.

 64KB mTHP (cgroup memory.high set to 40G):
 ==========================================

 -------------------------------------------------------------------------------
                    mm-unstable 9-23-2024              zswap-mTHP     Change wrt
                        CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y       Baseline
                                 Baseline
 -------------------------------------------------------------------------------
 ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
                                      iaa                     iaa            iaa
 -------------------------------------------------------------------------------
 Throughput (KB/s)   143,323      125,485     153,550     129,609    7%       3%
 elapsed time (sec)    24.97        25.42       23.90       25.19    4%       1%
 sys time (sec)       822.72       750.96      757.70      731.13    8%       3%
 memcg_high          132,743      169,825     148,075     192,744
 memcg_swap_fail     639,067      841,553       2,204       2,215
 pswpin                    0            0           0           0
 pswpout                   0            0           0           0
 zswpin                  795          873         760         902
 zswpout          10,011,266   13,195,137  10,010,017  13,193,554
 thp_swpout                0            0           0           0
 thp_swpout_               0            0           0           0
  fallback
 64kB-mthp_          639,065      841,553       2,204       2,215
  swpout_fallback
 pgmajfault            2,861        2,924       3,054       3,259
 ZSWPOUT-64kB            n/a          n/a     623,451     822,268
 SWPOUT-64kB               0            0           0           0
 -------------------------------------------------------------------------------

 2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
 =======================================================

 -------------------------------------------------------------------------------
                    mm-unstable 9-23-2024              zswap-mTHP     Change wrt
                        CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y       Baseline
                                 Baseline
 -------------------------------------------------------------------------------
 ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
                                      iaa                     iaa            iaa
 -------------------------------------------------------------------------------
 Throughput (KB/s)   145,616      139,640     169,404     141,168   16%       1%
 elapsed time (sec)    25.05        23.85       23.02       23.37    8%       2%
 sys time (sec)       790.53       676.34      613.26      677.83   22%    -0.2%
 memcg_high           16,702       25,197      17,374      23,890
 memcg_swap_fail      21,485       27,814         114         144
 pswpin                    0            0           0           0
 pswpout                   0            0           0           0
 zswpin                  793          852         778         922
 zswpout          10,011,709   13,186,882  10,010,893  13,195,600
 thp_swpout                0            0           0           0
 thp_swpout_          21,485       27,814         114         144
  fallback
 2048kB-mthp_            n/a          n/a           0           0
  swpout_fallback
 pgmajfault            2,701        2,822       4,151       5,066
 ZSWPOUT-2048kB          n/a          n/a      19,442      25,615
 SWPOUT-2048kB             0            0           0           0
 -------------------------------------------------------------------------------

We mostly see improvements in throughput, elapsed and sys time for zstd and
deflate-iaa, when comparing before (THP_SWAP=N) vs. after (THP_SWAP=Y).

Case 2: Comparing SSD swap mTHP vs. zswap mTHP
==============================================

In this scenario, CONFIG_THP_SWAP is enabled in "before" and "after"
experiments. The "before" represents zswap rejecting mTHP, and the mTHP
being stored by the 823G SSD swap. The "after" represents data with this
patch-series, that results in 64K/2M (m)THP being processed and stored by
zswap.

 64KB mTHP (cgroup memory.high set to 40G):
 ==========================================

 -------------------------------------------------------------------------------
                    mm-unstable 9-23-2024              zswap-mTHP     Change wrt
                        CONFIG_THP_SWAP=Y       CONFIG_THP_SWAP=Y       Baseline
                                 Baseline
 -------------------------------------------------------------------------------
 ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
                                      iaa                     iaa            iaa
 -------------------------------------------------------------------------------
 Throughput (KB/s)    20,265       20,696     153,550     129,609   658%    526%
 elapsed time (sec)    72.44        70.86       23.90       25.19    67%     64%
 sys time (sec)        77.95        77.99      757.70      731.13  -872%   -837%
 memcg_high          115,811      113,277     148,075     192,744
 memcg_swap_fail       2,386        2,425       2,204       2,215
 pswpin                   16           16           0           0
 pswpout           7,774,235    7,616,069           0           0
 zswpin                  728          749         760         902
 zswpout              38,424       39,022  10,010,017  13,193,554
 thp_swpout                0            0           0           0
 thp_swpout_               0            0           0           0
  fallback	                                                 
 64kB-mthp_            2,386        2,425       2,204       2,215
  swpout_fallback                                                
 pgmajfault            2,757        2,860       3,054       3,259
 ZSWPOUT-64kB            n/a          n/a     623,451     822,268
 SWPOUT-64kB         485,890      476,004           0           0
 -------------------------------------------------------------------------------

 2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
 =======================================================

 -------------------------------------------------------------------------------
                    mm-unstable 9-23-2024              zswap-mTHP     Change wrt
                        CONFIG_THP_SWAP=Y       CONFIG_THP_SWAP=Y       Baseline
                                 Baseline
 -------------------------------------------------------------------------------
 ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
                                      iaa                     iaa            iaa
 -------------------------------------------------------------------------------
 Throughput (KB/s)    24,347       35,971     169,404     141,168    596%   292%
 elapsed time (sec)    63.52        64.59       23.02       23.37     64%    64%
 sys time (sec)        27.91        27.01      613.26      677.83  -2098% -2410%
 memcg_high           13,576       13,467      17,374      23,890
 memcg_swap_fail         162          124         114         144
 pswpin                    0            0           0           0
 pswpout           7,003,307    7,168,853           0           0
 zswpin                  741          722         778         922
 zswpout              84,429       65,315  10,010,893  13,195,600
 thp_swpout           13,678       14,002           0           0
 thp_swpout_             162          124         114         144
  fallback	                                                 
 2048kB-mthp_            n/a          n/a           0           0
  swpout_fallback                                                
 pgmajfault            3,345        2,903       4,151       5,066
 ZSWPOUT-2048kB          n/a          n/a      19,442      25,615
 SWPOUT-2048kB        13,678       14,002           0           0
 -------------------------------------------------------------------------------

We see significant improvements in throughput and elapsed time for zstd and
deflate-iaa, when comparing before (mTHP-SSD) vs. after (mTHP-ZSWAP). The
sys time increases with mTHP-ZSWAP as expected, due to the CPU compression
time vs. asynchronous disk write times, as pointed out by Ying and Yosry.

In the "Before" scenario, when zswap does not store mTHP, only allocations
count towards the cgroup memory limit. However, in the "After" scenario,
with the introduction of zswap_store() mTHP, both, allocations as well as
the zswap compressed pool usage from all 70 processes are counted towards
the memory limit. As a result, we see higher swapout activity in the
"After" data. Hence, more time is spent doing reclaim as the zswap cgroup
charge leads to more frequent memory.high breaches.

Summary:
========
The v7 data presented above comparing zswap-mTHP with a conventional 823G
SSD swap demonstrates good performance improvements with zswap-mTHP. Hence,
it seems reasonable for zswap_store to support (m)THP, so that further
performance improvements can be implemented.

Some of the ideas that have shown promise in our experiments are:

1) IAA compress/decompress batching.
2) Distributing compress jobs across all IAA devices on the socket.

In the experimental setup used in this patchset, we have enabled
IAA compress verification to ensure additional hardware data integrity CRC
checks not currently done by the software compressors. The tests run for
this patchset are also using only 1 IAA device per core, that avails of 2
compress engines on the device. In our experiments with IAA batching, we
distribute compress jobs from all cores to the 8 compress engines available
per socket. We further compress the pages in each mTHP in parallel in the
accelerator. As a result, we improve compress latency and reclaim
throughput.

The following compares the same usemem workload characteristics between:

1) zstd (v7 experiments)
2) deflate-iaa "Fixed mode" (v7 experiments)
3) deflate-iaa with batching
4) deflate-iaa-canned "Canned mode" [3] with batching

vm.page-cluster is set to "2" for all runs.

64K mTHP ZSWAP:
===============

 -------------------------------------------------------------------------------
 ZSWAP            zstd   IAA Fixed   IAA Fixed  IAA Canned     IAA    IAA    IAA
 compressor       (v7)        (v7)  + Batching  + Batching   Batch Canned Canned
                                                               vs.    vs.  Batch
 64K mTHP                                                    Seqtl  Fixed    vs.
                                                                            ZSTD
 ------------------------------------------------------------------------------- 
 Throughput    153,550     129,609     156,215     166,975   21%     7%       9%
     (KB/s)
 elapsed time    23.90       25.19       22.46       21.38   11%     5%      11%
        (sec)
 sys time       757.70      731.13      715.62      648.83    2%     9%      14%
    (sec)
 memcg_high    148,075     192,744     197,548     181,734
 memcg_swap_     2,204       2,215       2,293       2,263
  fail
 pswpin              0           0           0           0 
 pswpout             0           0           0           0 
 zswpin            760         902         774         833
 zswpout    10,010,017  13,193,554  13,193,176  12,125,616
 thp_swpout          0           0           0           0 
 thp_swpout_         0           0           0           0 
  fallback
 64kB-mthp_      2,204       2,215       2,293       2,263
  swpout_
  fallback
 pgmajfault      3,054       3,259       3,545       3,516
 ZSWPOUT-64kB  623,451     822,268     822,176     755,480
 SWPOUT-64kB         0           0           0           0 
 swap_ra           146         161         152         159
 swap_ra_hit        64         121          68          88
 -------------------------------------------------------------------------------

2M THP ZSWAP:
=============

 -------------------------------------------------------------------------------
 ZSWAP            zstd   IAA Fixed   IAA Fixed  IAA Canned     IAA    IAA    IAA
 compressor       (v7)        (v7)  + Batching  + Batching   Batch Canned Canned
                                                               vs.    vs.  Batch
 2M THP                                                      Seqtl  Fixed    vs.
                                                                            ZSTD
 ------------------------------------------------------------------------------- 
 Throughput    169,404     141,168     175,089     193,407     24%    10%    14%
     (KB/s)
 elapsed time    23.02       23.37       21.13       19.97     10%     5%    13%
        (sec)
 sys time       613.26      677.83      630.51      533.80      7%    15%    13%
    (sec)
 memcg_high     17,374      23,890      24,349      22,374
 memcg_swap_       114         144         102          88
  fail
 pswpin              0           0           0           0
 pswpout             0           0           0           0
 zswpin            778         922       6,492       6,642
 zswpout    10,010,893  13,195,600  13,199,907  12,132,265
 thp_swpout          0           0           0           0
 thp_swpout_       114         144         102          88
  fallback
 pgmajfault      4,151       5,066       5,032       4,999
 ZSWPOUT-2MB    19,442      25,615      25,666      23,594
 SWPOUT-2MB          0           0           0           0
 swap_ra             3           9       4,383       4,494
 swap_ra_hit         2           6       4,298       4,412
 -------------------------------------------------------------------------------

With ZSWAP IAA compress/decompress batching, we are able to demonstrate
significant performance improvements and memory savings in scalability
experiments under memory pressure, as compared to software compressors. We
hope to submit this work in subsequent patch series.

Thanks,
Kanchana

[1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
[2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1-ryan.roberts@arm.com/
[3] https://patchwork.kernel.org/project/linux-crypto/cover/cover.1710969449.git.andre.glover@linux.intel.com/

Kanchana P Sridhar (8):
  mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined.
  mm: zswap: Modify zswap_compress() to accept a page instead of a
    folio.
  mm: zswap: Refactor code to store an entry in zswap xarray.
  mm: zswap: Refactor code to delete stored offsets in case of errors.
  mm: zswap: Compress and store a specific page in a folio.
  mm: zswap: Support mTHP swapout in zswap_store().
  mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout
    stats.
  mm: Document the newly added mTHP zswpout stats, clarify swpout
    semantics.

 Documentation/admin-guide/mm/transhuge.rst |   8 +-
 include/linux/huge_mm.h                    |   1 +
 include/linux/memcontrol.h                 |   4 +
 mm/Kconfig                                 |   8 +
 mm/huge_memory.c                           |   3 +
 mm/page_io.c                               |   1 +
 mm/zswap.c                                 | 248 ++++++++++++++++-----
 7 files changed, 210 insertions(+), 63 deletions(-)

base-commit: acfabf7e197f7a5bedf4749dac1f39551417b049
-- 
2.27.0

^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v7 1/8] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined.
  2024-09-24  1:17 [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar
@ 2024-09-24  1:17 ` Kanchana P Sridhar
  2024-09-24 16:45   ` Nhat Pham
  2024-09-24  1:17 ` [PATCH v7 2/8] mm: zswap: Modify zswap_compress() to accept a page instead of a folio Kanchana P Sridhar
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 79+ messages in thread
From: Kanchana P Sridhar @ 2024-09-24  1:17 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
	ying.huang, 21cnbao, akpm
  Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This resolves an issue with obj_cgroup_get() not being defined if
CONFIG_MEMCG is not defined.

Before this patch, we would see build errors if obj_cgroup_get() is
called from code that is agnostic of CONFIG_MEMCG.

The zswap_store() changes for mTHP in subsequent commits will require
the use of obj_cgroup_get() in zswap code that falls into this category.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 include/linux/memcontrol.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 34d2da05f2f1..15c2716f9aa3 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1282,6 +1282,10 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css)
 	return NULL;
 }
 
+static inline void obj_cgroup_get(struct obj_cgroup *objcg)
+{
+}
+
 static inline void obj_cgroup_put(struct obj_cgroup *objcg)
 {
 }
-- 
2.27.0



^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v7 2/8] mm: zswap: Modify zswap_compress() to accept a page instead of a folio.
  2024-09-24  1:17 [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar
  2024-09-24  1:17 ` [PATCH v7 1/8] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined Kanchana P Sridhar
@ 2024-09-24  1:17 ` Kanchana P Sridhar
  2024-09-24 16:50   ` Nhat Pham
  2024-09-24  1:17 ` [PATCH v7 3/8] mm: zswap: Refactor code to store an entry in zswap xarray Kanchana P Sridhar
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 79+ messages in thread
From: Kanchana P Sridhar @ 2024-09-24  1:17 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
	ying.huang, 21cnbao, akpm
  Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

For zswap_store() to be able to store an mTHP by compressing
it one page at a time, zswap_compress() needs to accept a page
as input. This will allow us to iterate through each page in
the mTHP in zswap_store(), compress it and store it in the zpool.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 449914ea9919..59b7733a62d3 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -876,7 +876,7 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
 	return 0;
 }
 
-static bool zswap_compress(struct folio *folio, struct zswap_entry *entry)
+static bool zswap_compress(struct page *page, struct zswap_entry *entry)
 {
 	struct crypto_acomp_ctx *acomp_ctx;
 	struct scatterlist input, output;
@@ -894,7 +894,7 @@ static bool zswap_compress(struct folio *folio, struct zswap_entry *entry)
 
 	dst = acomp_ctx->buffer;
 	sg_init_table(&input, 1);
-	sg_set_folio(&input, folio, PAGE_SIZE, 0);
+	sg_set_page(&input, page, PAGE_SIZE, 0);
 
 	/*
 	 * We need PAGE_SIZE * 2 here since there maybe over-compression case,
@@ -1458,7 +1458,7 @@ bool zswap_store(struct folio *folio)
 		mem_cgroup_put(memcg);
 	}
 
-	if (!zswap_compress(folio, entry))
+	if (!zswap_compress(&folio->page, entry))
 		goto put_pool;
 
 	entry->swpentry = swp;
-- 
2.27.0



^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v7 3/8] mm: zswap: Refactor code to store an entry in zswap xarray.
  2024-09-24  1:17 [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar
  2024-09-24  1:17 ` [PATCH v7 1/8] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined Kanchana P Sridhar
  2024-09-24  1:17 ` [PATCH v7 2/8] mm: zswap: Modify zswap_compress() to accept a page instead of a folio Kanchana P Sridhar
@ 2024-09-24  1:17 ` Kanchana P Sridhar
  2024-09-24 17:16   ` Nhat Pham
  2024-09-24 19:14   ` Yosry Ahmed
  2024-09-24  1:17 ` [PATCH v7 4/8] mm: zswap: Refactor code to delete stored offsets in case of errors Kanchana P Sridhar
                   ` (6 subsequent siblings)
  9 siblings, 2 replies; 79+ messages in thread
From: Kanchana P Sridhar @ 2024-09-24  1:17 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
	ying.huang, 21cnbao, akpm
  Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

Added a new procedure zswap_store_entry() that refactors the code
currently in zswap_store() to store an entry in the zswap xarray.
This will allow us to call this procedure for each storing the swap
offset of each page in an mTHP in the xarray, as part of zswap_store()
supporting mTHP.

Also, made a minor edit in the comments for 'struct zswap_entry' to delete
the description of the 'value' member that was deleted in commit
20a5532ffa53d6ecf41ded920a7b0ff9c65a7dcf ("mm: remove code to handle
same filled pages").

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 51 ++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 34 insertions(+), 17 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 59b7733a62d3..fd35a81b6e36 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -190,7 +190,6 @@ static struct shrinker *zswap_shrinker;
  *              section for context.
  * pool - the zswap_pool the entry's data is in
  * handle - zpool allocation handle that stores the compressed page data
- * value - value of the same-value filled pages which have same content
  * objcg - the obj_cgroup that the compressed memory is charged to
  * lru - handle to the pool's lru used to evict pages.
  */
@@ -1404,12 +1403,44 @@ static void shrink_worker(struct work_struct *w)
 /*********************************
 * main API
 **********************************/
+
+/*
+ * Returns true if the entry was successfully
+ * stored in the xarray, and false otherwise.
+ */
+static bool zswap_store_entry(struct xarray *tree,
+			      struct zswap_entry *entry)
+{
+	struct zswap_entry *old;
+	pgoff_t offset = swp_offset(entry->swpentry);
+
+	old = xa_store(tree, offset, entry, GFP_KERNEL);
+
+	if (xa_is_err(old)) {
+		int err = xa_err(old);
+
+		WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
+		zswap_reject_alloc_fail++;
+		return false;
+	}
+
+	/*
+	 * We may have had an existing entry that became stale when
+	 * the folio was redirtied and now the new version is being
+	 * swapped out. Get rid of the old.
+	 */
+	if (old)
+		zswap_entry_free(old);
+
+	return true;
+}
+
 bool zswap_store(struct folio *folio)
 {
 	swp_entry_t swp = folio->swap;
 	pgoff_t offset = swp_offset(swp);
 	struct xarray *tree = swap_zswap_tree(swp);
-	struct zswap_entry *entry, *old;
+	struct zswap_entry *entry;
 	struct obj_cgroup *objcg = NULL;
 	struct mem_cgroup *memcg = NULL;
 
@@ -1465,22 +1496,8 @@ bool zswap_store(struct folio *folio)
 	entry->objcg = objcg;
 	entry->referenced = true;
 
-	old = xa_store(tree, offset, entry, GFP_KERNEL);
-	if (xa_is_err(old)) {
-		int err = xa_err(old);
-
-		WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
-		zswap_reject_alloc_fail++;
+	if (!zswap_store_entry(tree, entry))
 		goto store_failed;
-	}
-
-	/*
-	 * We may have had an existing entry that became stale when
-	 * the folio was redirtied and now the new version is being
-	 * swapped out. Get rid of the old.
-	 */
-	if (old)
-		zswap_entry_free(old);
 
 	if (objcg) {
 		obj_cgroup_charge_zswap(objcg, entry->length);
-- 
2.27.0



^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v7 4/8] mm: zswap: Refactor code to delete stored offsets in case of errors.
  2024-09-24  1:17 [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar
                   ` (2 preceding siblings ...)
  2024-09-24  1:17 ` [PATCH v7 3/8] mm: zswap: Refactor code to store an entry in zswap xarray Kanchana P Sridhar
@ 2024-09-24  1:17 ` Kanchana P Sridhar
  2024-09-24 17:25   ` Nhat Pham
  2024-09-24 19:20   ` Yosry Ahmed
  2024-09-24  1:17 ` [PATCH v7 5/8] mm: zswap: Compress and store a specific page in a folio Kanchana P Sridhar
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 79+ messages in thread
From: Kanchana P Sridhar @ 2024-09-24  1:17 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
	ying.huang, 21cnbao, akpm
  Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

Added a new procedure zswap_delete_stored_offsets() that can be
called to delete stored offsets in a folio in case zswap_store()
fails or zswap is disabled.

Refactored the code in zswap_store() that handles these cases,
to call zswap_delete_stored_offsets().

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 33 ++++++++++++++++++++++++++++++---
 1 file changed, 30 insertions(+), 3 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index fd35a81b6e36..9bea948d653e 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1435,8 +1435,37 @@ static bool zswap_store_entry(struct xarray *tree,
 	return true;
 }
 
+/*
+ * If the zswap store fails or zswap is disabled, we must invalidate the
+ * possibly stale entries which were previously stored at the offsets
+ * corresponding to each page of the folio. Otherwise, writeback could
+ * overwrite the new data in the swapfile.
+ *
+ * This is called after the store of an offset in a large folio has failed.
+ * All zswap entries in the folio must be deleted. This helps make sure
+ * that a swapped-out mTHP is either entirely stored in zswap, or entirely
+ * not stored in zswap.
+ *
+ * This is also called if zswap_store() is invoked, but zswap is not enabled.
+ * All offsets for the folio are deleted from zswap in this case.
+ */
+static void zswap_delete_stored_offsets(struct xarray *tree,
+					pgoff_t offset,
+					long nr_pages)
+{
+	struct zswap_entry *entry;
+	long i;
+
+	for (i = 0; i < nr_pages; ++i) {
+		entry = xa_erase(tree, offset + i);
+		if (entry)
+			zswap_entry_free(entry);
+	}
+}
+
 bool zswap_store(struct folio *folio)
 {
+	long nr_pages = folio_nr_pages(folio);
 	swp_entry_t swp = folio->swap;
 	pgoff_t offset = swp_offset(swp);
 	struct xarray *tree = swap_zswap_tree(swp);
@@ -1541,9 +1570,7 @@ bool zswap_store(struct folio *folio)
 	 * possibly stale entry which was previously stored at this offset.
 	 * Otherwise, writeback could overwrite the new data in the swapfile.
 	 */
-	entry = xa_erase(tree, offset);
-	if (entry)
-		zswap_entry_free(entry);
+	zswap_delete_stored_offsets(tree, offset, nr_pages);
 	return false;
 }
 
-- 
2.27.0



^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v7 5/8] mm: zswap: Compress and store a specific page in a folio.
  2024-09-24  1:17 [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar
                   ` (3 preceding siblings ...)
  2024-09-24  1:17 ` [PATCH v7 4/8] mm: zswap: Refactor code to delete stored offsets in case of errors Kanchana P Sridhar
@ 2024-09-24  1:17 ` Kanchana P Sridhar
  2024-09-24 19:28   ` Yosry Ahmed
  2024-09-24  1:17 ` [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store() Kanchana P Sridhar
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 79+ messages in thread
From: Kanchana P Sridhar @ 2024-09-24  1:17 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
	ying.huang, 21cnbao, akpm
  Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

For zswap_store() to handle mTHP folios, we need to iterate through each
page in the mTHP, compress it and store it in the zswap pool. This patch
introduces an auxiliary function zswap_store_page() that provides this
functionality.

The function signature reflects the design intent, namely, for it
to be invoked by zswap_store() per-page in an mTHP. Hence, the folio's
objcg and the zswap_pool to use are input parameters for sake of
efficiency and consistency.

The functionality in zswap_store_page() is reused and adapted from
Ryan Roberts' RFC patch [1]:

  "[RFC,v1] mm: zswap: Store large folios without splitting"

  [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u

Co-developed-by: Ryan Roberts
Signed-off-by:
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 88 insertions(+)

diff --git a/mm/zswap.c b/mm/zswap.c
index 9bea948d653e..8f2e0ab34c84 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1463,6 +1463,94 @@ static void zswap_delete_stored_offsets(struct xarray *tree,
 	}
 }
 
+/*
+ * Stores the page at specified "index" in a folio.
+ *
+ * @folio: The folio to store in zswap.
+ * @index: Index into the page in the folio that this function will store.
+ * @objcg: The folio's objcg.
+ * @pool:  The zswap_pool to store the compressed data for the page.
+ */
+static bool __maybe_unused zswap_store_page(struct folio *folio, long index,
+					    struct obj_cgroup *objcg,
+					    struct zswap_pool *pool)
+{
+	swp_entry_t swp = folio->swap;
+	int type = swp_type(swp);
+	pgoff_t offset = swp_offset(swp) + index;
+	struct page *page = folio_page(folio, index);
+	struct xarray *tree = swap_zswap_tree(swp);
+	struct zswap_entry *entry;
+
+	if (objcg)
+		obj_cgroup_get(objcg);
+
+	if (zswap_check_limits())
+		goto reject;
+
+	/* allocate entry */
+	entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
+	if (!entry) {
+		zswap_reject_kmemcache_fail++;
+		goto reject;
+	}
+
+	/* if entry is successfully added, it keeps the reference */
+	if (!zswap_pool_get(pool))
+		goto freepage;
+
+	entry->pool = pool;
+
+	if (!zswap_compress(page, entry))
+		goto put_pool;
+
+	entry->swpentry = swp_entry(type, offset);
+	entry->objcg = objcg;
+	entry->referenced = true;
+
+	if (!zswap_store_entry(tree, entry))
+		goto store_failed;
+
+	if (objcg) {
+		obj_cgroup_charge_zswap(objcg, entry->length);
+		count_objcg_event(objcg, ZSWPOUT);
+	}
+
+	/*
+	 * We finish initializing the entry while it's already in xarray.
+	 * This is safe because:
+	 *
+	 * 1. Concurrent stores and invalidations are excluded by folio lock.
+	 *
+	 * 2. Writeback is excluded by the entry not being on the LRU yet.
+	 *    The publishing order matters to prevent writeback from seeing
+	 *    an incoherent entry.
+	 */
+	if (entry->length) {
+		INIT_LIST_HEAD(&entry->lru);
+		zswap_lru_add(&zswap_list_lru, entry);
+	}
+
+	/* update stats */
+	atomic_inc(&zswap_stored_pages);
+	count_vm_event(ZSWPOUT);
+
+	return true;
+
+store_failed:
+	zpool_free(entry->pool->zpool, entry->handle);
+put_pool:
+	zswap_pool_put(pool);
+freepage:
+	zswap_entry_cache_free(entry);
+reject:
+	obj_cgroup_put(objcg);
+	if (zswap_pool_reached_full)
+		queue_work(shrink_wq, &zswap_shrink_work);
+
+	return false;
+}
+
 bool zswap_store(struct folio *folio)
 {
 	long nr_pages = folio_nr_pages(folio);
-- 
2.27.0



^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-24  1:17 [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar
                   ` (4 preceding siblings ...)
  2024-09-24  1:17 ` [PATCH v7 5/8] mm: zswap: Compress and store a specific page in a folio Kanchana P Sridhar
@ 2024-09-24  1:17 ` Kanchana P Sridhar
  2024-09-24 17:33   ` Nhat Pham
                     ` (2 more replies)
  2024-09-24  1:17 ` [PATCH v7 7/8] mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout stats Kanchana P Sridhar
                   ` (3 subsequent siblings)
  9 siblings, 3 replies; 79+ messages in thread
From: Kanchana P Sridhar @ 2024-09-24  1:17 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
	ying.huang, 21cnbao, akpm
  Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

zswap_store() will now store mTHP and PMD-size THP folios by compressing
them page by page.

This patch provides a sequential implementation of storing an mTHP in
zswap_store() by iterating through each page in the folio to compress
and store it in the zswap zpool.

Towards this goal, zswap_compress() is modified to take a page instead
of a folio as input.

Each page's swap offset is stored as a separate zswap entry.

If an error is encountered during the store of any page in the mTHP,
all previous pages/entries stored will be invalidated. Thus, an mTHP
is either entirely stored in ZSWAP, or entirely not stored in ZSWAP.

This forms the basis for building batching of pages during zswap store
of large folios by compressing batches of up to say, 8 pages in an
mTHP in parallel in hardware, with the Intel In-Memory Analytics
Accelerator (Intel IAA).

A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default)
will enable/disable zswap storing of (m)THP. The corresponding tunable
zswap module parameter is "mthp_enabled".

This change reuses and adapts the functionality in Ryan Roberts' RFC
patch [1]:

  "[RFC,v1] mm: zswap: Store large folios without splitting"

  [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u

Also, addressed some of the RFC comments from the discussion in [1].

Co-developed-by: Ryan Roberts
Signed-off-by:
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/Kconfig |   8 ++++
 mm/zswap.c | 122 +++++++++++++++++++++++++----------------------------
 2 files changed, 66 insertions(+), 64 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index 09aebca1cae3..c659fb732ec4 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -59,6 +59,14 @@ config ZSWAP_SHRINKER_DEFAULT_ON
 	  reducing the chance that cold pages will reside in the zswap pool
 	  and consume memory indefinitely.
 
+config ZSWAP_STORE_THP_DEFAULT_ON
+	bool "Store mTHP and THP folios in zswap"
+	depends on ZSWAP
+	default n
+	help
+	  If selected, zswap will process mTHP and THP folios by
+	  compressing and storing each 4K page in the large folio.
+
 choice
 	prompt "Default compressor"
 	depends on ZSWAP
diff --git a/mm/zswap.c b/mm/zswap.c
index 8f2e0ab34c84..16ab770546d6 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -127,6 +127,14 @@ static bool zswap_shrinker_enabled = IS_ENABLED(
 		CONFIG_ZSWAP_SHRINKER_DEFAULT_ON);
 module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644);
 
+/*
+ * Enable/disable zswap processing of mTHP folios.
+ * For now, only zswap_store will process mTHP folios.
+ */
+static bool zswap_mthp_enabled = IS_ENABLED(
+		CONFIG_ZSWAP_STORE_THP_DEFAULT_ON);
+module_param_named(mthp_enabled, zswap_mthp_enabled, bool, 0644);
+
 bool zswap_is_enabled(void)
 {
 	return zswap_enabled;
@@ -1471,9 +1479,9 @@ static void zswap_delete_stored_offsets(struct xarray *tree,
  * @objcg: The folio's objcg.
  * @pool:  The zswap_pool to store the compressed data for the page.
  */
-static bool __maybe_unused zswap_store_page(struct folio *folio, long index,
-					    struct obj_cgroup *objcg,
-					    struct zswap_pool *pool)
+static bool zswap_store_page(struct folio *folio, long index,
+			     struct obj_cgroup *objcg,
+			     struct zswap_pool *pool)
 {
 	swp_entry_t swp = folio->swap;
 	int type = swp_type(swp);
@@ -1551,51 +1559,63 @@ static bool __maybe_unused zswap_store_page(struct folio *folio, long index,
 	return false;
 }
 
+/*
+ * Modified to store mTHP folios. Each page in the mTHP will be compressed
+ * and stored sequentially.
+ */
 bool zswap_store(struct folio *folio)
 {
 	long nr_pages = folio_nr_pages(folio);
 	swp_entry_t swp = folio->swap;
 	pgoff_t offset = swp_offset(swp);
 	struct xarray *tree = swap_zswap_tree(swp);
-	struct zswap_entry *entry;
 	struct obj_cgroup *objcg = NULL;
 	struct mem_cgroup *memcg = NULL;
+	struct zswap_pool *pool;
+	bool ret = false;
+	long index;
 
 	VM_WARN_ON_ONCE(!folio_test_locked(folio));
 	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
 
-	/* Large folios aren't supported */
-	if (folio_test_large(folio))
+	/* Storing large folios isn't enabled */
+	if (!zswap_mthp_enabled && folio_test_large(folio))
 		return false;
 
 	if (!zswap_enabled)
-		goto check_old;
+		goto reject;
 
-	/* Check cgroup limits */
+	/*
+	 * Check cgroup limits:
+	 *
+	 * The cgroup zswap limit check is done once at the beginning of an
+	 * mTHP store, and not within zswap_store_page() for each page
+	 * in the mTHP. We do however check the zswap pool limits at the
+	 * start of zswap_store_page(). What this means is, the cgroup
+	 * could go over the limits by at most (HPAGE_PMD_NR - 1) pages.
+	 * However, the per-store-page zswap pool limits check should
+	 * hopefully trigger the cgroup aware and zswap LRU aware global
+	 * reclaim implemented in the shrinker. If this assumption holds,
+	 * the cgroup exceeding the zswap limits could potentially be
+	 * resolved before the next zswap_store, and if it is not, the next
+	 * zswap_store would fail the cgroup zswap limit check at the start.
+	 */
 	objcg = get_obj_cgroup_from_folio(folio);
 	if (objcg && !obj_cgroup_may_zswap(objcg)) {
 		memcg = get_mem_cgroup_from_objcg(objcg);
 		if (shrink_memcg(memcg)) {
 			mem_cgroup_put(memcg);
-			goto reject;
+			goto put_objcg;
 		}
 		mem_cgroup_put(memcg);
 	}
 
 	if (zswap_check_limits())
-		goto reject;
-
-	/* allocate entry */
-	entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
-	if (!entry) {
-		zswap_reject_kmemcache_fail++;
-		goto reject;
-	}
+		goto put_objcg;
 
-	/* if entry is successfully added, it keeps the reference */
-	entry->pool = zswap_pool_current_get();
-	if (!entry->pool)
-		goto freepage;
+	pool = zswap_pool_current_get();
+	if (!pool)
+		goto put_objcg;
 
 	if (objcg) {
 		memcg = get_mem_cgroup_from_objcg(objcg);
@@ -1606,60 +1626,34 @@ bool zswap_store(struct folio *folio)
 		mem_cgroup_put(memcg);
 	}
 
-	if (!zswap_compress(&folio->page, entry))
-		goto put_pool;
-
-	entry->swpentry = swp;
-	entry->objcg = objcg;
-	entry->referenced = true;
-
-	if (!zswap_store_entry(tree, entry))
-		goto store_failed;
-
-	if (objcg) {
-		obj_cgroup_charge_zswap(objcg, entry->length);
-		count_objcg_event(objcg, ZSWPOUT);
-	}
-
 	/*
-	 * We finish initializing the entry while it's already in xarray.
-	 * This is safe because:
-	 *
-	 * 1. Concurrent stores and invalidations are excluded by folio lock.
-	 *
-	 * 2. Writeback is excluded by the entry not being on the LRU yet.
-	 *    The publishing order matters to prevent writeback from seeing
-	 *    an incoherent entry.
+	 * Store each page of the folio as a separate entry. If we fail to store
+	 * a page, unwind by removing all the previous pages we stored.
 	 */
-	if (entry->length) {
-		INIT_LIST_HEAD(&entry->lru);
-		zswap_lru_add(&zswap_list_lru, entry);
+	for (index = 0; index < nr_pages; ++index) {
+		if (!zswap_store_page(folio, index, objcg, pool))
+			goto put_pool;
 	}
 
-	/* update stats */
-	atomic_inc(&zswap_stored_pages);
-	count_vm_event(ZSWPOUT);
-
-	return true;
+	ret = true;
 
-store_failed:
-	zpool_free(entry->pool->zpool, entry->handle);
 put_pool:
-	zswap_pool_put(entry->pool);
-freepage:
-	zswap_entry_cache_free(entry);
-reject:
+	zswap_pool_put(pool);
+put_objcg:
 	obj_cgroup_put(objcg);
 	if (zswap_pool_reached_full)
 		queue_work(shrink_wq, &zswap_shrink_work);
-check_old:
+reject:
 	/*
-	 * If the zswap store fails or zswap is disabled, we must invalidate the
-	 * possibly stale entry which was previously stored at this offset.
-	 * Otherwise, writeback could overwrite the new data in the swapfile.
+	 * If the zswap store fails or zswap is disabled, we must invalidate
+	 * the possibly stale entries which were previously stored at the
+	 * offsets corresponding to each page of the folio. Otherwise,
+	 * writeback could overwrite the new data in the swapfile.
 	 */
-	zswap_delete_stored_offsets(tree, offset, nr_pages);
-	return false;
+	if (!ret)
+		zswap_delete_stored_offsets(tree, offset, nr_pages);
+
+	return ret;
 }
 
 bool zswap_load(struct folio *folio)
-- 
2.27.0



^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v7 7/8] mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout stats.
  2024-09-24  1:17 [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar
                   ` (5 preceding siblings ...)
  2024-09-24  1:17 ` [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store() Kanchana P Sridhar
@ 2024-09-24  1:17 ` Kanchana P Sridhar
  2024-09-24  1:17 ` [PATCH v7 8/8] mm: Document the newly added mTHP zswpout stats, clarify swpout semantics Kanchana P Sridhar
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 79+ messages in thread
From: Kanchana P Sridhar @ 2024-09-24  1:17 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
	ying.huang, 21cnbao, akpm
  Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

Add a new MTHP_STAT_ZSWPOUT entry to the sysfs mTHP stats so that
per-order mTHP folio ZSWAP stores can be accounted.

If zswap_store() successfully swaps out an mTHP, it will be counted under
the per-order sysfs "zswpout" stats:

/sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout

Other block dev/fs mTHP swap-out events will be counted under
the existing sysfs "swpout" stats:

/sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/swpout

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 include/linux/huge_mm.h | 1 +
 mm/huge_memory.c        | 3 +++
 mm/page_io.c            | 1 +
 3 files changed, 5 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 0b0539f4ee1a..ab95b94e9627 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -118,6 +118,7 @@ enum mthp_stat_item {
 	MTHP_STAT_ANON_FAULT_ALLOC,
 	MTHP_STAT_ANON_FAULT_FALLBACK,
 	MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE,
+	MTHP_STAT_ZSWPOUT,
 	MTHP_STAT_SWPOUT,
 	MTHP_STAT_SWPOUT_FALLBACK,
 	MTHP_STAT_SHMEM_ALLOC,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4e34b7f89daf..7d8ce7891ba8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -612,6 +612,7 @@ static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
 DEFINE_MTHP_STAT_ATTR(anon_fault_alloc, MTHP_STAT_ANON_FAULT_ALLOC);
 DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK);
 DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
+DEFINE_MTHP_STAT_ATTR(zswpout, MTHP_STAT_ZSWPOUT);
 DEFINE_MTHP_STAT_ATTR(swpout, MTHP_STAT_SWPOUT);
 DEFINE_MTHP_STAT_ATTR(swpout_fallback, MTHP_STAT_SWPOUT_FALLBACK);
 #ifdef CONFIG_SHMEM
@@ -630,6 +631,7 @@ static struct attribute *anon_stats_attrs[] = {
 	&anon_fault_fallback_attr.attr,
 	&anon_fault_fallback_charge_attr.attr,
 #ifndef CONFIG_SHMEM
+	&zswpout_attr.attr,
 	&swpout_attr.attr,
 	&swpout_fallback_attr.attr,
 #endif
@@ -660,6 +662,7 @@ static struct attribute_group file_stats_attr_grp = {
 
 static struct attribute *any_stats_attrs[] = {
 #ifdef CONFIG_SHMEM
+	&zswpout_attr.attr,
 	&swpout_attr.attr,
 	&swpout_fallback_attr.attr,
 #endif
diff --git a/mm/page_io.c b/mm/page_io.c
index bc1183299a7d..4aa34862676f 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -269,6 +269,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
 		swap_zeromap_folio_clear(folio);
 	}
 	if (zswap_store(folio)) {
+		count_mthp_stat(folio_order(folio), MTHP_STAT_ZSWPOUT);
 		folio_unlock(folio);
 		return 0;
 	}
-- 
2.27.0



^ permalink raw reply	[flat|nested] 79+ messages in thread

* [PATCH v7 8/8] mm: Document the newly added mTHP zswpout stats, clarify swpout semantics.
  2024-09-24  1:17 [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar
                   ` (6 preceding siblings ...)
  2024-09-24  1:17 ` [PATCH v7 7/8] mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout stats Kanchana P Sridhar
@ 2024-09-24  1:17 ` Kanchana P Sridhar
  2024-09-24 17:36   ` Nhat Pham
  2024-09-24 19:34 ` [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios Yosry Ahmed
  2024-09-25  6:35 ` Huang, Ying
  9 siblings, 1 reply; 79+ messages in thread
From: Kanchana P Sridhar @ 2024-09-24  1:17 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
	ying.huang, 21cnbao, akpm
  Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

Added documentation for the newly added sysfs mTHP "zswpout" stats.

Clarified that only non-ZSWAP mTHP swapouts will be accounted in the mTHP
"swpout" stats.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 Documentation/admin-guide/mm/transhuge.rst | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index cfdd16a52e39..a65f905e9ca7 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -530,10 +530,14 @@ anon_fault_fallback_charge
 	instead falls back to using huge pages with lower orders or
 	small pages even though the allocation was successful.
 
-swpout
-	is incremented every time a huge page is swapped out in one
+zswpout
+	is incremented every time a huge page is swapped out to ZSWAP in one
 	piece without splitting.
 
+swpout
+	is incremented every time a huge page is swapped out to a non-ZSWAP
+	swap entity in one piece without splitting.
+
 swpout_fallback
 	is incremented if a huge page has to be split before swapout.
 	Usually because failed to allocate some continuous swap space
-- 
2.27.0



^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 1/8] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined.
  2024-09-24  1:17 ` [PATCH v7 1/8] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined Kanchana P Sridhar
@ 2024-09-24 16:45   ` Nhat Pham
  0 siblings, 0 replies; 79+ messages in thread
From: Nhat Pham @ 2024-09-24 16:45 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao,
	akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal

On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> This resolves an issue with obj_cgroup_get() not being defined if
> CONFIG_MEMCG is not defined.
>
> Before this patch, we would see build errors if obj_cgroup_get() is
> called from code that is agnostic of CONFIG_MEMCG.
>
> The zswap_store() changes for mTHP in subsequent commits will require
> the use of obj_cgroup_get() in zswap code that falls into this category.
>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---

LGTM.

Reviewed-by: Nhat Pham <nphamcs@gmail.com>


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 2/8] mm: zswap: Modify zswap_compress() to accept a page instead of a folio.
  2024-09-24  1:17 ` [PATCH v7 2/8] mm: zswap: Modify zswap_compress() to accept a page instead of a folio Kanchana P Sridhar
@ 2024-09-24 16:50   ` Nhat Pham
  0 siblings, 0 replies; 79+ messages in thread
From: Nhat Pham @ 2024-09-24 16:50 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao,
	akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal

On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> For zswap_store() to be able to store an mTHP by compressing
> it one page at a time, zswap_compress() needs to accept a page
> as input. This will allow us to iterate through each page in
> the mTHP in zswap_store(), compress it and store it in the zpool.
>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>

Reviewed-by: Nhat Pham <nphamcs@gmail.com>


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 3/8] mm: zswap: Refactor code to store an entry in zswap xarray.
  2024-09-24  1:17 ` [PATCH v7 3/8] mm: zswap: Refactor code to store an entry in zswap xarray Kanchana P Sridhar
@ 2024-09-24 17:16   ` Nhat Pham
  2024-09-24 20:40     ` Sridhar, Kanchana P
  2024-09-24 19:14   ` Yosry Ahmed
  1 sibling, 1 reply; 79+ messages in thread
From: Nhat Pham @ 2024-09-24 17:16 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao,
	akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal

On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> Added a new procedure zswap_store_entry() that refactors the code
> currently in zswap_store() to store an entry in the zswap xarray.
> This will allow us to call this procedure for each storing the swap
> offset of each page in an mTHP in the xarray, as part of zswap_store()
> supporting mTHP.
>
> Also, made a minor edit in the comments for 'struct zswap_entry' to delete
> the description of the 'value' member that was deleted in commit
> 20a5532ffa53d6ecf41ded920a7b0ff9c65a7dcf ("mm: remove code to handle
> same filled pages").

nit: This probably should be a separate patch...

>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>

Otherwise, LGTM :)

Reviewed-by: Nhat Pham <nphamcs@gmail.com>


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 4/8] mm: zswap: Refactor code to delete stored offsets in case of errors.
  2024-09-24  1:17 ` [PATCH v7 4/8] mm: zswap: Refactor code to delete stored offsets in case of errors Kanchana P Sridhar
@ 2024-09-24 17:25   ` Nhat Pham
  2024-09-24 20:41     ` Sridhar, Kanchana P
  2024-09-24 19:20   ` Yosry Ahmed
  1 sibling, 1 reply; 79+ messages in thread
From: Nhat Pham @ 2024-09-24 17:25 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao,
	akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal

On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> Added a new procedure zswap_delete_stored_offsets() that can be
> called to delete stored offsets in a folio in case zswap_store()
> fails or zswap is disabled.
>
> Refactored the code in zswap_store() that handles these cases,
> to call zswap_delete_stored_offsets().
>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  mm/zswap.c | 33 ++++++++++++++++++++++++++++++---
>  1 file changed, 30 insertions(+), 3 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index fd35a81b6e36..9bea948d653e 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1435,8 +1435,37 @@ static bool zswap_store_entry(struct xarray *tree,
>         return true;
>  }
>
> +/*
> + * If the zswap store fails or zswap is disabled, we must invalidate the
> + * possibly stale entries which were previously stored at the offsets
> + * corresponding to each page of the folio. Otherwise, writeback could
> + * overwrite the new data in the swapfile.
> + *
> + * This is called after the store of an offset in a large folio has failed.

"store of a subpage" rather than "stored of an offset"?


> + * All zswap entries in the folio must be deleted. This helps make sure
> + * that a swapped-out mTHP is either entirely stored in zswap, or entirely
> + * not stored in zswap.
> + *
> + * This is also called if zswap_store() is invoked, but zswap is not enabled.
> + * All offsets for the folio are deleted from zswap in this case.
> + */
> +static void zswap_delete_stored_offsets(struct xarray *tree,
> +                                       pgoff_t offset,
> +                                       long nr_pages)
> +{
> +       struct zswap_entry *entry;
> +       long i;
> +
> +       for (i = 0; i < nr_pages; ++i) {
> +               entry = xa_erase(tree, offset + i);
> +               if (entry)
> +                       zswap_entry_free(entry);
> +       }
> +}
> +


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-24  1:17 ` [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store() Kanchana P Sridhar
@ 2024-09-24 17:33   ` Nhat Pham
  2024-09-24 20:51     ` Sridhar, Kanchana P
  2024-09-24 19:38   ` Yosry Ahmed
  2024-09-25 14:27   ` Johannes Weiner
  2 siblings, 1 reply; 79+ messages in thread
From: Nhat Pham @ 2024-09-24 17:33 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao,
	akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal

On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> zswap_store() will now store mTHP and PMD-size THP folios by compressing
> them page by page.
>
> This patch provides a sequential implementation of storing an mTHP in
> zswap_store() by iterating through each page in the folio to compress
> and store it in the zswap zpool.
>
> Towards this goal, zswap_compress() is modified to take a page instead
> of a folio as input.
>
> Each page's swap offset is stored as a separate zswap entry.
>
> If an error is encountered during the store of any page in the mTHP,
> all previous pages/entries stored will be invalidated. Thus, an mTHP
> is either entirely stored in ZSWAP, or entirely not stored in ZSWAP.
>
> This forms the basis for building batching of pages during zswap store
> of large folios by compressing batches of up to say, 8 pages in an
> mTHP in parallel in hardware, with the Intel In-Memory Analytics
> Accelerator (Intel IAA).
>
> A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default)
> will enable/disable zswap storing of (m)THP. The corresponding tunable
> zswap module parameter is "mthp_enabled".
>
> This change reuses and adapts the functionality in Ryan Roberts' RFC
> patch [1]:
>
>   "[RFC,v1] mm: zswap: Store large folios without splitting"
>
>   [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
>
> Also, addressed some of the RFC comments from the discussion in [1].
>
> Co-developed-by: Ryan Roberts
> Signed-off-by:
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  mm/Kconfig |   8 ++++
>  mm/zswap.c | 122 +++++++++++++++++++++++++----------------------------
>  2 files changed, 66 insertions(+), 64 deletions(-)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 09aebca1cae3..c659fb732ec4 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -59,6 +59,14 @@ config ZSWAP_SHRINKER_DEFAULT_ON
>           reducing the chance that cold pages will reside in the zswap pool
>           and consume memory indefinitely.
>
> +config ZSWAP_STORE_THP_DEFAULT_ON
> +       bool "Store mTHP and THP folios in zswap"
> +       depends on ZSWAP
> +       default n
> +       help
> +         If selected, zswap will process mTHP and THP folios by
> +         compressing and storing each 4K page in the large folio.
> +
>  choice
>         prompt "Default compressor"
>         depends on ZSWAP
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 8f2e0ab34c84..16ab770546d6 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -127,6 +127,14 @@ static bool zswap_shrinker_enabled = IS_ENABLED(
>                 CONFIG_ZSWAP_SHRINKER_DEFAULT_ON);
>  module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644);
>
> +/*
> + * Enable/disable zswap processing of mTHP folios.
> + * For now, only zswap_store will process mTHP folios.
> + */
> +static bool zswap_mthp_enabled = IS_ENABLED(
> +               CONFIG_ZSWAP_STORE_THP_DEFAULT_ON);
> +module_param_named(mthp_enabled, zswap_mthp_enabled, bool, 0644);
> +

Hmm, so this is a runtime knob. Also, should this be zswap_thp_enabled? :)

>  bool zswap_is_enabled(void)
>  {
>         return zswap_enabled;
> @@ -1471,9 +1479,9 @@ static void zswap_delete_stored_offsets(struct xarray *tree,
>   * @objcg: The folio's objcg.
>   * @pool:  The zswap_pool to store the compressed data for the page.
>   */
> -static bool __maybe_unused zswap_store_page(struct folio *folio, long index,
> -                                           struct obj_cgroup *objcg,
> -                                           struct zswap_pool *pool)
> +static bool zswap_store_page(struct folio *folio, long index,
> +                            struct obj_cgroup *objcg,
> +                            struct zswap_pool *pool)
>  {
>         swp_entry_t swp = folio->swap;
>         int type = swp_type(swp);
> @@ -1551,51 +1559,63 @@ static bool __maybe_unused zswap_store_page(struct folio *folio, long index,
>         return false;
>  }
>
> +/*
> + * Modified to store mTHP folios. Each page in the mTHP will be compressed
> + * and stored sequentially.
> + */
>  bool zswap_store(struct folio *folio)
>  {
>         long nr_pages = folio_nr_pages(folio);
>         swp_entry_t swp = folio->swap;
>         pgoff_t offset = swp_offset(swp);
>         struct xarray *tree = swap_zswap_tree(swp);
> -       struct zswap_entry *entry;
>         struct obj_cgroup *objcg = NULL;
>         struct mem_cgroup *memcg = NULL;
> +       struct zswap_pool *pool;
> +       bool ret = false;
> +       long index;
>
>         VM_WARN_ON_ONCE(!folio_test_locked(folio));
>         VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
>
> -       /* Large folios aren't supported */
> -       if (folio_test_large(folio))
> +       /* Storing large folios isn't enabled */
> +       if (!zswap_mthp_enabled && folio_test_large(folio))
>                 return false;

Hmm can this go wrong somehow? Can we have a case where we enable
zswap_mthp_enabled, have a large folio written to zswap, disable
zswap_mthp_enabled, and attempt to store that folio to zswap again.

Now, we have a stale copy in zswap that is not invalidated...?

Or am I missing something here :)


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 8/8] mm: Document the newly added mTHP zswpout stats, clarify swpout semantics.
  2024-09-24  1:17 ` [PATCH v7 8/8] mm: Document the newly added mTHP zswpout stats, clarify swpout semantics Kanchana P Sridhar
@ 2024-09-24 17:36   ` Nhat Pham
  2024-09-24 20:52     ` Sridhar, Kanchana P
  0 siblings, 1 reply; 79+ messages in thread
From: Nhat Pham @ 2024-09-24 17:36 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao,
	akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal

On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> Added documentation for the newly added sysfs mTHP "zswpout" stats.
>
> Clarified that only non-ZSWAP mTHP swapouts will be accounted in the mTHP
> "swpout" stats.
>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  Documentation/admin-guide/mm/transhuge.rst | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
> index cfdd16a52e39..a65f905e9ca7 100644
> --- a/Documentation/admin-guide/mm/transhuge.rst
> +++ b/Documentation/admin-guide/mm/transhuge.rst
> @@ -530,10 +530,14 @@ anon_fault_fallback_charge
>         instead falls back to using huge pages with lower orders or
>         small pages even though the allocation was successful.
>
> -swpout
> -       is incremented every time a huge page is swapped out in one
> +zswpout
> +       is incremented every time a huge page is swapped out to ZSWAP in one
>         piece without splitting.

nit: a bit weird to capitalize ZSWAP no? :)

>
> +swpout
> +       is incremented every time a huge page is swapped out to a non-ZSWAP
> +       swap entity in one piece without splitting.
> +

nit: "non-zswap swap entity" is a bit awkward. Maybe swapped out to a
non-zswap swap device?


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 3/8] mm: zswap: Refactor code to store an entry in zswap xarray.
  2024-09-24  1:17 ` [PATCH v7 3/8] mm: zswap: Refactor code to store an entry in zswap xarray Kanchana P Sridhar
  2024-09-24 17:16   ` Nhat Pham
@ 2024-09-24 19:14   ` Yosry Ahmed
  2024-09-24 22:22     ` Sridhar, Kanchana P
  1 sibling, 1 reply; 79+ messages in thread
From: Yosry Ahmed @ 2024-09-24 19:14 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao,
	akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal

On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> Added a new procedure zswap_store_entry() that refactors the code
> currently in zswap_store() to store an entry in the zswap xarray.
> This will allow us to call this procedure for each storing the swap
> offset of each page in an mTHP in the xarray, as part of zswap_store()
> supporting mTHP.
>
> Also, made a minor edit in the comments for 'struct zswap_entry' to delete
> the description of the 'value' member that was deleted in commit
> 20a5532ffa53d6ecf41ded920a7b0ff9c65a7dcf ("mm: remove code to handle
> same filled pages").
>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  mm/zswap.c | 51 ++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 34 insertions(+), 17 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 59b7733a62d3..fd35a81b6e36 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -190,7 +190,6 @@ static struct shrinker *zswap_shrinker;
>   *              section for context.
>   * pool - the zswap_pool the entry's data is in
>   * handle - zpool allocation handle that stores the compressed page data
> - * value - value of the same-value filled pages which have same content
>   * objcg - the obj_cgroup that the compressed memory is charged to
>   * lru - handle to the pool's lru used to evict pages.
>   */
> @@ -1404,12 +1403,44 @@ static void shrink_worker(struct work_struct *w)
>  /*********************************
>  * main API
>  **********************************/
> +
> +/*
> + * Returns true if the entry was successfully
> + * stored in the xarray, and false otherwise.
> + */
> +static bool zswap_store_entry(struct xarray *tree,
> +                             struct zswap_entry *entry)


I think zswap_tree_store() is a more descriptive name.

>
> +{
> +       struct zswap_entry *old;
> +       pgoff_t offset = swp_offset(entry->swpentry);


Reverse xmas tree where possible please (longest to shortest declarations).

>
> +
> +       old = xa_store(tree, offset, entry, GFP_KERNEL);
> +

No need for the blank line here.

> +       if (xa_is_err(old)) {
> +               int err = xa_err(old);
> +
> +               WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
> +               zswap_reject_alloc_fail++;
> +               return false;
> +       }
> +
> +       /*
> +        * We may have had an existing entry that became stale when
> +        * the folio was redirtied and now the new version is being
> +        * swapped out. Get rid of the old.
> +        */
> +       if (old)
> +               zswap_entry_free(old);
> +
> +       return true;
> +}
> +
>  bool zswap_store(struct folio *folio)
>  {
>         swp_entry_t swp = folio->swap;
>         pgoff_t offset = swp_offset(swp);
>         struct xarray *tree = swap_zswap_tree(swp);
> -       struct zswap_entry *entry, *old;
> +       struct zswap_entry *entry;
>         struct obj_cgroup *objcg = NULL;
>         struct mem_cgroup *memcg = NULL;
>
> @@ -1465,22 +1496,8 @@ bool zswap_store(struct folio *folio)
>         entry->objcg = objcg;
>         entry->referenced = true;
>
> -       old = xa_store(tree, offset, entry, GFP_KERNEL);
> -       if (xa_is_err(old)) {
> -               int err = xa_err(old);
> -
> -               WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err);
> -               zswap_reject_alloc_fail++;
> +       if (!zswap_store_entry(tree, entry))
>                 goto store_failed;
> -       }
> -
> -       /*
> -        * We may have had an existing entry that became stale when
> -        * the folio was redirtied and now the new version is being
> -        * swapped out. Get rid of the old.
> -        */
> -       if (old)
> -               zswap_entry_free(old);
>
>         if (objcg) {
>                 obj_cgroup_charge_zswap(objcg, entry->length);
> --
> 2.27.0
>


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 4/8] mm: zswap: Refactor code to delete stored offsets in case of errors.
  2024-09-24  1:17 ` [PATCH v7 4/8] mm: zswap: Refactor code to delete stored offsets in case of errors Kanchana P Sridhar
  2024-09-24 17:25   ` Nhat Pham
@ 2024-09-24 19:20   ` Yosry Ahmed
  2024-09-24 22:32     ` Sridhar, Kanchana P
  1 sibling, 1 reply; 79+ messages in thread
From: Yosry Ahmed @ 2024-09-24 19:20 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao,
	akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal

On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> Added a new procedure zswap_delete_stored_offsets() that can be
> called to delete stored offsets in a folio in case zswap_store()
> fails or zswap is disabled.

I don't see the value in this helper. It will get called in one place
AFAICT, and it is a bit inconsistent that we have to explicitly loop
in zswap_store() to store pages, but the loop to delete pages upon
failure is hidden in the helper.

I am not against adding a trivial zswap_tree_delete() helper (or
similar) that calls xa_erase() and  zswap_entry_free() to match
zswap_tree_store() if you prefer that.

>
> Refactored the code in zswap_store() that handles these cases,
> to call zswap_delete_stored_offsets().
>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  mm/zswap.c | 33 ++++++++++++++++++++++++++++++---
>  1 file changed, 30 insertions(+), 3 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index fd35a81b6e36..9bea948d653e 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1435,8 +1435,37 @@ static bool zswap_store_entry(struct xarray *tree,
>         return true;
>  }
>
> +/*
> + * If the zswap store fails or zswap is disabled, we must invalidate the
> + * possibly stale entries which were previously stored at the offsets
> + * corresponding to each page of the folio. Otherwise, writeback could
> + * overwrite the new data in the swapfile.
> + *
> + * This is called after the store of an offset in a large folio has failed.
> + * All zswap entries in the folio must be deleted. This helps make sure
> + * that a swapped-out mTHP is either entirely stored in zswap, or entirely
> + * not stored in zswap.
> + *
> + * This is also called if zswap_store() is invoked, but zswap is not enabled.
> + * All offsets for the folio are deleted from zswap in this case.
> + */
> +static void zswap_delete_stored_offsets(struct xarray *tree,
> +                                       pgoff_t offset,
> +                                       long nr_pages)
> +{
> +       struct zswap_entry *entry;
> +       long i;
> +
> +       for (i = 0; i < nr_pages; ++i) {
> +               entry = xa_erase(tree, offset + i);
> +               if (entry)
> +                       zswap_entry_free(entry);
> +       }
> +}
> +
>  bool zswap_store(struct folio *folio)
>  {
> +       long nr_pages = folio_nr_pages(folio);
>         swp_entry_t swp = folio->swap;
>         pgoff_t offset = swp_offset(swp);
>         struct xarray *tree = swap_zswap_tree(swp);
> @@ -1541,9 +1570,7 @@ bool zswap_store(struct folio *folio)
>          * possibly stale entry which was previously stored at this offset.
>          * Otherwise, writeback could overwrite the new data in the swapfile.
>          */
> -       entry = xa_erase(tree, offset);
> -       if (entry)
> -               zswap_entry_free(entry);
> +       zswap_delete_stored_offsets(tree, offset, nr_pages);
>         return false;
>  }
>
> --
> 2.27.0
>


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 5/8] mm: zswap: Compress and store a specific page in a folio.
  2024-09-24  1:17 ` [PATCH v7 5/8] mm: zswap: Compress and store a specific page in a folio Kanchana P Sridhar
@ 2024-09-24 19:28   ` Yosry Ahmed
  2024-09-24 22:45     ` Sridhar, Kanchana P
  0 siblings, 1 reply; 79+ messages in thread
From: Yosry Ahmed @ 2024-09-24 19:28 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao,
	akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal

On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> For zswap_store() to handle mTHP folios, we need to iterate through each
> page in the mTHP, compress it and store it in the zswap pool. This patch
> introduces an auxiliary function zswap_store_page() that provides this
> functionality.
>
> The function signature reflects the design intent, namely, for it
> to be invoked by zswap_store() per-page in an mTHP. Hence, the folio's
> objcg and the zswap_pool to use are input parameters for sake of
> efficiency and consistency.
>
> The functionality in zswap_store_page() is reused and adapted from
> Ryan Roberts' RFC patch [1]:
>
>   "[RFC,v1] mm: zswap: Store large folios without splitting"
>
>   [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
>
> Co-developed-by: Ryan Roberts
> Signed-off-by:
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  mm/zswap.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 88 insertions(+)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 9bea948d653e..8f2e0ab34c84 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1463,6 +1463,94 @@ static void zswap_delete_stored_offsets(struct xarray *tree,
>         }
>  }
>
> +/*
> + * Stores the page at specified "index" in a folio.
> + *
> + * @folio: The folio to store in zswap.
> + * @index: Index into the page in the folio that this function will store.
> + * @objcg: The folio's objcg.
> + * @pool:  The zswap_pool to store the compressed data for the page.
> + */
> +static bool __maybe_unused zswap_store_page(struct folio *folio, long index,
> +                                           struct obj_cgroup *objcg,
> +                                           struct zswap_pool *pool)

Why are we adding an unused function that duplicates code in
zswap_store(), then using it in the following patch? This makes it
difficult to see that the function does the same thing. This patch
should be refactoring the per-page code out of zswap_store() into
zswap_store_page(), and directly calling zswap_store_page() from
zswap_store().

> +{
> +       swp_entry_t swp = folio->swap;
> +       int type = swp_type(swp);
> +       pgoff_t offset = swp_offset(swp) + index;
> +       struct page *page = folio_page(folio, index);
> +       struct xarray *tree = swap_zswap_tree(swp);
> +       struct zswap_entry *entry;
> +
> +       if (objcg)
> +               obj_cgroup_get(objcg);
> +
> +       if (zswap_check_limits())
> +               goto reject;
> +
> +       /* allocate entry */
> +       entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
> +       if (!entry) {
> +               zswap_reject_kmemcache_fail++;
> +               goto reject;
> +       }
> +
> +       /* if entry is successfully added, it keeps the reference */
> +       if (!zswap_pool_get(pool))
> +               goto freepage;

I think we can batch this for all pages in zswap_store(), maybe first
add zswap_pool_get_many().

I am also wondering if it would be better to batch the limit checking
and allocating the entries, to front load any failures before we start
compression. Not sure if that's overall better though.

To batch allocate entries we will have to also allocate an array to
hold them. To batch the limit checking we will have to either allow
going further over limit for mTHPs, or check if there is enough
clearance to allow for compressing all the pages. Using the
uncompressed size will lead to false negatives though, so maybe we can
start tracking the average compression ratio for better limit
checking.

Nhat, Johannes, any thoughts here? I need someone to tell me if I am
overthinking this :)

> +
> +       entry->pool = pool;
> +
> +       if (!zswap_compress(page, entry))
> +               goto put_pool;
> +
> +       entry->swpentry = swp_entry(type, offset);
> +       entry->objcg = objcg;
> +       entry->referenced = true;
> +
> +       if (!zswap_store_entry(tree, entry))
> +               goto store_failed;
> +
> +       if (objcg) {
> +               obj_cgroup_charge_zswap(objcg, entry->length);
> +               count_objcg_event(objcg, ZSWPOUT);
> +       }
> +
> +       /*
> +        * We finish initializing the entry while it's already in xarray.
> +        * This is safe because:
> +        *
> +        * 1. Concurrent stores and invalidations are excluded by folio lock.
> +        *
> +        * 2. Writeback is excluded by the entry not being on the LRU yet.
> +        *    The publishing order matters to prevent writeback from seeing
> +        *    an incoherent entry.
> +        */
> +       if (entry->length) {
> +               INIT_LIST_HEAD(&entry->lru);
> +               zswap_lru_add(&zswap_list_lru, entry);
> +       }
> +
> +       /* update stats */
> +       atomic_inc(&zswap_stored_pages);
> +       count_vm_event(ZSWPOUT);

We should probably also batch updating the stats. It actually seems
like now we don't handle rolling them back upon failure.


> +
> +       return true;
> +
> +store_failed:
> +       zpool_free(entry->pool->zpool, entry->handle);
> +put_pool:
> +       zswap_pool_put(pool);
> +freepage:
> +       zswap_entry_cache_free(entry);
> +reject:
> +       obj_cgroup_put(objcg);
> +       if (zswap_pool_reached_full)
> +               queue_work(shrink_wq, &zswap_shrink_work);
> +
> +       return false;
> +}
> +
>  bool zswap_store(struct folio *folio)
>  {
>         long nr_pages = folio_nr_pages(folio);
> --
> 2.27.0
>


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios
  2024-09-24  1:17 [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar
                   ` (7 preceding siblings ...)
  2024-09-24  1:17 ` [PATCH v7 8/8] mm: Document the newly added mTHP zswpout stats, clarify swpout semantics Kanchana P Sridhar
@ 2024-09-24 19:34 ` Yosry Ahmed
  2024-09-24 22:50   ` Sridhar, Kanchana P
  2024-09-25  6:35 ` Huang, Ying
  9 siblings, 1 reply; 79+ messages in thread
From: Yosry Ahmed @ 2024-09-24 19:34 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao,
	akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal

On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> Hi All,
>
> This patch-series enables zswap_store() to accept and store mTHP
> folios. The most significant contribution in this series is from the
> earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
> migrated to mm-unstable as of 9-23-2024 in patches 5,6 of this series.
>
> [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
>      https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
>
> Additionally, there is an attempt to modularize some of the functionality
> in zswap_store(), to make it more amenable to supporting any-order
> mTHPs. For instance, the function zswap_store_entry() stores a zswap_entry
> in the xarray. Likewise, zswap_delete_stored_offsets() can be used to
> delete all offsets corresponding to a higher order folio stored in zswap.

These are implementation details that are not very useful here, you
can just mention that the first few patches do refactoring prep work.

>
> For accounting purposes, the patch-series adds per-order mTHP sysfs
> "zswpout" counters that get incremented upon successful zswap_store of
> an mTHP folio:
>
> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
>
> A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default)
> will enable/disable zswap storing of (m)THP. When disabled, zswap will
> fallback to rejecting the mTHP folio, to be processed by the backing
> swap device.

Why is this needed? Do we just not have enough confidence in the
feature yet, or are there some cases that regress from enabling mTHP
for zswapout?

Does generic mTHP swapout/swapin also use config options?

>
> This patch-series is a pre-requisite for ZSWAP compress batching of mTHP
> swap-out and decompress batching of swap-ins based on swapin_readahead(),
> using Intel IAA hardware acceleration, which we would like to submit in
> subsequent patch-series, with performance improvement data.
>
> Thanks to Ying Huang for pre-posting review feedback and suggestions!
>
> Thanks also to Nhat, Yosry, Barry, Chengming, Usama and Ying for their
> helpful feedback, data reviews and suggestions!
>
> Co-development signoff request:
> ===============================
> I would like to request Ryan Roberts' co-developer signoff on patches
> 5 and 6 in this series. Thanks Ryan!
>
> Changes since v6:
> =================

Please put the changelog at the very end, I almost missed the
performance evaluation.

> 1) Rebased to mm-unstable as of 9-23-2024,
>    commit acfabf7e197f7a5bedf4749dac1f39551417b049.
> 2) Refactored into smaller commits, as suggested by Yosry and
>    Chengming. Thanks both!
> 3) Reworded the commit log for patches 5 and 6 as per Yosry's
>    suggestion. Thanks Yosry!
> 4) Gathered data on a Sapphire Rapids server that has 823GiB SSD swap disk
>    partition. Also, all experiments are run with usemem --sleep 10, so that
>    the memory allocated by the 70 processes remains in memory
>    longer. Posted elapsed and sys times. Thanks to Yosry, Nhat and Ying for
>    their help with refining the performance characterization methodology.
> 5) Updated Documentation/admin-guide/mm/transhuge.rst as suggested by
>    Nhat. Thanks Nhat!
>
> Changes since v5:
> =================
> 1) Rebased to mm-unstable as of 8/29/2024,
>    commit 9287e4adbc6ab8fa04d25eb82e097fed877a4642.
> 2) Added CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default) to
>    enable/disable zswap_store() of mTHP folios. Thanks Nhat for the
>    suggestion to add a knob by which users can enable/disable this
>    change. Nhat, I hope this is along the lines of what you were
>    thinking.
> 3) Added vm-scalability usemem data with 4K folios with
>    CONFIG_ZSWAP_STORE_THP_DEFAULT_ON off, that I gathered to make sure
>    there is no regression with this change.
> 4) Added data with usemem with 64K and 2M THP for an alternate view of
>    before/after, as suggested by Yosry, so we can understand the impact
>    of when mTHPs are split into 4K folios in shrink_folio_list()
>    (CONFIG_THP_SWAP off) vs. not split (CONFIG_THP_SWAP on) and stored
>    in zswap. Thanks Yosry for this suggestion.
>
> Changes since v4:
> =================
> 1) Published before/after data with zstd, as suggested by Nhat (Thanks
>    Nhat for the data reviews!).
> 2) Rebased to mm-unstable from 8/27/2024,
>    commit b659edec079c90012cf8d05624e312d1062b8b87.
> 3) Incorporated the change in memcontrol.h that defines obj_cgroup_get() if
>    CONFIG_MEMCG is not defined, to resolve build errors reported by kernel
>    robot; as per Nhat's and Michal's suggestion to not require a separate
>    patch to fix the build errors (thanks both!).
> 4) Deleted all same-filled folio processing in zswap_store() of mTHP, as
>    suggested by Yosry (Thanks Yosry!).
> 5) Squashed the commits that define new mthp zswpout stat counters, and
>    invoke count_mthp_stat() after successful zswap_store()s; into a single
>    commit. Thanks Yosry for this suggestion!
>
> Changes since v3:
> =================
> 1) Rebased to mm-unstable commit 8c0b4f7b65fd1ca7af01267f491e815a40d77444.
>    Thanks to Barry for suggesting aligning with Ryan Roberts' latest
>    changes to count_mthp_stat() so that it's always defined, even when THP
>    is disabled. Barry, I have also made one other change in page_io.c
>    where count_mthp_stat() is called by count_swpout_vm_event(). I would
>    appreciate it if you can review this. Thanks!
>    Hopefully this should resolve the kernel robot build errors.
>
> Changes since v2:
> =================
> 1) Gathered usemem data using SSD as the backing swap device for zswap,
>    as suggested by Ying Huang. Ying, I would appreciate it if you can
>    review the latest data. Thanks!
> 2) Generated the base commit info in the patches to attempt to address
>    the kernel test robot build errors.
> 3) No code changes to the individual patches themselves.
>
> Changes since RFC v1:
> =====================
>
> 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
>    Thanks Barry!
> 2) Addressed some of the code review comments that Nhat Pham provided in
>    Ryan's initial RFC [1]:
>    - Added a comment about the cgroup zswap limit checks occuring once per
>      folio at the beginning of zswap_store().
>      Nhat, Ryan, please do let me know if the comments convey the summary
>      from the RFC discussion. Thanks!
>    - Posted data on running the cgroup suite's zswap kselftest.
> 3) Rebased to v6.11-rc3.
> 4) Gathered performance data with usemem and the rebased patch-series.
>
>
> Regression Testing:
> ===================
> I ran vm-scalability usemem 70 processes without mTHP, i.e., only 4K
> folios with mm-unstable and with this patch-series. The main goal was
> to make sure that there is no functional or performance regression
> wrt the earlier zswap behavior for 4K folios,
> CONFIG_ZSWAP_STORE_THP_DEFAULT_ON is not set, and zswap_store() of 4K
> pages goes through the newly added code path [zswap_store(),
> zswap_store_page()].
>
> The data indicates there is no regression.
>
>  ------------------------------------------------------------------------------
>                      mm-unstable 8-28-2024                        zswap-mTHP v6
>                                               CONFIG_ZSWAP_STORE_THP_DEFAULT_ON
>                                                                      is not set
>  ------------------------------------------------------------------------------
>  ZSWAP compressor        zstd     deflate-                     zstd    deflate-
>                                        iaa                                  iaa
>  ------------------------------------------------------------------------------
>  Throughput (KB/s)    110,775      113,010               111,550        121,937
>  sys time (sec)      1,141.72       954.87              1,131.95         828.47
>  memcg_high           140,500      153,737               139,772        134,129
>  memcg_swap_high            0            0                     0              0
>  memcg_swap_fail            0            0                     0              0
>  pswpin                     0            0                     0              0
>  pswpout                    0            0                     0              0
>  zswpin                   675          690                   682            684
>  zswpout            9,552,298   10,603,271             9,566,392      9,267,213
>  thp_swpout                 0            0                     0              0
>  thp_swpout_                0            0                     0              0
>   fallback
>  pgmajfault             3,453        3,468                 3,841          3,487
>  ZSWPOUT-64kB-mTHP        n/a          n/a                     0              0
>  SWPOUT-64kB-mTHP           0            0                     0              0
>  ------------------------------------------------------------------------------

It's probably better to put the zstd columns next to each other, and
the deflate-iaa columns next to each other, for easier visual
comparisons.

>
>
> Performance Testing:
> ====================
> Testing of this patch-series was done with mm-unstable as of 9-23-2024,
> commit acfabf7e197f7a5bedf4749dac1f39551417b049. Data was gathered
> without/with this patch-series, on an Intel Sapphire Rapids server,
> dual-socket 56 cores per socket, 4 IAA devices per socket, 503 GiB RAM and
> 823G SSD disk partition swap. Core frequency was fixed at 2500MHz.
>
> The vm-scalability "usemem" test was run in a cgroup whose memory.high
> was fixed at 40G. The is no swap limit set for the cgroup. Following a
> similar methodology as in Ryan Roberts' "Swap-out mTHP without splitting"
> series [2], 70 usemem processes were run, each allocating and writing 1G of
> memory, and sleeping for 10 sec before exiting:
>
>     usemem --init-time -w -O -s 10 -n 70 1g
>
> The vm/sysfs mTHP stats included with the performance data provide details
> on the swapout activity to ZSWAP/swap.
>
> Other kernel configuration parameters:
>
>     ZSWAP Compressors : zstd, deflate-iaa
>     ZSWAP Allocator   : zsmalloc
>     SWAP page-cluster : 2
>
> In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
> IAA "compression verification" is enabled. Hence each IAA compression
> will be decompressed internally by the "iaa_crypto" driver, the crc-s
> returned by the hardware will be compared and errors reported in case of
> mismatches. Thus "deflate-iaa" helps ensure better data integrity as
> compared to the software compressors.
>
> Throughput is derived by averaging the individual 70 processes' throughputs
> reported by usemem. elapsed/sys times are measured with perf. All data
> points per compressor/kernel/mTHP configuration are averaged across 3 runs.
>
> Case 1: Comparing zswap 4K vs. zswap mTHP
> =========================================
>
> In this scenario, the "before" is CONFIG_THP_SWAP set to off, that results in
> 64K/2M (m)THP to be split into 4K folios that get processed by zswap.
>
> The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that results
> in 64K/2M (m)THP to not be split, and processed by zswap.
>
>  64KB mTHP (cgroup memory.high set to 40G):
>  ==========================================
>
>  -------------------------------------------------------------------------------
>                     mm-unstable 9-23-2024              zswap-mTHP     Change wrt
>                         CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y       Baseline
>                                  Baseline
>  -------------------------------------------------------------------------------
>  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
>                                       iaa                     iaa            iaa
>  -------------------------------------------------------------------------------
>  Throughput (KB/s)   143,323      125,485     153,550     129,609    7%       3%
>  elapsed time (sec)    24.97        25.42       23.90       25.19    4%       1%
>  sys time (sec)       822.72       750.96      757.70      731.13    8%       3%
>  memcg_high          132,743      169,825     148,075     192,744
>  memcg_swap_fail     639,067      841,553       2,204       2,215
>  pswpin                    0            0           0           0
>  pswpout                   0            0           0           0
>  zswpin                  795          873         760         902
>  zswpout          10,011,266   13,195,137  10,010,017  13,193,554
>  thp_swpout                0            0           0           0
>  thp_swpout_               0            0           0           0
>   fallback
>  64kB-mthp_          639,065      841,553       2,204       2,215
>   swpout_fallback
>  pgmajfault            2,861        2,924       3,054       3,259
>  ZSWPOUT-64kB            n/a          n/a     623,451     822,268
>  SWPOUT-64kB               0            0           0           0
>  -------------------------------------------------------------------------------
>
>
>  2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
>  =======================================================
>
>  -------------------------------------------------------------------------------
>                     mm-unstable 9-23-2024              zswap-mTHP     Change wrt
>                         CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y       Baseline
>                                  Baseline
>  -------------------------------------------------------------------------------
>  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
>                                       iaa                     iaa            iaa
>  -------------------------------------------------------------------------------
>  Throughput (KB/s)   145,616      139,640     169,404     141,168   16%       1%
>  elapsed time (sec)    25.05        23.85       23.02       23.37    8%       2%
>  sys time (sec)       790.53       676.34      613.26      677.83   22%    -0.2%
>  memcg_high           16,702       25,197      17,374      23,890
>  memcg_swap_fail      21,485       27,814         114         144
>  pswpin                    0            0           0           0
>  pswpout                   0            0           0           0
>  zswpin                  793          852         778         922
>  zswpout          10,011,709   13,186,882  10,010,893  13,195,600
>  thp_swpout                0            0           0           0
>  thp_swpout_          21,485       27,814         114         144
>   fallback
>  2048kB-mthp_            n/a          n/a           0           0
>   swpout_fallback
>  pgmajfault            2,701        2,822       4,151       5,066
>  ZSWPOUT-2048kB          n/a          n/a      19,442      25,615
>  SWPOUT-2048kB             0            0           0           0
>  -------------------------------------------------------------------------------
>
> We mostly see improvements in throughput, elapsed and sys time for zstd and
> deflate-iaa, when comparing before (THP_SWAP=N) vs. after (THP_SWAP=Y).
>
>
> Case 2: Comparing SSD swap mTHP vs. zswap mTHP
> ==============================================
>
> In this scenario, CONFIG_THP_SWAP is enabled in "before" and "after"
> experiments. The "before" represents zswap rejecting mTHP, and the mTHP
> being stored by the 823G SSD swap. The "after" represents data with this
> patch-series, that results in 64K/2M (m)THP being processed and stored by
> zswap.
>
>  64KB mTHP (cgroup memory.high set to 40G):
>  ==========================================
>
>  -------------------------------------------------------------------------------
>                     mm-unstable 9-23-2024              zswap-mTHP     Change wrt
>                         CONFIG_THP_SWAP=Y       CONFIG_THP_SWAP=Y       Baseline
>                                  Baseline
>  -------------------------------------------------------------------------------
>  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
>                                       iaa                     iaa            iaa
>  -------------------------------------------------------------------------------
>  Throughput (KB/s)    20,265       20,696     153,550     129,609   658%    526%
>  elapsed time (sec)    72.44        70.86       23.90       25.19    67%     64%
>  sys time (sec)        77.95        77.99      757.70      731.13  -872%   -837%
>  memcg_high          115,811      113,277     148,075     192,744
>  memcg_swap_fail       2,386        2,425       2,204       2,215
>  pswpin                   16           16           0           0
>  pswpout           7,774,235    7,616,069           0           0
>  zswpin                  728          749         760         902
>  zswpout              38,424       39,022  10,010,017  13,193,554
>  thp_swpout                0            0           0           0
>  thp_swpout_               0            0           0           0
>   fallback
>  64kB-mthp_            2,386        2,425       2,204       2,215
>   swpout_fallback
>  pgmajfault            2,757        2,860       3,054       3,259
>  ZSWPOUT-64kB            n/a          n/a     623,451     822,268
>  SWPOUT-64kB         485,890      476,004           0           0
>  -------------------------------------------------------------------------------
>
>
>  2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
>  =======================================================
>
>  -------------------------------------------------------------------------------
>                     mm-unstable 9-23-2024              zswap-mTHP     Change wrt
>                         CONFIG_THP_SWAP=Y       CONFIG_THP_SWAP=Y       Baseline
>                                  Baseline
>  -------------------------------------------------------------------------------
>  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
>                                       iaa                     iaa            iaa
>  -------------------------------------------------------------------------------
>  Throughput (KB/s)    24,347       35,971     169,404     141,168    596%   292%
>  elapsed time (sec)    63.52        64.59       23.02       23.37     64%    64%
>  sys time (sec)        27.91        27.01      613.26      677.83  -2098% -2410%
>  memcg_high           13,576       13,467      17,374      23,890
>  memcg_swap_fail         162          124         114         144
>  pswpin                    0            0           0           0
>  pswpout           7,003,307    7,168,853           0           0
>  zswpin                  741          722         778         922
>  zswpout              84,429       65,315  10,010,893  13,195,600
>  thp_swpout           13,678       14,002           0           0
>  thp_swpout_             162          124         114         144
>   fallback
>  2048kB-mthp_            n/a          n/a           0           0
>   swpout_fallback
>  pgmajfault            3,345        2,903       4,151       5,066
>  ZSWPOUT-2048kB          n/a          n/a      19,442      25,615
>  SWPOUT-2048kB        13,678       14,002           0           0
>  -------------------------------------------------------------------------------
>
> We see significant improvements in throughput and elapsed time for zstd and
> deflate-iaa, when comparing before (mTHP-SSD) vs. after (mTHP-ZSWAP). The
> sys time increases with mTHP-ZSWAP as expected, due to the CPU compression
> time vs. asynchronous disk write times, as pointed out by Ying and Yosry.
>
> In the "Before" scenario, when zswap does not store mTHP, only allocations
> count towards the cgroup memory limit. However, in the "After" scenario,
> with the introduction of zswap_store() mTHP, both, allocations as well as
> the zswap compressed pool usage from all 70 processes are counted towards
> the memory limit. As a result, we see higher swapout activity in the
> "After" data. Hence, more time is spent doing reclaim as the zswap cgroup
> charge leads to more frequent memory.high breaches.
>
> Summary:
> ========
> The v7 data presented above comparing zswap-mTHP with a conventional 823G
> SSD swap demonstrates good performance improvements with zswap-mTHP. Hence,
> it seems reasonable for zswap_store to support (m)THP, so that further
> performance improvements can be implemented.
>
> Some of the ideas that have shown promise in our experiments are:
>
> 1) IAA compress/decompress batching.
> 2) Distributing compress jobs across all IAA devices on the socket.
>
> In the experimental setup used in this patchset, we have enabled
> IAA compress verification to ensure additional hardware data integrity CRC
> checks not currently done by the software compressors. The tests run for
> this patchset are also using only 1 IAA device per core, that avails of 2
> compress engines on the device. In our experiments with IAA batching, we
> distribute compress jobs from all cores to the 8 compress engines available
> per socket. We further compress the pages in each mTHP in parallel in the
> accelerator. As a result, we improve compress latency and reclaim
> throughput.
>
> The following compares the same usemem workload characteristics between:
>
> 1) zstd (v7 experiments)
> 2) deflate-iaa "Fixed mode" (v7 experiments)
> 3) deflate-iaa with batching
> 4) deflate-iaa-canned "Canned mode" [3] with batching
>
> vm.page-cluster is set to "2" for all runs.
>
> 64K mTHP ZSWAP:
> ===============
>
>  -------------------------------------------------------------------------------
>  ZSWAP            zstd   IAA Fixed   IAA Fixed  IAA Canned     IAA    IAA    IAA
>  compressor       (v7)        (v7)  + Batching  + Batching   Batch Canned Canned
>                                                                vs.    vs.  Batch
>  64K mTHP                                                    Seqtl  Fixed    vs.
>                                                                             ZSTD
>  -------------------------------------------------------------------------------
>  Throughput    153,550     129,609     156,215     166,975   21%     7%       9%
>      (KB/s)
>  elapsed time    23.90       25.19       22.46       21.38   11%     5%      11%
>         (sec)
>  sys time       757.70      731.13      715.62      648.83    2%     9%      14%
>     (sec)
>  memcg_high    148,075     192,744     197,548     181,734
>  memcg_swap_     2,204       2,215       2,293       2,263
>   fail
>  pswpin              0           0           0           0
>  pswpout             0           0           0           0
>  zswpin            760         902         774         833
>  zswpout    10,010,017  13,193,554  13,193,176  12,125,616
>  thp_swpout          0           0           0           0
>  thp_swpout_         0           0           0           0
>   fallback
>  64kB-mthp_      2,204       2,215       2,293       2,263
>   swpout_
>   fallback
>  pgmajfault      3,054       3,259       3,545       3,516
>  ZSWPOUT-64kB  623,451     822,268     822,176     755,480
>  SWPOUT-64kB         0           0           0           0
>  swap_ra           146         161         152         159
>  swap_ra_hit        64         121          68          88
>  -------------------------------------------------------------------------------
>
>
> 2M THP ZSWAP:
> =============
>
>  -------------------------------------------------------------------------------
>  ZSWAP            zstd   IAA Fixed   IAA Fixed  IAA Canned     IAA    IAA    IAA
>  compressor       (v7)        (v7)  + Batching  + Batching   Batch Canned Canned
>                                                                vs.    vs.  Batch
>  2M THP                                                      Seqtl  Fixed    vs.
>                                                                             ZSTD
>  -------------------------------------------------------------------------------
>  Throughput    169,404     141,168     175,089     193,407     24%    10%    14%
>      (KB/s)
>  elapsed time    23.02       23.37       21.13       19.97     10%     5%    13%
>         (sec)
>  sys time       613.26      677.83      630.51      533.80      7%    15%    13%
>     (sec)
>  memcg_high     17,374      23,890      24,349      22,374
>  memcg_swap_       114         144         102          88
>   fail
>  pswpin              0           0           0           0
>  pswpout             0           0           0           0
>  zswpin            778         922       6,492       6,642
>  zswpout    10,010,893  13,195,600  13,199,907  12,132,265
>  thp_swpout          0           0           0           0
>  thp_swpout_       114         144         102          88
>   fallback
>  pgmajfault      4,151       5,066       5,032       4,999
>  ZSWPOUT-2MB    19,442      25,615      25,666      23,594
>  SWPOUT-2MB          0           0           0           0
>  swap_ra             3           9       4,383       4,494
>  swap_ra_hit         2           6       4,298       4,412
>  -------------------------------------------------------------------------------
>
>
> With ZSWAP IAA compress/decompress batching, we are able to demonstrate
> significant performance improvements and memory savings in scalability
> experiments under memory pressure, as compared to software compressors. We
> hope to submit this work in subsequent patch series.

Honestly I would remove the detailed results of the followup series
for batching, it should be enough to mention a single figure for
further expected improvement from ongoing work that depends on this.

>
> Thanks,
> Kanchana
>
> [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
> [2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1-ryan.roberts@arm.com/
> [3] https://patchwork.kernel.org/project/linux-crypto/cover/cover.1710969449.git.andre.glover@linux.intel.com/
>
>
> Kanchana P Sridhar (8):
>   mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined.
>   mm: zswap: Modify zswap_compress() to accept a page instead of a
>     folio.
>   mm: zswap: Refactor code to store an entry in zswap xarray.
>   mm: zswap: Refactor code to delete stored offsets in case of errors.
>   mm: zswap: Compress and store a specific page in a folio.
>   mm: zswap: Support mTHP swapout in zswap_store().
>   mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout
>     stats.
>   mm: Document the newly added mTHP zswpout stats, clarify swpout
>     semantics.
>
>  Documentation/admin-guide/mm/transhuge.rst |   8 +-
>  include/linux/huge_mm.h                    |   1 +
>  include/linux/memcontrol.h                 |   4 +
>  mm/Kconfig                                 |   8 +
>  mm/huge_memory.c                           |   3 +
>  mm/page_io.c                               |   1 +
>  mm/zswap.c                                 | 248 ++++++++++++++++-----
>  7 files changed, 210 insertions(+), 63 deletions(-)
>
>
> base-commit: acfabf7e197f7a5bedf4749dac1f39551417b049
> --
> 2.27.0
>


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-24  1:17 ` [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store() Kanchana P Sridhar
  2024-09-24 17:33   ` Nhat Pham
@ 2024-09-24 19:38   ` Yosry Ahmed
  2024-09-24 20:51     ` Nhat Pham
                       ` (2 more replies)
  2024-09-25 14:27   ` Johannes Weiner
  2 siblings, 3 replies; 79+ messages in thread
From: Yosry Ahmed @ 2024-09-24 19:38 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao,
	akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal

On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> zswap_store() will now store mTHP and PMD-size THP folios by compressing
> them page by page.
>
> This patch provides a sequential implementation of storing an mTHP in
> zswap_store() by iterating through each page in the folio to compress
> and store it in the zswap zpool.
>
> Towards this goal, zswap_compress() is modified to take a page instead
> of a folio as input.
>
> Each page's swap offset is stored as a separate zswap entry.
>
> If an error is encountered during the store of any page in the mTHP,
> all previous pages/entries stored will be invalidated. Thus, an mTHP
> is either entirely stored in ZSWAP, or entirely not stored in ZSWAP.
>
> This forms the basis for building batching of pages during zswap store
> of large folios by compressing batches of up to say, 8 pages in an
> mTHP in parallel in hardware, with the Intel In-Memory Analytics
> Accelerator (Intel IAA).
>
> A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default)
> will enable/disable zswap storing of (m)THP. The corresponding tunable
> zswap module parameter is "mthp_enabled".
>
> This change reuses and adapts the functionality in Ryan Roberts' RFC
> patch [1]:
>
>   "[RFC,v1] mm: zswap: Store large folios without splitting"
>
>   [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
>
> Also, addressed some of the RFC comments from the discussion in [1].
>
> Co-developed-by: Ryan Roberts
> Signed-off-by:
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  mm/Kconfig |   8 ++++
>  mm/zswap.c | 122 +++++++++++++++++++++++++----------------------------
>  2 files changed, 66 insertions(+), 64 deletions(-)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 09aebca1cae3..c659fb732ec4 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -59,6 +59,14 @@ config ZSWAP_SHRINKER_DEFAULT_ON
>           reducing the chance that cold pages will reside in the zswap pool
>           and consume memory indefinitely.
>
> +config ZSWAP_STORE_THP_DEFAULT_ON
> +       bool "Store mTHP and THP folios in zswap"
> +       depends on ZSWAP
> +       default n
> +       help
> +         If selected, zswap will process mTHP and THP folios by
> +         compressing and storing each 4K page in the large folio.
> +
>  choice
>         prompt "Default compressor"
>         depends on ZSWAP
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 8f2e0ab34c84..16ab770546d6 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -127,6 +127,14 @@ static bool zswap_shrinker_enabled = IS_ENABLED(
>                 CONFIG_ZSWAP_SHRINKER_DEFAULT_ON);
>  module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644);
>
> +/*
> + * Enable/disable zswap processing of mTHP folios.
> + * For now, only zswap_store will process mTHP folios.
> + */
> +static bool zswap_mthp_enabled = IS_ENABLED(
> +               CONFIG_ZSWAP_STORE_THP_DEFAULT_ON);
> +module_param_named(mthp_enabled, zswap_mthp_enabled, bool, 0644);
> +
>  bool zswap_is_enabled(void)
>  {
>         return zswap_enabled;
> @@ -1471,9 +1479,9 @@ static void zswap_delete_stored_offsets(struct xarray *tree,
>   * @objcg: The folio's objcg.
>   * @pool:  The zswap_pool to store the compressed data for the page.
>   */
> -static bool __maybe_unused zswap_store_page(struct folio *folio, long index,
> -                                           struct obj_cgroup *objcg,
> -                                           struct zswap_pool *pool)
> +static bool zswap_store_page(struct folio *folio, long index,
> +                            struct obj_cgroup *objcg,
> +                            struct zswap_pool *pool)

As I mentioned earlier, the patch that introduced zswap_store_page()
should have directly used it in zswap_store(). This would make this
patch much clearer.

>  {
>         swp_entry_t swp = folio->swap;
>         int type = swp_type(swp);
> @@ -1551,51 +1559,63 @@ static bool __maybe_unused zswap_store_page(struct folio *folio, long index,
>         return false;
>  }
>
> +/*
> + * Modified to store mTHP folios. Each page in the mTHP will be compressed
> + * and stored sequentially.
> + */
>  bool zswap_store(struct folio *folio)
>  {
>         long nr_pages = folio_nr_pages(folio);
>         swp_entry_t swp = folio->swap;
>         pgoff_t offset = swp_offset(swp);
>         struct xarray *tree = swap_zswap_tree(swp);
> -       struct zswap_entry *entry;
>         struct obj_cgroup *objcg = NULL;
>         struct mem_cgroup *memcg = NULL;
> +       struct zswap_pool *pool;
> +       bool ret = false;
> +       long index;
>
>         VM_WARN_ON_ONCE(!folio_test_locked(folio));
>         VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
>
> -       /* Large folios aren't supported */
> -       if (folio_test_large(folio))
> +       /* Storing large folios isn't enabled */

The comment is now stating the obvious, please remove it.

> +       if (!zswap_mthp_enabled && folio_test_large(folio))
>                 return false;
>
>         if (!zswap_enabled)
> -               goto check_old;
> +               goto reject;
>
> -       /* Check cgroup limits */
> +       /*
> +        * Check cgroup limits:
> +        *
> +        * The cgroup zswap limit check is done once at the beginning of an
> +        * mTHP store, and not within zswap_store_page() for each page
> +        * in the mTHP. We do however check the zswap pool limits at the
> +        * start of zswap_store_page(). What this means is, the cgroup
> +        * could go over the limits by at most (HPAGE_PMD_NR - 1) pages.
> +        * However, the per-store-page zswap pool limits check should
> +        * hopefully trigger the cgroup aware and zswap LRU aware global
> +        * reclaim implemented in the shrinker. If this assumption holds,
> +        * the cgroup exceeding the zswap limits could potentially be
> +        * resolved before the next zswap_store, and if it is not, the next
> +        * zswap_store would fail the cgroup zswap limit check at the start.
> +        */

I do not really like this. Allowing going one page above the limit is
one thing, but one THP above the limit seems too much. I also don't
like relying on the repeated limit checking in zswap_store_page(), if
anything I think that should be batched too.

Is it too unreasonable to maintain the average compression ratio and
use that to estimate limit checking for both memcg and global limits?
Johannes, Nhat, any thoughts on this?

>         objcg = get_obj_cgroup_from_folio(folio);
>         if (objcg && !obj_cgroup_may_zswap(objcg)) {
>                 memcg = get_mem_cgroup_from_objcg(objcg);
>                 if (shrink_memcg(memcg)) {
>                         mem_cgroup_put(memcg);
> -                       goto reject;
> +                       goto put_objcg;
>                 }
>                 mem_cgroup_put(memcg);
>         }
>
>         if (zswap_check_limits())
> -               goto reject;
> -
> -       /* allocate entry */
> -       entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
> -       if (!entry) {
> -               zswap_reject_kmemcache_fail++;
> -               goto reject;
> -       }
> +               goto put_objcg;
>
> -       /* if entry is successfully added, it keeps the reference */
> -       entry->pool = zswap_pool_current_get();
> -       if (!entry->pool)
> -               goto freepage;
> +       pool = zswap_pool_current_get();
> +       if (!pool)
> +               goto put_objcg;
>
>         if (objcg) {
>                 memcg = get_mem_cgroup_from_objcg(objcg);
> @@ -1606,60 +1626,34 @@ bool zswap_store(struct folio *folio)
>                 mem_cgroup_put(memcg);
>         }
>
> -       if (!zswap_compress(&folio->page, entry))
> -               goto put_pool;
> -
> -       entry->swpentry = swp;
> -       entry->objcg = objcg;
> -       entry->referenced = true;
> -
> -       if (!zswap_store_entry(tree, entry))
> -               goto store_failed;
> -
> -       if (objcg) {
> -               obj_cgroup_charge_zswap(objcg, entry->length);
> -               count_objcg_event(objcg, ZSWPOUT);
> -       }
> -
>         /*
> -        * We finish initializing the entry while it's already in xarray.
> -        * This is safe because:
> -        *
> -        * 1. Concurrent stores and invalidations are excluded by folio lock.
> -        *
> -        * 2. Writeback is excluded by the entry not being on the LRU yet.
> -        *    The publishing order matters to prevent writeback from seeing
> -        *    an incoherent entry.
> +        * Store each page of the folio as a separate entry. If we fail to store
> +        * a page, unwind by removing all the previous pages we stored.
>          */
> -       if (entry->length) {
> -               INIT_LIST_HEAD(&entry->lru);
> -               zswap_lru_add(&zswap_list_lru, entry);
> +       for (index = 0; index < nr_pages; ++index) {
> +               if (!zswap_store_page(folio, index, objcg, pool))
> +                       goto put_pool;
>         }
>
> -       /* update stats */
> -       atomic_inc(&zswap_stored_pages);
> -       count_vm_event(ZSWPOUT);
> -
> -       return true;
> +       ret = true;
>
> -store_failed:
> -       zpool_free(entry->pool->zpool, entry->handle);
>  put_pool:
> -       zswap_pool_put(entry->pool);
> -freepage:
> -       zswap_entry_cache_free(entry);
> -reject:
> +       zswap_pool_put(pool);
> +put_objcg:
>         obj_cgroup_put(objcg);
>         if (zswap_pool_reached_full)
>                 queue_work(shrink_wq, &zswap_shrink_work);
> -check_old:
> +reject:
>         /*
> -        * If the zswap store fails or zswap is disabled, we must invalidate the
> -        * possibly stale entry which was previously stored at this offset.
> -        * Otherwise, writeback could overwrite the new data in the swapfile.
> +        * If the zswap store fails or zswap is disabled, we must invalidate
> +        * the possibly stale entries which were previously stored at the
> +        * offsets corresponding to each page of the folio. Otherwise,
> +        * writeback could overwrite the new data in the swapfile.
>          */
> -       zswap_delete_stored_offsets(tree, offset, nr_pages);
> -       return false;
> +       if (!ret)
> +               zswap_delete_stored_offsets(tree, offset, nr_pages);
> +
> +       return ret;
>  }
>
>  bool zswap_load(struct folio *folio)
> --
> 2.27.0
>


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v7 3/8] mm: zswap: Refactor code to store an entry in zswap xarray.
  2024-09-24 17:16   ` Nhat Pham
@ 2024-09-24 20:40     ` Sridhar, Kanchana P
  0 siblings, 0 replies; 79+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-24 20:40 UTC (permalink / raw)
  To: Nhat Pham
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
	akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar,
	Kanchana P


> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Tuesday, September 24, 2024 10:17 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 3/8] mm: zswap: Refactor code to store an entry in
> zswap xarray.
> 
> On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > Added a new procedure zswap_store_entry() that refactors the code
> > currently in zswap_store() to store an entry in the zswap xarray.
> > This will allow us to call this procedure for each storing the swap
> > offset of each page in an mTHP in the xarray, as part of zswap_store()
> > supporting mTHP.
> >
> > Also, made a minor edit in the comments for 'struct zswap_entry' to delete
> > the description of the 'value' member that was deleted in commit
> > 20a5532ffa53d6ecf41ded920a7b0ff9c65a7dcf ("mm: remove code to
> handle
> > same filled pages").
> 
> nit: This probably should be a separate patch...

Sure, will delete this change in v8.

> 
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> 
> Otherwise, LGTM :)
> 
> Reviewed-by: Nhat Pham <nphamcs@gmail.com>

Thanks Nhat!


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v7 4/8] mm: zswap: Refactor code to delete stored offsets in case of errors.
  2024-09-24 17:25   ` Nhat Pham
@ 2024-09-24 20:41     ` Sridhar, Kanchana P
  0 siblings, 0 replies; 79+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-24 20:41 UTC (permalink / raw)
  To: Nhat Pham
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
	akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar,
	Kanchana P

> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Tuesday, September 24, 2024 10:25 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 4/8] mm: zswap: Refactor code to delete stored
> offsets in case of errors.
> 
> On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > Added a new procedure zswap_delete_stored_offsets() that can be
> > called to delete stored offsets in a folio in case zswap_store()
> > fails or zswap is disabled.
> >
> > Refactored the code in zswap_store() that handles these cases,
> > to call zswap_delete_stored_offsets().
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  mm/zswap.c | 33 ++++++++++++++++++++++++++++++---
> >  1 file changed, 30 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index fd35a81b6e36..9bea948d653e 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -1435,8 +1435,37 @@ static bool zswap_store_entry(struct xarray
> *tree,
> >         return true;
> >  }
> >
> > +/*
> > + * If the zswap store fails or zswap is disabled, we must invalidate the
> > + * possibly stale entries which were previously stored at the offsets
> > + * corresponding to each page of the folio. Otherwise, writeback could
> > + * overwrite the new data in the swapfile.
> > + *
> > + * This is called after the store of an offset in a large folio has failed.
> 
> "store of a subpage" rather than "stored of an offset"?

Sure, I will make this change in v8.

> 
> 
> > + * All zswap entries in the folio must be deleted. This helps make sure
> > + * that a swapped-out mTHP is either entirely stored in zswap, or entirely
> > + * not stored in zswap.
> > + *
> > + * This is also called if zswap_store() is invoked, but zswap is not enabled.
> > + * All offsets for the folio are deleted from zswap in this case.
> > + */
> > +static void zswap_delete_stored_offsets(struct xarray *tree,
> > +                                       pgoff_t offset,
> > +                                       long nr_pages)
> > +{
> > +       struct zswap_entry *entry;
> > +       long i;
> > +
> > +       for (i = 0; i < nr_pages; ++i) {
> > +               entry = xa_erase(tree, offset + i);
> > +               if (entry)
> > +                       zswap_entry_free(entry);
> > +       }
> > +}
> > +

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-24 19:38   ` Yosry Ahmed
@ 2024-09-24 20:51     ` Nhat Pham
  2024-09-24 21:38       ` Yosry Ahmed
  2024-09-24 23:21       ` Sridhar, Kanchana P
  2024-09-24 23:02     ` Sridhar, Kanchana P
  2024-09-25 13:40     ` Johannes Weiner
  2 siblings, 2 replies; 79+ messages in thread
From: Nhat Pham @ 2024-09-24 20:51 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Kanchana P Sridhar, linux-kernel, linux-mm, hannes,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
	ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali,
	vinodh.gopal

On Tue, Sep 24, 2024 at 12:39 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar
> > +        * The cgroup zswap limit check is done once at the beginning of an
> > +        * mTHP store, and not within zswap_store_page() for each page
> > +        * in the mTHP. We do however check the zswap pool limits at the
> > +        * start of zswap_store_page(). What this means is, the cgroup
> > +        * could go over the limits by at most (HPAGE_PMD_NR - 1) pages.
> > +        * However, the per-store-page zswap pool limits check should
> > +        * hopefully trigger the cgroup aware and zswap LRU aware global
> > +        * reclaim implemented in the shrinker. If this assumption holds,
> > +        * the cgroup exceeding the zswap limits could potentially be
> > +        * resolved before the next zswap_store, and if it is not, the next
> > +        * zswap_store would fail the cgroup zswap limit check at the start.
> > +        */
>
> I do not really like this. Allowing going one page above the limit is
> one thing, but one THP above the limit seems too much. I also don't

Hmm what if you have multiple concurrent zswap stores, from different
tasks but the same cgroup? If none of them has charged, they would all
get greenlit, and charge towards the cgroup...

So technically the zswap limit checking is already best-effort only.
But now, instead of one page per violation, it's 512 pages per
violation :)

Yeah this can be bad. I think this is only safe if you only use
zswap.max as a binary knob (0 or max)...

> like relying on the repeated limit checking in zswap_store_page(), if
> anything I think that should be batched too.
>
> Is it too unreasonable to maintain the average compression ratio and
> use that to estimate limit checking for both memcg and global limits?
> Johannes, Nhat, any thoughts on this?

I remember asking about this, but past Nhat might have relented :)

https://lore.kernel.org/linux-mm/CAKEwX=PfAMZ2qJtwKwJsVx3TZWxV5z2ZaU1Epk1UD=DBdMsjFA@mail.gmail.com/

We can do limit checking and charging after compression is done, but
that's a lot of code change (might not even be possible)... It will,
however, allow us to do charging + checking in one go (rather than
doing it 8, 16, or 512 times)

Another thing we can do is to register a zswap writeback after the
zswap store attempts to clean up excess capacity. Not sure what will
happen if zswap writeback is disabled for the cgroup though :)

If it's too hard, the average estimate could be a decent compromise,
until we figure something smarter.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-24 17:33   ` Nhat Pham
@ 2024-09-24 20:51     ` Sridhar, Kanchana P
  2024-09-24 21:08       ` Nhat Pham
  0 siblings, 1 reply; 79+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-24 20:51 UTC (permalink / raw)
  To: Nhat Pham
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
	akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar,
	Kanchana P

> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Tuesday, September 24, 2024 10:34 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> zswap_store().
> 
> On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > zswap_store() will now store mTHP and PMD-size THP folios by compressing
> > them page by page.
> >
> > This patch provides a sequential implementation of storing an mTHP in
> > zswap_store() by iterating through each page in the folio to compress
> > and store it in the zswap zpool.
> >
> > Towards this goal, zswap_compress() is modified to take a page instead
> > of a folio as input.
> >
> > Each page's swap offset is stored as a separate zswap entry.
> >
> > If an error is encountered during the store of any page in the mTHP,
> > all previous pages/entries stored will be invalidated. Thus, an mTHP
> > is either entirely stored in ZSWAP, or entirely not stored in ZSWAP.
> >
> > This forms the basis for building batching of pages during zswap store
> > of large folios by compressing batches of up to say, 8 pages in an
> > mTHP in parallel in hardware, with the Intel In-Memory Analytics
> > Accelerator (Intel IAA).
> >
> > A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by
> default)
> > will enable/disable zswap storing of (m)THP. The corresponding tunable
> > zswap module parameter is "mthp_enabled".
> >
> > This change reuses and adapts the functionality in Ryan Roberts' RFC
> > patch [1]:
> >
> >   "[RFC,v1] mm: zswap: Store large folios without splitting"
> >
> >   [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> ryan.roberts@arm.com/T/#u
> >
> > Also, addressed some of the RFC comments from the discussion in [1].
> >
> > Co-developed-by: Ryan Roberts
> > Signed-off-by:
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  mm/Kconfig |   8 ++++
> >  mm/zswap.c | 122 +++++++++++++++++++++++++----------------------------
> >  2 files changed, 66 insertions(+), 64 deletions(-)
> >
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 09aebca1cae3..c659fb732ec4 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -59,6 +59,14 @@ config ZSWAP_SHRINKER_DEFAULT_ON
> >           reducing the chance that cold pages will reside in the zswap pool
> >           and consume memory indefinitely.
> >
> > +config ZSWAP_STORE_THP_DEFAULT_ON
> > +       bool "Store mTHP and THP folios in zswap"
> > +       depends on ZSWAP
> > +       default n
> > +       help
> > +         If selected, zswap will process mTHP and THP folios by
> > +         compressing and storing each 4K page in the large folio.
> > +
> >  choice
> >         prompt "Default compressor"
> >         depends on ZSWAP
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 8f2e0ab34c84..16ab770546d6 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -127,6 +127,14 @@ static bool zswap_shrinker_enabled =
> IS_ENABLED(
> >                 CONFIG_ZSWAP_SHRINKER_DEFAULT_ON);
> >  module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool,
> 0644);
> >
> > +/*
> > + * Enable/disable zswap processing of mTHP folios.
> > + * For now, only zswap_store will process mTHP folios.
> > + */
> > +static bool zswap_mthp_enabled = IS_ENABLED(
> > +               CONFIG_ZSWAP_STORE_THP_DEFAULT_ON);
> > +module_param_named(mthp_enabled, zswap_mthp_enabled, bool,
> 0644);
> > +
> 
> Hmm, so this is a runtime knob. Also, should this be zswap_thp_enabled? :)

Agreed, zswap_thp_enabled is a better name. I will make the change in v8.
More comments below as to the runtime knob.

> 
> >  bool zswap_is_enabled(void)
> >  {
> >         return zswap_enabled;
> > @@ -1471,9 +1479,9 @@ static void zswap_delete_stored_offsets(struct
> xarray *tree,
> >   * @objcg: The folio's objcg.
> >   * @pool:  The zswap_pool to store the compressed data for the page.
> >   */
> > -static bool __maybe_unused zswap_store_page(struct folio *folio, long
> index,
> > -                                           struct obj_cgroup *objcg,
> > -                                           struct zswap_pool *pool)
> > +static bool zswap_store_page(struct folio *folio, long index,
> > +                            struct obj_cgroup *objcg,
> > +                            struct zswap_pool *pool)
> >  {
> >         swp_entry_t swp = folio->swap;
> >         int type = swp_type(swp);
> > @@ -1551,51 +1559,63 @@ static bool __maybe_unused
> zswap_store_page(struct folio *folio, long index,
> >         return false;
> >  }
> >
> > +/*
> > + * Modified to store mTHP folios. Each page in the mTHP will be
> compressed
> > + * and stored sequentially.
> > + */
> >  bool zswap_store(struct folio *folio)
> >  {
> >         long nr_pages = folio_nr_pages(folio);
> >         swp_entry_t swp = folio->swap;
> >         pgoff_t offset = swp_offset(swp);
> >         struct xarray *tree = swap_zswap_tree(swp);
> > -       struct zswap_entry *entry;
> >         struct obj_cgroup *objcg = NULL;
> >         struct mem_cgroup *memcg = NULL;
> > +       struct zswap_pool *pool;
> > +       bool ret = false;
> > +       long index;
> >
> >         VM_WARN_ON_ONCE(!folio_test_locked(folio));
> >         VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> >
> > -       /* Large folios aren't supported */
> > -       if (folio_test_large(folio))
> > +       /* Storing large folios isn't enabled */
> > +       if (!zswap_mthp_enabled && folio_test_large(folio))
> >                 return false;
> 
> Hmm can this go wrong somehow? Can we have a case where we enable
> zswap_mthp_enabled, have a large folio written to zswap, disable
> zswap_mthp_enabled, and attempt to store that folio to zswap again.
> 
> Now, we have a stale copy in zswap that is not invalidated...?
> 
> Or am I missing something here :)

This is an excellent point. Thanks Nhat for catching this! I can see two
options to solving this:

Option 1: If zswap_mthp_enabled is "false", delete all stored offsets
for the mTHP in zswap before exiting. This could race with writeback
(either one or more subpages could be written back before zswap_store
acquires the tree lock), however, I don't think it will cause data inconsistencies.
Any offsets for subpages not written back will be deleted from zswap,
zswap_store() will return false, and the backing swap device's subsequent
swapout will over-write the zswap write-back data. Could anything go wrong
with this?

Option 2: Only provide a build config option,
CONFIG_ZSWAP_STORE_THP_DEFAULT_ON, that cannot be dynamically changed.

Would appreciate suggestions on these, and other potential solutions.

Thanks,
Kanchana

^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v7 8/8] mm: Document the newly added mTHP zswpout stats, clarify swpout semantics.
  2024-09-24 17:36   ` Nhat Pham
@ 2024-09-24 20:52     ` Sridhar, Kanchana P
  0 siblings, 0 replies; 79+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-24 20:52 UTC (permalink / raw)
  To: Nhat Pham
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
	akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar,
	Kanchana P

> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Tuesday, September 24, 2024 10:37 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 8/8] mm: Document the newly added mTHP zswpout
> stats, clarify swpout semantics.
> 
> On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > Added documentation for the newly added sysfs mTHP "zswpout" stats.
> >
> > Clarified that only non-ZSWAP mTHP swapouts will be accounted in the
> mTHP
> > "swpout" stats.
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  Documentation/admin-guide/mm/transhuge.rst | 8 ++++++--
> >  1 file changed, 6 insertions(+), 2 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/mm/transhuge.rst
> b/Documentation/admin-guide/mm/transhuge.rst
> > index cfdd16a52e39..a65f905e9ca7 100644
> > --- a/Documentation/admin-guide/mm/transhuge.rst
> > +++ b/Documentation/admin-guide/mm/transhuge.rst
> > @@ -530,10 +530,14 @@ anon_fault_fallback_charge
> >         instead falls back to using huge pages with lower orders or
> >         small pages even though the allocation was successful.
> >
> > -swpout
> > -       is incremented every time a huge page is swapped out in one
> > +zswpout
> > +       is incremented every time a huge page is swapped out to ZSWAP in
> one
> >         piece without splitting.
> 
> nit: a bit weird to capitalize ZSWAP no? :)

No problem :). Will fix in v8.

> 
> >
> > +swpout
> > +       is incremented every time a huge page is swapped out to a non-ZSWAP
> > +       swap entity in one piece without splitting.
> > +
> 
> nit: "non-zswap swap entity" is a bit awkward. Maybe swapped out to a
> non-zswap swap device?

Sure, will make this change in v8. Thanks Nhat!


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-24 20:51     ` Sridhar, Kanchana P
@ 2024-09-24 21:08       ` Nhat Pham
  2024-09-24 21:34         ` Yosry Ahmed
  0 siblings, 1 reply; 79+ messages in thread
From: Nhat Pham @ 2024-09-24 21:08 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
	akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh

On Tue, Sep 24, 2024 at 1:51 PM Sridhar, Kanchana P
<kanchana.p.sridhar@intel.com> wrote:
>
>
> This is an excellent point. Thanks Nhat for catching this! I can see two
> options to solving this:
>
> Option 1: If zswap_mthp_enabled is "false", delete all stored offsets
> for the mTHP in zswap before exiting. This could race with writeback
> (either one or more subpages could be written back before zswap_store
> acquires the tree lock), however, I don't think it will cause data inconsistencies.
> Any offsets for subpages not written back will be deleted from zswap,
> zswap_store() will return false, and the backing swap device's subsequent
> swapout will over-write the zswap write-back data. Could anything go wrong
> with this?

I think this should be safe, albeit a bit awkward.

At this point (zswap_store()), we should have the folio added to to
swap cache, and locked. All the associated swap entries will point to
this same (large) folio.

Any concurrent zswap writeback attempt, even on a tail page, should
get that folio when it calls __read_swap_cache_async(), and with
page_allocated == false, and should short circuit.

So I don't think we will race with zswap_writeback().

Yosry, Chengming, Johannes, any thoughts?

>
> Option 2: Only provide a build config option,
> CONFIG_ZSWAP_STORE_THP_DEFAULT_ON, that cannot be dynamically changed.

This can be a last resort thing, if the above doesn't work. Not the
end of the world, but not ideal :)

>
> Would appreciate suggestions on these, and other potential solutions.
>
> Thanks,
> Kanchana


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-24 21:08       ` Nhat Pham
@ 2024-09-24 21:34         ` Yosry Ahmed
  2024-09-24 22:16           ` Nhat Pham
  2024-09-24 22:17           ` Sridhar, Kanchana P
  0 siblings, 2 replies; 79+ messages in thread
From: Yosry Ahmed @ 2024-09-24 21:34 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Sridhar, Kanchana P, linux-kernel, linux-mm, hannes,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang,
	Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal,
	Vinodh

On Tue, Sep 24, 2024 at 2:08 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Tue, Sep 24, 2024 at 1:51 PM Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com> wrote:
> >
> >
> > This is an excellent point. Thanks Nhat for catching this! I can see two
> > options to solving this:
> >
> > Option 1: If zswap_mthp_enabled is "false", delete all stored offsets
> > for the mTHP in zswap before exiting. This could race with writeback
> > (either one or more subpages could be written back before zswap_store
> > acquires the tree lock), however, I don't think it will cause data inconsistencies.
> > Any offsets for subpages not written back will be deleted from zswap,
> > zswap_store() will return false, and the backing swap device's subsequent
> > swapout will over-write the zswap write-back data. Could anything go wrong
> > with this?
>
> I think this should be safe, albeit a bit awkward.
>
> At this point (zswap_store()), we should have the folio added to to
> swap cache, and locked. All the associated swap entries will point to
> this same (large) folio.
>
> Any concurrent zswap writeback attempt, even on a tail page, should
> get that folio when it calls __read_swap_cache_async(), and with
> page_allocated == false, and should short circuit.
>
> So I don't think we will race with zswap_writeback().
>
> Yosry, Chengming, Johannes, any thoughts?

Why can't we just handle it the same way as we handle zswap
disablement? If it is disabled, we invalidate any old entries for the
offsets and return false to swapout to disk.

Taking a step back, why do we need the runtime knob and config option?
Are there cases where we think zswapout of mTHPs will perform badly,
or is it just due to lack of confidence in the feature?

>
> >
> > Option 2: Only provide a build config option,
> > CONFIG_ZSWAP_STORE_THP_DEFAULT_ON, that cannot be dynamically changed.
>
> This can be a last resort thing, if the above doesn't work. Not the
> end of the world, but not ideal :)
>
> >
> > Would appreciate suggestions on these, and other potential solutions.
> >
> > Thanks,
> > Kanchana


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-24 20:51     ` Nhat Pham
@ 2024-09-24 21:38       ` Yosry Ahmed
  2024-09-24 23:11         ` Nhat Pham
  2024-09-24 23:21       ` Sridhar, Kanchana P
  1 sibling, 1 reply; 79+ messages in thread
From: Yosry Ahmed @ 2024-09-24 21:38 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Kanchana P Sridhar, linux-kernel, linux-mm, hannes,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
	ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali,
	vinodh.gopal

On Tue, Sep 24, 2024 at 1:51 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Tue, Sep 24, 2024 at 12:39 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar
> > > +        * The cgroup zswap limit check is done once at the beginning of an
> > > +        * mTHP store, and not within zswap_store_page() for each page
> > > +        * in the mTHP. We do however check the zswap pool limits at the
> > > +        * start of zswap_store_page(). What this means is, the cgroup
> > > +        * could go over the limits by at most (HPAGE_PMD_NR - 1) pages.
> > > +        * However, the per-store-page zswap pool limits check should
> > > +        * hopefully trigger the cgroup aware and zswap LRU aware global
> > > +        * reclaim implemented in the shrinker. If this assumption holds,
> > > +        * the cgroup exceeding the zswap limits could potentially be
> > > +        * resolved before the next zswap_store, and if it is not, the next
> > > +        * zswap_store would fail the cgroup zswap limit check at the start.
> > > +        */
> >
> > I do not really like this. Allowing going one page above the limit is
> > one thing, but one THP above the limit seems too much. I also don't
>
> Hmm what if you have multiple concurrent zswap stores, from different
> tasks but the same cgroup? If none of them has charged, they would all
> get greenlit, and charge towards the cgroup...
>
> So technically the zswap limit checking is already best-effort only.
> But now, instead of one page per violation, it's 512 pages per
> violation :)

Yeah good point about concurrent operations, we can go 512 pages above
limit * number of concurrent swapouts. That can be a lot of memory.

>
> Yeah this can be bad. I think this is only safe if you only use
> zswap.max as a binary knob (0 or max)...
>
> > like relying on the repeated limit checking in zswap_store_page(), if
> > anything I think that should be batched too.
> >
> > Is it too unreasonable to maintain the average compression ratio and
> > use that to estimate limit checking for both memcg and global limits?
> > Johannes, Nhat, any thoughts on this?
>
> I remember asking about this, but past Nhat might have relented :)
>
> https://lore.kernel.org/linux-mm/CAKEwX=PfAMZ2qJtwKwJsVx3TZWxV5z2ZaU1Epk1UD=DBdMsjFA@mail.gmail.com/
>
> We can do limit checking and charging after compression is done, but
> that's a lot of code change (might not even be possible)... It will,
> however, allow us to do charging + checking in one go (rather than
> doing it 8, 16, or 512 times)
>
> Another thing we can do is to register a zswap writeback after the
> zswap store attempts to clean up excess capacity. Not sure what will
> happen if zswap writeback is disabled for the cgroup though :)
>
> If it's too hard, the average estimate could be a decent compromise,
> until we figure something smarter.

We can also do what we discussed before about double charging. The
pages that are being reclaimed are already charged, so technically we
don't need to charge them again. We can uncharge the difference
between compressed and uncompressed sizes after compression and call
it a day. This fixes the limit checking and the double charging in one
go.

I am a little bit nervous though about zswap uncharing the pages from
under reclaim, there are likely further accesses of the page memcg
after zswap. Maybe we can plumb the info back to reclaim or set a flag
on the page to avoid uncharging it when it's freed.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-24 21:34         ` Yosry Ahmed
@ 2024-09-24 22:16           ` Nhat Pham
  2024-09-24 22:18             ` Sridhar, Kanchana P
  2024-09-24 22:28             ` Yosry Ahmed
  2024-09-24 22:17           ` Sridhar, Kanchana P
  1 sibling, 2 replies; 79+ messages in thread
From: Nhat Pham @ 2024-09-24 22:16 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sridhar, Kanchana P, linux-kernel, linux-mm, hannes,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang,
	Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal,
	Vinodh

On Tue, Sep 24, 2024 at 2:34 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
>
> Why can't we just handle it the same way as we handle zswap
> disablement? If it is disabled, we invalidate any old entries for the
> offsets and return false to swapout to disk.

I think that was the suggestion.

>
> Taking a step back, why do we need the runtime knob and config option?
> Are there cases where we think zswapout of mTHPs will perform badly,
> or is it just due to lack of confidence in the feature?

Fair point. I think the reason why I suggested this knob was because
we observe so much regressions in earlier benchmarks, and especially
on the software compressor column.

But now that we've reworked the benchmark + use zstd for software
compressor, I think we can get rid of this knob/config option, and
simplify things.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-24 21:34         ` Yosry Ahmed
  2024-09-24 22:16           ` Nhat Pham
@ 2024-09-24 22:17           ` Sridhar, Kanchana P
  1 sibling, 0 replies; 79+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-24 22:17 UTC (permalink / raw)
  To: Yosry Ahmed, Nhat Pham
  Cc: linux-kernel, linux-mm, hannes, chengming.zhou, usamaarif642,
	shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou,
	Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P


> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Tuesday, September 24, 2024 2:34 PM
> To: Nhat Pham <nphamcs@gmail.com>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> zswap_store().
> 
> On Tue, Sep 24, 2024 at 2:08 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > On Tue, Sep 24, 2024 at 1:51 PM Sridhar, Kanchana P
> > <kanchana.p.sridhar@intel.com> wrote:
> > >
> > >
> > > This is an excellent point. Thanks Nhat for catching this! I can see two
> > > options to solving this:
> > >
> > > Option 1: If zswap_mthp_enabled is "false", delete all stored offsets
> > > for the mTHP in zswap before exiting. This could race with writeback
> > > (either one or more subpages could be written back before zswap_store
> > > acquires the tree lock), however, I don't think it will cause data
> inconsistencies.
> > > Any offsets for subpages not written back will be deleted from zswap,
> > > zswap_store() will return false, and the backing swap device's subsequent
> > > swapout will over-write the zswap write-back data. Could anything go
> wrong
> > > with this?
> >
> > I think this should be safe, albeit a bit awkward.
> >
> > At this point (zswap_store()), we should have the folio added to to
> > swap cache, and locked. All the associated swap entries will point to
> > this same (large) folio.
> >
> > Any concurrent zswap writeback attempt, even on a tail page, should
> > get that folio when it calls __read_swap_cache_async(), and with
> > page_allocated == false, and should short circuit.
> >
> > So I don't think we will race with zswap_writeback().
> >
> > Yosry, Chengming, Johannes, any thoughts?
> 
> Why can't we just handle it the same way as we handle zswap
> disablement? If it is disabled, we invalidate any old entries for the
> offsets and return false to swapout to disk.
> 
> Taking a step back, why do we need the runtime knob and config option?
> Are there cases where we think zswapout of mTHPs will perform badly,
> or is it just due to lack of confidence in the feature?

Thanks Nhat and Yosry for the suggestions/comments.

If I recall correctly, the topic of adding a config option/knob came up
based on earlier data I had collected with a zram backing device setup,
which showed a performance degradation with zstd, but not with deflate-iaa.

Since the v7 data collected with an 823G SSD swap disk partition indicates
that we get good throughput and latency improvements with zswap-mTHP
with zstd and deflate-iaa, I am not sure if the knob is still required (if this
is representative of most of the setups that use mTHP).

I am confident about the zswap-mTHP feature itself, and don’t think the
knob is needed from that perspective. I think the question is really about
having the ability to disable zswap-mTHP in some existing setup where
having mTHP enabled performs worse with this patchset than without.

I am Ok with having the knob and handling it using Option 1, or, not
having a knob.

Thanks,
Kanchana 

> 
> >
> > >
> > > Option 2: Only provide a build config option,
> > > CONFIG_ZSWAP_STORE_THP_DEFAULT_ON, that cannot be dynamically
> changed.
> >
> > This can be a last resort thing, if the above doesn't work. Not the
> > end of the world, but not ideal :)
> >
> > >
> > > Would appreciate suggestions on these, and other potential solutions.
> > >
> > > Thanks,
> > > Kanchana

^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-24 22:16           ` Nhat Pham
@ 2024-09-24 22:18             ` Sridhar, Kanchana P
  2024-09-24 22:28             ` Yosry Ahmed
  1 sibling, 0 replies; 79+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-24 22:18 UTC (permalink / raw)
  To: Nhat Pham, Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, chengming.zhou, usamaarif642,
	shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou,
	Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P

> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Tuesday, September 24, 2024 3:16 PM
> To: Yosry Ahmed <yosryahmed@google.com>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> zswap_store().
> 
> On Tue, Sep 24, 2024 at 2:34 PM Yosry Ahmed <yosryahmed@google.com>
> wrote:
> >
> >
> > Why can't we just handle it the same way as we handle zswap
> > disablement? If it is disabled, we invalidate any old entries for the
> > offsets and return false to swapout to disk.
> 
> I think that was the suggestion.
> 
> >
> > Taking a step back, why do we need the runtime knob and config option?
> > Are there cases where we think zswapout of mTHPs will perform badly,
> > or is it just due to lack of confidence in the feature?
> 
> Fair point. I think the reason why I suggested this knob was because
> we observe so much regressions in earlier benchmarks, and especially
> on the software compressor column.
> 
> But now that we've reworked the benchmark + use zstd for software
> compressor, I think we can get rid of this knob/config option, and
> simplify things.

I agree, thanks Nhat! Will fix this in v8.

Thanks,
Kanchana

^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v7 3/8] mm: zswap: Refactor code to store an entry in zswap xarray.
  2024-09-24 19:14   ` Yosry Ahmed
@ 2024-09-24 22:22     ` Sridhar, Kanchana P
  0 siblings, 0 replies; 79+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-24 22:22 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
	akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar,
	Kanchana P

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Tuesday, September 24, 2024 12:15 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 3/8] mm: zswap: Refactor code to store an entry in
> zswap xarray.
> 
> On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > Added a new procedure zswap_store_entry() that refactors the code
> > currently in zswap_store() to store an entry in the zswap xarray.
> > This will allow us to call this procedure for each storing the swap
> > offset of each page in an mTHP in the xarray, as part of zswap_store()
> > supporting mTHP.
> >
> > Also, made a minor edit in the comments for 'struct zswap_entry' to delete
> > the description of the 'value' member that was deleted in commit
> > 20a5532ffa53d6ecf41ded920a7b0ff9c65a7dcf ("mm: remove code to
> handle
> > same filled pages").
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  mm/zswap.c | 51 ++++++++++++++++++++++++++++++++++-----------------
> >  1 file changed, 34 insertions(+), 17 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 59b7733a62d3..fd35a81b6e36 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -190,7 +190,6 @@ static struct shrinker *zswap_shrinker;
> >   *              section for context.
> >   * pool - the zswap_pool the entry's data is in
> >   * handle - zpool allocation handle that stores the compressed page data
> > - * value - value of the same-value filled pages which have same content
> >   * objcg - the obj_cgroup that the compressed memory is charged to
> >   * lru - handle to the pool's lru used to evict pages.
> >   */
> > @@ -1404,12 +1403,44 @@ static void shrink_worker(struct work_struct
> *w)
> >  /*********************************
> >  * main API
> >  **********************************/
> > +
> > +/*
> > + * Returns true if the entry was successfully
> > + * stored in the xarray, and false otherwise.
> > + */
> > +static bool zswap_store_entry(struct xarray *tree,
> > +                             struct zswap_entry *entry)
> 
> 
> I think zswap_tree_store() is a more descriptive name.

Thanks Yosry for the code review comments!
Sure, will change this to zswap_tree_store() in v8. 

> 
> >
> > +{
> > +       struct zswap_entry *old;
> > +       pgoff_t offset = swp_offset(entry->swpentry);
> 
> 
> Reverse xmas tree where possible please (longest to shortest declarations).
> 
> >
> > +
> > +       old = xa_store(tree, offset, entry, GFP_KERNEL);
> > +
> 
> No need for the blank line here.

Ok, will fix in v8.

Thanks,
Kanchana

> 
> > +       if (xa_is_err(old)) {
> > +               int err = xa_err(old);
> > +
> > +               WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n",
> err);
> > +               zswap_reject_alloc_fail++;
> > +               return false;
> > +       }
> > +
> > +       /*
> > +        * We may have had an existing entry that became stale when
> > +        * the folio was redirtied and now the new version is being
> > +        * swapped out. Get rid of the old.
> > +        */
> > +       if (old)
> > +               zswap_entry_free(old);
> > +
> > +       return true;
> > +}
> > +
> >  bool zswap_store(struct folio *folio)
> >  {
> >         swp_entry_t swp = folio->swap;
> >         pgoff_t offset = swp_offset(swp);
> >         struct xarray *tree = swap_zswap_tree(swp);
> > -       struct zswap_entry *entry, *old;
> > +       struct zswap_entry *entry;
> >         struct obj_cgroup *objcg = NULL;
> >         struct mem_cgroup *memcg = NULL;
> >
> > @@ -1465,22 +1496,8 @@ bool zswap_store(struct folio *folio)
> >         entry->objcg = objcg;
> >         entry->referenced = true;
> >
> > -       old = xa_store(tree, offset, entry, GFP_KERNEL);
> > -       if (xa_is_err(old)) {
> > -               int err = xa_err(old);
> > -
> > -               WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n",
> err);
> > -               zswap_reject_alloc_fail++;
> > +       if (!zswap_store_entry(tree, entry))
> >                 goto store_failed;
> > -       }
> > -
> > -       /*
> > -        * We may have had an existing entry that became stale when
> > -        * the folio was redirtied and now the new version is being
> > -        * swapped out. Get rid of the old.
> > -        */
> > -       if (old)
> > -               zswap_entry_free(old);
> >
> >         if (objcg) {
> >                 obj_cgroup_charge_zswap(objcg, entry->length);
> > --
> > 2.27.0
> >

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-24 22:16           ` Nhat Pham
  2024-09-24 22:18             ` Sridhar, Kanchana P
@ 2024-09-24 22:28             ` Yosry Ahmed
  1 sibling, 0 replies; 79+ messages in thread
From: Yosry Ahmed @ 2024-09-24 22:28 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Sridhar, Kanchana P, linux-kernel, linux-mm, hannes,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang,
	Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal,
	Vinodh

On Tue, Sep 24, 2024 at 3:16 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Tue, Sep 24, 2024 at 2:34 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> >
> > Why can't we just handle it the same way as we handle zswap
> > disablement? If it is disabled, we invalidate any old entries for the
> > offsets and return false to swapout to disk.
>
> I think that was the suggestion.


Hmm I may be reading this wrong, but my understanding was that the
suggestion is to synchronously remove all entries of large folios from
zswap when zswap_mthp is disabled. What I am suggesting is to do the
same thing we do in zswap_store() when zswap is disabled.

Anyway, if we are removing the knob this is not relevant anymore.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v7 4/8] mm: zswap: Refactor code to delete stored offsets in case of errors.
  2024-09-24 19:20   ` Yosry Ahmed
@ 2024-09-24 22:32     ` Sridhar, Kanchana P
  2024-09-25  0:43       ` Yosry Ahmed
  0 siblings, 1 reply; 79+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-24 22:32 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
	akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar,
	Kanchana P

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Tuesday, September 24, 2024 12:20 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 4/8] mm: zswap: Refactor code to delete stored
> offsets in case of errors.
> 
> On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > Added a new procedure zswap_delete_stored_offsets() that can be
> > called to delete stored offsets in a folio in case zswap_store()
> > fails or zswap is disabled.
> 
> I don't see the value in this helper. It will get called in one place
> AFAICT, and it is a bit inconsistent that we have to explicitly loop
> in zswap_store() to store pages, but the loop to delete pages upon
> failure is hidden in the helper.
> 
> I am not against adding a trivial zswap_tree_delete() helper (or
> similar) that calls xa_erase() and  zswap_entry_free() to match
> zswap_tree_store() if you prefer that.

This is a good point. I had refactored this routine in the context
of my code that does batching and the same loop over the mTHP's
subpages would get called in multiple error condition cases.

I am thinking it might probably make sense for say zswap_tree_delete()
to take a "folio" and "tree" and encapsulate deleting all stored offsets
for that folio. Since we have already done the computes for finding the
"tree", having that as an input parameter is mainly for latency, but if
it is cleaner to have "zswap_tree_delete(struct folio *folio)", that should
be Ok too. Please let me know your suggestion on this.

Thanks,
Kanchana

> 
> >
> > Refactored the code in zswap_store() that handles these cases,
> > to call zswap_delete_stored_offsets().
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  mm/zswap.c | 33 ++++++++++++++++++++++++++++++---
> >  1 file changed, 30 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index fd35a81b6e36..9bea948d653e 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -1435,8 +1435,37 @@ static bool zswap_store_entry(struct xarray
> *tree,
> >         return true;
> >  }
> >
> > +/*
> > + * If the zswap store fails or zswap is disabled, we must invalidate the
> > + * possibly stale entries which were previously stored at the offsets
> > + * corresponding to each page of the folio. Otherwise, writeback could
> > + * overwrite the new data in the swapfile.
> > + *
> > + * This is called after the store of an offset in a large folio has failed.
> > + * All zswap entries in the folio must be deleted. This helps make sure
> > + * that a swapped-out mTHP is either entirely stored in zswap, or entirely
> > + * not stored in zswap.
> > + *
> > + * This is also called if zswap_store() is invoked, but zswap is not enabled.
> > + * All offsets for the folio are deleted from zswap in this case.
> > + */
> > +static void zswap_delete_stored_offsets(struct xarray *tree,
> > +                                       pgoff_t offset,
> > +                                       long nr_pages)
> > +{
> > +       struct zswap_entry *entry;
> > +       long i;
> > +
> > +       for (i = 0; i < nr_pages; ++i) {
> > +               entry = xa_erase(tree, offset + i);
> > +               if (entry)
> > +                       zswap_entry_free(entry);
> > +       }
> > +}
> > +
> >  bool zswap_store(struct folio *folio)
> >  {
> > +       long nr_pages = folio_nr_pages(folio);
> >         swp_entry_t swp = folio->swap;
> >         pgoff_t offset = swp_offset(swp);
> >         struct xarray *tree = swap_zswap_tree(swp);
> > @@ -1541,9 +1570,7 @@ bool zswap_store(struct folio *folio)
> >          * possibly stale entry which was previously stored at this offset.
> >          * Otherwise, writeback could overwrite the new data in the swapfile.
> >          */
> > -       entry = xa_erase(tree, offset);
> > -       if (entry)
> > -               zswap_entry_free(entry);
> > +       zswap_delete_stored_offsets(tree, offset, nr_pages);
> >         return false;
> >  }
> >
> > --
> > 2.27.0
> >

^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v7 5/8] mm: zswap: Compress and store a specific page in a folio.
  2024-09-24 19:28   ` Yosry Ahmed
@ 2024-09-24 22:45     ` Sridhar, Kanchana P
  2024-09-25  0:47       ` Yosry Ahmed
  0 siblings, 1 reply; 79+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-24 22:45 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
	akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar,
	Kanchana P

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Tuesday, September 24, 2024 12:29 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 5/8] mm: zswap: Compress and store a specific page
> in a folio.
> 
> On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > For zswap_store() to handle mTHP folios, we need to iterate through each
> > page in the mTHP, compress it and store it in the zswap pool. This patch
> > introduces an auxiliary function zswap_store_page() that provides this
> > functionality.
> >
> > The function signature reflects the design intent, namely, for it
> > to be invoked by zswap_store() per-page in an mTHP. Hence, the folio's
> > objcg and the zswap_pool to use are input parameters for sake of
> > efficiency and consistency.
> >
> > The functionality in zswap_store_page() is reused and adapted from
> > Ryan Roberts' RFC patch [1]:
> >
> >   "[RFC,v1] mm: zswap: Store large folios without splitting"
> >
> >   [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> ryan.roberts@arm.com/T/#u
> >
> > Co-developed-by: Ryan Roberts
> > Signed-off-by:
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  mm/zswap.c | 88
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 88 insertions(+)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 9bea948d653e..8f2e0ab34c84 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -1463,6 +1463,94 @@ static void zswap_delete_stored_offsets(struct
> xarray *tree,
> >         }
> >  }
> >
> > +/*
> > + * Stores the page at specified "index" in a folio.
> > + *
> > + * @folio: The folio to store in zswap.
> > + * @index: Index into the page in the folio that this function will store.
> > + * @objcg: The folio's objcg.
> > + * @pool:  The zswap_pool to store the compressed data for the page.
> > + */
> > +static bool __maybe_unused zswap_store_page(struct folio *folio, long
> index,
> > +                                           struct obj_cgroup *objcg,
> > +                                           struct zswap_pool *pool)
> 
> Why are we adding an unused function that duplicates code in
> zswap_store(), then using it in the following patch? This makes it
> difficult to see that the function does the same thing. This patch
> should be refactoring the per-page code out of zswap_store() into
> zswap_store_page(), and directly calling zswap_store_page() from
> zswap_store().

Sure, thanks Yosry for this suggestion. Will fix in v8.

> 
> > +{
> > +       swp_entry_t swp = folio->swap;
> > +       int type = swp_type(swp);
> > +       pgoff_t offset = swp_offset(swp) + index;
> > +       struct page *page = folio_page(folio, index);
> > +       struct xarray *tree = swap_zswap_tree(swp);
> > +       struct zswap_entry *entry;
> > +
> > +       if (objcg)
> > +               obj_cgroup_get(objcg);
> > +
> > +       if (zswap_check_limits())
> > +               goto reject;
> > +
> > +       /* allocate entry */
> > +       entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
> > +       if (!entry) {
> > +               zswap_reject_kmemcache_fail++;
> > +               goto reject;
> > +       }
> > +
> > +       /* if entry is successfully added, it keeps the reference */
> > +       if (!zswap_pool_get(pool))
> > +               goto freepage;
> 
> I think we can batch this for all pages in zswap_store(), maybe first
> add zswap_pool_get_many().
> 
> I am also wondering if it would be better to batch the limit checking
> and allocating the entries, to front load any failures before we start
> compression. Not sure if that's overall better though.
> 
> To batch allocate entries we will have to also allocate an array to
> hold them. To batch the limit checking we will have to either allow
> going further over limit for mTHPs, or check if there is enough
> clearance to allow for compressing all the pages. Using the
> uncompressed size will lead to false negatives though, so maybe we can
> start tracking the average compression ratio for better limit
> checking.
> 
> Nhat, Johannes, any thoughts here? I need someone to tell me if I am
> overthinking this :)

These are all good points. I suppose I was thinking along the same lines
of what Nhat mentioned in an earlier comment. I was trying the
incremental zswap_pool_get() and limit checks and shrinker invocations
in case of (all) error conditions to allow different concurrent stores to make
progress, without favoring only one process's mTHP store. I was thinking
this would have minimal impact on the process(es) that see the zswap
limit being exceeded, and that this would be better than preemptively
checking for the entire mTHP and failing (this could also complicate things
where no one makes progress because multiple processes run the batch
checks and fail, when realistically one/many could have triggered
the shrinker before erroring out, and at least one could have made
progress).

Would appreciate your perspectives on how this should be handled,
and will implement a solution in v8 accordingly.

Thanks,
Kanchana

> 
> > +
> > +       entry->pool = pool;
> > +
> > +       if (!zswap_compress(page, entry))
> > +               goto put_pool;
> > +
> > +       entry->swpentry = swp_entry(type, offset);
> > +       entry->objcg = objcg;
> > +       entry->referenced = true;
> > +
> > +       if (!zswap_store_entry(tree, entry))
> > +               goto store_failed;
> > +
> > +       if (objcg) {
> > +               obj_cgroup_charge_zswap(objcg, entry->length);
> > +               count_objcg_event(objcg, ZSWPOUT);
> > +       }
> > +
> > +       /*
> > +        * We finish initializing the entry while it's already in xarray.
> > +        * This is safe because:
> > +        *
> > +        * 1. Concurrent stores and invalidations are excluded by folio lock.
> > +        *
> > +        * 2. Writeback is excluded by the entry not being on the LRU yet.
> > +        *    The publishing order matters to prevent writeback from seeing
> > +        *    an incoherent entry.
> > +        */
> > +       if (entry->length) {
> > +               INIT_LIST_HEAD(&entry->lru);
> > +               zswap_lru_add(&zswap_list_lru, entry);
> > +       }
> > +
> > +       /* update stats */
> > +       atomic_inc(&zswap_stored_pages);
> > +       count_vm_event(ZSWPOUT);
> 
> We should probably also batch updating the stats. It actually seems
> like now we don't handle rolling them back upon failure.

Good point! I assume you are referring only to the "ZSWPOUT" vm event stats
updates and not the "zswap_stored_pages" (since latter is used in limit checking)?

I will fix this in v8.

Thanks,
Kanchana

> 
> 
> > +
> > +       return true;
> > +
> > +store_failed:
> > +       zpool_free(entry->pool->zpool, entry->handle);
> > +put_pool:
> > +       zswap_pool_put(pool);
> > +freepage:
> > +       zswap_entry_cache_free(entry);
> > +reject:
> > +       obj_cgroup_put(objcg);
> > +       if (zswap_pool_reached_full)
> > +               queue_work(shrink_wq, &zswap_shrink_work);
> > +
> > +       return false;
> > +}
> > +
> >  bool zswap_store(struct folio *folio)
> >  {
> >         long nr_pages = folio_nr_pages(folio);
> > --
> > 2.27.0
> >

^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios
  2024-09-24 19:34 ` [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios Yosry Ahmed
@ 2024-09-24 22:50   ` Sridhar, Kanchana P
  0 siblings, 0 replies; 79+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-24 22:50 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
	akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar,
	Kanchana P

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Tuesday, September 24, 2024 12:35 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios
> 
> On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > Hi All,
> >
> > This patch-series enables zswap_store() to accept and store mTHP
> > folios. The most significant contribution in this series is from the
> > earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
> > migrated to mm-unstable as of 9-23-2024 in patches 5,6 of this series.
> >
> > [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
> >      https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> ryan.roberts@arm.com/T/#u
> >
> > Additionally, there is an attempt to modularize some of the functionality
> > in zswap_store(), to make it more amenable to supporting any-order
> > mTHPs. For instance, the function zswap_store_entry() stores a
> zswap_entry
> > in the xarray. Likewise, zswap_delete_stored_offsets() can be used to
> > delete all offsets corresponding to a higher order folio stored in zswap.
> 
> These are implementation details that are not very useful here, you
> can just mention that the first few patches do refactoring prep work.

Thanks Yosry for the comments! Sure, I will reword this as you've
suggested in v8.

> 
> >
> > For accounting purposes, the patch-series adds per-order mTHP sysfs
> > "zswpout" counters that get incremented upon successful zswap_store of
> > an mTHP folio:
> >
> > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
> >
> > A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by
> default)
> > will enable/disable zswap storing of (m)THP. When disabled, zswap will
> > fallback to rejecting the mTHP folio, to be processed by the backing
> > swap device.
> 
> Why is this needed? Do we just not have enough confidence in the
> feature yet, or are there some cases that regress from enabling mTHP
> for zswapout?
> 
> Does generic mTHP swapout/swapin also use config options?

As discussed in the other comments' follow-up, I will delete the config
option and runtime knob.

> 
> >
> > This patch-series is a pre-requisite for ZSWAP compress batching of mTHP
> > swap-out and decompress batching of swap-ins based on
> swapin_readahead(),
> > using Intel IAA hardware acceleration, which we would like to submit in
> > subsequent patch-series, with performance improvement data.
> >
> > Thanks to Ying Huang for pre-posting review feedback and suggestions!
> >
> > Thanks also to Nhat, Yosry, Barry, Chengming, Usama and Ying for their
> > helpful feedback, data reviews and suggestions!
> >
> > Co-development signoff request:
> > ===============================
> > I would like to request Ryan Roberts' co-developer signoff on patches
> > 5 and 6 in this series. Thanks Ryan!
> >
> > Changes since v6:
> > =================
> 
> Please put the changelog at the very end, I almost missed the
> performance evaluation.

Sure, will fix this.

> 
> > 1) Rebased to mm-unstable as of 9-23-2024,
> >    commit acfabf7e197f7a5bedf4749dac1f39551417b049.
> > 2) Refactored into smaller commits, as suggested by Yosry and
> >    Chengming. Thanks both!
> > 3) Reworded the commit log for patches 5 and 6 as per Yosry's
> >    suggestion. Thanks Yosry!
> > 4) Gathered data on a Sapphire Rapids server that has 823GiB SSD swap disk
> >    partition. Also, all experiments are run with usemem --sleep 10, so that
> >    the memory allocated by the 70 processes remains in memory
> >    longer. Posted elapsed and sys times. Thanks to Yosry, Nhat and Ying for
> >    their help with refining the performance characterization methodology.
> > 5) Updated Documentation/admin-guide/mm/transhuge.rst as suggested
> by
> >    Nhat. Thanks Nhat!
> >
> > Changes since v5:
> > =================
> > 1) Rebased to mm-unstable as of 8/29/2024,
> >    commit 9287e4adbc6ab8fa04d25eb82e097fed877a4642.
> > 2) Added CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default) to
> >    enable/disable zswap_store() of mTHP folios. Thanks Nhat for the
> >    suggestion to add a knob by which users can enable/disable this
> >    change. Nhat, I hope this is along the lines of what you were
> >    thinking.
> > 3) Added vm-scalability usemem data with 4K folios with
> >    CONFIG_ZSWAP_STORE_THP_DEFAULT_ON off, that I gathered to make
> sure
> >    there is no regression with this change.
> > 4) Added data with usemem with 64K and 2M THP for an alternate view of
> >    before/after, as suggested by Yosry, so we can understand the impact
> >    of when mTHPs are split into 4K folios in shrink_folio_list()
> >    (CONFIG_THP_SWAP off) vs. not split (CONFIG_THP_SWAP on) and stored
> >    in zswap. Thanks Yosry for this suggestion.
> >
> > Changes since v4:
> > =================
> > 1) Published before/after data with zstd, as suggested by Nhat (Thanks
> >    Nhat for the data reviews!).
> > 2) Rebased to mm-unstable from 8/27/2024,
> >    commit b659edec079c90012cf8d05624e312d1062b8b87.
> > 3) Incorporated the change in memcontrol.h that defines obj_cgroup_get() if
> >    CONFIG_MEMCG is not defined, to resolve build errors reported by kernel
> >    robot; as per Nhat's and Michal's suggestion to not require a separate
> >    patch to fix the build errors (thanks both!).
> > 4) Deleted all same-filled folio processing in zswap_store() of mTHP, as
> >    suggested by Yosry (Thanks Yosry!).
> > 5) Squashed the commits that define new mthp zswpout stat counters, and
> >    invoke count_mthp_stat() after successful zswap_store()s; into a single
> >    commit. Thanks Yosry for this suggestion!
> >
> > Changes since v3:
> > =================
> > 1) Rebased to mm-unstable commit
> 8c0b4f7b65fd1ca7af01267f491e815a40d77444.
> >    Thanks to Barry for suggesting aligning with Ryan Roberts' latest
> >    changes to count_mthp_stat() so that it's always defined, even when THP
> >    is disabled. Barry, I have also made one other change in page_io.c
> >    where count_mthp_stat() is called by count_swpout_vm_event(). I would
> >    appreciate it if you can review this. Thanks!
> >    Hopefully this should resolve the kernel robot build errors.
> >
> > Changes since v2:
> > =================
> > 1) Gathered usemem data using SSD as the backing swap device for zswap,
> >    as suggested by Ying Huang. Ying, I would appreciate it if you can
> >    review the latest data. Thanks!
> > 2) Generated the base commit info in the patches to attempt to address
> >    the kernel test robot build errors.
> > 3) No code changes to the individual patches themselves.
> >
> > Changes since RFC v1:
> > =====================
> >
> > 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
> >    Thanks Barry!
> > 2) Addressed some of the code review comments that Nhat Pham provided
> in
> >    Ryan's initial RFC [1]:
> >    - Added a comment about the cgroup zswap limit checks occuring once
> per
> >      folio at the beginning of zswap_store().
> >      Nhat, Ryan, please do let me know if the comments convey the summary
> >      from the RFC discussion. Thanks!
> >    - Posted data on running the cgroup suite's zswap kselftest.
> > 3) Rebased to v6.11-rc3.
> > 4) Gathered performance data with usemem and the rebased patch-series.
> >
> >
> > Regression Testing:
> > ===================
> > I ran vm-scalability usemem 70 processes without mTHP, i.e., only 4K
> > folios with mm-unstable and with this patch-series. The main goal was
> > to make sure that there is no functional or performance regression
> > wrt the earlier zswap behavior for 4K folios,
> > CONFIG_ZSWAP_STORE_THP_DEFAULT_ON is not set, and zswap_store() of
> 4K
> > pages goes through the newly added code path [zswap_store(),
> > zswap_store_page()].
> >
> > The data indicates there is no regression.
> >
> >  ------------------------------------------------------------------------------
> >                      mm-unstable 8-28-2024                        zswap-mTHP v6
> >                                               CONFIG_ZSWAP_STORE_THP_DEFAULT_ON
> >                                                                      is not set
> >  ------------------------------------------------------------------------------
> >  ZSWAP compressor        zstd     deflate-                     zstd    deflate-
> >                                        iaa                                  iaa
> >  ------------------------------------------------------------------------------
> >  Throughput (KB/s)    110,775      113,010               111,550        121,937
> >  sys time (sec)      1,141.72       954.87              1,131.95         828.47
> >  memcg_high           140,500      153,737               139,772        134,129
> >  memcg_swap_high            0            0                     0              0
> >  memcg_swap_fail            0            0                     0              0
> >  pswpin                     0            0                     0              0
> >  pswpout                    0            0                     0              0
> >  zswpin                   675          690                   682            684
> >  zswpout            9,552,298   10,603,271             9,566,392      9,267,213
> >  thp_swpout                 0            0                     0              0
> >  thp_swpout_                0            0                     0              0
> >   fallback
> >  pgmajfault             3,453        3,468                 3,841          3,487
> >  ZSWPOUT-64kB-mTHP        n/a          n/a                     0              0
> >  SWPOUT-64kB-mTHP           0            0                     0              0
> >  ------------------------------------------------------------------------------
> 
> It's probably better to put the zstd columns next to each other, and
> the deflate-iaa columns next to each other, for easier visual
> comparisons.

Sure. Will change this accordingly, in v8.

> 
> >
> >
> > Performance Testing:
> > ====================
> > Testing of this patch-series was done with mm-unstable as of 9-23-2024,
> > commit acfabf7e197f7a5bedf4749dac1f39551417b049. Data was gathered
> > without/with this patch-series, on an Intel Sapphire Rapids server,
> > dual-socket 56 cores per socket, 4 IAA devices per socket, 503 GiB RAM and
> > 823G SSD disk partition swap. Core frequency was fixed at 2500MHz.
> >
> > The vm-scalability "usemem" test was run in a cgroup whose memory.high
> > was fixed at 40G. The is no swap limit set for the cgroup. Following a
> > similar methodology as in Ryan Roberts' "Swap-out mTHP without splitting"
> > series [2], 70 usemem processes were run, each allocating and writing 1G of
> > memory, and sleeping for 10 sec before exiting:
> >
> >     usemem --init-time -w -O -s 10 -n 70 1g
> >
> > The vm/sysfs mTHP stats included with the performance data provide
> details
> > on the swapout activity to ZSWAP/swap.
> >
> > Other kernel configuration parameters:
> >
> >     ZSWAP Compressors : zstd, deflate-iaa
> >     ZSWAP Allocator   : zsmalloc
> >     SWAP page-cluster : 2
> >
> > In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
> > IAA "compression verification" is enabled. Hence each IAA compression
> > will be decompressed internally by the "iaa_crypto" driver, the crc-s
> > returned by the hardware will be compared and errors reported in case of
> > mismatches. Thus "deflate-iaa" helps ensure better data integrity as
> > compared to the software compressors.
> >
> > Throughput is derived by averaging the individual 70 processes' throughputs
> > reported by usemem. elapsed/sys times are measured with perf. All data
> > points per compressor/kernel/mTHP configuration are averaged across 3
> runs.
> >
> > Case 1: Comparing zswap 4K vs. zswap mTHP
> > =========================================
> >
> > In this scenario, the "before" is CONFIG_THP_SWAP set to off, that results in
> > 64K/2M (m)THP to be split into 4K folios that get processed by zswap.
> >
> > The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that
> results
> > in 64K/2M (m)THP to not be split, and processed by zswap.
> >
> >  64KB mTHP (cgroup memory.high set to 40G):
> >  ==========================================
> >
> >  -------------------------------------------------------------------------------
> >                     mm-unstable 9-23-2024              zswap-mTHP     Change wrt
> >                         CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y       Baseline
> >                                  Baseline
> >  -------------------------------------------------------------------------------
> >  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
> >                                       iaa                     iaa            iaa
> >  -------------------------------------------------------------------------------
> >  Throughput (KB/s)   143,323      125,485     153,550     129,609    7%       3%
> >  elapsed time (sec)    24.97        25.42       23.90       25.19    4%       1%
> >  sys time (sec)       822.72       750.96      757.70      731.13    8%       3%
> >  memcg_high          132,743      169,825     148,075     192,744
> >  memcg_swap_fail     639,067      841,553       2,204       2,215
> >  pswpin                    0            0           0           0
> >  pswpout                   0            0           0           0
> >  zswpin                  795          873         760         902
> >  zswpout          10,011,266   13,195,137  10,010,017  13,193,554
> >  thp_swpout                0            0           0           0
> >  thp_swpout_               0            0           0           0
> >   fallback
> >  64kB-mthp_          639,065      841,553       2,204       2,215
> >   swpout_fallback
> >  pgmajfault            2,861        2,924       3,054       3,259
> >  ZSWPOUT-64kB            n/a          n/a     623,451     822,268
> >  SWPOUT-64kB               0            0           0           0
> >  -------------------------------------------------------------------------------
> >
> >
> >  2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
> >  =======================================================
> >
> >  -------------------------------------------------------------------------------
> >                     mm-unstable 9-23-2024              zswap-mTHP     Change wrt
> >                         CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y       Baseline
> >                                  Baseline
> >  -------------------------------------------------------------------------------
> >  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
> >                                       iaa                     iaa            iaa
> >  -------------------------------------------------------------------------------
> >  Throughput (KB/s)   145,616      139,640     169,404     141,168   16%       1%
> >  elapsed time (sec)    25.05        23.85       23.02       23.37    8%       2%
> >  sys time (sec)       790.53       676.34      613.26      677.83   22%    -0.2%
> >  memcg_high           16,702       25,197      17,374      23,890
> >  memcg_swap_fail      21,485       27,814         114         144
> >  pswpin                    0            0           0           0
> >  pswpout                   0            0           0           0
> >  zswpin                  793          852         778         922
> >  zswpout          10,011,709   13,186,882  10,010,893  13,195,600
> >  thp_swpout                0            0           0           0
> >  thp_swpout_          21,485       27,814         114         144
> >   fallback
> >  2048kB-mthp_            n/a          n/a           0           0
> >   swpout_fallback
> >  pgmajfault            2,701        2,822       4,151       5,066
> >  ZSWPOUT-2048kB          n/a          n/a      19,442      25,615
> >  SWPOUT-2048kB             0            0           0           0
> >  -------------------------------------------------------------------------------
> >
> > We mostly see improvements in throughput, elapsed and sys time for zstd
> and
> > deflate-iaa, when comparing before (THP_SWAP=N) vs. after
> (THP_SWAP=Y).
> >
> >
> > Case 2: Comparing SSD swap mTHP vs. zswap mTHP
> > ==============================================
> >
> > In this scenario, CONFIG_THP_SWAP is enabled in "before" and "after"
> > experiments. The "before" represents zswap rejecting mTHP, and the mTHP
> > being stored by the 823G SSD swap. The "after" represents data with this
> > patch-series, that results in 64K/2M (m)THP being processed and stored by
> > zswap.
> >
> >  64KB mTHP (cgroup memory.high set to 40G):
> >  ==========================================
> >
> >  -------------------------------------------------------------------------------
> >                     mm-unstable 9-23-2024              zswap-mTHP     Change wrt
> >                         CONFIG_THP_SWAP=Y       CONFIG_THP_SWAP=Y       Baseline
> >                                  Baseline
> >  -------------------------------------------------------------------------------
> >  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
> >                                       iaa                     iaa            iaa
> >  -------------------------------------------------------------------------------
> >  Throughput (KB/s)    20,265       20,696     153,550     129,609   658%    526%
> >  elapsed time (sec)    72.44        70.86       23.90       25.19    67%     64%
> >  sys time (sec)        77.95        77.99      757.70      731.13  -872%   -837%
> >  memcg_high          115,811      113,277     148,075     192,744
> >  memcg_swap_fail       2,386        2,425       2,204       2,215
> >  pswpin                   16           16           0           0
> >  pswpout           7,774,235    7,616,069           0           0
> >  zswpin                  728          749         760         902
> >  zswpout              38,424       39,022  10,010,017  13,193,554
> >  thp_swpout                0            0           0           0
> >  thp_swpout_               0            0           0           0
> >   fallback
> >  64kB-mthp_            2,386        2,425       2,204       2,215
> >   swpout_fallback
> >  pgmajfault            2,757        2,860       3,054       3,259
> >  ZSWPOUT-64kB            n/a          n/a     623,451     822,268
> >  SWPOUT-64kB         485,890      476,004           0           0
> >  -------------------------------------------------------------------------------
> >
> >
> >  2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
> >  =======================================================
> >
> >  -------------------------------------------------------------------------------
> >                     mm-unstable 9-23-2024              zswap-mTHP     Change wrt
> >                         CONFIG_THP_SWAP=Y       CONFIG_THP_SWAP=Y       Baseline
> >                                  Baseline
> >  -------------------------------------------------------------------------------
> >  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
> >                                       iaa                     iaa            iaa
> >  -------------------------------------------------------------------------------
> >  Throughput (KB/s)    24,347       35,971     169,404     141,168    596%   292%
> >  elapsed time (sec)    63.52        64.59       23.02       23.37     64%    64%
> >  sys time (sec)        27.91        27.01      613.26      677.83  -2098% -2410%
> >  memcg_high           13,576       13,467      17,374      23,890
> >  memcg_swap_fail         162          124         114         144
> >  pswpin                    0            0           0           0
> >  pswpout           7,003,307    7,168,853           0           0
> >  zswpin                  741          722         778         922
> >  zswpout              84,429       65,315  10,010,893  13,195,600
> >  thp_swpout           13,678       14,002           0           0
> >  thp_swpout_             162          124         114         144
> >   fallback
> >  2048kB-mthp_            n/a          n/a           0           0
> >   swpout_fallback
> >  pgmajfault            3,345        2,903       4,151       5,066
> >  ZSWPOUT-2048kB          n/a          n/a      19,442      25,615
> >  SWPOUT-2048kB        13,678       14,002           0           0
> >  -------------------------------------------------------------------------------
> >
> > We see significant improvements in throughput and elapsed time for zstd
> and
> > deflate-iaa, when comparing before (mTHP-SSD) vs. after (mTHP-ZSWAP).
> The
> > sys time increases with mTHP-ZSWAP as expected, due to the CPU
> compression
> > time vs. asynchronous disk write times, as pointed out by Ying and Yosry.
> >
> > In the "Before" scenario, when zswap does not store mTHP, only allocations
> > count towards the cgroup memory limit. However, in the "After" scenario,
> > with the introduction of zswap_store() mTHP, both, allocations as well as
> > the zswap compressed pool usage from all 70 processes are counted
> towards
> > the memory limit. As a result, we see higher swapout activity in the
> > "After" data. Hence, more time is spent doing reclaim as the zswap cgroup
> > charge leads to more frequent memory.high breaches.
> >
> > Summary:
> > ========
> > The v7 data presented above comparing zswap-mTHP with a conventional
> 823G
> > SSD swap demonstrates good performance improvements with zswap-
> mTHP. Hence,
> > it seems reasonable for zswap_store to support (m)THP, so that further
> > performance improvements can be implemented.
> >
> > Some of the ideas that have shown promise in our experiments are:
> >
> > 1) IAA compress/decompress batching.
> > 2) Distributing compress jobs across all IAA devices on the socket.
> >
> > In the experimental setup used in this patchset, we have enabled
> > IAA compress verification to ensure additional hardware data integrity CRC
> > checks not currently done by the software compressors. The tests run for
> > this patchset are also using only 1 IAA device per core, that avails of 2
> > compress engines on the device. In our experiments with IAA batching, we
> > distribute compress jobs from all cores to the 8 compress engines available
> > per socket. We further compress the pages in each mTHP in parallel in the
> > accelerator. As a result, we improve compress latency and reclaim
> > throughput.
> >
> > The following compares the same usemem workload characteristics
> between:
> >
> > 1) zstd (v7 experiments)
> > 2) deflate-iaa "Fixed mode" (v7 experiments)
> > 3) deflate-iaa with batching
> > 4) deflate-iaa-canned "Canned mode" [3] with batching
> >
> > vm.page-cluster is set to "2" for all runs.
> >
> > 64K mTHP ZSWAP:
> > ===============
> >
> >  -------------------------------------------------------------------------------
> >  ZSWAP            zstd   IAA Fixed   IAA Fixed  IAA Canned     IAA    IAA    IAA
> >  compressor       (v7)        (v7)  + Batching  + Batching   Batch Canned Canned
> >                                                                vs.    vs.  Batch
> >  64K mTHP                                                    Seqtl  Fixed    vs.
> >                                                                             ZSTD
> >  -------------------------------------------------------------------------------
> >  Throughput    153,550     129,609     156,215     166,975   21%     7%       9%
> >      (KB/s)
> >  elapsed time    23.90       25.19       22.46       21.38   11%     5%      11%
> >         (sec)
> >  sys time       757.70      731.13      715.62      648.83    2%     9%      14%
> >     (sec)
> >  memcg_high    148,075     192,744     197,548     181,734
> >  memcg_swap_     2,204       2,215       2,293       2,263
> >   fail
> >  pswpin              0           0           0           0
> >  pswpout             0           0           0           0
> >  zswpin            760         902         774         833
> >  zswpout    10,010,017  13,193,554  13,193,176  12,125,616
> >  thp_swpout          0           0           0           0
> >  thp_swpout_         0           0           0           0
> >   fallback
> >  64kB-mthp_      2,204       2,215       2,293       2,263
> >   swpout_
> >   fallback
> >  pgmajfault      3,054       3,259       3,545       3,516
> >  ZSWPOUT-64kB  623,451     822,268     822,176     755,480
> >  SWPOUT-64kB         0           0           0           0
> >  swap_ra           146         161         152         159
> >  swap_ra_hit        64         121          68          88
> >  -------------------------------------------------------------------------------
> >
> >
> > 2M THP ZSWAP:
> > =============
> >
> >  -------------------------------------------------------------------------------
> >  ZSWAP            zstd   IAA Fixed   IAA Fixed  IAA Canned     IAA    IAA    IAA
> >  compressor       (v7)        (v7)  + Batching  + Batching   Batch Canned Canned
> >                                                                vs.    vs.  Batch
> >  2M THP                                                      Seqtl  Fixed    vs.
> >                                                                             ZSTD
> >  -------------------------------------------------------------------------------
> >  Throughput    169,404     141,168     175,089     193,407     24%    10%    14%
> >      (KB/s)
> >  elapsed time    23.02       23.37       21.13       19.97     10%     5%    13%
> >         (sec)
> >  sys time       613.26      677.83      630.51      533.80      7%    15%    13%
> >     (sec)
> >  memcg_high     17,374      23,890      24,349      22,374
> >  memcg_swap_       114         144         102          88
> >   fail
> >  pswpin              0           0           0           0
> >  pswpout             0           0           0           0
> >  zswpin            778         922       6,492       6,642
> >  zswpout    10,010,893  13,195,600  13,199,907  12,132,265
> >  thp_swpout          0           0           0           0
> >  thp_swpout_       114         144         102          88
> >   fallback
> >  pgmajfault      4,151       5,066       5,032       4,999
> >  ZSWPOUT-2MB    19,442      25,615      25,666      23,594
> >  SWPOUT-2MB          0           0           0           0
> >  swap_ra             3           9       4,383       4,494
> >  swap_ra_hit         2           6       4,298       4,412
> >  -------------------------------------------------------------------------------
> >
> >
> > With ZSWAP IAA compress/decompress batching, we are able to
> demonstrate
> > significant performance improvements and memory savings in scalability
> > experiments under memory pressure, as compared to software
> compressors. We
> > hope to submit this work in subsequent patch series.
> 
> Honestly I would remove the detailed results of the followup series
> for batching, it should be enough to mention a single figure for
> further expected improvement from ongoing work that depends on this.

Definitely, will summarize the results of batching in the cover letter for v8.

Thanks,
Kanchana

> 
> >
> > Thanks,
> > Kanchana
> >
> > [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> ryan.roberts@arm.com/T/#u
> > [2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1-
> ryan.roberts@arm.com/
> > [3] https://patchwork.kernel.org/project/linux-
> crypto/cover/cover.1710969449.git.andre.glover@linux.intel.com/
> >
> >
> > Kanchana P Sridhar (8):
> >   mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined.
> >   mm: zswap: Modify zswap_compress() to accept a page instead of a
> >     folio.
> >   mm: zswap: Refactor code to store an entry in zswap xarray.
> >   mm: zswap: Refactor code to delete stored offsets in case of errors.
> >   mm: zswap: Compress and store a specific page in a folio.
> >   mm: zswap: Support mTHP swapout in zswap_store().
> >   mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout
> >     stats.
> >   mm: Document the newly added mTHP zswpout stats, clarify swpout
> >     semantics.
> >
> >  Documentation/admin-guide/mm/transhuge.rst |   8 +-
> >  include/linux/huge_mm.h                    |   1 +
> >  include/linux/memcontrol.h                 |   4 +
> >  mm/Kconfig                                 |   8 +
> >  mm/huge_memory.c                           |   3 +
> >  mm/page_io.c                               |   1 +
> >  mm/zswap.c                                 | 248 ++++++++++++++++-----
> >  7 files changed, 210 insertions(+), 63 deletions(-)
> >
> >
> > base-commit: acfabf7e197f7a5bedf4749dac1f39551417b049
> > --
> > 2.27.0
> >

^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-24 19:38   ` Yosry Ahmed
  2024-09-24 20:51     ` Nhat Pham
@ 2024-09-24 23:02     ` Sridhar, Kanchana P
  2024-09-25 13:40     ` Johannes Weiner
  2 siblings, 0 replies; 79+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-24 23:02 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
	akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar,
	Kanchana P

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Tuesday, September 24, 2024 12:39 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> zswap_store().
> 
> On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > zswap_store() will now store mTHP and PMD-size THP folios by compressing
> > them page by page.
> >
> > This patch provides a sequential implementation of storing an mTHP in
> > zswap_store() by iterating through each page in the folio to compress
> > and store it in the zswap zpool.
> >
> > Towards this goal, zswap_compress() is modified to take a page instead
> > of a folio as input.
> >
> > Each page's swap offset is stored as a separate zswap entry.
> >
> > If an error is encountered during the store of any page in the mTHP,
> > all previous pages/entries stored will be invalidated. Thus, an mTHP
> > is either entirely stored in ZSWAP, or entirely not stored in ZSWAP.
> >
> > This forms the basis for building batching of pages during zswap store
> > of large folios by compressing batches of up to say, 8 pages in an
> > mTHP in parallel in hardware, with the Intel In-Memory Analytics
> > Accelerator (Intel IAA).
> >
> > A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by
> default)
> > will enable/disable zswap storing of (m)THP. The corresponding tunable
> > zswap module parameter is "mthp_enabled".
> >
> > This change reuses and adapts the functionality in Ryan Roberts' RFC
> > patch [1]:
> >
> >   "[RFC,v1] mm: zswap: Store large folios without splitting"
> >
> >   [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> ryan.roberts@arm.com/T/#u
> >
> > Also, addressed some of the RFC comments from the discussion in [1].
> >
> > Co-developed-by: Ryan Roberts
> > Signed-off-by:
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  mm/Kconfig |   8 ++++
> >  mm/zswap.c | 122 +++++++++++++++++++++++++----------------------------
> >  2 files changed, 66 insertions(+), 64 deletions(-)
> >
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 09aebca1cae3..c659fb732ec4 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -59,6 +59,14 @@ config ZSWAP_SHRINKER_DEFAULT_ON
> >           reducing the chance that cold pages will reside in the zswap pool
> >           and consume memory indefinitely.
> >
> > +config ZSWAP_STORE_THP_DEFAULT_ON
> > +       bool "Store mTHP and THP folios in zswap"
> > +       depends on ZSWAP
> > +       default n
> > +       help
> > +         If selected, zswap will process mTHP and THP folios by
> > +         compressing and storing each 4K page in the large folio.
> > +
> >  choice
> >         prompt "Default compressor"
> >         depends on ZSWAP
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 8f2e0ab34c84..16ab770546d6 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -127,6 +127,14 @@ static bool zswap_shrinker_enabled =
> IS_ENABLED(
> >                 CONFIG_ZSWAP_SHRINKER_DEFAULT_ON);
> >  module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool,
> 0644);
> >
> > +/*
> > + * Enable/disable zswap processing of mTHP folios.
> > + * For now, only zswap_store will process mTHP folios.
> > + */
> > +static bool zswap_mthp_enabled = IS_ENABLED(
> > +               CONFIG_ZSWAP_STORE_THP_DEFAULT_ON);
> > +module_param_named(mthp_enabled, zswap_mthp_enabled, bool,
> 0644);
> > +
> >  bool zswap_is_enabled(void)
> >  {
> >         return zswap_enabled;
> > @@ -1471,9 +1479,9 @@ static void zswap_delete_stored_offsets(struct
> xarray *tree,
> >   * @objcg: The folio's objcg.
> >   * @pool:  The zswap_pool to store the compressed data for the page.
> >   */
> > -static bool __maybe_unused zswap_store_page(struct folio *folio, long
> index,
> > -                                           struct obj_cgroup *objcg,
> > -                                           struct zswap_pool *pool)
> > +static bool zswap_store_page(struct folio *folio, long index,
> > +                            struct obj_cgroup *objcg,
> > +                            struct zswap_pool *pool)
> 
> As I mentioned earlier, the patch that introduced zswap_store_page()
> should have directly used it in zswap_store(). This would make this
> patch much clearer.

Sure. I will fix this in v8.

> 
> >  {
> >         swp_entry_t swp = folio->swap;
> >         int type = swp_type(swp);
> > @@ -1551,51 +1559,63 @@ static bool __maybe_unused
> zswap_store_page(struct folio *folio, long index,
> >         return false;
> >  }
> >
> > +/*
> > + * Modified to store mTHP folios. Each page in the mTHP will be
> compressed
> > + * and stored sequentially.
> > + */
> >  bool zswap_store(struct folio *folio)
> >  {
> >         long nr_pages = folio_nr_pages(folio);
> >         swp_entry_t swp = folio->swap;
> >         pgoff_t offset = swp_offset(swp);
> >         struct xarray *tree = swap_zswap_tree(swp);
> > -       struct zswap_entry *entry;
> >         struct obj_cgroup *objcg = NULL;
> >         struct mem_cgroup *memcg = NULL;
> > +       struct zswap_pool *pool;
> > +       bool ret = false;
> > +       long index;
> >
> >         VM_WARN_ON_ONCE(!folio_test_locked(folio));
> >         VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> >
> > -       /* Large folios aren't supported */
> > -       if (folio_test_large(folio))
> > +       /* Storing large folios isn't enabled */
> 
> The comment is now stating the obvious, please remove it.

Ok. I suppose this check will also no longer be needed based on the
config knob not being needed.

> 
> > +       if (!zswap_mthp_enabled && folio_test_large(folio))
> >                 return false;
> >
> >         if (!zswap_enabled)
> > -               goto check_old;
> > +               goto reject;
> >
> > -       /* Check cgroup limits */
> > +       /*
> > +        * Check cgroup limits:
> > +        *
> > +        * The cgroup zswap limit check is done once at the beginning of an
> > +        * mTHP store, and not within zswap_store_page() for each page
> > +        * in the mTHP. We do however check the zswap pool limits at the
> > +        * start of zswap_store_page(). What this means is, the cgroup
> > +        * could go over the limits by at most (HPAGE_PMD_NR - 1) pages.
> > +        * However, the per-store-page zswap pool limits check should
> > +        * hopefully trigger the cgroup aware and zswap LRU aware global
> > +        * reclaim implemented in the shrinker. If this assumption holds,
> > +        * the cgroup exceeding the zswap limits could potentially be
> > +        * resolved before the next zswap_store, and if it is not, the next
> > +        * zswap_store would fail the cgroup zswap limit check at the start.
> > +        */
> 
> I do not really like this. Allowing going one page above the limit is
> one thing, but one THP above the limit seems too much. I also don't
> like relying on the repeated limit checking in zswap_store_page(), if
> anything I think that should be batched too.
> 
> Is it too unreasonable to maintain the average compression ratio and
> use that to estimate limit checking for both memcg and global limits?
> Johannes, Nhat, any thoughts on this?

I see that Nhat has responded. Hopefully we can discuss this
in the follow-up to Nhat's comments.

Thanks,
Kanchana

> 
> >         objcg = get_obj_cgroup_from_folio(folio);
> >         if (objcg && !obj_cgroup_may_zswap(objcg)) {
> >                 memcg = get_mem_cgroup_from_objcg(objcg);
> >                 if (shrink_memcg(memcg)) {
> >                         mem_cgroup_put(memcg);
> > -                       goto reject;
> > +                       goto put_objcg;
> >                 }
> >                 mem_cgroup_put(memcg);
> >         }
> >
> >         if (zswap_check_limits())
> > -               goto reject;
> > -
> > -       /* allocate entry */
> > -       entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
> > -       if (!entry) {
> > -               zswap_reject_kmemcache_fail++;
> > -               goto reject;
> > -       }
> > +               goto put_objcg;
> >
> > -       /* if entry is successfully added, it keeps the reference */
> > -       entry->pool = zswap_pool_current_get();
> > -       if (!entry->pool)
> > -               goto freepage;
> > +       pool = zswap_pool_current_get();
> > +       if (!pool)
> > +               goto put_objcg;
> >
> >         if (objcg) {
> >                 memcg = get_mem_cgroup_from_objcg(objcg);
> > @@ -1606,60 +1626,34 @@ bool zswap_store(struct folio *folio)
> >                 mem_cgroup_put(memcg);
> >         }
> >
> > -       if (!zswap_compress(&folio->page, entry))
> > -               goto put_pool;
> > -
> > -       entry->swpentry = swp;
> > -       entry->objcg = objcg;
> > -       entry->referenced = true;
> > -
> > -       if (!zswap_store_entry(tree, entry))
> > -               goto store_failed;
> > -
> > -       if (objcg) {
> > -               obj_cgroup_charge_zswap(objcg, entry->length);
> > -               count_objcg_event(objcg, ZSWPOUT);
> > -       }
> > -
> >         /*
> > -        * We finish initializing the entry while it's already in xarray.
> > -        * This is safe because:
> > -        *
> > -        * 1. Concurrent stores and invalidations are excluded by folio lock.
> > -        *
> > -        * 2. Writeback is excluded by the entry not being on the LRU yet.
> > -        *    The publishing order matters to prevent writeback from seeing
> > -        *    an incoherent entry.
> > +        * Store each page of the folio as a separate entry. If we fail to store
> > +        * a page, unwind by removing all the previous pages we stored.
> >          */
> > -       if (entry->length) {
> > -               INIT_LIST_HEAD(&entry->lru);
> > -               zswap_lru_add(&zswap_list_lru, entry);
> > +       for (index = 0; index < nr_pages; ++index) {
> > +               if (!zswap_store_page(folio, index, objcg, pool))
> > +                       goto put_pool;
> >         }
> >
> > -       /* update stats */
> > -       atomic_inc(&zswap_stored_pages);
> > -       count_vm_event(ZSWPOUT);
> > -
> > -       return true;
> > +       ret = true;
> >
> > -store_failed:
> > -       zpool_free(entry->pool->zpool, entry->handle);
> >  put_pool:
> > -       zswap_pool_put(entry->pool);
> > -freepage:
> > -       zswap_entry_cache_free(entry);
> > -reject:
> > +       zswap_pool_put(pool);
> > +put_objcg:
> >         obj_cgroup_put(objcg);
> >         if (zswap_pool_reached_full)
> >                 queue_work(shrink_wq, &zswap_shrink_work);
> > -check_old:
> > +reject:
> >         /*
> > -        * If the zswap store fails or zswap is disabled, we must invalidate the
> > -        * possibly stale entry which was previously stored at this offset.
> > -        * Otherwise, writeback could overwrite the new data in the swapfile.
> > +        * If the zswap store fails or zswap is disabled, we must invalidate
> > +        * the possibly stale entries which were previously stored at the
> > +        * offsets corresponding to each page of the folio. Otherwise,
> > +        * writeback could overwrite the new data in the swapfile.
> >          */
> > -       zswap_delete_stored_offsets(tree, offset, nr_pages);
> > -       return false;
> > +       if (!ret)
> > +               zswap_delete_stored_offsets(tree, offset, nr_pages);
> > +
> > +       return ret;
> >  }
> >
> >  bool zswap_load(struct folio *folio)
> > --
> > 2.27.0
> >

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-24 21:38       ` Yosry Ahmed
@ 2024-09-24 23:11         ` Nhat Pham
  2024-09-25  0:05           ` Sridhar, Kanchana P
  2024-09-25  0:52           ` Yosry Ahmed
  0 siblings, 2 replies; 79+ messages in thread
From: Nhat Pham @ 2024-09-24 23:11 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Kanchana P Sridhar, linux-kernel, linux-mm, hannes,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
	ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali,
	vinodh.gopal, joshua.hahnjy

On Tue, Sep 24, 2024 at 2:38 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
>
> We can also do what we discussed before about double charging. The
> pages that are being reclaimed are already charged, so technically we
> don't need to charge them again. We can uncharge the difference
> between compressed and uncompressed sizes after compression and call
> it a day. This fixes the limit checking and the double charging in one
> go.
> I am a little bit nervous though about zswap uncharing the pages from
> under reclaim, there are likely further accesses of the page memcg
> after zswap. Maybe we can plumb the info back to reclaim or set a flag
> on the page to avoid uncharging it when it's freed.

Hmm this is just for memory usage charging, no? The problem here is
the zswap usage (zswap.current), and its relation to the limit.

One thing we can do is check the zswap usage against the limit for
every subpage, but that's likely expensive...?

With the new atomic counters Joshua is working on, we can
check-and-charge at the same time, after we have compressed the whole
large folio, like this:

for (memcg = original_memcg; !mem_cgroup_is_root(memcg);
     memcg = parent_mem_cgroup(memcg));
     old_usage = atomic_read(&memcg->zswap);

     do {
        new_usage = old_usage + size;
        if (new_usage > limit) {
           /* undo charging of descendants, then return false */
        }
      } while (!atomic_try_cmpxchg(&memcg->zswap, old_usage, new_usage))
}

But I don't know what we can do in the current design. I gave it some
more thought, and even if we only check after we know the size, we can
still potentially overshoot the limit :(


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-24 20:51     ` Nhat Pham
  2024-09-24 21:38       ` Yosry Ahmed
@ 2024-09-24 23:21       ` Sridhar, Kanchana P
  1 sibling, 0 replies; 79+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-24 23:21 UTC (permalink / raw)
  To: Nhat Pham, Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, chengming.zhou, usamaarif642,
	shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou,
	Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P

> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Tuesday, September 24, 2024 1:51 PM
> To: Yosry Ahmed <yosryahmed@google.com>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> zswap_store().
> 
> On Tue, Sep 24, 2024 at 12:39 PM Yosry Ahmed <yosryahmed@google.com>
> wrote:
> >
> > On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar
> > > +        * The cgroup zswap limit check is done once at the beginning of an
> > > +        * mTHP store, and not within zswap_store_page() for each page
> > > +        * in the mTHP. We do however check the zswap pool limits at the
> > > +        * start of zswap_store_page(). What this means is, the cgroup
> > > +        * could go over the limits by at most (HPAGE_PMD_NR - 1) pages.
> > > +        * However, the per-store-page zswap pool limits check should
> > > +        * hopefully trigger the cgroup aware and zswap LRU aware global
> > > +        * reclaim implemented in the shrinker. If this assumption holds,
> > > +        * the cgroup exceeding the zswap limits could potentially be
> > > +        * resolved before the next zswap_store, and if it is not, the next
> > > +        * zswap_store would fail the cgroup zswap limit check at the start.
> > > +        */
> >
> > I do not really like this. Allowing going one page above the limit is
> > one thing, but one THP above the limit seems too much. I also don't
> 
> Hmm what if you have multiple concurrent zswap stores, from different
> tasks but the same cgroup? If none of them has charged, they would all
> get greenlit, and charge towards the cgroup...
> 
> So technically the zswap limit checking is already best-effort only.
> But now, instead of one page per violation, it's 512 pages per
> violation :)
> 
> Yeah this can be bad. I think this is only safe if you only use
> zswap.max as a binary knob (0 or max)...
> 
> > like relying on the repeated limit checking in zswap_store_page(), if
> > anything I think that should be batched too.
> >
> > Is it too unreasonable to maintain the average compression ratio and
> > use that to estimate limit checking for both memcg and global limits?
> > Johannes, Nhat, any thoughts on this?
> 
> I remember asking about this, but past Nhat might have relented :)
> 
> https://lore.kernel.org/linux-
> mm/CAKEwX=PfAMZ2qJtwKwJsVx3TZWxV5z2ZaU1Epk1UD=DBdMsjFA@mail
> .gmail.com/
> 
> We can do limit checking and charging after compression is done, but
> that's a lot of code change (might not even be possible)... It will,
> however, allow us to do charging + checking in one go (rather than
> doing it 8, 16, or 512 times)
> 
> Another thing we can do is to register a zswap writeback after the
> zswap store attempts to clean up excess capacity. Not sure what will
> happen if zswap writeback is disabled for the cgroup though :)
> 
> If it's too hard, the average estimate could be a decent compromise,
> until we figure something smarter.

Thanks Yosry and Nhat for these insights. This is how I was viewing
this scenario: I thought of incrementally (per subpage store) calling
zswap_pool_get() and limit checks followed by shrinker invocations
in case of error conditions to allow different concurrent stores to make
progress, without favoring only one process's mTHP store based on
there being enough zpool space available (for e.g. based on compression
ratio estimate).

Besides simplicity and no added overhead in the regular cases, I was
thinking this approach would have minimal impact on the process(es)
that see the zswap limit being exceeded, and that this would be better
than preemptively checking for the entire mTHP and failing (this could
also complicate things where no one makes progress because multiple
processes run the batch checks and fail, when realistically one/many
could have triggered the shrinker before erroring out, and at least
one/few could have made progress).

Another potential solution for this could be based on experimental
data for a given setup, on mTHP swapout failures and say, reducing
the zswap zpool max-limit and/or acceptance threshold perhaps?

Would appreciate your suggestions on how to proceed as far as
the limit checks.

Thanks,
Kanchana

^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-24 23:11         ` Nhat Pham
@ 2024-09-25  0:05           ` Sridhar, Kanchana P
  2024-09-25  0:52           ` Yosry Ahmed
  1 sibling, 0 replies; 79+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-25  0:05 UTC (permalink / raw)
  To: Nhat Pham, Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, chengming.zhou, usamaarif642,
	shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou,
	Nanhai, Feghali, Wajdi K, Gopal, Vinodh, joshua.hahnjy, Sridhar,
	Kanchana P

> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Tuesday, September 24, 2024 4:11 PM
> To: Yosry Ahmed <yosryahmed@google.com>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>;
> joshua.hahnjy@gmail.com
> Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> zswap_store().
> 
> On Tue, Sep 24, 2024 at 2:38 PM Yosry Ahmed <yosryahmed@google.com>
> wrote:
> >
> >
> > We can also do what we discussed before about double charging. The
> > pages that are being reclaimed are already charged, so technically we
> > don't need to charge them again. We can uncharge the difference
> > between compressed and uncompressed sizes after compression and call
> > it a day. This fixes the limit checking and the double charging in one
> > go.
> > I am a little bit nervous though about zswap uncharing the pages from
> > under reclaim, there are likely further accesses of the page memcg
> > after zswap. Maybe we can plumb the info back to reclaim or set a flag
> > on the page to avoid uncharging it when it's freed.
> 
> Hmm this is just for memory usage charging, no? The problem here is
> the zswap usage (zswap.current), and its relation to the limit.
> 
> One thing we can do is check the zswap usage against the limit for
> every subpage, but that's likely expensive...?

This is the approach currently implemented in v7.
Data gathered doesn’t indicate a performance issue with this
specific workload in the two scenarios validated, namely,
zswap-4K vs. zswap-mTHP and SSD-mTHP vs. zswap-mTHP (we only
see performance gains with explainable sys time increase).

Of course, the existing implementation could be a baseline for
validating performance of other approaches, e.g., checking zswap usage
per mTHP. However, these other approaches would also need to be
evaluated for more global multi-instance implications as far as all
processes being able to make progress. 

> 
> With the new atomic counters Joshua is working on, we can
> check-and-charge at the same time, after we have compressed the whole
> large folio, like this:
> 
> for (memcg = original_memcg; !mem_cgroup_is_root(memcg);
>      memcg = parent_mem_cgroup(memcg));
>      old_usage = atomic_read(&memcg->zswap);
> 
>      do {
>         new_usage = old_usage + size;
>         if (new_usage > limit) {
>            /* undo charging of descendants, then return false */
>         }
>       } while (!atomic_try_cmpxchg(&memcg->zswap, old_usage, new_usage))
> }
> 
> But I don't know what we can do in the current design. I gave it some
> more thought, and even if we only check after we know the size, we can
> still potentially overshoot the limit :(

I agree. Moreover, these checks based on estimated ratio or compressed size
could also add overhead in the normal case where we are not near the usage
limits.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 4/8] mm: zswap: Refactor code to delete stored offsets in case of errors.
  2024-09-24 22:32     ` Sridhar, Kanchana P
@ 2024-09-25  0:43       ` Yosry Ahmed
  2024-09-25  1:18         ` Sridhar, Kanchana P
  2024-09-25 14:11         ` Johannes Weiner
  0 siblings, 2 replies; 79+ messages in thread
From: Yosry Ahmed @ 2024-09-25  0:43 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
	akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh

On Tue, Sep 24, 2024 at 3:33 PM Sridhar, Kanchana P
<kanchana.p.sridhar@intel.com> wrote:
>
> > -----Original Message-----
> > From: Yosry Ahmed <yosryahmed@google.com>
> > Sent: Tuesday, September 24, 2024 12:20 PM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> > usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v7 4/8] mm: zswap: Refactor code to delete stored
> > offsets in case of errors.
> >
> > On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar
> > <kanchana.p.sridhar@intel.com> wrote:
> > >
> > > Added a new procedure zswap_delete_stored_offsets() that can be
> > > called to delete stored offsets in a folio in case zswap_store()
> > > fails or zswap is disabled.
> >
> > I don't see the value in this helper. It will get called in one place
> > AFAICT, and it is a bit inconsistent that we have to explicitly loop
> > in zswap_store() to store pages, but the loop to delete pages upon
> > failure is hidden in the helper.
> >
> > I am not against adding a trivial zswap_tree_delete() helper (or
> > similar) that calls xa_erase() and  zswap_entry_free() to match
> > zswap_tree_store() if you prefer that.
>
> This is a good point. I had refactored this routine in the context
> of my code that does batching and the same loop over the mTHP's
> subpages would get called in multiple error condition cases.
>
> I am thinking it might probably make sense for say zswap_tree_delete()
> to take a "folio" and "tree" and encapsulate deleting all stored offsets
> for that folio. Since we have already done the computes for finding the
> "tree", having that as an input parameter is mainly for latency, but if
> it is cleaner to have "zswap_tree_delete(struct folio *folio)", that should
> be Ok too. Please let me know your suggestion on this.
>

What I meant is "zswap_tree_delete(struct xarray *tree, pgoff_t
offset)", and loop and call this  in zswap_store(). This would be
consistent on looping and calling zswap_store_page().

But we can keep the helper as-is actually and just rename it to
zswap_tree_delete() and move the loop inside. No strong preference.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 5/8] mm: zswap: Compress and store a specific page in a folio.
  2024-09-24 22:45     ` Sridhar, Kanchana P
@ 2024-09-25  0:47       ` Yosry Ahmed
  2024-09-25  1:49         ` Sridhar, Kanchana P
  0 siblings, 1 reply; 79+ messages in thread
From: Yosry Ahmed @ 2024-09-25  0:47 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
	akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh

[..]
> >
> > > +{
> > > +       swp_entry_t swp = folio->swap;
> > > +       int type = swp_type(swp);
> > > +       pgoff_t offset = swp_offset(swp) + index;
> > > +       struct page *page = folio_page(folio, index);
> > > +       struct xarray *tree = swap_zswap_tree(swp);
> > > +       struct zswap_entry *entry;
> > > +
> > > +       if (objcg)
> > > +               obj_cgroup_get(objcg);
> > > +
> > > +       if (zswap_check_limits())
> > > +               goto reject;
> > > +
> > > +       /* allocate entry */
> > > +       entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
> > > +       if (!entry) {
> > > +               zswap_reject_kmemcache_fail++;
> > > +               goto reject;
> > > +       }
> > > +
> > > +       /* if entry is successfully added, it keeps the reference */
> > > +       if (!zswap_pool_get(pool))
> > > +               goto freepage;
> >
> > I think we can batch this for all pages in zswap_store(), maybe first
> > add zswap_pool_get_many().
> >
> > I am also wondering if it would be better to batch the limit checking
> > and allocating the entries, to front load any failures before we start
> > compression. Not sure if that's overall better though.
> >
> > To batch allocate entries we will have to also allocate an array to
> > hold them. To batch the limit checking we will have to either allow
> > going further over limit for mTHPs, or check if there is enough
> > clearance to allow for compressing all the pages. Using the
> > uncompressed size will lead to false negatives though, so maybe we can
> > start tracking the average compression ratio for better limit
> > checking.
> >
> > Nhat, Johannes, any thoughts here? I need someone to tell me if I am
> > overthinking this :)
>
> These are all good points. I suppose I was thinking along the same lines
> of what Nhat mentioned in an earlier comment. I was trying the
> incremental zswap_pool_get() and limit checks and shrinker invocations
> in case of (all) error conditions to allow different concurrent stores to make
> progress, without favoring only one process's mTHP store. I was thinking
> this would have minimal impact on the process(es) that see the zswap
> limit being exceeded, and that this would be better than preemptively
> checking for the entire mTHP and failing (this could also complicate things
> where no one makes progress because multiple processes run the batch
> checks and fail, when realistically one/many could have triggered
> the shrinker before erroring out, and at least one could have made
> progress).

On the other hand, if we allow concurrent mTHP swapouts to do limit
checks incrementally, they may all fail at the last page. While if
they all do limit checks beforehand, one of them can proceed.

I think we need to agree on a higher-level strategy for limit
checking, both global and per-memcg. The per-memcg limit should be
stricter though, so we may end up having different policies.

>
> Would appreciate your perspectives on how this should be handled,
> and will implement a solution in v8 accordingly.
>
> Thanks,
> Kanchana
>
> >
> > > +
> > > +       entry->pool = pool;
> > > +
> > > +       if (!zswap_compress(page, entry))
> > > +               goto put_pool;
> > > +
> > > +       entry->swpentry = swp_entry(type, offset);
> > > +       entry->objcg = objcg;
> > > +       entry->referenced = true;
> > > +
> > > +       if (!zswap_store_entry(tree, entry))
> > > +               goto store_failed;
> > > +
> > > +       if (objcg) {
> > > +               obj_cgroup_charge_zswap(objcg, entry->length);
> > > +               count_objcg_event(objcg, ZSWPOUT);
> > > +       }
> > > +
> > > +       /*
> > > +        * We finish initializing the entry while it's already in xarray.
> > > +        * This is safe because:
> > > +        *
> > > +        * 1. Concurrent stores and invalidations are excluded by folio lock.
> > > +        *
> > > +        * 2. Writeback is excluded by the entry not being on the LRU yet.
> > > +        *    The publishing order matters to prevent writeback from seeing
> > > +        *    an incoherent entry.
> > > +        */
> > > +       if (entry->length) {
> > > +               INIT_LIST_HEAD(&entry->lru);
> > > +               zswap_lru_add(&zswap_list_lru, entry);
> > > +       }
> > > +
> > > +       /* update stats */
> > > +       atomic_inc(&zswap_stored_pages);
> > > +       count_vm_event(ZSWPOUT);
> >
> > We should probably also batch updating the stats. It actually seems
> > like now we don't handle rolling them back upon failure.
>
> Good point! I assume you are referring only to the "ZSWPOUT" vm event stats
> updates and not the "zswap_stored_pages" (since latter is used in limit checking)?

I actually meant both. Do we rollback changes to zswap_stored_pages
when some stores succeed and some of them fail?

I think it's more correct and efficient to update the atomic once
after all the pages are successfully compressed and stored.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-24 23:11         ` Nhat Pham
  2024-09-25  0:05           ` Sridhar, Kanchana P
@ 2024-09-25  0:52           ` Yosry Ahmed
  1 sibling, 0 replies; 79+ messages in thread
From: Yosry Ahmed @ 2024-09-25  0:52 UTC (permalink / raw)
  To: Nhat Pham
  Cc: Kanchana P Sridhar, linux-kernel, linux-mm, hannes,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
	ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali,
	vinodh.gopal, joshua.hahnjy

On Tue, Sep 24, 2024 at 4:11 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Tue, Sep 24, 2024 at 2:38 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> >
> > We can also do what we discussed before about double charging. The
> > pages that are being reclaimed are already charged, so technically we
> > don't need to charge them again. We can uncharge the difference
> > between compressed and uncompressed sizes after compression and call
> > it a day. This fixes the limit checking and the double charging in one
> > go.
> > I am a little bit nervous though about zswap uncharing the pages from
> > under reclaim, there are likely further accesses of the page memcg
> > after zswap. Maybe we can plumb the info back to reclaim or set a flag
> > on the page to avoid uncharging it when it's freed.
>
> Hmm this is just for memory usage charging, no? The problem here is
> the zswap usage (zswap.current), and its relation to the limit.
>
> One thing we can do is check the zswap usage against the limit for
> every subpage, but that's likely expensive...?

Ah yes, I totally missed this.

>
> With the new atomic counters Joshua is working on, we can
> check-and-charge at the same time, after we have compressed the whole
> large folio, like this:
>
> for (memcg = original_memcg; !mem_cgroup_is_root(memcg);
>      memcg = parent_mem_cgroup(memcg));
>      old_usage = atomic_read(&memcg->zswap);
>
>      do {
>         new_usage = old_usage + size;
>         if (new_usage > limit) {
>            /* undo charging of descendants, then return false */
>         }
>       } while (!atomic_try_cmpxchg(&memcg->zswap, old_usage, new_usage))
> }
>
> But I don't know what we can do in the current design. I gave it some
> more thought, and even if we only check after we know the size, we can
> still potentially overshoot the limit :(

Yeah it's difficult because if we check the limit before compressing,
we have to estimate the compressed size or check using the
uncompressed size. If we wait until after compression we will either
overshoot the limit or free the compressed page and fallback to swap.

Maybe a good compromise is to do the check before compression with an
estimate based on historical compression ratio, and then do the actual
charging after the compression and allow overshooting, hopefully it's
not too much if our estimate is good. We can also improve this later
by adding a backoff mechanism where we make more conservative
estimates the more we overshoot the limit.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v7 4/8] mm: zswap: Refactor code to delete stored offsets in case of errors.
  2024-09-25  0:43       ` Yosry Ahmed
@ 2024-09-25  1:18         ` Sridhar, Kanchana P
  2024-09-25 14:11         ` Johannes Weiner
  1 sibling, 0 replies; 79+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-25  1:18 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
	akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar,
	Kanchana P

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Tuesday, September 24, 2024 5:43 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 4/8] mm: zswap: Refactor code to delete stored
> offsets in case of errors.
> 
> On Tue, Sep 24, 2024 at 3:33 PM Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > > -----Original Message-----
> > > From: Yosry Ahmed <yosryahmed@google.com>
> > > Sent: Tuesday, September 24, 2024 12:20 PM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > hannes@cmpxchg.org; nphamcs@gmail.com;
> chengming.zhou@linux.dev;
> > > usamaarif642@gmail.com; shakeel.butt@linux.dev;
> ryan.roberts@arm.com;
> > > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> > > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > > Subject: Re: [PATCH v7 4/8] mm: zswap: Refactor code to delete stored
> > > offsets in case of errors.
> > >
> > > On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar
> > > <kanchana.p.sridhar@intel.com> wrote:
> > > >
> > > > Added a new procedure zswap_delete_stored_offsets() that can be
> > > > called to delete stored offsets in a folio in case zswap_store()
> > > > fails or zswap is disabled.
> > >
> > > I don't see the value in this helper. It will get called in one place
> > > AFAICT, and it is a bit inconsistent that we have to explicitly loop
> > > in zswap_store() to store pages, but the loop to delete pages upon
> > > failure is hidden in the helper.
> > >
> > > I am not against adding a trivial zswap_tree_delete() helper (or
> > > similar) that calls xa_erase() and  zswap_entry_free() to match
> > > zswap_tree_store() if you prefer that.
> >
> > This is a good point. I had refactored this routine in the context
> > of my code that does batching and the same loop over the mTHP's
> > subpages would get called in multiple error condition cases.
> >
> > I am thinking it might probably make sense for say zswap_tree_delete()
> > to take a "folio" and "tree" and encapsulate deleting all stored offsets
> > for that folio. Since we have already done the computes for finding the
> > "tree", having that as an input parameter is mainly for latency, but if
> > it is cleaner to have "zswap_tree_delete(struct folio *folio)", that should
> > be Ok too. Please let me know your suggestion on this.
> >
> 
> What I meant is "zswap_tree_delete(struct xarray *tree, pgoff_t
> offset)", and loop and call this  in zswap_store(). This would be
> consistent on looping and calling zswap_store_page().
> 
> But we can keep the helper as-is actually and just rename it to
> zswap_tree_delete() and move the loop inside. No strong preference.

Ok, sounds good.

Thanks,
Kanchana

^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v7 5/8] mm: zswap: Compress and store a specific page in a folio.
  2024-09-25  0:47       ` Yosry Ahmed
@ 2024-09-25  1:49         ` Sridhar, Kanchana P
  2024-09-25 13:53           ` Johannes Weiner
  0 siblings, 1 reply; 79+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-25  1:49 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
	akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar,
	Kanchana P

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Tuesday, September 24, 2024 5:47 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 5/8] mm: zswap: Compress and store a specific page
> in a folio.
> 
> [..]
> > >
> > > > +{
> > > > +       swp_entry_t swp = folio->swap;
> > > > +       int type = swp_type(swp);
> > > > +       pgoff_t offset = swp_offset(swp) + index;
> > > > +       struct page *page = folio_page(folio, index);
> > > > +       struct xarray *tree = swap_zswap_tree(swp);
> > > > +       struct zswap_entry *entry;
> > > > +
> > > > +       if (objcg)
> > > > +               obj_cgroup_get(objcg);
> > > > +
> > > > +       if (zswap_check_limits())
> > > > +               goto reject;
> > > > +
> > > > +       /* allocate entry */
> > > > +       entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
> > > > +       if (!entry) {
> > > > +               zswap_reject_kmemcache_fail++;
> > > > +               goto reject;
> > > > +       }
> > > > +
> > > > +       /* if entry is successfully added, it keeps the reference */
> > > > +       if (!zswap_pool_get(pool))
> > > > +               goto freepage;
> > >
> > > I think we can batch this for all pages in zswap_store(), maybe first
> > > add zswap_pool_get_many().
> > >
> > > I am also wondering if it would be better to batch the limit checking
> > > and allocating the entries, to front load any failures before we start
> > > compression. Not sure if that's overall better though.
> > >
> > > To batch allocate entries we will have to also allocate an array to
> > > hold them. To batch the limit checking we will have to either allow
> > > going further over limit for mTHPs, or check if there is enough
> > > clearance to allow for compressing all the pages. Using the
> > > uncompressed size will lead to false negatives though, so maybe we can
> > > start tracking the average compression ratio for better limit
> > > checking.
> > >
> > > Nhat, Johannes, any thoughts here? I need someone to tell me if I am
> > > overthinking this :)
> >
> > These are all good points. I suppose I was thinking along the same lines
> > of what Nhat mentioned in an earlier comment. I was trying the
> > incremental zswap_pool_get() and limit checks and shrinker invocations
> > in case of (all) error conditions to allow different concurrent stores to make
> > progress, without favoring only one process's mTHP store. I was thinking
> > this would have minimal impact on the process(es) that see the zswap
> > limit being exceeded, and that this would be better than preemptively
> > checking for the entire mTHP and failing (this could also complicate things
> > where no one makes progress because multiple processes run the batch
> > checks and fail, when realistically one/many could have triggered
> > the shrinker before erroring out, and at least one could have made
> > progress).
> 
> On the other hand, if we allow concurrent mTHP swapouts to do limit
> checks incrementally, they may all fail at the last page. While if
> they all do limit checks beforehand, one of them can proceed.

Yes, this is possible too. Although, given the dynamic nature of the usage,
even with a check-before-store strategy for mTHP we could end up in a
similar situation as the optimistic approach in which we allowed progress
until there really was a reason to fail. 

> 
> I think we need to agree on a higher-level strategy for limit
> checking, both global and per-memcg. The per-memcg limit should be
> stricter though, so we may end up having different policies.

Sure, this makes sense. One possibility is we could allow zswap to
follow the "optimistic approach" used currently, while we manage
the limits checking at the memcg level? Something along the lines
of mem_cgroup_handle_over_high() that gets called every time after
a page-fault is handled; instead checks the cgroup's zswap usage and
triggers writeback? This seems like one way of not adding overhead to
the reclaim path (zswap will store mTHP until the limit checking causes
error and unwinding state), while triggering zswap-LRU based writeback
at a higher level to manage the limit.

> 
> >
> > Would appreciate your perspectives on how this should be handled,
> > and will implement a solution in v8 accordingly.
> >
> > Thanks,
> > Kanchana
> >
> > >
> > > > +
> > > > +       entry->pool = pool;
> > > > +
> > > > +       if (!zswap_compress(page, entry))
> > > > +               goto put_pool;
> > > > +
> > > > +       entry->swpentry = swp_entry(type, offset);
> > > > +       entry->objcg = objcg;
> > > > +       entry->referenced = true;
> > > > +
> > > > +       if (!zswap_store_entry(tree, entry))
> > > > +               goto store_failed;
> > > > +
> > > > +       if (objcg) {
> > > > +               obj_cgroup_charge_zswap(objcg, entry->length);
> > > > +               count_objcg_event(objcg, ZSWPOUT);
> > > > +       }
> > > > +
> > > > +       /*
> > > > +        * We finish initializing the entry while it's already in xarray.
> > > > +        * This is safe because:
> > > > +        *
> > > > +        * 1. Concurrent stores and invalidations are excluded by folio lock.
> > > > +        *
> > > > +        * 2. Writeback is excluded by the entry not being on the LRU yet.
> > > > +        *    The publishing order matters to prevent writeback from seeing
> > > > +        *    an incoherent entry.
> > > > +        */
> > > > +       if (entry->length) {
> > > > +               INIT_LIST_HEAD(&entry->lru);
> > > > +               zswap_lru_add(&zswap_list_lru, entry);
> > > > +       }
> > > > +
> > > > +       /* update stats */
> > > > +       atomic_inc(&zswap_stored_pages);
> > > > +       count_vm_event(ZSWPOUT);
> > >
> > > We should probably also batch updating the stats. It actually seems
> > > like now we don't handle rolling them back upon failure.
> >
> > Good point! I assume you are referring only to the "ZSWPOUT" vm event
> stats
> > updates and not the "zswap_stored_pages" (since latter is used in limit
> checking)?
> 
> I actually meant both. Do we rollback changes to zswap_stored_pages
> when some stores succeed and some of them fail?

Yes we do. zswap_tree_delete() calls zswap_entry_free() which will
decrement zswap_stored_pages. The only stat that is left in an incorrect
state in this case is the vmstat 'zswpout'.

> 
> I think it's more correct and efficient to update the atomic once
> after all the pages are successfully compressed and stored.

Actually this would need to co-relate with the limits checking strategy,
because the atomic is used there and needs to be as accurate as possible.

As far as the vmstat 'zswpout', the reason I left it as-is in my patchset
was to be more indicative of the actual zswpout compute events that
occurred (for things like getting the compressions count), regardless
of whether or not the overall mTHP store was successful. If this vmstat
needs to reflect only successful zswpout events (i.e., represent the zswap
usage), I can fix it by updating it once only if the mTHP is stored successfully.

Thanks,
Kanchana

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios
  2024-09-24  1:17 [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar
                   ` (8 preceding siblings ...)
  2024-09-24 19:34 ` [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios Yosry Ahmed
@ 2024-09-25  6:35 ` Huang, Ying
  2024-09-25 18:39   ` Sridhar, Kanchana P
  9 siblings, 1 reply; 79+ messages in thread
From: Huang, Ying @ 2024-09-25  6:35 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
	21cnbao, akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal

Kanchana P Sridhar <kanchana.p.sridhar@intel.com> writes:

[snip]

>
> Case 1: Comparing zswap 4K vs. zswap mTHP
> =========================================
>
> In this scenario, the "before" is CONFIG_THP_SWAP set to off, that results in
> 64K/2M (m)THP to be split into 4K folios that get processed by zswap.
>
> The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that results
> in 64K/2M (m)THP to not be split, and processed by zswap.
>
>  64KB mTHP (cgroup memory.high set to 40G):
>  ==========================================
>
>  -------------------------------------------------------------------------------
>                     mm-unstable 9-23-2024              zswap-mTHP     Change wrt
>                         CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y       Baseline
>                                  Baseline
>  -------------------------------------------------------------------------------
>  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
>                                       iaa                     iaa            iaa
>  -------------------------------------------------------------------------------
>  Throughput (KB/s)   143,323      125,485     153,550     129,609    7%       3%
>  elapsed time (sec)    24.97        25.42       23.90       25.19    4%       1%
>  sys time (sec)       822.72       750.96      757.70      731.13    8%       3%
>  memcg_high          132,743      169,825     148,075     192,744
>  memcg_swap_fail     639,067      841,553       2,204       2,215
>  pswpin                    0            0           0           0
>  pswpout                   0            0           0           0
>  zswpin                  795          873         760         902
>  zswpout          10,011,266   13,195,137  10,010,017  13,193,554
>  thp_swpout                0            0           0           0
>  thp_swpout_               0            0           0           0
>   fallback
>  64kB-mthp_          639,065      841,553       2,204       2,215
>   swpout_fallback
>  pgmajfault            2,861        2,924       3,054       3,259
>  ZSWPOUT-64kB            n/a          n/a     623,451     822,268
>  SWPOUT-64kB               0            0           0           0
>  -------------------------------------------------------------------------------
>

IIUC, the throughput is the sum of throughput of all usemem processes?

One possible issue of usemem test case is the "imbalance" issue.  That
is, some usemem processes may swap-out/swap-in less, so the score is
very high; while some other processes may swap-out/swap-in more, so the
score is very low.  Sometimes, the total score decreases, but the scores
of usemem processes are more balanced, so that the performance should be
considered better.  And, in general, we should make usemem score
balanced among processes via say longer test time.  Can you check this
in your test results?

[snip]

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-24 19:38   ` Yosry Ahmed
  2024-09-24 20:51     ` Nhat Pham
  2024-09-24 23:02     ` Sridhar, Kanchana P
@ 2024-09-25 13:40     ` Johannes Weiner
  2024-09-25 18:30       ` Yosry Ahmed
  2 siblings, 1 reply; 79+ messages in thread
From: Johannes Weiner @ 2024-09-25 13:40 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Kanchana P Sridhar, linux-kernel, linux-mm, nphamcs,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
	ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali,
	vinodh.gopal

On Tue, Sep 24, 2024 at 12:38:32PM -0700, Yosry Ahmed wrote:
> On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > zswap_store() will now store mTHP and PMD-size THP folios by compressing
> > them page by page.
> >
> > This patch provides a sequential implementation of storing an mTHP in
> > zswap_store() by iterating through each page in the folio to compress
> > and store it in the zswap zpool.
> >
> > Towards this goal, zswap_compress() is modified to take a page instead
> > of a folio as input.
> >
> > Each page's swap offset is stored as a separate zswap entry.
> >
> > If an error is encountered during the store of any page in the mTHP,
> > all previous pages/entries stored will be invalidated. Thus, an mTHP
> > is either entirely stored in ZSWAP, or entirely not stored in ZSWAP.
> >
> > This forms the basis for building batching of pages during zswap store
> > of large folios by compressing batches of up to say, 8 pages in an
> > mTHP in parallel in hardware, with the Intel In-Memory Analytics
> > Accelerator (Intel IAA).
> >
> > A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default)
> > will enable/disable zswap storing of (m)THP. The corresponding tunable
> > zswap module parameter is "mthp_enabled".
> >
> > This change reuses and adapts the functionality in Ryan Roberts' RFC
> > patch [1]:
> >
> >   "[RFC,v1] mm: zswap: Store large folios without splitting"
> >
> >   [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
> >
> > Also, addressed some of the RFC comments from the discussion in [1].
> >
> > Co-developed-by: Ryan Roberts
> > Signed-off-by:
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  mm/Kconfig |   8 ++++
> >  mm/zswap.c | 122 +++++++++++++++++++++++++----------------------------
> >  2 files changed, 66 insertions(+), 64 deletions(-)
> >
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 09aebca1cae3..c659fb732ec4 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -59,6 +59,14 @@ config ZSWAP_SHRINKER_DEFAULT_ON
> >           reducing the chance that cold pages will reside in the zswap pool
> >           and consume memory indefinitely.
> >
> > +config ZSWAP_STORE_THP_DEFAULT_ON
> > +       bool "Store mTHP and THP folios in zswap"
> > +       depends on ZSWAP
> > +       default n
> > +       help
> > +         If selected, zswap will process mTHP and THP folios by
> > +         compressing and storing each 4K page in the large folio.
> > +
> >  choice
> >         prompt "Default compressor"
> >         depends on ZSWAP
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 8f2e0ab34c84..16ab770546d6 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -127,6 +127,14 @@ static bool zswap_shrinker_enabled = IS_ENABLED(
> >                 CONFIG_ZSWAP_SHRINKER_DEFAULT_ON);
> >  module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644);
> >
> > +/*
> > + * Enable/disable zswap processing of mTHP folios.
> > + * For now, only zswap_store will process mTHP folios.
> > + */
> > +static bool zswap_mthp_enabled = IS_ENABLED(
> > +               CONFIG_ZSWAP_STORE_THP_DEFAULT_ON);
> > +module_param_named(mthp_enabled, zswap_mthp_enabled, bool, 0644);
> > +
> >  bool zswap_is_enabled(void)
> >  {
> >         return zswap_enabled;
> > @@ -1471,9 +1479,9 @@ static void zswap_delete_stored_offsets(struct xarray *tree,
> >   * @objcg: The folio's objcg.
> >   * @pool:  The zswap_pool to store the compressed data for the page.
> >   */
> > -static bool __maybe_unused zswap_store_page(struct folio *folio, long index,
> > -                                           struct obj_cgroup *objcg,
> > -                                           struct zswap_pool *pool)
> > +static bool zswap_store_page(struct folio *folio, long index,
> > +                            struct obj_cgroup *objcg,
> > +                            struct zswap_pool *pool)
> 
> As I mentioned earlier, the patch that introduced zswap_store_page()
> should have directly used it in zswap_store(). This would make this
> patch much clearer.
> 
> >  {
> >         swp_entry_t swp = folio->swap;
> >         int type = swp_type(swp);
> > @@ -1551,51 +1559,63 @@ static bool __maybe_unused zswap_store_page(struct folio *folio, long index,
> >         return false;
> >  }
> >
> > +/*
> > + * Modified to store mTHP folios. Each page in the mTHP will be compressed
> > + * and stored sequentially.
> > + */
> >  bool zswap_store(struct folio *folio)
> >  {
> >         long nr_pages = folio_nr_pages(folio);
> >         swp_entry_t swp = folio->swap;
> >         pgoff_t offset = swp_offset(swp);
> >         struct xarray *tree = swap_zswap_tree(swp);
> > -       struct zswap_entry *entry;
> >         struct obj_cgroup *objcg = NULL;
> >         struct mem_cgroup *memcg = NULL;
> > +       struct zswap_pool *pool;
> > +       bool ret = false;
> > +       long index;
> >
> >         VM_WARN_ON_ONCE(!folio_test_locked(folio));
> >         VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> >
> > -       /* Large folios aren't supported */
> > -       if (folio_test_large(folio))
> > +       /* Storing large folios isn't enabled */
> 
> The comment is now stating the obvious, please remove it.
> 
> > +       if (!zswap_mthp_enabled && folio_test_large(folio))
> >                 return false;
> >
> >         if (!zswap_enabled)
> > -               goto check_old;
> > +               goto reject;
> >
> > -       /* Check cgroup limits */
> > +       /*
> > +        * Check cgroup limits:
> > +        *
> > +        * The cgroup zswap limit check is done once at the beginning of an
> > +        * mTHP store, and not within zswap_store_page() for each page
> > +        * in the mTHP. We do however check the zswap pool limits at the
> > +        * start of zswap_store_page(). What this means is, the cgroup
> > +        * could go over the limits by at most (HPAGE_PMD_NR - 1) pages.
> > +        * However, the per-store-page zswap pool limits check should
> > +        * hopefully trigger the cgroup aware and zswap LRU aware global
> > +        * reclaim implemented in the shrinker. If this assumption holds,
> > +        * the cgroup exceeding the zswap limits could potentially be
> > +        * resolved before the next zswap_store, and if it is not, the next
> > +        * zswap_store would fail the cgroup zswap limit check at the start.
> > +        */
> 
> I do not really like this. Allowing going one page above the limit is
> one thing, but one THP above the limit seems too much. I also don't
> like relying on the repeated limit checking in zswap_store_page(), if
> anything I think that should be batched too.
> 
> Is it too unreasonable to maintain the average compression ratio and
> use that to estimate limit checking for both memcg and global limits?
> Johannes, Nhat, any thoughts on this?

I honestly don't think it's much of an issue. The global limit is
huge, and the cgroup limit is to the best of my knowledge only used as
a binary switch. Setting a non-binary limit - global or cgroup - seems
like a bit of an obscure usecase to me, because in the vast majority
of cases it's preferable to keep compresing over declaring OOM.

And even if you do have some granular limit, the workload size scales
with it. It's not like you have a thousand THPs in a 10M cgroup.

If this ever becomes an issue, we can handle it in a fastpath-slowpath
scheme: check the limit up front for fast-path failure if we're
already maxed out, just like now; then make obj_cgroup_charge_zswap()
atomically charge against zswap.max and unwind the store if we raced.

For now, I would just keep the simple version we currently have: check
once in zswap_store() and then just go ahead for the whole folio.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 5/8] mm: zswap: Compress and store a specific page in a folio.
  2024-09-25  1:49         ` Sridhar, Kanchana P
@ 2024-09-25 13:53           ` Johannes Weiner
  2024-09-25 18:45             ` Sridhar, Kanchana P
  0 siblings, 1 reply; 79+ messages in thread
From: Johannes Weiner @ 2024-09-25 13:53 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: Yosry Ahmed, linux-kernel, linux-mm, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
	akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh

On Wed, Sep 25, 2024 at 01:49:03AM +0000, Sridhar, Kanchana P wrote:
> > From: Yosry Ahmed <yosryahmed@google.com>
> > I think it's more correct and efficient to update the atomic once
> > after all the pages are successfully compressed and stored.
> 
> Actually this would need to co-relate with the limits checking strategy,
> because the atomic is used there and needs to be as accurate as possible.

For the limit checks, we use the zpool counters, not zswap_stored_pages.

zswap_stored_pages is used in the zswap shrinker to guesstimate
pressure, so it's likely a good thing to only count entries that are
expected to stay, and not account the ones that might fail just yet.

> As far as the vmstat 'zswpout', the reason I left it as-is in my patchset
> was to be more indicative of the actual zswpout compute events that
> occurred (for things like getting the compressions count), regardless
> of whether or not the overall mTHP store was successful. If this vmstat
> needs to reflect only successful zswpout events (i.e., represent the zswap
> usage), I can fix it by updating it once only if the mTHP is stored successfully.

Yeah, that's fine as well.

I would suggest batching them both at the end of zswap_store().


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 4/8] mm: zswap: Refactor code to delete stored offsets in case of errors.
  2024-09-25  0:43       ` Yosry Ahmed
  2024-09-25  1:18         ` Sridhar, Kanchana P
@ 2024-09-25 14:11         ` Johannes Weiner
  2024-09-25 18:45           ` Sridhar, Kanchana P
  1 sibling, 1 reply; 79+ messages in thread
From: Johannes Weiner @ 2024-09-25 14:11 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sridhar, Kanchana P, linux-kernel, linux-mm, nphamcs,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang,
	Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal,
	Vinodh

On Tue, Sep 24, 2024 at 05:43:22PM -0700, Yosry Ahmed wrote:
> What I meant is "zswap_tree_delete(struct xarray *tree, pgoff_t
> offset)", and loop and call this  in zswap_store(). This would be
> consistent on looping and calling zswap_store_page().
> 
> But we can keep the helper as-is actually and just rename it to
> zswap_tree_delete() and move the loop inside. No strong preference.

Both helpers seem unnecesary.

zswap_tree_store() is not called in a loop directly. It's called from
zswap_store_page(), which is essentially what zswap_store() is now,
and that was fine with the open-coded insert.

zswap_tree_delete() just hides what's going on. zswap_store() has the
for-loop to store the subpages, so it makes sense it has the for loop
for unwinding on rejection as well. This makes it easier on the reader
to match up attempt and unwind.

Please just drop both.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-24  1:17 ` [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store() Kanchana P Sridhar
  2024-09-24 17:33   ` Nhat Pham
  2024-09-24 19:38   ` Yosry Ahmed
@ 2024-09-25 14:27   ` Johannes Weiner
  2024-09-25 18:17     ` Yosry Ahmed
  2024-09-25 18:48     ` Sridhar, Kanchana P
  2 siblings, 2 replies; 79+ messages in thread
From: Johannes Weiner @ 2024-09-25 14:27 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, yosryahmed, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao,
	akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal

On Mon, Sep 23, 2024 at 06:17:07PM -0700, Kanchana P Sridhar wrote:
> zswap_store() will now store mTHP and PMD-size THP folios by compressing

The hugepage terminology throughout the patches is a bit convoluted.

There is no real distinction in this code between PMD-size THPs and
sub-PMD-sized mTHPs e.g. In particular, I think "mTHP" made sense when
they were added, to distinguish them from conventional THPs. But using
this term going forward just causes confusion, IMO.

We're going through a big effort in the codebase to call all of these
things simply "folios" - which stands for "one or more pages". If you
want to emphasize the "more than one page", the convention is to call
it a "large folio". (If you need to emphasize that it's PMD size -
which doesn't apply to these patches, but just for the record - the
convention is "pmd-mappable folio".)

So what this patch set does is "support large folios in zswap".

> @@ -1551,51 +1559,63 @@ static bool __maybe_unused zswap_store_page(struct folio *folio, long index,
>  	return false;
>  }
>  
> +/*
> + * Modified to store mTHP folios. Each page in the mTHP will be compressed
> + * and stored sequentially.
> + */

This is a changelog, not a code comment ;) Please delete it.

>  bool zswap_store(struct folio *folio)
>  {
>  	long nr_pages = folio_nr_pages(folio);
>  	swp_entry_t swp = folio->swap;
>  	pgoff_t offset = swp_offset(swp);
>  	struct xarray *tree = swap_zswap_tree(swp);
> -	struct zswap_entry *entry;
>  	struct obj_cgroup *objcg = NULL;
>  	struct mem_cgroup *memcg = NULL;
> +	struct zswap_pool *pool;
> +	bool ret = false;
> +	long index;
>  
>  	VM_WARN_ON_ONCE(!folio_test_locked(folio));
>  	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
>  
> -	/* Large folios aren't supported */
> -	if (folio_test_large(folio))
> +	/* Storing large folios isn't enabled */
> +	if (!zswap_mthp_enabled && folio_test_large(folio))
>  		return false;
>  
>  	if (!zswap_enabled)
> -		goto check_old;
> +		goto reject;
>  
> -	/* Check cgroup limits */
> +	/*
> +	 * Check cgroup limits:
> +	 *
> +	 * The cgroup zswap limit check is done once at the beginning of an
> +	 * mTHP store, and not within zswap_store_page() for each page
> +	 * in the mTHP. We do however check the zswap pool limits at the

Use "folio" and "large folio" as appropriate here and throughout.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-25 14:27   ` Johannes Weiner
@ 2024-09-25 18:17     ` Yosry Ahmed
  2024-09-25 18:48     ` Sridhar, Kanchana P
  1 sibling, 0 replies; 79+ messages in thread
From: Yosry Ahmed @ 2024-09-25 18:17 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Kanchana P Sridhar, linux-kernel, linux-mm, nphamcs,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
	ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali,
	vinodh.gopal

On Wed, Sep 25, 2024 at 7:27 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Mon, Sep 23, 2024 at 06:17:07PM -0700, Kanchana P Sridhar wrote:
> > zswap_store() will now store mTHP and PMD-size THP folios by compressing
>
> The hugepage terminology throughout the patches is a bit convoluted.
>
> There is no real distinction in this code between PMD-size THPs and
> sub-PMD-sized mTHPs e.g. In particular, I think "mTHP" made sense when
> they were added, to distinguish them from conventional THPs. But using
> this term going forward just causes confusion, IMO.
>
> We're going through a big effort in the codebase to call all of these
> things simply "folios" - which stands for "one or more pages". If you
> want to emphasize the "more than one page", the convention is to call
> it a "large folio". (If you need to emphasize that it's PMD size -
> which doesn't apply to these patches, but just for the record - the
> convention is "pmd-mappable folio".)
>
> So what this patch set does is "support large folios in zswap".

Agreed on all of this, except it should be "support large folios in
zswap _stores". We don't really support loading large folios.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-25 13:40     ` Johannes Weiner
@ 2024-09-25 18:30       ` Yosry Ahmed
  2024-09-25 19:10         ` Sridhar, Kanchana P
  2024-09-25 19:20         ` Johannes Weiner
  0 siblings, 2 replies; 79+ messages in thread
From: Yosry Ahmed @ 2024-09-25 18:30 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Kanchana P Sridhar, linux-kernel, linux-mm, nphamcs,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
	ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali,
	vinodh.gopal

[..]
> > > +       /*
> > > +        * Check cgroup limits:
> > > +        *
> > > +        * The cgroup zswap limit check is done once at the beginning of an
> > > +        * mTHP store, and not within zswap_store_page() for each page
> > > +        * in the mTHP. We do however check the zswap pool limits at the
> > > +        * start of zswap_store_page(). What this means is, the cgroup
> > > +        * could go over the limits by at most (HPAGE_PMD_NR - 1) pages.
> > > +        * However, the per-store-page zswap pool limits check should
> > > +        * hopefully trigger the cgroup aware and zswap LRU aware global
> > > +        * reclaim implemented in the shrinker. If this assumption holds,
> > > +        * the cgroup exceeding the zswap limits could potentially be
> > > +        * resolved before the next zswap_store, and if it is not, the next
> > > +        * zswap_store would fail the cgroup zswap limit check at the start.
> > > +        */
> >
> > I do not really like this. Allowing going one page above the limit is
> > one thing, but one THP above the limit seems too much. I also don't
> > like relying on the repeated limit checking in zswap_store_page(), if
> > anything I think that should be batched too.
> >
> > Is it too unreasonable to maintain the average compression ratio and
> > use that to estimate limit checking for both memcg and global limits?
> > Johannes, Nhat, any thoughts on this?
>
> I honestly don't think it's much of an issue. The global limit is
> huge, and the cgroup limit is to the best of my knowledge only used as
> a binary switch. Setting a non-binary limit - global or cgroup - seems
> like a bit of an obscure usecase to me, because in the vast majority
> of cases it's preferable to keep compresing over declaring OOM.
>
> And even if you do have some granular limit, the workload size scales
> with it. It's not like you have a thousand THPs in a 10M cgroup.

The memcg limit and zswap limit can be disproportionate, although that
shouldn't be common.

>
> If this ever becomes an issue, we can handle it in a fastpath-slowpath
> scheme: check the limit up front for fast-path failure if we're
> already maxed out, just like now; then make obj_cgroup_charge_zswap()
> atomically charge against zswap.max and unwind the store if we raced.
>
> For now, I would just keep the simple version we currently have: check
> once in zswap_store() and then just go ahead for the whole folio.

I am not totally against this but I feel like this is too optimistic.
I think we can keep it simple-ish by maintaining an ewma for the
compression ratio, we already have primitives for this (see
DECLARE_EWMA).

Then in zswap_store(), we can use the ewma to estimate the compressed
size and use it to do the memcg and global limit checks once, like we
do today. Instead of just checking if we are below the limits, we
check if we have enough headroom for the estimated compressed size.
Then we call zswap_store_page() to do the per-page stuff, then do
batched charging and stats updates.

If you think that's an overkill we can keep doing the limit checks as
we do today,
but I would still like to see batching of all the limit checks,
charging, and stats updates. It makes little sense otherwise.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios
  2024-09-25  6:35 ` Huang, Ying
@ 2024-09-25 18:39   ` Sridhar, Kanchana P
  2024-09-26  0:44     ` Huang, Ying
  0 siblings, 1 reply; 79+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-25 18:39 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
	21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh,
	Sridhar, Kanchana P

> -----Original Message-----
> From: Huang, Ying <ying.huang@intel.com>
> Sent: Tuesday, September 24, 2024 11:35 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com;
> akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali,
> Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios
> 
> Kanchana P Sridhar <kanchana.p.sridhar@intel.com> writes:
> 
> [snip]
> 
> >
> > Case 1: Comparing zswap 4K vs. zswap mTHP
> > =========================================
> >
> > In this scenario, the "before" is CONFIG_THP_SWAP set to off, that results in
> > 64K/2M (m)THP to be split into 4K folios that get processed by zswap.
> >
> > The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that
> results
> > in 64K/2M (m)THP to not be split, and processed by zswap.
> >
> >  64KB mTHP (cgroup memory.high set to 40G):
> >  ==========================================
> >
> >  -------------------------------------------------------------------------------
> >                     mm-unstable 9-23-2024              zswap-mTHP     Change wrt
> >                         CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y       Baseline
> >                                  Baseline
> >  -------------------------------------------------------------------------------
> >  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
> >                                       iaa                     iaa            iaa
> >  -------------------------------------------------------------------------------
> >  Throughput (KB/s)   143,323      125,485     153,550     129,609    7%       3%
> >  elapsed time (sec)    24.97        25.42       23.90       25.19    4%       1%
> >  sys time (sec)       822.72       750.96      757.70      731.13    8%       3%
> >  memcg_high          132,743      169,825     148,075     192,744
> >  memcg_swap_fail     639,067      841,553       2,204       2,215
> >  pswpin                    0            0           0           0
> >  pswpout                   0            0           0           0
> >  zswpin                  795          873         760         902
> >  zswpout          10,011,266   13,195,137  10,010,017  13,193,554
> >  thp_swpout                0            0           0           0
> >  thp_swpout_               0            0           0           0
> >   fallback
> >  64kB-mthp_          639,065      841,553       2,204       2,215
> >   swpout_fallback
> >  pgmajfault            2,861        2,924       3,054       3,259
> >  ZSWPOUT-64kB            n/a          n/a     623,451     822,268
> >  SWPOUT-64kB               0            0           0           0
> >  -------------------------------------------------------------------------------
> >
> 
> IIUC, the throughput is the sum of throughput of all usemem processes?
> 
> One possible issue of usemem test case is the "imbalance" issue.  That
> is, some usemem processes may swap-out/swap-in less, so the score is
> very high; while some other processes may swap-out/swap-in more, so the
> score is very low.  Sometimes, the total score decreases, but the scores
> of usemem processes are more balanced, so that the performance should be
> considered better.  And, in general, we should make usemem score
> balanced among processes via say longer test time.  Can you check this
> in your test results?

Actually, the throughput data listed in the cover-letter is the average of
all the usemem processes. Your observation about the "imbalance" issue is
right. Some processes see a higher throughput than others. I have noticed
that the throughputs progressively reduce as the individual processes exit
and print their stats.

Listed below are the stats from two runs of usemem70: sleep 10 and sleep 30.
Both are run with a cgroup mem-limit of 40G. Data is with v7, 64K folios are
enabled, zswap uses zstd. 


-----------------------------------------------
               sleep 10           sleep 30
      Throughput (KB/s)  Throughput (KB/s)
 -----------------------------------------------
                181,540            191,686
                179,651            191,459
                179,068            188,834
                177,244            187,568
                177,215            186,703
                176,565            185,584
                176,546            185,370
                176,470            185,021
                176,214            184,303
                176,128            184,040
                175,279            183,932
                174,745            180,831
                173,935            179,418
                161,546            168,014
                160,332            167,540
                160,122            167,364
                159,613            167,020
                159,546            166,590
                159,021            166,483
                158,845            166,418
                158,426            166,264
                158,396            166,066
                158,371            165,944
                158,298            165,866
                158,250            165,884
                158,057            165,533
                158,011            165,532
                157,899            165,457
                157,894            165,424
                157,839            165,410
                157,731            165,407
                157,629            165,273
                157,626            164,867
                157,581            164,636
                157,471            164,266
                157,430            164,225
                157,287            163,290
                156,289            153,597
                153,970            147,494
                148,244            147,102
                142,907            146,111
                142,811            145,789
                139,171            141,168
                136,314            140,714
                133,616            140,111
                132,881            139,636
                132,729            136,943
                132,680            136,844
                132,248            135,726
                132,027            135,384
                131,929            135,270
                131,766            134,748
                131,667            134,733
                131,576            134,582
                131,396            134,302
                131,351            134,160
                131,135            134,102
                130,885            134,097
                130,854            134,058
                130,767            134,006
                130,666            133,960
                130,647            133,894
                130,152            133,837
                130,006            133,747
                129,921            133,679
                129,856            133,666
                129,377            133,564
                128,366            133,331
                127,988            132,938
                126,903            132,746
 -----------------------------------------------
      sum    10,526,916         10,919,561
  average       150,385            155,994
   stddev        17,551             19,633
 -----------------------------------------------
    elapsed       24.40              43.66
 time (sec)
   sys time      806.25             766.05
      (sec)
    zswpout  10,008,713         10,008,407
  64K folio     623,463            623,629
     swpout
 -----------------------------------------------

As we increase the time for which allocations are maintained,
there seems to be a slight improvement in throughput, but the
variance increases as well. The processes with lower throughput
could be the ones that handle the memcg being over limit by
doing reclaim, possibly before they can allocate.

Interestingly, the longer test time does seem to reduce the amount
of reclaim (hence lower sys time), but more 64K large folios seem to
be reclaimed. Could this mean that with longer test time (sleep 30),
more cold memory residing in large folios is getting reclaimed, as
against memory just relinquished by the exiting processes?

Thanks,
Kanchana

> 
> [snip]
> 
> --
> Best Regards,
> Huang, Ying


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v7 5/8] mm: zswap: Compress and store a specific page in a folio.
  2024-09-25 13:53           ` Johannes Weiner
@ 2024-09-25 18:45             ` Sridhar, Kanchana P
  0 siblings, 0 replies; 79+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-25 18:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Yosry Ahmed, linux-kernel, linux-mm, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
	akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar,
	Kanchana P

> -----Original Message-----
> From: Johannes Weiner <hannes@cmpxchg.org>
> Sent: Wednesday, September 25, 2024 6:54 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Yosry Ahmed <yosryahmed@google.com>; linux-kernel@vger.kernel.org;
> linux-mm@kvack.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 5/8] mm: zswap: Compress and store a specific page
> in a folio.
> 
> On Wed, Sep 25, 2024 at 01:49:03AM +0000, Sridhar, Kanchana P wrote:
> > > From: Yosry Ahmed <yosryahmed@google.com>
> > > I think it's more correct and efficient to update the atomic once
> > > after all the pages are successfully compressed and stored.
> >
> > Actually this would need to co-relate with the limits checking strategy,
> > because the atomic is used there and needs to be as accurate as possible.
> 
> For the limit checks, we use the zpool counters, not zswap_stored_pages.

Thanks Johannes for your insights and comments. Yes, you are absolutely
right. My apologies.

> 
> zswap_stored_pages is used in the zswap shrinker to guesstimate
> pressure, so it's likely a good thing to only count entries that are
> expected to stay, and not account the ones that might fail just yet.

Sure, makes sense.

> 
> > As far as the vmstat 'zswpout', the reason I left it as-is in my patchset
> > was to be more indicative of the actual zswpout compute events that
> > occurred (for things like getting the compressions count), regardless
> > of whether or not the overall mTHP store was successful. If this vmstat
> > needs to reflect only successful zswpout events (i.e., represent the zswap
> > usage), I can fix it by updating it once only if the mTHP is stored successfully.
> 
> Yeah, that's fine as well.
> 
> I would suggest batching them both at the end of zswap_store().

Ok, will do so in v8.

Thanks,
Kanchana



^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v7 4/8] mm: zswap: Refactor code to delete stored offsets in case of errors.
  2024-09-25 14:11         ` Johannes Weiner
@ 2024-09-25 18:45           ` Sridhar, Kanchana P
  0 siblings, 0 replies; 79+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-25 18:45 UTC (permalink / raw)
  To: Johannes Weiner, Yosry Ahmed
  Cc: linux-kernel, linux-mm, nphamcs, chengming.zhou, usamaarif642,
	shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou,
	Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P

> -----Original Message-----
> From: Johannes Weiner <hannes@cmpxchg.org>
> Sent: Wednesday, September 25, 2024 7:11 AM
> To: Yosry Ahmed <yosryahmed@google.com>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 4/8] mm: zswap: Refactor code to delete stored
> offsets in case of errors.
> 
> On Tue, Sep 24, 2024 at 05:43:22PM -0700, Yosry Ahmed wrote:
> > What I meant is "zswap_tree_delete(struct xarray *tree, pgoff_t
> > offset)", and loop and call this  in zswap_store(). This would be
> > consistent on looping and calling zswap_store_page().
> >
> > But we can keep the helper as-is actually and just rename it to
> > zswap_tree_delete() and move the loop inside. No strong preference.
> 
> Both helpers seem unnecesary.
> 
> zswap_tree_store() is not called in a loop directly. It's called from
> zswap_store_page(), which is essentially what zswap_store() is now,
> and that was fine with the open-coded insert.
> 
> zswap_tree_delete() just hides what's going on. zswap_store() has the
> for-loop to store the subpages, so it makes sense it has the for loop
> for unwinding on rejection as well. This makes it easier on the reader
> to match up attempt and unwind.
> 
> Please just drop both.

Ok, sounds good.



^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-25 14:27   ` Johannes Weiner
  2024-09-25 18:17     ` Yosry Ahmed
@ 2024-09-25 18:48     ` Sridhar, Kanchana P
  1 sibling, 0 replies; 79+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-25 18:48 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-kernel, linux-mm, yosryahmed, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
	akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar,
	Kanchana P

> -----Original Message-----
> From: Johannes Weiner <hannes@cmpxchg.org>
> Sent: Wednesday, September 25, 2024 7:28 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> yosryahmed@google.com; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> zswap_store().
> 
> On Mon, Sep 23, 2024 at 06:17:07PM -0700, Kanchana P Sridhar wrote:
> > zswap_store() will now store mTHP and PMD-size THP folios by compressing
> 
> The hugepage terminology throughout the patches is a bit convoluted.
> 
> There is no real distinction in this code between PMD-size THPs and
> sub-PMD-sized mTHPs e.g. In particular, I think "mTHP" made sense when
> they were added, to distinguish them from conventional THPs. But using
> this term going forward just causes confusion, IMO.
> 
> We're going through a big effort in the codebase to call all of these
> things simply "folios" - which stands for "one or more pages". If you
> want to emphasize the "more than one page", the convention is to call
> it a "large folio". (If you need to emphasize that it's PMD size -
> which doesn't apply to these patches, but just for the record - the
> convention is "pmd-mappable folio".)
> 
> So what this patch set does is "support large folios in zswap".

Sure. Will modify this to be "support large folios in zswap _stores"
as per Yosry's follow-up clarification.

> 
> > @@ -1551,51 +1559,63 @@ static bool __maybe_unused
> zswap_store_page(struct folio *folio, long index,
> >  	return false;
> >  }
> >
> > +/*
> > + * Modified to store mTHP folios. Each page in the mTHP will be
> compressed
> > + * and stored sequentially.
> > + */
> 
> This is a changelog, not a code comment ;) Please delete it.

Ok, sure.

> 
> >  bool zswap_store(struct folio *folio)
> >  {
> >  	long nr_pages = folio_nr_pages(folio);
> >  	swp_entry_t swp = folio->swap;
> >  	pgoff_t offset = swp_offset(swp);
> >  	struct xarray *tree = swap_zswap_tree(swp);
> > -	struct zswap_entry *entry;
> >  	struct obj_cgroup *objcg = NULL;
> >  	struct mem_cgroup *memcg = NULL;
> > +	struct zswap_pool *pool;
> > +	bool ret = false;
> > +	long index;
> >
> >  	VM_WARN_ON_ONCE(!folio_test_locked(folio));
> >  	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> >
> > -	/* Large folios aren't supported */
> > -	if (folio_test_large(folio))
> > +	/* Storing large folios isn't enabled */
> > +	if (!zswap_mthp_enabled && folio_test_large(folio))
> >  		return false;
> >
> >  	if (!zswap_enabled)
> > -		goto check_old;
> > +		goto reject;
> >
> > -	/* Check cgroup limits */
> > +	/*
> > +	 * Check cgroup limits:
> > +	 *
> > +	 * The cgroup zswap limit check is done once at the beginning of an
> > +	 * mTHP store, and not within zswap_store_page() for each page
> > +	 * in the mTHP. We do however check the zswap pool limits at the
> 
> Use "folio" and "large folio" as appropriate here and throughout.

Sounds good.

Thanks,
Kanchana



^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-25 18:30       ` Yosry Ahmed
@ 2024-09-25 19:10         ` Sridhar, Kanchana P
  2024-09-25 19:49           ` Yosry Ahmed
  2024-09-25 19:20         ` Johannes Weiner
  1 sibling, 1 reply; 79+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-25 19:10 UTC (permalink / raw)
  To: Yosry Ahmed, Johannes Weiner
  Cc: linux-kernel, linux-mm, nphamcs, chengming.zhou, usamaarif642,
	shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou,
	Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Wednesday, September 25, 2024 11:31 AM
> To: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> zswap_store().
> 
> [..]
> > > > +       /*
> > > > +        * Check cgroup limits:
> > > > +        *
> > > > +        * The cgroup zswap limit check is done once at the beginning of an
> > > > +        * mTHP store, and not within zswap_store_page() for each page
> > > > +        * in the mTHP. We do however check the zswap pool limits at the
> > > > +        * start of zswap_store_page(). What this means is, the cgroup
> > > > +        * could go over the limits by at most (HPAGE_PMD_NR - 1) pages.
> > > > +        * However, the per-store-page zswap pool limits check should
> > > > +        * hopefully trigger the cgroup aware and zswap LRU aware global
> > > > +        * reclaim implemented in the shrinker. If this assumption holds,
> > > > +        * the cgroup exceeding the zswap limits could potentially be
> > > > +        * resolved before the next zswap_store, and if it is not, the next
> > > > +        * zswap_store would fail the cgroup zswap limit check at the start.
> > > > +        */
> > >
> > > I do not really like this. Allowing going one page above the limit is
> > > one thing, but one THP above the limit seems too much. I also don't
> > > like relying on the repeated limit checking in zswap_store_page(), if
> > > anything I think that should be batched too.
> > >
> > > Is it too unreasonable to maintain the average compression ratio and
> > > use that to estimate limit checking for both memcg and global limits?
> > > Johannes, Nhat, any thoughts on this?
> >
> > I honestly don't think it's much of an issue. The global limit is
> > huge, and the cgroup limit is to the best of my knowledge only used as
> > a binary switch. Setting a non-binary limit - global or cgroup - seems
> > like a bit of an obscure usecase to me, because in the vast majority
> > of cases it's preferable to keep compresing over declaring OOM.
> >
> > And even if you do have some granular limit, the workload size scales
> > with it. It's not like you have a thousand THPs in a 10M cgroup.
> 
> The memcg limit and zswap limit can be disproportionate, although that
> shouldn't be common.
> 
> >
> > If this ever becomes an issue, we can handle it in a fastpath-slowpath
> > scheme: check the limit up front for fast-path failure if we're
> > already maxed out, just like now; then make obj_cgroup_charge_zswap()
> > atomically charge against zswap.max and unwind the store if we raced.
> >
> > For now, I would just keep the simple version we currently have: check
> > once in zswap_store() and then just go ahead for the whole folio.
> 
> I am not totally against this but I feel like this is too optimistic.
> I think we can keep it simple-ish by maintaining an ewma for the
> compression ratio, we already have primitives for this (see
> DECLARE_EWMA).
> 
> Then in zswap_store(), we can use the ewma to estimate the compressed
> size and use it to do the memcg and global limit checks once, like we
> do today. Instead of just checking if we are below the limits, we
> check if we have enough headroom for the estimated compressed size.
> Then we call zswap_store_page() to do the per-page stuff, then do
> batched charging and stats updates.
> 
> If you think that's an overkill we can keep doing the limit checks as
> we do today,
> but I would still like to see batching of all the limit checks,
> charging, and stats updates. It makes little sense otherwise.

Thanks Johannes and Yosry for these suggestions and pointers.
I think there is general agreement about the batch charging and
zswap_stored_pages/stats updates. Yosry,  does "batching of limit
checks" imply the same as a simple check for being over the cgroup
limit at the start of zswap_store and not doing this check in
zswap_store_page? Does this also imply a zswap_pool_get_many()?
Would appreciate it if you can help clarify.

The main question in my mind about using the EWMA checks is,
will it add overhead to the normal zswap reclaim path; and if so,
would a simple limit check at the start of zswap_store as suggested
by Johannes suffice. I can run a few experiments to quantify this
overhead, and maybe we can revisit this?

Thanks,
Kanchana


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-25 18:30       ` Yosry Ahmed
  2024-09-25 19:10         ` Sridhar, Kanchana P
@ 2024-09-25 19:20         ` Johannes Weiner
  2024-09-25 19:39           ` Yosry Ahmed
  1 sibling, 1 reply; 79+ messages in thread
From: Johannes Weiner @ 2024-09-25 19:20 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Kanchana P Sridhar, linux-kernel, linux-mm, nphamcs,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
	ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali,
	vinodh.gopal

On Wed, Sep 25, 2024 at 11:30:34AM -0700, Yosry Ahmed wrote:
> Johannes wrote:
> > If this ever becomes an issue, we can handle it in a fastpath-slowpath
> > scheme: check the limit up front for fast-path failure if we're
> > already maxed out, just like now; then make obj_cgroup_charge_zswap()
> > atomically charge against zswap.max and unwind the store if we raced.
> >
> > For now, I would just keep the simple version we currently have: check
> > once in zswap_store() and then just go ahead for the whole folio.
> 
> I am not totally against this but I feel like this is too optimistic.
> I think we can keep it simple-ish by maintaining an ewma for the
> compression ratio, we already have primitives for this (see
> DECLARE_EWMA).
> 
> Then in zswap_store(), we can use the ewma to estimate the compressed
> size and use it to do the memcg and global limit checks once, like we
> do today. Instead of just checking if we are below the limits, we
> check if we have enough headroom for the estimated compressed size.
> Then we call zswap_store_page() to do the per-page stuff, then do
> batched charging and stats updates.

I'm not sure what you gain from making a non-atomic check precise. You
can get a hundred threads determining down precisely that *their*
store will fit exactly into the last 800kB before the limit.

> If you think that's an overkill we can keep doing the limit checks as
> we do today,

I just don't see how it would make a practical difference.

What would make a difference is atomic transactional charging of the
compressed size, and unwinding on failure - with the upfront check to
avoid pointlessly compressing (outside of race conditions).

And I'm not against doing that in general, I am just against doing it
per default.

It's a lot of complexity, and like I said, the practical usecase for
limiting zswap memory to begin with is quite unclear to me. Zswap is
not a limited resource. It's just memory. And you already had the
memory for the uncompressed copy. So it's a bit strange to me to say
"you have compressed your memory enough, so now you get sent to disk
(or we declare OOM)". What would be a reason to limit it?

It sort of makes sense as a binary switch, but I don't get the usecase
for a granular limit. (And I blame my own cowardice for making the
cgroup knob a limit, to keep options open, instead of a switch.)

All that to say, this would be better in a follow-up patch. We allow
overshooting now, it's not clear how overshooting by a larger amount
makes a categorical difference.

> but I would still like to see batching of all the limit checks,
> charging, and stats updates. It makes little sense otherwise.

Definitely. One check, one charge, one stat update per folio.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-25 19:20         ` Johannes Weiner
@ 2024-09-25 19:39           ` Yosry Ahmed
  2024-09-25 20:13             ` Johannes Weiner
  0 siblings, 1 reply; 79+ messages in thread
From: Yosry Ahmed @ 2024-09-25 19:39 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Kanchana P Sridhar, linux-kernel, linux-mm, nphamcs,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
	ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali,
	vinodh.gopal

On Wed, Sep 25, 2024 at 12:20 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Sep 25, 2024 at 11:30:34AM -0700, Yosry Ahmed wrote:
> > Johannes wrote:
> > > If this ever becomes an issue, we can handle it in a fastpath-slowpath
> > > scheme: check the limit up front for fast-path failure if we're
> > > already maxed out, just like now; then make obj_cgroup_charge_zswap()
> > > atomically charge against zswap.max and unwind the store if we raced.
> > >
> > > For now, I would just keep the simple version we currently have: check
> > > once in zswap_store() and then just go ahead for the whole folio.
> >
> > I am not totally against this but I feel like this is too optimistic.
> > I think we can keep it simple-ish by maintaining an ewma for the
> > compression ratio, we already have primitives for this (see
> > DECLARE_EWMA).
> >
> > Then in zswap_store(), we can use the ewma to estimate the compressed
> > size and use it to do the memcg and global limit checks once, like we
> > do today. Instead of just checking if we are below the limits, we
> > check if we have enough headroom for the estimated compressed size.
> > Then we call zswap_store_page() to do the per-page stuff, then do
> > batched charging and stats updates.
>
> I'm not sure what you gain from making a non-atomic check precise. You
> can get a hundred threads determining down precisely that *their*
> store will fit exactly into the last 800kB before the limit.

We just get to avoid overshooting in cases where we know we probably
can't fit it anyway. If we have 4KB left and we are trying to compress
a 2MB THP, for example. It just makes the upfront check to avoid
pointless compression a little bit more meaningful.

>
> > If you think that's an overkill we can keep doing the limit checks as
> > we do today,
>
> I just don't see how it would make a practical difference.
>
> What would make a difference is atomic transactional charging of the
> compressed size, and unwinding on failure - with the upfront check to
> avoid pointlessly compressing (outside of race conditions).
>
> And I'm not against doing that in general, I am just against doing it
> per default.
>
> It's a lot of complexity, and like I said, the practical usecase for
> limiting zswap memory to begin with is quite unclear to me. Zswap is
> not a limited resource. It's just memory. And you already had the
> memory for the uncompressed copy. So it's a bit strange to me to say
> "you have compressed your memory enough, so now you get sent to disk
> (or we declare OOM)". What would be a reason to limit it?

Technically speaking if we have a global zswap limit, it becomes a
limited resource and distributing it across workloads can make sense.
That being said, I am not aware of any existing use cases for that.

The other use case is controlling when writeback kicks in for
different workloads. It may not make sense for limit-based reclaim,
because as you mentioned the memory is limited anyway and workloads
should be free to compress their own memory within their limit as they
please. But it may make sense for proactive reclaim, controlling how
much memory we compress vs how much memory we completely evict to
disk.

Again, not aware of any existing use cases for this as well.

>
> It sort of makes sense as a binary switch, but I don't get the usecase
> for a granular limit. (And I blame my own cowardice for making the
> cgroup knob a limit, to keep options open, instead of a switch.)
>
> All that to say, this would be better in a follow-up patch. We allow
> overshooting now, it's not clear how overshooting by a larger amount
> makes a categorical difference.

I am not against making this a follow-up, if/when the need arises. My
whole point was that using EWMA (or similar) we can make the upfront
check a little bit more meaningful than "We have 1 byte of headroom,
let's go compress a 2MB THP!". I think it's not a lot of complexity to
check for headroom based on an estimated compression size, but I
didn't try to code it, so maybe I am wrong :)

>
> > but I would still like to see batching of all the limit checks,
> > charging, and stats updates. It makes little sense otherwise.
>
> Definitely. One check, one charge, one stat update per folio.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-25 19:10         ` Sridhar, Kanchana P
@ 2024-09-25 19:49           ` Yosry Ahmed
  2024-09-25 20:49             ` Johannes Weiner
  0 siblings, 1 reply; 79+ messages in thread
From: Yosry Ahmed @ 2024-09-25 19:49 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: Johannes Weiner, linux-kernel, linux-mm, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
	akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh

[..]
> > > > > +       /*
> > > > > +        * Check cgroup limits:
> > > > > +        *
> > > > > +        * The cgroup zswap limit check is done once at the beginning of an
> > > > > +        * mTHP store, and not within zswap_store_page() for each page
> > > > > +        * in the mTHP. We do however check the zswap pool limits at the
> > > > > +        * start of zswap_store_page(). What this means is, the cgroup
> > > > > +        * could go over the limits by at most (HPAGE_PMD_NR - 1) pages.
> > > > > +        * However, the per-store-page zswap pool limits check should
> > > > > +        * hopefully trigger the cgroup aware and zswap LRU aware global
> > > > > +        * reclaim implemented in the shrinker. If this assumption holds,
> > > > > +        * the cgroup exceeding the zswap limits could potentially be
> > > > > +        * resolved before the next zswap_store, and if it is not, the next
> > > > > +        * zswap_store would fail the cgroup zswap limit check at the start.
> > > > > +        */
> > > >
> > > > I do not really like this. Allowing going one page above the limit is
> > > > one thing, but one THP above the limit seems too much. I also don't
> > > > like relying on the repeated limit checking in zswap_store_page(), if
> > > > anything I think that should be batched too.
> > > >
> > > > Is it too unreasonable to maintain the average compression ratio and
> > > > use that to estimate limit checking for both memcg and global limits?
> > > > Johannes, Nhat, any thoughts on this?
> > >
> > > I honestly don't think it's much of an issue. The global limit is
> > > huge, and the cgroup limit is to the best of my knowledge only used as
> > > a binary switch. Setting a non-binary limit - global or cgroup - seems
> > > like a bit of an obscure usecase to me, because in the vast majority
> > > of cases it's preferable to keep compresing over declaring OOM.
> > >
> > > And even if you do have some granular limit, the workload size scales
> > > with it. It's not like you have a thousand THPs in a 10M cgroup.
> >
> > The memcg limit and zswap limit can be disproportionate, although that
> > shouldn't be common.
> >
> > >
> > > If this ever becomes an issue, we can handle it in a fastpath-slowpath
> > > scheme: check the limit up front for fast-path failure if we're
> > > already maxed out, just like now; then make obj_cgroup_charge_zswap()
> > > atomically charge against zswap.max and unwind the store if we raced.
> > >
> > > For now, I would just keep the simple version we currently have: check
> > > once in zswap_store() and then just go ahead for the whole folio.
> >
> > I am not totally against this but I feel like this is too optimistic.
> > I think we can keep it simple-ish by maintaining an ewma for the
> > compression ratio, we already have primitives for this (see
> > DECLARE_EWMA).
> >
> > Then in zswap_store(), we can use the ewma to estimate the compressed
> > size and use it to do the memcg and global limit checks once, like we
> > do today. Instead of just checking if we are below the limits, we
> > check if we have enough headroom for the estimated compressed size.
> > Then we call zswap_store_page() to do the per-page stuff, then do
> > batched charging and stats updates.
> >
> > If you think that's an overkill we can keep doing the limit checks as
> > we do today,
> > but I would still like to see batching of all the limit checks,
> > charging, and stats updates. It makes little sense otherwise.
>
> Thanks Johannes and Yosry for these suggestions and pointers.
> I think there is general agreement about the batch charging and
> zswap_stored_pages/stats updates. Yosry,  does "batching of limit
> checks" imply the same as a simple check for being over the cgroup
> limit at the start of zswap_store and not doing this check in
> zswap_store_page? Does this also imply a zswap_pool_get_many()?
> Would appreciate it if you can help clarify.

Yes I think we should batch as much as possible in zswap_store(), and
only do the things are truly per-page in zswap_store_page(). The limit
checks, stats updates, zswap_pool refs, charging etc. Batching all of
these things should be clear wins.

>
> The main question in my mind about using the EWMA checks is,
> will it add overhead to the normal zswap reclaim path; and if so,
> would a simple limit check at the start of zswap_store as suggested
> by Johannes suffice. I can run a few experiments to quantify this
> overhead, and maybe we can revisit this?

If you look at ewma_##name##_add() in include/linux/average.h, it's
really just a bunch of bit shifts, so I am not concerned about runtime
overhead. My discussion with Johannes is more about if the complexity
is justified, I'd wait for that discussion to settle.

Either way, we should check the limits once in zswap_store().


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-25 19:39           ` Yosry Ahmed
@ 2024-09-25 20:13             ` Johannes Weiner
  2024-09-25 21:06               ` Yosry Ahmed
  0 siblings, 1 reply; 79+ messages in thread
From: Johannes Weiner @ 2024-09-25 20:13 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Kanchana P Sridhar, linux-kernel, linux-mm, nphamcs,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
	ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali,
	vinodh.gopal

On Wed, Sep 25, 2024 at 12:39:02PM -0700, Yosry Ahmed wrote:
> On Wed, Sep 25, 2024 at 12:20 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> >
> > On Wed, Sep 25, 2024 at 11:30:34AM -0700, Yosry Ahmed wrote:
> > > Johannes wrote:
> > > > If this ever becomes an issue, we can handle it in a fastpath-slowpath
> > > > scheme: check the limit up front for fast-path failure if we're
> > > > already maxed out, just like now; then make obj_cgroup_charge_zswap()
> > > > atomically charge against zswap.max and unwind the store if we raced.
> > > >
> > > > For now, I would just keep the simple version we currently have: check
> > > > once in zswap_store() and then just go ahead for the whole folio.
> > >
> > > I am not totally against this but I feel like this is too optimistic.
> > > I think we can keep it simple-ish by maintaining an ewma for the
> > > compression ratio, we already have primitives for this (see
> > > DECLARE_EWMA).
> > >
> > > Then in zswap_store(), we can use the ewma to estimate the compressed
> > > size and use it to do the memcg and global limit checks once, like we
> > > do today. Instead of just checking if we are below the limits, we
> > > check if we have enough headroom for the estimated compressed size.
> > > Then we call zswap_store_page() to do the per-page stuff, then do
> > > batched charging and stats updates.
> >
> > I'm not sure what you gain from making a non-atomic check precise. You
> > can get a hundred threads determining down precisely that *their*
> > store will fit exactly into the last 800kB before the limit.
> 
> We just get to avoid overshooting in cases where we know we probably
> can't fit it anyway. If we have 4KB left and we are trying to compress
> a 2MB THP, for example. It just makes the upfront check to avoid
> pointless compression a little bit more meaningful.

I think I'm missing something. It's not just an upfront check, it's
the only check. The charge down the line doesn't limit anything, it
just counts. So if this check passes, we WILL store the folio. There
is no pointless compression.

We might overshoot the limit by about one folio in a single-threaded
scenario. But that is negligible in comparison to the overshoot we can
get due to race conditions.

Again, I see no no practical, meaningful difference in outcome by
making that limit check any more precise. Just keep it as-is.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-25 19:49           ` Yosry Ahmed
@ 2024-09-25 20:49             ` Johannes Weiner
  0 siblings, 0 replies; 79+ messages in thread
From: Johannes Weiner @ 2024-09-25 20:49 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Sridhar, Kanchana P, linux-kernel, linux-mm, nphamcs,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang,
	Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal,
	Vinodh

On Wed, Sep 25, 2024 at 12:49:13PM -0700, Yosry Ahmed wrote:
> Kanchana wrote:
> > The main question in my mind about using the EWMA checks is,
> > will it add overhead to the normal zswap reclaim path; and if so,
> > would a simple limit check at the start of zswap_store as suggested
> > by Johannes suffice. I can run a few experiments to quantify this
> > overhead, and maybe we can revisit this?
> 
> If you look at ewma_##name##_add() in include/linux/average.h, it's
> really just a bunch of bit shifts, so I am not concerned about runtime
> overhead. My discussion with Johannes is more about if the complexity
> is justified, I'd wait for that discussion to settle.

Sorry to be blunt, but "precision" in a non-atomic check like this
makes no sense. The fact that it's not too expensive is irrelevant.
This discussion around this honestly has gone off the rails.

Just leave the limit checks exactly as they are. Check limits and
cgroup_may_zswap() once up front. Compress the subpages. Acquire
references and bump all stats in batches of folio_nr_pages(). You can
add up the subpage compressed bytes in the for-loop and do the
obj_cgroup_charge_zswap() in a single call at the end as well.

That's my suggestion. If that's no good, please ELI5.

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-25 20:13             ` Johannes Weiner
@ 2024-09-25 21:06               ` Yosry Ahmed
  2024-09-25 22:29                 ` Sridhar, Kanchana P
  0 siblings, 1 reply; 79+ messages in thread
From: Yosry Ahmed @ 2024-09-25 21:06 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Kanchana P Sridhar, linux-kernel, linux-mm, nphamcs,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
	ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali,
	vinodh.gopal

On Wed, Sep 25, 2024 at 1:13 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Sep 25, 2024 at 12:39:02PM -0700, Yosry Ahmed wrote:
> > On Wed, Sep 25, 2024 at 12:20 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > >
> > > On Wed, Sep 25, 2024 at 11:30:34AM -0700, Yosry Ahmed wrote:
> > > > Johannes wrote:
> > > > > If this ever becomes an issue, we can handle it in a fastpath-slowpath
> > > > > scheme: check the limit up front for fast-path failure if we're
> > > > > already maxed out, just like now; then make obj_cgroup_charge_zswap()
> > > > > atomically charge against zswap.max and unwind the store if we raced.
> > > > >
> > > > > For now, I would just keep the simple version we currently have: check
> > > > > once in zswap_store() and then just go ahead for the whole folio.
> > > >
> > > > I am not totally against this but I feel like this is too optimistic.
> > > > I think we can keep it simple-ish by maintaining an ewma for the
> > > > compression ratio, we already have primitives for this (see
> > > > DECLARE_EWMA).
> > > >
> > > > Then in zswap_store(), we can use the ewma to estimate the compressed
> > > > size and use it to do the memcg and global limit checks once, like we
> > > > do today. Instead of just checking if we are below the limits, we
> > > > check if we have enough headroom for the estimated compressed size.
> > > > Then we call zswap_store_page() to do the per-page stuff, then do
> > > > batched charging and stats updates.
> > >
> > > I'm not sure what you gain from making a non-atomic check precise. You
> > > can get a hundred threads determining down precisely that *their*
> > > store will fit exactly into the last 800kB before the limit.
> >
> > We just get to avoid overshooting in cases where we know we probably
> > can't fit it anyway. If we have 4KB left and we are trying to compress
> > a 2MB THP, for example. It just makes the upfront check to avoid
> > pointless compression a little bit more meaningful.
>
> I think I'm missing something. It's not just an upfront check, it's
> the only check. The charge down the line doesn't limit anything, it
> just counts. So if this check passes, we WILL store the folio. There
> is no pointless compression.

I got confused by what you said about the fast-slow path, I thought
you were suggesting we do this now, so I was saying it's better to use
an estimate of the compressed size in the fast path to avoid pointless
compression.

I missed the second paragraph.

>
> We might overshoot the limit by about one folio in a single-threaded
> scenario. But that is negligible in comparison to the overshoot we can
> get due to race conditions.
>
> Again, I see no no practical, meaningful difference in outcome by
> making that limit check any more precise. Just keep it as-is.

> Sorry to be blunt, but "precision" in a non-atomic check like this?
> makes no sense. The fact that it's not too expensive is irrelevant.
> This discussion around this honestly has gone off the rails.

Yeah I thought we were talking about the version where we rollback
compressions if we overshoot, my bad. We discussed quite a few things
and I managed to confuse myself.

> Just leave the limit checks exactly as they are. Check limits and
> cgroup_may_zswap() once up front. Compress the subpages. Acquire
> references and bump all stats in batches of folio_nr_pages(). You can
> add up the subpage compressed bytes in the for-loop and do the
> obj_cgroup_charge_zswap() in a single call at the end as well.

We can keep the limit checks as they are for now, and revisit as needed.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-25 21:06               ` Yosry Ahmed
@ 2024-09-25 22:29                 ` Sridhar, Kanchana P
  2024-09-26  3:58                   ` Sridhar, Kanchana P
  0 siblings, 1 reply; 79+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-25 22:29 UTC (permalink / raw)
  To: Yosry Ahmed, Johannes Weiner
  Cc: linux-kernel, linux-mm, nphamcs, chengming.zhou, usamaarif642,
	shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou,
	Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Wednesday, September 25, 2024 2:06 PM
> To: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> zswap_store().
> 
> On Wed, Sep 25, 2024 at 1:13 PM Johannes Weiner <hannes@cmpxchg.org>
> wrote:
> >
> > On Wed, Sep 25, 2024 at 12:39:02PM -0700, Yosry Ahmed wrote:
> > > On Wed, Sep 25, 2024 at 12:20 PM Johannes Weiner
> <hannes@cmpxchg.org> wrote:
> > > >
> > > > On Wed, Sep 25, 2024 at 11:30:34AM -0700, Yosry Ahmed wrote:
> > > > > Johannes wrote:
> > > > > > If this ever becomes an issue, we can handle it in a fastpath-
> slowpath
> > > > > > scheme: check the limit up front for fast-path failure if we're
> > > > > > already maxed out, just like now; then make
> obj_cgroup_charge_zswap()
> > > > > > atomically charge against zswap.max and unwind the store if we
> raced.
> > > > > >
> > > > > > For now, I would just keep the simple version we currently have:
> check
> > > > > > once in zswap_store() and then just go ahead for the whole folio.
> > > > >
> > > > > I am not totally against this but I feel like this is too optimistic.
> > > > > I think we can keep it simple-ish by maintaining an ewma for the
> > > > > compression ratio, we already have primitives for this (see
> > > > > DECLARE_EWMA).
> > > > >
> > > > > Then in zswap_store(), we can use the ewma to estimate the
> compressed
> > > > > size and use it to do the memcg and global limit checks once, like we
> > > > > do today. Instead of just checking if we are below the limits, we
> > > > > check if we have enough headroom for the estimated compressed size.
> > > > > Then we call zswap_store_page() to do the per-page stuff, then do
> > > > > batched charging and stats updates.
> > > >
> > > > I'm not sure what you gain from making a non-atomic check precise. You
> > > > can get a hundred threads determining down precisely that *their*
> > > > store will fit exactly into the last 800kB before the limit.
> > >
> > > We just get to avoid overshooting in cases where we know we probably
> > > can't fit it anyway. If we have 4KB left and we are trying to compress
> > > a 2MB THP, for example. It just makes the upfront check to avoid
> > > pointless compression a little bit more meaningful.
> >
> > I think I'm missing something. It's not just an upfront check, it's
> > the only check. The charge down the line doesn't limit anything, it
> > just counts. So if this check passes, we WILL store the folio. There
> > is no pointless compression.
> 
> I got confused by what you said about the fast-slow path, I thought
> you were suggesting we do this now, so I was saying it's better to use
> an estimate of the compressed size in the fast path to avoid pointless
> compression.
> 
> I missed the second paragraph.
> 
> >
> > We might overshoot the limit by about one folio in a single-threaded
> > scenario. But that is negligible in comparison to the overshoot we can
> > get due to race conditions.
> >
> > Again, I see no no practical, meaningful difference in outcome by
> > making that limit check any more precise. Just keep it as-is.
> 
> > Sorry to be blunt, but "precision" in a non-atomic check like this?
> > makes no sense. The fact that it's not too expensive is irrelevant.
> > This discussion around this honestly has gone off the rails.
> 
> Yeah I thought we were talking about the version where we rollback
> compressions if we overshoot, my bad. We discussed quite a few things
> and I managed to confuse myself.
> 
> > Just leave the limit checks exactly as they are. Check limits and
> > cgroup_may_zswap() once up front. Compress the subpages. Acquire
> > references and bump all stats in batches of folio_nr_pages(). You can
> > add up the subpage compressed bytes in the for-loop and do the
> > obj_cgroup_charge_zswap() in a single call at the end as well.
> 
> We can keep the limit checks as they are for now, and revisit as needed.

Thanks Johannes and Yosry for the discussion! I will proceed as suggested.

Thanks,
Kanchana

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios
  2024-09-25 18:39   ` Sridhar, Kanchana P
@ 2024-09-26  0:44     ` Huang, Ying
  2024-09-26  3:48       ` Sridhar, Kanchana P
  0 siblings, 1 reply; 79+ messages in thread
From: Huang, Ying @ 2024-09-26  0:44 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
	21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh

"Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com> writes:

>> -----Original Message-----
>> From: Huang, Ying <ying.huang@intel.com>
>> Sent: Tuesday, September 24, 2024 11:35 PM
>> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
>> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
>> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
>> chengming.zhou@linux.dev; usamaarif642@gmail.com;
>> shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com;
>> akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali,
>> Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
>> <vinodh.gopal@intel.com>
>> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios
>> 
>> Kanchana P Sridhar <kanchana.p.sridhar@intel.com> writes:
>> 
>> [snip]
>> 
>> >
>> > Case 1: Comparing zswap 4K vs. zswap mTHP
>> > =========================================
>> >
>> > In this scenario, the "before" is CONFIG_THP_SWAP set to off, that results in
>> > 64K/2M (m)THP to be split into 4K folios that get processed by zswap.
>> >
>> > The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that
>> results
>> > in 64K/2M (m)THP to not be split, and processed by zswap.
>> >
>> >  64KB mTHP (cgroup memory.high set to 40G):
>> >  ==========================================
>> >
>> >  -------------------------------------------------------------------------------
>> >                     mm-unstable 9-23-2024              zswap-mTHP     Change wrt
>> >                         CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y       Baseline
>> >                                  Baseline
>> >  -------------------------------------------------------------------------------
>> >  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
>> >                                       iaa                     iaa            iaa
>> >  -------------------------------------------------------------------------------
>> >  Throughput (KB/s)   143,323      125,485     153,550     129,609    7%       3%
>> >  elapsed time (sec)    24.97        25.42       23.90       25.19    4%       1%
>> >  sys time (sec)       822.72       750.96      757.70      731.13    8%       3%
>> >  memcg_high          132,743      169,825     148,075     192,744
>> >  memcg_swap_fail     639,067      841,553       2,204       2,215
>> >  pswpin                    0            0           0           0
>> >  pswpout                   0            0           0           0
>> >  zswpin                  795          873         760         902
>> >  zswpout          10,011,266   13,195,137  10,010,017  13,193,554
>> >  thp_swpout                0            0           0           0
>> >  thp_swpout_               0            0           0           0
>> >   fallback
>> >  64kB-mthp_          639,065      841,553       2,204       2,215
>> >   swpout_fallback
>> >  pgmajfault            2,861        2,924       3,054       3,259
>> >  ZSWPOUT-64kB            n/a          n/a     623,451     822,268
>> >  SWPOUT-64kB               0            0           0           0
>> >  -------------------------------------------------------------------------------
>> >
>> 
>> IIUC, the throughput is the sum of throughput of all usemem processes?
>> 
>> One possible issue of usemem test case is the "imbalance" issue.  That
>> is, some usemem processes may swap-out/swap-in less, so the score is
>> very high; while some other processes may swap-out/swap-in more, so the
>> score is very low.  Sometimes, the total score decreases, but the scores
>> of usemem processes are more balanced, so that the performance should be
>> considered better.  And, in general, we should make usemem score
>> balanced among processes via say longer test time.  Can you check this
>> in your test results?
>
> Actually, the throughput data listed in the cover-letter is the average of
> all the usemem processes. Your observation about the "imbalance" issue is
> right. Some processes see a higher throughput than others. I have noticed
> that the throughputs progressively reduce as the individual processes exit
> and print their stats.
>
> Listed below are the stats from two runs of usemem70: sleep 10 and sleep 30.
> Both are run with a cgroup mem-limit of 40G. Data is with v7, 64K folios are
> enabled, zswap uses zstd. 
>
>
> -----------------------------------------------
>                sleep 10           sleep 30
>       Throughput (KB/s)  Throughput (KB/s)
>  -----------------------------------------------
>                 181,540            191,686
>                 179,651            191,459
>                 179,068            188,834
>                 177,244            187,568
>                 177,215            186,703
>                 176,565            185,584
>                 176,546            185,370
>                 176,470            185,021
>                 176,214            184,303
>                 176,128            184,040
>                 175,279            183,932
>                 174,745            180,831
>                 173,935            179,418
>                 161,546            168,014
>                 160,332            167,540
>                 160,122            167,364
>                 159,613            167,020
>                 159,546            166,590
>                 159,021            166,483
>                 158,845            166,418
>                 158,426            166,264
>                 158,396            166,066
>                 158,371            165,944
>                 158,298            165,866
>                 158,250            165,884
>                 158,057            165,533
>                 158,011            165,532
>                 157,899            165,457
>                 157,894            165,424
>                 157,839            165,410
>                 157,731            165,407
>                 157,629            165,273
>                 157,626            164,867
>                 157,581            164,636
>                 157,471            164,266
>                 157,430            164,225
>                 157,287            163,290
>                 156,289            153,597
>                 153,970            147,494
>                 148,244            147,102
>                 142,907            146,111
>                 142,811            145,789
>                 139,171            141,168
>                 136,314            140,714
>                 133,616            140,111
>                 132,881            139,636
>                 132,729            136,943
>                 132,680            136,844
>                 132,248            135,726
>                 132,027            135,384
>                 131,929            135,270
>                 131,766            134,748
>                 131,667            134,733
>                 131,576            134,582
>                 131,396            134,302
>                 131,351            134,160
>                 131,135            134,102
>                 130,885            134,097
>                 130,854            134,058
>                 130,767            134,006
>                 130,666            133,960
>                 130,647            133,894
>                 130,152            133,837
>                 130,006            133,747
>                 129,921            133,679
>                 129,856            133,666
>                 129,377            133,564
>                 128,366            133,331
>                 127,988            132,938
>                 126,903            132,746
>  -----------------------------------------------
>       sum    10,526,916         10,919,561
>   average       150,385            155,994
>    stddev        17,551             19,633
>  -----------------------------------------------
>     elapsed       24.40              43.66
>  time (sec)
>    sys time      806.25             766.05
>       (sec)
>     zswpout  10,008,713         10,008,407
>   64K folio     623,463            623,629
>      swpout
>  -----------------------------------------------

Although there are some imbalance, I don't find it's too much.  So, I
think the test result is reasonable.  Please pay attention to the
imbalance issue in the future tests.

> As we increase the time for which allocations are maintained,
> there seems to be a slight improvement in throughput, but the
> variance increases as well. The processes with lower throughput
> could be the ones that handle the memcg being over limit by
> doing reclaim, possibly before they can allocate.
>
> Interestingly, the longer test time does seem to reduce the amount
> of reclaim (hence lower sys time), but more 64K large folios seem to
> be reclaimed. Could this mean that with longer test time (sleep 30),
> more cold memory residing in large folios is getting reclaimed, as
> against memory just relinquished by the exiting processes?

I don't think longer sleep time in test helps much to balance.  Can you
try with less process, and larger memory size per process?  I guess that
this will improve balance.

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios
  2024-09-26  0:44     ` Huang, Ying
@ 2024-09-26  3:48       ` Sridhar, Kanchana P
  2024-09-26  6:47         ` Huang, Ying
  0 siblings, 1 reply; 79+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-26  3:48 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
	21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh,
	Sridhar, Kanchana P

Hi Ying,

> -----Original Message-----
> From: Huang, Ying <ying.huang@intel.com>
> Sent: Wednesday, September 25, 2024 5:45 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com;
> akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali,
> Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios
> 
> "Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com> writes:
> 
> >> -----Original Message-----
> >> From: Huang, Ying <ying.huang@intel.com>
> >> Sent: Tuesday, September 24, 2024 11:35 PM
> >> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> >> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> >> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> >> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> >> shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com;
> >> akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>;
> Feghali,
> >> Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> >> <vinodh.gopal@intel.com>
> >> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios
> >>
> >> Kanchana P Sridhar <kanchana.p.sridhar@intel.com> writes:
> >>
> >> [snip]
> >>
> >> >
> >> > Case 1: Comparing zswap 4K vs. zswap mTHP
> >> > =========================================
> >> >
> >> > In this scenario, the "before" is CONFIG_THP_SWAP set to off, that
> results in
> >> > 64K/2M (m)THP to be split into 4K folios that get processed by zswap.
> >> >
> >> > The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that
> >> results
> >> > in 64K/2M (m)THP to not be split, and processed by zswap.
> >> >
> >> >  64KB mTHP (cgroup memory.high set to 40G):
> >> >  ==========================================
> >> >
> >> >  -------------------------------------------------------------------------------
> >> >                     mm-unstable 9-23-2024              zswap-mTHP     Change wrt
> >> >                         CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y
> Baseline
> >> >                                  Baseline
> >> >  -------------------------------------------------------------------------------
> >> >  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
> >> >                                       iaa                     iaa            iaa
> >> >  -------------------------------------------------------------------------------
> >> >  Throughput (KB/s)   143,323      125,485     153,550     129,609    7%
> 3%
> >> >  elapsed time (sec)    24.97        25.42       23.90       25.19    4%       1%
> >> >  sys time (sec)       822.72       750.96      757.70      731.13    8%       3%
> >> >  memcg_high          132,743      169,825     148,075     192,744
> >> >  memcg_swap_fail     639,067      841,553       2,204       2,215
> >> >  pswpin                    0            0           0           0
> >> >  pswpout                   0            0           0           0
> >> >  zswpin                  795          873         760         902
> >> >  zswpout          10,011,266   13,195,137  10,010,017  13,193,554
> >> >  thp_swpout                0            0           0           0
> >> >  thp_swpout_               0            0           0           0
> >> >   fallback
> >> >  64kB-mthp_          639,065      841,553       2,204       2,215
> >> >   swpout_fallback
> >> >  pgmajfault            2,861        2,924       3,054       3,259
> >> >  ZSWPOUT-64kB            n/a          n/a     623,451     822,268
> >> >  SWPOUT-64kB               0            0           0           0
> >> >  -------------------------------------------------------------------------------
> >> >
> >>
> >> IIUC, the throughput is the sum of throughput of all usemem processes?
> >>
> >> One possible issue of usemem test case is the "imbalance" issue.  That
> >> is, some usemem processes may swap-out/swap-in less, so the score is
> >> very high; while some other processes may swap-out/swap-in more, so the
> >> score is very low.  Sometimes, the total score decreases, but the scores
> >> of usemem processes are more balanced, so that the performance should
> be
> >> considered better.  And, in general, we should make usemem score
> >> balanced among processes via say longer test time.  Can you check this
> >> in your test results?
> >
> > Actually, the throughput data listed in the cover-letter is the average of
> > all the usemem processes. Your observation about the "imbalance" issue is
> > right. Some processes see a higher throughput than others. I have noticed
> > that the throughputs progressively reduce as the individual processes exit
> > and print their stats.
> >
> > Listed below are the stats from two runs of usemem70: sleep 10 and sleep
> 30.
> > Both are run with a cgroup mem-limit of 40G. Data is with v7, 64K folios are
> > enabled, zswap uses zstd.
> >
> >
> > -----------------------------------------------
> >                sleep 10           sleep 30
> >       Throughput (KB/s)  Throughput (KB/s)
> >  -----------------------------------------------
> >                 181,540            191,686
> >                 179,651            191,459
> >                 179,068            188,834
> >                 177,244            187,568
> >                 177,215            186,703
> >                 176,565            185,584
> >                 176,546            185,370
> >                 176,470            185,021
> >                 176,214            184,303
> >                 176,128            184,040
> >                 175,279            183,932
> >                 174,745            180,831
> >                 173,935            179,418
> >                 161,546            168,014
> >                 160,332            167,540
> >                 160,122            167,364
> >                 159,613            167,020
> >                 159,546            166,590
> >                 159,021            166,483
> >                 158,845            166,418
> >                 158,426            166,264
> >                 158,396            166,066
> >                 158,371            165,944
> >                 158,298            165,866
> >                 158,250            165,884
> >                 158,057            165,533
> >                 158,011            165,532
> >                 157,899            165,457
> >                 157,894            165,424
> >                 157,839            165,410
> >                 157,731            165,407
> >                 157,629            165,273
> >                 157,626            164,867
> >                 157,581            164,636
> >                 157,471            164,266
> >                 157,430            164,225
> >                 157,287            163,290
> >                 156,289            153,597
> >                 153,970            147,494
> >                 148,244            147,102
> >                 142,907            146,111
> >                 142,811            145,789
> >                 139,171            141,168
> >                 136,314            140,714
> >                 133,616            140,111
> >                 132,881            139,636
> >                 132,729            136,943
> >                 132,680            136,844
> >                 132,248            135,726
> >                 132,027            135,384
> >                 131,929            135,270
> >                 131,766            134,748
> >                 131,667            134,733
> >                 131,576            134,582
> >                 131,396            134,302
> >                 131,351            134,160
> >                 131,135            134,102
> >                 130,885            134,097
> >                 130,854            134,058
> >                 130,767            134,006
> >                 130,666            133,960
> >                 130,647            133,894
> >                 130,152            133,837
> >                 130,006            133,747
> >                 129,921            133,679
> >                 129,856            133,666
> >                 129,377            133,564
> >                 128,366            133,331
> >                 127,988            132,938
> >                 126,903            132,746
> >  -----------------------------------------------
> >       sum    10,526,916         10,919,561
> >   average       150,385            155,994
> >    stddev        17,551             19,633
> >  -----------------------------------------------
> >     elapsed       24.40              43.66
> >  time (sec)
> >    sys time      806.25             766.05
> >       (sec)
> >     zswpout  10,008,713         10,008,407
> >   64K folio     623,463            623,629
> >      swpout
> >  -----------------------------------------------
> 
> Although there are some imbalance, I don't find it's too much.  So, I
> think the test result is reasonable.  Please pay attention to the
> imbalance issue in the future tests.

Sure, will do so.

> 
> > As we increase the time for which allocations are maintained,
> > there seems to be a slight improvement in throughput, but the
> > variance increases as well. The processes with lower throughput
> > could be the ones that handle the memcg being over limit by
> > doing reclaim, possibly before they can allocate.
> >
> > Interestingly, the longer test time does seem to reduce the amount
> > of reclaim (hence lower sys time), but more 64K large folios seem to
> > be reclaimed. Could this mean that with longer test time (sleep 30),
> > more cold memory residing in large folios is getting reclaimed, as
> > against memory just relinquished by the exiting processes?
> 
> I don't think longer sleep time in test helps much to balance.  Can you
> try with less process, and larger memory size per process?  I guess that
> this will improve balance.

I tried this, and the data is listed below:

  usemem options:
  ---------------
  30 processes allocate 10G each
  cgroup memory limit = 150G
  sleep 10
  525Gi SSD disk swap partition
  64K large folios enabled      

  Throughput (KB/s) of each of the 30 processes:
 ---------------------------------------------------------------
                      mm-unstable    zswap_store of large folios
                        9-25-2024                v7
 zswap compressor:           zstd         zstd  deflate-iaa
 ---------------------------------------------------------------
                           38,393      234,485      374,427
                           37,283      215,528      314,225
                           37,156      214,942      304,413
                           37,143      213,073      304,146
                           36,814      212,904      290,186
                           36,277      212,304      288,212
                           36,104      212,207      285,682
                           36,000      210,173      270,661
                           35,994      208,487      256,960
                           35,979      207,788      248,313
                           35,967      207,714      235,338
                           35,966      207,703      229,335
                           35,835      207,690      221,697
                           35,793      207,418      221,600
                           35,692      206,160      219,346
                           35,682      206,128      219,162
                           35,681      205,817      219,155
                           35,678      205,546      214,862
                           35,678      205,523      214,710
                           35,677      204,951      214,282
                           35,677      204,283      213,441
                           35,677      203,348      213,011
                           35,675      203,028      212,923
                           35,673      201,922      212,492
                           35,672      201,660      212,225
                           35,672      200,724      211,808
                           35,672      200,324      211,420
                           35,671      199,686      211,413
                           35,667      198,858      211,346
                           35,667      197,590      211,209
 ---------------------------------------------------------------
 sum                     1,081,515    6,217,964    7,268,000
 average                    36,051      207,265      242,267
 stddev                        655        7,010       42,234
 elapsed time (sec)         343.70       107.40        84.34
 sys time (sec)             269.30     2,520.13     1,696.20
 memcg.high breaches       443,672      475,074      623,333
 zswpout                    22,605   48,931,249   54,777,100
 pswpout                40,004,528            0            0
 hugepages-64K zswpout           0    3,057,090    3,421,855
 hugepages-64K swpout    2,500,283            0            0
 ---------------------------------------------------------------

As you can see, this is quite a memory-constrained scenario, where we
are giving a 50% of total memory required, as the memory limit for the
cgroup in which the 30 processes are run. This causes significantly more
reclaim activity than the setup I was using thus far (70 processes, 1G,
40G limit).

The variance or "imbalance" reduces somewhat for zstd, but not for IAA.

IAA shows really good throughput (17%) and elapsed time (21%) and
sys time (33%) improvement wrt zstd with zswap_store of large folios.
These are the memory-constrained scenarios in which IAA typically
does really well. IAA verify_compress is enabled, so this is an added
data integrity checks benefit we get with IAA.

I would like to get your and the maintainers' feedback on whether
I should switch to this "usemem30-10G" setup for v8?

Thanks,
Kanchana

> 
> --
> Best Regards,
> Huang, Ying


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-25 22:29                 ` Sridhar, Kanchana P
@ 2024-09-26  3:58                   ` Sridhar, Kanchana P
  2024-09-26  4:52                     ` Yosry Ahmed
  0 siblings, 1 reply; 79+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-26  3:58 UTC (permalink / raw)
  To: Yosry Ahmed, Johannes Weiner
  Cc: linux-kernel, linux-mm, nphamcs, chengming.zhou, usamaarif642,
	shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou,
	Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P

> -----Original Message-----
> From: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Sent: Wednesday, September 25, 2024 3:29 PM
> To: Yosry Ahmed <yosryahmed@google.com>; Johannes Weiner
> <hannes@cmpxchg.org>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>;
> Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Subject: RE: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> zswap_store().
> 
> > -----Original Message-----
> > From: Yosry Ahmed <yosryahmed@google.com>
> > Sent: Wednesday, September 25, 2024 2:06 PM
> > To: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> > kernel@vger.kernel.org; linux-mm@kvack.org; nphamcs@gmail.com;
> > chengming.zhou@linux.dev; usamaarif642@gmail.com;
> > shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying
> > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org;
> > Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> > zswap_store().
> >
> > On Wed, Sep 25, 2024 at 1:13 PM Johannes Weiner
> <hannes@cmpxchg.org>
> > wrote:
> > >
> > > On Wed, Sep 25, 2024 at 12:39:02PM -0700, Yosry Ahmed wrote:
> > > > On Wed, Sep 25, 2024 at 12:20 PM Johannes Weiner
> > <hannes@cmpxchg.org> wrote:
> > > > >
> > > > > On Wed, Sep 25, 2024 at 11:30:34AM -0700, Yosry Ahmed wrote:
> > > > > > Johannes wrote:
> > > > > > > If this ever becomes an issue, we can handle it in a fastpath-
> > slowpath
> > > > > > > scheme: check the limit up front for fast-path failure if we're
> > > > > > > already maxed out, just like now; then make
> > obj_cgroup_charge_zswap()
> > > > > > > atomically charge against zswap.max and unwind the store if we
> > raced.
> > > > > > >
> > > > > > > For now, I would just keep the simple version we currently have:
> > check
> > > > > > > once in zswap_store() and then just go ahead for the whole folio.
> > > > > >
> > > > > > I am not totally against this but I feel like this is too optimistic.
> > > > > > I think we can keep it simple-ish by maintaining an ewma for the
> > > > > > compression ratio, we already have primitives for this (see
> > > > > > DECLARE_EWMA).
> > > > > >
> > > > > > Then in zswap_store(), we can use the ewma to estimate the
> > compressed
> > > > > > size and use it to do the memcg and global limit checks once, like we
> > > > > > do today. Instead of just checking if we are below the limits, we
> > > > > > check if we have enough headroom for the estimated compressed
> size.
> > > > > > Then we call zswap_store_page() to do the per-page stuff, then do
> > > > > > batched charging and stats updates.
> > > > >
> > > > > I'm not sure what you gain from making a non-atomic check precise.
> You
> > > > > can get a hundred threads determining down precisely that *their*
> > > > > store will fit exactly into the last 800kB before the limit.
> > > >
> > > > We just get to avoid overshooting in cases where we know we probably
> > > > can't fit it anyway. If we have 4KB left and we are trying to compress
> > > > a 2MB THP, for example. It just makes the upfront check to avoid
> > > > pointless compression a little bit more meaningful.
> > >
> > > I think I'm missing something. It's not just an upfront check, it's
> > > the only check. The charge down the line doesn't limit anything, it
> > > just counts. So if this check passes, we WILL store the folio. There
> > > is no pointless compression.
> >
> > I got confused by what you said about the fast-slow path, I thought
> > you were suggesting we do this now, so I was saying it's better to use
> > an estimate of the compressed size in the fast path to avoid pointless
> > compression.
> >
> > I missed the second paragraph.
> >
> > >
> > > We might overshoot the limit by about one folio in a single-threaded
> > > scenario. But that is negligible in comparison to the overshoot we can
> > > get due to race conditions.
> > >
> > > Again, I see no no practical, meaningful difference in outcome by
> > > making that limit check any more precise. Just keep it as-is.
> >
> > > Sorry to be blunt, but "precision" in a non-atomic check like this?
> > > makes no sense. The fact that it's not too expensive is irrelevant.
> > > This discussion around this honestly has gone off the rails.
> >
> > Yeah I thought we were talking about the version where we rollback
> > compressions if we overshoot, my bad. We discussed quite a few things
> > and I managed to confuse myself.
> >
> > > Just leave the limit checks exactly as they are. Check limits and
> > > cgroup_may_zswap() once up front. Compress the subpages. Acquire
> > > references and bump all stats in batches of folio_nr_pages(). You can
> > > add up the subpage compressed bytes in the for-loop and do the
> > > obj_cgroup_charge_zswap() in a single call at the end as well.
> >
> > We can keep the limit checks as they are for now, and revisit as needed.
> 
> Thanks Johannes and Yosry for the discussion! I will proceed as suggested.

One thing I realized while reworking the patches for the batched checks is:
within zswap_store_page(), we set the entry->objcg and entry->pool before
adding it to the xarray. Given this, wouldn't it be safer to get the objcg
and pool reference per sub-page, locally in zswap_store_page(), rather than
obtaining batched references at the end if the store is successful? If we want
zswap_store_page() to be self-contained and correct as far as the entry
being created and added to the xarray, it seems like the right thing to do?
I am a bit apprehensive about the entry being added to the xarray without
a reference obtained on the objcg and pool, because any page-faults/writeback
that occur on sub-pages added to the xarray before the entire folio has been
stored, would run into issues.

Just wanted to run this by you. The rest of the batched charging, atomic
and stat updates should be Ok.

Thanks,
Kanchana

> 
> Thanks,
> Kanchana

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-26  3:58                   ` Sridhar, Kanchana P
@ 2024-09-26  4:52                     ` Yosry Ahmed
  2024-09-26 16:40                       ` Sridhar, Kanchana P
  0 siblings, 1 reply; 79+ messages in thread
From: Yosry Ahmed @ 2024-09-26  4:52 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: Johannes Weiner, linux-kernel, linux-mm, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
	akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh

[..]
>
> One thing I realized while reworking the patches for the batched checks is:
> within zswap_store_page(), we set the entry->objcg and entry->pool before
> adding it to the xarray. Given this, wouldn't it be safer to get the objcg
> and pool reference per sub-page, locally in zswap_store_page(), rather than
> obtaining batched references at the end if the store is successful? If we want
> zswap_store_page() to be self-contained and correct as far as the entry
> being created and added to the xarray, it seems like the right thing to do?
> I am a bit apprehensive about the entry being added to the xarray without
> a reference obtained on the objcg and pool, because any page-faults/writeback
> that occur on sub-pages added to the xarray before the entire folio has been
> stored, would run into issues.

We definitely should not obtain references to the pool and objcg after
initializing the entries with them. We can obtain all references in
zswap_store() before zswap_store_page(). IOW, the batching in this
case should be done before the per-page operations, not after.

>
> Just wanted to run this by you. The rest of the batched charging, atomic
> and stat updates should be Ok.
>
> Thanks,
> Kanchana
>
> >
> > Thanks,
> > Kanchana


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios
  2024-09-26  3:48       ` Sridhar, Kanchana P
@ 2024-09-26  6:47         ` Huang, Ying
  2024-09-26 21:44           ` Sridhar, Kanchana P
  0 siblings, 1 reply; 79+ messages in thread
From: Huang, Ying @ 2024-09-26  6:47 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
	21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh

"Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com> writes:

> Hi Ying,
>
>> -----Original Message-----
>> From: Huang, Ying <ying.huang@intel.com>
>> Sent: Wednesday, September 25, 2024 5:45 PM
>> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
>> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
>> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
>> chengming.zhou@linux.dev; usamaarif642@gmail.com;
>> shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com;
>> akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali,
>> Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
>> <vinodh.gopal@intel.com>
>> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios
>> 
>> "Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com> writes:
>> 
>> >> -----Original Message-----
>> >> From: Huang, Ying <ying.huang@intel.com>
>> >> Sent: Tuesday, September 24, 2024 11:35 PM
>> >> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
>> >> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
>> >> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
>> >> chengming.zhou@linux.dev; usamaarif642@gmail.com;
>> >> shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com;
>> >> akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>;
>> Feghali,
>> >> Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
>> >> <vinodh.gopal@intel.com>
>> >> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios
>> >>
>> >> Kanchana P Sridhar <kanchana.p.sridhar@intel.com> writes:
>> >>
>> >> [snip]
>> >>
>> >> >
>> >> > Case 1: Comparing zswap 4K vs. zswap mTHP
>> >> > =========================================
>> >> >
>> >> > In this scenario, the "before" is CONFIG_THP_SWAP set to off, that
>> results in
>> >> > 64K/2M (m)THP to be split into 4K folios that get processed by zswap.
>> >> >
>> >> > The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that
>> >> results
>> >> > in 64K/2M (m)THP to not be split, and processed by zswap.
>> >> >
>> >> >  64KB mTHP (cgroup memory.high set to 40G):
>> >> >  ==========================================
>> >> >
>> >> >  -------------------------------------------------------------------------------
>> >> >                     mm-unstable 9-23-2024              zswap-mTHP     Change wrt
>> >> >                         CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y
>> Baseline
>> >> >                                  Baseline
>> >> >  -------------------------------------------------------------------------------
>> >> >  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd deflate-
>> >> >                                       iaa                     iaa            iaa
>> >> >  -------------------------------------------------------------------------------
>> >> >  Throughput (KB/s)   143,323      125,485     153,550     129,609    7%
>> 3%
>> >> >  elapsed time (sec)    24.97        25.42       23.90       25.19    4%       1%
>> >> >  sys time (sec)       822.72       750.96      757.70      731.13    8%       3%
>> >> >  memcg_high          132,743      169,825     148,075     192,744
>> >> >  memcg_swap_fail     639,067      841,553       2,204       2,215
>> >> >  pswpin                    0            0           0           0
>> >> >  pswpout                   0            0           0           0
>> >> >  zswpin                  795          873         760         902
>> >> >  zswpout          10,011,266   13,195,137  10,010,017  13,193,554
>> >> >  thp_swpout                0            0           0           0
>> >> >  thp_swpout_               0            0           0           0
>> >> >   fallback
>> >> >  64kB-mthp_          639,065      841,553       2,204       2,215
>> >> >   swpout_fallback
>> >> >  pgmajfault            2,861        2,924       3,054       3,259
>> >> >  ZSWPOUT-64kB            n/a          n/a     623,451     822,268
>> >> >  SWPOUT-64kB               0            0           0           0
>> >> >  -------------------------------------------------------------------------------
>> >> >
>> >>
>> >> IIUC, the throughput is the sum of throughput of all usemem processes?
>> >>
>> >> One possible issue of usemem test case is the "imbalance" issue.  That
>> >> is, some usemem processes may swap-out/swap-in less, so the score is
>> >> very high; while some other processes may swap-out/swap-in more, so the
>> >> score is very low.  Sometimes, the total score decreases, but the scores
>> >> of usemem processes are more balanced, so that the performance should
>> be
>> >> considered better.  And, in general, we should make usemem score
>> >> balanced among processes via say longer test time.  Can you check this
>> >> in your test results?
>> >
>> > Actually, the throughput data listed in the cover-letter is the average of
>> > all the usemem processes. Your observation about the "imbalance" issue is
>> > right. Some processes see a higher throughput than others. I have noticed
>> > that the throughputs progressively reduce as the individual processes exit
>> > and print their stats.
>> >
>> > Listed below are the stats from two runs of usemem70: sleep 10 and sleep
>> 30.
>> > Both are run with a cgroup mem-limit of 40G. Data is with v7, 64K folios are
>> > enabled, zswap uses zstd.
>> >
>> >
>> > -----------------------------------------------
>> >                sleep 10           sleep 30
>> >       Throughput (KB/s)  Throughput (KB/s)
>> >  -----------------------------------------------
>> >                 181,540            191,686
>> >                 179,651            191,459
>> >                 179,068            188,834
>> >                 177,244            187,568
>> >                 177,215            186,703
>> >                 176,565            185,584
>> >                 176,546            185,370
>> >                 176,470            185,021
>> >                 176,214            184,303
>> >                 176,128            184,040
>> >                 175,279            183,932
>> >                 174,745            180,831
>> >                 173,935            179,418
>> >                 161,546            168,014
>> >                 160,332            167,540
>> >                 160,122            167,364
>> >                 159,613            167,020
>> >                 159,546            166,590
>> >                 159,021            166,483
>> >                 158,845            166,418
>> >                 158,426            166,264
>> >                 158,396            166,066
>> >                 158,371            165,944
>> >                 158,298            165,866
>> >                 158,250            165,884
>> >                 158,057            165,533
>> >                 158,011            165,532
>> >                 157,899            165,457
>> >                 157,894            165,424
>> >                 157,839            165,410
>> >                 157,731            165,407
>> >                 157,629            165,273
>> >                 157,626            164,867
>> >                 157,581            164,636
>> >                 157,471            164,266
>> >                 157,430            164,225
>> >                 157,287            163,290
>> >                 156,289            153,597
>> >                 153,970            147,494
>> >                 148,244            147,102
>> >                 142,907            146,111
>> >                 142,811            145,789
>> >                 139,171            141,168
>> >                 136,314            140,714
>> >                 133,616            140,111
>> >                 132,881            139,636
>> >                 132,729            136,943
>> >                 132,680            136,844
>> >                 132,248            135,726
>> >                 132,027            135,384
>> >                 131,929            135,270
>> >                 131,766            134,748
>> >                 131,667            134,733
>> >                 131,576            134,582
>> >                 131,396            134,302
>> >                 131,351            134,160
>> >                 131,135            134,102
>> >                 130,885            134,097
>> >                 130,854            134,058
>> >                 130,767            134,006
>> >                 130,666            133,960
>> >                 130,647            133,894
>> >                 130,152            133,837
>> >                 130,006            133,747
>> >                 129,921            133,679
>> >                 129,856            133,666
>> >                 129,377            133,564
>> >                 128,366            133,331
>> >                 127,988            132,938
>> >                 126,903            132,746
>> >  -----------------------------------------------
>> >       sum    10,526,916         10,919,561
>> >   average       150,385            155,994
>> >    stddev        17,551             19,633
>> >  -----------------------------------------------
>> >     elapsed       24.40              43.66
>> >  time (sec)
>> >    sys time      806.25             766.05
>> >       (sec)
>> >     zswpout  10,008,713         10,008,407
>> >   64K folio     623,463            623,629
>> >      swpout
>> >  -----------------------------------------------
>> 
>> Although there are some imbalance, I don't find it's too much.  So, I
>> think the test result is reasonable.  Please pay attention to the
>> imbalance issue in the future tests.
>
> Sure, will do so.
>
>> 
>> > As we increase the time for which allocations are maintained,
>> > there seems to be a slight improvement in throughput, but the
>> > variance increases as well. The processes with lower throughput
>> > could be the ones that handle the memcg being over limit by
>> > doing reclaim, possibly before they can allocate.
>> >
>> > Interestingly, the longer test time does seem to reduce the amount
>> > of reclaim (hence lower sys time), but more 64K large folios seem to
>> > be reclaimed. Could this mean that with longer test time (sleep 30),
>> > more cold memory residing in large folios is getting reclaimed, as
>> > against memory just relinquished by the exiting processes?
>> 
>> I don't think longer sleep time in test helps much to balance.  Can you
>> try with less process, and larger memory size per process?  I guess that
>> this will improve balance.
>
> I tried this, and the data is listed below:
>
>   usemem options:
>   ---------------
>   30 processes allocate 10G each
>   cgroup memory limit = 150G
>   sleep 10
>   525Gi SSD disk swap partition
>   64K large folios enabled      
>
>   Throughput (KB/s) of each of the 30 processes:
>  ---------------------------------------------------------------
>                       mm-unstable    zswap_store of large folios
>                         9-25-2024                v7
>  zswap compressor:           zstd         zstd  deflate-iaa
>  ---------------------------------------------------------------
>                            38,393      234,485      374,427
>                            37,283      215,528      314,225
>                            37,156      214,942      304,413
>                            37,143      213,073      304,146
>                            36,814      212,904      290,186
>                            36,277      212,304      288,212
>                            36,104      212,207      285,682
>                            36,000      210,173      270,661
>                            35,994      208,487      256,960
>                            35,979      207,788      248,313
>                            35,967      207,714      235,338
>                            35,966      207,703      229,335
>                            35,835      207,690      221,697
>                            35,793      207,418      221,600
>                            35,692      206,160      219,346
>                            35,682      206,128      219,162
>                            35,681      205,817      219,155
>                            35,678      205,546      214,862
>                            35,678      205,523      214,710
>                            35,677      204,951      214,282
>                            35,677      204,283      213,441
>                            35,677      203,348      213,011
>                            35,675      203,028      212,923
>                            35,673      201,922      212,492
>                            35,672      201,660      212,225
>                            35,672      200,724      211,808
>                            35,672      200,324      211,420
>                            35,671      199,686      211,413
>                            35,667      198,858      211,346
>                            35,667      197,590      211,209
>  ---------------------------------------------------------------
>  sum                     1,081,515    6,217,964    7,268,000
>  average                    36,051      207,265      242,267
>  stddev                        655        7,010       42,234
>  elapsed time (sec)         343.70       107.40        84.34
>  sys time (sec)             269.30     2,520.13     1,696.20
>  memcg.high breaches       443,672      475,074      623,333
>  zswpout                    22,605   48,931,249   54,777,100
>  pswpout                40,004,528            0            0
>  hugepages-64K zswpout           0    3,057,090    3,421,855
>  hugepages-64K swpout    2,500,283            0            0
>  ---------------------------------------------------------------
>
> As you can see, this is quite a memory-constrained scenario, where we
> are giving a 50% of total memory required, as the memory limit for the
> cgroup in which the 30 processes are run. This causes significantly more
> reclaim activity than the setup I was using thus far (70 processes, 1G,
> 40G limit).
>
> The variance or "imbalance" reduces somewhat for zstd, but not for IAA.
>
> IAA shows really good throughput (17%) and elapsed time (21%) and
> sys time (33%) improvement wrt zstd with zswap_store of large folios.
> These are the memory-constrained scenarios in which IAA typically
> does really well. IAA verify_compress is enabled, so this is an added
> data integrity checks benefit we get with IAA.
>
> I would like to get your and the maintainers' feedback on whether
> I should switch to this "usemem30-10G" setup for v8?

The results looks good to me.  I suggest you to use it.

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-26  4:52                     ` Yosry Ahmed
@ 2024-09-26 16:40                       ` Sridhar, Kanchana P
  2024-09-26 17:19                         ` Yosry Ahmed
  0 siblings, 1 reply; 79+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-26 16:40 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Johannes Weiner, linux-kernel, linux-mm, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
	akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar,
	Kanchana P

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Wednesday, September 25, 2024 9:52 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>; linux-kernel@vger.kernel.org;
> linux-mm@kvack.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> zswap_store().
> 
> [..]
> >
> > One thing I realized while reworking the patches for the batched checks is:
> > within zswap_store_page(), we set the entry->objcg and entry->pool before
> > adding it to the xarray. Given this, wouldn't it be safer to get the objcg
> > and pool reference per sub-page, locally in zswap_store_page(), rather than
> > obtaining batched references at the end if the store is successful? If we
> want
> > zswap_store_page() to be self-contained and correct as far as the entry
> > being created and added to the xarray, it seems like the right thing to do?
> > I am a bit apprehensive about the entry being added to the xarray without
> > a reference obtained on the objcg and pool, because any page-
> faults/writeback
> > that occur on sub-pages added to the xarray before the entire folio has been
> > stored, would run into issues.
> 
> We definitely should not obtain references to the pool and objcg after
> initializing the entries with them. We can obtain all references in
> zswap_store() before zswap_store_page(). IOW, the batching in this
> case should be done before the per-page operations, not after.

Thanks Yosry. IIUC, we should obtain all references to the objcg and to the
zswap_pool at the start of zswap_store.

In the case of error on any sub-page, we will unwind state for potentially
only the stored pages or the entire folio if it happened to already be in zswap
and is being re-written. We might need some additional book-keeping to
keep track of which sub-pages were found in the xarray and zswap_entry_free()
got called (nr_sb). Assuming I define a new "obj_cgroup_put_many()", I would need
to call this with (folio_nr_pages() - nr_sb).

As far as zswap_pool_get(), there is some added complexity if we want to
keep the existing implementation that calls "percpu_ref_tryget()", and assuming
this is extended to provide a new "zswap_pool_get_many()" that calls
"percpu_ref_tryget_many()". Is there a reason we use percpu_ref_tryget() instead
of percpu_ref_get()? Reason I ask is, with tryget(), if for some reason the pool->ref
is 0, no further increments will be made. If so, upon unwinding state in
zswap_store(), I would need to special-case to catch this before calling a new
"zswap_pool_put_many()".

Things could be a little simpler if zswap_pool_get() can use "percpu_ref_get()"
which will always increment the refcount. Since the zswap pool->ref is initialized
to "1", this seems Ok, but I don't know if there will be unintended consequences.

Can you please advise on what is the simplest/cleanest approach:

1) Proceed with the above changes without changing percpu_ref_tryget in
     zswap_pool_get. Needs special-casing in zswap_store to detect pool->ref
    being "0" before calling zswap_pool_put[_many].
2) Modify zswap_pool_get/zswap_pool_get_many to use percpu_ref_get_many
    and avoid special-casing to detect pool->ref being "0" before calling
    zswap_pool_put[_many].
3) Keep the approach in v7 where obj_cgroup_get/put is localized to
    zswap_store_page for both success and error conditions, and any unwinding
    state in zswap_store will take care of dropping references obtained from
    prior successful writes (from this or prior invocations of zswap_store).

Thanks,
Kanchana

> 
> >
> > Just wanted to run this by you. The rest of the batched charging, atomic
> > and stat updates should be Ok.
> >
> > Thanks,
> > Kanchana
> >
> > >
> > > Thanks,
> > > Kanchana

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-26 16:40                       ` Sridhar, Kanchana P
@ 2024-09-26 17:19                         ` Yosry Ahmed
  2024-09-26 17:29                           ` Sridhar, Kanchana P
  0 siblings, 1 reply; 79+ messages in thread
From: Yosry Ahmed @ 2024-09-26 17:19 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: Johannes Weiner, linux-kernel, linux-mm, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
	akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh

On Thu, Sep 26, 2024 at 9:40 AM Sridhar, Kanchana P
<kanchana.p.sridhar@intel.com> wrote:
>
> > -----Original Message-----
> > From: Yosry Ahmed <yosryahmed@google.com>
> > Sent: Wednesday, September 25, 2024 9:52 PM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>; linux-kernel@vger.kernel.org;
> > linux-mm@kvack.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> > usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> > zswap_store().
> >
> > [..]
> > >
> > > One thing I realized while reworking the patches for the batched checks is:
> > > within zswap_store_page(), we set the entry->objcg and entry->pool before
> > > adding it to the xarray. Given this, wouldn't it be safer to get the objcg
> > > and pool reference per sub-page, locally in zswap_store_page(), rather than
> > > obtaining batched references at the end if the store is successful? If we
> > want
> > > zswap_store_page() to be self-contained and correct as far as the entry
> > > being created and added to the xarray, it seems like the right thing to do?
> > > I am a bit apprehensive about the entry being added to the xarray without
> > > a reference obtained on the objcg and pool, because any page-
> > faults/writeback
> > > that occur on sub-pages added to the xarray before the entire folio has been
> > > stored, would run into issues.
> >
> > We definitely should not obtain references to the pool and objcg after
> > initializing the entries with them. We can obtain all references in
> > zswap_store() before zswap_store_page(). IOW, the batching in this
> > case should be done before the per-page operations, not after.
>
> Thanks Yosry. IIUC, we should obtain all references to the objcg and to the
> zswap_pool at the start of zswap_store.
>
> In the case of error on any sub-page, we will unwind state for potentially
> only the stored pages or the entire folio if it happened to already be in zswap
> and is being re-written. We might need some additional book-keeping to
> keep track of which sub-pages were found in the xarray and zswap_entry_free()
> got called (nr_sb). Assuming I define a new "obj_cgroup_put_many()", I would need
> to call this with (folio_nr_pages() - nr_sb).
>
> As far as zswap_pool_get(), there is some added complexity if we want to
> keep the existing implementation that calls "percpu_ref_tryget()", and assuming
> this is extended to provide a new "zswap_pool_get_many()" that calls
> "percpu_ref_tryget_many()". Is there a reason we use percpu_ref_tryget() instead
> of percpu_ref_get()? Reason I ask is, with tryget(), if for some reason the pool->ref
> is 0, no further increments will be made. If so, upon unwinding state in
> zswap_store(), I would need to special-case to catch this before calling a new
> "zswap_pool_put_many()".
>
> Things could be a little simpler if zswap_pool_get() can use "percpu_ref_get()"
> which will always increment the refcount. Since the zswap pool->ref is initialized
> to "1", this seems Ok, but I don't know if there will be unintended consequences.
>
> Can you please advise on what is the simplest/cleanest approach:
>
> 1) Proceed with the above changes without changing percpu_ref_tryget in
>      zswap_pool_get. Needs special-casing in zswap_store to detect pool->ref
>     being "0" before calling zswap_pool_put[_many].

My assumption is that we can reorder the code such that if
zswap_pool_get_many() fails we don't call zswap_pool_put_many() to
begin with (e.g. jump to a label after zswap_pool_put_many()).

> 2) Modify zswap_pool_get/zswap_pool_get_many to use percpu_ref_get_many
>     and avoid special-casing to detect pool->ref being "0" before calling
>     zswap_pool_put[_many].

I don't think we can simply switch the tryget to a get, as I believe
we can race with the pool being destroyed.

> 3) Keep the approach in v7 where obj_cgroup_get/put is localized to
>     zswap_store_page for both success and error conditions, and any unwinding
>     state in zswap_store will take care of dropping references obtained from
>     prior successful writes (from this or prior invocations of zswap_store).

I am also fine with doing that and doing the reference batching as a follow up.


>
> Thanks,
> Kanchana
>
> >
> > >
> > > Just wanted to run this by you. The rest of the batched charging, atomic
> > > and stat updates should be Ok.
> > >
> > > Thanks,
> > > Kanchana
> > >
> > > >
> > > > Thanks,
> > > > Kanchana


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-26 17:19                         ` Yosry Ahmed
@ 2024-09-26 17:29                           ` Sridhar, Kanchana P
  2024-09-26 17:34                             ` Yosry Ahmed
  2024-09-26 18:43                             ` Johannes Weiner
  0 siblings, 2 replies; 79+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-26 17:29 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Johannes Weiner, linux-kernel, linux-mm, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
	akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar,
	Kanchana P

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Thursday, September 26, 2024 10:20 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>; linux-kernel@vger.kernel.org;
> linux-mm@kvack.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> zswap_store().
> 
> On Thu, Sep 26, 2024 at 9:40 AM Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > > -----Original Message-----
> > > From: Yosry Ahmed <yosryahmed@google.com>
> > > Sent: Wednesday, September 25, 2024 9:52 PM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > Cc: Johannes Weiner <hannes@cmpxchg.org>; linux-
> kernel@vger.kernel.org;
> > > linux-mm@kvack.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> > > usamaarif642@gmail.com; shakeel.butt@linux.dev;
> ryan.roberts@arm.com;
> > > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> > > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> > > zswap_store().
> > >
> > > [..]
> > > >
> > > > One thing I realized while reworking the patches for the batched checks
> is:
> > > > within zswap_store_page(), we set the entry->objcg and entry->pool
> before
> > > > adding it to the xarray. Given this, wouldn't it be safer to get the objcg
> > > > and pool reference per sub-page, locally in zswap_store_page(), rather
> than
> > > > obtaining batched references at the end if the store is successful? If we
> > > want
> > > > zswap_store_page() to be self-contained and correct as far as the entry
> > > > being created and added to the xarray, it seems like the right thing to
> do?
> > > > I am a bit apprehensive about the entry being added to the xarray
> without
> > > > a reference obtained on the objcg and pool, because any page-
> > > faults/writeback
> > > > that occur on sub-pages added to the xarray before the entire folio has
> been
> > > > stored, would run into issues.
> > >
> > > We definitely should not obtain references to the pool and objcg after
> > > initializing the entries with them. We can obtain all references in
> > > zswap_store() before zswap_store_page(). IOW, the batching in this
> > > case should be done before the per-page operations, not after.
> >
> > Thanks Yosry. IIUC, we should obtain all references to the objcg and to the
> > zswap_pool at the start of zswap_store.
> >
> > In the case of error on any sub-page, we will unwind state for potentially
> > only the stored pages or the entire folio if it happened to already be in
> zswap
> > and is being re-written. We might need some additional book-keeping to
> > keep track of which sub-pages were found in the xarray and
> zswap_entry_free()
> > got called (nr_sb). Assuming I define a new "obj_cgroup_put_many()", I
> would need
> > to call this with (folio_nr_pages() - nr_sb).
> >
> > As far as zswap_pool_get(), there is some added complexity if we want to
> > keep the existing implementation that calls "percpu_ref_tryget()", and
> assuming
> > this is extended to provide a new "zswap_pool_get_many()" that calls
> > "percpu_ref_tryget_many()". Is there a reason we use percpu_ref_tryget()
> instead
> > of percpu_ref_get()? Reason I ask is, with tryget(), if for some reason the
> pool->ref
> > is 0, no further increments will be made. If so, upon unwinding state in
> > zswap_store(), I would need to special-case to catch this before calling a
> new
> > "zswap_pool_put_many()".
> >
> > Things could be a little simpler if zswap_pool_get() can use
> "percpu_ref_get()"
> > which will always increment the refcount. Since the zswap pool->ref is
> initialized
> > to "1", this seems Ok, but I don't know if there will be unintended
> consequences.
> >
> > Can you please advise on what is the simplest/cleanest approach:
> >
> > 1) Proceed with the above changes without changing percpu_ref_tryget in
> >      zswap_pool_get. Needs special-casing in zswap_store to detect pool-
> >ref
> >     being "0" before calling zswap_pool_put[_many].
> 
> My assumption is that we can reorder the code such that if
> zswap_pool_get_many() fails we don't call zswap_pool_put_many() to
> begin with (e.g. jump to a label after zswap_pool_put_many()).

However, the pool refcount could change between the start and end of
zswap_store.

> 
> > 2) Modify zswap_pool_get/zswap_pool_get_many to use
> percpu_ref_get_many
> >     and avoid special-casing to detect pool->ref being "0" before calling
> >     zswap_pool_put[_many].
> 
> I don't think we can simply switch the tryget to a get, as I believe
> we can race with the pool being destroyed.

That was my initial thought as well, but I figured this couldn't happen
since the pool->ref is initialized to "1", and based on the existing
implementation. In any case, I can understand the intent of the use
of "tryget"; it is just that it adds to the considerations for reference
batching.

> 
> > 3) Keep the approach in v7 where obj_cgroup_get/put is localized to
> >     zswap_store_page for both success and error conditions, and any
> unwinding
> >     state in zswap_store will take care of dropping references obtained from
> >     prior successful writes (from this or prior invocations of zswap_store).
> 
> I am also fine with doing that and doing the reference batching as a follow up.

I think so too! We could try and improve upon (3) with reference batching
in a follow-up patch.

Thanks,
Kanchana

> 
> 
> >
> > Thanks,
> > Kanchana
> >
> > >
> > > >
> > > > Just wanted to run this by you. The rest of the batched charging, atomic
> > > > and stat updates should be Ok.
> > > >
> > > > Thanks,
> > > > Kanchana
> > > >
> > > > >
> > > > > Thanks,
> > > > > Kanchana

^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-26 17:29                           ` Sridhar, Kanchana P
@ 2024-09-26 17:34                             ` Yosry Ahmed
  2024-09-26 19:36                               ` Sridhar, Kanchana P
  2024-09-26 18:43                             ` Johannes Weiner
  1 sibling, 1 reply; 79+ messages in thread
From: Yosry Ahmed @ 2024-09-26 17:34 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: Johannes Weiner, linux-kernel, linux-mm, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
	akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh

On Thu, Sep 26, 2024 at 10:29 AM Sridhar, Kanchana P
<kanchana.p.sridhar@intel.com> wrote:
>
> > -----Original Message-----
> > From: Yosry Ahmed <yosryahmed@google.com>
> > Sent: Thursday, September 26, 2024 10:20 AM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>; linux-kernel@vger.kernel.org;
> > linux-mm@kvack.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> > usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> > zswap_store().
> >
> > On Thu, Sep 26, 2024 at 9:40 AM Sridhar, Kanchana P
> > <kanchana.p.sridhar@intel.com> wrote:
> > >
> > > > -----Original Message-----
> > > > From: Yosry Ahmed <yosryahmed@google.com>
> > > > Sent: Wednesday, September 25, 2024 9:52 PM
> > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > > Cc: Johannes Weiner <hannes@cmpxchg.org>; linux-
> > kernel@vger.kernel.org;
> > > > linux-mm@kvack.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> > > > usamaarif642@gmail.com; shakeel.butt@linux.dev;
> > ryan.roberts@arm.com;
> > > > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> > > > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> > > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > > > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> > > > zswap_store().
> > > >
> > > > [..]
> > > > >
> > > > > One thing I realized while reworking the patches for the batched checks
> > is:
> > > > > within zswap_store_page(), we set the entry->objcg and entry->pool
> > before
> > > > > adding it to the xarray. Given this, wouldn't it be safer to get the objcg
> > > > > and pool reference per sub-page, locally in zswap_store_page(), rather
> > than
> > > > > obtaining batched references at the end if the store is successful? If we
> > > > want
> > > > > zswap_store_page() to be self-contained and correct as far as the entry
> > > > > being created and added to the xarray, it seems like the right thing to
> > do?
> > > > > I am a bit apprehensive about the entry being added to the xarray
> > without
> > > > > a reference obtained on the objcg and pool, because any page-
> > > > faults/writeback
> > > > > that occur on sub-pages added to the xarray before the entire folio has
> > been
> > > > > stored, would run into issues.
> > > >
> > > > We definitely should not obtain references to the pool and objcg after
> > > > initializing the entries with them. We can obtain all references in
> > > > zswap_store() before zswap_store_page(). IOW, the batching in this
> > > > case should be done before the per-page operations, not after.
> > >
> > > Thanks Yosry. IIUC, we should obtain all references to the objcg and to the
> > > zswap_pool at the start of zswap_store.
> > >
> > > In the case of error on any sub-page, we will unwind state for potentially
> > > only the stored pages or the entire folio if it happened to already be in
> > zswap
> > > and is being re-written. We might need some additional book-keeping to
> > > keep track of which sub-pages were found in the xarray and
> > zswap_entry_free()
> > > got called (nr_sb). Assuming I define a new "obj_cgroup_put_many()", I
> > would need
> > > to call this with (folio_nr_pages() - nr_sb).
> > >
> > > As far as zswap_pool_get(), there is some added complexity if we want to
> > > keep the existing implementation that calls "percpu_ref_tryget()", and
> > assuming
> > > this is extended to provide a new "zswap_pool_get_many()" that calls
> > > "percpu_ref_tryget_many()". Is there a reason we use percpu_ref_tryget()
> > instead
> > > of percpu_ref_get()? Reason I ask is, with tryget(), if for some reason the
> > pool->ref
> > > is 0, no further increments will be made. If so, upon unwinding state in
> > > zswap_store(), I would need to special-case to catch this before calling a
> > new
> > > "zswap_pool_put_many()".
> > >
> > > Things could be a little simpler if zswap_pool_get() can use
> > "percpu_ref_get()"
> > > which will always increment the refcount. Since the zswap pool->ref is
> > initialized
> > > to "1", this seems Ok, but I don't know if there will be unintended
> > consequences.
> > >
> > > Can you please advise on what is the simplest/cleanest approach:
> > >
> > > 1) Proceed with the above changes without changing percpu_ref_tryget in
> > >      zswap_pool_get. Needs special-casing in zswap_store to detect pool-
> > >ref
> > >     being "0" before calling zswap_pool_put[_many].
> >
> > My assumption is that we can reorder the code such that if
> > zswap_pool_get_many() fails we don't call zswap_pool_put_many() to
> > begin with (e.g. jump to a label after zswap_pool_put_many()).
>
> However, the pool refcount could change between the start and end of
> zswap_store.

I am not sure what you mean. If zswap_pool_get_many() fails then we
just do not call zswap_pool_put_many() at all and abort.

>
> >
> > > 2) Modify zswap_pool_get/zswap_pool_get_many to use
> > percpu_ref_get_many
> > >     and avoid special-casing to detect pool->ref being "0" before calling
> > >     zswap_pool_put[_many].
> >
> > I don't think we can simply switch the tryget to a get, as I believe
> > we can race with the pool being destroyed.
>
> That was my initial thought as well, but I figured this couldn't happen
> since the pool->ref is initialized to "1", and based on the existing
> implementation. In any case, I can understand the intent of the use
> of "tryget"; it is just that it adds to the considerations for reference
> batching.

The initial ref can be dropped in __zswap_param_set() if a new pool is
created (see the call to ercpu_ref_kill(()).

>
> >
> > > 3) Keep the approach in v7 where obj_cgroup_get/put is localized to
> > >     zswap_store_page for both success and error conditions, and any
> > unwinding
> > >     state in zswap_store will take care of dropping references obtained from
> > >     prior successful writes (from this or prior invocations of zswap_store).
> >
> > I am also fine with doing that and doing the reference batching as a follow up.
>
> I think so too! We could try and improve upon (3) with reference batching
> in a follow-up patch.

SGTM.


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-26 17:29                           ` Sridhar, Kanchana P
  2024-09-26 17:34                             ` Yosry Ahmed
@ 2024-09-26 18:43                             ` Johannes Weiner
  2024-09-26 18:45                               ` Yosry Ahmed
  2024-09-26 19:39                               ` Sridhar, Kanchana P
  1 sibling, 2 replies; 79+ messages in thread
From: Johannes Weiner @ 2024-09-26 18:43 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: Yosry Ahmed, linux-kernel, linux-mm, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
	akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh

On Thu, Sep 26, 2024 at 05:29:30PM +0000, Sridhar, Kanchana P wrote:
> > > 3) Keep the approach in v7 where obj_cgroup_get/put is localized to
> > >     zswap_store_page for both success and error conditions, and any
> > unwinding
> > >     state in zswap_store will take care of dropping references obtained from
> > >     prior successful writes (from this or prior invocations of zswap_store).
> > 
> > I am also fine with doing that and doing the reference batching as a follow up.
> 
> I think so too! We could try and improve upon (3) with reference batching
> in a follow-up patch.

Yeah, I agree. The percpu-refcounts are not that expensive, we should
be able to live with per-page ops for now.

One thing you *can* do from the start is tryget a pool reference in
zswap_store(), to prevent the pools untimely demise while you work on
it, and then in zswap_store_page() you can do gets instead of trygets.

You'd have to rename zswap_pool_get() to zswap_pool_tryget() (which is
probably for the best) and implement the trivial new zswap_pool_get().


^ permalink raw reply	[flat|nested] 79+ messages in thread

* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-26 18:43                             ` Johannes Weiner
@ 2024-09-26 18:45                               ` Yosry Ahmed
  2024-09-26 19:40                                 ` Sridhar, Kanchana P
  2024-09-26 19:39                               ` Sridhar, Kanchana P
  1 sibling, 1 reply; 79+ messages in thread
From: Yosry Ahmed @ 2024-09-26 18:45 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Sridhar, Kanchana P, linux-kernel, linux-mm, nphamcs,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang,
	Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal,
	Vinodh

On Thu, Sep 26, 2024 at 11:43 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Thu, Sep 26, 2024 at 05:29:30PM +0000, Sridhar, Kanchana P wrote:
> > > > 3) Keep the approach in v7 where obj_cgroup_get/put is localized to
> > > >     zswap_store_page for both success and error conditions, and any
> > > unwinding
> > > >     state in zswap_store will take care of dropping references obtained from
> > > >     prior successful writes (from this or prior invocations of zswap_store).
> > >
> > > I am also fine with doing that and doing the reference batching as a follow up.
> >
> > I think so too! We could try and improve upon (3) with reference batching
> > in a follow-up patch.
>
> Yeah, I agree. The percpu-refcounts are not that expensive, we should
> be able to live with per-page ops for now.
>
> One thing you *can* do from the start is tryget a pool reference in
> zswap_store(), to prevent the pools untimely demise while you work on
> it, and then in zswap_store_page() you can do gets instead of trygets.
>
> You'd have to rename zswap_pool_get() to zswap_pool_tryget() (which is
> probably for the best) and implement the trivial new zswap_pool_get().

Yeah I was actually planning to send a follow-up patch to do exactly
that until we figure out proper patching for the refcounts. Even
better if Kanchana incorporates it in the next version :)


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-26 17:34                             ` Yosry Ahmed
@ 2024-09-26 19:36                               ` Sridhar, Kanchana P
  0 siblings, 0 replies; 79+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-26 19:36 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Johannes Weiner, linux-kernel, linux-mm, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
	akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar,
	Kanchana P

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Thursday, September 26, 2024 10:35 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>; linux-kernel@vger.kernel.org;
> linux-mm@kvack.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> zswap_store().
> 
> On Thu, Sep 26, 2024 at 10:29 AM Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > > -----Original Message-----
> > > From: Yosry Ahmed <yosryahmed@google.com>
> > > Sent: Thursday, September 26, 2024 10:20 AM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > Cc: Johannes Weiner <hannes@cmpxchg.org>; linux-
> kernel@vger.kernel.org;
> > > linux-mm@kvack.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> > > usamaarif642@gmail.com; shakeel.butt@linux.dev;
> ryan.roberts@arm.com;
> > > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> > > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> > > zswap_store().
> > >
> > > On Thu, Sep 26, 2024 at 9:40 AM Sridhar, Kanchana P
> > > <kanchana.p.sridhar@intel.com> wrote:
> > > >
> > > > > -----Original Message-----
> > > > > From: Yosry Ahmed <yosryahmed@google.com>
> > > > > Sent: Wednesday, September 25, 2024 9:52 PM
> > > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > > > Cc: Johannes Weiner <hannes@cmpxchg.org>; linux-
> > > kernel@vger.kernel.org;
> > > > > linux-mm@kvack.org; nphamcs@gmail.com;
> chengming.zhou@linux.dev;
> > > > > usamaarif642@gmail.com; shakeel.butt@linux.dev;
> > > ryan.roberts@arm.com;
> > > > > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com;
> akpm@linux-
> > > > > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> > > > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> > > > > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> > > > > zswap_store().
> > > > >
> > > > > [..]
> > > > > >
> > > > > > One thing I realized while reworking the patches for the batched
> checks
> > > is:
> > > > > > within zswap_store_page(), we set the entry->objcg and entry->pool
> > > before
> > > > > > adding it to the xarray. Given this, wouldn't it be safer to get the
> objcg
> > > > > > and pool reference per sub-page, locally in zswap_store_page(),
> rather
> > > than
> > > > > > obtaining batched references at the end if the store is successful? If
> we
> > > > > want
> > > > > > zswap_store_page() to be self-contained and correct as far as the
> entry
> > > > > > being created and added to the xarray, it seems like the right thing to
> > > do?
> > > > > > I am a bit apprehensive about the entry being added to the xarray
> > > without
> > > > > > a reference obtained on the objcg and pool, because any page-
> > > > > faults/writeback
> > > > > > that occur on sub-pages added to the xarray before the entire folio
> has
> > > been
> > > > > > stored, would run into issues.
> > > > >
> > > > > We definitely should not obtain references to the pool and objcg after
> > > > > initializing the entries with them. We can obtain all references in
> > > > > zswap_store() before zswap_store_page(). IOW, the batching in this
> > > > > case should be done before the per-page operations, not after.
> > > >
> > > > Thanks Yosry. IIUC, we should obtain all references to the objcg and to
> the
> > > > zswap_pool at the start of zswap_store.
> > > >
> > > > In the case of error on any sub-page, we will unwind state for potentially
> > > > only the stored pages or the entire folio if it happened to already be in
> > > zswap
> > > > and is being re-written. We might need some additional book-keeping to
> > > > keep track of which sub-pages were found in the xarray and
> > > zswap_entry_free()
> > > > got called (nr_sb). Assuming I define a new "obj_cgroup_put_many()", I
> > > would need
> > > > to call this with (folio_nr_pages() - nr_sb).
> > > >
> > > > As far as zswap_pool_get(), there is some added complexity if we want
> to
> > > > keep the existing implementation that calls "percpu_ref_tryget()", and
> > > assuming
> > > > this is extended to provide a new "zswap_pool_get_many()" that calls
> > > > "percpu_ref_tryget_many()". Is there a reason we use
> percpu_ref_tryget()
> > > instead
> > > > of percpu_ref_get()? Reason I ask is, with tryget(), if for some reason the
> > > pool->ref
> > > > is 0, no further increments will be made. If so, upon unwinding state in
> > > > zswap_store(), I would need to special-case to catch this before calling a
> > > new
> > > > "zswap_pool_put_many()".
> > > >
> > > > Things could be a little simpler if zswap_pool_get() can use
> > > "percpu_ref_get()"
> > > > which will always increment the refcount. Since the zswap pool->ref is
> > > initialized
> > > > to "1", this seems Ok, but I don't know if there will be unintended
> > > consequences.
> > > >
> > > > Can you please advise on what is the simplest/cleanest approach:
> > > >
> > > > 1) Proceed with the above changes without changing percpu_ref_tryget
> in
> > > >      zswap_pool_get. Needs special-casing in zswap_store to detect pool-
> > > >ref
> > > >     being "0" before calling zswap_pool_put[_many].
> > >
> > > My assumption is that we can reorder the code such that if
> > > zswap_pool_get_many() fails we don't call zswap_pool_put_many() to
> > > begin with (e.g. jump to a label after zswap_pool_put_many()).
> >
> > However, the pool refcount could change between the start and end of
> > zswap_store.
> 
> I am not sure what you mean. If zswap_pool_get_many() fails then we
> just do not call zswap_pool_put_many() at all and abort.

I guess I was thinking of a scenario where zswap_pool_get_many() returns
true; subsequently, the pool refcount reaches 0 before the zswap_pool_put_many().
I just realized this shouldn’t happen, so I think we are Ok. Will think about this
some more while creating the follow-up patch.

> 
> >
> > >
> > > > 2) Modify zswap_pool_get/zswap_pool_get_many to use
> > > percpu_ref_get_many
> > > >     and avoid special-casing to detect pool->ref being "0" before calling
> > > >     zswap_pool_put[_many].
> > >
> > > I don't think we can simply switch the tryget to a get, as I believe
> > > we can race with the pool being destroyed.
> >
> > That was my initial thought as well, but I figured this couldn't happen
> > since the pool->ref is initialized to "1", and based on the existing
> > implementation. In any case, I can understand the intent of the use
> > of "tryget"; it is just that it adds to the considerations for reference
> > batching.
> 
> The initial ref can be dropped in __zswap_param_set() if a new pool is
> created (see the call to ercpu_ref_kill(()).

I see.. this makes sense, thanks Yosry!

> 
> >
> > >
> > > > 3) Keep the approach in v7 where obj_cgroup_get/put is localized to
> > > >     zswap_store_page for both success and error conditions, and any
> > > unwinding
> > > >     state in zswap_store will take care of dropping references obtained
> from
> > > >     prior successful writes (from this or prior invocations of zswap_store).
> > >
> > > I am also fine with doing that and doing the reference batching as a follow
> up.
> >
> > I think so too! We could try and improve upon (3) with reference batching
> > in a follow-up patch.
> 
> SGTM.

Thanks, will proceed!


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-26 18:43                             ` Johannes Weiner
  2024-09-26 18:45                               ` Yosry Ahmed
@ 2024-09-26 19:39                               ` Sridhar, Kanchana P
  1 sibling, 0 replies; 79+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-26 19:39 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Yosry Ahmed, linux-kernel, linux-mm, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
	akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar,
	Kanchana P

> -----Original Message-----
> From: Johannes Weiner <hannes@cmpxchg.org>
> Sent: Thursday, September 26, 2024 11:43 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Yosry Ahmed <yosryahmed@google.com>; linux-kernel@vger.kernel.org;
> linux-mm@kvack.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> zswap_store().
> 
> On Thu, Sep 26, 2024 at 05:29:30PM +0000, Sridhar, Kanchana P wrote:
> > > > 3) Keep the approach in v7 where obj_cgroup_get/put is localized to
> > > >     zswap_store_page for both success and error conditions, and any
> > > unwinding
> > > >     state in zswap_store will take care of dropping references obtained
> from
> > > >     prior successful writes (from this or prior invocations of zswap_store).
> > >
> > > I am also fine with doing that and doing the reference batching as a follow
> up.
> >
> > I think so too! We could try and improve upon (3) with reference batching
> > in a follow-up patch.
> 
> Yeah, I agree. The percpu-refcounts are not that expensive, we should
> be able to live with per-page ops for now.
> 
> One thing you *can* do from the start is tryget a pool reference in
> zswap_store(), to prevent the pools untimely demise while you work on
> it, and then in zswap_store_page() you can do gets instead of trygets.

Sure, this sounds good Johannes, thanks for the suggestion! I already
do a zswap_pool_current_get() at the beginning of zswap_store in the
v7 code, for this purpose.

> 
> You'd have to rename zswap_pool_get() to zswap_pool_tryget() (which is
> probably for the best) and implement the trivial new zswap_pool_get().

Ok, will do so.

Thanks,
Kanchana


^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store().
  2024-09-26 18:45                               ` Yosry Ahmed
@ 2024-09-26 19:40                                 ` Sridhar, Kanchana P
  0 siblings, 0 replies; 79+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-26 19:40 UTC (permalink / raw)
  To: Yosry Ahmed, Johannes Weiner
  Cc: linux-kernel, linux-mm, nphamcs, chengming.zhou, usamaarif642,
	shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou,
	Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Thursday, September 26, 2024 11:46 AM
> To: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in
> zswap_store().
> 
> On Thu, Sep 26, 2024 at 11:43 AM Johannes Weiner <hannes@cmpxchg.org>
> wrote:
> >
> > On Thu, Sep 26, 2024 at 05:29:30PM +0000, Sridhar, Kanchana P wrote:
> > > > > 3) Keep the approach in v7 where obj_cgroup_get/put is localized to
> > > > >     zswap_store_page for both success and error conditions, and any
> > > > unwinding
> > > > >     state in zswap_store will take care of dropping references obtained
> from
> > > > >     prior successful writes (from this or prior invocations of
> zswap_store).
> > > >
> > > > I am also fine with doing that and doing the reference batching as a
> follow up.
> > >
> > > I think so too! We could try and improve upon (3) with reference batching
> > > in a follow-up patch.
> >
> > Yeah, I agree. The percpu-refcounts are not that expensive, we should
> > be able to live with per-page ops for now.
> >
> > One thing you *can* do from the start is tryget a pool reference in
> > zswap_store(), to prevent the pools untimely demise while you work on
> > it, and then in zswap_store_page() you can do gets instead of trygets.
> >
> > You'd have to rename zswap_pool_get() to zswap_pool_tryget() (which is
> > probably for the best) and implement the trivial new zswap_pool_get().
> 
> Yeah I was actually planning to send a follow-up patch to do exactly
> that until we figure out proper patching for the refcounts. Even
> better if Kanchana incorporates it in the next version :)

Sure, Yosry, I will incorporate it in the next version!

Thanks again,
Kanchana

^ permalink raw reply	[flat|nested] 79+ messages in thread

* RE: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios
  2024-09-26  6:47         ` Huang, Ying
@ 2024-09-26 21:44           ` Sridhar, Kanchana P
  0 siblings, 0 replies; 79+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-26 21:44 UTC (permalink / raw)
  To: Huang, Ying
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
	21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh,
	Sridhar, Kanchana P

> -----Original Message-----
> From: Huang, Ying <ying.huang@intel.com>
> Sent: Wednesday, September 25, 2024 11:48 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com;
> akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali,
> Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios
> 
> "Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com> writes:
> 
> > Hi Ying,
> >
> >> -----Original Message-----
> >> From: Huang, Ying <ying.huang@intel.com>
> >> Sent: Wednesday, September 25, 2024 5:45 PM
> >> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> >> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> >> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com;
> >> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> >> shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com;
> >> akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>;
> Feghali,
> >> Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> >> <vinodh.gopal@intel.com>
> >> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios
> >>
> >> "Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com> writes:
> >>
> >> >> -----Original Message-----
> >> >> From: Huang, Ying <ying.huang@intel.com>
> >> >> Sent: Tuesday, September 24, 2024 11:35 PM
> >> >> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> >> >> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> >> >> hannes@cmpxchg.org; yosryahmed@google.com;
> nphamcs@gmail.com;
> >> >> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> >> >> shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com;
> >> >> akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>;
> >> Feghali,
> >> >> Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> >> >> <vinodh.gopal@intel.com>
> >> >> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios
> >> >>
> >> >> Kanchana P Sridhar <kanchana.p.sridhar@intel.com> writes:
> >> >>
> >> >> [snip]
> >> >>
> >> >> >
> >> >> > Case 1: Comparing zswap 4K vs. zswap mTHP
> >> >> > =========================================
> >> >> >
> >> >> > In this scenario, the "before" is CONFIG_THP_SWAP set to off, that
> >> results in
> >> >> > 64K/2M (m)THP to be split into 4K folios that get processed by zswap.
> >> >> >
> >> >> > The "after" is CONFIG_THP_SWAP set to on, and this patch-series,
> that
> >> >> results
> >> >> > in 64K/2M (m)THP to not be split, and processed by zswap.
> >> >> >
> >> >> >  64KB mTHP (cgroup memory.high set to 40G):
> >> >> >  ==========================================
> >> >> >
> >> >> >  -------------------------------------------------------------------------------
> >> >> >                     mm-unstable 9-23-2024              zswap-mTHP     Change
> wrt
> >> >> >                         CONFIG_THP_SWAP=N       CONFIG_THP_SWAP=Y
> >> Baseline
> >> >> >                                  Baseline
> >> >> >  -------------------------------------------------------------------------------
> >> >> >  ZSWAP compressor       zstd     deflate-        zstd    deflate-  zstd
> deflate-
> >> >> >                                       iaa                     iaa            iaa
> >> >> >  -------------------------------------------------------------------------------
> >> >> >  Throughput (KB/s)   143,323      125,485     153,550     129,609    7%
> >> 3%
> >> >> >  elapsed time (sec)    24.97        25.42       23.90       25.19    4%       1%
> >> >> >  sys time (sec)       822.72       750.96      757.70      731.13    8%       3%
> >> >> >  memcg_high          132,743      169,825     148,075     192,744
> >> >> >  memcg_swap_fail     639,067      841,553       2,204       2,215
> >> >> >  pswpin                    0            0           0           0
> >> >> >  pswpout                   0            0           0           0
> >> >> >  zswpin                  795          873         760         902
> >> >> >  zswpout          10,011,266   13,195,137  10,010,017  13,193,554
> >> >> >  thp_swpout                0            0           0           0
> >> >> >  thp_swpout_               0            0           0           0
> >> >> >   fallback
> >> >> >  64kB-mthp_          639,065      841,553       2,204       2,215
> >> >> >   swpout_fallback
> >> >> >  pgmajfault            2,861        2,924       3,054       3,259
> >> >> >  ZSWPOUT-64kB            n/a          n/a     623,451     822,268
> >> >> >  SWPOUT-64kB               0            0           0           0
> >> >> >  -------------------------------------------------------------------------------
> >> >> >
> >> >>
> >> >> IIUC, the throughput is the sum of throughput of all usemem processes?
> >> >>
> >> >> One possible issue of usemem test case is the "imbalance" issue.  That
> >> >> is, some usemem processes may swap-out/swap-in less, so the score is
> >> >> very high; while some other processes may swap-out/swap-in more, so
> the
> >> >> score is very low.  Sometimes, the total score decreases, but the scores
> >> >> of usemem processes are more balanced, so that the performance
> should
> >> be
> >> >> considered better.  And, in general, we should make usemem score
> >> >> balanced among processes via say longer test time.  Can you check this
> >> >> in your test results?
> >> >
> >> > Actually, the throughput data listed in the cover-letter is the average of
> >> > all the usemem processes. Your observation about the "imbalance" issue
> is
> >> > right. Some processes see a higher throughput than others. I have
> noticed
> >> > that the throughputs progressively reduce as the individual processes
> exit
> >> > and print their stats.
> >> >
> >> > Listed below are the stats from two runs of usemem70: sleep 10 and
> sleep
> >> 30.
> >> > Both are run with a cgroup mem-limit of 40G. Data is with v7, 64K folios
> are
> >> > enabled, zswap uses zstd.
> >> >
> >> >
> >> > -----------------------------------------------
> >> >                sleep 10           sleep 30
> >> >       Throughput (KB/s)  Throughput (KB/s)
> >> >  -----------------------------------------------
> >> >                 181,540            191,686
> >> >                 179,651            191,459
> >> >                 179,068            188,834
> >> >                 177,244            187,568
> >> >                 177,215            186,703
> >> >                 176,565            185,584
> >> >                 176,546            185,370
> >> >                 176,470            185,021
> >> >                 176,214            184,303
> >> >                 176,128            184,040
> >> >                 175,279            183,932
> >> >                 174,745            180,831
> >> >                 173,935            179,418
> >> >                 161,546            168,014
> >> >                 160,332            167,540
> >> >                 160,122            167,364
> >> >                 159,613            167,020
> >> >                 159,546            166,590
> >> >                 159,021            166,483
> >> >                 158,845            166,418
> >> >                 158,426            166,264
> >> >                 158,396            166,066
> >> >                 158,371            165,944
> >> >                 158,298            165,866
> >> >                 158,250            165,884
> >> >                 158,057            165,533
> >> >                 158,011            165,532
> >> >                 157,899            165,457
> >> >                 157,894            165,424
> >> >                 157,839            165,410
> >> >                 157,731            165,407
> >> >                 157,629            165,273
> >> >                 157,626            164,867
> >> >                 157,581            164,636
> >> >                 157,471            164,266
> >> >                 157,430            164,225
> >> >                 157,287            163,290
> >> >                 156,289            153,597
> >> >                 153,970            147,494
> >> >                 148,244            147,102
> >> >                 142,907            146,111
> >> >                 142,811            145,789
> >> >                 139,171            141,168
> >> >                 136,314            140,714
> >> >                 133,616            140,111
> >> >                 132,881            139,636
> >> >                 132,729            136,943
> >> >                 132,680            136,844
> >> >                 132,248            135,726
> >> >                 132,027            135,384
> >> >                 131,929            135,270
> >> >                 131,766            134,748
> >> >                 131,667            134,733
> >> >                 131,576            134,582
> >> >                 131,396            134,302
> >> >                 131,351            134,160
> >> >                 131,135            134,102
> >> >                 130,885            134,097
> >> >                 130,854            134,058
> >> >                 130,767            134,006
> >> >                 130,666            133,960
> >> >                 130,647            133,894
> >> >                 130,152            133,837
> >> >                 130,006            133,747
> >> >                 129,921            133,679
> >> >                 129,856            133,666
> >> >                 129,377            133,564
> >> >                 128,366            133,331
> >> >                 127,988            132,938
> >> >                 126,903            132,746
> >> >  -----------------------------------------------
> >> >       sum    10,526,916         10,919,561
> >> >   average       150,385            155,994
> >> >    stddev        17,551             19,633
> >> >  -----------------------------------------------
> >> >     elapsed       24.40              43.66
> >> >  time (sec)
> >> >    sys time      806.25             766.05
> >> >       (sec)
> >> >     zswpout  10,008,713         10,008,407
> >> >   64K folio     623,463            623,629
> >> >      swpout
> >> >  -----------------------------------------------
> >>
> >> Although there are some imbalance, I don't find it's too much.  So, I
> >> think the test result is reasonable.  Please pay attention to the
> >> imbalance issue in the future tests.
> >
> > Sure, will do so.
> >
> >>
> >> > As we increase the time for which allocations are maintained,
> >> > there seems to be a slight improvement in throughput, but the
> >> > variance increases as well. The processes with lower throughput
> >> > could be the ones that handle the memcg being over limit by
> >> > doing reclaim, possibly before they can allocate.
> >> >
> >> > Interestingly, the longer test time does seem to reduce the amount
> >> > of reclaim (hence lower sys time), but more 64K large folios seem to
> >> > be reclaimed. Could this mean that with longer test time (sleep 30),
> >> > more cold memory residing in large folios is getting reclaimed, as
> >> > against memory just relinquished by the exiting processes?
> >>
> >> I don't think longer sleep time in test helps much to balance.  Can you
> >> try with less process, and larger memory size per process?  I guess that
> >> this will improve balance.
> >
> > I tried this, and the data is listed below:
> >
> >   usemem options:
> >   ---------------
> >   30 processes allocate 10G each
> >   cgroup memory limit = 150G
> >   sleep 10
> >   525Gi SSD disk swap partition
> >   64K large folios enabled
> >
> >   Throughput (KB/s) of each of the 30 processes:
> >  ---------------------------------------------------------------
> >                       mm-unstable    zswap_store of large folios
> >                         9-25-2024                v7
> >  zswap compressor:           zstd         zstd  deflate-iaa
> >  ---------------------------------------------------------------
> >                            38,393      234,485      374,427
> >                            37,283      215,528      314,225
> >                            37,156      214,942      304,413
> >                            37,143      213,073      304,146
> >                            36,814      212,904      290,186
> >                            36,277      212,304      288,212
> >                            36,104      212,207      285,682
> >                            36,000      210,173      270,661
> >                            35,994      208,487      256,960
> >                            35,979      207,788      248,313
> >                            35,967      207,714      235,338
> >                            35,966      207,703      229,335
> >                            35,835      207,690      221,697
> >                            35,793      207,418      221,600
> >                            35,692      206,160      219,346
> >                            35,682      206,128      219,162
> >                            35,681      205,817      219,155
> >                            35,678      205,546      214,862
> >                            35,678      205,523      214,710
> >                            35,677      204,951      214,282
> >                            35,677      204,283      213,441
> >                            35,677      203,348      213,011
> >                            35,675      203,028      212,923
> >                            35,673      201,922      212,492
> >                            35,672      201,660      212,225
> >                            35,672      200,724      211,808
> >                            35,672      200,324      211,420
> >                            35,671      199,686      211,413
> >                            35,667      198,858      211,346
> >                            35,667      197,590      211,209
> >  ---------------------------------------------------------------
> >  sum                     1,081,515    6,217,964    7,268,000
> >  average                    36,051      207,265      242,267
> >  stddev                        655        7,010       42,234
> >  elapsed time (sec)         343.70       107.40        84.34
> >  sys time (sec)             269.30     2,520.13     1,696.20
> >  memcg.high breaches       443,672      475,074      623,333
> >  zswpout                    22,605   48,931,249   54,777,100
> >  pswpout                40,004,528            0            0
> >  hugepages-64K zswpout           0    3,057,090    3,421,855
> >  hugepages-64K swpout    2,500,283            0            0
> >  ---------------------------------------------------------------
> >
> > As you can see, this is quite a memory-constrained scenario, where we
> > are giving a 50% of total memory required, as the memory limit for the
> > cgroup in which the 30 processes are run. This causes significantly more
> > reclaim activity than the setup I was using thus far (70 processes, 1G,
> > 40G limit).
> >
> > The variance or "imbalance" reduces somewhat for zstd, but not for IAA.
> >
> > IAA shows really good throughput (17%) and elapsed time (21%) and
> > sys time (33%) improvement wrt zstd with zswap_store of large folios.
> > These are the memory-constrained scenarios in which IAA typically
> > does really well. IAA verify_compress is enabled, so this is an added
> > data integrity checks benefit we get with IAA.
> >
> > I would like to get your and the maintainers' feedback on whether
> > I should switch to this "usemem30-10G" setup for v8?
> 
> The results looks good to me.  I suggest you to use it.

Ok, sure, thanks Ying.

Thanks,
Kanchana

> 
> --
> Best Regards,
> Huang, Ying


^ permalink raw reply	[flat|nested] 79+ messages in thread

end of thread, other threads:[~2024-09-26 21:44 UTC | newest]

Thread overview: 79+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-09-24  1:17 [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar
2024-09-24  1:17 ` [PATCH v7 1/8] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined Kanchana P Sridhar
2024-09-24 16:45   ` Nhat Pham
2024-09-24  1:17 ` [PATCH v7 2/8] mm: zswap: Modify zswap_compress() to accept a page instead of a folio Kanchana P Sridhar
2024-09-24 16:50   ` Nhat Pham
2024-09-24  1:17 ` [PATCH v7 3/8] mm: zswap: Refactor code to store an entry in zswap xarray Kanchana P Sridhar
2024-09-24 17:16   ` Nhat Pham
2024-09-24 20:40     ` Sridhar, Kanchana P
2024-09-24 19:14   ` Yosry Ahmed
2024-09-24 22:22     ` Sridhar, Kanchana P
2024-09-24  1:17 ` [PATCH v7 4/8] mm: zswap: Refactor code to delete stored offsets in case of errors Kanchana P Sridhar
2024-09-24 17:25   ` Nhat Pham
2024-09-24 20:41     ` Sridhar, Kanchana P
2024-09-24 19:20   ` Yosry Ahmed
2024-09-24 22:32     ` Sridhar, Kanchana P
2024-09-25  0:43       ` Yosry Ahmed
2024-09-25  1:18         ` Sridhar, Kanchana P
2024-09-25 14:11         ` Johannes Weiner
2024-09-25 18:45           ` Sridhar, Kanchana P
2024-09-24  1:17 ` [PATCH v7 5/8] mm: zswap: Compress and store a specific page in a folio Kanchana P Sridhar
2024-09-24 19:28   ` Yosry Ahmed
2024-09-24 22:45     ` Sridhar, Kanchana P
2024-09-25  0:47       ` Yosry Ahmed
2024-09-25  1:49         ` Sridhar, Kanchana P
2024-09-25 13:53           ` Johannes Weiner
2024-09-25 18:45             ` Sridhar, Kanchana P
2024-09-24  1:17 ` [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store() Kanchana P Sridhar
2024-09-24 17:33   ` Nhat Pham
2024-09-24 20:51     ` Sridhar, Kanchana P
2024-09-24 21:08       ` Nhat Pham
2024-09-24 21:34         ` Yosry Ahmed
2024-09-24 22:16           ` Nhat Pham
2024-09-24 22:18             ` Sridhar, Kanchana P
2024-09-24 22:28             ` Yosry Ahmed
2024-09-24 22:17           ` Sridhar, Kanchana P
2024-09-24 19:38   ` Yosry Ahmed
2024-09-24 20:51     ` Nhat Pham
2024-09-24 21:38       ` Yosry Ahmed
2024-09-24 23:11         ` Nhat Pham
2024-09-25  0:05           ` Sridhar, Kanchana P
2024-09-25  0:52           ` Yosry Ahmed
2024-09-24 23:21       ` Sridhar, Kanchana P
2024-09-24 23:02     ` Sridhar, Kanchana P
2024-09-25 13:40     ` Johannes Weiner
2024-09-25 18:30       ` Yosry Ahmed
2024-09-25 19:10         ` Sridhar, Kanchana P
2024-09-25 19:49           ` Yosry Ahmed
2024-09-25 20:49             ` Johannes Weiner
2024-09-25 19:20         ` Johannes Weiner
2024-09-25 19:39           ` Yosry Ahmed
2024-09-25 20:13             ` Johannes Weiner
2024-09-25 21:06               ` Yosry Ahmed
2024-09-25 22:29                 ` Sridhar, Kanchana P
2024-09-26  3:58                   ` Sridhar, Kanchana P
2024-09-26  4:52                     ` Yosry Ahmed
2024-09-26 16:40                       ` Sridhar, Kanchana P
2024-09-26 17:19                         ` Yosry Ahmed
2024-09-26 17:29                           ` Sridhar, Kanchana P
2024-09-26 17:34                             ` Yosry Ahmed
2024-09-26 19:36                               ` Sridhar, Kanchana P
2024-09-26 18:43                             ` Johannes Weiner
2024-09-26 18:45                               ` Yosry Ahmed
2024-09-26 19:40                                 ` Sridhar, Kanchana P
2024-09-26 19:39                               ` Sridhar, Kanchana P
2024-09-25 14:27   ` Johannes Weiner
2024-09-25 18:17     ` Yosry Ahmed
2024-09-25 18:48     ` Sridhar, Kanchana P
2024-09-24  1:17 ` [PATCH v7 7/8] mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout stats Kanchana P Sridhar
2024-09-24  1:17 ` [PATCH v7 8/8] mm: Document the newly added mTHP zswpout stats, clarify swpout semantics Kanchana P Sridhar
2024-09-24 17:36   ` Nhat Pham
2024-09-24 20:52     ` Sridhar, Kanchana P
2024-09-24 19:34 ` [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios Yosry Ahmed
2024-09-24 22:50   ` Sridhar, Kanchana P
2024-09-25  6:35 ` Huang, Ying
2024-09-25 18:39   ` Sridhar, Kanchana P
2024-09-26  0:44     ` Huang, Ying
2024-09-26  3:48       ` Sridhar, Kanchana P
2024-09-26  6:47         ` Huang, Ying
2024-09-26 21:44           ` Sridhar, Kanchana P

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox