[PATCH v8 0/8] mm: zswap swap-out of large folios

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v8 0/8] mm: zswap swap-out of large folios
@ 2024-09-28  2:16 Kanchana P Sridhar
  2024-09-28  2:16 ` [PATCH v8 1/8] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined Kanchana P Sridhar
                   ` (8 more replies)
  0 siblings, 9 replies; 43+ messages in thread
From: Kanchana P Sridhar @ 2024-09-28  2:16 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
	ying.huang, 21cnbao, akpm
  Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

Hi All,

This patch-series enables zswap_store() to accept and store large
folios. The most significant contribution in this series is from the 
earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
migrated to mm-unstable as of 9-27-2024 in patch 6 of this series, and
adapted based on code review comments received for v7 of the current
patch-series.

[1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
     https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u

The first few patches do the prep work for supporting large folios in
zswap_store. Patch 6 provides the main functionality to swap-out large
folios in zswap. Patch 7 adds sysfs per-order hugepages "zswpout" counters
that get incremented upon successful zswap_store of large folios:

/sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout

Patch 8 updates the documentation for the new sysfs "zswpout" counters.

This patch-series is a pre-requisite for zswap compress batching of large
folio swap-out and decompress batching of swap-ins based on
swapin_readahead(), using Intel IAA hardware acceleration, which we would
like to submit in subsequent patch-series, with performance improvement
data.

Thanks to Ying Huang for pre-posting review feedback and suggestions!

Thanks also to Nhat, Yosry, Johannes, Barry, Chengming, Usama and Ying for
their helpful feedback, data reviews and suggestions!

Co-development signoff request:
===============================
I would like to thank Ryan Roberts for his original RFC [1] and request
his co-developer signoff on patch 6 in this series. Thanks Ryan!

System setup for testing:
=========================
Testing of this patch-series was done with mm-unstable as of 9-27-2024,
commit de2fbaa6d9c3576ec7133ed02a370ec9376bf000. Data was gathered
without/with this patch-series, on an Intel Sapphire Rapids server,
dual-socket 56 cores per socket, 4 IAA devices per socket, 503 GiB RAM and
525G SSD disk partition swap. Core frequency was fixed at 2500MHz.

The vm-scalability "usemem" test was run in a cgroup whose memory.high
was fixed at 150G. The is no swap limit set for the cgroup. 30 usemem
processes were run, each allocating and writing 10G of memory, and sleeping
for 10 sec before exiting:

usemem --init-time -w -O -s 10 -n 30 10g

Other kernel configuration parameters:

    zswap compressors : zstd, deflate-iaa
    zswap allocator   : zsmalloc
    vm.page-cluster   : 2

In the experiments where "deflate-iaa" is used as the zswap compressor,
IAA "compression verification" is enabled by default
(cat /sys/bus/dsa/drivers/crypto/verify_compress). Hence each IAA
compression will be decompressed internally by the "iaa_crypto" driver, the
crc-s returned by the hardware will be compared and errors reported in case
of mismatches. Thus "deflate-iaa" helps ensure better data integrity as
compared to the software compressors, and the experimental data listed
below is with verify_compress set to "1".

Total and average throughput are derived from the individual 30 processes'
throughputs reported by usemem. elapsed/sys times are measured with perf.

The vm stats and sysfs hugepages stats included with the performance data
provide details on the swapout activity to zswap/swap device.

Testing labels used in data summaries:
======================================
The data refers to these test configurations and the before/after
comparisons that they do:

 before-case1:
 -------------
 mm-unstable 9-27-2024, CONFIG_THP_SWAP=N (compares zswap 4K vs. zswap 64K)

 In this scenario, CONFIG_THP_SWAP=N results in 64K/2M folios to be split
 into 4K folios that get processed by zswap.

 before-case2:
 -------------
 mm-unstable 9-27-2024, CONFIG_THP_SWAP=Y (compares SSD swap large folios vs. zswap large folios)

 In this scenario, CONFIG_THP_SWAP=Y results in zswap rejecting large
 folios, which will then be stored by the SSD swap device.

 after:
 ------
 v8 of this patch-series, CONFIG_THP_SWAP=Y

 The "after" is CONFIG_THP_SWAP=Y and v8 of this patch-series, that results
 in 64K/2M folios to not be split, and to be processed by zswap_store.

Regression Testing:
===================
I ran vm-scalability usemem without large folios, i.e., only 4K folios with
mm-unstable and this patch-series. The main goal was to make sure that
there is no functional or performance regression wrt the earlier zswap
behavior for 4K folios, now that 4K folios will be processed by the new
zswap_store() code.

The data indicates there is no significant regression.

 -------------------------------------------------------------------------------
 4K folios:
 ==========

 zswap compressor                zstd          zstd        zstd  zstd v8 zstd v8
                         before-case1  before-case2       after      vs.     vs.
                                                                   case1   case2
 -------------------------------------------------------------------------------
 Total throughput (KB/s)    4,793,363     4,880,978   4,813,151       0%     -1%
 Average throughput (KB/s)    159,778       162,699     160,438       0%     -1%
 elapsed time (sec)            130.14        123.17      127.21       2%     -3%
 sys time (sec)              3,135.53      2,985.64    3,110.53       1%     -4%
 memcg_high                   446,826       444,626     448,231        
 memcg_swap_fail                    0             0           0              
 pswpout                            0             0           0              
 pswpin                             0             0           0              
 zswpout                   48,932,107    48,931,971  48,931,584             
 zswpin                           383           386         388            
 thp_swpout                         0             0           0              
 thp_swpout_fallback                0             0           0              
 64kB-mthp_swpout_fallback          0             0           0              
 pgmajfault                     3,063         3,077       3,082          
 swap_ra                           93            94          93             
 swap_ra_hit                       47            47          47             
 ZSWPOUT-64kB                     n/a           n/a           0              
 SWPOUT-64kB                        0             0           0              
 -------------------------------------------------------------------------------

Performance Testing:
====================

We list the data for 64K folios with before/after data per-compressor,
followed by the same for 2M pmd-mappable folios.

 -------------------------------------------------------------------------------
 64K folios: zstd:
 =================

 zswap compressor                zstd          zstd         zstd      zstd v8
                         before-case1  before-case2        after     vs.    vs.
                                                                   case1  case2
 -------------------------------------------------------------------------------
 Total throughput (KB/s)    5,222,213     1,076,611    6,227,367     19%    478%
 Average throughput (KB/s)    174,073        35,887      207,578     19%    478%
 elapsed time (sec)            120.50        347.16       109.21      9%     69%
 sys time (sec)              2,930.33        248.16     2,609.22     11%   -951%
 memcg_high                   416,773       552,200      482,703                  
 memcg_swap_fail            3,192,906         1,293          944                  
 pswpout                            0    40,778,448            0                  
 pswpin                             0            16            0                  
 zswpout                   48,931,583        20,903   48,931,271                  
 zswpin                           384           363          392                  
 thp_swpout                         0             0            0                  
 thp_swpout_fallback                0             0            0                  
 64kB-mthp_swpout_fallback  3,192,906         1,293          944                  
 pgmajfault                     3,452         3,072        3,095                  
 swap_ra                           90            87          100                  
 swap_ra_hit                       42            43           56                  
 ZSWPOUT-64kB                     n/a           n/a    3,057,260                  
 SWPOUT-64kB                        0     2,548,653            0                  
 -------------------------------------------------------------------------------

 -------------------------------------------------------------------------------
 64K folios: deflate-iaa:
 ========================

 zswap compressor         deflate-iaa   deflate-iaa  deflate-iaa  deflate-iaa v8
                         before-case1  before-case2        after      vs.    vs.
                                                                   case1   case2
 -------------------------------------------------------------------------------
 Total throughput (KB/s)    5,652,608     1,089,180    6,315,000     12%    480%  
 Average throughput (KB/s)    188,420        36,306      210,500     12%    480%  
 elapsed time (sec)            102.90        343.35        91.11     11%     73%  
 sys time (sec)              2,246.86        213.53     1,939.31     14%   -808%  
 memcg_high                   576,104       502,907      612,505                   
 memcg_swap_fail            4,016,117         1,407        1,660                   
 pswpout                            0    40,862,080            0                   
 pswpin                             0            20            0                   
 zswpout                   61,163,423        22,444   57,317,607                   
 zswpin                           401           368          449                   
 thp_swpout                         0             0            0                   
 thp_swpout_fallback                0             0            0                   
 64kB-mthp_swpout_fallback  4,016,117         1,407        1,660                   
 pgmajfault                     3,063         3,153        3,167                   
 swap_ra                           96            93          149                   
 swap_ra_hit                       46            45           89                   
 ZSWPOUT-64kB                     n/a           n/a    3,580,673                   
 SWPOUT-64kB                        0     2,553,880            0                 
 -------------------------------------------------------------------------------

 -------------------------------------------------------------------------------
 2M folios: zstd:
 ================

 zswap compressor                zstd          zstd         zstd      zstd v8
                         before-case1  before-case2        after     vs.    vs.
                                                                   case1  case2
 -------------------------------------------------------------------------------
 Total throughput (KB/s)    5,895,500     1,109,694     6,460,111    10%    482% 
 Average throughput (KB/s)    196,516        36,989       215,337    10%    482% 
 elapsed time (sec)            108.77        334.28        105.92     3%     68% 
 sys time (sec)              2,657.14         94.88      2,436.24     8%  -2468% 
 memcg_high                    64,200        66,316        60,300                 
 memcg_swap_fail              101,182            70            30                 
 pswpout                            0    40,166,400             0                 
 pswpin                             0             0             0                 
 zswpout                   48,931,499        36,507    48,869,236                 
 zswpin                           380           379           397                 
 thp_swpout                         0        78,450             0                 
 thp_swpout_fallback          101,182            70            30                 
 2MB-mthp_swpout_fallback           0             0             0                 
 pgmajfault                     3,067         3,417         4,765                 
 swap_ra                           91            90         5,073                 
 swap_ra_hit                       45            45         5,024                 
 ZSWPOUT-2MB                      n/a           n/a        95,408                 
 SWPOUT-2MB                         0        78,450             0                
 -------------------------------------------------------------------------------

 -------------------------------------------------------------------------------
 2M folios: deflate-iaa:
 =======================

 zswap compressor         deflate-iaa   deflate-iaa  deflate-iaa  deflate-iaa v8
                         before-case1  before-case2        after      vs.    vs.
                                                                   case1   case2
 -------------------------------------------------------------------------------
 Total throughput (KB/s)   6,286,587      1,126,785     7,569,560    20%    572%   
 Average throughput (KB/s)   209,552         37,559       252,318    20%    572%   
 elapsed time (sec)            96.19         333.03         81.96    15%     75%   
 sys time (sec)             2,141.44          99.96      1,768.41    17%  -1669%   
 memcg_high                   99,253         64,666        75,139                   
 memcg_swap_fail             129,074             53            73                   
 pswpout                           0     40,048,128             0                   
 pswpin                            0              0             0                   
 zswpout                  61,312,794         28,321    57,083,119                   
 zswpin                          383            406           447                   
 thp_swpout                        0         78,219             0                   
 thp_swpout_fallback         129,074             53            73                   
 2MB-mthp_swpout_fallback          0              0             0                   
 pgmajfault                    3,430          3,077         7,133                   
 swap_ra                          91            103        11,978                   
 swap_ra_hit                      47             46        11,920                   
 ZSWPOUT-2MB                     n/a            n/a       111,390                   
 SWPOUT-2MB                        0         78,219             0               
 -------------------------------------------------------------------------------

And finally, this is a comparison of deflate-iaa vs. zstd with v8 of this
patch-series:

 ---------------------------------------------
                   zswap_store large folios v8
                  Impr w/ deflate-iaa vs. zstd

                       64K folios    2M folios
 ---------------------------------------------
 Throughput (KB/s)             1%          17%
 elapsed time (sec)           17%          23%
 sys time (sec)               26%          27%
 ---------------------------------------------

Conclusions based on the performance results:
=============================================

 v8 wrt before-case1:
 --------------------
 We see significant improvements in throughput, elapsed and sys time for
 zstd and deflate-iaa, when comparing before-case1 (THP_SWAP=N) vs. after
 (THP_SWAP=Y) with zswap_store large folios.

 v8 wrt before-case2:
 --------------------
 We see even more significant improvements in throughput and elapsed time
 for zstd and deflate-iaa, when comparing before-case2 (large-folio-SSD)
 vs. after (large-folio-zswap). The sys time increases with
 large-folio-zswap as expected, due to the CPU compression time
 vs. asynchronous disk write times, as pointed out by Ying and Yosry.

 In before-case2, when zswap does not store large folios, only allocations
 and cgroup charging due to 4K folio zswap stores count towards the cgroup
 memory limit. However, in the after scenario, with the introduction of
 zswap_store() of large folios, there is an added component of the zswap
 compressed pool usage from large folio stores from potentially all 30
 processes, that gets counted towards the memory limit. As a result, we see
 higher swapout activity in the "after" data.

Summary:
========
The v8 data presented above shows that zswap_store of large folios
demonstrates good throughput/performance improvements compared to
conventional SSD swap of large folios with a sufficiently large 525G SSD
swap device. Hence, it seems reasonable for zswap_store to support large
folios, so that further performance improvements can be implemented.

In the experimental setup used in this patchset, we have enabled IAA
compress verification to ensure additional hardware data integrity CRC
checks not currently done by the software compressors. We see good
throughput/latency improvements with deflate-iaa vs. zstd with zswap_store
of large folios.

Some of the ideas for further reducing latency that have shown promise in
our experiments, are:

1) IAA compress/decompress batching.
2) Distributing compress jobs across all IAA devices on the socket.

The tests run for this patchset are using only 1 IAA device per core, that
avails of 2 compress engines on the device. In our experiments with IAA
batching, we distribute compress jobs from all cores to the 8 compress
engines available per socket. We further compress the pages in each folio
in parallel in the accelerator. As a result, we improve compress latency
and reclaim throughput.

In decompress batching, we use swapin_readahead to generate a prefetch
batch of 4K folios that we decompress in parallel in IAA.

 ------------------------------------------------------------------------------
                          IAA compress/decompress batching
              Further improvements wrt v8 zswap_store Sequential
                          subpage store using "deflate-iaa":

                      "deflate-iaa" Batching  "deflate-iaa-canned" [2] Batching
                          Additional Impr               Additional Impr   
                     64K folios    2M folios     64K folios    2M folios
 ------------------------------------------------------------------------------
 Throughput (KB/s)          35%          34%            44%          44%  
 elapsed time (sec)          9%          10%            14%          17%  
 sys time (sec)            0.4%           4%             8%          15%  
 ------------------------------------------------------------------------------

With zswap IAA compress/decompress batching, we are able to demonstrate
significant performance improvements and memory savings in server
scalability experiments in highly contended system scenarios under
significant memory pressure; as compared to software compressors. We hope
to submit this work in subsequent patch series. The current patch-series is
a prequisite for these future submissions.

Thanks,
Kanchana

[1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
[2] https://patchwork.kernel.org/project/linux-crypto/cover/cover.1710969449.git.andre.glover@linux.intel.com/

Changes since v7:
=================
1) Rebased to mm-unstable as of 9-27-2024,
   commit de2fbaa6d9c3576ec7133ed02a370ec9376bf000.
2) Added Nhat's 'Reviewed-by' to patches 1 and 2. Thanks Nhat!
3) Implemented one-time obj_cgroup_may_zswap and zswap_check_limits at the
   start of zswap_store. Implemented one-time batch updates to cgroup zswap
   charging (with total compressed bytes), zswap_stored_pages and the
   memcg/vm zswpout event stats (with folio_nr_pages()) only for successful
   stores at the end of zswap_store. Thanks Yosry and Johannes for guidance
   on this!
4) Changed the existing zswap_pool_get() to zswap_pool_tryget(). Modified
   zswap_pool_current_get() and zswap_pool_find_get() to call
   zswap_pool_tryget(). Furthermore, zswap_store() obtains a reference to a
   valid zswap_pool upfront by calling zswap_pool_tryget(), and errors out
   if the tryget fails. Added a new zswap_pool_get() that calls
   "percpu_ref_get(&pool->ref)" and is called in zswap_store_page(), as
   suggested by Johannes & Yosry. Thanks both!
5) Provided a new count_objcg_events() API for batch event updates.
6) Changed "zswap_stored_pages" to atomic_long_t to support adding
   folio_nr_pages() to it once a large folio is stored successfully.
7) Deleted the refactoring done in v7 for the xarray updates in
   zswap_store_page(); and unwinding of stored offsets in zswap_store() in
   case of errors, as suggested by Johannes.
8) Deleted the CONFIG_ZSWAP_STORE_THP_DEFAULT_ON config option and
   "zswap_mthp_enabled" tunable, as recommended by Yosry, Johannes and
   Nhat.
9) Replaced references to "mTHP" with "large folios"; organized
   before/after data per-compressor for easier visual comparisons;
   incorporated Nhat's feedback in the documentation updates; moved
   changelog to the end. Thanks Johannes, Yosry and Nhat!
10) Moved the usemem testing configuration to 30 processes, each allocating
    10G within a 150G memory-limit constrained cgroup, maintaining the
    allocated memory for 10 sec before exiting. Thanks Ying for this
    suggestion!

Changes since v6:
=================
1) Rebased to mm-unstable as of 9-23-2024,
   commit acfabf7e197f7a5bedf4749dac1f39551417b049.
2) Refactored into smaller commits, as suggested by Yosry and
   Chengming. Thanks both!
3) Reworded the commit log for patches 5 and 6 as per Yosry's
   suggestion. Thanks Yosry!
4) Gathered data on a Sapphire Rapids server that has 823GiB SSD swap disk
   partition. Also, all experiments are run with usemem --sleep 10, so that
   the memory allocated by the 70 processes remains in memory
   longer. Posted elapsed and sys times. Thanks to Yosry, Nhat and Ying for
   their help with refining the performance characterization methodology.
5) Updated Documentation/admin-guide/mm/transhuge.rst as suggested by
   Nhat. Thanks Nhat!

Changes since v5:
=================
1) Rebased to mm-unstable as of 8/29/2024,
   commit 9287e4adbc6ab8fa04d25eb82e097fed877a4642.
2) Added CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default) to
   enable/disable zswap_store() of mTHP folios. Thanks Nhat for the
   suggestion to add a knob by which users can enable/disable this
   change. Nhat, I hope this is along the lines of what you were
   thinking.
3) Added vm-scalability usemem data with 4K folios with
   CONFIG_ZSWAP_STORE_THP_DEFAULT_ON off, that I gathered to make sure
   there is no regression with this change.
4) Added data with usemem with 64K and 2M THP for an alternate view of
   before/after, as suggested by Yosry, so we can understand the impact
   of when mTHPs are split into 4K folios in shrink_folio_list()
   (CONFIG_THP_SWAP off) vs. not split (CONFIG_THP_SWAP on) and stored
   in zswap. Thanks Yosry for this suggestion.

Changes since v4:
=================
1) Published before/after data with zstd, as suggested by Nhat (Thanks
   Nhat for the data reviews!).
2) Rebased to mm-unstable from 8/27/2024,
   commit b659edec079c90012cf8d05624e312d1062b8b87.
3) Incorporated the change in memcontrol.h that defines obj_cgroup_get() if
   CONFIG_MEMCG is not defined, to resolve build errors reported by kernel
   robot; as per Nhat's and Michal's suggestion to not require a separate
   patch to fix the build errors (thanks both!).
4) Deleted all same-filled folio processing in zswap_store() of mTHP, as
   suggested by Yosry (Thanks Yosry!).
5) Squashed the commits that define new mthp zswpout stat counters, and
   invoke count_mthp_stat() after successful zswap_store()s; into a single
   commit. Thanks Yosry for this suggestion!

Changes since v3:
=================
1) Rebased to mm-unstable commit 8c0b4f7b65fd1ca7af01267f491e815a40d77444.
   Thanks to Barry for suggesting aligning with Ryan Roberts' latest
   changes to count_mthp_stat() so that it's always defined, even when THP
   is disabled. Barry, I have also made one other change in page_io.c
   where count_mthp_stat() is called by count_swpout_vm_event(). I would
   appreciate it if you can review this. Thanks!
   Hopefully this should resolve the kernel robot build errors.

Changes since v2:
=================
1) Gathered usemem data using SSD as the backing swap device for zswap,
   as suggested by Ying Huang. Ying, I would appreciate it if you can
   review the latest data. Thanks!
2) Generated the base commit info in the patches to attempt to address
   the kernel test robot build errors.
3) No code changes to the individual patches themselves.

Changes since RFC v1:
=====================

1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
   Thanks Barry!
2) Addressed some of the code review comments that Nhat Pham provided in
   Ryan's initial RFC [1]:
   - Added a comment about the cgroup zswap limit checks occuring once per
     folio at the beginning of zswap_store().
     Nhat, Ryan, please do let me know if the comments convey the summary
     from the RFC discussion. Thanks!
   - Posted data on running the cgroup suite's zswap kselftest.
3) Rebased to v6.11-rc3.
4) Gathered performance data with usemem and the rebased patch-series.

Kanchana P Sridhar (8):
  mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined.
  mm: zswap: Modify zswap_compress() to accept a page instead of a
    folio.
  mm: zswap: Rename zswap_pool_get() to zswap_pool_tryget().
  mm: Provide a new count_objcg_events() API for batch event updates.
  mm: zswap: Modify zswap_stored_pages to be atomic_long_t.
  mm: zswap: Support large folios in zswap_store().
  mm: swap: Count successful large folio zswap stores in hugepage
    zswpout stats.
  mm: Document the newly added sysfs large folios zswpout stats.

 Documentation/admin-guide/mm/transhuge.rst |   8 +-
 fs/proc/meminfo.c                          |   2 +-
 include/linux/huge_mm.h                    |   1 +
 include/linux/memcontrol.h                 |  24 ++
 include/linux/zswap.h                      |   2 +-
 mm/huge_memory.c                           |   3 +
 mm/page_io.c                               |   1 +
 mm/zswap.c                                 | 254 +++++++++++++++------
 8 files changed, 219 insertions(+), 76 deletions(-)

base-commit: de2fbaa6d9c3576ec7133ed02a370ec9376bf000
-- 
2.27.0

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v8 1/8] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined.
  2024-09-28  2:16 [PATCH v8 0/8] mm: zswap swap-out of large folios Kanchana P Sridhar
@ 2024-09-28  2:16 ` Kanchana P Sridhar
  2024-09-28  2:30   ` Yosry Ahmed
                     ` (2 more replies)
  2024-09-28  2:16 ` [PATCH v8 2/8] mm: zswap: Modify zswap_compress() to accept a page instead of a folio Kanchana P Sridhar
                   ` (7 subsequent siblings)
  8 siblings, 3 replies; 43+ messages in thread
From: Kanchana P Sridhar @ 2024-09-28  2:16 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
	ying.huang, 21cnbao, akpm
  Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

This resolves an issue with obj_cgroup_get() not being defined if
CONFIG_MEMCG is not defined.

Before this patch, we would see build errors if obj_cgroup_get() is
called from code that is agnostic of CONFIG_MEMCG.

The zswap_store() changes for large folios in subsequent commits will
require the use of obj_cgroup_get() in zswap code that falls into this
category.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
---
 include/linux/memcontrol.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 34d2da05f2f1..15c2716f9aa3 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1282,6 +1282,10 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css)
 	return NULL;
 }
 
+static inline void obj_cgroup_get(struct obj_cgroup *objcg)
+{
+}
+
 static inline void obj_cgroup_put(struct obj_cgroup *objcg)
 {
 }
-- 
2.27.0



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v8 1/8] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined.
  2024-09-28  2:16 ` [PATCH v8 1/8] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined Kanchana P Sridhar
@ 2024-09-28  2:30   ` Yosry Ahmed
  2024-09-28  5:39   ` Chengming Zhou
  2024-09-28 13:46   ` Johannes Weiner
  2 siblings, 0 replies; 43+ messages in thread
From: Yosry Ahmed @ 2024-09-28  2:30 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao,
	akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal

On Fri, Sep 27, 2024 at 7:16 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> This resolves an issue with obj_cgroup_get() not being defined if
> CONFIG_MEMCG is not defined.
>
> Before this patch, we would see build errors if obj_cgroup_get() is
> called from code that is agnostic of CONFIG_MEMCG.
>
> The zswap_store() changes for large folios in subsequent commits will
> require the use of obj_cgroup_get() in zswap code that falls into this
> category.
>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> Reviewed-by: Nhat Pham <nphamcs@gmail.com>'

Reviewed-by: Yosry Ahmed <yosryahmed@google.com>

> ---
>  include/linux/memcontrol.h | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 34d2da05f2f1..15c2716f9aa3 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1282,6 +1282,10 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css)
>         return NULL;
>  }
>
> +static inline void obj_cgroup_get(struct obj_cgroup *objcg)
> +{
> +}
> +
>  static inline void obj_cgroup_put(struct obj_cgroup *objcg)
>  {
>  }
> --
> 2.27.0
>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v8 1/8] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined.
  2024-09-28  2:16 ` [PATCH v8 1/8] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined Kanchana P Sridhar
  2024-09-28  2:30   ` Yosry Ahmed
@ 2024-09-28  5:39   ` Chengming Zhou
  2024-09-28 13:46   ` Johannes Weiner
  2 siblings, 0 replies; 43+ messages in thread
From: Chengming Zhou @ 2024-09-28  5:39 UTC (permalink / raw)
  To: Kanchana P Sridhar, linux-kernel, linux-mm, hannes, yosryahmed,
	nphamcs, usamaarif642, shakeel.butt, ryan.roberts, ying.huang,
	21cnbao, akpm
  Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal

On 2024/9/28 10:16, Kanchana P Sridhar wrote:
> This resolves an issue with obj_cgroup_get() not being defined if
> CONFIG_MEMCG is not defined.
> 
> Before this patch, we would see build errors if obj_cgroup_get() is
> called from code that is agnostic of CONFIG_MEMCG.
> 
> The zswap_store() changes for large folios in subsequent commits will
> require the use of obj_cgroup_get() in zswap code that falls into this
> category.
> 
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> Reviewed-by: Nhat Pham <nphamcs@gmail.com>

Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>

> ---
>   include/linux/memcontrol.h | 4 ++++
>   1 file changed, 4 insertions(+)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 34d2da05f2f1..15c2716f9aa3 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1282,6 +1282,10 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css)
>   	return NULL;
>   }
>   
> +static inline void obj_cgroup_get(struct obj_cgroup *objcg)
> +{
> +}
> +
>   static inline void obj_cgroup_put(struct obj_cgroup *objcg)
>   {
>   }


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v8 1/8] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined.
  2024-09-28  2:16 ` [PATCH v8 1/8] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined Kanchana P Sridhar
  2024-09-28  2:30   ` Yosry Ahmed
  2024-09-28  5:39   ` Chengming Zhou
@ 2024-09-28 13:46   ` Johannes Weiner
  2 siblings, 0 replies; 43+ messages in thread
From: Johannes Weiner @ 2024-09-28 13:46 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, yosryahmed, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao,
	akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal

On Fri, Sep 27, 2024 at 07:16:13PM -0700, Kanchana P Sridhar wrote:
> This resolves an issue with obj_cgroup_get() not being defined if
> CONFIG_MEMCG is not defined.
> 
> Before this patch, we would see build errors if obj_cgroup_get() is
> called from code that is agnostic of CONFIG_MEMCG.
> 
> The zswap_store() changes for large folios in subsequent commits will
> require the use of obj_cgroup_get() in zswap code that falls into this
> category.
> 
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> Reviewed-by: Nhat Pham <nphamcs@gmail.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v8 2/8] mm: zswap: Modify zswap_compress() to accept a page instead of a folio.
  2024-09-28  2:16 [PATCH v8 0/8] mm: zswap swap-out of large folios Kanchana P Sridhar
  2024-09-28  2:16 ` [PATCH v8 1/8] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined Kanchana P Sridhar
@ 2024-09-28  2:16 ` Kanchana P Sridhar
  2024-09-28  5:41   ` Chengming Zhou
  2024-09-28 13:46   ` Johannes Weiner
  2024-09-28  2:16 ` [PATCH v8 3/8] mm: zswap: Rename zswap_pool_get() to zswap_pool_tryget() Kanchana P Sridhar
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 43+ messages in thread
From: Kanchana P Sridhar @ 2024-09-28  2:16 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
	ying.huang, 21cnbao, akpm
  Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

For zswap_store() to be able to store a large folio by compressing it
one page at a time, zswap_compress() needs to accept a page as input.
This will allow us to iterate through each page in the folio in
zswap_store(), compress it and store it in the zpool.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
---
 mm/zswap.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index efad4e941e44..fd7a8c14457a 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -875,7 +875,7 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
 	return 0;
 }
 
-static bool zswap_compress(struct folio *folio, struct zswap_entry *entry)
+static bool zswap_compress(struct page *page, struct zswap_entry *entry)
 {
 	struct crypto_acomp_ctx *acomp_ctx;
 	struct scatterlist input, output;
@@ -893,7 +893,7 @@ static bool zswap_compress(struct folio *folio, struct zswap_entry *entry)
 
 	dst = acomp_ctx->buffer;
 	sg_init_table(&input, 1);
-	sg_set_folio(&input, folio, PAGE_SIZE, 0);
+	sg_set_page(&input, page, PAGE_SIZE, 0);
 
 	/*
 	 * We need PAGE_SIZE * 2 here since there maybe over-compression case,
@@ -1456,7 +1456,7 @@ bool zswap_store(struct folio *folio)
 		mem_cgroup_put(memcg);
 	}
 
-	if (!zswap_compress(folio, entry))
+	if (!zswap_compress(&folio->page, entry))
 		goto put_pool;
 
 	entry->swpentry = swp;
-- 
2.27.0



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v8 2/8] mm: zswap: Modify zswap_compress() to accept a page instead of a folio.
  2024-09-28  2:16 ` [PATCH v8 2/8] mm: zswap: Modify zswap_compress() to accept a page instead of a folio Kanchana P Sridhar
@ 2024-09-28  5:41   ` Chengming Zhou
  2024-09-28 13:46   ` Johannes Weiner
  1 sibling, 0 replies; 43+ messages in thread
From: Chengming Zhou @ 2024-09-28  5:41 UTC (permalink / raw)
  To: Kanchana P Sridhar, linux-kernel, linux-mm, hannes, yosryahmed,
	nphamcs, usamaarif642, shakeel.butt, ryan.roberts, ying.huang,
	21cnbao, akpm
  Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal

On 2024/9/28 10:16, Kanchana P Sridhar wrote:
> For zswap_store() to be able to store a large folio by compressing it
> one page at a time, zswap_compress() needs to accept a page as input.
> This will allow us to iterate through each page in the folio in
> zswap_store(), compress it and store it in the zpool.
> 
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> Reviewed-by: Nhat Pham <nphamcs@gmail.com>

Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>

> ---
>   mm/zswap.c | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index efad4e941e44..fd7a8c14457a 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -875,7 +875,7 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
>   	return 0;
>   }
>   
> -static bool zswap_compress(struct folio *folio, struct zswap_entry *entry)
> +static bool zswap_compress(struct page *page, struct zswap_entry *entry)
>   {
>   	struct crypto_acomp_ctx *acomp_ctx;
>   	struct scatterlist input, output;
> @@ -893,7 +893,7 @@ static bool zswap_compress(struct folio *folio, struct zswap_entry *entry)
>   
>   	dst = acomp_ctx->buffer;
>   	sg_init_table(&input, 1);
> -	sg_set_folio(&input, folio, PAGE_SIZE, 0);
> +	sg_set_page(&input, page, PAGE_SIZE, 0);
>   
>   	/*
>   	 * We need PAGE_SIZE * 2 here since there maybe over-compression case,
> @@ -1456,7 +1456,7 @@ bool zswap_store(struct folio *folio)
>   		mem_cgroup_put(memcg);
>   	}
>   
> -	if (!zswap_compress(folio, entry))
> +	if (!zswap_compress(&folio->page, entry))
>   		goto put_pool;
>   
>   	entry->swpentry = swp;


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v8 2/8] mm: zswap: Modify zswap_compress() to accept a page instead of a folio.
  2024-09-28  2:16 ` [PATCH v8 2/8] mm: zswap: Modify zswap_compress() to accept a page instead of a folio Kanchana P Sridhar
  2024-09-28  5:41   ` Chengming Zhou
@ 2024-09-28 13:46   ` Johannes Weiner
  1 sibling, 0 replies; 43+ messages in thread
From: Johannes Weiner @ 2024-09-28 13:46 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, yosryahmed, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao,
	akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal

On Fri, Sep 27, 2024 at 07:16:14PM -0700, Kanchana P Sridhar wrote:
> For zswap_store() to be able to store a large folio by compressing it
> one page at a time, zswap_compress() needs to accept a page as input.
> This will allow us to iterate through each page in the folio in
> zswap_store(), compress it and store it in the zpool.
> 
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> Reviewed-by: Nhat Pham <nphamcs@gmail.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v8 3/8] mm: zswap: Rename zswap_pool_get() to zswap_pool_tryget().
  2024-09-28  2:16 [PATCH v8 0/8] mm: zswap swap-out of large folios Kanchana P Sridhar
  2024-09-28  2:16 ` [PATCH v8 1/8] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined Kanchana P Sridhar
  2024-09-28  2:16 ` [PATCH v8 2/8] mm: zswap: Modify zswap_compress() to accept a page instead of a folio Kanchana P Sridhar
@ 2024-09-28  2:16 ` Kanchana P Sridhar
  2024-09-28  2:29   ` Yosry Ahmed
                     ` (3 more replies)
  2024-09-28  2:16 ` [PATCH v8 4/8] mm: Provide a new count_objcg_events() API for batch event updates Kanchana P Sridhar
                   ` (5 subsequent siblings)
  8 siblings, 4 replies; 43+ messages in thread
From: Kanchana P Sridhar @ 2024-09-28  2:16 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
	ying.huang, 21cnbao, akpm
  Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

Modify the name of the existing zswap_pool_get() to zswap_pool_tryget()
to be representative of the call it makes to percpu_ref_tryget().
A subsequent patch will introduce a new zswap_pool_get() that calls
percpu_ref_get().

The intent behind this change is for higher level zswap API such as
zswap_store() to call zswap_pool_tryget() to check upfront if the pool's
refcount is "0" (which means it could be getting destroyed) and to handle
this as an error condition. zswap_store() would proceed only if
zswap_pool_tryget() returns success, and any additional pool refcounts that
need to be obtained for compressing sub-pages in a large folio could simply
call zswap_pool_get().

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index fd7a8c14457a..0f281e50a034 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -403,7 +403,7 @@ static void __zswap_pool_empty(struct percpu_ref *ref)
 	spin_unlock_bh(&zswap_pools_lock);
 }
 
-static int __must_check zswap_pool_get(struct zswap_pool *pool)
+static int __must_check zswap_pool_tryget(struct zswap_pool *pool)
 {
 	if (!pool)
 		return 0;
@@ -441,7 +441,7 @@ static struct zswap_pool *zswap_pool_current_get(void)
 	rcu_read_lock();
 
 	pool = __zswap_pool_current();
-	if (!zswap_pool_get(pool))
+	if (!zswap_pool_tryget(pool))
 		pool = NULL;
 
 	rcu_read_unlock();
@@ -462,7 +462,7 @@ static struct zswap_pool *zswap_pool_find_get(char *type, char *compressor)
 		if (strcmp(zpool_get_type(pool->zpool), type))
 			continue;
 		/* if we can't get it, it's about to be destroyed */
-		if (!zswap_pool_get(pool))
+		if (!zswap_pool_tryget(pool))
 			continue;
 		return pool;
 	}
-- 
2.27.0



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v8 3/8] mm: zswap: Rename zswap_pool_get() to zswap_pool_tryget().
  2024-09-28  2:16 ` [PATCH v8 3/8] mm: zswap: Rename zswap_pool_get() to zswap_pool_tryget() Kanchana P Sridhar
@ 2024-09-28  2:29   ` Yosry Ahmed
  2024-09-28  5:43   ` Chengming Zhou
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 43+ messages in thread
From: Yosry Ahmed @ 2024-09-28  2:29 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao,
	akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal

On Fri, Sep 27, 2024 at 7:16 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> Modify the name of the existing zswap_pool_get() to zswap_pool_tryget()
> to be representative of the call it makes to percpu_ref_tryget().
> A subsequent patch will introduce a new zswap_pool_get() that calls
> percpu_ref_get().
>
> The intent behind this change is for higher level zswap API such as
> zswap_store() to call zswap_pool_tryget() to check upfront if the pool's
> refcount is "0" (which means it could be getting destroyed) and to handle
> this as an error condition. zswap_store() would proceed only if
> zswap_pool_tryget() returns success, and any additional pool refcounts that
> need to be obtained for compressing sub-pages in a large folio could simply
> call zswap_pool_get().
>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>

Acked-by: Yosry Ahmed <yosryahmed@google.com>

> ---
>  mm/zswap.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index fd7a8c14457a..0f281e50a034 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -403,7 +403,7 @@ static void __zswap_pool_empty(struct percpu_ref *ref)
>         spin_unlock_bh(&zswap_pools_lock);
>  }
>
> -static int __must_check zswap_pool_get(struct zswap_pool *pool)
> +static int __must_check zswap_pool_tryget(struct zswap_pool *pool)
>  {
>         if (!pool)
>                 return 0;
> @@ -441,7 +441,7 @@ static struct zswap_pool *zswap_pool_current_get(void)
>         rcu_read_lock();
>
>         pool = __zswap_pool_current();
> -       if (!zswap_pool_get(pool))
> +       if (!zswap_pool_tryget(pool))
>                 pool = NULL;
>
>         rcu_read_unlock();
> @@ -462,7 +462,7 @@ static struct zswap_pool *zswap_pool_find_get(char *type, char *compressor)
>                 if (strcmp(zpool_get_type(pool->zpool), type))
>                         continue;
>                 /* if we can't get it, it's about to be destroyed */
> -               if (!zswap_pool_get(pool))
> +               if (!zswap_pool_tryget(pool))
>                         continue;
>                 return pool;
>         }
> --
> 2.27.0
>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v8 3/8] mm: zswap: Rename zswap_pool_get() to zswap_pool_tryget().
  2024-09-28  2:16 ` [PATCH v8 3/8] mm: zswap: Rename zswap_pool_get() to zswap_pool_tryget() Kanchana P Sridhar
  2024-09-28  2:29   ` Yosry Ahmed
@ 2024-09-28  5:43   ` Chengming Zhou
  2024-09-29 21:01     ` Sridhar, Kanchana P
  2024-09-28 13:47   ` Johannes Weiner
  2024-09-28 23:26   ` Nhat Pham
  3 siblings, 1 reply; 43+ messages in thread
From: Chengming Zhou @ 2024-09-28  5:43 UTC (permalink / raw)
  To: Kanchana P Sridhar, linux-kernel, linux-mm, hannes, yosryahmed,
	nphamcs, usamaarif642, shakeel.butt, ryan.roberts, ying.huang,
	21cnbao, akpm
  Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal

On 2024/9/28 10:16, Kanchana P Sridhar wrote:
> Modify the name of the existing zswap_pool_get() to zswap_pool_tryget()
> to be representative of the call it makes to percpu_ref_tryget().
> A subsequent patch will introduce a new zswap_pool_get() that calls
> percpu_ref_get().
> 
> The intent behind this change is for higher level zswap API such as
> zswap_store() to call zswap_pool_tryget() to check upfront if the pool's
> refcount is "0" (which means it could be getting destroyed) and to handle
> this as an error condition. zswap_store() would proceed only if
> zswap_pool_tryget() returns success, and any additional pool refcounts that
> need to be obtained for compressing sub-pages in a large folio could simply
> call zswap_pool_get().
> 
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>

Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>

> ---
>   mm/zswap.c | 6 +++---
>   1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index fd7a8c14457a..0f281e50a034 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -403,7 +403,7 @@ static void __zswap_pool_empty(struct percpu_ref *ref)
>   	spin_unlock_bh(&zswap_pools_lock);
>   }
>   
> -static int __must_check zswap_pool_get(struct zswap_pool *pool)
> +static int __must_check zswap_pool_tryget(struct zswap_pool *pool)
>   {
>   	if (!pool)
>   		return 0;
> @@ -441,7 +441,7 @@ static struct zswap_pool *zswap_pool_current_get(void)
>   	rcu_read_lock();
>   
>   	pool = __zswap_pool_current();
> -	if (!zswap_pool_get(pool))
> +	if (!zswap_pool_tryget(pool))
>   		pool = NULL;
>   
>   	rcu_read_unlock();
> @@ -462,7 +462,7 @@ static struct zswap_pool *zswap_pool_find_get(char *type, char *compressor)
>   		if (strcmp(zpool_get_type(pool->zpool), type))
>   			continue;
>   		/* if we can't get it, it's about to be destroyed */
> -		if (!zswap_pool_get(pool))
> +		if (!zswap_pool_tryget(pool))
>   			continue;
>   		return pool;
>   	}


^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH v8 3/8] mm: zswap: Rename zswap_pool_get() to zswap_pool_tryget().
  2024-09-28  5:43   ` Chengming Zhou
@ 2024-09-29 21:01     ` Sridhar, Kanchana P
  0 siblings, 0 replies; 43+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-29 21:01 UTC (permalink / raw)
  To: Chengming Zhou, linux-kernel, linux-mm, hannes, yosryahmed,
	nphamcs, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying,
	21cnbao, akpm, Sridhar, Kanchana P
  Cc: Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh

> -----Original Message-----
> From: Chengming Zhou <chengming.zhou@linux.dev>
> Sent: Friday, September 27, 2024 10:44 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org;
> yosryahmed@google.com; nphamcs@gmail.com; usamaarif642@gmail.com;
> shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org
> Cc: Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v8 3/8] mm: zswap: Rename zswap_pool_get() to
> zswap_pool_tryget().
> 
> On 2024/9/28 10:16, Kanchana P Sridhar wrote:
> > Modify the name of the existing zswap_pool_get() to zswap_pool_tryget()
> > to be representative of the call it makes to percpu_ref_tryget().
> > A subsequent patch will introduce a new zswap_pool_get() that calls
> > percpu_ref_get().
> >
> > The intent behind this change is for higher level zswap API such as
> > zswap_store() to call zswap_pool_tryget() to check upfront if the pool's
> > refcount is "0" (which means it could be getting destroyed) and to handle
> > this as an error condition. zswap_store() would proceed only if
> > zswap_pool_tryget() returns success, and any additional pool refcounts that
> > need to be obtained for compressing sub-pages in a large folio could simply
> > call zswap_pool_get().
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> 
> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>

Thanks Chengming!

Thanks,
Kanchana

> 
> > ---
> >   mm/zswap.c | 6 +++---
> >   1 file changed, 3 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index fd7a8c14457a..0f281e50a034 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -403,7 +403,7 @@ static void __zswap_pool_empty(struct percpu_ref
> *ref)
> >   	spin_unlock_bh(&zswap_pools_lock);
> >   }
> >
> > -static int __must_check zswap_pool_get(struct zswap_pool *pool)
> > +static int __must_check zswap_pool_tryget(struct zswap_pool *pool)
> >   {
> >   	if (!pool)
> >   		return 0;
> > @@ -441,7 +441,7 @@ static struct zswap_pool
> *zswap_pool_current_get(void)
> >   	rcu_read_lock();
> >
> >   	pool = __zswap_pool_current();
> > -	if (!zswap_pool_get(pool))
> > +	if (!zswap_pool_tryget(pool))
> >   		pool = NULL;
> >
> >   	rcu_read_unlock();
> > @@ -462,7 +462,7 @@ static struct zswap_pool
> *zswap_pool_find_get(char *type, char *compressor)
> >   		if (strcmp(zpool_get_type(pool->zpool), type))
> >   			continue;
> >   		/* if we can't get it, it's about to be destroyed */
> > -		if (!zswap_pool_get(pool))
> > +		if (!zswap_pool_tryget(pool))
> >   			continue;
> >   		return pool;
> >   	}

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v8 3/8] mm: zswap: Rename zswap_pool_get() to zswap_pool_tryget().
  2024-09-28  2:16 ` [PATCH v8 3/8] mm: zswap: Rename zswap_pool_get() to zswap_pool_tryget() Kanchana P Sridhar
  2024-09-28  2:29   ` Yosry Ahmed
  2024-09-28  5:43   ` Chengming Zhou
@ 2024-09-28 13:47   ` Johannes Weiner
  2024-09-28 23:26   ` Nhat Pham
  3 siblings, 0 replies; 43+ messages in thread
From: Johannes Weiner @ 2024-09-28 13:47 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, yosryahmed, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao,
	akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal

On Fri, Sep 27, 2024 at 07:16:15PM -0700, Kanchana P Sridhar wrote:
> Modify the name of the existing zswap_pool_get() to zswap_pool_tryget()
> to be representative of the call it makes to percpu_ref_tryget().
> A subsequent patch will introduce a new zswap_pool_get() that calls
> percpu_ref_get().
> 
> The intent behind this change is for higher level zswap API such as
> zswap_store() to call zswap_pool_tryget() to check upfront if the pool's
> refcount is "0" (which means it could be getting destroyed) and to handle
> this as an error condition. zswap_store() would proceed only if
> zswap_pool_tryget() returns success, and any additional pool refcounts that
> need to be obtained for compressing sub-pages in a large folio could simply
> call zswap_pool_get().
> 
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v8 3/8] mm: zswap: Rename zswap_pool_get() to zswap_pool_tryget().
  2024-09-28  2:16 ` [PATCH v8 3/8] mm: zswap: Rename zswap_pool_get() to zswap_pool_tryget() Kanchana P Sridhar
                     ` (2 preceding siblings ...)
  2024-09-28 13:47   ` Johannes Weiner
@ 2024-09-28 23:26   ` Nhat Pham
  2024-09-29 21:04     ` Sridhar, Kanchana P
  3 siblings, 1 reply; 43+ messages in thread
From: Nhat Pham @ 2024-09-28 23:26 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao,
	akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal

On Fri, Sep 27, 2024 at 7:16 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> Modify the name of the existing zswap_pool_get() to zswap_pool_tryget()
> to be representative of the call it makes to percpu_ref_tryget().
> A subsequent patch will introduce a new zswap_pool_get() that calls
> percpu_ref_get().
>
> The intent behind this change is for higher level zswap API such as
> zswap_store() to call zswap_pool_tryget() to check upfront if the pool's
> refcount is "0" (which means it could be getting destroyed) and to handle
> this as an error condition. zswap_store() would proceed only if
> zswap_pool_tryget() returns success, and any additional pool refcounts that
> need to be obtained for compressing sub-pages in a large folio could simply
> call zswap_pool_get().
>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>

Reviewed-by: Nhat Pham <nphamcs@gmail.com>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH v8 3/8] mm: zswap: Rename zswap_pool_get() to zswap_pool_tryget().
  2024-09-28 23:26   ` Nhat Pham
@ 2024-09-29 21:04     ` Sridhar, Kanchana P
  0 siblings, 0 replies; 43+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-29 21:04 UTC (permalink / raw)
  To: Nhat Pham
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
	akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar,
	Kanchana P

> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Saturday, September 28, 2024 4:27 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v8 3/8] mm: zswap: Rename zswap_pool_get() to
> zswap_pool_tryget().
> 
> On Fri, Sep 27, 2024 at 7:16 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > Modify the name of the existing zswap_pool_get() to zswap_pool_tryget()
> > to be representative of the call it makes to percpu_ref_tryget().
> > A subsequent patch will introduce a new zswap_pool_get() that calls
> > percpu_ref_get().
> >
> > The intent behind this change is for higher level zswap API such as
> > zswap_store() to call zswap_pool_tryget() to check upfront if the pool's
> > refcount is "0" (which means it could be getting destroyed) and to handle
> > this as an error condition. zswap_store() would proceed only if
> > zswap_pool_tryget() returns success, and any additional pool refcounts that
> > need to be obtained for compressing sub-pages in a large folio could simply
> > call zswap_pool_get().
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> 
> Reviewed-by: Nhat Pham <nphamcs@gmail.com>

Thanks Nhat! 

Thanks,
Kanchana

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v8 4/8] mm: Provide a new count_objcg_events() API for batch event updates.
  2024-09-28  2:16 [PATCH v8 0/8] mm: zswap swap-out of large folios Kanchana P Sridhar
                   ` (2 preceding siblings ...)
  2024-09-28  2:16 ` [PATCH v8 3/8] mm: zswap: Rename zswap_pool_get() to zswap_pool_tryget() Kanchana P Sridhar
@ 2024-09-28  2:16 ` Kanchana P Sridhar
  2024-09-28  3:02   ` Yosry Ahmed
  2024-09-28  2:16 ` [PATCH v8 5/8] mm: zswap: Modify zswap_stored_pages to be atomic_long_t Kanchana P Sridhar
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 43+ messages in thread
From: Kanchana P Sridhar @ 2024-09-28  2:16 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
	ying.huang, 21cnbao, akpm
  Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

With the introduction of zswap_store() swapping out large folios,
we need to efficiently update the objcg's memcg events once per
successfully stored folio. For instance, the 'ZSWPOUT' event needs
to be incremented by folio_nr_pages().

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 include/linux/memcontrol.h | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 15c2716f9aa3..f47fd00c5eea 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1778,6 +1778,21 @@ static inline void count_objcg_event(struct obj_cgroup *objcg,
 	rcu_read_unlock();
 }
 
+static inline void count_objcg_events(struct obj_cgroup *objcg,
+				      enum vm_event_item idx,
+				      unsigned long count)
+{
+	struct mem_cgroup *memcg;
+
+	if (!memcg_kmem_online())
+		return;
+
+	rcu_read_lock();
+	memcg = obj_cgroup_memcg(objcg);
+	count_memcg_events(memcg, idx, count);
+	rcu_read_unlock();
+}
+
 #else
 static inline bool mem_cgroup_kmem_disabled(void)
 {
@@ -1834,6 +1849,11 @@ static inline void count_objcg_event(struct obj_cgroup *objcg,
 {
 }
 
+static inline void count_objcg_events(struct obj_cgroup *objcg,
+				      enum vm_event_item idx,
+				      unsigned long count)
+{
+}
 #endif /* CONFIG_MEMCG */
 
 #if defined(CONFIG_MEMCG) && defined(CONFIG_ZSWAP)
-- 
2.27.0



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v8 4/8] mm: Provide a new count_objcg_events() API for batch event updates.
  2024-09-28  2:16 ` [PATCH v8 4/8] mm: Provide a new count_objcg_events() API for batch event updates Kanchana P Sridhar
@ 2024-09-28  3:02   ` Yosry Ahmed
  2024-09-28  5:46     ` Chengming Zhou
  2024-09-29 21:00     ` Sridhar, Kanchana P
  0 siblings, 2 replies; 43+ messages in thread
From: Yosry Ahmed @ 2024-09-28  3:02 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao,
	akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal

On Fri, Sep 27, 2024 at 7:16 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> With the introduction of zswap_store() swapping out large folios,
> we need to efficiently update the objcg's memcg events once per
> successfully stored folio. For instance, the 'ZSWPOUT' event needs
> to be incremented by folio_nr_pages().
>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  include/linux/memcontrol.h | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 15c2716f9aa3..f47fd00c5eea 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1778,6 +1778,21 @@ static inline void count_objcg_event(struct obj_cgroup *objcg,
>         rcu_read_unlock();
>  }
>
> +static inline void count_objcg_events(struct obj_cgroup *objcg,
> +                                     enum vm_event_item idx,
> +                                     unsigned long count)
> +{
> +       struct mem_cgroup *memcg;
> +
> +       if (!memcg_kmem_online())
> +               return;
> +
> +       rcu_read_lock();
> +       memcg = obj_cgroup_memcg(objcg);
> +       count_memcg_events(memcg, idx, count);
> +       rcu_read_unlock();
> +}

Instead of replicating the code in count_objcg_event(), we should
change count_objcg_event() to become count_objcg_events() (i.e. add a
count parameter). The existing callers can pass in 1, there's only 3
of them anyway (2 after patch 6), and they are all in zswap.

> +
>  #else
>  static inline bool mem_cgroup_kmem_disabled(void)
>  {
> @@ -1834,6 +1849,11 @@ static inline void count_objcg_event(struct obj_cgroup *objcg,
>  {
>  }
>
> +static inline void count_objcg_events(struct obj_cgroup *objcg,
> +                                     enum vm_event_item idx,
> +                                     unsigned long count)
> +{
> +}
>  #endif /* CONFIG_MEMCG */
>
>  #if defined(CONFIG_MEMCG) && defined(CONFIG_ZSWAP)
> --
> 2.27.0
>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v8 4/8] mm: Provide a new count_objcg_events() API for batch event updates.
  2024-09-28  3:02   ` Yosry Ahmed
@ 2024-09-28  5:46     ` Chengming Zhou
  2024-09-29 21:00     ` Sridhar, Kanchana P
  1 sibling, 0 replies; 43+ messages in thread
From: Chengming Zhou @ 2024-09-28  5:46 UTC (permalink / raw)
  To: Yosry Ahmed, Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, nphamcs, usamaarif642,
	shakeel.butt, ryan.roberts, ying.huang, 21cnbao, akpm,
	nanhai.zou, wajdi.k.feghali, vinodh.gopal

On 2024/9/28 11:02, Yosry Ahmed wrote:
> On Fri, Sep 27, 2024 at 7:16 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
>>
>> With the introduction of zswap_store() swapping out large folios,
>> we need to efficiently update the objcg's memcg events once per
>> successfully stored folio. For instance, the 'ZSWPOUT' event needs
>> to be incremented by folio_nr_pages().
>>
>> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
>> ---
>>   include/linux/memcontrol.h | 20 ++++++++++++++++++++
>>   1 file changed, 20 insertions(+)
>>
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 15c2716f9aa3..f47fd00c5eea 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -1778,6 +1778,21 @@ static inline void count_objcg_event(struct obj_cgroup *objcg,
>>          rcu_read_unlock();
>>   }
>>
>> +static inline void count_objcg_events(struct obj_cgroup *objcg,
>> +                                     enum vm_event_item idx,
>> +                                     unsigned long count)
>> +{
>> +       struct mem_cgroup *memcg;
>> +
>> +       if (!memcg_kmem_online())
>> +               return;
>> +
>> +       rcu_read_lock();
>> +       memcg = obj_cgroup_memcg(objcg);
>> +       count_memcg_events(memcg, idx, count);
>> +       rcu_read_unlock();
>> +}
> 
> Instead of replicating the code in count_objcg_event(), we should
> change count_objcg_event() to become count_objcg_events() (i.e. add a
> count parameter). The existing callers can pass in 1, there's only 3
> of them anyway (2 after patch 6), and they are all in zswap.

Right, agree.

> 
>> +
>>   #else
>>   static inline bool mem_cgroup_kmem_disabled(void)
>>   {
>> @@ -1834,6 +1849,11 @@ static inline void count_objcg_event(struct obj_cgroup *objcg,
>>   {
>>   }
>>
>> +static inline void count_objcg_events(struct obj_cgroup *objcg,
>> +                                     enum vm_event_item idx,
>> +                                     unsigned long count)
>> +{
>> +}
>>   #endif /* CONFIG_MEMCG */
>>
>>   #if defined(CONFIG_MEMCG) && defined(CONFIG_ZSWAP)
>> --
>> 2.27.0
>>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH v8 4/8] mm: Provide a new count_objcg_events() API for batch event updates.
  2024-09-28  3:02   ` Yosry Ahmed
  2024-09-28  5:46     ` Chengming Zhou
@ 2024-09-29 21:00     ` Sridhar, Kanchana P
  1 sibling, 0 replies; 43+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-29 21:00 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
	akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar,
	Kanchana P

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Friday, September 27, 2024 8:02 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v8 4/8] mm: Provide a new count_objcg_events() API for
> batch event updates.
> 
> On Fri, Sep 27, 2024 at 7:16 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > With the introduction of zswap_store() swapping out large folios,
> > we need to efficiently update the objcg's memcg events once per
> > successfully stored folio. For instance, the 'ZSWPOUT' event needs
> > to be incremented by folio_nr_pages().
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  include/linux/memcontrol.h | 20 ++++++++++++++++++++
> >  1 file changed, 20 insertions(+)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 15c2716f9aa3..f47fd00c5eea 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -1778,6 +1778,21 @@ static inline void count_objcg_event(struct
> obj_cgroup *objcg,
> >         rcu_read_unlock();
> >  }
> >
> > +static inline void count_objcg_events(struct obj_cgroup *objcg,
> > +                                     enum vm_event_item idx,
> > +                                     unsigned long count)
> > +{
> > +       struct mem_cgroup *memcg;
> > +
> > +       if (!memcg_kmem_online())
> > +               return;
> > +
> > +       rcu_read_lock();
> > +       memcg = obj_cgroup_memcg(objcg);
> > +       count_memcg_events(memcg, idx, count);
> > +       rcu_read_unlock();
> > +}
> 
> Instead of replicating the code in count_objcg_event(), we should
> change count_objcg_event() to become count_objcg_events() (i.e. add a
> count parameter). The existing callers can pass in 1, there's only 3
> of them anyway (2 after patch 6), and they are all in zswap.

Thanks Yosry. This makes sense. I will incorporate this in v9.

Thanks,
Kanchana

> 
> > +
> >  #else
> >  static inline bool mem_cgroup_kmem_disabled(void)
> >  {
> > @@ -1834,6 +1849,11 @@ static inline void count_objcg_event(struct
> obj_cgroup *objcg,
> >  {
> >  }
> >
> > +static inline void count_objcg_events(struct obj_cgroup *objcg,
> > +                                     enum vm_event_item idx,
> > +                                     unsigned long count)
> > +{
> > +}
> >  #endif /* CONFIG_MEMCG */
> >
> >  #if defined(CONFIG_MEMCG) && defined(CONFIG_ZSWAP)
> > --
> > 2.27.0
> >

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v8 5/8] mm: zswap: Modify zswap_stored_pages to be atomic_long_t.
  2024-09-28  2:16 [PATCH v8 0/8] mm: zswap swap-out of large folios Kanchana P Sridhar
                   ` (3 preceding siblings ...)
  2024-09-28  2:16 ` [PATCH v8 4/8] mm: Provide a new count_objcg_events() API for batch event updates Kanchana P Sridhar
@ 2024-09-28  2:16 ` Kanchana P Sridhar
  2024-09-28  2:57   ` Yosry Ahmed
                     ` (3 more replies)
  2024-09-28  2:16 ` [PATCH v8 6/8] mm: zswap: Support large folios in zswap_store() Kanchana P Sridhar
                   ` (3 subsequent siblings)
  8 siblings, 4 replies; 43+ messages in thread
From: Kanchana P Sridhar @ 2024-09-28  2:16 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
	ying.huang, 21cnbao, akpm
  Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

For zswap_store() to support large folios, we need to be able to do
a batch update of zswap_stored_pages upon successful store of all pages
in the folio. For this, we need to add folio_nr_pages(), which returns
a long, to zswap_stored_pages.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 fs/proc/meminfo.c     |  2 +-
 include/linux/zswap.h |  2 +-
 mm/zswap.c            | 19 +++++++++++++------
 3 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 245171d9164b..8ba9b1472390 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -91,7 +91,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 #ifdef CONFIG_ZSWAP
 	show_val_kb(m, "Zswap:          ", zswap_total_pages());
 	seq_printf(m,  "Zswapped:       %8lu kB\n",
-		   (unsigned long)atomic_read(&zswap_stored_pages) <<
+		   (unsigned long)atomic_long_read(&zswap_stored_pages) <<
 		   (PAGE_SHIFT - 10));
 #endif
 	show_val_kb(m, "Dirty:          ",
diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index 9cd1beef0654..d961ead91bf1 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -7,7 +7,7 @@
 
 struct lruvec;
 
-extern atomic_t zswap_stored_pages;
+extern atomic_long_t zswap_stored_pages;
 
 #ifdef CONFIG_ZSWAP
 
diff --git a/mm/zswap.c b/mm/zswap.c
index 0f281e50a034..43e4e216db41 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -43,7 +43,7 @@
 * statistics
 **********************************/
 /* The number of compressed pages currently stored in zswap */
-atomic_t zswap_stored_pages = ATOMIC_INIT(0);
+atomic_long_t zswap_stored_pages = ATOMIC_INIT(0);
 
 /*
  * The statistics below are not protected from concurrent access for
@@ -802,7 +802,7 @@ static void zswap_entry_free(struct zswap_entry *entry)
 		obj_cgroup_put(entry->objcg);
 	}
 	zswap_entry_cache_free(entry);
-	atomic_dec(&zswap_stored_pages);
+	atomic_long_dec(&zswap_stored_pages);
 }
 
 /*********************************
@@ -1232,7 +1232,7 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
 		nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
 	} else {
 		nr_backing = zswap_total_pages();
-		nr_stored = atomic_read(&zswap_stored_pages);
+		nr_stored = atomic_long_read(&zswap_stored_pages);
 	}
 
 	if (!nr_stored)
@@ -1501,7 +1501,7 @@ bool zswap_store(struct folio *folio)
 	}
 
 	/* update stats */
-	atomic_inc(&zswap_stored_pages);
+	atomic_long_inc(&zswap_stored_pages);
 	count_vm_event(ZSWPOUT);
 
 	return true;
@@ -1650,6 +1650,13 @@ static int debugfs_get_total_size(void *data, u64 *val)
 }
 DEFINE_DEBUGFS_ATTRIBUTE(total_size_fops, debugfs_get_total_size, NULL, "%llu\n");
 
+static int debugfs_get_stored_pages(void *data, u64 *val)
+{
+	*val = atomic_long_read(&zswap_stored_pages);
+	return 0;
+}
+DEFINE_DEBUGFS_ATTRIBUTE(stored_pages_fops, debugfs_get_stored_pages, NULL, "%llu\n");
+
 static int zswap_debugfs_init(void)
 {
 	if (!debugfs_initialized())
@@ -1673,8 +1680,8 @@ static int zswap_debugfs_init(void)
 			   zswap_debugfs_root, &zswap_written_back_pages);
 	debugfs_create_file("pool_total_size", 0444,
 			    zswap_debugfs_root, NULL, &total_size_fops);
-	debugfs_create_atomic_t("stored_pages", 0444,
-				zswap_debugfs_root, &zswap_stored_pages);
+	debugfs_create_file("stored_pages", 0444,
+			    zswap_debugfs_root, NULL, &stored_pages_fops);
 
 	return 0;
 }
-- 
2.27.0



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v8 5/8] mm: zswap: Modify zswap_stored_pages to be atomic_long_t.
  2024-09-28  2:16 ` [PATCH v8 5/8] mm: zswap: Modify zswap_stored_pages to be atomic_long_t Kanchana P Sridhar
@ 2024-09-28  2:57   ` Yosry Ahmed
  2024-09-28  4:50     ` Matthew Wilcox
  2024-09-28  8:13   ` Yosry Ahmed
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 43+ messages in thread
From: Yosry Ahmed @ 2024-09-28  2:57 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao,
	akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal

On Fri, Sep 27, 2024 at 7:16 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> For zswap_store() to support large folios, we need to be able to do
> a batch update of zswap_stored_pages upon successful store of all pages
> in the folio. For this, we need to add folio_nr_pages(), which returns
> a long, to zswap_stored_pages.

Do we really need this? A lot of places in the kernel assign the
result of folio_nr_pages() to an int (thp_nr_pages(),
split_huge_pages_all(), etc). I don't think we need to worry about
folio_nr_pages() exceeding INT_MAX for a while.

>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  fs/proc/meminfo.c     |  2 +-
>  include/linux/zswap.h |  2 +-
>  mm/zswap.c            | 19 +++++++++++++------
>  3 files changed, 15 insertions(+), 8 deletions(-)
>
> diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> index 245171d9164b..8ba9b1472390 100644
> --- a/fs/proc/meminfo.c
> +++ b/fs/proc/meminfo.c
> @@ -91,7 +91,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>  #ifdef CONFIG_ZSWAP
>         show_val_kb(m, "Zswap:          ", zswap_total_pages());
>         seq_printf(m,  "Zswapped:       %8lu kB\n",
> -                  (unsigned long)atomic_read(&zswap_stored_pages) <<
> +                  (unsigned long)atomic_long_read(&zswap_stored_pages) <<
>                    (PAGE_SHIFT - 10));
>  #endif
>         show_val_kb(m, "Dirty:          ",
> diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> index 9cd1beef0654..d961ead91bf1 100644
> --- a/include/linux/zswap.h
> +++ b/include/linux/zswap.h
> @@ -7,7 +7,7 @@
>
>  struct lruvec;
>
> -extern atomic_t zswap_stored_pages;
> +extern atomic_long_t zswap_stored_pages;
>
>  #ifdef CONFIG_ZSWAP
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 0f281e50a034..43e4e216db41 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -43,7 +43,7 @@
>  * statistics
>  **********************************/
>  /* The number of compressed pages currently stored in zswap */
> -atomic_t zswap_stored_pages = ATOMIC_INIT(0);
> +atomic_long_t zswap_stored_pages = ATOMIC_INIT(0);
>
>  /*
>   * The statistics below are not protected from concurrent access for
> @@ -802,7 +802,7 @@ static void zswap_entry_free(struct zswap_entry *entry)
>                 obj_cgroup_put(entry->objcg);
>         }
>         zswap_entry_cache_free(entry);
> -       atomic_dec(&zswap_stored_pages);
> +       atomic_long_dec(&zswap_stored_pages);
>  }
>
>  /*********************************
> @@ -1232,7 +1232,7 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
>                 nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
>         } else {
>                 nr_backing = zswap_total_pages();
> -               nr_stored = atomic_read(&zswap_stored_pages);
> +               nr_stored = atomic_long_read(&zswap_stored_pages);
>         }
>
>         if (!nr_stored)
> @@ -1501,7 +1501,7 @@ bool zswap_store(struct folio *folio)
>         }
>
>         /* update stats */
> -       atomic_inc(&zswap_stored_pages);
> +       atomic_long_inc(&zswap_stored_pages);
>         count_vm_event(ZSWPOUT);
>
>         return true;
> @@ -1650,6 +1650,13 @@ static int debugfs_get_total_size(void *data, u64 *val)
>  }
>  DEFINE_DEBUGFS_ATTRIBUTE(total_size_fops, debugfs_get_total_size, NULL, "%llu\n");
>
> +static int debugfs_get_stored_pages(void *data, u64 *val)
> +{
> +       *val = atomic_long_read(&zswap_stored_pages);
> +       return 0;
> +}
> +DEFINE_DEBUGFS_ATTRIBUTE(stored_pages_fops, debugfs_get_stored_pages, NULL, "%llu\n");
> +
>  static int zswap_debugfs_init(void)
>  {
>         if (!debugfs_initialized())
> @@ -1673,8 +1680,8 @@ static int zswap_debugfs_init(void)
>                            zswap_debugfs_root, &zswap_written_back_pages);
>         debugfs_create_file("pool_total_size", 0444,
>                             zswap_debugfs_root, NULL, &total_size_fops);
> -       debugfs_create_atomic_t("stored_pages", 0444,
> -                               zswap_debugfs_root, &zswap_stored_pages);
> +       debugfs_create_file("stored_pages", 0444,
> +                           zswap_debugfs_root, NULL, &stored_pages_fops);
>
>         return 0;
>  }
> --
> 2.27.0
>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v8 5/8] mm: zswap: Modify zswap_stored_pages to be atomic_long_t.
  2024-09-28  2:57   ` Yosry Ahmed
@ 2024-09-28  4:50     ` Matthew Wilcox
  2024-09-28  8:12       ` Yosry Ahmed
  0 siblings, 1 reply; 43+ messages in thread
From: Matthew Wilcox @ 2024-09-28  4:50 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Kanchana P Sridhar, linux-kernel, linux-mm, hannes, nphamcs,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
	ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali,
	vinodh.gopal

On Fri, Sep 27, 2024 at 07:57:49PM -0700, Yosry Ahmed wrote:
> On Fri, Sep 27, 2024 at 7:16 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > For zswap_store() to support large folios, we need to be able to do
> > a batch update of zswap_stored_pages upon successful store of all pages
> > in the folio. For this, we need to add folio_nr_pages(), which returns
> > a long, to zswap_stored_pages.
> 
> Do we really need this? A lot of places in the kernel assign the
> result of folio_nr_pages() to an int (thp_nr_pages(),
> split_huge_pages_all(), etc). I don't think we need to worry about
> folio_nr_pages() exceeding INT_MAX for a while.

You'd be surprised.  Let's assume we add support for PUD-sized pages
(personally I think this is too large to make sense, but some people can't
be told).  On arm64, we can have a 64kB page size, so that's 13 bits per
level for a total of 2^26 pages per PUD.  That feels uncomfortable close
to 2^32 to me.

Anywhere you've found that's using an int to store folio_nr_pages() is
somewhere we should probably switch to long.  And this, btw, is why I've
moved from using an int to store folio_size() to using size_t.  A PMD is
already 512MB (with a 64KB page size), and so a PUD will be 4TB.

thp_nr_pages() is not a good example.  I'll be happy when we kill it;
we're actually almost there.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v8 5/8] mm: zswap: Modify zswap_stored_pages to be atomic_long_t.
  2024-09-28  4:50     ` Matthew Wilcox
@ 2024-09-28  8:12       ` Yosry Ahmed
  0 siblings, 0 replies; 43+ messages in thread
From: Yosry Ahmed @ 2024-09-28  8:12 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Kanchana P Sridhar, linux-kernel, linux-mm, hannes, nphamcs,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
	ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali,
	vinodh.gopal

On Fri, Sep 27, 2024 at 9:51 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Sep 27, 2024 at 07:57:49PM -0700, Yosry Ahmed wrote:
> > On Fri, Sep 27, 2024 at 7:16 PM Kanchana P Sridhar
> > <kanchana.p.sridhar@intel.com> wrote:
> > >
> > > For zswap_store() to support large folios, we need to be able to do
> > > a batch update of zswap_stored_pages upon successful store of all pages
> > > in the folio. For this, we need to add folio_nr_pages(), which returns
> > > a long, to zswap_stored_pages.
> >
> > Do we really need this? A lot of places in the kernel assign the
> > result of folio_nr_pages() to an int (thp_nr_pages(),
> > split_huge_pages_all(), etc). I don't think we need to worry about
> > folio_nr_pages() exceeding INT_MAX for a while.
>
> You'd be surprised.  Let's assume we add support for PUD-sized pages
> (personally I think this is too large to make sense, but some people can't
> be told).  On arm64, we can have a 64kB page size, so that's 13 bits per
> level for a total of 2^26 pages per PUD.  That feels uncomfortable close
> to 2^32 to me.
>
> Anywhere you've found that's using an int to store folio_nr_pages() is
> somewhere we should probably switch to long.

There are a lot of them: rmap.c, shmem.c, khugepaged.c, etc.

> And this, btw, is why I've
> moved from using an int to store folio_size() to using size_t.  A PMD is
> already 512MB (with a 64KB page size), and so a PUD will be 4TB.

Thanks for pointing this out. I assumed the presence of many places
using int to store folio_nr_pages() means that it's a general
assumption.

Also, if we think it's possible that a single folio size may approach
INT_MAX, then we are in bigger trouble for zswap_stored_pages, because
that's the total number of pages stored in zswap on the entire system.
That's much more likely to exceed INT_MAX than a single folio.

>
> thp_nr_pages() is not a good example.  I'll be happy when we kill it;
> we're actually almost there.

Yeah I can only see 2 callers.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v8 5/8] mm: zswap: Modify zswap_stored_pages to be atomic_long_t.
  2024-09-28  2:16 ` [PATCH v8 5/8] mm: zswap: Modify zswap_stored_pages to be atomic_long_t Kanchana P Sridhar
  2024-09-28  2:57   ` Yosry Ahmed
@ 2024-09-28  8:13   ` Yosry Ahmed
  2024-09-29 21:04     ` Sridhar, Kanchana P
  2024-09-28 13:53   ` Johannes Weiner
  2024-09-28 23:27   ` Nhat Pham
  3 siblings, 1 reply; 43+ messages in thread
From: Yosry Ahmed @ 2024-09-28  8:13 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao,
	akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal

On Fri, Sep 27, 2024 at 7:16 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> For zswap_store() to support large folios, we need to be able to do
> a batch update of zswap_stored_pages upon successful store of all pages
> in the folio. For this, we need to add folio_nr_pages(), which returns
> a long, to zswap_stored_pages.
>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  fs/proc/meminfo.c     |  2 +-
>  include/linux/zswap.h |  2 +-
>  mm/zswap.c            | 19 +++++++++++++------
>  3 files changed, 15 insertions(+), 8 deletions(-)
>
> diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> index 245171d9164b..8ba9b1472390 100644
> --- a/fs/proc/meminfo.c
> +++ b/fs/proc/meminfo.c
> @@ -91,7 +91,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
>  #ifdef CONFIG_ZSWAP
>         show_val_kb(m, "Zswap:          ", zswap_total_pages());
>         seq_printf(m,  "Zswapped:       %8lu kB\n",
> -                  (unsigned long)atomic_read(&zswap_stored_pages) <<
> +                  (unsigned long)atomic_long_read(&zswap_stored_pages) <<

Do we still need this cast? "HardwareCorrupted" seems to be using
atomic_long_read() without a cast.

Otherwise this LGTM:
Acked-by: Yosry Ahmed <yosryahmed@google.com>

>                    (PAGE_SHIFT - 10));
>  #endif
>         show_val_kb(m, "Dirty:          ",
> diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> index 9cd1beef0654..d961ead91bf1 100644
> --- a/include/linux/zswap.h
> +++ b/include/linux/zswap.h
> @@ -7,7 +7,7 @@
>
>  struct lruvec;
>
> -extern atomic_t zswap_stored_pages;
> +extern atomic_long_t zswap_stored_pages;
>
>  #ifdef CONFIG_ZSWAP
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 0f281e50a034..43e4e216db41 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -43,7 +43,7 @@
>  * statistics
>  **********************************/
>  /* The number of compressed pages currently stored in zswap */
> -atomic_t zswap_stored_pages = ATOMIC_INIT(0);
> +atomic_long_t zswap_stored_pages = ATOMIC_INIT(0);
>
>  /*
>   * The statistics below are not protected from concurrent access for
> @@ -802,7 +802,7 @@ static void zswap_entry_free(struct zswap_entry *entry)
>                 obj_cgroup_put(entry->objcg);
>         }
>         zswap_entry_cache_free(entry);
> -       atomic_dec(&zswap_stored_pages);
> +       atomic_long_dec(&zswap_stored_pages);
>  }
>
>  /*********************************
> @@ -1232,7 +1232,7 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
>                 nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
>         } else {
>                 nr_backing = zswap_total_pages();
> -               nr_stored = atomic_read(&zswap_stored_pages);
> +               nr_stored = atomic_long_read(&zswap_stored_pages);
>         }
>
>         if (!nr_stored)
> @@ -1501,7 +1501,7 @@ bool zswap_store(struct folio *folio)
>         }
>
>         /* update stats */
> -       atomic_inc(&zswap_stored_pages);
> +       atomic_long_inc(&zswap_stored_pages);
>         count_vm_event(ZSWPOUT);
>
>         return true;
> @@ -1650,6 +1650,13 @@ static int debugfs_get_total_size(void *data, u64 *val)
>  }
>  DEFINE_DEBUGFS_ATTRIBUTE(total_size_fops, debugfs_get_total_size, NULL, "%llu\n");
>
> +static int debugfs_get_stored_pages(void *data, u64 *val)
> +{
> +       *val = atomic_long_read(&zswap_stored_pages);
> +       return 0;
> +}
> +DEFINE_DEBUGFS_ATTRIBUTE(stored_pages_fops, debugfs_get_stored_pages, NULL, "%llu\n");
> +
>  static int zswap_debugfs_init(void)
>  {
>         if (!debugfs_initialized())
> @@ -1673,8 +1680,8 @@ static int zswap_debugfs_init(void)
>                            zswap_debugfs_root, &zswap_written_back_pages);
>         debugfs_create_file("pool_total_size", 0444,
>                             zswap_debugfs_root, NULL, &total_size_fops);
> -       debugfs_create_atomic_t("stored_pages", 0444,
> -                               zswap_debugfs_root, &zswap_stored_pages);
> +       debugfs_create_file("stored_pages", 0444,
> +                           zswap_debugfs_root, NULL, &stored_pages_fops);
>
>         return 0;
>  }
> --
> 2.27.0
>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH v8 5/8] mm: zswap: Modify zswap_stored_pages to be atomic_long_t.
  2024-09-28  8:13   ` Yosry Ahmed
@ 2024-09-29 21:04     ` Sridhar, Kanchana P
  0 siblings, 0 replies; 43+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-29 21:04 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
	akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar,
	Kanchana P

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Saturday, September 28, 2024 1:13 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v8 5/8] mm: zswap: Modify zswap_stored_pages to be
> atomic_long_t.
> 
> On Fri, Sep 27, 2024 at 7:16 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > For zswap_store() to support large folios, we need to be able to do
> > a batch update of zswap_stored_pages upon successful store of all pages
> > in the folio. For this, we need to add folio_nr_pages(), which returns
> > a long, to zswap_stored_pages.
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  fs/proc/meminfo.c     |  2 +-
> >  include/linux/zswap.h |  2 +-
> >  mm/zswap.c            | 19 +++++++++++++------
> >  3 files changed, 15 insertions(+), 8 deletions(-)
> >
> > diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> > index 245171d9164b..8ba9b1472390 100644
> > --- a/fs/proc/meminfo.c
> > +++ b/fs/proc/meminfo.c
> > @@ -91,7 +91,7 @@ static int meminfo_proc_show(struct seq_file *m,
> void *v)
> >  #ifdef CONFIG_ZSWAP
> >         show_val_kb(m, "Zswap:          ", zswap_total_pages());
> >         seq_printf(m,  "Zswapped:       %8lu kB\n",
> > -                  (unsigned long)atomic_read(&zswap_stored_pages) <<
> > +                  (unsigned long)atomic_long_read(&zswap_stored_pages) <<
> 
> Do we still need this cast? "HardwareCorrupted" seems to be using
> atomic_long_read() without a cast.
> 
> Otherwise this LGTM:
> Acked-by: Yosry Ahmed <yosryahmed@google.com>

Thanks Yosry for the Acked-by's!

Thanks,
Kanchana

> 
> >                    (PAGE_SHIFT - 10));
> >  #endif
> >         show_val_kb(m, "Dirty:          ",
> > diff --git a/include/linux/zswap.h b/include/linux/zswap.h
> > index 9cd1beef0654..d961ead91bf1 100644
> > --- a/include/linux/zswap.h
> > +++ b/include/linux/zswap.h
> > @@ -7,7 +7,7 @@
> >
> >  struct lruvec;
> >
> > -extern atomic_t zswap_stored_pages;
> > +extern atomic_long_t zswap_stored_pages;
> >
> >  #ifdef CONFIG_ZSWAP
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 0f281e50a034..43e4e216db41 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -43,7 +43,7 @@
> >  * statistics
> >  **********************************/
> >  /* The number of compressed pages currently stored in zswap */
> > -atomic_t zswap_stored_pages = ATOMIC_INIT(0);
> > +atomic_long_t zswap_stored_pages = ATOMIC_INIT(0);
> >
> >  /*
> >   * The statistics below are not protected from concurrent access for
> > @@ -802,7 +802,7 @@ static void zswap_entry_free(struct zswap_entry
> *entry)
> >                 obj_cgroup_put(entry->objcg);
> >         }
> >         zswap_entry_cache_free(entry);
> > -       atomic_dec(&zswap_stored_pages);
> > +       atomic_long_dec(&zswap_stored_pages);
> >  }
> >
> >  /*********************************
> > @@ -1232,7 +1232,7 @@ static unsigned long zswap_shrinker_count(struct
> shrinker *shrinker,
> >                 nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
> >         } else {
> >                 nr_backing = zswap_total_pages();
> > -               nr_stored = atomic_read(&zswap_stored_pages);
> > +               nr_stored = atomic_long_read(&zswap_stored_pages);
> >         }
> >
> >         if (!nr_stored)
> > @@ -1501,7 +1501,7 @@ bool zswap_store(struct folio *folio)
> >         }
> >
> >         /* update stats */
> > -       atomic_inc(&zswap_stored_pages);
> > +       atomic_long_inc(&zswap_stored_pages);
> >         count_vm_event(ZSWPOUT);
> >
> >         return true;
> > @@ -1650,6 +1650,13 @@ static int debugfs_get_total_size(void *data,
> u64 *val)
> >  }
> >  DEFINE_DEBUGFS_ATTRIBUTE(total_size_fops, debugfs_get_total_size,
> NULL, "%llu\n");
> >
> > +static int debugfs_get_stored_pages(void *data, u64 *val)
> > +{
> > +       *val = atomic_long_read(&zswap_stored_pages);
> > +       return 0;
> > +}
> > +DEFINE_DEBUGFS_ATTRIBUTE(stored_pages_fops,
> debugfs_get_stored_pages, NULL, "%llu\n");
> > +
> >  static int zswap_debugfs_init(void)
> >  {
> >         if (!debugfs_initialized())
> > @@ -1673,8 +1680,8 @@ static int zswap_debugfs_init(void)
> >                            zswap_debugfs_root, &zswap_written_back_pages);
> >         debugfs_create_file("pool_total_size", 0444,
> >                             zswap_debugfs_root, NULL, &total_size_fops);
> > -       debugfs_create_atomic_t("stored_pages", 0444,
> > -                               zswap_debugfs_root, &zswap_stored_pages);
> > +       debugfs_create_file("stored_pages", 0444,
> > +                           zswap_debugfs_root, NULL, &stored_pages_fops);
> >
> >         return 0;
> >  }
> > --
> > 2.27.0
> >

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v8 5/8] mm: zswap: Modify zswap_stored_pages to be atomic_long_t.
  2024-09-28  2:16 ` [PATCH v8 5/8] mm: zswap: Modify zswap_stored_pages to be atomic_long_t Kanchana P Sridhar
  2024-09-28  2:57   ` Yosry Ahmed
  2024-09-28  8:13   ` Yosry Ahmed
@ 2024-09-28 13:53   ` Johannes Weiner
  2024-09-29 21:03     ` Sridhar, Kanchana P
  2024-09-28 23:27   ` Nhat Pham
  3 siblings, 1 reply; 43+ messages in thread
From: Johannes Weiner @ 2024-09-28 13:53 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, yosryahmed, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao,
	akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal

On Fri, Sep 27, 2024 at 07:16:17PM -0700, Kanchana P Sridhar wrote:
> For zswap_store() to support large folios, we need to be able to do
> a batch update of zswap_stored_pages upon successful store of all pages
> in the folio. For this, we need to add folio_nr_pages(), which returns
> a long, to zswap_stored_pages.
> 
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>

Long for pages makes sense to me even independent of the large folios
coming in. An int is just 8TB in 4k (base) pages.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH v8 5/8] mm: zswap: Modify zswap_stored_pages to be atomic_long_t.
  2024-09-28 13:53   ` Johannes Weiner
@ 2024-09-29 21:03     ` Sridhar, Kanchana P
  0 siblings, 0 replies; 43+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-29 21:03 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: linux-kernel, linux-mm, yosryahmed, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
	akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar,
	Kanchana P

> -----Original Message-----
> From: Johannes Weiner <hannes@cmpxchg.org>
> Sent: Saturday, September 28, 2024 6:54 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> yosryahmed@google.com; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v8 5/8] mm: zswap: Modify zswap_stored_pages to be
> atomic_long_t.
> 
> On Fri, Sep 27, 2024 at 07:16:17PM -0700, Kanchana P Sridhar wrote:
> > For zswap_store() to support large folios, we need to be able to do
> > a batch update of zswap_stored_pages upon successful store of all pages
> > in the folio. For this, we need to add folio_nr_pages(), which returns
> > a long, to zswap_stored_pages.
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> 
> Long for pages makes sense to me even independent of the large folios
> coming in. An int is just 8TB in 4k (base) pages.
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Thanks Johannes for the Acked-by's!

Thanks,
Kanchana


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v8 5/8] mm: zswap: Modify zswap_stored_pages to be atomic_long_t.
  2024-09-28  2:16 ` [PATCH v8 5/8] mm: zswap: Modify zswap_stored_pages to be atomic_long_t Kanchana P Sridhar
                     ` (2 preceding siblings ...)
  2024-09-28 13:53   ` Johannes Weiner
@ 2024-09-28 23:27   ` Nhat Pham
  3 siblings, 0 replies; 43+ messages in thread
From: Nhat Pham @ 2024-09-28 23:27 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao,
	akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal

On Fri, Sep 27, 2024 at 7:16 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> For zswap_store() to support large folios, we need to be able to do
> a batch update of zswap_stored_pages upon successful store of all pages
> in the folio. For this, we need to add folio_nr_pages(), which returns
> a long, to zswap_stored_pages.
>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>

Reviewed-by: Nhat Pham <nphamcs@gmail.com>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v8 6/8] mm: zswap: Support large folios in zswap_store().
  2024-09-28  2:16 [PATCH v8 0/8] mm: zswap swap-out of large folios Kanchana P Sridhar
                   ` (4 preceding siblings ...)
  2024-09-28  2:16 ` [PATCH v8 5/8] mm: zswap: Modify zswap_stored_pages to be atomic_long_t Kanchana P Sridhar
@ 2024-09-28  2:16 ` Kanchana P Sridhar
  2024-09-28  3:42   ` Yosry Ahmed
  2024-09-28  6:05   ` Chengming Zhou
  2024-09-28  2:16 ` [PATCH v8 7/8] mm: swap: Count successful large folio zswap stores in hugepage zswpout stats Kanchana P Sridhar
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 43+ messages in thread
From: Kanchana P Sridhar @ 2024-09-28  2:16 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
	ying.huang, 21cnbao, akpm
  Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

zswap_store() will store large folios by compressing them page by page.

This patch provides a sequential implementation of storing a large folio
in zswap_store() by iterating through each page in the folio to compress
and store it in the zswap zpool.

Towards this goal, zswap_compress() is modified to take a page instead
of a folio as input.

Each page's swap offset is stored as a separate zswap entry.

We check the cgroup zswap limit and the zpool utilization against
the zswap max/accept_threshold limits once, at the beginning of
zswap_store. We also obtain a percpu_ref_tryget() reference to the
current zswap_pool at the start of zswap_store to prevent it from
being deleted while the folio is being stored.

If these one-time checks pass, we compress the sub-pages of the folio,
while maintaining a running count of compressed bytes for all the folio's
sub-pages. If all pages are successfully compressed and stored, we do the
cgroup zswap charging with the total compressed bytes, and batch update
the zswap_stored_pages atomic/zswpout event stats with folio_nr_pages()
once, before returning from zswap_store.

The patch adds a new zswap_pool_get() function that is called in the
sub-page level "zswap_store_page()" function.

If an error is encountered during the store of any page in the folio,
all pages in that folio currently stored in zswap will be invalidated.
Thus, a folio is either entirely stored in zswap, or entirely not stored
in zswap.

This patch forms the basis for building compress batching of pages in a
large folio in zswap_store by compressing up to say, 8 pages of the folio
in parallel in hardware using the Intel In-Memory Analytics Accelerator
(Intel IAA).

This change reuses and adapts the functionality in Ryan Roberts' RFC
patch [1]:

  "[RFC,v1] mm: zswap: Store large folios without splitting"

  [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u

Also, addressed some of the RFC comments from the discussion in [1].

Co-developed-by: Ryan Roberts
Signed-off-by:
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 mm/zswap.c | 227 ++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 165 insertions(+), 62 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 43e4e216db41..b8395ddf7b7c 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -411,6 +411,16 @@ static int __must_check zswap_pool_tryget(struct zswap_pool *pool)
 	return percpu_ref_tryget(&pool->ref);
 }
 
+/*
+ * Note: zswap_pool_get() should only be called after zswap_pool_tryget()
+ * returns success. zswap_pool_tryget() returns success only if the "pool" is
+ * non-NULL and the "&pool->ref" is non-0.
+ */
+static void zswap_pool_get(struct zswap_pool *pool)
+{
+	percpu_ref_get(&pool->ref);
+}
+
 static void zswap_pool_put(struct zswap_pool *pool)
 {
 	percpu_ref_put(&pool->ref);
@@ -1402,38 +1412,35 @@ static void shrink_worker(struct work_struct *w)
 /*********************************
 * main API
 **********************************/
-bool zswap_store(struct folio *folio)
+
+/*
+ * Stores the page at specified "index" in a folio.
+ *
+ * @folio: The folio to store in zswap.
+ * @index: Index into the page in the folio that this function will store.
+ * @objcg: The folio's objcg.
+ * @pool:  The zswap_pool to store the compressed data for the page.
+ *         The caller should have obtained a reference to a valid
+ *         zswap_pool by calling zswap_pool_tryget(), to pass as this
+ *         argument.
+ * @compressed_bytes: The compressed entry->length value is added
+ *                    to this, so that the caller can get the total
+ *                    compressed lengths of all sub-pages in a folio.
+ */
+static bool zswap_store_page(struct folio *folio, long index,
+			     struct obj_cgroup *objcg,
+			     struct zswap_pool *pool,
+			     size_t *compressed_bytes)
 {
+	struct page *page = folio_page(folio, index);
 	swp_entry_t swp = folio->swap;
-	pgoff_t offset = swp_offset(swp);
 	struct xarray *tree = swap_zswap_tree(swp);
+	pgoff_t offset = swp_offset(swp) + index;
 	struct zswap_entry *entry, *old;
-	struct obj_cgroup *objcg = NULL;
-	struct mem_cgroup *memcg = NULL;
-
-	VM_WARN_ON_ONCE(!folio_test_locked(folio));
-	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
+	int type = swp_type(swp);
 
-	/* Large folios aren't supported */
-	if (folio_test_large(folio))
-		return false;
-
-	if (!zswap_enabled)
-		goto check_old;
-
-	/* Check cgroup limits */
-	objcg = get_obj_cgroup_from_folio(folio);
-	if (objcg && !obj_cgroup_may_zswap(objcg)) {
-		memcg = get_mem_cgroup_from_objcg(objcg);
-		if (shrink_memcg(memcg)) {
-			mem_cgroup_put(memcg);
-			goto reject;
-		}
-		mem_cgroup_put(memcg);
-	}
-
-	if (zswap_check_limits())
-		goto reject;
+	if (objcg)
+		obj_cgroup_get(objcg);
 
 	/* allocate entry */
 	entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
@@ -1442,24 +1449,21 @@ bool zswap_store(struct folio *folio)
 		goto reject;
 	}
 
-	/* if entry is successfully added, it keeps the reference */
-	entry->pool = zswap_pool_current_get();
-	if (!entry->pool)
-		goto freepage;
+	/*
+	 * We get here only after the call to zswap_pool_tryget() in the
+	 * caller, zswap_store(), has returned success. Hence it is safe
+	 * to call zswap_pool_get().
+	 *
+	 * if entry is successfully added, it keeps the reference
+	 */
+	zswap_pool_get(pool);
 
-	if (objcg) {
-		memcg = get_mem_cgroup_from_objcg(objcg);
-		if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
-			mem_cgroup_put(memcg);
-			goto put_pool;
-		}
-		mem_cgroup_put(memcg);
-	}
+	entry->pool = pool;
 
-	if (!zswap_compress(&folio->page, entry))
+	if (!zswap_compress(page, entry))
 		goto put_pool;
 
-	entry->swpentry = swp;
+	entry->swpentry = swp_entry(type, offset);
 	entry->objcg = objcg;
 	entry->referenced = true;
 
@@ -1480,11 +1484,6 @@ bool zswap_store(struct folio *folio)
 	if (old)
 		zswap_entry_free(old);
 
-	if (objcg) {
-		obj_cgroup_charge_zswap(objcg, entry->length);
-		count_objcg_event(objcg, ZSWPOUT);
-	}
-
 	/*
 	 * We finish initializing the entry while it's already in xarray.
 	 * This is safe because:
@@ -1496,36 +1495,140 @@ bool zswap_store(struct folio *folio)
 	 *    an incoherent entry.
 	 */
 	if (entry->length) {
+		*compressed_bytes += entry->length;
 		INIT_LIST_HEAD(&entry->lru);
 		zswap_lru_add(&zswap_list_lru, entry);
 	}
 
-	/* update stats */
-	atomic_long_inc(&zswap_stored_pages);
-	count_vm_event(ZSWPOUT);
-
 	return true;
 
 store_failed:
 	zpool_free(entry->pool->zpool, entry->handle);
 put_pool:
-	zswap_pool_put(entry->pool);
-freepage:
+	zswap_pool_put(pool);
 	zswap_entry_cache_free(entry);
 reject:
 	obj_cgroup_put(objcg);
-	if (zswap_pool_reached_full)
-		queue_work(shrink_wq, &zswap_shrink_work);
-check_old:
+	return false;
+}
+
+bool zswap_store(struct folio *folio)
+{
+	long nr_pages = folio_nr_pages(folio);
+	swp_entry_t swp = folio->swap;
+	struct xarray *tree = swap_zswap_tree(swp);
+	pgoff_t offset = swp_offset(swp);
+	struct obj_cgroup *objcg = NULL;
+	struct mem_cgroup *memcg = NULL;
+	struct zswap_pool *pool;
+	size_t compressed_bytes = 0;
+	bool ret = false;
+	long index;
+
+	VM_WARN_ON_ONCE(!folio_test_locked(folio));
+	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
+
+	if (!zswap_enabled)
+		goto reject;
+
 	/*
-	 * If the zswap store fails or zswap is disabled, we must invalidate the
-	 * possibly stale entry which was previously stored at this offset.
-	 * Otherwise, writeback could overwrite the new data in the swapfile.
+	 * Check cgroup zswap limits:
+	 *
+	 * The cgroup zswap limit check is done once at the beginning of
+	 * zswap_store(). The cgroup charging is done once, at the end
+	 * of a successful folio store. What this means is, if the cgroup
+	 * was within the zswap_max limit at the beginning of a large folio
+	 * store, it could go over the limit by at most (HPAGE_PMD_NR - 1)
+	 * pages.
 	 */
-	entry = xa_erase(tree, offset);
-	if (entry)
-		zswap_entry_free(entry);
-	return false;
+	objcg = get_obj_cgroup_from_folio(folio);
+	if (objcg && !obj_cgroup_may_zswap(objcg)) {
+		memcg = get_mem_cgroup_from_objcg(objcg);
+		if (shrink_memcg(memcg)) {
+			mem_cgroup_put(memcg);
+			goto put_objcg;
+		}
+		mem_cgroup_put(memcg);
+	}
+
+	/*
+	 * Check zpool utilization against zswap limits:
+	 *
+	 * The zswap zpool utilization is also checked against the limits
+	 * just once, at the start of zswap_store(). If the check passes,
+	 * any breaches of the limits set by zswap_max_pages() or
+	 * zswap_accept_thr_pages() that may happen while storing this
+	 * folio, will only be detected during the next call to
+	 * zswap_store() by any process.
+	 */
+	if (zswap_check_limits())
+		goto put_objcg;
+
+	pool = zswap_pool_current_get();
+	if (!pool)
+		goto put_objcg;
+
+	if (objcg) {
+		memcg = get_mem_cgroup_from_objcg(objcg);
+		if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
+			mem_cgroup_put(memcg);
+			goto put_pool;
+		}
+		mem_cgroup_put(memcg);
+	}
+
+	/*
+	 * Store each page of the folio as a separate entry. If we fail to
+	 * store a page, unwind by deleting all the pages for this folio
+	 * currently in zswap.
+	 */
+	for (index = 0; index < nr_pages; ++index) {
+		if (!zswap_store_page(folio, index, objcg, pool, &compressed_bytes))
+			goto put_pool;
+	}
+
+	/*
+	 * All pages in the folio have been successfully stored.
+	 * Batch update the cgroup zswap charging, zswap_stored_page atomic,
+	 * and ZSWPOUT event stats.
+	 */
+	if (objcg) {
+		obj_cgroup_charge_zswap(objcg, compressed_bytes);
+		count_objcg_events(objcg, ZSWPOUT, nr_pages);
+	}
+
+	/* update stats */
+	atomic_long_add(nr_pages, &zswap_stored_pages);
+	count_vm_events(ZSWPOUT, nr_pages);
+
+	ret = true;
+
+put_pool:
+	zswap_pool_put(pool);
+put_objcg:
+	obj_cgroup_put(objcg);
+reject:
+	/*
+	 * If the zswap store fails or zswap is disabled, we must invalidate
+	 * the possibly stale entries which were previously stored at the
+	 * offsets corresponding to each page of the folio. Otherwise,
+	 * writeback could overwrite the new data in the swapfile.
+	 */
+	if (!ret) {
+		struct zswap_entry *entry;
+		long i;
+
+		for (i = 0; i < nr_pages; ++i) {
+			entry = xa_erase(tree, offset + i);
+			if (entry)
+				zswap_entry_free(entry);
+		}
+
+		if (zswap_pool_reached_full)
+			queue_work(shrink_wq, &zswap_shrink_work);
+	}
+
+	return ret;
 }
 
 bool zswap_load(struct folio *folio)
-- 
2.27.0



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v8 6/8] mm: zswap: Support large folios in zswap_store().
  2024-09-28  2:16 ` [PATCH v8 6/8] mm: zswap: Support large folios in zswap_store() Kanchana P Sridhar
@ 2024-09-28  3:42   ` Yosry Ahmed
  2024-09-28 14:15     ` Johannes Weiner
  2024-09-29 21:24     ` Sridhar, Kanchana P
  2024-09-28  6:05   ` Chengming Zhou
  1 sibling, 2 replies; 43+ messages in thread
From: Yosry Ahmed @ 2024-09-28  3:42 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao,
	akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal

On Fri, Sep 27, 2024 at 7:16 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> zswap_store() will store large folios by compressing them page by page.
>
> This patch provides a sequential implementation of storing a large folio
> in zswap_store() by iterating through each page in the folio to compress
> and store it in the zswap zpool.
>
> Towards this goal, zswap_compress() is modified to take a page instead
> of a folio as input.

It is already modified at this point, it's not part of this patch.

>
> Each page's swap offset is stored as a separate zswap entry.

I think "Each page is compressed individually and stored as a separate
zswap entry" is clearer. We do not store the offsets.

>
> We check the cgroup zswap limit and the zpool utilization against
> the zswap max/accept_threshold limits once, at the beginning of
> zswap_store. We also obtain a percpu_ref_tryget() reference to the
> current zswap_pool at the start of zswap_store to prevent it from
> being deleted while the folio is being stored.

This can be clarified further:

We check the global and per-cgroup limits once at the beginning, and
only check that the limit is not reached yet. This is racy and
inaccurate, but it should be sufficient for now. We also obtain
initial references to the relevant objcg and pool to guarantee that
subsequent references can be acquired by zswap_store_page().

>
> If these one-time checks pass, we compress the sub-pages of the folio,
> while maintaining a running count of compressed bytes for all the folio's
> sub-pages. If all pages are successfully compressed and stored, we do the
> cgroup zswap charging with the total compressed bytes, and batch update
> the zswap_stored_pages atomic/zswpout event stats with folio_nr_pages()
> once, before returning from zswap_store.

Please consistently use "page" instead of "sub-page", and add "()" to
the end of function names.

>
> The patch adds a new zswap_pool_get() function that is called in the
> sub-page level "zswap_store_page()" function.

"that is called by zswap_store_page(), which handles compressing and
storing each page in the folio."

>
> If an error is encountered during the store of any page in the folio,
> all pages in that folio currently stored in zswap will be invalidated.
> Thus, a folio is either entirely stored in zswap, or entirely not stored
> in zswap.
>
> This patch forms the basis for building compress batching of pages in a
> large folio in zswap_store by compressing up to say, 8 pages of the folio
> in parallel in hardware using the Intel In-Memory Analytics Accelerator
> (Intel IAA).

This patch also has its own merits, it batches some operations, and
more importantly enables swapping out large folios to zswap without
splitting them.

>
> This change reuses and adapts the functionality in Ryan Roberts' RFC
> patch [1]:
>
>   "[RFC,v1] mm: zswap: Store large folios without splitting"
>
>   [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
>
> Also, addressed some of the RFC comments from the discussion in [1].
>
> Co-developed-by: Ryan Roberts
> Signed-off-by:
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>  mm/zswap.c | 227 ++++++++++++++++++++++++++++++++++++++---------------
>  1 file changed, 165 insertions(+), 62 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 43e4e216db41..b8395ddf7b7c 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -411,6 +411,16 @@ static int __must_check zswap_pool_tryget(struct zswap_pool *pool)
>         return percpu_ref_tryget(&pool->ref);
>  }
>
> +/*
> + * Note: zswap_pool_get() should only be called after zswap_pool_tryget()
> + * returns success. zswap_pool_tryget() returns success only if the "pool" is
> + * non-NULL and the "&pool->ref" is non-0.

Just use the same statement used by css_get():

"The caller must already have a reference."

> + */
> +static void zswap_pool_get(struct zswap_pool *pool)
> +{
> +       percpu_ref_get(&pool->ref);
> +}
> +
>  static void zswap_pool_put(struct zswap_pool *pool)
>  {
>         percpu_ref_put(&pool->ref);
> @@ -1402,38 +1412,35 @@ static void shrink_worker(struct work_struct *w)
>  /*********************************
>  * main API
>  **********************************/
> -bool zswap_store(struct folio *folio)
> +
> +/*
> + * Stores the page at specified "index" in a folio.
> + *
> + * @folio: The folio to store in zswap.
> + * @index: Index into the page in the folio that this function will store.
> + * @objcg: The folio's objcg.
> + * @pool:  The zswap_pool to store the compressed data for the page.
> + *         The caller should have obtained a reference to a valid
> + *         zswap_pool by calling zswap_pool_tryget(), to pass as this
> + *         argument.
> + * @compressed_bytes: The compressed entry->length value is added
> + *                    to this, so that the caller can get the total
> + *                    compressed lengths of all sub-pages in a folio.
> + */
> +static bool zswap_store_page(struct folio *folio, long index,
> +                            struct obj_cgroup *objcg,
> +                            struct zswap_pool *pool,
> +                            size_t *compressed_bytes)

Would it be clearer to just return the compressed size, and return -1
upon failure?

>  {
> +       struct page *page = folio_page(folio, index);
>         swp_entry_t swp = folio->swap;
> -       pgoff_t offset = swp_offset(swp);
>         struct xarray *tree = swap_zswap_tree(swp);
> +       pgoff_t offset = swp_offset(swp) + index;
>         struct zswap_entry *entry, *old;
> -       struct obj_cgroup *objcg = NULL;
> -       struct mem_cgroup *memcg = NULL;
> -
> -       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> -       VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> +       int type = swp_type(swp);

Why do we need type? We use it when initializing entry->swpentry to
reconstruct the swp_entry_t we already have.

>
> -       /* Large folios aren't supported */
> -       if (folio_test_large(folio))
> -               return false;
> -
> -       if (!zswap_enabled)
> -               goto check_old;
> -
> -       /* Check cgroup limits */
> -       objcg = get_obj_cgroup_from_folio(folio);
> -       if (objcg && !obj_cgroup_may_zswap(objcg)) {
> -               memcg = get_mem_cgroup_from_objcg(objcg);
> -               if (shrink_memcg(memcg)) {
> -                       mem_cgroup_put(memcg);
> -                       goto reject;
> -               }
> -               mem_cgroup_put(memcg);
> -       }
> -
> -       if (zswap_check_limits())
> -               goto reject;
> +       if (objcg)
> +               obj_cgroup_get(objcg);
>
>         /* allocate entry */
>         entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
> @@ -1442,24 +1449,21 @@ bool zswap_store(struct folio *folio)
>                 goto reject;
>         }
>
> -       /* if entry is successfully added, it keeps the reference */
> -       entry->pool = zswap_pool_current_get();
> -       if (!entry->pool)
> -               goto freepage;
> +       /*
> +        * We get here only after the call to zswap_pool_tryget() in the
> +        * caller, zswap_store(), has returned success. Hence it is safe
> +        * to call zswap_pool_get().
> +        *
> +        * if entry is successfully added, it keeps the reference
> +        */
> +       zswap_pool_get(pool);

I would move this right above obj_cgroup_get() and have a single
concise comment for both, e.g.:

/* zswap_store() already holds a ref on 'pool' and 'objcg' */

>
> -       if (objcg) {
> -               memcg = get_mem_cgroup_from_objcg(objcg);
> -               if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
> -                       mem_cgroup_put(memcg);
> -                       goto put_pool;
> -               }
> -               mem_cgroup_put(memcg);
> -       }
> +       entry->pool = pool;
>
> -       if (!zswap_compress(&folio->page, entry))
> +       if (!zswap_compress(page, entry))
>                 goto put_pool;
>
> -       entry->swpentry = swp;
> +       entry->swpentry = swp_entry(type, offset);
>         entry->objcg = objcg;
>         entry->referenced = true;
>
> @@ -1480,11 +1484,6 @@ bool zswap_store(struct folio *folio)
>         if (old)
>                 zswap_entry_free(old);
>
> -       if (objcg) {
> -               obj_cgroup_charge_zswap(objcg, entry->length);
> -               count_objcg_event(objcg, ZSWPOUT);
> -       }
> -
>         /*
>          * We finish initializing the entry while it's already in xarray.
>          * This is safe because:
> @@ -1496,36 +1495,140 @@ bool zswap_store(struct folio *folio)
>          *    an incoherent entry.
>          */
>         if (entry->length) {
> +               *compressed_bytes += entry->length;
>                 INIT_LIST_HEAD(&entry->lru);
>                 zswap_lru_add(&zswap_list_lru, entry);
>         }
>
> -       /* update stats */
> -       atomic_long_inc(&zswap_stored_pages);
> -       count_vm_event(ZSWPOUT);
> -
>         return true;
>
>  store_failed:
>         zpool_free(entry->pool->zpool, entry->handle);
>  put_pool:
> -       zswap_pool_put(entry->pool);
> -freepage:
> +       zswap_pool_put(pool);
>         zswap_entry_cache_free(entry);
>  reject:

Please rename this to put_objcg, we are no longer "rejecting" here.

>         obj_cgroup_put(objcg);
> -       if (zswap_pool_reached_full)
> -               queue_work(shrink_wq, &zswap_shrink_work);
> -check_old:
> +       return false;
> +}
> +
> +bool zswap_store(struct folio *folio)
> +{
> +       long nr_pages = folio_nr_pages(folio);
> +       swp_entry_t swp = folio->swap;
> +       struct xarray *tree = swap_zswap_tree(swp);
> +       pgoff_t offset = swp_offset(swp);
> +       struct obj_cgroup *objcg = NULL;
> +       struct mem_cgroup *memcg = NULL;
> +       struct zswap_pool *pool;
> +       size_t compressed_bytes = 0;

Why size_t? entry->length is int.

> +       bool ret = false;
> +       long index;
> +
> +       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> +       VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> +
> +       if (!zswap_enabled)
> +               goto reject;
> +
>         /*
> -        * If the zswap store fails or zswap is disabled, we must invalidate the
> -        * possibly stale entry which was previously stored at this offset.
> -        * Otherwise, writeback could overwrite the new data in the swapfile.
> +        * Check cgroup zswap limits:
> +        *
> +        * The cgroup zswap limit check is done once at the beginning of
> +        * zswap_store(). The cgroup charging is done once, at the end
> +        * of a successful folio store. What this means is, if the cgroup
> +        * was within the zswap_max limit at the beginning of a large folio
> +        * store, it could go over the limit by at most (HPAGE_PMD_NR - 1)
> +        * pages.

Add ".. due to this store", as it can be much more over the limit when
stores are racing. Also, this comment can be slightly generalized and
the comment above zswap_check_limits() omitted.

>          */
> -       entry = xa_erase(tree, offset);
> -       if (entry)
> -               zswap_entry_free(entry);
> -       return false;
> +       objcg = get_obj_cgroup_from_folio(folio);
> +       if (objcg && !obj_cgroup_may_zswap(objcg)) {
> +               memcg = get_mem_cgroup_from_objcg(objcg);
> +               if (shrink_memcg(memcg)) {
> +                       mem_cgroup_put(memcg);
> +                       goto put_objcg;
> +               }
> +               mem_cgroup_put(memcg);
> +       }
> +
> +       /*
> +        * Check zpool utilization against zswap limits:
> +        *
> +        * The zswap zpool utilization is also checked against the limits
> +        * just once, at the start of zswap_store(). If the check passes,
> +        * any breaches of the limits set by zswap_max_pages() or
> +        * zswap_accept_thr_pages() that may happen while storing this
> +        * folio, will only be detected during the next call to
> +        * zswap_store() by any process.
> +        */
> +       if (zswap_check_limits())
> +               goto put_objcg;
> +
> +       pool = zswap_pool_current_get();
> +       if (!pool)
> +               goto put_objcg;
> +
> +       if (objcg) {
> +               memcg = get_mem_cgroup_from_objcg(objcg);
> +               if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
> +                       mem_cgroup_put(memcg);
> +                       goto put_pool;
> +               }
> +               mem_cgroup_put(memcg);
> +       }
> +
> +       /*
> +        * Store each page of the folio as a separate entry. If we fail to
> +        * store a page, unwind by deleting all the pages for this folio
> +        * currently in zswap.
> +        */
> +       for (index = 0; index < nr_pages; ++index) {
> +               if (!zswap_store_page(folio, index, objcg, pool, &compressed_bytes))
> +                       goto put_pool;
> +       }
> +
> +       /*
> +        * All pages in the folio have been successfully stored.
> +        * Batch update the cgroup zswap charging, zswap_stored_page atomic,
> +        * and ZSWPOUT event stats.
> +        */

This comment seems unnecessary.

> +       if (objcg) {
> +               obj_cgroup_charge_zswap(objcg, compressed_bytes);
> +               count_objcg_events(objcg, ZSWPOUT, nr_pages);
> +       }
> +
> +       /* update stats */

Please delete this comment too.

> +       atomic_long_add(nr_pages, &zswap_stored_pages);
> +       count_vm_events(ZSWPOUT, nr_pages);
> +
> +       ret = true;
> +
> +put_pool:
> +       zswap_pool_put(pool);
> +put_objcg:
> +       obj_cgroup_put(objcg);
> +reject:
> +       /*
> +        * If the zswap store fails or zswap is disabled, we must invalidate
> +        * the possibly stale entries which were previously stored at the
> +        * offsets corresponding to each page of the folio. Otherwise,
> +        * writeback could overwrite the new data in the swapfile.
> +        */
> +       if (!ret) {
> +               struct zswap_entry *entry;
> +               long i;
> +
> +               for (i = 0; i < nr_pages; ++i) {

Just reuse 'index' here.

> +                       entry = xa_erase(tree, offset + i);
> +                       if (entry)
> +                               zswap_entry_free(entry);

I think we need a comment in zswap_store_page() that we shouldn't have
any possibility of failure after the entry is added in the tree.
Otherwise we may duplicate the cleanup work here (e.g. putting refs).
This is already the case in zswap_store(), but it's now more subtle
and easier to break when the code is split and we have two cleanup
paths.

> +               }
> +
> +               if (zswap_pool_reached_full)
> +                       queue_work(shrink_wq, &zswap_shrink_work);

We are now doing this check even when zswap is disabled. I think this
is a problem.

If zswap gets disabled while 'zswap_pool_reached_full' is true, we
will queue 'zswap_shrink_work' every time zswap_store() is called in
the swapout path, but 'zswap_pool_reached_full' will never become
false again become zswap_check_limits() will never be called.

I think we need to maintain two separate "reject" and "check_old"
labels, and avoid this check when zswap is disabled.

> +       }
> +
> +       return ret;
>  }
>
>  bool zswap_load(struct folio *folio)
> --
> 2.27.0
>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v8 6/8] mm: zswap: Support large folios in zswap_store().
  2024-09-28  3:42   ` Yosry Ahmed
@ 2024-09-28 14:15     ` Johannes Weiner
  2024-09-28 18:11       ` Yosry Ahmed
  2024-09-29 21:24     ` Sridhar, Kanchana P
  1 sibling, 1 reply; 43+ messages in thread
From: Johannes Weiner @ 2024-09-28 14:15 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Kanchana P Sridhar, linux-kernel, linux-mm, nphamcs,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
	ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali,
	vinodh.gopal

On Fri, Sep 27, 2024 at 08:42:16PM -0700, Yosry Ahmed wrote:
> On Fri, Sep 27, 2024 at 7:16 PM Kanchana P Sridhar
> >  {
> > +       struct page *page = folio_page(folio, index);
> >         swp_entry_t swp = folio->swap;
> > -       pgoff_t offset = swp_offset(swp);
> >         struct xarray *tree = swap_zswap_tree(swp);
> > +       pgoff_t offset = swp_offset(swp) + index;
> >         struct zswap_entry *entry, *old;
> > -       struct obj_cgroup *objcg = NULL;
> > -       struct mem_cgroup *memcg = NULL;
> > -
> > -       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > -       VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > +       int type = swp_type(swp);
> 
> Why do we need type? We use it when initializing entry->swpentry to
> reconstruct the swp_entry_t we already have.

It's not the same entry. folio->swap points to the head entry, this
function has to store swap entries with the offsets of each subpage.

Given the name of this function, it might be better to actually pass a
page pointer to it; do the folio_page() inside zswap_store().

Then do

		entry->swpentry = page_swap_entry(page);

below.

> >         obj_cgroup_put(objcg);
> > -       if (zswap_pool_reached_full)
> > -               queue_work(shrink_wq, &zswap_shrink_work);
> > -check_old:
> > +       return false;
> > +}
> > +
> > +bool zswap_store(struct folio *folio)
> > +{
> > +       long nr_pages = folio_nr_pages(folio);
> > +       swp_entry_t swp = folio->swap;
> > +       struct xarray *tree = swap_zswap_tree(swp);
> > +       pgoff_t offset = swp_offset(swp);
> > +       struct obj_cgroup *objcg = NULL;
> > +       struct mem_cgroup *memcg = NULL;
> > +       struct zswap_pool *pool;
> > +       size_t compressed_bytes = 0;
> 
> Why size_t? entry->length is int.

In light of Willy's comment, I think size_t is a good idea.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v8 6/8] mm: zswap: Support large folios in zswap_store().
  2024-09-28 14:15     ` Johannes Weiner
@ 2024-09-28 18:11       ` Yosry Ahmed
  2024-09-29 21:15         ` Sridhar, Kanchana P
  0 siblings, 1 reply; 43+ messages in thread
From: Yosry Ahmed @ 2024-09-28 18:11 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Kanchana P Sridhar, linux-kernel, linux-mm, nphamcs,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
	ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali,
	vinodh.gopal

On Sat, Sep 28, 2024 at 7:15 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, Sep 27, 2024 at 08:42:16PM -0700, Yosry Ahmed wrote:
> > On Fri, Sep 27, 2024 at 7:16 PM Kanchana P Sridhar
> > >  {
> > > +       struct page *page = folio_page(folio, index);
> > >         swp_entry_t swp = folio->swap;
> > > -       pgoff_t offset = swp_offset(swp);
> > >         struct xarray *tree = swap_zswap_tree(swp);
> > > +       pgoff_t offset = swp_offset(swp) + index;
> > >         struct zswap_entry *entry, *old;
> > > -       struct obj_cgroup *objcg = NULL;
> > > -       struct mem_cgroup *memcg = NULL;
> > > -
> > > -       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > > -       VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > > +       int type = swp_type(swp);
> >
> > Why do we need type? We use it when initializing entry->swpentry to
> > reconstruct the swp_entry_t we already have.
>
> It's not the same entry. folio->swap points to the head entry, this
> function has to store swap entries with the offsets of each subpage.

Duh, yeah, thanks.

>
> Given the name of this function, it might be better to actually pass a
> page pointer to it; do the folio_page() inside zswap_store().
>
> Then do
>
>                 entry->swpentry = page_swap_entry(page);
>
> below.

That is indeed clearer.

Although this will be adding yet another caller of page_swap_entry()
that already has the folio, yet it calls page_swap_entry() for each
page in the folio, which calls page_folio() inside.

I wonder if we should add (or replace page_swap_entry()) with a
folio_swap_entry(folio, index) helper. This can also be done as a
follow up.

>
> > >         obj_cgroup_put(objcg);
> > > -       if (zswap_pool_reached_full)
> > > -               queue_work(shrink_wq, &zswap_shrink_work);
> > > -check_old:
> > > +       return false;
> > > +}
> > > +
> > > +bool zswap_store(struct folio *folio)
> > > +{
> > > +       long nr_pages = folio_nr_pages(folio);
> > > +       swp_entry_t swp = folio->swap;
> > > +       struct xarray *tree = swap_zswap_tree(swp);
> > > +       pgoff_t offset = swp_offset(swp);
> > > +       struct obj_cgroup *objcg = NULL;
> > > +       struct mem_cgroup *memcg = NULL;
> > > +       struct zswap_pool *pool;
> > > +       size_t compressed_bytes = 0;
> >
> > Why size_t? entry->length is int.
>
> In light of Willy's comment, I think size_t is a good idea.

Agreed.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH v8 6/8] mm: zswap: Support large folios in zswap_store().
  2024-09-28 18:11       ` Yosry Ahmed
@ 2024-09-29 21:15         ` Sridhar, Kanchana P
  2024-09-30 17:55           ` Sridhar, Kanchana P
  0 siblings, 1 reply; 43+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-29 21:15 UTC (permalink / raw)
  To: Yosry Ahmed, Johannes Weiner
  Cc: linux-kernel, linux-mm, nphamcs, chengming.zhou, usamaarif642,
	shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou,
	Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Saturday, September 28, 2024 11:11 AM
> To: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v8 6/8] mm: zswap: Support large folios in zswap_store().
> 
> On Sat, Sep 28, 2024 at 7:15 AM Johannes Weiner <hannes@cmpxchg.org>
> wrote:
> >
> > On Fri, Sep 27, 2024 at 08:42:16PM -0700, Yosry Ahmed wrote:
> > > On Fri, Sep 27, 2024 at 7:16 PM Kanchana P Sridhar
> > > >  {
> > > > +       struct page *page = folio_page(folio, index);
> > > >         swp_entry_t swp = folio->swap;
> > > > -       pgoff_t offset = swp_offset(swp);
> > > >         struct xarray *tree = swap_zswap_tree(swp);
> > > > +       pgoff_t offset = swp_offset(swp) + index;
> > > >         struct zswap_entry *entry, *old;
> > > > -       struct obj_cgroup *objcg = NULL;
> > > > -       struct mem_cgroup *memcg = NULL;
> > > > -
> > > > -       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > > > -       VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > > > +       int type = swp_type(swp);
> > >
> > > Why do we need type? We use it when initializing entry->swpentry to
> > > reconstruct the swp_entry_t we already have.
> >
> > It's not the same entry. folio->swap points to the head entry, this
> > function has to store swap entries with the offsets of each subpage.
> 
> Duh, yeah, thanks.
> 
> >
> > Given the name of this function, it might be better to actually pass a
> > page pointer to it; do the folio_page() inside zswap_store().
> >
> > Then do
> >
> >                 entry->swpentry = page_swap_entry(page);
> >
> > below.
> 
> That is indeed clearer.
> 
> Although this will be adding yet another caller of page_swap_entry()
> that already has the folio, yet it calls page_swap_entry() for each
> page in the folio, which calls page_folio() inside.
> 
> I wonder if we should add (or replace page_swap_entry()) with a
> folio_swap_entry(folio, index) helper. This can also be done as a
> follow up.

Thanks Johannes and Yosry for these comments. I was thinking about
this some more. In its current form, zswap_store_page() is called in the context
of the folio by passing in a [folio, index]. This implies a key assumption about
the existing zswap_store() large folios functionality, i.e., we do the per-page
store for the page at a "index * PAGE_SIZE" within the folio, and not for any
arbitrary page. Further, we need the folio for folio_nid(); but this can also be
computed from the page. Another reason why I thought the existing signature
might be preferable is because it seems like it enables getting the entry's
swp_entry_t with fewer computes. Could calling page_swap_entry() add
more computes; which if it is the case, could potentially add up (say 512 times)

I would appreciate your thoughts on whether these are valid considerations,
and can proceed accordingly.

> 
> >
> > > >         obj_cgroup_put(objcg);
> > > > -       if (zswap_pool_reached_full)
> > > > -               queue_work(shrink_wq, &zswap_shrink_work);
> > > > -check_old:
> > > > +       return false;
> > > > +}
> > > > +
> > > > +bool zswap_store(struct folio *folio)
> > > > +{
> > > > +       long nr_pages = folio_nr_pages(folio);
> > > > +       swp_entry_t swp = folio->swap;
> > > > +       struct xarray *tree = swap_zswap_tree(swp);
> > > > +       pgoff_t offset = swp_offset(swp);
> > > > +       struct obj_cgroup *objcg = NULL;
> > > > +       struct mem_cgroup *memcg = NULL;
> > > > +       struct zswap_pool *pool;
> > > > +       size_t compressed_bytes = 0;
> > >
> > > Why size_t? entry->length is int.
> >
> > In light of Willy's comment, I think size_t is a good idea.
> 
> Agreed.

Thanks Yosry, Matthew and Johannes for the resolution on this!

Thanks,
Kanchana


^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH v8 6/8] mm: zswap: Support large folios in zswap_store().
  2024-09-29 21:15         ` Sridhar, Kanchana P
@ 2024-09-30 17:55           ` Sridhar, Kanchana P
  0 siblings, 0 replies; 43+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-30 17:55 UTC (permalink / raw)
  To: Yosry Ahmed, Johannes Weiner
  Cc: linux-kernel, linux-mm, nphamcs, chengming.zhou, usamaarif642,
	shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou,
	Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P

> -----Original Message-----
> From: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Sent: Sunday, September 29, 2024 2:15 PM
> To: Yosry Ahmed <yosryahmed@google.com>; Johannes Weiner
> <hannes@cmpxchg.org>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>;
> Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Subject: RE: [PATCH v8 6/8] mm: zswap: Support large folios in zswap_store().
> 
> > -----Original Message-----
> > From: Yosry Ahmed <yosryahmed@google.com>
> > Sent: Saturday, September 28, 2024 11:11 AM
> > To: Johannes Weiner <hannes@cmpxchg.org>
> > Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> > kernel@vger.kernel.org; linux-mm@kvack.org; nphamcs@gmail.com;
> > chengming.zhou@linux.dev; usamaarif642@gmail.com;
> > shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying
> > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org;
> > Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v8 6/8] mm: zswap: Support large folios in
> zswap_store().
> >
> > On Sat, Sep 28, 2024 at 7:15 AM Johannes Weiner <hannes@cmpxchg.org>
> > wrote:
> > >
> > > On Fri, Sep 27, 2024 at 08:42:16PM -0700, Yosry Ahmed wrote:
> > > > On Fri, Sep 27, 2024 at 7:16 PM Kanchana P Sridhar
> > > > >  {
> > > > > +       struct page *page = folio_page(folio, index);
> > > > >         swp_entry_t swp = folio->swap;
> > > > > -       pgoff_t offset = swp_offset(swp);
> > > > >         struct xarray *tree = swap_zswap_tree(swp);
> > > > > +       pgoff_t offset = swp_offset(swp) + index;
> > > > >         struct zswap_entry *entry, *old;
> > > > > -       struct obj_cgroup *objcg = NULL;
> > > > > -       struct mem_cgroup *memcg = NULL;
> > > > > -
> > > > > -       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > > > > -       VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > > > > +       int type = swp_type(swp);
> > > >
> > > > Why do we need type? We use it when initializing entry->swpentry to
> > > > reconstruct the swp_entry_t we already have.
> > >
> > > It's not the same entry. folio->swap points to the head entry, this
> > > function has to store swap entries with the offsets of each subpage.
> >
> > Duh, yeah, thanks.
> >
> > >
> > > Given the name of this function, it might be better to actually pass a
> > > page pointer to it; do the folio_page() inside zswap_store().
> > >
> > > Then do
> > >
> > >                 entry->swpentry = page_swap_entry(page);
> > >
> > > below.
> >
> > That is indeed clearer.
> >
> > Although this will be adding yet another caller of page_swap_entry()
> > that already has the folio, yet it calls page_swap_entry() for each
> > page in the folio, which calls page_folio() inside.
> >
> > I wonder if we should add (or replace page_swap_entry()) with a
> > folio_swap_entry(folio, index) helper. This can also be done as a
> > follow up.
> 
> Thanks Johannes and Yosry for these comments. I was thinking about
> this some more. In its current form, zswap_store_page() is called in the
> context
> of the folio by passing in a [folio, index]. This implies a key assumption about
> the existing zswap_store() large folios functionality, i.e., we do the per-page
> store for the page at a "index * PAGE_SIZE" within the folio, and not for any
> arbitrary page. Further, we need the folio for folio_nid(); but this can also be
> computed from the page. Another reason why I thought the existing signature
> might be preferable is because it seems like it enables getting the entry's
> swp_entry_t with fewer computes. Could calling page_swap_entry() add
> more computes; which if it is the case, could potentially add up (say 512
> times)

I went ahead and quantified this with the v8 signature of zswap_store_page()
and the suggested changes for this function to take a page and use
page_swap_entry(). I ran usemem with 2M pmd-mappable folios enabled.
The results indicate that the page_swap_entry() implementation is slightly
better in throughput and latency:

v8:                             run1       run2       run3    average
---------------------------------------------------------------------
Total throughput (KB/s):   6,483,835  6,396,760  6,349,532  6,410,042
Average throughput (KB/s):   216,127    213,225	   211,651    213,889
elapsed time (sec):           107.75     107.06	    109.99     108.87
sys time (sec):             2,476.43   2,453.99	  2,551.52   2,513.98
---------------------------------------------------------------------


page_swap_entry():              run1       run2       run3    average
---------------------------------------------------------------------
Total throughput (KB/s):   6,462,954  6,396,134  6,418,076  6,425,721
Average throughput (KB/s):   215,431    213,204	   213,935    214,683
elapsed time (sec):           108.67     109.46	    107.91     108.29
sys time (sec):             2,473.65   2,493.33	  2,507.82   2,490.74
---------------------------------------------------------------------------

Based on this, I will go ahead and implement the change suggested
by Johannes and submit a v9.

Thanks,
Kanchana

> 
> I would appreciate your thoughts on whether these are valid considerations,
> and can proceed accordingly.
> 
> >
> > >
> > > > >         obj_cgroup_put(objcg);
> > > > > -       if (zswap_pool_reached_full)
> > > > > -               queue_work(shrink_wq, &zswap_shrink_work);
> > > > > -check_old:
> > > > > +       return false;
> > > > > +}
> > > > > +
> > > > > +bool zswap_store(struct folio *folio)
> > > > > +{
> > > > > +       long nr_pages = folio_nr_pages(folio);
> > > > > +       swp_entry_t swp = folio->swap;
> > > > > +       struct xarray *tree = swap_zswap_tree(swp);
> > > > > +       pgoff_t offset = swp_offset(swp);
> > > > > +       struct obj_cgroup *objcg = NULL;
> > > > > +       struct mem_cgroup *memcg = NULL;
> > > > > +       struct zswap_pool *pool;
> > > > > +       size_t compressed_bytes = 0;
> > > >
> > > > Why size_t? entry->length is int.
> > >
> > > In light of Willy's comment, I think size_t is a good idea.
> >
> > Agreed.
> 
> Thanks Yosry, Matthew and Johannes for the resolution on this!
> 
> Thanks,
> Kanchana


^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH v8 6/8] mm: zswap: Support large folios in zswap_store().
  2024-09-28  3:42   ` Yosry Ahmed
  2024-09-28 14:15     ` Johannes Weiner
@ 2024-09-29 21:24     ` Sridhar, Kanchana P
  1 sibling, 0 replies; 43+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-29 21:24 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
	akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar,
	Kanchana P

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Friday, September 27, 2024 8:42 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v8 6/8] mm: zswap: Support large folios in zswap_store().
> 
> On Fri, Sep 27, 2024 at 7:16 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > zswap_store() will store large folios by compressing them page by page.
> >
> > This patch provides a sequential implementation of storing a large folio
> > in zswap_store() by iterating through each page in the folio to compress
> > and store it in the zswap zpool.
> >
> > Towards this goal, zswap_compress() is modified to take a page instead
> > of a folio as input.
> 
> It is already modified at this point, it's not part of this patch.
> 
> >
> > Each page's swap offset is stored as a separate zswap entry.
> 
> I think "Each page is compressed individually and stored as a separate
> zswap entry" is clearer. We do not store the offsets.
> 
> >
> > We check the cgroup zswap limit and the zpool utilization against
> > the zswap max/accept_threshold limits once, at the beginning of
> > zswap_store. We also obtain a percpu_ref_tryget() reference to the
> > current zswap_pool at the start of zswap_store to prevent it from
> > being deleted while the folio is being stored.
> 
> This can be clarified further:
> 
> We check the global and per-cgroup limits once at the beginning, and
> only check that the limit is not reached yet. This is racy and
> inaccurate, but it should be sufficient for now. We also obtain
> initial references to the relevant objcg and pool to guarantee that
> subsequent references can be acquired by zswap_store_page().
> 
> >
> > If these one-time checks pass, we compress the sub-pages of the folio,
> > while maintaining a running count of compressed bytes for all the folio's
> > sub-pages. If all pages are successfully compressed and stored, we do the
> > cgroup zswap charging with the total compressed bytes, and batch update
> > the zswap_stored_pages atomic/zswpout event stats with folio_nr_pages()
> > once, before returning from zswap_store.
> 
> Please consistently use "page" instead of "sub-page", and add "()" to
> the end of function names.
> 
> >
> > The patch adds a new zswap_pool_get() function that is called in the
> > sub-page level "zswap_store_page()" function.
> 
> "that is called by zswap_store_page(), which handles compressing and
> storing each page in the folio."
> 
> >
> > If an error is encountered during the store of any page in the folio,
> > all pages in that folio currently stored in zswap will be invalidated.
> > Thus, a folio is either entirely stored in zswap, or entirely not stored
> > in zswap.
> >
> > This patch forms the basis for building compress batching of pages in a
> > large folio in zswap_store by compressing up to say, 8 pages of the folio
> > in parallel in hardware using the Intel In-Memory Analytics Accelerator
> > (Intel IAA).
> 
> This patch also has its own merits, it batches some operations, and
> more importantly enables swapping out large folios to zswap without
> splitting them.

Thanks for the detailed commit log review comments! Sure, I will incorporate
all of these in v9. I will also modify the latency improvement calculation convention
as you've suggested in another comment on the cover letter. Will re-gather all
the "after" data once v9 is ready.

> 
> >
> > This change reuses and adapts the functionality in Ryan Roberts' RFC
> > patch [1]:
> >
> >   "[RFC,v1] mm: zswap: Store large folios without splitting"
> >
> >   [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> ryan.roberts@arm.com/T/#u
> >
> > Also, addressed some of the RFC comments from the discussion in [1].
> >
> > Co-developed-by: Ryan Roberts
> > Signed-off-by:
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> >  mm/zswap.c | 227 ++++++++++++++++++++++++++++++++++++++----------
> -----
> >  1 file changed, 165 insertions(+), 62 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 43e4e216db41..b8395ddf7b7c 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -411,6 +411,16 @@ static int __must_check zswap_pool_tryget(struct
> zswap_pool *pool)
> >         return percpu_ref_tryget(&pool->ref);
> >  }
> >
> > +/*
> > + * Note: zswap_pool_get() should only be called after zswap_pool_tryget()
> > + * returns success. zswap_pool_tryget() returns success only if the "pool" is
> > + * non-NULL and the "&pool->ref" is non-0.
> 
> Just use the same statement used by css_get():
> 
> "The caller must already have a reference."
> 
> > + */
> > +static void zswap_pool_get(struct zswap_pool *pool)
> > +{
> > +       percpu_ref_get(&pool->ref);
> > +}
> > +
> >  static void zswap_pool_put(struct zswap_pool *pool)
> >  {
> >         percpu_ref_put(&pool->ref);
> > @@ -1402,38 +1412,35 @@ static void shrink_worker(struct work_struct
> *w)
> >  /*********************************
> >  * main API
> >  **********************************/
> > -bool zswap_store(struct folio *folio)
> > +
> > +/*
> > + * Stores the page at specified "index" in a folio.
> > + *
> > + * @folio: The folio to store in zswap.
> > + * @index: Index into the page in the folio that this function will store.
> > + * @objcg: The folio's objcg.
> > + * @pool:  The zswap_pool to store the compressed data for the page.
> > + *         The caller should have obtained a reference to a valid
> > + *         zswap_pool by calling zswap_pool_tryget(), to pass as this
> > + *         argument.
> > + * @compressed_bytes: The compressed entry->length value is added
> > + *                    to this, so that the caller can get the total
> > + *                    compressed lengths of all sub-pages in a folio.
> > + */
> > +static bool zswap_store_page(struct folio *folio, long index,
> > +                            struct obj_cgroup *objcg,
> > +                            struct zswap_pool *pool,
> > +                            size_t *compressed_bytes)
> 
> Would it be clearer to just return the compressed size, and return -1
> upon failure?
> 
> >  {
> > +       struct page *page = folio_page(folio, index);
> >         swp_entry_t swp = folio->swap;
> > -       pgoff_t offset = swp_offset(swp);
> >         struct xarray *tree = swap_zswap_tree(swp);
> > +       pgoff_t offset = swp_offset(swp) + index;
> >         struct zswap_entry *entry, *old;
> > -       struct obj_cgroup *objcg = NULL;
> > -       struct mem_cgroup *memcg = NULL;
> > -
> > -       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > -       VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > +       int type = swp_type(swp);
> 
> Why do we need type? We use it when initializing entry->swpentry to
> reconstruct the swp_entry_t we already have.
> 
> >
> > -       /* Large folios aren't supported */
> > -       if (folio_test_large(folio))
> > -               return false;
> > -
> > -       if (!zswap_enabled)
> > -               goto check_old;
> > -
> > -       /* Check cgroup limits */
> > -       objcg = get_obj_cgroup_from_folio(folio);
> > -       if (objcg && !obj_cgroup_may_zswap(objcg)) {
> > -               memcg = get_mem_cgroup_from_objcg(objcg);
> > -               if (shrink_memcg(memcg)) {
> > -                       mem_cgroup_put(memcg);
> > -                       goto reject;
> > -               }
> > -               mem_cgroup_put(memcg);
> > -       }
> > -
> > -       if (zswap_check_limits())
> > -               goto reject;
> > +       if (objcg)
> > +               obj_cgroup_get(objcg);
> >
> >         /* allocate entry */
> >         entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
> > @@ -1442,24 +1449,21 @@ bool zswap_store(struct folio *folio)
> >                 goto reject;
> >         }
> >
> > -       /* if entry is successfully added, it keeps the reference */
> > -       entry->pool = zswap_pool_current_get();
> > -       if (!entry->pool)
> > -               goto freepage;
> > +       /*
> > +        * We get here only after the call to zswap_pool_tryget() in the
> > +        * caller, zswap_store(), has returned success. Hence it is safe
> > +        * to call zswap_pool_get().
> > +        *
> > +        * if entry is successfully added, it keeps the reference
> > +        */
> > +       zswap_pool_get(pool);
> 
> I would move this right above obj_cgroup_get() and have a single
> concise comment for both, e.g.:
> 
> /* zswap_store() already holds a ref on 'pool' and 'objcg' */
> 
> >
> > -       if (objcg) {
> > -               memcg = get_mem_cgroup_from_objcg(objcg);
> > -               if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
> > -                       mem_cgroup_put(memcg);
> > -                       goto put_pool;
> > -               }
> > -               mem_cgroup_put(memcg);
> > -       }
> > +       entry->pool = pool;
> >
> > -       if (!zswap_compress(&folio->page, entry))
> > +       if (!zswap_compress(page, entry))
> >                 goto put_pool;
> >
> > -       entry->swpentry = swp;
> > +       entry->swpentry = swp_entry(type, offset);
> >         entry->objcg = objcg;
> >         entry->referenced = true;
> >
> > @@ -1480,11 +1484,6 @@ bool zswap_store(struct folio *folio)
> >         if (old)
> >                 zswap_entry_free(old);
> >
> > -       if (objcg) {
> > -               obj_cgroup_charge_zswap(objcg, entry->length);
> > -               count_objcg_event(objcg, ZSWPOUT);
> > -       }
> > -
> >         /*
> >          * We finish initializing the entry while it's already in xarray.
> >          * This is safe because:
> > @@ -1496,36 +1495,140 @@ bool zswap_store(struct folio *folio)
> >          *    an incoherent entry.
> >          */
> >         if (entry->length) {
> > +               *compressed_bytes += entry->length;
> >                 INIT_LIST_HEAD(&entry->lru);
> >                 zswap_lru_add(&zswap_list_lru, entry);
> >         }
> >
> > -       /* update stats */
> > -       atomic_long_inc(&zswap_stored_pages);
> > -       count_vm_event(ZSWPOUT);
> > -
> >         return true;
> >
> >  store_failed:
> >         zpool_free(entry->pool->zpool, entry->handle);
> >  put_pool:
> > -       zswap_pool_put(entry->pool);
> > -freepage:
> > +       zswap_pool_put(pool);
> >         zswap_entry_cache_free(entry);
> >  reject:
> 
> Please rename this to put_objcg, we are no longer "rejecting" here.

I have reworked getting the refs on pool and objcg within zswap_store_page(),
as per your suggestion. With these modifications, the "reject" label is only for
the case where we return "false" from zswap_store_page(), so it seems more
appropriate? Anyway, let me know once v9 is posted, and I can make any
necessary changes.

> 
> >         obj_cgroup_put(objcg);
> > -       if (zswap_pool_reached_full)
> > -               queue_work(shrink_wq, &zswap_shrink_work);
> > -check_old:
> > +       return false;
> > +}
> > +
> > +bool zswap_store(struct folio *folio)
> > +{
> > +       long nr_pages = folio_nr_pages(folio);
> > +       swp_entry_t swp = folio->swap;
> > +       struct xarray *tree = swap_zswap_tree(swp);
> > +       pgoff_t offset = swp_offset(swp);
> > +       struct obj_cgroup *objcg = NULL;
> > +       struct mem_cgroup *memcg = NULL;
> > +       struct zswap_pool *pool;
> > +       size_t compressed_bytes = 0;
> 
> Why size_t? entry->length is int.
> 
> > +       bool ret = false;
> > +       long index;
> > +
> > +       VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > +       VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > +
> > +       if (!zswap_enabled)
> > +               goto reject;
> > +
> >         /*
> > -        * If the zswap store fails or zswap is disabled, we must invalidate the
> > -        * possibly stale entry which was previously stored at this offset.
> > -        * Otherwise, writeback could overwrite the new data in the swapfile.
> > +        * Check cgroup zswap limits:
> > +        *
> > +        * The cgroup zswap limit check is done once at the beginning of
> > +        * zswap_store(). The cgroup charging is done once, at the end
> > +        * of a successful folio store. What this means is, if the cgroup
> > +        * was within the zswap_max limit at the beginning of a large folio
> > +        * store, it could go over the limit by at most (HPAGE_PMD_NR - 1)
> > +        * pages.
> 
> Add ".. due to this store", as it can be much more over the limit when
> stores are racing. Also, this comment can be slightly generalized and
> the comment above zswap_check_limits() omitted.
> 
> >          */
> > -       entry = xa_erase(tree, offset);
> > -       if (entry)
> > -               zswap_entry_free(entry);
> > -       return false;
> > +       objcg = get_obj_cgroup_from_folio(folio);
> > +       if (objcg && !obj_cgroup_may_zswap(objcg)) {
> > +               memcg = get_mem_cgroup_from_objcg(objcg);
> > +               if (shrink_memcg(memcg)) {
> > +                       mem_cgroup_put(memcg);
> > +                       goto put_objcg;
> > +               }
> > +               mem_cgroup_put(memcg);
> > +       }
> > +
> > +       /*
> > +        * Check zpool utilization against zswap limits:
> > +        *
> > +        * The zswap zpool utilization is also checked against the limits
> > +        * just once, at the start of zswap_store(). If the check passes,
> > +        * any breaches of the limits set by zswap_max_pages() or
> > +        * zswap_accept_thr_pages() that may happen while storing this
> > +        * folio, will only be detected during the next call to
> > +        * zswap_store() by any process.
> > +        */
> > +       if (zswap_check_limits())
> > +               goto put_objcg;
> > +
> > +       pool = zswap_pool_current_get();
> > +       if (!pool)
> > +               goto put_objcg;
> > +
> > +       if (objcg) {
> > +               memcg = get_mem_cgroup_from_objcg(objcg);
> > +               if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
> > +                       mem_cgroup_put(memcg);
> > +                       goto put_pool;
> > +               }
> > +               mem_cgroup_put(memcg);
> > +       }
> > +
> > +       /*
> > +        * Store each page of the folio as a separate entry. If we fail to
> > +        * store a page, unwind by deleting all the pages for this folio
> > +        * currently in zswap.
> > +        */
> > +       for (index = 0; index < nr_pages; ++index) {
> > +               if (!zswap_store_page(folio, index, objcg, pool,
> &compressed_bytes))
> > +                       goto put_pool;
> > +       }
> > +
> > +       /*
> > +        * All pages in the folio have been successfully stored.
> > +        * Batch update the cgroup zswap charging, zswap_stored_page
> atomic,
> > +        * and ZSWPOUT event stats.
> > +        */
> 
> This comment seems unnecessary.
> 
> > +       if (objcg) {
> > +               obj_cgroup_charge_zswap(objcg, compressed_bytes);
> > +               count_objcg_events(objcg, ZSWPOUT, nr_pages);
> > +       }
> > +
> > +       /* update stats */
> 
> Please delete this comment too.
> 
> > +       atomic_long_add(nr_pages, &zswap_stored_pages);
> > +       count_vm_events(ZSWPOUT, nr_pages);
> > +
> > +       ret = true;
> > +
> > +put_pool:
> > +       zswap_pool_put(pool);
> > +put_objcg:
> > +       obj_cgroup_put(objcg);
> > +reject:
> > +       /*
> > +        * If the zswap store fails or zswap is disabled, we must invalidate
> > +        * the possibly stale entries which were previously stored at the
> > +        * offsets corresponding to each page of the folio. Otherwise,
> > +        * writeback could overwrite the new data in the swapfile.
> > +        */
> > +       if (!ret) {
> > +               struct zswap_entry *entry;
> > +               long i;
> > +
> > +               for (i = 0; i < nr_pages; ++i) {
> 
> Just reuse 'index' here.
> 
> > +                       entry = xa_erase(tree, offset + i);
> > +                       if (entry)
> > +                               zswap_entry_free(entry);
> 
> I think we need a comment in zswap_store_page() that we shouldn't have
> any possibility of failure after the entry is added in the tree.
> Otherwise we may duplicate the cleanup work here (e.g. putting refs).
> This is already the case in zswap_store(), but it's now more subtle
> and easier to break when the code is split and we have two cleanup
> paths.

I have incorporated all these comments, thanks!

> 
> > +               }
> > +
> > +               if (zswap_pool_reached_full)
> > +                       queue_work(shrink_wq, &zswap_shrink_work);
> 
> We are now doing this check even when zswap is disabled. I think this
> is a problem.
> 
> If zswap gets disabled while 'zswap_pool_reached_full' is true, we
> will queue 'zswap_shrink_work' every time zswap_store() is called in
> the swapout path, but 'zswap_pool_reached_full' will never become
> false again become zswap_check_limits() will never be called.
> 
> I think we need to maintain two separate "reject" and "check_old"
> labels, and avoid this check when zswap is disabled.

Thanks Yosry, great catch! I have fixed this in the v9 that I am working on.

Thanks again,
Kanchana

> 
> > +       }
> > +
> > +       return ret;
> >  }
> >
> >  bool zswap_load(struct folio *folio)
> > --
> > 2.27.0
> >

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v8 6/8] mm: zswap: Support large folios in zswap_store().
  2024-09-28  2:16 ` [PATCH v8 6/8] mm: zswap: Support large folios in zswap_store() Kanchana P Sridhar
  2024-09-28  3:42   ` Yosry Ahmed
@ 2024-09-28  6:05   ` Chengming Zhou
  1 sibling, 0 replies; 43+ messages in thread
From: Chengming Zhou @ 2024-09-28  6:05 UTC (permalink / raw)
  To: Kanchana P Sridhar, linux-kernel, linux-mm, hannes, yosryahmed,
	nphamcs, usamaarif642, shakeel.butt, ryan.roberts, ying.huang,
	21cnbao, akpm
  Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal

On 2024/9/28 10:16, Kanchana P Sridhar wrote:
> zswap_store() will store large folios by compressing them page by page.
> 
> This patch provides a sequential implementation of storing a large folio
> in zswap_store() by iterating through each page in the folio to compress
> and store it in the zswap zpool.
> 
> Towards this goal, zswap_compress() is modified to take a page instead
> of a folio as input.
> 
> Each page's swap offset is stored as a separate zswap entry.
> 
> We check the cgroup zswap limit and the zpool utilization against
> the zswap max/accept_threshold limits once, at the beginning of
> zswap_store. We also obtain a percpu_ref_tryget() reference to the
> current zswap_pool at the start of zswap_store to prevent it from
> being deleted while the folio is being stored.
> 
> If these one-time checks pass, we compress the sub-pages of the folio,
> while maintaining a running count of compressed bytes for all the folio's
> sub-pages. If all pages are successfully compressed and stored, we do the
> cgroup zswap charging with the total compressed bytes, and batch update
> the zswap_stored_pages atomic/zswpout event stats with folio_nr_pages()
> once, before returning from zswap_store.
> 
> The patch adds a new zswap_pool_get() function that is called in the
> sub-page level "zswap_store_page()" function.
> 
> If an error is encountered during the store of any page in the folio,
> all pages in that folio currently stored in zswap will be invalidated.
> Thus, a folio is either entirely stored in zswap, or entirely not stored
> in zswap.
> 
> This patch forms the basis for building compress batching of pages in a
> large folio in zswap_store by compressing up to say, 8 pages of the folio
> in parallel in hardware using the Intel In-Memory Analytics Accelerator
> (Intel IAA).
> 
> This change reuses and adapts the functionality in Ryan Roberts' RFC
> patch [1]:
> 
>    "[RFC,v1] mm: zswap: Store large folios without splitting"
> 
>    [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
> 
> Also, addressed some of the RFC comments from the discussion in [1].
> 
> Co-developed-by: Ryan Roberts
> Signed-off-by:

This seems not right.

> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
>   mm/zswap.c | 227 ++++++++++++++++++++++++++++++++++++++---------------
>   1 file changed, 165 insertions(+), 62 deletions(-)
> 
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 43e4e216db41..b8395ddf7b7c 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -411,6 +411,16 @@ static int __must_check zswap_pool_tryget(struct zswap_pool *pool)
>   	return percpu_ref_tryget(&pool->ref);
>   }
>   
> +/*
> + * Note: zswap_pool_get() should only be called after zswap_pool_tryget()
> + * returns success. zswap_pool_tryget() returns success only if the "pool" is
> + * non-NULL and the "&pool->ref" is non-0.
> + */
> +static void zswap_pool_get(struct zswap_pool *pool)
> +{
> +	percpu_ref_get(&pool->ref);
> +}

percpu_ref_tryget_many() could be considered? But this looks ok too.

Thanks.

> +
>   static void zswap_pool_put(struct zswap_pool *pool)
>   {
>   	percpu_ref_put(&pool->ref);
> @@ -1402,38 +1412,35 @@ static void shrink_worker(struct work_struct *w)
>   /*********************************
>   * main API
>   **********************************/
> -bool zswap_store(struct folio *folio)
> +
> +/*
> + * Stores the page at specified "index" in a folio.
> + *
> + * @folio: The folio to store in zswap.
> + * @index: Index into the page in the folio that this function will store.
> + * @objcg: The folio's objcg.
> + * @pool:  The zswap_pool to store the compressed data for the page.
> + *         The caller should have obtained a reference to a valid
> + *         zswap_pool by calling zswap_pool_tryget(), to pass as this
> + *         argument.
> + * @compressed_bytes: The compressed entry->length value is added
> + *                    to this, so that the caller can get the total
> + *                    compressed lengths of all sub-pages in a folio.
> + */
> +static bool zswap_store_page(struct folio *folio, long index,
> +			     struct obj_cgroup *objcg,
> +			     struct zswap_pool *pool,
> +			     size_t *compressed_bytes)
>   {
> +	struct page *page = folio_page(folio, index);
>   	swp_entry_t swp = folio->swap;
> -	pgoff_t offset = swp_offset(swp);
>   	struct xarray *tree = swap_zswap_tree(swp);
> +	pgoff_t offset = swp_offset(swp) + index;
>   	struct zswap_entry *entry, *old;
> -	struct obj_cgroup *objcg = NULL;
> -	struct mem_cgroup *memcg = NULL;
> -
> -	VM_WARN_ON_ONCE(!folio_test_locked(folio));
> -	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> +	int type = swp_type(swp);
>   
> -	/* Large folios aren't supported */
> -	if (folio_test_large(folio))
> -		return false;
> -
> -	if (!zswap_enabled)
> -		goto check_old;
> -
> -	/* Check cgroup limits */
> -	objcg = get_obj_cgroup_from_folio(folio);
> -	if (objcg && !obj_cgroup_may_zswap(objcg)) {
> -		memcg = get_mem_cgroup_from_objcg(objcg);
> -		if (shrink_memcg(memcg)) {
> -			mem_cgroup_put(memcg);
> -			goto reject;
> -		}
> -		mem_cgroup_put(memcg);
> -	}
> -
> -	if (zswap_check_limits())
> -		goto reject;
> +	if (objcg)
> +		obj_cgroup_get(objcg);
>   
>   	/* allocate entry */
>   	entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
> @@ -1442,24 +1449,21 @@ bool zswap_store(struct folio *folio)
>   		goto reject;
>   	}
>   
> -	/* if entry is successfully added, it keeps the reference */
> -	entry->pool = zswap_pool_current_get();
> -	if (!entry->pool)
> -		goto freepage;
> +	/*
> +	 * We get here only after the call to zswap_pool_tryget() in the
> +	 * caller, zswap_store(), has returned success. Hence it is safe
> +	 * to call zswap_pool_get().
> +	 *
> +	 * if entry is successfully added, it keeps the reference
> +	 */
> +	zswap_pool_get(pool);
>   
> -	if (objcg) {
> -		memcg = get_mem_cgroup_from_objcg(objcg);
> -		if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
> -			mem_cgroup_put(memcg);
> -			goto put_pool;
> -		}
> -		mem_cgroup_put(memcg);
> -	}
> +	entry->pool = pool;
>   
> -	if (!zswap_compress(&folio->page, entry))
> +	if (!zswap_compress(page, entry))
>   		goto put_pool;
>   
> -	entry->swpentry = swp;
> +	entry->swpentry = swp_entry(type, offset);
>   	entry->objcg = objcg;
>   	entry->referenced = true;
>   
> @@ -1480,11 +1484,6 @@ bool zswap_store(struct folio *folio)
>   	if (old)
>   		zswap_entry_free(old);
>   
> -	if (objcg) {
> -		obj_cgroup_charge_zswap(objcg, entry->length);
> -		count_objcg_event(objcg, ZSWPOUT);
> -	}
> -
>   	/*
>   	 * We finish initializing the entry while it's already in xarray.
>   	 * This is safe because:
> @@ -1496,36 +1495,140 @@ bool zswap_store(struct folio *folio)
>   	 *    an incoherent entry.
>   	 */
>   	if (entry->length) {
> +		*compressed_bytes += entry->length;
>   		INIT_LIST_HEAD(&entry->lru);
>   		zswap_lru_add(&zswap_list_lru, entry);
>   	}
>   
> -	/* update stats */
> -	atomic_long_inc(&zswap_stored_pages);
> -	count_vm_event(ZSWPOUT);
> -
>   	return true;
>   
>   store_failed:
>   	zpool_free(entry->pool->zpool, entry->handle);
>   put_pool:
> -	zswap_pool_put(entry->pool);
> -freepage:
> +	zswap_pool_put(pool);
>   	zswap_entry_cache_free(entry);
>   reject:
>   	obj_cgroup_put(objcg);
> -	if (zswap_pool_reached_full)
> -		queue_work(shrink_wq, &zswap_shrink_work);
> -check_old:
> +	return false;
> +}
> +
> +bool zswap_store(struct folio *folio)
> +{
> +	long nr_pages = folio_nr_pages(folio);
> +	swp_entry_t swp = folio->swap;
> +	struct xarray *tree = swap_zswap_tree(swp);
> +	pgoff_t offset = swp_offset(swp);
> +	struct obj_cgroup *objcg = NULL;
> +	struct mem_cgroup *memcg = NULL;
> +	struct zswap_pool *pool;
> +	size_t compressed_bytes = 0;
> +	bool ret = false;
> +	long index;
> +
> +	VM_WARN_ON_ONCE(!folio_test_locked(folio));
> +	VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> +
> +	if (!zswap_enabled)
> +		goto reject;
> +
>   	/*
> -	 * If the zswap store fails or zswap is disabled, we must invalidate the
> -	 * possibly stale entry which was previously stored at this offset.
> -	 * Otherwise, writeback could overwrite the new data in the swapfile.
> +	 * Check cgroup zswap limits:
> +	 *
> +	 * The cgroup zswap limit check is done once at the beginning of
> +	 * zswap_store(). The cgroup charging is done once, at the end
> +	 * of a successful folio store. What this means is, if the cgroup
> +	 * was within the zswap_max limit at the beginning of a large folio
> +	 * store, it could go over the limit by at most (HPAGE_PMD_NR - 1)
> +	 * pages.
>   	 */
> -	entry = xa_erase(tree, offset);
> -	if (entry)
> -		zswap_entry_free(entry);
> -	return false;
> +	objcg = get_obj_cgroup_from_folio(folio);
> +	if (objcg && !obj_cgroup_may_zswap(objcg)) {
> +		memcg = get_mem_cgroup_from_objcg(objcg);
> +		if (shrink_memcg(memcg)) {
> +			mem_cgroup_put(memcg);
> +			goto put_objcg;
> +		}
> +		mem_cgroup_put(memcg);
> +	}
> +
> +	/*
> +	 * Check zpool utilization against zswap limits:
> +	 *
> +	 * The zswap zpool utilization is also checked against the limits
> +	 * just once, at the start of zswap_store(). If the check passes,
> +	 * any breaches of the limits set by zswap_max_pages() or
> +	 * zswap_accept_thr_pages() that may happen while storing this
> +	 * folio, will only be detected during the next call to
> +	 * zswap_store() by any process.
> +	 */
> +	if (zswap_check_limits())
> +		goto put_objcg;
> +
> +	pool = zswap_pool_current_get();
> +	if (!pool)
> +		goto put_objcg;
> +
> +	if (objcg) {
> +		memcg = get_mem_cgroup_from_objcg(objcg);
> +		if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
> +			mem_cgroup_put(memcg);
> +			goto put_pool;
> +		}
> +		mem_cgroup_put(memcg);
> +	}
> +
> +	/*
> +	 * Store each page of the folio as a separate entry. If we fail to
> +	 * store a page, unwind by deleting all the pages for this folio
> +	 * currently in zswap.
> +	 */
> +	for (index = 0; index < nr_pages; ++index) {
> +		if (!zswap_store_page(folio, index, objcg, pool, &compressed_bytes))
> +			goto put_pool;
> +	}
> +
> +	/*
> +	 * All pages in the folio have been successfully stored.
> +	 * Batch update the cgroup zswap charging, zswap_stored_page atomic,
> +	 * and ZSWPOUT event stats.
> +	 */
> +	if (objcg) {
> +		obj_cgroup_charge_zswap(objcg, compressed_bytes);
> +		count_objcg_events(objcg, ZSWPOUT, nr_pages);
> +	}
> +
> +	/* update stats */
> +	atomic_long_add(nr_pages, &zswap_stored_pages);
> +	count_vm_events(ZSWPOUT, nr_pages);
> +
> +	ret = true;
> +
> +put_pool:
> +	zswap_pool_put(pool);
> +put_objcg:
> +	obj_cgroup_put(objcg);
> +reject:
> +	/*
> +	 * If the zswap store fails or zswap is disabled, we must invalidate
> +	 * the possibly stale entries which were previously stored at the
> +	 * offsets corresponding to each page of the folio. Otherwise,
> +	 * writeback could overwrite the new data in the swapfile.
> +	 */
> +	if (!ret) {
> +		struct zswap_entry *entry;
> +		long i;
> +
> +		for (i = 0; i < nr_pages; ++i) {
> +			entry = xa_erase(tree, offset + i);
> +			if (entry)
> +				zswap_entry_free(entry);
> +		}
> +
> +		if (zswap_pool_reached_full)
> +			queue_work(shrink_wq, &zswap_shrink_work);
> +	}
> +
> +	return ret;
>   }
>   
>   bool zswap_load(struct folio *folio)


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v8 7/8] mm: swap: Count successful large folio zswap stores in hugepage zswpout stats.
  2024-09-28  2:16 [PATCH v8 0/8] mm: zswap swap-out of large folios Kanchana P Sridhar
                   ` (5 preceding siblings ...)
  2024-09-28  2:16 ` [PATCH v8 6/8] mm: zswap: Support large folios in zswap_store() Kanchana P Sridhar
@ 2024-09-28  2:16 ` Kanchana P Sridhar
  2024-09-28  2:16 ` [PATCH v8 8/8] mm: Document the newly added sysfs large folios " Kanchana P Sridhar
  2024-09-28  2:25 ` [PATCH v8 0/8] mm: zswap swap-out of large folios Yosry Ahmed
  8 siblings, 0 replies; 43+ messages in thread
From: Kanchana P Sridhar @ 2024-09-28  2:16 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
	ying.huang, 21cnbao, akpm
  Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

Added a new MTHP_STAT_ZSWPOUT entry to the sysfs transparent_hugepage
stats so that successful large folio zswap stores can be accounted under
the per-order sysfs "zswpout" stats:

/sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout

Other non-zswap swap device swap-out events will be counted under
the existing sysfs "swpout" stats:

/sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/swpout

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 include/linux/huge_mm.h | 1 +
 mm/huge_memory.c        | 3 +++
 mm/page_io.c            | 1 +
 3 files changed, 5 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 5eb4b0376c7d..3eca60f3d512 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -119,6 +119,7 @@ enum mthp_stat_item {
 	MTHP_STAT_ANON_FAULT_ALLOC,
 	MTHP_STAT_ANON_FAULT_FALLBACK,
 	MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE,
+	MTHP_STAT_ZSWPOUT,
 	MTHP_STAT_SWPOUT,
 	MTHP_STAT_SWPOUT_FALLBACK,
 	MTHP_STAT_SHMEM_ALLOC,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 13bf59b84075..f596f57a3a90 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -611,6 +611,7 @@ static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
 DEFINE_MTHP_STAT_ATTR(anon_fault_alloc, MTHP_STAT_ANON_FAULT_ALLOC);
 DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK);
 DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
+DEFINE_MTHP_STAT_ATTR(zswpout, MTHP_STAT_ZSWPOUT);
 DEFINE_MTHP_STAT_ATTR(swpout, MTHP_STAT_SWPOUT);
 DEFINE_MTHP_STAT_ATTR(swpout_fallback, MTHP_STAT_SWPOUT_FALLBACK);
 #ifdef CONFIG_SHMEM
@@ -629,6 +630,7 @@ static struct attribute *anon_stats_attrs[] = {
 	&anon_fault_fallback_attr.attr,
 	&anon_fault_fallback_charge_attr.attr,
 #ifndef CONFIG_SHMEM
+	&zswpout_attr.attr,
 	&swpout_attr.attr,
 	&swpout_fallback_attr.attr,
 #endif
@@ -659,6 +661,7 @@ static struct attribute_group file_stats_attr_grp = {
 
 static struct attribute *any_stats_attrs[] = {
 #ifdef CONFIG_SHMEM
+	&zswpout_attr.attr,
 	&swpout_attr.attr,
 	&swpout_fallback_attr.attr,
 #endif
diff --git a/mm/page_io.c b/mm/page_io.c
index bc1183299a7d..4aa34862676f 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -269,6 +269,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
 		swap_zeromap_folio_clear(folio);
 	}
 	if (zswap_store(folio)) {
+		count_mthp_stat(folio_order(folio), MTHP_STAT_ZSWPOUT);
 		folio_unlock(folio);
 		return 0;
 	}
-- 
2.27.0



^ permalink raw reply	[flat|nested] 43+ messages in thread

* [PATCH v8 8/8] mm: Document the newly added sysfs large folios zswpout stats.
  2024-09-28  2:16 [PATCH v8 0/8] mm: zswap swap-out of large folios Kanchana P Sridhar
                   ` (6 preceding siblings ...)
  2024-09-28  2:16 ` [PATCH v8 7/8] mm: swap: Count successful large folio zswap stores in hugepage zswpout stats Kanchana P Sridhar
@ 2024-09-28  2:16 ` Kanchana P Sridhar
  2024-09-29 22:34   ` Nhat Pham
  2024-09-28  2:25 ` [PATCH v8 0/8] mm: zswap swap-out of large folios Yosry Ahmed
  8 siblings, 1 reply; 43+ messages in thread
From: Kanchana P Sridhar @ 2024-09-28  2:16 UTC (permalink / raw)
  To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
	chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
	ying.huang, 21cnbao, akpm
  Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar

Added documentation for the newly added sysfs per-order hugepage
"zswpout" stats.

Clarified that only non-zswap swapouts will be accounted in the existing
"swpout" stats.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 Documentation/admin-guide/mm/transhuge.rst | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index cfdd16a52e39..2a171ed5206e 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -530,10 +530,14 @@ anon_fault_fallback_charge
 	instead falls back to using huge pages with lower orders or
 	small pages even though the allocation was successful.
 
-swpout
-	is incremented every time a huge page is swapped out in one
+zswpout
+	is incremented every time a huge page is swapped out to zswap in one
 	piece without splitting.
 
+swpout
+	is incremented every time a huge page is swapped out to a non-zswap
+	swap device in one piece without splitting.
+
 swpout_fallback
 	is incremented if a huge page has to be split before swapout.
 	Usually because failed to allocate some continuous swap space
-- 
2.27.0



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v8 8/8] mm: Document the newly added sysfs large folios zswpout stats.
  2024-09-28  2:16 ` [PATCH v8 8/8] mm: Document the newly added sysfs large folios " Kanchana P Sridhar
@ 2024-09-29 22:34   ` Nhat Pham
  2024-09-30  0:56     ` Sridhar, Kanchana P
  0 siblings, 1 reply; 43+ messages in thread
From: Nhat Pham @ 2024-09-29 22:34 UTC (permalink / raw)
  To: Kanchana P Sridhar
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao,
	akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal

On Fri, Sep 27, 2024 at 7:16 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> Added documentation for the newly added sysfs per-order hugepage
> "zswpout" stats.
>
> Clarified that only non-zswap swapouts will be accounted in the existing
> "swpout" stats.
>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>

LGTM.
Reviewed-by: Nhat Pham <nphamcs@gmail.com>

I think you can squash this to the last commit (i.e adding the new
stats and the documentation for that stats at the same time). But no
strong opinions :)


^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH v8 8/8] mm: Document the newly added sysfs large folios zswpout stats.
  2024-09-29 22:34   ` Nhat Pham
@ 2024-09-30  0:56     ` Sridhar, Kanchana P
  0 siblings, 0 replies; 43+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-30  0:56 UTC (permalink / raw)
  To: Nhat Pham
  Cc: linux-kernel, linux-mm, hannes, yosryahmed, chengming.zhou,
	usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
	akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar,
	Kanchana P

> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Sunday, September 29, 2024 3:35 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v8 8/8] mm: Document the newly added sysfs large folios
> zswpout stats.
> 
> On Fri, Sep 27, 2024 at 7:16 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > Added documentation for the newly added sysfs per-order hugepage
> > "zswpout" stats.
> >
> > Clarified that only non-zswap swapouts will be accounted in the existing
> > "swpout" stats.
> >
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> 
> LGTM.
> Reviewed-by: Nhat Pham <nphamcs@gmail.com>
> 
> I think you can squash this to the last commit (i.e adding the new
> stats and the documentation for that stats at the same time). But no
> strong opinions :)

Sure this sounds good, Nhat!

Thanks,
Kanchana

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v8 0/8] mm: zswap swap-out of large folios
  2024-09-28  2:16 [PATCH v8 0/8] mm: zswap swap-out of large folios Kanchana P Sridhar
                   ` (7 preceding siblings ...)
  2024-09-28  2:16 ` [PATCH v8 8/8] mm: Document the newly added sysfs large folios " Kanchana P Sridhar
@ 2024-09-28  2:25 ` Yosry Ahmed
  2024-09-28  2:36   ` Sridhar, Kanchana P
  8 siblings, 1 reply; 43+ messages in thread
From: Yosry Ahmed @ 2024-09-28  2:25 UTC (permalink / raw)
  To: Kanchana P Sridhar, Ryan Roberts
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, ying.huang, 21cnbao, akpm,
	nanhai.zou, wajdi.k.feghali, vinodh.gopal

On Fri, Sep 27, 2024 at 7:16 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> Hi All,
>
> This patch-series enables zswap_store() to accept and store large
> folios. The most significant contribution in this series is from the
> earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
> migrated to mm-unstable as of 9-27-2024 in patch 6 of this series, and
> adapted based on code review comments received for v7 of the current
> patch-series.
>
> [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
>      https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
>
> The first few patches do the prep work for supporting large folios in
> zswap_store. Patch 6 provides the main functionality to swap-out large
> folios in zswap. Patch 7 adds sysfs per-order hugepages "zswpout" counters
> that get incremented upon successful zswap_store of large folios:
>
> /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
>
> Patch 8 updates the documentation for the new sysfs "zswpout" counters.
>
> This patch-series is a pre-requisite for zswap compress batching of large
> folio swap-out and decompress batching of swap-ins based on
> swapin_readahead(), using Intel IAA hardware acceleration, which we would
> like to submit in subsequent patch-series, with performance improvement
> data.
>
> Thanks to Ying Huang for pre-posting review feedback and suggestions!
>
> Thanks also to Nhat, Yosry, Johannes, Barry, Chengming, Usama and Ying for
> their helpful feedback, data reviews and suggestions!
>
> Co-development signoff request:
> ===============================
> I would like to thank Ryan Roberts for his original RFC [1] and request
> his co-developer signoff on patch 6 in this series. Thanks Ryan!


Ryan, could you help Kanchana out with a Signed-off-by please :)

>
>
>
> System setup for testing:
> =========================
> Testing of this patch-series was done with mm-unstable as of 9-27-2024,
> commit de2fbaa6d9c3576ec7133ed02a370ec9376bf000. Data was gathered
> without/with this patch-series, on an Intel Sapphire Rapids server,
> dual-socket 56 cores per socket, 4 IAA devices per socket, 503 GiB RAM and
> 525G SSD disk partition swap. Core frequency was fixed at 2500MHz.
>
> The vm-scalability "usemem" test was run in a cgroup whose memory.high
> was fixed at 150G. The is no swap limit set for the cgroup. 30 usemem
> processes were run, each allocating and writing 10G of memory, and sleeping
> for 10 sec before exiting:
>
> usemem --init-time -w -O -s 10 -n 30 10g
>
> Other kernel configuration parameters:
>
>     zswap compressors : zstd, deflate-iaa
>     zswap allocator   : zsmalloc
>     vm.page-cluster   : 2
>
> In the experiments where "deflate-iaa" is used as the zswap compressor,
> IAA "compression verification" is enabled by default
> (cat /sys/bus/dsa/drivers/crypto/verify_compress). Hence each IAA
> compression will be decompressed internally by the "iaa_crypto" driver, the
> crc-s returned by the hardware will be compared and errors reported in case
> of mismatches. Thus "deflate-iaa" helps ensure better data integrity as
> compared to the software compressors, and the experimental data listed
> below is with verify_compress set to "1".
>
> Total and average throughput are derived from the individual 30 processes'
> throughputs reported by usemem. elapsed/sys times are measured with perf.
>
> The vm stats and sysfs hugepages stats included with the performance data
> provide details on the swapout activity to zswap/swap device.
>
>
> Testing labels used in data summaries:
> ======================================
> The data refers to these test configurations and the before/after
> comparisons that they do:
>
>  before-case1:
>  -------------
>  mm-unstable 9-27-2024, CONFIG_THP_SWAP=N (compares zswap 4K vs. zswap 64K)
>
>  In this scenario, CONFIG_THP_SWAP=N results in 64K/2M folios to be split
>  into 4K folios that get processed by zswap.
>
>  before-case2:
>  -------------
>  mm-unstable 9-27-2024, CONFIG_THP_SWAP=Y (compares SSD swap large folios vs. zswap large folios)
>
>  In this scenario, CONFIG_THP_SWAP=Y results in zswap rejecting large
>  folios, which will then be stored by the SSD swap device.
>
>  after:
>  ------
>  v8 of this patch-series, CONFIG_THP_SWAP=Y
>
>  The "after" is CONFIG_THP_SWAP=Y and v8 of this patch-series, that results
>  in 64K/2M folios to not be split, and to be processed by zswap_store.
>
>
> Regression Testing:
> ===================
> I ran vm-scalability usemem without large folios, i.e., only 4K folios with
> mm-unstable and this patch-series. The main goal was to make sure that
> there is no functional or performance regression wrt the earlier zswap
> behavior for 4K folios, now that 4K folios will be processed by the new
> zswap_store() code.
>
> The data indicates there is no significant regression.
>
>  -------------------------------------------------------------------------------
>  4K folios:
>  ==========
>
>  zswap compressor                zstd          zstd        zstd  zstd v8 zstd v8
>                          before-case1  before-case2       after      vs.     vs.
>                                                                    case1   case2
>  -------------------------------------------------------------------------------
>  Total throughput (KB/s)    4,793,363     4,880,978   4,813,151       0%     -1%
>  Average throughput (KB/s)    159,778       162,699     160,438       0%     -1%
>  elapsed time (sec)            130.14        123.17      127.21       2%     -3%
>  sys time (sec)              3,135.53      2,985.64    3,110.53       1%     -4%
>
>  memcg_high                   446,826       444,626     448,231
>  memcg_swap_fail                    0             0           0
>  pswpout                            0             0           0
>  pswpin                             0             0           0
>  zswpout                   48,932,107    48,931,971  48,931,584
>  zswpin                           383           386         388
>  thp_swpout                         0             0           0
>  thp_swpout_fallback                0             0           0
>  64kB-mthp_swpout_fallback          0             0           0
>  pgmajfault                     3,063         3,077       3,082
>  swap_ra                           93            94          93
>  swap_ra_hit                       47            47          47
>  ZSWPOUT-64kB                     n/a           n/a           0
>  SWPOUT-64kB                        0             0           0
>  -------------------------------------------------------------------------------
>
>
> Performance Testing:
> ====================
>
> We list the data for 64K folios with before/after data per-compressor,
> followed by the same for 2M pmd-mappable folios.
>
>
>  -------------------------------------------------------------------------------
>  64K folios: zstd:
>  =================
>
>  zswap compressor                zstd          zstd         zstd      zstd v8
>                          before-case1  before-case2        after     vs.    vs.
>                                                                    case1  case2
>  -------------------------------------------------------------------------------
>  Total throughput (KB/s)    5,222,213     1,076,611    6,227,367     19%    478%
>  Average throughput (KB/s)    174,073        35,887      207,578     19%    478%
>  elapsed time (sec)            120.50        347.16       109.21      9%     69%


The diff here is supposed to be negative, right?
(Same for the below results)

Otherwise the results are looking really good, we have come a long way
since the first version :)

Thanks for working on this! I will look at individual patches later
today or early next week.

>
>  sys time (sec)              2,930.33        248.16     2,609.22     11%   -951%
>  memcg_high                   416,773       552,200      482,703
>  memcg_swap_fail            3,192,906         1,293          944
>  pswpout                            0    40,778,448            0
>  pswpin                             0            16            0
>  zswpout                   48,931,583        20,903   48,931,271
>  zswpin                           384           363          392
>  thp_swpout                         0             0            0
>  thp_swpout_fallback                0             0            0
>  64kB-mthp_swpout_fallback  3,192,906         1,293          944
>  pgmajfault                     3,452         3,072        3,095
>  swap_ra                           90            87          100
>  swap_ra_hit                       42            43           56
>  ZSWPOUT-64kB                     n/a           n/a    3,057,260
>  SWPOUT-64kB                        0     2,548,653            0
>  -------------------------------------------------------------------------------
>
>
>  -------------------------------------------------------------------------------
>  64K folios: deflate-iaa:
>  ========================
>
>  zswap compressor         deflate-iaa   deflate-iaa  deflate-iaa  deflate-iaa v8
>                          before-case1  before-case2        after      vs.    vs.
>                                                                    case1   case2
>  -------------------------------------------------------------------------------
>  Total throughput (KB/s)    5,652,608     1,089,180    6,315,000     12%    480%
>  Average throughput (KB/s)    188,420        36,306      210,500     12%    480%
>  elapsed time (sec)            102.90        343.35        91.11     11%     73%
>
>
>  sys time (sec)              2,246.86        213.53     1,939.31     14%   -808%
>  memcg_high                   576,104       502,907      612,505
>  memcg_swap_fail            4,016,117         1,407        1,660
>  pswpout                            0    40,862,080            0
>  pswpin                             0            20            0
>  zswpout                   61,163,423        22,444   57,317,607
>  zswpin                           401           368          449
>  thp_swpout                         0             0            0
>  thp_swpout_fallback                0             0            0
>  64kB-mthp_swpout_fallback  4,016,117         1,407        1,660
>  pgmajfault                     3,063         3,153        3,167
>  swap_ra                           96            93          149
>  swap_ra_hit                       46            45           89
>  ZSWPOUT-64kB                     n/a           n/a    3,580,673
>  SWPOUT-64kB                        0     2,553,880            0
>  -------------------------------------------------------------------------------
>
>
>  -------------------------------------------------------------------------------
>  2M folios: zstd:
>  ================
>
>  zswap compressor                zstd          zstd         zstd      zstd v8
>                          before-case1  before-case2        after     vs.    vs.
>                                                                    case1  case2
>  -------------------------------------------------------------------------------
>  Total throughput (KB/s)    5,895,500     1,109,694     6,460,111    10%    482%
>  Average throughput (KB/s)    196,516        36,989       215,337    10%    482%
>  elapsed time (sec)            108.77        334.28        105.92     3%     68%
>
>
>  sys time (sec)              2,657.14         94.88      2,436.24     8%  -2468%
>  memcg_high                    64,200        66,316        60,300
>  memcg_swap_fail              101,182            70            30
>  pswpout                            0    40,166,400             0
>  pswpin                             0             0             0
>  zswpout                   48,931,499        36,507    48,869,236
>  zswpin                           380           379           397
>  thp_swpout                         0        78,450             0
>  thp_swpout_fallback          101,182            70            30
>  2MB-mthp_swpout_fallback           0             0             0
>  pgmajfault                     3,067         3,417         4,765
>  swap_ra                           91            90         5,073
>  swap_ra_hit                       45            45         5,024
>  ZSWPOUT-2MB                      n/a           n/a        95,408
>  SWPOUT-2MB                         0        78,450             0
>  -------------------------------------------------------------------------------
>
>
>  -------------------------------------------------------------------------------
>  2M folios: deflate-iaa:
>  =======================
>
>  zswap compressor         deflate-iaa   deflate-iaa  deflate-iaa  deflate-iaa v8
>                          before-case1  before-case2        after      vs.    vs.
>                                                                    case1   case2
>  -------------------------------------------------------------------------------
>  Total throughput (KB/s)   6,286,587      1,126,785     7,569,560    20%    572%
>  Average throughput (KB/s)   209,552         37,559       252,318    20%    572%
>  elapsed time (sec)            96.19         333.03         81.96    15%     75%
>
>  sys time (sec)             2,141.44          99.96      1,768.41    17%  -1669%
>  memcg_high                   99,253         64,666        75,139
>  memcg_swap_fail             129,074             53            73
>  pswpout                           0     40,048,128             0
>  pswpin                            0              0             0
>  zswpout                  61,312,794         28,321    57,083,119
>  zswpin                          383            406           447
>  thp_swpout                        0         78,219             0
>  thp_swpout_fallback         129,074             53            73
>  2MB-mthp_swpout_fallback          0              0             0
>  pgmajfault                    3,430          3,077         7,133
>  swap_ra                          91            103        11,978
>  swap_ra_hit                      47             46        11,920
>  ZSWPOUT-2MB                     n/a            n/a       111,390
>  SWPOUT-2MB                        0         78,219             0
>  -------------------------------------------------------------------------------
>
> And finally, this is a comparison of deflate-iaa vs. zstd with v8 of this
> patch-series:
>
>  ---------------------------------------------
>                    zswap_store large folios v8
>                   Impr w/ deflate-iaa vs. zstd
>
>                        64K folios    2M folios
>  ---------------------------------------------
>  Throughput (KB/s)             1%          17%
>  elapsed time (sec)           17%          23%
>  sys time (sec)               26%          27%
>  ---------------------------------------------
>
>
> Conclusions based on the performance results:
> =============================================
>
>  v8 wrt before-case1:
>  --------------------
>  We see significant improvements in throughput, elapsed and sys time for
>  zstd and deflate-iaa, when comparing before-case1 (THP_SWAP=N) vs. after
>  (THP_SWAP=Y) with zswap_store large folios.
>
>  v8 wrt before-case2:
>  --------------------
>  We see even more significant improvements in throughput and elapsed time
>  for zstd and deflate-iaa, when comparing before-case2 (large-folio-SSD)
>  vs. after (large-folio-zswap). The sys time increases with
>  large-folio-zswap as expected, due to the CPU compression time
>  vs. asynchronous disk write times, as pointed out by Ying and Yosry.
>
>  In before-case2, when zswap does not store large folios, only allocations
>  and cgroup charging due to 4K folio zswap stores count towards the cgroup
>  memory limit. However, in the after scenario, with the introduction of
>  zswap_store() of large folios, there is an added component of the zswap
>  compressed pool usage from large folio stores from potentially all 30
>  processes, that gets counted towards the memory limit. As a result, we see
>  higher swapout activity in the "after" data.
>
>
> Summary:
> ========
> The v8 data presented above shows that zswap_store of large folios
> demonstrates good throughput/performance improvements compared to
> conventional SSD swap of large folios with a sufficiently large 525G SSD
> swap device. Hence, it seems reasonable for zswap_store to support large
> folios, so that further performance improvements can be implemented.
>
> In the experimental setup used in this patchset, we have enabled IAA
> compress verification to ensure additional hardware data integrity CRC
> checks not currently done by the software compressors. We see good
> throughput/latency improvements with deflate-iaa vs. zstd with zswap_store
> of large folios.
>
> Some of the ideas for further reducing latency that have shown promise in
> our experiments, are:
>
> 1) IAA compress/decompress batching.
> 2) Distributing compress jobs across all IAA devices on the socket.
>
> The tests run for this patchset are using only 1 IAA device per core, that
> avails of 2 compress engines on the device. In our experiments with IAA
> batching, we distribute compress jobs from all cores to the 8 compress
> engines available per socket. We further compress the pages in each folio
> in parallel in the accelerator. As a result, we improve compress latency
> and reclaim throughput.
>
> In decompress batching, we use swapin_readahead to generate a prefetch
> batch of 4K folios that we decompress in parallel in IAA.
>
>  ------------------------------------------------------------------------------
>                           IAA compress/decompress batching
>               Further improvements wrt v8 zswap_store Sequential
>                           subpage store using "deflate-iaa":
>
>                       "deflate-iaa" Batching  "deflate-iaa-canned" [2] Batching
>                           Additional Impr               Additional Impr
>                      64K folios    2M folios     64K folios    2M folios
>  ------------------------------------------------------------------------------
>  Throughput (KB/s)          35%          34%            44%          44%
>  elapsed time (sec)          9%          10%            14%          17%
>  sys time (sec)            0.4%           4%             8%          15%
>  ------------------------------------------------------------------------------
>
>
> With zswap IAA compress/decompress batching, we are able to demonstrate
> significant performance improvements and memory savings in server
> scalability experiments in highly contended system scenarios under
> significant memory pressure; as compared to software compressors. We hope
> to submit this work in subsequent patch series. The current patch-series is
> a prequisite for these future submissions.
>
> Thanks,
> Kanchana
>
>
> [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
> [2] https://patchwork.kernel.org/project/linux-crypto/cover/cover.1710969449.git.andre.glover@linux.intel.com/
>
>
> Changes since v7:
> =================
> 1) Rebased to mm-unstable as of 9-27-2024,
>    commit de2fbaa6d9c3576ec7133ed02a370ec9376bf000.
> 2) Added Nhat's 'Reviewed-by' to patches 1 and 2. Thanks Nhat!
> 3) Implemented one-time obj_cgroup_may_zswap and zswap_check_limits at the
>    start of zswap_store. Implemented one-time batch updates to cgroup zswap
>    charging (with total compressed bytes), zswap_stored_pages and the
>    memcg/vm zswpout event stats (with folio_nr_pages()) only for successful
>    stores at the end of zswap_store. Thanks Yosry and Johannes for guidance
>    on this!
> 4) Changed the existing zswap_pool_get() to zswap_pool_tryget(). Modified
>    zswap_pool_current_get() and zswap_pool_find_get() to call
>    zswap_pool_tryget(). Furthermore, zswap_store() obtains a reference to a
>    valid zswap_pool upfront by calling zswap_pool_tryget(), and errors out
>    if the tryget fails. Added a new zswap_pool_get() that calls
>    "percpu_ref_get(&pool->ref)" and is called in zswap_store_page(), as
>    suggested by Johannes & Yosry. Thanks both!
> 5) Provided a new count_objcg_events() API for batch event updates.
> 6) Changed "zswap_stored_pages" to atomic_long_t to support adding
>    folio_nr_pages() to it once a large folio is stored successfully.
> 7) Deleted the refactoring done in v7 for the xarray updates in
>    zswap_store_page(); and unwinding of stored offsets in zswap_store() in
>    case of errors, as suggested by Johannes.
> 8) Deleted the CONFIG_ZSWAP_STORE_THP_DEFAULT_ON config option and
>    "zswap_mthp_enabled" tunable, as recommended by Yosry, Johannes and
>    Nhat.
> 9) Replaced references to "mTHP" with "large folios"; organized
>    before/after data per-compressor for easier visual comparisons;
>    incorporated Nhat's feedback in the documentation updates; moved
>    changelog to the end. Thanks Johannes, Yosry and Nhat!
> 10) Moved the usemem testing configuration to 30 processes, each allocating
>     10G within a 150G memory-limit constrained cgroup, maintaining the
>     allocated memory for 10 sec before exiting. Thanks Ying for this
>     suggestion!
>
> Changes since v6:
> =================
> 1) Rebased to mm-unstable as of 9-23-2024,
>    commit acfabf7e197f7a5bedf4749dac1f39551417b049.
> 2) Refactored into smaller commits, as suggested by Yosry and
>    Chengming. Thanks both!
> 3) Reworded the commit log for patches 5 and 6 as per Yosry's
>    suggestion. Thanks Yosry!
> 4) Gathered data on a Sapphire Rapids server that has 823GiB SSD swap disk
>    partition. Also, all experiments are run with usemem --sleep 10, so that
>    the memory allocated by the 70 processes remains in memory
>    longer. Posted elapsed and sys times. Thanks to Yosry, Nhat and Ying for
>    their help with refining the performance characterization methodology.
> 5) Updated Documentation/admin-guide/mm/transhuge.rst as suggested by
>    Nhat. Thanks Nhat!
>
> Changes since v5:
> =================
> 1) Rebased to mm-unstable as of 8/29/2024,
>    commit 9287e4adbc6ab8fa04d25eb82e097fed877a4642.
> 2) Added CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default) to
>    enable/disable zswap_store() of mTHP folios. Thanks Nhat for the
>    suggestion to add a knob by which users can enable/disable this
>    change. Nhat, I hope this is along the lines of what you were
>    thinking.
> 3) Added vm-scalability usemem data with 4K folios with
>    CONFIG_ZSWAP_STORE_THP_DEFAULT_ON off, that I gathered to make sure
>    there is no regression with this change.
> 4) Added data with usemem with 64K and 2M THP for an alternate view of
>    before/after, as suggested by Yosry, so we can understand the impact
>    of when mTHPs are split into 4K folios in shrink_folio_list()
>    (CONFIG_THP_SWAP off) vs. not split (CONFIG_THP_SWAP on) and stored
>    in zswap. Thanks Yosry for this suggestion.
>
> Changes since v4:
> =================
> 1) Published before/after data with zstd, as suggested by Nhat (Thanks
>    Nhat for the data reviews!).
> 2) Rebased to mm-unstable from 8/27/2024,
>    commit b659edec079c90012cf8d05624e312d1062b8b87.
> 3) Incorporated the change in memcontrol.h that defines obj_cgroup_get() if
>    CONFIG_MEMCG is not defined, to resolve build errors reported by kernel
>    robot; as per Nhat's and Michal's suggestion to not require a separate
>    patch to fix the build errors (thanks both!).
> 4) Deleted all same-filled folio processing in zswap_store() of mTHP, as
>    suggested by Yosry (Thanks Yosry!).
> 5) Squashed the commits that define new mthp zswpout stat counters, and
>    invoke count_mthp_stat() after successful zswap_store()s; into a single
>    commit. Thanks Yosry for this suggestion!
>
> Changes since v3:
> =================
> 1) Rebased to mm-unstable commit 8c0b4f7b65fd1ca7af01267f491e815a40d77444.
>    Thanks to Barry for suggesting aligning with Ryan Roberts' latest
>    changes to count_mthp_stat() so that it's always defined, even when THP
>    is disabled. Barry, I have also made one other change in page_io.c
>    where count_mthp_stat() is called by count_swpout_vm_event(). I would
>    appreciate it if you can review this. Thanks!
>    Hopefully this should resolve the kernel robot build errors.
>
> Changes since v2:
> =================
> 1) Gathered usemem data using SSD as the backing swap device for zswap,
>    as suggested by Ying Huang. Ying, I would appreciate it if you can
>    review the latest data. Thanks!
> 2) Generated the base commit info in the patches to attempt to address
>    the kernel test robot build errors.
> 3) No code changes to the individual patches themselves.
>
> Changes since RFC v1:
> =====================
>
> 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
>    Thanks Barry!
> 2) Addressed some of the code review comments that Nhat Pham provided in
>    Ryan's initial RFC [1]:
>    - Added a comment about the cgroup zswap limit checks occuring once per
>      folio at the beginning of zswap_store().
>      Nhat, Ryan, please do let me know if the comments convey the summary
>      from the RFC discussion. Thanks!
>    - Posted data on running the cgroup suite's zswap kselftest.
> 3) Rebased to v6.11-rc3.
> 4) Gathered performance data with usemem and the rebased patch-series.
>
>
>
> Kanchana P Sridhar (8):
>   mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined.
>   mm: zswap: Modify zswap_compress() to accept a page instead of a
>     folio.
>   mm: zswap: Rename zswap_pool_get() to zswap_pool_tryget().
>   mm: Provide a new count_objcg_events() API for batch event updates.
>   mm: zswap: Modify zswap_stored_pages to be atomic_long_t.
>   mm: zswap: Support large folios in zswap_store().
>   mm: swap: Count successful large folio zswap stores in hugepage
>     zswpout stats.
>   mm: Document the newly added sysfs large folios zswpout stats.
>
>  Documentation/admin-guide/mm/transhuge.rst |   8 +-
>  fs/proc/meminfo.c                          |   2 +-
>  include/linux/huge_mm.h                    |   1 +
>  include/linux/memcontrol.h                 |  24 ++
>  include/linux/zswap.h                      |   2 +-
>  mm/huge_memory.c                           |   3 +
>  mm/page_io.c                               |   1 +
>  mm/zswap.c                                 | 254 +++++++++++++++------
>  8 files changed, 219 insertions(+), 76 deletions(-)
>
>
> base-commit: de2fbaa6d9c3576ec7133ed02a370ec9376bf000
> --
> 2.27.0
>


^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [PATCH v8 0/8] mm: zswap swap-out of large folios
  2024-09-28  2:25 ` [PATCH v8 0/8] mm: zswap swap-out of large folios Yosry Ahmed
@ 2024-09-28  2:36   ` Sridhar, Kanchana P
  2024-09-28  3:00     ` Yosry Ahmed
  0 siblings, 1 reply; 43+ messages in thread
From: Sridhar, Kanchana P @ 2024-09-28  2:36 UTC (permalink / raw)
  To: Yosry Ahmed, Ryan Roberts
  Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
	usamaarif642, shakeel.butt, Huang, Ying, 21cnbao, akpm, Zou,
	Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P

Hi Yosry,

> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Friday, September 27, 2024 7:25 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; Ryan Roberts
> <ryan.roberts@arm.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; shakeel.butt@linux.dev; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v8 0/8] mm: zswap swap-out of large folios
> 
> On Fri, Sep 27, 2024 at 7:16 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > Hi All,
> >
> > This patch-series enables zswap_store() to accept and store large
> > folios. The most significant contribution in this series is from the
> > earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
> > migrated to mm-unstable as of 9-27-2024 in patch 6 of this series, and
> > adapted based on code review comments received for v7 of the current
> > patch-series.
> >
> > [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
> >      https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> ryan.roberts@arm.com/T/#u
> >
> > The first few patches do the prep work for supporting large folios in
> > zswap_store. Patch 6 provides the main functionality to swap-out large
> > folios in zswap. Patch 7 adds sysfs per-order hugepages "zswpout" counters
> > that get incremented upon successful zswap_store of large folios:
> >
> > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
> >
> > Patch 8 updates the documentation for the new sysfs "zswpout" counters.
> >
> > This patch-series is a pre-requisite for zswap compress batching of large
> > folio swap-out and decompress batching of swap-ins based on
> > swapin_readahead(), using Intel IAA hardware acceleration, which we
> would
> > like to submit in subsequent patch-series, with performance improvement
> > data.
> >
> > Thanks to Ying Huang for pre-posting review feedback and suggestions!
> >
> > Thanks also to Nhat, Yosry, Johannes, Barry, Chengming, Usama and Ying for
> > their helpful feedback, data reviews and suggestions!
> >
> > Co-development signoff request:
> > ===============================
> > I would like to thank Ryan Roberts for his original RFC [1] and request
> > his co-developer signoff on patch 6 in this series. Thanks Ryan!
> 
> 
> Ryan, could you help Kanchana out with a Signed-off-by please :)
> 
> >
> >
> >
> > System setup for testing:
> > =========================
> > Testing of this patch-series was done with mm-unstable as of 9-27-2024,
> > commit de2fbaa6d9c3576ec7133ed02a370ec9376bf000. Data was
> gathered
> > without/with this patch-series, on an Intel Sapphire Rapids server,
> > dual-socket 56 cores per socket, 4 IAA devices per socket, 503 GiB RAM and
> > 525G SSD disk partition swap. Core frequency was fixed at 2500MHz.
> >
> > The vm-scalability "usemem" test was run in a cgroup whose memory.high
> > was fixed at 150G. The is no swap limit set for the cgroup. 30 usemem
> > processes were run, each allocating and writing 10G of memory, and
> sleeping
> > for 10 sec before exiting:
> >
> > usemem --init-time -w -O -s 10 -n 30 10g
> >
> > Other kernel configuration parameters:
> >
> >     zswap compressors : zstd, deflate-iaa
> >     zswap allocator   : zsmalloc
> >     vm.page-cluster   : 2
> >
> > In the experiments where "deflate-iaa" is used as the zswap compressor,
> > IAA "compression verification" is enabled by default
> > (cat /sys/bus/dsa/drivers/crypto/verify_compress). Hence each IAA
> > compression will be decompressed internally by the "iaa_crypto" driver, the
> > crc-s returned by the hardware will be compared and errors reported in
> case
> > of mismatches. Thus "deflate-iaa" helps ensure better data integrity as
> > compared to the software compressors, and the experimental data listed
> > below is with verify_compress set to "1".
> >
> > Total and average throughput are derived from the individual 30 processes'
> > throughputs reported by usemem. elapsed/sys times are measured with
> perf.
> >
> > The vm stats and sysfs hugepages stats included with the performance data
> > provide details on the swapout activity to zswap/swap device.
> >
> >
> > Testing labels used in data summaries:
> > ======================================
> > The data refers to these test configurations and the before/after
> > comparisons that they do:
> >
> >  before-case1:
> >  -------------
> >  mm-unstable 9-27-2024, CONFIG_THP_SWAP=N (compares zswap 4K vs.
> zswap 64K)
> >
> >  In this scenario, CONFIG_THP_SWAP=N results in 64K/2M folios to be split
> >  into 4K folios that get processed by zswap.
> >
> >  before-case2:
> >  -------------
> >  mm-unstable 9-27-2024, CONFIG_THP_SWAP=Y (compares SSD swap large
> folios vs. zswap large folios)
> >
> >  In this scenario, CONFIG_THP_SWAP=Y results in zswap rejecting large
> >  folios, which will then be stored by the SSD swap device.
> >
> >  after:
> >  ------
> >  v8 of this patch-series, CONFIG_THP_SWAP=Y
> >
> >  The "after" is CONFIG_THP_SWAP=Y and v8 of this patch-series, that
> results
> >  in 64K/2M folios to not be split, and to be processed by zswap_store.
> >
> >
> > Regression Testing:
> > ===================
> > I ran vm-scalability usemem without large folios, i.e., only 4K folios with
> > mm-unstable and this patch-series. The main goal was to make sure that
> > there is no functional or performance regression wrt the earlier zswap
> > behavior for 4K folios, now that 4K folios will be processed by the new
> > zswap_store() code.
> >
> > The data indicates there is no significant regression.
> >
> >  -------------------------------------------------------------------------------
> >  4K folios:
> >  ==========
> >
> >  zswap compressor                zstd          zstd        zstd  zstd v8 zstd v8
> >                          before-case1  before-case2       after      vs.     vs.
> >                                                                    case1   case2
> >  -------------------------------------------------------------------------------
> >  Total throughput (KB/s)    4,793,363     4,880,978   4,813,151       0%     -1%
> >  Average throughput (KB/s)    159,778       162,699     160,438       0%     -1%
> >  elapsed time (sec)            130.14        123.17      127.21       2%     -3%
> >  sys time (sec)              3,135.53      2,985.64    3,110.53       1%     -4%
> >
> >  memcg_high                   446,826       444,626     448,231
> >  memcg_swap_fail                    0             0           0
> >  pswpout                            0             0           0
> >  pswpin                             0             0           0
> >  zswpout                   48,932,107    48,931,971  48,931,584
> >  zswpin                           383           386         388
> >  thp_swpout                         0             0           0
> >  thp_swpout_fallback                0             0           0
> >  64kB-mthp_swpout_fallback          0             0           0
> >  pgmajfault                     3,063         3,077       3,082
> >  swap_ra                           93            94          93
> >  swap_ra_hit                       47            47          47
> >  ZSWPOUT-64kB                     n/a           n/a           0
> >  SWPOUT-64kB                        0             0           0
> >  -------------------------------------------------------------------------------
> >
> >
> > Performance Testing:
> > ====================
> >
> > We list the data for 64K folios with before/after data per-compressor,
> > followed by the same for 2M pmd-mappable folios.
> >
> >
> >  -------------------------------------------------------------------------------
> >  64K folios: zstd:
> >  =================
> >
> >  zswap compressor                zstd          zstd         zstd      zstd v8
> >                          before-case1  before-case2        after     vs.    vs.
> >                                                                    case1  case2
> >  -------------------------------------------------------------------------------
> >  Total throughput (KB/s)    5,222,213     1,076,611    6,227,367     19%    478%
> >  Average throughput (KB/s)    174,073        35,887      207,578     19%    478%
> >  elapsed time (sec)            120.50        347.16       109.21      9%     69%
> 
> 
> The diff here is supposed to be negative, right?
> (Same for the below results)

So this is supposed to be positive to indicate the throughput improvement
[(new-old)/old] with v8 as compared to the before-case1 and before-case2.
For latency, a positive value indicates the latency reducing, since I calculate
[(old-new)/old]. This is the metric used throughout.

Based on this convention, positive percentages are improvements in both,
throughput and latency.

> 
> Otherwise the results are looking really good, we have come a long way
> since the first version :)
> 
> Thanks for working on this! I will look at individual patches later
> today or early next week.

Many thanks Yosry :)  I immensely appreciate your, Nhat's, Johannes', Ying's
and others' help in getting here!

Sure, this sounds good.

Thanks,
Kanchana

> 
> >
> >  sys time (sec)              2,930.33        248.16     2,609.22     11%   -951%
> >  memcg_high                   416,773       552,200      482,703
> >  memcg_swap_fail            3,192,906         1,293          944
> >  pswpout                            0    40,778,448            0
> >  pswpin                             0            16            0
> >  zswpout                   48,931,583        20,903   48,931,271
> >  zswpin                           384           363          392
> >  thp_swpout                         0             0            0
> >  thp_swpout_fallback                0             0            0
> >  64kB-mthp_swpout_fallback  3,192,906         1,293          944
> >  pgmajfault                     3,452         3,072        3,095
> >  swap_ra                           90            87          100
> >  swap_ra_hit                       42            43           56
> >  ZSWPOUT-64kB                     n/a           n/a    3,057,260
> >  SWPOUT-64kB                        0     2,548,653            0
> >  -------------------------------------------------------------------------------
> >
> >
> >  -------------------------------------------------------------------------------
> >  64K folios: deflate-iaa:
> >  ========================
> >
> >  zswap compressor         deflate-iaa   deflate-iaa  deflate-iaa  deflate-iaa v8
> >                          before-case1  before-case2        after      vs.    vs.
> >                                                                    case1   case2
> >  -------------------------------------------------------------------------------
> >  Total throughput (KB/s)    5,652,608     1,089,180    6,315,000     12%    480%
> >  Average throughput (KB/s)    188,420        36,306      210,500     12%    480%
> >  elapsed time (sec)            102.90        343.35        91.11     11%     73%
> >
> >
> >  sys time (sec)              2,246.86        213.53     1,939.31     14%   -808%
> >  memcg_high                   576,104       502,907      612,505
> >  memcg_swap_fail            4,016,117         1,407        1,660
> >  pswpout                            0    40,862,080            0
> >  pswpin                             0            20            0
> >  zswpout                   61,163,423        22,444   57,317,607
> >  zswpin                           401           368          449
> >  thp_swpout                         0             0            0
> >  thp_swpout_fallback                0             0            0
> >  64kB-mthp_swpout_fallback  4,016,117         1,407        1,660
> >  pgmajfault                     3,063         3,153        3,167
> >  swap_ra                           96            93          149
> >  swap_ra_hit                       46            45           89
> >  ZSWPOUT-64kB                     n/a           n/a    3,580,673
> >  SWPOUT-64kB                        0     2,553,880            0
> >  -------------------------------------------------------------------------------
> >
> >
> >  -------------------------------------------------------------------------------
> >  2M folios: zstd:
> >  ================
> >
> >  zswap compressor                zstd          zstd         zstd      zstd v8
> >                          before-case1  before-case2        after     vs.    vs.
> >                                                                    case1  case2
> >  -------------------------------------------------------------------------------
> >  Total throughput (KB/s)    5,895,500     1,109,694     6,460,111    10%    482%
> >  Average throughput (KB/s)    196,516        36,989       215,337    10%    482%
> >  elapsed time (sec)            108.77        334.28        105.92     3%     68%
> >
> >
> >  sys time (sec)              2,657.14         94.88      2,436.24     8%  -2468%
> >  memcg_high                    64,200        66,316        60,300
> >  memcg_swap_fail              101,182            70            30
> >  pswpout                            0    40,166,400             0
> >  pswpin                             0             0             0
> >  zswpout                   48,931,499        36,507    48,869,236
> >  zswpin                           380           379           397
> >  thp_swpout                         0        78,450             0
> >  thp_swpout_fallback          101,182            70            30
> >  2MB-mthp_swpout_fallback           0             0             0
> >  pgmajfault                     3,067         3,417         4,765
> >  swap_ra                           91            90         5,073
> >  swap_ra_hit                       45            45         5,024
> >  ZSWPOUT-2MB                      n/a           n/a        95,408
> >  SWPOUT-2MB                         0        78,450             0
> >  -------------------------------------------------------------------------------
> >
> >
> >  -------------------------------------------------------------------------------
> >  2M folios: deflate-iaa:
> >  =======================
> >
> >  zswap compressor         deflate-iaa   deflate-iaa  deflate-iaa  deflate-iaa v8
> >                          before-case1  before-case2        after      vs.    vs.
> >                                                                    case1   case2
> >  -------------------------------------------------------------------------------
> >  Total throughput (KB/s)   6,286,587      1,126,785     7,569,560    20%    572%
> >  Average throughput (KB/s)   209,552         37,559       252,318    20%    572%
> >  elapsed time (sec)            96.19         333.03         81.96    15%     75%
> >
> >  sys time (sec)             2,141.44          99.96      1,768.41    17%  -1669%
> >  memcg_high                   99,253         64,666        75,139
> >  memcg_swap_fail             129,074             53            73
> >  pswpout                           0     40,048,128             0
> >  pswpin                            0              0             0
> >  zswpout                  61,312,794         28,321    57,083,119
> >  zswpin                          383            406           447
> >  thp_swpout                        0         78,219             0
> >  thp_swpout_fallback         129,074             53            73
> >  2MB-mthp_swpout_fallback          0              0             0
> >  pgmajfault                    3,430          3,077         7,133
> >  swap_ra                          91            103        11,978
> >  swap_ra_hit                      47             46        11,920
> >  ZSWPOUT-2MB                     n/a            n/a       111,390
> >  SWPOUT-2MB                        0         78,219             0
> >  -------------------------------------------------------------------------------
> >
> > And finally, this is a comparison of deflate-iaa vs. zstd with v8 of this
> > patch-series:
> >
> >  ---------------------------------------------
> >                    zswap_store large folios v8
> >                   Impr w/ deflate-iaa vs. zstd
> >
> >                        64K folios    2M folios
> >  ---------------------------------------------
> >  Throughput (KB/s)             1%          17%
> >  elapsed time (sec)           17%          23%
> >  sys time (sec)               26%          27%
> >  ---------------------------------------------
> >
> >
> > Conclusions based on the performance results:
> > =============================================
> >
> >  v8 wrt before-case1:
> >  --------------------
> >  We see significant improvements in throughput, elapsed and sys time for
> >  zstd and deflate-iaa, when comparing before-case1 (THP_SWAP=N) vs.
> after
> >  (THP_SWAP=Y) with zswap_store large folios.
> >
> >  v8 wrt before-case2:
> >  --------------------
> >  We see even more significant improvements in throughput and elapsed
> time
> >  for zstd and deflate-iaa, when comparing before-case2 (large-folio-SSD)
> >  vs. after (large-folio-zswap). The sys time increases with
> >  large-folio-zswap as expected, due to the CPU compression time
> >  vs. asynchronous disk write times, as pointed out by Ying and Yosry.
> >
> >  In before-case2, when zswap does not store large folios, only allocations
> >  and cgroup charging due to 4K folio zswap stores count towards the cgroup
> >  memory limit. However, in the after scenario, with the introduction of
> >  zswap_store() of large folios, there is an added component of the zswap
> >  compressed pool usage from large folio stores from potentially all 30
> >  processes, that gets counted towards the memory limit. As a result, we see
> >  higher swapout activity in the "after" data.
> >
> >
> > Summary:
> > ========
> > The v8 data presented above shows that zswap_store of large folios
> > demonstrates good throughput/performance improvements compared to
> > conventional SSD swap of large folios with a sufficiently large 525G SSD
> > swap device. Hence, it seems reasonable for zswap_store to support large
> > folios, so that further performance improvements can be implemented.
> >
> > In the experimental setup used in this patchset, we have enabled IAA
> > compress verification to ensure additional hardware data integrity CRC
> > checks not currently done by the software compressors. We see good
> > throughput/latency improvements with deflate-iaa vs. zstd with
> zswap_store
> > of large folios.
> >
> > Some of the ideas for further reducing latency that have shown promise in
> > our experiments, are:
> >
> > 1) IAA compress/decompress batching.
> > 2) Distributing compress jobs across all IAA devices on the socket.
> >
> > The tests run for this patchset are using only 1 IAA device per core, that
> > avails of 2 compress engines on the device. In our experiments with IAA
> > batching, we distribute compress jobs from all cores to the 8 compress
> > engines available per socket. We further compress the pages in each folio
> > in parallel in the accelerator. As a result, we improve compress latency
> > and reclaim throughput.
> >
> > In decompress batching, we use swapin_readahead to generate a prefetch
> > batch of 4K folios that we decompress in parallel in IAA.
> >
> >  ------------------------------------------------------------------------------
> >                           IAA compress/decompress batching
> >               Further improvements wrt v8 zswap_store Sequential
> >                           subpage store using "deflate-iaa":
> >
> >                       "deflate-iaa" Batching  "deflate-iaa-canned" [2] Batching
> >                           Additional Impr               Additional Impr
> >                      64K folios    2M folios     64K folios    2M folios
> >  ------------------------------------------------------------------------------
> >  Throughput (KB/s)          35%          34%            44%          44%
> >  elapsed time (sec)          9%          10%            14%          17%
> >  sys time (sec)            0.4%           4%             8%          15%
> >  ------------------------------------------------------------------------------
> >
> >
> > With zswap IAA compress/decompress batching, we are able to
> demonstrate
> > significant performance improvements and memory savings in server
> > scalability experiments in highly contended system scenarios under
> > significant memory pressure; as compared to software compressors. We
> hope
> > to submit this work in subsequent patch series. The current patch-series is
> > a prequisite for these future submissions.
> >
> > Thanks,
> > Kanchana
> >
> >
> > [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> ryan.roberts@arm.com/T/#u
> > [2] https://patchwork.kernel.org/project/linux-
> crypto/cover/cover.1710969449.git.andre.glover@linux.intel.com/
> >
> >
> > Changes since v7:
> > =================
> > 1) Rebased to mm-unstable as of 9-27-2024,
> >    commit de2fbaa6d9c3576ec7133ed02a370ec9376bf000.
> > 2) Added Nhat's 'Reviewed-by' to patches 1 and 2. Thanks Nhat!
> > 3) Implemented one-time obj_cgroup_may_zswap and zswap_check_limits
> at the
> >    start of zswap_store. Implemented one-time batch updates to cgroup
> zswap
> >    charging (with total compressed bytes), zswap_stored_pages and the
> >    memcg/vm zswpout event stats (with folio_nr_pages()) only for successful
> >    stores at the end of zswap_store. Thanks Yosry and Johannes for guidance
> >    on this!
> > 4) Changed the existing zswap_pool_get() to zswap_pool_tryget(). Modified
> >    zswap_pool_current_get() and zswap_pool_find_get() to call
> >    zswap_pool_tryget(). Furthermore, zswap_store() obtains a reference to a
> >    valid zswap_pool upfront by calling zswap_pool_tryget(), and errors out
> >    if the tryget fails. Added a new zswap_pool_get() that calls
> >    "percpu_ref_get(&pool->ref)" and is called in zswap_store_page(), as
> >    suggested by Johannes & Yosry. Thanks both!
> > 5) Provided a new count_objcg_events() API for batch event updates.
> > 6) Changed "zswap_stored_pages" to atomic_long_t to support adding
> >    folio_nr_pages() to it once a large folio is stored successfully.
> > 7) Deleted the refactoring done in v7 for the xarray updates in
> >    zswap_store_page(); and unwinding of stored offsets in zswap_store() in
> >    case of errors, as suggested by Johannes.
> > 8) Deleted the CONFIG_ZSWAP_STORE_THP_DEFAULT_ON config option
> and
> >    "zswap_mthp_enabled" tunable, as recommended by Yosry, Johannes and
> >    Nhat.
> > 9) Replaced references to "mTHP" with "large folios"; organized
> >    before/after data per-compressor for easier visual comparisons;
> >    incorporated Nhat's feedback in the documentation updates; moved
> >    changelog to the end. Thanks Johannes, Yosry and Nhat!
> > 10) Moved the usemem testing configuration to 30 processes, each
> allocating
> >     10G within a 150G memory-limit constrained cgroup, maintaining the
> >     allocated memory for 10 sec before exiting. Thanks Ying for this
> >     suggestion!
> >
> > Changes since v6:
> > =================
> > 1) Rebased to mm-unstable as of 9-23-2024,
> >    commit acfabf7e197f7a5bedf4749dac1f39551417b049.
> > 2) Refactored into smaller commits, as suggested by Yosry and
> >    Chengming. Thanks both!
> > 3) Reworded the commit log for patches 5 and 6 as per Yosry's
> >    suggestion. Thanks Yosry!
> > 4) Gathered data on a Sapphire Rapids server that has 823GiB SSD swap disk
> >    partition. Also, all experiments are run with usemem --sleep 10, so that
> >    the memory allocated by the 70 processes remains in memory
> >    longer. Posted elapsed and sys times. Thanks to Yosry, Nhat and Ying for
> >    their help with refining the performance characterization methodology.
> > 5) Updated Documentation/admin-guide/mm/transhuge.rst as suggested
> by
> >    Nhat. Thanks Nhat!
> >
> > Changes since v5:
> > =================
> > 1) Rebased to mm-unstable as of 8/29/2024,
> >    commit 9287e4adbc6ab8fa04d25eb82e097fed877a4642.
> > 2) Added CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default) to
> >    enable/disable zswap_store() of mTHP folios. Thanks Nhat for the
> >    suggestion to add a knob by which users can enable/disable this
> >    change. Nhat, I hope this is along the lines of what you were
> >    thinking.
> > 3) Added vm-scalability usemem data with 4K folios with
> >    CONFIG_ZSWAP_STORE_THP_DEFAULT_ON off, that I gathered to make
> sure
> >    there is no regression with this change.
> > 4) Added data with usemem with 64K and 2M THP for an alternate view of
> >    before/after, as suggested by Yosry, so we can understand the impact
> >    of when mTHPs are split into 4K folios in shrink_folio_list()
> >    (CONFIG_THP_SWAP off) vs. not split (CONFIG_THP_SWAP on) and stored
> >    in zswap. Thanks Yosry for this suggestion.
> >
> > Changes since v4:
> > =================
> > 1) Published before/after data with zstd, as suggested by Nhat (Thanks
> >    Nhat for the data reviews!).
> > 2) Rebased to mm-unstable from 8/27/2024,
> >    commit b659edec079c90012cf8d05624e312d1062b8b87.
> > 3) Incorporated the change in memcontrol.h that defines obj_cgroup_get() if
> >    CONFIG_MEMCG is not defined, to resolve build errors reported by kernel
> >    robot; as per Nhat's and Michal's suggestion to not require a separate
> >    patch to fix the build errors (thanks both!).
> > 4) Deleted all same-filled folio processing in zswap_store() of mTHP, as
> >    suggested by Yosry (Thanks Yosry!).
> > 5) Squashed the commits that define new mthp zswpout stat counters, and
> >    invoke count_mthp_stat() after successful zswap_store()s; into a single
> >    commit. Thanks Yosry for this suggestion!
> >
> > Changes since v3:
> > =================
> > 1) Rebased to mm-unstable commit
> 8c0b4f7b65fd1ca7af01267f491e815a40d77444.
> >    Thanks to Barry for suggesting aligning with Ryan Roberts' latest
> >    changes to count_mthp_stat() so that it's always defined, even when THP
> >    is disabled. Barry, I have also made one other change in page_io.c
> >    where count_mthp_stat() is called by count_swpout_vm_event(). I would
> >    appreciate it if you can review this. Thanks!
> >    Hopefully this should resolve the kernel robot build errors.
> >
> > Changes since v2:
> > =================
> > 1) Gathered usemem data using SSD as the backing swap device for zswap,
> >    as suggested by Ying Huang. Ying, I would appreciate it if you can
> >    review the latest data. Thanks!
> > 2) Generated the base commit info in the patches to attempt to address
> >    the kernel test robot build errors.
> > 3) No code changes to the individual patches themselves.
> >
> > Changes since RFC v1:
> > =====================
> >
> > 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
> >    Thanks Barry!
> > 2) Addressed some of the code review comments that Nhat Pham provided
> in
> >    Ryan's initial RFC [1]:
> >    - Added a comment about the cgroup zswap limit checks occuring once
> per
> >      folio at the beginning of zswap_store().
> >      Nhat, Ryan, please do let me know if the comments convey the summary
> >      from the RFC discussion. Thanks!
> >    - Posted data on running the cgroup suite's zswap kselftest.
> > 3) Rebased to v6.11-rc3.
> > 4) Gathered performance data with usemem and the rebased patch-series.
> >
> >
> >
> > Kanchana P Sridhar (8):
> >   mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined.
> >   mm: zswap: Modify zswap_compress() to accept a page instead of a
> >     folio.
> >   mm: zswap: Rename zswap_pool_get() to zswap_pool_tryget().
> >   mm: Provide a new count_objcg_events() API for batch event updates.
> >   mm: zswap: Modify zswap_stored_pages to be atomic_long_t.
> >   mm: zswap: Support large folios in zswap_store().
> >   mm: swap: Count successful large folio zswap stores in hugepage
> >     zswpout stats.
> >   mm: Document the newly added sysfs large folios zswpout stats.
> >
> >  Documentation/admin-guide/mm/transhuge.rst |   8 +-
> >  fs/proc/meminfo.c                          |   2 +-
> >  include/linux/huge_mm.h                    |   1 +
> >  include/linux/memcontrol.h                 |  24 ++
> >  include/linux/zswap.h                      |   2 +-
> >  mm/huge_memory.c                           |   3 +
> >  mm/page_io.c                               |   1 +
> >  mm/zswap.c                                 | 254 +++++++++++++++------
> >  8 files changed, 219 insertions(+), 76 deletions(-)
> >
> >
> > base-commit: de2fbaa6d9c3576ec7133ed02a370ec9376bf000
> > --
> > 2.27.0
> >

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [PATCH v8 0/8] mm: zswap swap-out of large folios
  2024-09-28  2:36   ` Sridhar, Kanchana P
@ 2024-09-28  3:00     ` Yosry Ahmed
  0 siblings, 0 replies; 43+ messages in thread
From: Yosry Ahmed @ 2024-09-28  3:00 UTC (permalink / raw)
  To: Sridhar, Kanchana P
  Cc: Ryan Roberts, linux-kernel, linux-mm, hannes, nphamcs,
	chengming.zhou, usamaarif642, shakeel.butt, Huang, Ying, 21cnbao,
	akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh

[..]
> > > Performance Testing:
> > > ====================
> > >
> > > We list the data for 64K folios with before/after data per-compressor,
> > > followed by the same for 2M pmd-mappable folios.
> > >
> > >
> > >  -------------------------------------------------------------------------------
> > >  64K folios: zstd:
> > >  =================
> > >
> > >  zswap compressor                zstd          zstd         zstd      zstd v8
> > >                          before-case1  before-case2        after     vs.    vs.
> > >                                                                    case1  case2
> > >  -------------------------------------------------------------------------------
> > >  Total throughput (KB/s)    5,222,213     1,076,611    6,227,367     19%    478%
> > >  Average throughput (KB/s)    174,073        35,887      207,578     19%    478%
> > >  elapsed time (sec)            120.50        347.16       109.21      9%     69%
> >
> >
> > The diff here is supposed to be negative, right?
> > (Same for the below results)
>
> So this is supposed to be positive to indicate the throughput improvement
> [(new-old)/old] with v8 as compared to the before-case1 and before-case2.
> For latency, a positive value indicates the latency reducing, since I calculate
> [(old-new)/old]. This is the metric used throughout.
>
> Based on this convention, positive percentages are improvements in both,
> throughput and latency.

But you use negative percentages for sys time, we should at least be
consistent with this.


^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2024-09-30 17:56 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-09-28  2:16 [PATCH v8 0/8] mm: zswap swap-out of large folios Kanchana P Sridhar
2024-09-28  2:16 ` [PATCH v8 1/8] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined Kanchana P Sridhar
2024-09-28  2:30   ` Yosry Ahmed
2024-09-28  5:39   ` Chengming Zhou
2024-09-28 13:46   ` Johannes Weiner
2024-09-28  2:16 ` [PATCH v8 2/8] mm: zswap: Modify zswap_compress() to accept a page instead of a folio Kanchana P Sridhar
2024-09-28  5:41   ` Chengming Zhou
2024-09-28 13:46   ` Johannes Weiner
2024-09-28  2:16 ` [PATCH v8 3/8] mm: zswap: Rename zswap_pool_get() to zswap_pool_tryget() Kanchana P Sridhar
2024-09-28  2:29   ` Yosry Ahmed
2024-09-28  5:43   ` Chengming Zhou
2024-09-29 21:01     ` Sridhar, Kanchana P
2024-09-28 13:47   ` Johannes Weiner
2024-09-28 23:26   ` Nhat Pham
2024-09-29 21:04     ` Sridhar, Kanchana P
2024-09-28  2:16 ` [PATCH v8 4/8] mm: Provide a new count_objcg_events() API for batch event updates Kanchana P Sridhar
2024-09-28  3:02   ` Yosry Ahmed
2024-09-28  5:46     ` Chengming Zhou
2024-09-29 21:00     ` Sridhar, Kanchana P
2024-09-28  2:16 ` [PATCH v8 5/8] mm: zswap: Modify zswap_stored_pages to be atomic_long_t Kanchana P Sridhar
2024-09-28  2:57   ` Yosry Ahmed
2024-09-28  4:50     ` Matthew Wilcox
2024-09-28  8:12       ` Yosry Ahmed
2024-09-28  8:13   ` Yosry Ahmed
2024-09-29 21:04     ` Sridhar, Kanchana P
2024-09-28 13:53   ` Johannes Weiner
2024-09-29 21:03     ` Sridhar, Kanchana P
2024-09-28 23:27   ` Nhat Pham
2024-09-28  2:16 ` [PATCH v8 6/8] mm: zswap: Support large folios in zswap_store() Kanchana P Sridhar
2024-09-28  3:42   ` Yosry Ahmed
2024-09-28 14:15     ` Johannes Weiner
2024-09-28 18:11       ` Yosry Ahmed
2024-09-29 21:15         ` Sridhar, Kanchana P
2024-09-30 17:55           ` Sridhar, Kanchana P
2024-09-29 21:24     ` Sridhar, Kanchana P
2024-09-28  6:05   ` Chengming Zhou
2024-09-28  2:16 ` [PATCH v8 7/8] mm: swap: Count successful large folio zswap stores in hugepage zswpout stats Kanchana P Sridhar
2024-09-28  2:16 ` [PATCH v8 8/8] mm: Document the newly added sysfs large folios " Kanchana P Sridhar
2024-09-29 22:34   ` Nhat Pham
2024-09-30  0:56     ` Sridhar, Kanchana P
2024-09-28  2:25 ` [PATCH v8 0/8] mm: zswap swap-out of large folios Yosry Ahmed
2024-09-28  2:36   ` Sridhar, Kanchana P
2024-09-28  3:00     ` Yosry Ahmed

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox