* [PATCH v9 0/7] mm: zswap swap-out of large folios
@ 2024-09-30 22:12 Kanchana P Sridhar
2024-09-30 22:12 ` [PATCH v9 1/7] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined Kanchana P Sridhar
` (6 more replies)
0 siblings, 7 replies; 29+ messages in thread
From: Kanchana P Sridhar @ 2024-09-30 22:12 UTC (permalink / raw)
To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
ying.huang, 21cnbao, akpm, willy
Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar
Hi All,
This patch-series enables zswap_store() to accept and store large
folios. The most significant contribution in this series is from the
earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
migrated to mm-unstable as of 9-27-2024 in patch 6 of this series, and
adapted based on code review comments received for v7 of the current
patch-series.
[1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
The first few patches do the prep work for supporting large folios in
zswap_store. Patch 6 provides the main functionality to swap-out large
folios in zswap. Patch 7 adds sysfs per-order hugepages "zswpout" counters
that get incremented upon successful zswap_store of large folios:
/sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
Patch 8 updates the documentation for the new sysfs "zswpout" counters.
This patch-series is a pre-requisite for zswap compress batching of large
folio swap-out and decompress batching of swap-ins based on
swapin_readahead(), using Intel IAA hardware acceleration, which we would
like to submit in subsequent patch-series, with performance improvement
data.
Thanks to Ying Huang for pre-posting review feedback and suggestions!
Thanks also to Nhat, Yosry, Johannes, Barry, Chengming, Usama, Ying and
Matthew for their helpful feedback, code/data reviews and suggestions!
Co-development signoff request:
===============================
I would like to thank Ryan Roberts for his original RFC [1] and request
his co-developer signoff on patch 6 in this series. Thanks Ryan!
System setup for testing:
=========================
Testing of this patch-series was done with mm-unstable as of 9-27-2024,
commit de2fbaa6d9c3576ec7133ed02a370ec9376bf000 (without this patch-series)
and mm-unstable 9-30-2024 commit 66af62407e82647ec5b44462dc29d50ba03fdb22
(with this patch-series). Data was gathered on an Intel Sapphire Rapids
server, dual-socket 56 cores per socket, 4 IAA devices per socket, 503 GiB
RAM and 525G SSD disk partition swap. Core frequency was fixed at 2500MHz.
The vm-scalability "usemem" test was run in a cgroup whose memory.high
was fixed at 150G. The is no swap limit set for the cgroup. 30 usemem
processes were run, each allocating and writing 10G of memory, and sleeping
for 10 sec before exiting:
usemem --init-time -w -O -s 10 -n 30 10g
Other kernel configuration parameters:
zswap compressors : zstd, deflate-iaa
zswap allocator : zsmalloc
vm.page-cluster : 2
In the experiments where "deflate-iaa" is used as the zswap compressor,
IAA "compression verification" is enabled by default
(cat /sys/bus/dsa/drivers/crypto/verify_compress). Hence each IAA
compression will be decompressed internally by the "iaa_crypto" driver, the
crc-s returned by the hardware will be compared and errors reported in case
of mismatches. Thus "deflate-iaa" helps ensure better data integrity as
compared to the software compressors, and the experimental data listed
below is with verify_compress set to "1".
Metrics reporting methodology:
==============================
Total and average throughput are derived from the individual 30 processes'
throughputs reported by usemem. elapsed/sys times are measured with perf.
All percentage changes are "new" vs. "old"; hence a positive value
denotes an increase in the metric, whether it is throughput or latency,
and a negative value denotes a reduction in the metric. Positive throughput
change percentages and negative latency change percentages denote improvements.
The vm stats and sysfs hugepages stats included with the performance data
provide details on the swapout activity to zswap/swap device.
Testing labels used in data summaries:
======================================
The data refers to these test configurations and the before/after
comparisons that they do:
before-case1:
-------------
mm-unstable 9-27-2024, CONFIG_THP_SWAP=N (compares zswap 4K vs. zswap 64K)
In this scenario, CONFIG_THP_SWAP=N results in 64K/2M folios to be split
into 4K folios that get processed by zswap.
before-case2:
-------------
mm-unstable 9-27-2024, CONFIG_THP_SWAP=Y (compares SSD swap large folios vs. zswap large folios)
In this scenario, CONFIG_THP_SWAP=Y results in zswap rejecting large
folios, which will then be stored by the SSD swap device.
after:
------
v9 of this patch-series, CONFIG_THP_SWAP=Y
The "after" is CONFIG_THP_SWAP=Y and v9 of this patch-series, that results
in 64K/2M folios to not be split, and to be processed by zswap_store.
Regression Testing:
===================
I ran vm-scalability usemem without large folios, i.e., only 4K folios with
mm-unstable and this patch-series. The main goal was to make sure that
there is no functional or performance regression wrt the earlier zswap
behavior for 4K folios, now that 4K folios will be processed by the new
zswap_store() code.
The data indicates there is no significant regression.
-------------------------------------------------------------------------------
4K folios:
==========
zswap compressor zstd zstd zstd zstd v9 zstd v9
before-case1 before-case2 after vs. vs.
case1 case2
-------------------------------------------------------------------------------
Total throughput (KB/s) 4,793,363 4,880,978 4,850,564 1% -1%
Average throughput (KB/s) 159,778 162,699 161,685 1% -1%
elapsed time (sec) 130.14 123.17 125.90 -3% 2%
sys time (sec) 3,135.53 2,985.64 3,078.31 -2% 3%
memcg_high 446,826 444,626 443,925
memcg_swap_fail 0 0 0
zswpout 48,932,107 48,931,971 48,932,186
zswpin 383 386 392
pswpout 0 0 0
pswpin 0 0 0
thp_swpout 0 0 0
thp_swpout_fallback 0 0 0
64kB-mthp_swpout_fallback 0 0 0
pgmajfault 3,063 3,077 3,075
swap_ra 93 94 98
swap_ra_hit 47 47 48
ZSWPOUT-64kB n/a n/a 0
SWPOUT-64kB 0 0 0
-------------------------------------------------------------------------------
Performance Testing:
====================
We list the data for 64K folios with before/after data per-compressor,
followed by the same for 2M pmd-mappable folios.
-------------------------------------------------------------------------------
64K folios: zstd:
=================
zswap compressor zstd zstd zstd zstd v9
before-case1 before-case2 after vs. vs.
case1 case2
-------------------------------------------------------------------------------
Total throughput (KB/s) 5,222,213 1,076,611 6,284,884 20% 484%
Average throughput (KB/s) 174,073 35,887 209,496 20% 484%
elapsed time (sec) 120.50 347.16 109.12 -9% -69%
sys time (sec) 2,930.33 248.16 2,603.12 -11% 949%
memcg_high 416,773 552,200 481,632
memcg_swap_fail 3,192,906 1,293 749
zswpout 48,931,583 20,903 48,932,292
zswpin 384 363 395
pswpout 0 40,778,448 0
pswpin 0 16 0
thp_swpout 0 0 0
thp_swpout_fallback 0 0 0
64kB-mthp_swpout_fallback 3,192,906 1,293 749
pgmajfault 3,452 3,072 3,048
swap_ra 90 87 100
swap_ra_hit 42 43 53
ZSWPOUT-64kB n/a n/a 3,057,504
SWPOUT-64kB 0 2,548,653 0
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
64K folios: deflate-iaa:
========================
zswap compressor deflate-iaa deflate-iaa deflate-iaa deflate-iaa v9
before-case1 before-case2 after vs. vs.
case1 case2
-------------------------------------------------------------------------------
Total throughput (KB/s) 5,652,608 1,089,180 7,046,527 25% 547%
Average throughput (KB/s) 188,420 36,306 234,884 25% 547%
elapsed time (sec) 102.90 343.35 88.29 -14% -74%
sys time (sec) 2,246.86 213.53 1,869.78 -17% 776%
memcg_high 576,104 502,907 657,505
memcg_swap_fail 4,016,117 1,407 1,618
zswpout 61,163,423 22,444 56,358,604
zswpin 401 368 447
pswpout 0 40,862,080 0
pswpin 0 20 0
thp_swpout 0 0 0
thp_swpout_fallback 0 0 0
64kB-mthp_swpout_fallback 4,016,117 1,407 1,618
pgmajfault 3,063 3,153 3,145
swap_ra 96 93 132
swap_ra_hit 46 45 57
ZSWPOUT-64kB n/a n/a 3,520,777
SWPOUT-64kB 0 2,553,880 0
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
2M folios: zstd:
================
zswap compressor zstd zstd zstd zstd v9
before-case1 before-case2 after vs. vs.
case1 case2
-------------------------------------------------------------------------------
Total throughput (KB/s) 5,895,500 1,109,694 6,684,792 13% 502%
Average throughput (KB/s) 196,516 36,989 222,826 13% 502%
elapsed time (sec) 108.77 334.28 103.47 -5% -69%
sys time (sec) 2,657.14 94.88 2,334.70 -12% 2361%
memcg_high 64,200 66,316 61,944
memcg_swap_fail 101,182 70 39
zswpout 48,931,499 36,507 48,785,119
zswpin 380 379 405
pswpout 0 40,166,400 0
pswpin 0 0 0
thp_swpout 0 78,450 0
thp_swpout_fallback 101,182 70 39
2MB-mthp_swpout_fallback 0 0 39
pgmajfault 3,067 3,417 5,513
swap_ra 91 90 6,228
swap_ra_hit 45 45 6,174
ZSWPOUT-2MB n/a n/a 95,234
SWPOUT-2MB 0 78,450 0
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
2M folios: deflate-iaa:
=======================
zswap compressor deflate-iaa deflate-iaa deflate-iaa deflate-iaa v9
before-case1 before-case2 after vs. vs.
case1 case2
-------------------------------------------------------------------------------
Total throughput (KB/s) 6,286,587 1,126,785 7,372,721 17% 554%
Average throughput (KB/s) 209,552 37,559 245,757 17% 554%
elapsed time (sec) 96.19 333.03 84.72 -12% -75%
sys time (sec) 2,141.44 99.96 1,748.87 -18% 1650%
memcg_high 99,253 64,666 82,564
memcg_swap_fail 129,074 53 107
zswpout 61,312,794 28,321 60,323,493
zswpin 383 406 426
pswpout 0 40,048,128 0
pswpin 0 0 0
thp_swpout 0 78,219 0
thp_swpout_fallback 129,074 53 107
2MB-mthp_swpout_fallback 0 0 107
pgmajfault 3,430 3,077 9,077
swap_ra 91 103 17,725
swap_ra_hit 47 46 17,658
ZSWPOUT-2MB n/a n/a 117,670
SWPOUT-2MB 0 78,219 0
-------------------------------------------------------------------------------
And finally, this is a comparison of deflate-iaa vs. zstd with v9 of this
patch-series:
---------------------------------------------
zswap_store large folios v9
Impr w/ deflate-iaa vs. zstd
64K folios 2M folios
---------------------------------------------
Throughput (KB/s) 12% 10%
elapsed time (sec) -19% -18%
sys time (sec) -28% -25%
---------------------------------------------
Conclusions based on the performance results:
=============================================
v9 wrt before-case1:
--------------------
We see significant improvements in throughput, elapsed and sys time for
zstd and deflate-iaa, when comparing before-case1 (THP_SWAP=N) vs. after
(THP_SWAP=Y) with zswap_store large folios.
v9 wrt before-case2:
--------------------
We see even more significant improvements in throughput and elapsed time
for zstd and deflate-iaa, when comparing before-case2 (large-folio-SSD)
vs. after (large-folio-zswap). The sys time increases with
large-folio-zswap as expected, due to the CPU compression time
vs. asynchronous disk write times, as pointed out by Ying and Yosry.
In before-case2, when zswap does not store large folios, only allocations
and cgroup charging due to 4K folio zswap stores count towards the cgroup
memory limit. However, in the after scenario, with the introduction of
zswap_store() of large folios, there is an added component of the zswap
compressed pool usage from large folio stores from potentially all 30
processes, that gets counted towards the memory limit. As a result, we see
higher swapout activity in the "after" data.
Summary:
========
The v9 data presented above shows that zswap_store of large folios
demonstrates good throughput/performance improvements compared to
conventional SSD swap of large folios with a sufficiently large 525G SSD
swap device. Hence, it seems reasonable for zswap_store to support large
folios, so that further performance improvements can be implemented.
In the experimental setup used in this patchset, we have enabled IAA
compress verification to ensure additional hardware data integrity CRC
checks not currently done by the software compressors. We see good
throughput/latency improvements with deflate-iaa vs. zstd with zswap_store
of large folios.
Some of the ideas for further reducing latency that have shown promise in
our experiments, are:
1) IAA compress/decompress batching.
2) Distributing compress jobs across all IAA devices on the socket.
The tests run for this patchset are using only 1 IAA device per core, that
avails of 2 compress engines on the device. In our experiments with IAA
batching, we distribute compress jobs from all cores to the 8 compress
engines available per socket. We further compress the pages in each folio
in parallel in the accelerator. As a result, we improve compress latency
and reclaim throughput.
In decompress batching, we use swapin_readahead to generate a prefetch
batch of 4K folios that we decompress in parallel in IAA.
------------------------------------------------------------------------------
IAA compress/decompress batching
Further improvements wrt v9 zswap_store Sequential
subpage store using "deflate-iaa":
"deflate-iaa" Batching "deflate-iaa-canned" [2] Batching
Additional Impr Additional Impr
64K folios 2M folios 64K folios 2M folios
------------------------------------------------------------------------------
Throughput (KB/s) 21% 37% 29% 48%
elapsed time (sec) -6% -13% -11% -20%
sys time (sec) 3% -3% -5% -14%
------------------------------------------------------------------------------
With zswap IAA compress/decompress batching, we are able to demonstrate
significant performance improvements and memory savings in server
scalability experiments in highly contended system scenarios under
significant memory pressure; as compared to software compressors. We hope
to submit this work in subsequent patch series. The current patch-series is
a prequisite for these future submissions.
Thanks,
Kanchana
[1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
[2] https://patchwork.kernel.org/project/linux-crypto/cover/cover.1710969449.git.andre.glover@linux.intel.com/
Changes since v8:
=================
1) Rebased to mm-unstable as of 9-30-2024,
commit 66af62407e82647ec5b44462dc29d50ba03fdb22.
2) Changed count_objcg_event() to become count_objcg_events() that accepts
a "count" parameter. Modified existing calls to count_objcg_event() to
instead call count_objcg_events() with a "count" of 1. Thanks Yosry for
this suggestion!
3) Avoid the check for "zswap_pool_reached_full" if zswap is disabled, as
pointed out by Yosry. Thanks Yosry!
4) Modified zswap_store_page() to take a page as input, and to derive the
swp_entry_t for the page by calling page_swap_entry() as suggested by
Johannes. Thanks Johannes!
5) Code cleanup and code comments clarifications suggested by Yosry in
zswap_store_page() and zswap_store(). Also, more concise commit log for
patch 6 as suggested by Yosry. Thanks Yosry!
6) Squashed the commits related to the sysfs hugepage zswpout stats
addition and the documentation update into a single commit, as suggested
by Nhat. Thanks Nhat!
7) Modified the latency metric change reporting convention to be "new"
vs. "old", as per Yosry's suggestion.
8) Changes in the cover-letter suggested by Yosry. Thanks Yosry!
9) Added "Acked-by" and "Reviewed-by" from Johannes, Yosry, Nhat and
Chengming. Thanks to all!
Changes since v7:
=================
1) Rebased to mm-unstable as of 9-27-2024,
commit de2fbaa6d9c3576ec7133ed02a370ec9376bf000.
2) Added Nhat's 'Reviewed-by' to patches 1 and 2. Thanks Nhat!
3) Implemented one-time obj_cgroup_may_zswap and zswap_check_limits at the
start of zswap_store. Implemented one-time batch updates to cgroup zswap
charging (with total compressed bytes), zswap_stored_pages and the
memcg/vm zswpout event stats (with folio_nr_pages()) only for successful
stores at the end of zswap_store. Thanks Yosry and Johannes for guidance
on this!
4) Changed the existing zswap_pool_get() to zswap_pool_tryget(). Modified
zswap_pool_current_get() and zswap_pool_find_get() to call
zswap_pool_tryget(). Furthermore, zswap_store() obtains a reference to a
valid zswap_pool upfront by calling zswap_pool_tryget(), and errors out
if the tryget fails. Added a new zswap_pool_get() that calls
"percpu_ref_get(&pool->ref)" and is called in zswap_store_page(), as
suggested by Johannes & Yosry. Thanks both!
5) Provided a new count_objcg_events() API for batch event updates.
6) Changed "zswap_stored_pages" to atomic_long_t to support adding
folio_nr_pages() to it once a large folio is stored successfully.
7) Deleted the refactoring done in v7 for the xarray updates in
zswap_store_page(); and unwinding of stored offsets in zswap_store() in
case of errors, as suggested by Johannes.
8) Deleted the CONFIG_ZSWAP_STORE_THP_DEFAULT_ON config option and
"zswap_mthp_enabled" tunable, as recommended by Yosry, Johannes and
Nhat.
9) Replaced references to "mTHP" with "large folios"; organized
before/after data per-compressor for easier visual comparisons;
incorporated Nhat's feedback in the documentation updates; moved
changelog to the end. Thanks Johannes, Yosry and Nhat!
10) Moved the usemem testing configuration to 30 processes, each allocating
10G within a 150G memory-limit constrained cgroup, maintaining the
allocated memory for 10 sec before exiting. Thanks Ying for this
suggestion!
Changes since v6:
=================
1) Rebased to mm-unstable as of 9-23-2024,
commit acfabf7e197f7a5bedf4749dac1f39551417b049.
2) Refactored into smaller commits, as suggested by Yosry and
Chengming. Thanks both!
3) Reworded the commit log for patches 5 and 6 as per Yosry's
suggestion. Thanks Yosry!
4) Gathered data on a Sapphire Rapids server that has 823GiB SSD swap disk
partition. Also, all experiments are run with usemem --sleep 10, so that
the memory allocated by the 70 processes remains in memory
longer. Posted elapsed and sys times. Thanks to Yosry, Nhat and Ying for
their help with refining the performance characterization methodology.
5) Updated Documentation/admin-guide/mm/transhuge.rst as suggested by
Nhat. Thanks Nhat!
Changes since v5:
=================
1) Rebased to mm-unstable as of 8/29/2024,
commit 9287e4adbc6ab8fa04d25eb82e097fed877a4642.
2) Added CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default) to
enable/disable zswap_store() of mTHP folios. Thanks Nhat for the
suggestion to add a knob by which users can enable/disable this
change. Nhat, I hope this is along the lines of what you were
thinking.
3) Added vm-scalability usemem data with 4K folios with
CONFIG_ZSWAP_STORE_THP_DEFAULT_ON off, that I gathered to make sure
there is no regression with this change.
4) Added data with usemem with 64K and 2M THP for an alternate view of
before/after, as suggested by Yosry, so we can understand the impact
of when mTHPs are split into 4K folios in shrink_folio_list()
(CONFIG_THP_SWAP off) vs. not split (CONFIG_THP_SWAP on) and stored
in zswap. Thanks Yosry for this suggestion.
Changes since v4:
=================
1) Published before/after data with zstd, as suggested by Nhat (Thanks
Nhat for the data reviews!).
2) Rebased to mm-unstable from 8/27/2024,
commit b659edec079c90012cf8d05624e312d1062b8b87.
3) Incorporated the change in memcontrol.h that defines obj_cgroup_get() if
CONFIG_MEMCG is not defined, to resolve build errors reported by kernel
robot; as per Nhat's and Michal's suggestion to not require a separate
patch to fix the build errors (thanks both!).
4) Deleted all same-filled folio processing in zswap_store() of mTHP, as
suggested by Yosry (Thanks Yosry!).
5) Squashed the commits that define new mthp zswpout stat counters, and
invoke count_mthp_stat() after successful zswap_store()s; into a single
commit. Thanks Yosry for this suggestion!
Changes since v3:
=================
1) Rebased to mm-unstable commit 8c0b4f7b65fd1ca7af01267f491e815a40d77444.
Thanks to Barry for suggesting aligning with Ryan Roberts' latest
changes to count_mthp_stat() so that it's always defined, even when THP
is disabled. Barry, I have also made one other change in page_io.c
where count_mthp_stat() is called by count_swpout_vm_event(). I would
appreciate it if you can review this. Thanks!
Hopefully this should resolve the kernel robot build errors.
Changes since v2:
=================
1) Gathered usemem data using SSD as the backing swap device for zswap,
as suggested by Ying Huang. Ying, I would appreciate it if you can
review the latest data. Thanks!
2) Generated the base commit info in the patches to attempt to address
the kernel test robot build errors.
3) No code changes to the individual patches themselves.
Changes since RFC v1:
=====================
1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
Thanks Barry!
2) Addressed some of the code review comments that Nhat Pham provided in
Ryan's initial RFC [1]:
- Added a comment about the cgroup zswap limit checks occuring once per
folio at the beginning of zswap_store().
Nhat, Ryan, please do let me know if the comments convey the summary
from the RFC discussion. Thanks!
- Posted data on running the cgroup suite's zswap kselftest.
3) Rebased to v6.11-rc3.
4) Gathered performance data with usemem and the rebased patch-series.
Kanchana P Sridhar (7):
mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined.
mm: zswap: Modify zswap_compress() to accept a page instead of a
folio.
mm: zswap: Rename zswap_pool_get() to zswap_pool_tryget().
mm: Change count_objcg_event() to count_objcg_events() for batch event
updates.
mm: zswap: Modify zswap_stored_pages to be atomic_long_t.
mm: zswap: Support large folios in zswap_store().
mm: swap: Count successful large folio zswap stores in hugepage
zswpout stats.
Documentation/admin-guide/mm/transhuge.rst | 8 +-
fs/proc/meminfo.c | 2 +-
include/linux/huge_mm.h | 1 +
include/linux/memcontrol.h | 16 +-
include/linux/zswap.h | 2 +-
mm/huge_memory.c | 3 +
mm/page_io.c | 1 +
mm/zswap.c | 251 ++++++++++++++-------
8 files changed, 196 insertions(+), 88 deletions(-)
base-commit: 66af62407e82647ec5b44462dc29d50ba03fdb22
--
2.27.0
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v9 1/7] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined.
2024-09-30 22:12 [PATCH v9 0/7] mm: zswap swap-out of large folios Kanchana P Sridhar
@ 2024-09-30 22:12 ` Kanchana P Sridhar
2024-09-30 22:12 ` [PATCH v9 2/7] mm: zswap: Modify zswap_compress() to accept a page instead of a folio Kanchana P Sridhar
` (5 subsequent siblings)
6 siblings, 0 replies; 29+ messages in thread
From: Kanchana P Sridhar @ 2024-09-30 22:12 UTC (permalink / raw)
To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
ying.huang, 21cnbao, akpm, willy
Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar
This resolves an issue with obj_cgroup_get() not being defined if
CONFIG_MEMCG is not defined.
Before this patch, we would see build errors if obj_cgroup_get() is
called from code that is agnostic of CONFIG_MEMCG.
The zswap_store() changes for large folios in subsequent commits will
require the use of obj_cgroup_get() in zswap code that falls into this
category.
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
include/linux/memcontrol.h | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 34d2da05f2f1..15c2716f9aa3 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1282,6 +1282,10 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css)
return NULL;
}
+static inline void obj_cgroup_get(struct obj_cgroup *objcg)
+{
+}
+
static inline void obj_cgroup_put(struct obj_cgroup *objcg)
{
}
--
2.27.0
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v9 2/7] mm: zswap: Modify zswap_compress() to accept a page instead of a folio.
2024-09-30 22:12 [PATCH v9 0/7] mm: zswap swap-out of large folios Kanchana P Sridhar
2024-09-30 22:12 ` [PATCH v9 1/7] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined Kanchana P Sridhar
@ 2024-09-30 22:12 ` Kanchana P Sridhar
2024-09-30 22:16 ` Yosry Ahmed
2024-09-30 22:12 ` [PATCH v9 3/7] mm: zswap: Rename zswap_pool_get() to zswap_pool_tryget() Kanchana P Sridhar
` (4 subsequent siblings)
6 siblings, 1 reply; 29+ messages in thread
From: Kanchana P Sridhar @ 2024-09-30 22:12 UTC (permalink / raw)
To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
ying.huang, 21cnbao, akpm, willy
Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar
For zswap_store() to be able to store a large folio by compressing it
one page at a time, zswap_compress() needs to accept a page as input.
This will allow us to iterate through each page in the folio in
zswap_store(), compress it and store it in the zpool.
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
---
mm/zswap.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/mm/zswap.c b/mm/zswap.c
index efad4e941e44..fd7a8c14457a 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -875,7 +875,7 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
return 0;
}
-static bool zswap_compress(struct folio *folio, struct zswap_entry *entry)
+static bool zswap_compress(struct page *page, struct zswap_entry *entry)
{
struct crypto_acomp_ctx *acomp_ctx;
struct scatterlist input, output;
@@ -893,7 +893,7 @@ static bool zswap_compress(struct folio *folio, struct zswap_entry *entry)
dst = acomp_ctx->buffer;
sg_init_table(&input, 1);
- sg_set_folio(&input, folio, PAGE_SIZE, 0);
+ sg_set_page(&input, page, PAGE_SIZE, 0);
/*
* We need PAGE_SIZE * 2 here since there maybe over-compression case,
@@ -1456,7 +1456,7 @@ bool zswap_store(struct folio *folio)
mem_cgroup_put(memcg);
}
- if (!zswap_compress(folio, entry))
+ if (!zswap_compress(&folio->page, entry))
goto put_pool;
entry->swpentry = swp;
--
2.27.0
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v9 3/7] mm: zswap: Rename zswap_pool_get() to zswap_pool_tryget().
2024-09-30 22:12 [PATCH v9 0/7] mm: zswap swap-out of large folios Kanchana P Sridhar
2024-09-30 22:12 ` [PATCH v9 1/7] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined Kanchana P Sridhar
2024-09-30 22:12 ` [PATCH v9 2/7] mm: zswap: Modify zswap_compress() to accept a page instead of a folio Kanchana P Sridhar
@ 2024-09-30 22:12 ` Kanchana P Sridhar
2024-09-30 22:12 ` [PATCH v9 4/7] mm: Change count_objcg_event() to count_objcg_events() for batch event updates Kanchana P Sridhar
` (3 subsequent siblings)
6 siblings, 0 replies; 29+ messages in thread
From: Kanchana P Sridhar @ 2024-09-30 22:12 UTC (permalink / raw)
To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
ying.huang, 21cnbao, akpm, willy
Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar
Modify the name of the existing zswap_pool_get() to zswap_pool_tryget()
to be representative of the call it makes to percpu_ref_tryget().
A subsequent patch will introduce a new zswap_pool_get() that calls
percpu_ref_get().
The intent behind this change is for higher level zswap API such as
zswap_store() to call zswap_pool_tryget() to check upfront if the pool's
refcount is "0" (which means it could be getting destroyed) and to handle
this as an error condition. zswap_store() would proceed only if
zswap_pool_tryget() returns success, and any additional pool refcounts that
need to be obtained for compressing sub-pages in a large folio could simply
call zswap_pool_get().
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
---
mm/zswap.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/mm/zswap.c b/mm/zswap.c
index fd7a8c14457a..0f281e50a034 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -403,7 +403,7 @@ static void __zswap_pool_empty(struct percpu_ref *ref)
spin_unlock_bh(&zswap_pools_lock);
}
-static int __must_check zswap_pool_get(struct zswap_pool *pool)
+static int __must_check zswap_pool_tryget(struct zswap_pool *pool)
{
if (!pool)
return 0;
@@ -441,7 +441,7 @@ static struct zswap_pool *zswap_pool_current_get(void)
rcu_read_lock();
pool = __zswap_pool_current();
- if (!zswap_pool_get(pool))
+ if (!zswap_pool_tryget(pool))
pool = NULL;
rcu_read_unlock();
@@ -462,7 +462,7 @@ static struct zswap_pool *zswap_pool_find_get(char *type, char *compressor)
if (strcmp(zpool_get_type(pool->zpool), type))
continue;
/* if we can't get it, it's about to be destroyed */
- if (!zswap_pool_get(pool))
+ if (!zswap_pool_tryget(pool))
continue;
return pool;
}
--
2.27.0
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v9 4/7] mm: Change count_objcg_event() to count_objcg_events() for batch event updates.
2024-09-30 22:12 [PATCH v9 0/7] mm: zswap swap-out of large folios Kanchana P Sridhar
` (2 preceding siblings ...)
2024-09-30 22:12 ` [PATCH v9 3/7] mm: zswap: Rename zswap_pool_get() to zswap_pool_tryget() Kanchana P Sridhar
@ 2024-09-30 22:12 ` Kanchana P Sridhar
2024-09-30 22:18 ` Yosry Ahmed
` (2 more replies)
2024-09-30 22:12 ` [PATCH v9 5/7] mm: zswap: Modify zswap_stored_pages to be atomic_long_t Kanchana P Sridhar
` (2 subsequent siblings)
6 siblings, 3 replies; 29+ messages in thread
From: Kanchana P Sridhar @ 2024-09-30 22:12 UTC (permalink / raw)
To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
ying.huang, 21cnbao, akpm, willy
Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar
With the introduction of zswap_store() swapping out large folios,
we need to efficiently update the objcg's memcg events once per
successfully stored folio. For instance, the 'ZSWPOUT' event needs
to be incremented by folio_nr_pages().
To facilitate this, the existing count_objcg_event() API is modified
to be count_objcg_events() that additionally accepts a count parameter.
The only existing calls to count_objcg_event() are in zswap.c - these
have been modified to call count_objcg_events() with a count of 1.
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
include/linux/memcontrol.h | 12 +++++++-----
mm/zswap.c | 6 +++---
2 files changed, 10 insertions(+), 8 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 15c2716f9aa3..524006313b0d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1764,8 +1764,9 @@ static inline int memcg_kmem_id(struct mem_cgroup *memcg)
struct mem_cgroup *mem_cgroup_from_slab_obj(void *p);
-static inline void count_objcg_event(struct obj_cgroup *objcg,
- enum vm_event_item idx)
+static inline void count_objcg_events(struct obj_cgroup *objcg,
+ enum vm_event_item idx,
+ unsigned long count)
{
struct mem_cgroup *memcg;
@@ -1774,7 +1775,7 @@ static inline void count_objcg_event(struct obj_cgroup *objcg,
rcu_read_lock();
memcg = obj_cgroup_memcg(objcg);
- count_memcg_events(memcg, idx, 1);
+ count_memcg_events(memcg, idx, count);
rcu_read_unlock();
}
@@ -1829,8 +1830,9 @@ static inline struct mem_cgroup *mem_cgroup_from_slab_obj(void *p)
return NULL;
}
-static inline void count_objcg_event(struct obj_cgroup *objcg,
- enum vm_event_item idx)
+static inline void count_objcg_events(struct obj_cgroup *objcg,
+ enum vm_event_item idx,
+ unsigned long count)
{
}
diff --git a/mm/zswap.c b/mm/zswap.c
index 0f281e50a034..69b9c025fd47 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1053,7 +1053,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
count_vm_event(ZSWPWB);
if (entry->objcg)
- count_objcg_event(entry->objcg, ZSWPWB);
+ count_objcg_events(entry->objcg, ZSWPWB, 1);
zswap_entry_free(entry);
@@ -1482,7 +1482,7 @@ bool zswap_store(struct folio *folio)
if (objcg) {
obj_cgroup_charge_zswap(objcg, entry->length);
- count_objcg_event(objcg, ZSWPOUT);
+ count_objcg_events(objcg, ZSWPOUT, 1);
}
/*
@@ -1576,7 +1576,7 @@ bool zswap_load(struct folio *folio)
count_vm_event(ZSWPIN);
if (entry->objcg)
- count_objcg_event(entry->objcg, ZSWPIN);
+ count_objcg_events(entry->objcg, ZSWPIN, 1);
if (swapcache) {
zswap_entry_free(entry);
--
2.27.0
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v9 5/7] mm: zswap: Modify zswap_stored_pages to be atomic_long_t.
2024-09-30 22:12 [PATCH v9 0/7] mm: zswap swap-out of large folios Kanchana P Sridhar
` (3 preceding siblings ...)
2024-09-30 22:12 ` [PATCH v9 4/7] mm: Change count_objcg_event() to count_objcg_events() for batch event updates Kanchana P Sridhar
@ 2024-09-30 22:12 ` Kanchana P Sridhar
2024-09-30 22:12 ` [PATCH v9 6/7] mm: zswap: Support large folios in zswap_store() Kanchana P Sridhar
2024-09-30 22:12 ` [PATCH v9 7/7] mm: swap: Count successful large folio zswap stores in hugepage zswpout stats Kanchana P Sridhar
6 siblings, 0 replies; 29+ messages in thread
From: Kanchana P Sridhar @ 2024-09-30 22:12 UTC (permalink / raw)
To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
ying.huang, 21cnbao, akpm, willy
Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar
For zswap_store() to support large folios, we need to be able to do
a batch update of zswap_stored_pages upon successful store of all pages
in the folio. For this, we need to add folio_nr_pages(), which returns
a long, to zswap_stored_pages.
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
---
fs/proc/meminfo.c | 2 +-
include/linux/zswap.h | 2 +-
mm/zswap.c | 19 +++++++++++++------
3 files changed, 15 insertions(+), 8 deletions(-)
diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
index 245171d9164b..8ba9b1472390 100644
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -91,7 +91,7 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
#ifdef CONFIG_ZSWAP
show_val_kb(m, "Zswap: ", zswap_total_pages());
seq_printf(m, "Zswapped: %8lu kB\n",
- (unsigned long)atomic_read(&zswap_stored_pages) <<
+ (unsigned long)atomic_long_read(&zswap_stored_pages) <<
(PAGE_SHIFT - 10));
#endif
show_val_kb(m, "Dirty: ",
diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index 9cd1beef0654..d961ead91bf1 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -7,7 +7,7 @@
struct lruvec;
-extern atomic_t zswap_stored_pages;
+extern atomic_long_t zswap_stored_pages;
#ifdef CONFIG_ZSWAP
diff --git a/mm/zswap.c b/mm/zswap.c
index 69b9c025fd47..2b8da50f6322 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -43,7 +43,7 @@
* statistics
**********************************/
/* The number of compressed pages currently stored in zswap */
-atomic_t zswap_stored_pages = ATOMIC_INIT(0);
+atomic_long_t zswap_stored_pages = ATOMIC_INIT(0);
/*
* The statistics below are not protected from concurrent access for
@@ -802,7 +802,7 @@ static void zswap_entry_free(struct zswap_entry *entry)
obj_cgroup_put(entry->objcg);
}
zswap_entry_cache_free(entry);
- atomic_dec(&zswap_stored_pages);
+ atomic_long_dec(&zswap_stored_pages);
}
/*********************************
@@ -1232,7 +1232,7 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
nr_stored = memcg_page_state(memcg, MEMCG_ZSWAPPED);
} else {
nr_backing = zswap_total_pages();
- nr_stored = atomic_read(&zswap_stored_pages);
+ nr_stored = atomic_long_read(&zswap_stored_pages);
}
if (!nr_stored)
@@ -1501,7 +1501,7 @@ bool zswap_store(struct folio *folio)
}
/* update stats */
- atomic_inc(&zswap_stored_pages);
+ atomic_long_inc(&zswap_stored_pages);
count_vm_event(ZSWPOUT);
return true;
@@ -1650,6 +1650,13 @@ static int debugfs_get_total_size(void *data, u64 *val)
}
DEFINE_DEBUGFS_ATTRIBUTE(total_size_fops, debugfs_get_total_size, NULL, "%llu\n");
+static int debugfs_get_stored_pages(void *data, u64 *val)
+{
+ *val = atomic_long_read(&zswap_stored_pages);
+ return 0;
+}
+DEFINE_DEBUGFS_ATTRIBUTE(stored_pages_fops, debugfs_get_stored_pages, NULL, "%llu\n");
+
static int zswap_debugfs_init(void)
{
if (!debugfs_initialized())
@@ -1673,8 +1680,8 @@ static int zswap_debugfs_init(void)
zswap_debugfs_root, &zswap_written_back_pages);
debugfs_create_file("pool_total_size", 0444,
zswap_debugfs_root, NULL, &total_size_fops);
- debugfs_create_atomic_t("stored_pages", 0444,
- zswap_debugfs_root, &zswap_stored_pages);
+ debugfs_create_file("stored_pages", 0444,
+ zswap_debugfs_root, NULL, &stored_pages_fops);
return 0;
}
--
2.27.0
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v9 6/7] mm: zswap: Support large folios in zswap_store().
2024-09-30 22:12 [PATCH v9 0/7] mm: zswap swap-out of large folios Kanchana P Sridhar
` (4 preceding siblings ...)
2024-09-30 22:12 ` [PATCH v9 5/7] mm: zswap: Modify zswap_stored_pages to be atomic_long_t Kanchana P Sridhar
@ 2024-09-30 22:12 ` Kanchana P Sridhar
2024-09-30 23:09 ` Yosry Ahmed
` (2 more replies)
2024-09-30 22:12 ` [PATCH v9 7/7] mm: swap: Count successful large folio zswap stores in hugepage zswpout stats Kanchana P Sridhar
6 siblings, 3 replies; 29+ messages in thread
From: Kanchana P Sridhar @ 2024-09-30 22:12 UTC (permalink / raw)
To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
ying.huang, 21cnbao, akpm, willy
Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar
zswap_store() will store large folios by compressing them page by page.
This patch provides a sequential implementation of storing a large folio
in zswap_store() by iterating through each page in the folio to compress
and store it in the zswap zpool.
zswap_store() calls the newly added zswap_store_page() function for each
page in the folio. zswap_store_page() handles compressing and storing each
page.
We check the global and per-cgroup limits once at the beginning of
zswap_store(), and only check that the limit is not reached yet. This is
racy and inaccurate, but it should be sufficient for now. We also obtain
initial references to the relevant objcg and pool to guarantee that
subsequent references can be acquired by zswap_store_page(). A new function
zswap_pool_get() is added to facilitate this.
If these one-time checks pass, we compress the pages of the folio, while
maintaining a running count of compressed bytes for all the folio's pages.
If all pages are successfully compressed and stored, we do the cgroup
zswap charging with the total compressed bytes, and batch update the
zswap_stored_pages atomic/zswpout event stats with folio_nr_pages() once,
before returning from zswap_store().
If an error is encountered during the store of any page in the folio,
all pages in that folio currently stored in zswap will be invalidated.
Thus, a folio is either entirely stored in zswap, or entirely not stored
in zswap.
The most important value provided by this patch is it enables swapping out
large folios to zswap without splitting them. Furthermore, it batches some
operations while doing so (cgroup charging, stats updates).
This patch also forms the basis for building compress batching of pages in
a large folio in zswap_store() by compressing up to say, 8 pages of the
folio in parallel in hardware using the Intel In-Memory Analytics
Accelerator (Intel IAA).
This change reuses and adapts the functionality in Ryan Roberts' RFC
patch [1]:
"[RFC,v1] mm: zswap: Store large folios without splitting"
[1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
Also, addressed some of the RFC comments from the discussion in [1].
Co-developed-by: Ryan Roberts
Signed-off-by:
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
mm/zswap.c | 220 +++++++++++++++++++++++++++++++++++++----------------
1 file changed, 153 insertions(+), 67 deletions(-)
diff --git a/mm/zswap.c b/mm/zswap.c
index 2b8da50f6322..b74c8de99646 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -411,6 +411,12 @@ static int __must_check zswap_pool_tryget(struct zswap_pool *pool)
return percpu_ref_tryget(&pool->ref);
}
+/* The caller must already have a reference. */
+static void zswap_pool_get(struct zswap_pool *pool)
+{
+ percpu_ref_get(&pool->ref);
+}
+
static void zswap_pool_put(struct zswap_pool *pool)
{
percpu_ref_put(&pool->ref);
@@ -1402,68 +1408,52 @@ static void shrink_worker(struct work_struct *w)
/*********************************
* main API
**********************************/
-bool zswap_store(struct folio *folio)
+
+/*
+ * Stores the page at specified "index" in a folio.
+ *
+ * @page: The page to store in zswap.
+ * @objcg: The folio's objcg. Caller has a reference.
+ * @pool: The zswap_pool to store the compressed data for the page.
+ * The caller should have obtained a reference to a valid
+ * zswap_pool by calling zswap_pool_tryget(), to pass as this
+ * argument.
+ * @tree: The xarray for the @page's folio's swap.
+ * @compressed_bytes: The compressed entry->length value is added
+ * to this, so that the caller can get the total
+ * compressed lengths of all sub-pages in a folio.
+ */
+static bool zswap_store_page(struct page *page,
+ struct obj_cgroup *objcg,
+ struct zswap_pool *pool,
+ struct xarray *tree,
+ size_t *compressed_bytes)
{
- swp_entry_t swp = folio->swap;
- pgoff_t offset = swp_offset(swp);
- struct xarray *tree = swap_zswap_tree(swp);
struct zswap_entry *entry, *old;
- struct obj_cgroup *objcg = NULL;
- struct mem_cgroup *memcg = NULL;
-
- VM_WARN_ON_ONCE(!folio_test_locked(folio));
- VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
-
- /* Large folios aren't supported */
- if (folio_test_large(folio))
- return false;
-
- if (!zswap_enabled)
- goto check_old;
-
- /* Check cgroup limits */
- objcg = get_obj_cgroup_from_folio(folio);
- if (objcg && !obj_cgroup_may_zswap(objcg)) {
- memcg = get_mem_cgroup_from_objcg(objcg);
- if (shrink_memcg(memcg)) {
- mem_cgroup_put(memcg);
- goto reject;
- }
- mem_cgroup_put(memcg);
- }
-
- if (zswap_check_limits())
- goto reject;
/* allocate entry */
- entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
+ entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(page_folio(page)));
if (!entry) {
zswap_reject_kmemcache_fail++;
goto reject;
}
- /* if entry is successfully added, it keeps the reference */
- entry->pool = zswap_pool_current_get();
- if (!entry->pool)
- goto freepage;
+ /* zswap_store() already holds a ref on 'objcg' and 'pool' */
+ if (objcg)
+ obj_cgroup_get(objcg);
+ zswap_pool_get(pool);
- if (objcg) {
- memcg = get_mem_cgroup_from_objcg(objcg);
- if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
- mem_cgroup_put(memcg);
- goto put_pool;
- }
- mem_cgroup_put(memcg);
- }
+ /* if entry is successfully added, it keeps the reference */
+ entry->pool = pool;
- if (!zswap_compress(&folio->page, entry))
- goto put_pool;
+ if (!zswap_compress(page, entry))
+ goto put_pool_objcg;
- entry->swpentry = swp;
+ entry->swpentry = page_swap_entry(page);
entry->objcg = objcg;
entry->referenced = true;
- old = xa_store(tree, offset, entry, GFP_KERNEL);
+ old = xa_store(tree, swp_offset(entry->swpentry), entry, GFP_KERNEL);
if (xa_is_err(old)) {
int err = xa_err(old);
@@ -1480,11 +1470,6 @@ bool zswap_store(struct folio *folio)
if (old)
zswap_entry_free(old);
- if (objcg) {
- obj_cgroup_charge_zswap(objcg, entry->length);
- count_objcg_events(objcg, ZSWPOUT, 1);
- }
-
/*
* We finish initializing the entry while it's already in xarray.
* This is safe because:
@@ -1496,36 +1481,137 @@ bool zswap_store(struct folio *folio)
* an incoherent entry.
*/
if (entry->length) {
+ *compressed_bytes += entry->length;
INIT_LIST_HEAD(&entry->lru);
zswap_lru_add(&zswap_list_lru, entry);
}
- /* update stats */
- atomic_long_inc(&zswap_stored_pages);
- count_vm_event(ZSWPOUT);
-
+ /*
+ * We shouldn't have any possibility of failure after the entry is
+ * added in the xarray. The pool/objcg refs obtained here will only
+ * be dropped if/when zswap_entry_free() gets called.
+ */
return true;
store_failed:
zpool_free(entry->pool->zpool, entry->handle);
-put_pool:
- zswap_pool_put(entry->pool);
-freepage:
+put_pool_objcg:
+ zswap_pool_put(pool);
+ obj_cgroup_put(objcg);
zswap_entry_cache_free(entry);
reject:
+ return false;
+}
+
+bool zswap_store(struct folio *folio)
+{
+ long nr_pages = folio_nr_pages(folio);
+ swp_entry_t swp = folio->swap;
+ struct xarray *tree = swap_zswap_tree(swp);
+ struct obj_cgroup *objcg = NULL;
+ struct mem_cgroup *memcg = NULL;
+ struct zswap_pool *pool;
+ size_t compressed_bytes = 0;
+ bool ret = false;
+ long index;
+
+ VM_WARN_ON_ONCE(!folio_test_locked(folio));
+ VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
+
+ if (!zswap_enabled)
+ goto check_old;
+
+ /*
+ * Check cgroup zswap limits:
+ *
+ * The cgroup zswap limit check is done once at the beginning of
+ * zswap_store(). The cgroup charging is done once, at the end
+ * of a successful folio store. What this means is, if the cgroup
+ * was within the zswap_max limit at the beginning of a large folio
+ * store, it could go over the limit by at most (HPAGE_PMD_NR - 1)
+ * pages due to this store.
+ */
+ objcg = get_obj_cgroup_from_folio(folio);
+ if (objcg && !obj_cgroup_may_zswap(objcg)) {
+ memcg = get_mem_cgroup_from_objcg(objcg);
+ if (shrink_memcg(memcg)) {
+ mem_cgroup_put(memcg);
+ goto put_objcg;
+ }
+ mem_cgroup_put(memcg);
+ }
+
+ /*
+ * Check zpool utilization against zswap limits:
+ *
+ * The zswap zpool utilization is also checked against the limits
+ * just once, at the start of zswap_store(). If the check passes,
+ * any breaches of the limits set by zswap_max_pages() or
+ * zswap_accept_thr_pages() that may happen while storing this
+ * folio, will only be detected during the next call to
+ * zswap_store() by any process.
+ */
+ if (zswap_check_limits())
+ goto put_objcg;
+
+ pool = zswap_pool_current_get();
+ if (!pool)
+ goto put_objcg;
+
+ if (objcg) {
+ memcg = get_mem_cgroup_from_objcg(objcg);
+ if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
+ mem_cgroup_put(memcg);
+ goto put_pool;
+ }
+ mem_cgroup_put(memcg);
+ }
+
+ /*
+ * Store each page of the folio as a separate entry. If we fail to
+ * store a page, unwind by deleting all the pages for this folio
+ * currently in zswap.
+ */
+ for (index = 0; index < nr_pages; ++index) {
+ if (!zswap_store_page(folio_page(folio, index), objcg, pool, tree, &compressed_bytes))
+ goto put_pool;
+ }
+
+ if (objcg) {
+ obj_cgroup_charge_zswap(objcg, compressed_bytes);
+ count_objcg_events(objcg, ZSWPOUT, nr_pages);
+ }
+
+ atomic_long_add(nr_pages, &zswap_stored_pages);
+ count_vm_events(ZSWPOUT, nr_pages);
+
+ ret = true;
+
+put_pool:
+ zswap_pool_put(pool);
+put_objcg:
obj_cgroup_put(objcg);
- if (zswap_pool_reached_full)
+ if (!ret && zswap_pool_reached_full)
queue_work(shrink_wq, &zswap_shrink_work);
check_old:
/*
- * If the zswap store fails or zswap is disabled, we must invalidate the
- * possibly stale entry which was previously stored at this offset.
- * Otherwise, writeback could overwrite the new data in the swapfile.
+ * If the zswap store fails or zswap is disabled, we must invalidate
+ * the possibly stale entries which were previously stored at the
+ * offsets corresponding to each page of the folio. Otherwise,
+ * writeback could overwrite the new data in the swapfile.
*/
- entry = xa_erase(tree, offset);
- if (entry)
- zswap_entry_free(entry);
- return false;
+ if (!ret) {
+ pgoff_t offset = swp_offset(swp);
+ struct zswap_entry *entry;
+
+ for (index = 0; index < nr_pages; ++index) {
+ entry = xa_erase(tree, offset + index);
+ if (entry)
+ zswap_entry_free(entry);
+ }
+ }
+
+ return ret;
}
bool zswap_load(struct folio *folio)
--
2.27.0
^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v9 7/7] mm: swap: Count successful large folio zswap stores in hugepage zswpout stats.
2024-09-30 22:12 [PATCH v9 0/7] mm: zswap swap-out of large folios Kanchana P Sridhar
` (5 preceding siblings ...)
2024-09-30 22:12 ` [PATCH v9 6/7] mm: zswap: Support large folios in zswap_store() Kanchana P Sridhar
@ 2024-09-30 22:12 ` Kanchana P Sridhar
6 siblings, 0 replies; 29+ messages in thread
From: Kanchana P Sridhar @ 2024-09-30 22:12 UTC (permalink / raw)
To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
ying.huang, 21cnbao, akpm, willy
Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar
Added a new MTHP_STAT_ZSWPOUT entry to the sysfs transparent_hugepage
stats so that successful large folio zswap stores can be accounted under
the per-order sysfs "zswpout" stats:
/sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
Other non-zswap swap device swap-out events will be counted under
the existing sysfs "swpout" stats:
/sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/swpout
Also, added documentation for the newly added sysfs per-order hugepage
"zswpout" stats. The documentation clarifies that only non-zswap swapouts
will be accounted in the existing "swpout" stats.
Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
---
Documentation/admin-guide/mm/transhuge.rst | 8 ++++++--
include/linux/huge_mm.h | 1 +
mm/huge_memory.c | 3 +++
mm/page_io.c | 1 +
4 files changed, 11 insertions(+), 2 deletions(-)
diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index cfdd16a52e39..2a171ed5206e 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -530,10 +530,14 @@ anon_fault_fallback_charge
instead falls back to using huge pages with lower orders or
small pages even though the allocation was successful.
-swpout
- is incremented every time a huge page is swapped out in one
+zswpout
+ is incremented every time a huge page is swapped out to zswap in one
piece without splitting.
+swpout
+ is incremented every time a huge page is swapped out to a non-zswap
+ swap device in one piece without splitting.
+
swpout_fallback
is incremented if a huge page has to be split before swapout.
Usually because failed to allocate some continuous swap space
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 5eb4b0376c7d..3eca60f3d512 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -119,6 +119,7 @@ enum mthp_stat_item {
MTHP_STAT_ANON_FAULT_ALLOC,
MTHP_STAT_ANON_FAULT_FALLBACK,
MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE,
+ MTHP_STAT_ZSWPOUT,
MTHP_STAT_SWPOUT,
MTHP_STAT_SWPOUT_FALLBACK,
MTHP_STAT_SHMEM_ALLOC,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 13bf59b84075..f596f57a3a90 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -611,6 +611,7 @@ static struct kobj_attribute _name##_attr = __ATTR_RO(_name)
DEFINE_MTHP_STAT_ATTR(anon_fault_alloc, MTHP_STAT_ANON_FAULT_ALLOC);
DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK);
DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
+DEFINE_MTHP_STAT_ATTR(zswpout, MTHP_STAT_ZSWPOUT);
DEFINE_MTHP_STAT_ATTR(swpout, MTHP_STAT_SWPOUT);
DEFINE_MTHP_STAT_ATTR(swpout_fallback, MTHP_STAT_SWPOUT_FALLBACK);
#ifdef CONFIG_SHMEM
@@ -629,6 +630,7 @@ static struct attribute *anon_stats_attrs[] = {
&anon_fault_fallback_attr.attr,
&anon_fault_fallback_charge_attr.attr,
#ifndef CONFIG_SHMEM
+ &zswpout_attr.attr,
&swpout_attr.attr,
&swpout_fallback_attr.attr,
#endif
@@ -659,6 +661,7 @@ static struct attribute_group file_stats_attr_grp = {
static struct attribute *any_stats_attrs[] = {
#ifdef CONFIG_SHMEM
+ &zswpout_attr.attr,
&swpout_attr.attr,
&swpout_fallback_attr.attr,
#endif
diff --git a/mm/page_io.c b/mm/page_io.c
index bc1183299a7d..4aa34862676f 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -269,6 +269,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
swap_zeromap_folio_clear(folio);
}
if (zswap_store(folio)) {
+ count_mthp_stat(folio_order(folio), MTHP_STAT_ZSWPOUT);
folio_unlock(folio);
return 0;
}
--
2.27.0
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v9 2/7] mm: zswap: Modify zswap_compress() to accept a page instead of a folio.
2024-09-30 22:12 ` [PATCH v9 2/7] mm: zswap: Modify zswap_compress() to accept a page instead of a folio Kanchana P Sridhar
@ 2024-09-30 22:16 ` Yosry Ahmed
0 siblings, 0 replies; 29+ messages in thread
From: Yosry Ahmed @ 2024-09-30 22:16 UTC (permalink / raw)
To: Kanchana P Sridhar
Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao,
akpm, willy, nanhai.zou, wajdi.k.feghali, vinodh.gopal
On Mon, Sep 30, 2024 at 3:12 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> For zswap_store() to be able to store a large folio by compressing it
> one page at a time, zswap_compress() needs to accept a page as input.
> This will allow us to iterate through each page in the folio in
> zswap_store(), compress it and store it in the zpool.
>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> Reviewed-by: Nhat Pham <nphamcs@gmail.com>
> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
> ---
> mm/zswap.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index efad4e941e44..fd7a8c14457a 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -875,7 +875,7 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
> return 0;
> }
>
> -static bool zswap_compress(struct folio *folio, struct zswap_entry *entry)
> +static bool zswap_compress(struct page *page, struct zswap_entry *entry)
> {
> struct crypto_acomp_ctx *acomp_ctx;
> struct scatterlist input, output;
> @@ -893,7 +893,7 @@ static bool zswap_compress(struct folio *folio, struct zswap_entry *entry)
>
> dst = acomp_ctx->buffer;
> sg_init_table(&input, 1);
> - sg_set_folio(&input, folio, PAGE_SIZE, 0);
> + sg_set_page(&input, page, PAGE_SIZE, 0);
>
> /*
> * We need PAGE_SIZE * 2 here since there maybe over-compression case,
> @@ -1456,7 +1456,7 @@ bool zswap_store(struct folio *folio)
> mem_cgroup_put(memcg);
> }
>
> - if (!zswap_compress(folio, entry))
> + if (!zswap_compress(&folio->page, entry))
> goto put_pool;
>
> entry->swpentry = swp;
> --
> 2.27.0
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v9 4/7] mm: Change count_objcg_event() to count_objcg_events() for batch event updates.
2024-09-30 22:12 ` [PATCH v9 4/7] mm: Change count_objcg_event() to count_objcg_events() for batch event updates Kanchana P Sridhar
@ 2024-09-30 22:18 ` Yosry Ahmed
2024-09-30 22:59 ` Nhat Pham
2024-10-01 0:22 ` Johannes Weiner
2 siblings, 0 replies; 29+ messages in thread
From: Yosry Ahmed @ 2024-09-30 22:18 UTC (permalink / raw)
To: Kanchana P Sridhar
Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao,
akpm, willy, nanhai.zou, wajdi.k.feghali, vinodh.gopal
On Mon, Sep 30, 2024 at 3:12 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> With the introduction of zswap_store() swapping out large folios,
> we need to efficiently update the objcg's memcg events once per
> successfully stored folio. For instance, the 'ZSWPOUT' event needs
> to be incremented by folio_nr_pages().
>
> To facilitate this, the existing count_objcg_event() API is modified
> to be count_objcg_events() that additionally accepts a count parameter.
> The only existing calls to count_objcg_event() are in zswap.c - these
> have been modified to call count_objcg_events() with a count of 1.
>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
> ---
> include/linux/memcontrol.h | 12 +++++++-----
> mm/zswap.c | 6 +++---
> 2 files changed, 10 insertions(+), 8 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 15c2716f9aa3..524006313b0d 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1764,8 +1764,9 @@ static inline int memcg_kmem_id(struct mem_cgroup *memcg)
>
> struct mem_cgroup *mem_cgroup_from_slab_obj(void *p);
>
> -static inline void count_objcg_event(struct obj_cgroup *objcg,
> - enum vm_event_item idx)
> +static inline void count_objcg_events(struct obj_cgroup *objcg,
> + enum vm_event_item idx,
> + unsigned long count)
> {
> struct mem_cgroup *memcg;
>
> @@ -1774,7 +1775,7 @@ static inline void count_objcg_event(struct obj_cgroup *objcg,
>
> rcu_read_lock();
> memcg = obj_cgroup_memcg(objcg);
> - count_memcg_events(memcg, idx, 1);
> + count_memcg_events(memcg, idx, count);
> rcu_read_unlock();
> }
>
> @@ -1829,8 +1830,9 @@ static inline struct mem_cgroup *mem_cgroup_from_slab_obj(void *p)
> return NULL;
> }
>
> -static inline void count_objcg_event(struct obj_cgroup *objcg,
> - enum vm_event_item idx)
> +static inline void count_objcg_events(struct obj_cgroup *objcg,
> + enum vm_event_item idx,
> + unsigned long count)
> {
> }
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 0f281e50a034..69b9c025fd47 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1053,7 +1053,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>
> count_vm_event(ZSWPWB);
> if (entry->objcg)
> - count_objcg_event(entry->objcg, ZSWPWB);
> + count_objcg_events(entry->objcg, ZSWPWB, 1);
>
> zswap_entry_free(entry);
>
> @@ -1482,7 +1482,7 @@ bool zswap_store(struct folio *folio)
>
> if (objcg) {
> obj_cgroup_charge_zswap(objcg, entry->length);
> - count_objcg_event(objcg, ZSWPOUT);
> + count_objcg_events(objcg, ZSWPOUT, 1);
> }
>
> /*
> @@ -1576,7 +1576,7 @@ bool zswap_load(struct folio *folio)
>
> count_vm_event(ZSWPIN);
> if (entry->objcg)
> - count_objcg_event(entry->objcg, ZSWPIN);
> + count_objcg_events(entry->objcg, ZSWPIN, 1);
>
> if (swapcache) {
> zswap_entry_free(entry);
> --
> 2.27.0
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v9 4/7] mm: Change count_objcg_event() to count_objcg_events() for batch event updates.
2024-09-30 22:12 ` [PATCH v9 4/7] mm: Change count_objcg_event() to count_objcg_events() for batch event updates Kanchana P Sridhar
2024-09-30 22:18 ` Yosry Ahmed
@ 2024-09-30 22:59 ` Nhat Pham
2024-10-01 0:22 ` Johannes Weiner
2 siblings, 0 replies; 29+ messages in thread
From: Nhat Pham @ 2024-09-30 22:59 UTC (permalink / raw)
To: Kanchana P Sridhar
Cc: linux-kernel, linux-mm, hannes, yosryahmed, chengming.zhou,
usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao,
akpm, willy, nanhai.zou, wajdi.k.feghali, vinodh.gopal
On Mon, Sep 30, 2024 at 3:12 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> With the introduction of zswap_store() swapping out large folios,
> we need to efficiently update the objcg's memcg events once per
> successfully stored folio. For instance, the 'ZSWPOUT' event needs
> to be incremented by folio_nr_pages().
>
> To facilitate this, the existing count_objcg_event() API is modified
> to be count_objcg_events() that additionally accepts a count parameter.
> The only existing calls to count_objcg_event() are in zswap.c - these
> have been modified to call count_objcg_events() with a count of 1.
>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
LGTM!
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v9 6/7] mm: zswap: Support large folios in zswap_store().
2024-09-30 22:12 ` [PATCH v9 6/7] mm: zswap: Support large folios in zswap_store() Kanchana P Sridhar
@ 2024-09-30 23:09 ` Yosry Ahmed
2024-10-01 0:35 ` Sridhar, Kanchana P
2024-09-30 23:11 ` Nhat Pham
2024-10-01 1:14 ` Johannes Weiner
2 siblings, 1 reply; 29+ messages in thread
From: Yosry Ahmed @ 2024-09-30 23:09 UTC (permalink / raw)
To: Kanchana P Sridhar
Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao,
akpm, willy, nanhai.zou, wajdi.k.feghali, vinodh.gopal
On Mon, Sep 30, 2024 at 3:12 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> zswap_store() will store large folios by compressing them page by page.
>
> This patch provides a sequential implementation of storing a large folio
> in zswap_store() by iterating through each page in the folio to compress
> and store it in the zswap zpool.
>
> zswap_store() calls the newly added zswap_store_page() function for each
> page in the folio. zswap_store_page() handles compressing and storing each
> page.
>
> We check the global and per-cgroup limits once at the beginning of
> zswap_store(), and only check that the limit is not reached yet. This is
> racy and inaccurate, but it should be sufficient for now. We also obtain
> initial references to the relevant objcg and pool to guarantee that
> subsequent references can be acquired by zswap_store_page(). A new function
> zswap_pool_get() is added to facilitate this.
>
> If these one-time checks pass, we compress the pages of the folio, while
> maintaining a running count of compressed bytes for all the folio's pages.
> If all pages are successfully compressed and stored, we do the cgroup
> zswap charging with the total compressed bytes, and batch update the
> zswap_stored_pages atomic/zswpout event stats with folio_nr_pages() once,
> before returning from zswap_store().
>
> If an error is encountered during the store of any page in the folio,
> all pages in that folio currently stored in zswap will be invalidated.
> Thus, a folio is either entirely stored in zswap, or entirely not stored
> in zswap.
>
> The most important value provided by this patch is it enables swapping out
> large folios to zswap without splitting them. Furthermore, it batches some
> operations while doing so (cgroup charging, stats updates).
>
> This patch also forms the basis for building compress batching of pages in
> a large folio in zswap_store() by compressing up to say, 8 pages of the
> folio in parallel in hardware using the Intel In-Memory Analytics
> Accelerator (Intel IAA).
>
> This change reuses and adapts the functionality in Ryan Roberts' RFC
> patch [1]:
>
> "[RFC,v1] mm: zswap: Store large folios without splitting"
>
> [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
>
> Also, addressed some of the RFC comments from the discussion in [1].
>
> Co-developed-by: Ryan Roberts
> Signed-off-by:
We haven't been able to get a signoff from Ryan. Andrew, what's the policy here?
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
> mm/zswap.c | 220 +++++++++++++++++++++++++++++++++++++----------------
> 1 file changed, 153 insertions(+), 67 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 2b8da50f6322..b74c8de99646 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -411,6 +411,12 @@ static int __must_check zswap_pool_tryget(struct zswap_pool *pool)
> return percpu_ref_tryget(&pool->ref);
> }
>
> +/* The caller must already have a reference. */
> +static void zswap_pool_get(struct zswap_pool *pool)
> +{
> + percpu_ref_get(&pool->ref);
> +}
> +
> static void zswap_pool_put(struct zswap_pool *pool)
> {
> percpu_ref_put(&pool->ref);
> @@ -1402,68 +1408,52 @@ static void shrink_worker(struct work_struct *w)
> /*********************************
> * main API
> **********************************/
> -bool zswap_store(struct folio *folio)
> +
> +/*
> + * Stores the page at specified "index" in a folio.
> + *
> + * @page: The page to store in zswap.
> + * @objcg: The folio's objcg. Caller has a reference.
> + * @pool: The zswap_pool to store the compressed data for the page.
> + * The caller should have obtained a reference to a valid
> + * zswap_pool by calling zswap_pool_tryget(), to pass as this
> + * argument.
> + * @tree: The xarray for the @page's folio's swap.
> + * @compressed_bytes: The compressed entry->length value is added
> + * to this, so that the caller can get the total
> + * compressed lengths of all sub-pages in a folio.
> + */
> +static bool zswap_store_page(struct page *page,
> + struct obj_cgroup *objcg,
> + struct zswap_pool *pool,
> + struct xarray *tree,
> + size_t *compressed_bytes)
> {
> - swp_entry_t swp = folio->swap;
> - pgoff_t offset = swp_offset(swp);
> - struct xarray *tree = swap_zswap_tree(swp);
> struct zswap_entry *entry, *old;
> - struct obj_cgroup *objcg = NULL;
> - struct mem_cgroup *memcg = NULL;
> -
> - VM_WARN_ON_ONCE(!folio_test_locked(folio));
> - VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> -
> - /* Large folios aren't supported */
> - if (folio_test_large(folio))
> - return false;
> -
> - if (!zswap_enabled)
> - goto check_old;
> -
> - /* Check cgroup limits */
> - objcg = get_obj_cgroup_from_folio(folio);
> - if (objcg && !obj_cgroup_may_zswap(objcg)) {
> - memcg = get_mem_cgroup_from_objcg(objcg);
> - if (shrink_memcg(memcg)) {
> - mem_cgroup_put(memcg);
> - goto reject;
> - }
> - mem_cgroup_put(memcg);
> - }
> -
> - if (zswap_check_limits())
> - goto reject;
>
> /* allocate entry */
> - entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
> + entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(page_folio(page)));
Can we just use page_to_nid() here? I think the node info exists even
for tail pages, right?
> if (!entry) {
> zswap_reject_kmemcache_fail++;
> goto reject;
> }
>
> - /* if entry is successfully added, it keeps the reference */
> - entry->pool = zswap_pool_current_get();
> - if (!entry->pool)
> - goto freepage;
> + /* zswap_store() already holds a ref on 'objcg' and 'pool' */
> + if (objcg)
> + obj_cgroup_get(objcg);
> + zswap_pool_get(pool);
>
> - if (objcg) {
> - memcg = get_mem_cgroup_from_objcg(objcg);
> - if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
> - mem_cgroup_put(memcg);
> - goto put_pool;
> - }
> - mem_cgroup_put(memcg);
> - }
> + /* if entry is successfully added, it keeps the reference */
> + entry->pool = pool;
>
> - if (!zswap_compress(&folio->page, entry))
> - goto put_pool;
> + if (!zswap_compress(page, entry))
> + goto put_pool_objcg;
>
> - entry->swpentry = swp;
> + entry->swpentry = page_swap_entry(page);
> entry->objcg = objcg;
> entry->referenced = true;
>
> - old = xa_store(tree, offset, entry, GFP_KERNEL);
> + old = xa_store(tree, swp_offset(entry->swpentry), entry, GFP_KERNEL);
> if (xa_is_err(old)) {
> int err = xa_err(old);
>
> @@ -1480,11 +1470,6 @@ bool zswap_store(struct folio *folio)
> if (old)
> zswap_entry_free(old);
>
> - if (objcg) {
> - obj_cgroup_charge_zswap(objcg, entry->length);
> - count_objcg_events(objcg, ZSWPOUT, 1);
> - }
> -
> /*
> * We finish initializing the entry while it's already in xarray.
> * This is safe because:
> @@ -1496,36 +1481,137 @@ bool zswap_store(struct folio *folio)
> * an incoherent entry.
> */
> if (entry->length) {
> + *compressed_bytes += entry->length;
> INIT_LIST_HEAD(&entry->lru);
> zswap_lru_add(&zswap_list_lru, entry);
> }
>
> - /* update stats */
> - atomic_long_inc(&zswap_stored_pages);
> - count_vm_event(ZSWPOUT);
> -
> + /*
> + * We shouldn't have any possibility of failure after the entry is
> + * added in the xarray. The pool/objcg refs obtained here will only
> + * be dropped if/when zswap_entry_free() gets called.
> + */
> return true;
>
> store_failed:
> zpool_free(entry->pool->zpool, entry->handle);
> -put_pool:
> - zswap_pool_put(entry->pool);
> -freepage:
> +put_pool_objcg:
> + zswap_pool_put(pool);
> + obj_cgroup_put(objcg);
I think if we reorder the function we can drop these calls, make the
comments positioned a bit better, and centralize the entry
initializations. I am also not a fan of passing a semi-initialized
entry to zswap_compress() to get the pool pointer.
Does the following diff improve things or did I miss something?
diff --git a/mm/zswap.c b/mm/zswap.c
index b74c8de996468..eac1f846886a6 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -881,7 +881,8 @@ static int zswap_cpu_comp_dead(unsigned int cpu,
struct hlist_node *node)
return 0;
}
-static bool zswap_compress(struct page *page, struct zswap_entry *entry)
+static bool zswap_compress(struct page *page, struct zswap_entry *entry,
+ struct zswap_pool *pool)
{
struct crypto_acomp_ctx *acomp_ctx;
struct scatterlist input, output;
@@ -893,7 +894,7 @@ static bool zswap_compress(struct page *page,
struct zswap_entry *entry)
gfp_t gfp;
u8 *dst;
- acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
+ acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
mutex_lock(&acomp_ctx->mutex);
@@ -926,7 +927,7 @@ static bool zswap_compress(struct page *page,
struct zswap_entry *entry)
if (comp_ret)
goto unlock;
- zpool = entry->pool->zpool;
+ zpool = pool->zpool;
gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM;
if (zpool_malloc_support_movable(zpool))
gfp |= __GFP_HIGHMEM | __GFP_MOVABLE;
@@ -1435,23 +1436,11 @@ static bool zswap_store_page(struct page *page,
entry = zswap_entry_cache_alloc(GFP_KERNEL,
folio_nid(page_folio(page)));
if (!entry) {
zswap_reject_kmemcache_fail++;
- goto reject;
+ return false;
}
- /* zswap_store() already holds a ref on 'objcg' and 'pool' */
- if (objcg)
- obj_cgroup_get(objcg);
- zswap_pool_get(pool);
-
- /* if entry is successfully added, it keeps the reference */
- entry->pool = pool;
-
- if (!zswap_compress(page, entry))
- goto put_pool_objcg;
-
- entry->swpentry = page_swap_entry(page);
- entry->objcg = objcg;
- entry->referenced = true;
+ if (!zswap_compress(page, entry, pool))
+ goto compress_failed;
old = xa_store(tree, swp_offset(entry->swpentry), entry, GFP_KERNEL);
if (xa_is_err(old)) {
@@ -1470,6 +1459,16 @@ static bool zswap_store_page(struct page *page,
if (old)
zswap_entry_free(old);
+ /*
+ * The entry is successfully compressed and stored in the tree, there is
+ * no further possibility of failure. Grab refs to the pool and objcg.
+ * These refs will be dropped by zswap_entry_free() when the entry is
+ * removed from the tree.
+ */
+ zswap_pool_get(pool);
+ if (objcg)
+ obj_cgroup_get(objcg);
+
/*
* We finish initializing the entry while it's already in xarray.
* This is safe because:
@@ -1480,26 +1479,22 @@ static bool zswap_store_page(struct page *page,
* The publishing order matters to prevent writeback from seeing
* an incoherent entry.
*/
+ entry->pool = pool;
+ entry->swpentry = page_swap_entry(page);
+ entry->objcg = objcg;
+ entry->referenced = true;
if (entry->length) {
*compressed_bytes += entry->length;
INIT_LIST_HEAD(&entry->lru);
zswap_lru_add(&zswap_list_lru, entry);
}
- /*
- * We shouldn't have any possibility of failure after the entry is
- * added in the xarray. The pool/objcg refs obtained here will only
- * be dropped if/when zswap_entry_free() gets called.
- */
return true;
store_failed:
- zpool_free(entry->pool->zpool, entry->handle);
-put_pool_objcg:
- zswap_pool_put(pool);
- obj_cgroup_put(objcg);
+ zpool_free(pool->zpool, entry->handle);
+compress_failed:
zswap_entry_cache_free(entry);
-reject:
return false;
}
> zswap_entry_cache_free(entry);
> reject:
> + return false;
> +}
> +
> +bool zswap_store(struct folio *folio)
> +{
> + long nr_pages = folio_nr_pages(folio);
> + swp_entry_t swp = folio->swap;
> + struct xarray *tree = swap_zswap_tree(swp);
> + struct obj_cgroup *objcg = NULL;
> + struct mem_cgroup *memcg = NULL;
> + struct zswap_pool *pool;
> + size_t compressed_bytes = 0;
> + bool ret = false;
> + long index;
> +
> + VM_WARN_ON_ONCE(!folio_test_locked(folio));
> + VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> +
> + if (!zswap_enabled)
> + goto check_old;
> +
> + /*
> + * Check cgroup zswap limits:
> + *
> + * The cgroup zswap limit check is done once at the beginning of
> + * zswap_store(). The cgroup charging is done once, at the end
> + * of a successful folio store. What this means is, if the cgroup
> + * was within the zswap_max limit at the beginning of a large folio
> + * store, it could go over the limit by at most (HPAGE_PMD_NR - 1)
> + * pages due to this store.
> + */
> + objcg = get_obj_cgroup_from_folio(folio);
> + if (objcg && !obj_cgroup_may_zswap(objcg)) {
> + memcg = get_mem_cgroup_from_objcg(objcg);
> + if (shrink_memcg(memcg)) {
> + mem_cgroup_put(memcg);
> + goto put_objcg;
> + }
> + mem_cgroup_put(memcg);
> + }
> +
> + /*
> + * Check zpool utilization against zswap limits:
> + *
> + * The zswap zpool utilization is also checked against the limits
> + * just once, at the start of zswap_store(). If the check passes,
> + * any breaches of the limits set by zswap_max_pages() or
> + * zswap_accept_thr_pages() that may happen while storing this
> + * folio, will only be detected during the next call to
> + * zswap_store() by any process.
> + */
This is essentially a rephrased repetition of the last comment, just
refer to it instead. Something like:
/*
* Check zpool utilization against zswap limits. The possibility of
* going overlimit is the same as the cgroup limit check.
*/
> + if (zswap_check_limits())
> + goto put_objcg;
> +
> + pool = zswap_pool_current_get();
> + if (!pool)
> + goto put_objcg;
> +
> + if (objcg) {
> + memcg = get_mem_cgroup_from_objcg(objcg);
> + if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
> + mem_cgroup_put(memcg);
> + goto put_pool;
> + }
> + mem_cgroup_put(memcg);
> + }
> +
> + /*
> + * Store each page of the folio as a separate entry. If we fail to
> + * store a page, unwind by deleting all the pages for this folio
> + * currently in zswap.
> + */
> + for (index = 0; index < nr_pages; ++index) {
> + if (!zswap_store_page(folio_page(folio, index), objcg, pool, tree, &compressed_bytes))
> + goto put_pool;
> + }
> +
> + if (objcg) {
> + obj_cgroup_charge_zswap(objcg, compressed_bytes);
> + count_objcg_events(objcg, ZSWPOUT, nr_pages);
> + }
> +
> + atomic_long_add(nr_pages, &zswap_stored_pages);
> + count_vm_events(ZSWPOUT, nr_pages);
> +
> + ret = true;
> +
> +put_pool:
> + zswap_pool_put(pool);
> +put_objcg:
> obj_cgroup_put(objcg);
> - if (zswap_pool_reached_full)
> + if (!ret && zswap_pool_reached_full)
> queue_work(shrink_wq, &zswap_shrink_work);
> check_old:
> /*
> - * If the zswap store fails or zswap is disabled, we must invalidate the
> - * possibly stale entry which was previously stored at this offset.
> - * Otherwise, writeback could overwrite the new data in the swapfile.
> + * If the zswap store fails or zswap is disabled, we must invalidate
> + * the possibly stale entries which were previously stored at the
> + * offsets corresponding to each page of the folio. Otherwise,
> + * writeback could overwrite the new data in the swapfile.
> */
> - entry = xa_erase(tree, offset);
> - if (entry)
> - zswap_entry_free(entry);
> - return false;
> + if (!ret) {
> + pgoff_t offset = swp_offset(swp);
> + struct zswap_entry *entry;
> +
> + for (index = 0; index < nr_pages; ++index) {
> + entry = xa_erase(tree, offset + index);
> + if (entry)
> + zswap_entry_free(entry);
> + }
> + }
> +
> + return ret;
> }
>
> bool zswap_load(struct folio *folio)
> --
> 2.27.0
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v9 6/7] mm: zswap: Support large folios in zswap_store().
2024-09-30 22:12 ` [PATCH v9 6/7] mm: zswap: Support large folios in zswap_store() Kanchana P Sridhar
2024-09-30 23:09 ` Yosry Ahmed
@ 2024-09-30 23:11 ` Nhat Pham
2024-09-30 23:19 ` Yosry Ahmed
2024-10-01 1:14 ` Johannes Weiner
2 siblings, 1 reply; 29+ messages in thread
From: Nhat Pham @ 2024-09-30 23:11 UTC (permalink / raw)
To: Kanchana P Sridhar
Cc: linux-kernel, linux-mm, hannes, yosryahmed, chengming.zhou,
usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao,
akpm, willy, nanhai.zou, wajdi.k.feghali, vinodh.gopal
On Mon, Sep 30, 2024 at 3:12 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> zswap_store() will store large folios by compressing them page by page.
>
> This patch provides a sequential implementation of storing a large folio
> in zswap_store() by iterating through each page in the folio to compress
> and store it in the zswap zpool.
>
> zswap_store() calls the newly added zswap_store_page() function for each
> page in the folio. zswap_store_page() handles compressing and storing each
> page.
>
> We check the global and per-cgroup limits once at the beginning of
> zswap_store(), and only check that the limit is not reached yet. This is
> racy and inaccurate, but it should be sufficient for now. We also obtain
> initial references to the relevant objcg and pool to guarantee that
> subsequent references can be acquired by zswap_store_page(). A new function
> zswap_pool_get() is added to facilitate this.
>
> If these one-time checks pass, we compress the pages of the folio, while
> maintaining a running count of compressed bytes for all the folio's pages.
> If all pages are successfully compressed and stored, we do the cgroup
> zswap charging with the total compressed bytes, and batch update the
> zswap_stored_pages atomic/zswpout event stats with folio_nr_pages() once,
> before returning from zswap_store().
>
> If an error is encountered during the store of any page in the folio,
> all pages in that folio currently stored in zswap will be invalidated.
> Thus, a folio is either entirely stored in zswap, or entirely not stored
> in zswap.
>
> The most important value provided by this patch is it enables swapping out
> large folios to zswap without splitting them. Furthermore, it batches some
> operations while doing so (cgroup charging, stats updates).
>
> This patch also forms the basis for building compress batching of pages in
> a large folio in zswap_store() by compressing up to say, 8 pages of the
> folio in parallel in hardware using the Intel In-Memory Analytics
> Accelerator (Intel IAA).
>
> This change reuses and adapts the functionality in Ryan Roberts' RFC
> patch [1]:
>
> "[RFC,v1] mm: zswap: Store large folios without splitting"
>
> [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
>
> Also, addressed some of the RFC comments from the discussion in [1].
>
> Co-developed-by: Ryan Roberts
> Signed-off-by:
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> ---
> mm/zswap.c | 220 +++++++++++++++++++++++++++++++++++++----------------
> 1 file changed, 153 insertions(+), 67 deletions(-)
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 2b8da50f6322..b74c8de99646 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -411,6 +411,12 @@ static int __must_check zswap_pool_tryget(struct zswap_pool *pool)
> return percpu_ref_tryget(&pool->ref);
> }
>
> +/* The caller must already have a reference. */
> +static void zswap_pool_get(struct zswap_pool *pool)
> +{
> + percpu_ref_get(&pool->ref);
> +}
> +
> static void zswap_pool_put(struct zswap_pool *pool)
> {
> percpu_ref_put(&pool->ref);
> @@ -1402,68 +1408,52 @@ static void shrink_worker(struct work_struct *w)
> /*********************************
> * main API
> **********************************/
> -bool zswap_store(struct folio *folio)
> +
> +/*
> + * Stores the page at specified "index" in a folio.
> + *
> + * @page: The page to store in zswap.
> + * @objcg: The folio's objcg. Caller has a reference.
> + * @pool: The zswap_pool to store the compressed data for the page.
> + * The caller should have obtained a reference to a valid
> + * zswap_pool by calling zswap_pool_tryget(), to pass as this
> + * argument.
> + * @tree: The xarray for the @page's folio's swap.
> + * @compressed_bytes: The compressed entry->length value is added
> + * to this, so that the caller can get the total
> + * compressed lengths of all sub-pages in a folio.
> + */
> +static bool zswap_store_page(struct page *page,
> + struct obj_cgroup *objcg,
> + struct zswap_pool *pool,
> + struct xarray *tree,
> + size_t *compressed_bytes)
> {
> - swp_entry_t swp = folio->swap;
> - pgoff_t offset = swp_offset(swp);
> - struct xarray *tree = swap_zswap_tree(swp);
> struct zswap_entry *entry, *old;
> - struct obj_cgroup *objcg = NULL;
> - struct mem_cgroup *memcg = NULL;
> -
> - VM_WARN_ON_ONCE(!folio_test_locked(folio));
> - VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> -
> - /* Large folios aren't supported */
> - if (folio_test_large(folio))
> - return false;
> -
> - if (!zswap_enabled)
> - goto check_old;
> -
> - /* Check cgroup limits */
> - objcg = get_obj_cgroup_from_folio(folio);
> - if (objcg && !obj_cgroup_may_zswap(objcg)) {
> - memcg = get_mem_cgroup_from_objcg(objcg);
> - if (shrink_memcg(memcg)) {
> - mem_cgroup_put(memcg);
> - goto reject;
> - }
> - mem_cgroup_put(memcg);
> - }
> -
> - if (zswap_check_limits())
> - goto reject;
>
> /* allocate entry */
> - entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
> + entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(page_folio(page)));
> if (!entry) {
> zswap_reject_kmemcache_fail++;
> goto reject;
> }
>
> - /* if entry is successfully added, it keeps the reference */
> - entry->pool = zswap_pool_current_get();
> - if (!entry->pool)
> - goto freepage;
> + /* zswap_store() already holds a ref on 'objcg' and 'pool' */
> + if (objcg)
> + obj_cgroup_get(objcg);
> + zswap_pool_get(pool);
Should we also batch-get references to the pool as well? i.e add a
helper function:
/* The caller must already have a reference. */
static void zswap_pool_get_many(struct zswap_pool *pool, unsigned long nr)
{
percpu_ref_get_many(&pool->ref, nr);
}
then do it in a fell swoop after you're done storing all individual subpages
(near atomic_long_add(nr_pages, &zswap_stored_pages)).
Do double check that it is safe - I think it should be, since we have
the folio locked in swapcache, so there should not be any shenanigans
(for e.g no race with concurrent free or writeback).
Perhaps a fixlet suffices?
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v9 6/7] mm: zswap: Support large folios in zswap_store().
2024-09-30 23:11 ` Nhat Pham
@ 2024-09-30 23:19 ` Yosry Ahmed
2024-09-30 23:29 ` Nhat Pham
0 siblings, 1 reply; 29+ messages in thread
From: Yosry Ahmed @ 2024-09-30 23:19 UTC (permalink / raw)
To: Nhat Pham
Cc: Kanchana P Sridhar, linux-kernel, linux-mm, hannes,
chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
ying.huang, 21cnbao, akpm, willy, nanhai.zou, wajdi.k.feghali,
vinodh.gopal
On Mon, Sep 30, 2024 at 4:11 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Mon, Sep 30, 2024 at 3:12 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > zswap_store() will store large folios by compressing them page by page.
> >
> > This patch provides a sequential implementation of storing a large folio
> > in zswap_store() by iterating through each page in the folio to compress
> > and store it in the zswap zpool.
> >
> > zswap_store() calls the newly added zswap_store_page() function for each
> > page in the folio. zswap_store_page() handles compressing and storing each
> > page.
> >
> > We check the global and per-cgroup limits once at the beginning of
> > zswap_store(), and only check that the limit is not reached yet. This is
> > racy and inaccurate, but it should be sufficient for now. We also obtain
> > initial references to the relevant objcg and pool to guarantee that
> > subsequent references can be acquired by zswap_store_page(). A new function
> > zswap_pool_get() is added to facilitate this.
> >
> > If these one-time checks pass, we compress the pages of the folio, while
> > maintaining a running count of compressed bytes for all the folio's pages.
> > If all pages are successfully compressed and stored, we do the cgroup
> > zswap charging with the total compressed bytes, and batch update the
> > zswap_stored_pages atomic/zswpout event stats with folio_nr_pages() once,
> > before returning from zswap_store().
> >
> > If an error is encountered during the store of any page in the folio,
> > all pages in that folio currently stored in zswap will be invalidated.
> > Thus, a folio is either entirely stored in zswap, or entirely not stored
> > in zswap.
> >
> > The most important value provided by this patch is it enables swapping out
> > large folios to zswap without splitting them. Furthermore, it batches some
> > operations while doing so (cgroup charging, stats updates).
> >
> > This patch also forms the basis for building compress batching of pages in
> > a large folio in zswap_store() by compressing up to say, 8 pages of the
> > folio in parallel in hardware using the Intel In-Memory Analytics
> > Accelerator (Intel IAA).
> >
> > This change reuses and adapts the functionality in Ryan Roberts' RFC
> > patch [1]:
> >
> > "[RFC,v1] mm: zswap: Store large folios without splitting"
> >
> > [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
> >
> > Also, addressed some of the RFC comments from the discussion in [1].
> >
> > Co-developed-by: Ryan Roberts
> > Signed-off-by:
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> > mm/zswap.c | 220 +++++++++++++++++++++++++++++++++++++----------------
> > 1 file changed, 153 insertions(+), 67 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 2b8da50f6322..b74c8de99646 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -411,6 +411,12 @@ static int __must_check zswap_pool_tryget(struct zswap_pool *pool)
> > return percpu_ref_tryget(&pool->ref);
> > }
> >
> > +/* The caller must already have a reference. */
> > +static void zswap_pool_get(struct zswap_pool *pool)
> > +{
> > + percpu_ref_get(&pool->ref);
> > +}
> > +
> > static void zswap_pool_put(struct zswap_pool *pool)
> > {
> > percpu_ref_put(&pool->ref);
> > @@ -1402,68 +1408,52 @@ static void shrink_worker(struct work_struct *w)
> > /*********************************
> > * main API
> > **********************************/
> > -bool zswap_store(struct folio *folio)
> > +
> > +/*
> > + * Stores the page at specified "index" in a folio.
> > + *
> > + * @page: The page to store in zswap.
> > + * @objcg: The folio's objcg. Caller has a reference.
> > + * @pool: The zswap_pool to store the compressed data for the page.
> > + * The caller should have obtained a reference to a valid
> > + * zswap_pool by calling zswap_pool_tryget(), to pass as this
> > + * argument.
> > + * @tree: The xarray for the @page's folio's swap.
> > + * @compressed_bytes: The compressed entry->length value is added
> > + * to this, so that the caller can get the total
> > + * compressed lengths of all sub-pages in a folio.
> > + */
> > +static bool zswap_store_page(struct page *page,
> > + struct obj_cgroup *objcg,
> > + struct zswap_pool *pool,
> > + struct xarray *tree,
> > + size_t *compressed_bytes)
> > {
> > - swp_entry_t swp = folio->swap;
> > - pgoff_t offset = swp_offset(swp);
> > - struct xarray *tree = swap_zswap_tree(swp);
> > struct zswap_entry *entry, *old;
> > - struct obj_cgroup *objcg = NULL;
> > - struct mem_cgroup *memcg = NULL;
> > -
> > - VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > - VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > -
> > - /* Large folios aren't supported */
> > - if (folio_test_large(folio))
> > - return false;
> > -
> > - if (!zswap_enabled)
> > - goto check_old;
> > -
> > - /* Check cgroup limits */
> > - objcg = get_obj_cgroup_from_folio(folio);
> > - if (objcg && !obj_cgroup_may_zswap(objcg)) {
> > - memcg = get_mem_cgroup_from_objcg(objcg);
> > - if (shrink_memcg(memcg)) {
> > - mem_cgroup_put(memcg);
> > - goto reject;
> > - }
> > - mem_cgroup_put(memcg);
> > - }
> > -
> > - if (zswap_check_limits())
> > - goto reject;
> >
> > /* allocate entry */
> > - entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
> > + entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(page_folio(page)));
> > if (!entry) {
> > zswap_reject_kmemcache_fail++;
> > goto reject;
> > }
> >
> > - /* if entry is successfully added, it keeps the reference */
> > - entry->pool = zswap_pool_current_get();
> > - if (!entry->pool)
> > - goto freepage;
> > + /* zswap_store() already holds a ref on 'objcg' and 'pool' */
> > + if (objcg)
> > + obj_cgroup_get(objcg);
> > + zswap_pool_get(pool);
>
> Should we also batch-get references to the pool as well? i.e add a
> helper function:
>
> /* The caller must already have a reference. */
> static void zswap_pool_get_many(struct zswap_pool *pool, unsigned long nr)
> {
> percpu_ref_get_many(&pool->ref, nr);
> }
>
> then do it in a fell swoop after you're done storing all individual subpages
> (near atomic_long_add(nr_pages, &zswap_stored_pages)).
>
> Do double check that it is safe - I think it should be, since we have
> the folio locked in swapcache, so there should not be any shenanigans
> (for e.g no race with concurrent free or writeback).
>
> Perhaps a fixlet suffices?
I suggested this in a previous version, and Kanchana faced some
complexities implementing it:
https://lore.kernel.org/lkml/SJ0PR11MB56785027ED6FCF673A84CEE6C96A2@SJ0PR11MB5678.namprd11.prod.outlook.com/
Basically, if we batch get the refs after the store I think it's not
safe, because once an entry is published to writeback it can be
written back and freed, and a ref that we never acquired would be
dropped.
Getting refs before the store would work, but then if the store fails
at an arbitrary page, we need to only drop refs on the pool for pages
that were not added to the tree, as the cleanup loop with
zswap_entry_free() at the end of zswap_store() will drop the ref for
those that were added to the tree.
We agreed to (potentially) do the batching for refcounts as a followup.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v9 6/7] mm: zswap: Support large folios in zswap_store().
2024-09-30 23:19 ` Yosry Ahmed
@ 2024-09-30 23:29 ` Nhat Pham
2024-09-30 23:33 ` Yosry Ahmed
0 siblings, 1 reply; 29+ messages in thread
From: Nhat Pham @ 2024-09-30 23:29 UTC (permalink / raw)
To: Yosry Ahmed
Cc: Kanchana P Sridhar, linux-kernel, linux-mm, hannes,
chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
ying.huang, 21cnbao, akpm, willy, nanhai.zou, wajdi.k.feghali,
vinodh.gopal
On Mon, Sep 30, 2024 at 4:20 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Mon, Sep 30, 2024 at 4:11 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> I suggested this in a previous version, and Kanchana faced some
> complexities implementing it:
> https://lore.kernel.org/lkml/SJ0PR11MB56785027ED6FCF673A84CEE6C96A2@SJ0PR11MB5678.namprd11.prod.outlook.com/
Sorry, I missed that conversation.
>
> Basically, if we batch get the refs after the store I think it's not
> safe, because once an entry is published to writeback it can be
> written back and freed, and a ref that we never acquired would be
> dropped.
Hmmm. I don't think writeback could touch any individual subpage just yet, no?
Before doing any work, zswap writeback would attempt to add the
subpage to the swap cache (via __read_swap_cache_async()). However,
all subpage will have already been added to swap cache, and point to
the (large) folio. So zswap_writeback_entry() should short circuit
here (the if (!page_allocated) case).
>
> Getting refs before the store would work, but then if the store fails
> at an arbitrary page, we need to only drop refs on the pool for pages
> that were not added to the tree, as the cleanup loop with
> zswap_entry_free() at the end of zswap_store() will drop the ref for
> those that were added to the tree.
>
> We agreed to (potentially) do the batching for refcounts as a followup.
But yeah no biggie. Not a dealbreaker for me tbh. I thought it was a
quick change (hence the fixlet suggestion), but if not then let's do
it as a follow-up.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v9 6/7] mm: zswap: Support large folios in zswap_store().
2024-09-30 23:29 ` Nhat Pham
@ 2024-09-30 23:33 ` Yosry Ahmed
2024-09-30 23:43 ` Nhat Pham
0 siblings, 1 reply; 29+ messages in thread
From: Yosry Ahmed @ 2024-09-30 23:33 UTC (permalink / raw)
To: Nhat Pham
Cc: Kanchana P Sridhar, linux-kernel, linux-mm, hannes,
chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
ying.huang, 21cnbao, akpm, willy, nanhai.zou, wajdi.k.feghali,
vinodh.gopal
On Mon, Sep 30, 2024 at 4:29 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Mon, Sep 30, 2024 at 4:20 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > On Mon, Sep 30, 2024 at 4:11 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > I suggested this in a previous version, and Kanchana faced some
> > complexities implementing it:
> > https://lore.kernel.org/lkml/SJ0PR11MB56785027ED6FCF673A84CEE6C96A2@SJ0PR11MB5678.namprd11.prod.outlook.com/
>
> Sorry, I missed that conversation.
>
> >
> > Basically, if we batch get the refs after the store I think it's not
> > safe, because once an entry is published to writeback it can be
> > written back and freed, and a ref that we never acquired would be
> > dropped.
>
> Hmmm. I don't think writeback could touch any individual subpage just yet, no?
>
> Before doing any work, zswap writeback would attempt to add the
> subpage to the swap cache (via __read_swap_cache_async()). However,
> all subpage will have already been added to swap cache, and point to
> the (large) folio. So zswap_writeback_entry() should short circuit
> here (the if (!page_allocated) case).
If it's safe to take the refs after all calls to zswap_store_page()
are successful, then yeah that should be possible, for both the pool
and objcg. I didn't look closely though.
Just to clarify, you mean grab one ref first, then do the
compressions, then grab the remaining refs, right?
The other thing I would be worried about is code clarity, not sure if
it would be confusing to initialize the entries in zswap_store_page(),
then grab the refs later in zswap_store().
>
> >
> > Getting refs before the store would work, but then if the store fails
> > at an arbitrary page, we need to only drop refs on the pool for pages
> > that were not added to the tree, as the cleanup loop with
> > zswap_entry_free() at the end of zswap_store() will drop the ref for
> > those that were added to the tree.
> >
> > We agreed to (potentially) do the batching for refcounts as a followup.
>
> But yeah no biggie. Not a dealbreaker for me tbh. I thought it was a
> quick change (hence the fixlet suggestion), but if not then let's do
> it as a follow-up.
I don't feel strongly either way.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v9 6/7] mm: zswap: Support large folios in zswap_store().
2024-09-30 23:33 ` Yosry Ahmed
@ 2024-09-30 23:43 ` Nhat Pham
2024-10-01 0:47 ` Sridhar, Kanchana P
0 siblings, 1 reply; 29+ messages in thread
From: Nhat Pham @ 2024-09-30 23:43 UTC (permalink / raw)
To: Yosry Ahmed
Cc: Kanchana P Sridhar, linux-kernel, linux-mm, hannes,
chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
ying.huang, 21cnbao, akpm, willy, nanhai.zou, wajdi.k.feghali,
vinodh.gopal
On Mon, Sep 30, 2024 at 4:34 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Mon, Sep 30, 2024 at 4:29 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > On Mon, Sep 30, 2024 at 4:20 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > On Mon, Sep 30, 2024 at 4:11 PM Nhat Pham <nphamcs@gmail.com> wrote:
> > >
> > > I suggested this in a previous version, and Kanchana faced some
> > > complexities implementing it:
> > > https://lore.kernel.org/lkml/SJ0PR11MB56785027ED6FCF673A84CEE6C96A2@SJ0PR11MB5678.namprd11.prod.outlook.com/
> >
> > Sorry, I missed that conversation.
> >
> > >
> > > Basically, if we batch get the refs after the store I think it's not
> > > safe, because once an entry is published to writeback it can be
> > > written back and freed, and a ref that we never acquired would be
> > > dropped.
> >
> > Hmmm. I don't think writeback could touch any individual subpage just yet, no?
> >
> > Before doing any work, zswap writeback would attempt to add the
> > subpage to the swap cache (via __read_swap_cache_async()). However,
> > all subpage will have already been added to swap cache, and point to
> > the (large) folio. So zswap_writeback_entry() should short circuit
> > here (the if (!page_allocated) case).
>
> If it's safe to take the refs after all calls to zswap_store_page()
> are successful, then yeah that should be possible, for both the pool
> and objcg. I didn't look closely though.
>
> Just to clarify, you mean grab one ref first, then do the
> compressions, then grab the remaining refs, right?
Ah yeah, that's what I meant. We can either perform one of the
following sequences: grab 1 -> grab nr -> drop 1, or grab 1 -> grab nr
- 1 if successful, drop 1 if failed.
Seems straightforward to me, but yeah it seems a bit hair-splitting of
me to die on this hill :) Just thought it was weird seeing the other
parts batchified, and one part wasn't.
The rest LGTM - I'll defer to you and Johannes for further review.
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v9 4/7] mm: Change count_objcg_event() to count_objcg_events() for batch event updates.
2024-09-30 22:12 ` [PATCH v9 4/7] mm: Change count_objcg_event() to count_objcg_events() for batch event updates Kanchana P Sridhar
2024-09-30 22:18 ` Yosry Ahmed
2024-09-30 22:59 ` Nhat Pham
@ 2024-10-01 0:22 ` Johannes Weiner
2 siblings, 0 replies; 29+ messages in thread
From: Johannes Weiner @ 2024-10-01 0:22 UTC (permalink / raw)
To: Kanchana P Sridhar
Cc: linux-kernel, linux-mm, yosryahmed, nphamcs, chengming.zhou,
usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao,
akpm, willy, nanhai.zou, wajdi.k.feghali, vinodh.gopal
On Mon, Sep 30, 2024 at 03:12:18PM -0700, Kanchana P Sridhar wrote:
> With the introduction of zswap_store() swapping out large folios,
> we need to efficiently update the objcg's memcg events once per
> successfully stored folio. For instance, the 'ZSWPOUT' event needs
> to be incremented by folio_nr_pages().
>
> To facilitate this, the existing count_objcg_event() API is modified
> to be count_objcg_events() that additionally accepts a count parameter.
> The only existing calls to count_objcg_event() are in zswap.c - these
> have been modified to call count_objcg_events() with a count of 1.
>
> Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH v9 6/7] mm: zswap: Support large folios in zswap_store().
2024-09-30 23:09 ` Yosry Ahmed
@ 2024-10-01 0:35 ` Sridhar, Kanchana P
2024-10-01 6:00 ` Yosry Ahmed
0 siblings, 1 reply; 29+ messages in thread
From: Sridhar, Kanchana P @ 2024-10-01 0:35 UTC (permalink / raw)
To: Yosry Ahmed
Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
akpm, willy, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh,
Sridhar, Kanchana P
> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Monday, September 30, 2024 4:09 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; willy@infradead.org; Zou, Nanhai <nanhai.zou@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v9 6/7] mm: zswap: Support large folios in zswap_store().
>
> On Mon, Sep 30, 2024 at 3:12 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > zswap_store() will store large folios by compressing them page by page.
> >
> > This patch provides a sequential implementation of storing a large folio
> > in zswap_store() by iterating through each page in the folio to compress
> > and store it in the zswap zpool.
> >
> > zswap_store() calls the newly added zswap_store_page() function for each
> > page in the folio. zswap_store_page() handles compressing and storing each
> > page.
> >
> > We check the global and per-cgroup limits once at the beginning of
> > zswap_store(), and only check that the limit is not reached yet. This is
> > racy and inaccurate, but it should be sufficient for now. We also obtain
> > initial references to the relevant objcg and pool to guarantee that
> > subsequent references can be acquired by zswap_store_page(). A new
> function
> > zswap_pool_get() is added to facilitate this.
> >
> > If these one-time checks pass, we compress the pages of the folio, while
> > maintaining a running count of compressed bytes for all the folio's pages.
> > If all pages are successfully compressed and stored, we do the cgroup
> > zswap charging with the total compressed bytes, and batch update the
> > zswap_stored_pages atomic/zswpout event stats with folio_nr_pages()
> once,
> > before returning from zswap_store().
> >
> > If an error is encountered during the store of any page in the folio,
> > all pages in that folio currently stored in zswap will be invalidated.
> > Thus, a folio is either entirely stored in zswap, or entirely not stored
> > in zswap.
> >
> > The most important value provided by this patch is it enables swapping out
> > large folios to zswap without splitting them. Furthermore, it batches some
> > operations while doing so (cgroup charging, stats updates).
> >
> > This patch also forms the basis for building compress batching of pages in
> > a large folio in zswap_store() by compressing up to say, 8 pages of the
> > folio in parallel in hardware using the Intel In-Memory Analytics
> > Accelerator (Intel IAA).
> >
> > This change reuses and adapts the functionality in Ryan Roberts' RFC
> > patch [1]:
> >
> > "[RFC,v1] mm: zswap: Store large folios without splitting"
> >
> > [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-
> ryan.roberts@arm.com/T/#u
> >
> > Also, addressed some of the RFC comments from the discussion in [1].
> >
> > Co-developed-by: Ryan Roberts
> > Signed-off-by:
>
> We haven't been able to get a signoff from Ryan. Andrew, what's the policy
> here?
>
> > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
> > ---
> > mm/zswap.c | 220 +++++++++++++++++++++++++++++++++++++-----------
> -----
> > 1 file changed, 153 insertions(+), 67 deletions(-)
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 2b8da50f6322..b74c8de99646 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -411,6 +411,12 @@ static int __must_check zswap_pool_tryget(struct
> zswap_pool *pool)
> > return percpu_ref_tryget(&pool->ref);
> > }
> >
> > +/* The caller must already have a reference. */
> > +static void zswap_pool_get(struct zswap_pool *pool)
> > +{
> > + percpu_ref_get(&pool->ref);
> > +}
> > +
> > static void zswap_pool_put(struct zswap_pool *pool)
> > {
> > percpu_ref_put(&pool->ref);
> > @@ -1402,68 +1408,52 @@ static void shrink_worker(struct work_struct
> *w)
> > /*********************************
> > * main API
> > **********************************/
> > -bool zswap_store(struct folio *folio)
> > +
> > +/*
> > + * Stores the page at specified "index" in a folio.
> > + *
> > + * @page: The page to store in zswap.
> > + * @objcg: The folio's objcg. Caller has a reference.
> > + * @pool: The zswap_pool to store the compressed data for the page.
> > + * The caller should have obtained a reference to a valid
> > + * zswap_pool by calling zswap_pool_tryget(), to pass as this
> > + * argument.
> > + * @tree: The xarray for the @page's folio's swap.
> > + * @compressed_bytes: The compressed entry->length value is added
> > + * to this, so that the caller can get the total
> > + * compressed lengths of all sub-pages in a folio.
> > + */
> > +static bool zswap_store_page(struct page *page,
> > + struct obj_cgroup *objcg,
> > + struct zswap_pool *pool,
> > + struct xarray *tree,
> > + size_t *compressed_bytes)
> > {
> > - swp_entry_t swp = folio->swap;
> > - pgoff_t offset = swp_offset(swp);
> > - struct xarray *tree = swap_zswap_tree(swp);
> > struct zswap_entry *entry, *old;
> > - struct obj_cgroup *objcg = NULL;
> > - struct mem_cgroup *memcg = NULL;
> > -
> > - VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > - VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > -
> > - /* Large folios aren't supported */
> > - if (folio_test_large(folio))
> > - return false;
> > -
> > - if (!zswap_enabled)
> > - goto check_old;
> > -
> > - /* Check cgroup limits */
> > - objcg = get_obj_cgroup_from_folio(folio);
> > - if (objcg && !obj_cgroup_may_zswap(objcg)) {
> > - memcg = get_mem_cgroup_from_objcg(objcg);
> > - if (shrink_memcg(memcg)) {
> > - mem_cgroup_put(memcg);
> > - goto reject;
> > - }
> > - mem_cgroup_put(memcg);
> > - }
> > -
> > - if (zswap_check_limits())
> > - goto reject;
> >
> > /* allocate entry */
> > - entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
> > + entry = zswap_entry_cache_alloc(GFP_KERNEL,
> folio_nid(page_folio(page)));
>
> Can we just use page_to_nid() here? I think the node info exists even
> for tail pages, right?
I wasn't sure about this, and figured it would be safer to use folio_nid()
which always get the node id from the head page. We would need
Johannes/Matthew to give their recommendations.
>
> > if (!entry) {
> > zswap_reject_kmemcache_fail++;
> > goto reject;
> > }
> >
> > - /* if entry is successfully added, it keeps the reference */
> > - entry->pool = zswap_pool_current_get();
> > - if (!entry->pool)
> > - goto freepage;
> > + /* zswap_store() already holds a ref on 'objcg' and 'pool' */
> > + if (objcg)
> > + obj_cgroup_get(objcg);
> > + zswap_pool_get(pool);
> >
> > - if (objcg) {
> > - memcg = get_mem_cgroup_from_objcg(objcg);
> > - if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
> > - mem_cgroup_put(memcg);
> > - goto put_pool;
> > - }
> > - mem_cgroup_put(memcg);
> > - }
> > + /* if entry is successfully added, it keeps the reference */
> > + entry->pool = pool;
> >
> > - if (!zswap_compress(&folio->page, entry))
> > - goto put_pool;
> > + if (!zswap_compress(page, entry))
> > + goto put_pool_objcg;
> >
> > - entry->swpentry = swp;
> > + entry->swpentry = page_swap_entry(page);
> > entry->objcg = objcg;
> > entry->referenced = true;
> >
> > - old = xa_store(tree, offset, entry, GFP_KERNEL);
> > + old = xa_store(tree, swp_offset(entry->swpentry), entry, GFP_KERNEL);
> > if (xa_is_err(old)) {
> > int err = xa_err(old);
> >
> > @@ -1480,11 +1470,6 @@ bool zswap_store(struct folio *folio)
> > if (old)
> > zswap_entry_free(old);
> >
> > - if (objcg) {
> > - obj_cgroup_charge_zswap(objcg, entry->length);
> > - count_objcg_events(objcg, ZSWPOUT, 1);
> > - }
> > -
> > /*
> > * We finish initializing the entry while it's already in xarray.
> > * This is safe because:
> > @@ -1496,36 +1481,137 @@ bool zswap_store(struct folio *folio)
> > * an incoherent entry.
> > */
> > if (entry->length) {
> > + *compressed_bytes += entry->length;
> > INIT_LIST_HEAD(&entry->lru);
> > zswap_lru_add(&zswap_list_lru, entry);
> > }
> >
> > - /* update stats */
> > - atomic_long_inc(&zswap_stored_pages);
> > - count_vm_event(ZSWPOUT);
> > -
> > + /*
> > + * We shouldn't have any possibility of failure after the entry is
> > + * added in the xarray. The pool/objcg refs obtained here will only
> > + * be dropped if/when zswap_entry_free() gets called.
> > + */
> > return true;
> >
> > store_failed:
> > zpool_free(entry->pool->zpool, entry->handle);
> > -put_pool:
> > - zswap_pool_put(entry->pool);
> > -freepage:
> > +put_pool_objcg:
> > + zswap_pool_put(pool);
> > + obj_cgroup_put(objcg);
>
> I think if we reorder the function we can drop these calls, make the
> comments positioned a bit better, and centralize the entry
> initializations. I am also not a fan of passing a semi-initialized
> entry to zswap_compress() to get the pool pointer.
>
> Does the following diff improve things or did I miss something?
We shouldn’t be adding the entry to the xarray before initializing its pool
and objcg, right? Please let me know if I am misunderstanding what you're
proposing in the diff.
>
> diff --git a/mm/zswap.c b/mm/zswap.c
> index b74c8de996468..eac1f846886a6 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -881,7 +881,8 @@ static int zswap_cpu_comp_dead(unsigned int cpu,
> struct hlist_node *node)
> return 0;
> }
>
> -static bool zswap_compress(struct page *page, struct zswap_entry *entry)
> +static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> + struct zswap_pool *pool)
> {
> struct crypto_acomp_ctx *acomp_ctx;
> struct scatterlist input, output;
> @@ -893,7 +894,7 @@ static bool zswap_compress(struct page *page,
> struct zswap_entry *entry)
> gfp_t gfp;
> u8 *dst;
>
> - acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> + acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
>
> mutex_lock(&acomp_ctx->mutex);
>
> @@ -926,7 +927,7 @@ static bool zswap_compress(struct page *page,
> struct zswap_entry *entry)
> if (comp_ret)
> goto unlock;
>
> - zpool = entry->pool->zpool;
> + zpool = pool->zpool;
> gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM;
> if (zpool_malloc_support_movable(zpool))
> gfp |= __GFP_HIGHMEM | __GFP_MOVABLE;
> @@ -1435,23 +1436,11 @@ static bool zswap_store_page(struct page
> *page,
> entry = zswap_entry_cache_alloc(GFP_KERNEL,
> folio_nid(page_folio(page)));
> if (!entry) {
> zswap_reject_kmemcache_fail++;
> - goto reject;
> + return false;
> }
>
> - /* zswap_store() already holds a ref on 'objcg' and 'pool' */
> - if (objcg)
> - obj_cgroup_get(objcg);
> - zswap_pool_get(pool);
> -
> - /* if entry is successfully added, it keeps the reference */
> - entry->pool = pool;
> -
> - if (!zswap_compress(page, entry))
> - goto put_pool_objcg;
> -
> - entry->swpentry = page_swap_entry(page);
> - entry->objcg = objcg;
> - entry->referenced = true;
> + if (!zswap_compress(page, entry, pool))
> + goto compress_failed;
>
> old = xa_store(tree, swp_offset(entry->swpentry), entry, GFP_KERNEL);
> if (xa_is_err(old)) {
> @@ -1470,6 +1459,16 @@ static bool zswap_store_page(struct page *page,
> if (old)
> zswap_entry_free(old);
>
> + /*
> + * The entry is successfully compressed and stored in the tree, there is
> + * no further possibility of failure. Grab refs to the pool and objcg.
> + * These refs will be dropped by zswap_entry_free() when the entry is
> + * removed from the tree.
> + */
> + zswap_pool_get(pool);
> + if (objcg)
> + obj_cgroup_get(objcg);
> +
> /*
> * We finish initializing the entry while it's already in xarray.
> * This is safe because:
> @@ -1480,26 +1479,22 @@ static bool zswap_store_page(struct page
> *page,
> * The publishing order matters to prevent writeback from seeing
> * an incoherent entry.
> */
> + entry->pool = pool;
> + entry->swpentry = page_swap_entry(page);
> + entry->objcg = objcg;
> + entry->referenced = true;
> if (entry->length) {
> *compressed_bytes += entry->length;
> INIT_LIST_HEAD(&entry->lru);
> zswap_lru_add(&zswap_list_lru, entry);
> }
>
> - /*
> - * We shouldn't have any possibility of failure after the entry is
> - * added in the xarray. The pool/objcg refs obtained here will only
> - * be dropped if/when zswap_entry_free() gets called.
> - */
> return true;
>
> store_failed:
> - zpool_free(entry->pool->zpool, entry->handle);
> -put_pool_objcg:
> - zswap_pool_put(pool);
> - obj_cgroup_put(objcg);
> + zpool_free(pool->zpool, entry->handle);
> +compress_failed:
> zswap_entry_cache_free(entry);
> -reject:
> return false;
> }
>
>
> > zswap_entry_cache_free(entry);
> > reject:
> > + return false;
> > +}
> > +
> > +bool zswap_store(struct folio *folio)
> > +{
> > + long nr_pages = folio_nr_pages(folio);
> > + swp_entry_t swp = folio->swap;
> > + struct xarray *tree = swap_zswap_tree(swp);
> > + struct obj_cgroup *objcg = NULL;
> > + struct mem_cgroup *memcg = NULL;
> > + struct zswap_pool *pool;
> > + size_t compressed_bytes = 0;
> > + bool ret = false;
> > + long index;
> > +
> > + VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > + VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > +
> > + if (!zswap_enabled)
> > + goto check_old;
> > +
> > + /*
> > + * Check cgroup zswap limits:
> > + *
> > + * The cgroup zswap limit check is done once at the beginning of
> > + * zswap_store(). The cgroup charging is done once, at the end
> > + * of a successful folio store. What this means is, if the cgroup
> > + * was within the zswap_max limit at the beginning of a large folio
> > + * store, it could go over the limit by at most (HPAGE_PMD_NR - 1)
> > + * pages due to this store.
> > + */
> > + objcg = get_obj_cgroup_from_folio(folio);
> > + if (objcg && !obj_cgroup_may_zswap(objcg)) {
> > + memcg = get_mem_cgroup_from_objcg(objcg);
> > + if (shrink_memcg(memcg)) {
> > + mem_cgroup_put(memcg);
> > + goto put_objcg;
> > + }
> > + mem_cgroup_put(memcg);
> > + }
> > +
> > + /*
> > + * Check zpool utilization against zswap limits:
> > + *
> > + * The zswap zpool utilization is also checked against the limits
> > + * just once, at the start of zswap_store(). If the check passes,
> > + * any breaches of the limits set by zswap_max_pages() or
> > + * zswap_accept_thr_pages() that may happen while storing this
> > + * folio, will only be detected during the next call to
> > + * zswap_store() by any process.
> > + */
>
> This is essentially a rephrased repetition of the last comment, just
> refer to it instead. Something like:
>
> /*
> * Check zpool utilization against zswap limits. The possibility of
> * going overlimit is the same as the cgroup limit check.
> */
Sure, I will make this change.
Thanks,
Kanchana
>
> > + if (zswap_check_limits())
> > + goto put_objcg;
> > +
> > + pool = zswap_pool_current_get();
> > + if (!pool)
> > + goto put_objcg;
> > +
> > + if (objcg) {
> > + memcg = get_mem_cgroup_from_objcg(objcg);
> > + if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
> > + mem_cgroup_put(memcg);
> > + goto put_pool;
> > + }
> > + mem_cgroup_put(memcg);
> > + }
> > +
> > + /*
> > + * Store each page of the folio as a separate entry. If we fail to
> > + * store a page, unwind by deleting all the pages for this folio
> > + * currently in zswap.
> > + */
> > + for (index = 0; index < nr_pages; ++index) {
> > + if (!zswap_store_page(folio_page(folio, index), objcg, pool, tree,
> &compressed_bytes))
> > + goto put_pool;
> > + }
> > +
> > + if (objcg) {
> > + obj_cgroup_charge_zswap(objcg, compressed_bytes);
> > + count_objcg_events(objcg, ZSWPOUT, nr_pages);
> > + }
> > +
> > + atomic_long_add(nr_pages, &zswap_stored_pages);
> > + count_vm_events(ZSWPOUT, nr_pages);
> > +
> > + ret = true;
> > +
> > +put_pool:
> > + zswap_pool_put(pool);
> > +put_objcg:
> > obj_cgroup_put(objcg);
> > - if (zswap_pool_reached_full)
> > + if (!ret && zswap_pool_reached_full)
> > queue_work(shrink_wq, &zswap_shrink_work);
> > check_old:
> > /*
> > - * If the zswap store fails or zswap is disabled, we must invalidate the
> > - * possibly stale entry which was previously stored at this offset.
> > - * Otherwise, writeback could overwrite the new data in the swapfile.
> > + * If the zswap store fails or zswap is disabled, we must invalidate
> > + * the possibly stale entries which were previously stored at the
> > + * offsets corresponding to each page of the folio. Otherwise,
> > + * writeback could overwrite the new data in the swapfile.
> > */
> > - entry = xa_erase(tree, offset);
> > - if (entry)
> > - zswap_entry_free(entry);
> > - return false;
> > + if (!ret) {
> > + pgoff_t offset = swp_offset(swp);
> > + struct zswap_entry *entry;
> > +
> > + for (index = 0; index < nr_pages; ++index) {
> > + entry = xa_erase(tree, offset + index);
> > + if (entry)
> > + zswap_entry_free(entry);
> > + }
> > + }
> > +
> > + return ret;
> > }
> >
> > bool zswap_load(struct folio *folio)
> > --
> > 2.27.0
> >
^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH v9 6/7] mm: zswap: Support large folios in zswap_store().
2024-09-30 23:43 ` Nhat Pham
@ 2024-10-01 0:47 ` Sridhar, Kanchana P
0 siblings, 0 replies; 29+ messages in thread
From: Sridhar, Kanchana P @ 2024-10-01 0:47 UTC (permalink / raw)
To: Nhat Pham, Yosry Ahmed
Cc: linux-kernel, linux-mm, hannes, chengming.zhou, usamaarif642,
shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, willy,
Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar,
Kanchana P
> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Monday, September 30, 2024 4:43 PM
> To: Yosry Ahmed <yosryahmed@google.com>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux-
> kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> willy@infradead.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v9 6/7] mm: zswap: Support large folios in zswap_store().
>
> On Mon, Sep 30, 2024 at 4:34 PM Yosry Ahmed <yosryahmed@google.com>
> wrote:
> >
> > On Mon, Sep 30, 2024 at 4:29 PM Nhat Pham <nphamcs@gmail.com>
> wrote:
> > >
> > > On Mon, Sep 30, 2024 at 4:20 PM Yosry Ahmed
> <yosryahmed@google.com> wrote:
> > > >
> > > > On Mon, Sep 30, 2024 at 4:11 PM Nhat Pham <nphamcs@gmail.com>
> wrote:
> > > >
> > > > I suggested this in a previous version, and Kanchana faced some
> > > > complexities implementing it:
> > > >
> https://lore.kernel.org/lkml/SJ0PR11MB56785027ED6FCF673A84CEE6C96A2
> @SJ0PR11MB5678.namprd11.prod.outlook.com/
> > >
> > > Sorry, I missed that conversation.
> > >
> > > >
> > > > Basically, if we batch get the refs after the store I think it's not
> > > > safe, because once an entry is published to writeback it can be
> > > > written back and freed, and a ref that we never acquired would be
> > > > dropped.
> > >
> > > Hmmm. I don't think writeback could touch any individual subpage just
> yet, no?
> > >
> > > Before doing any work, zswap writeback would attempt to add the
> > > subpage to the swap cache (via __read_swap_cache_async()). However,
> > > all subpage will have already been added to swap cache, and point to
> > > the (large) folio. So zswap_writeback_entry() should short circuit
> > > here (the if (!page_allocated) case).
> >
> > If it's safe to take the refs after all calls to zswap_store_page()
> > are successful, then yeah that should be possible, for both the pool
> > and objcg. I didn't look closely though.
> >
> > Just to clarify, you mean grab one ref first, then do the
> > compressions, then grab the remaining refs, right?
>
> Ah yeah, that's what I meant. We can either perform one of the
> following sequences: grab 1 -> grab nr -> drop 1, or grab 1 -> grab nr
> - 1 if successful, drop 1 if failed.
>
> Seems straightforward to me, but yeah it seems a bit hair-splitting of
> me to die on this hill :) Just thought it was weird seeing the other
> parts batchified, and one part wasn't.
>
> The rest LGTM - I'll defer to you and Johannes for further review.
>
> Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Thanks Nhat! Sure, this sounds good. I thought I will clarify my current
thinking on this. As Yosry was also mentioning, batching of the pool refs
requires some more thought. This is tied in to the design approach of
obtaining the objcg/pool refs per page before adding them to the entry,
which in turn needs to happen before adding the entry to the xarray.
I think there is some value in doing this incrementally because at any
point, if storing the page fails:
1) All existing folio entries in the xarray will have valid pool/objcg that
can get refs dropped when we unwind state.
2) There are no excess refs that were potentially obtained by batching
that need to be dropped (this might make the code a little bit messy,
imho).
Thanks,
Kanchana
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v9 6/7] mm: zswap: Support large folios in zswap_store().
2024-09-30 22:12 ` [PATCH v9 6/7] mm: zswap: Support large folios in zswap_store() Kanchana P Sridhar
2024-09-30 23:09 ` Yosry Ahmed
2024-09-30 23:11 ` Nhat Pham
@ 2024-10-01 1:14 ` Johannes Weiner
2024-10-01 1:53 ` Sridhar, Kanchana P
2 siblings, 1 reply; 29+ messages in thread
From: Johannes Weiner @ 2024-10-01 1:14 UTC (permalink / raw)
To: Kanchana P Sridhar
Cc: linux-kernel, linux-mm, yosryahmed, nphamcs, chengming.zhou,
usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao,
akpm, willy, nanhai.zou, wajdi.k.feghali, vinodh.gopal
On Mon, Sep 30, 2024 at 03:12:20PM -0700, Kanchana P Sridhar wrote:
> /*********************************
> * main API
> **********************************/
> -bool zswap_store(struct folio *folio)
> +
> +/*
> + * Stores the page at specified "index" in a folio.
There is no more index and no folio in this function.
> + *
> + * @page: The page to store in zswap.
> + * @objcg: The folio's objcg. Caller has a reference.
> + * @pool: The zswap_pool to store the compressed data for the page.
> + * The caller should have obtained a reference to a valid
> + * zswap_pool by calling zswap_pool_tryget(), to pass as this
> + * argument.
> + * @tree: The xarray for the @page's folio's swap.
This doesn't look safe.
If the entries were to span a SWAP_ADDRESS_SPACE_SHIFT boundary, the
subpage entries would need to be spread out to different trees also.
Otherwise, it would break loading and writeback down the line.
I *think* it works right now, but at best it's not very future proof.
Please look up the tree inside the function for the specific
swp_entry_t that is being stored.
Same for the unwind/check_old: section.
> + * @compressed_bytes: The compressed entry->length value is added
> + * to this, so that the caller can get the total
> + * compressed lengths of all sub-pages in a folio.
> + */
With just one caller, IMO the function comment can be dropped...
> /* allocate entry */
> - entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
> + entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(page_folio(page)));
page_to_nid() is safe to use here.
> +bool zswap_store(struct folio *folio)
> +{
> + long nr_pages = folio_nr_pages(folio);
> + swp_entry_t swp = folio->swap;
> + struct xarray *tree = swap_zswap_tree(swp);
> + struct obj_cgroup *objcg = NULL;
> + struct mem_cgroup *memcg = NULL;
> + struct zswap_pool *pool;
> + size_t compressed_bytes = 0;
> + bool ret = false;
> + long index;
> +
> + VM_WARN_ON_ONCE(!folio_test_locked(folio));
> + VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> +
> + if (!zswap_enabled)
> + goto check_old;
> +
> + /*
> + * Check cgroup zswap limits:
> + *
> + * The cgroup zswap limit check is done once at the beginning of
> + * zswap_store(). The cgroup charging is done once, at the end
> + * of a successful folio store. What this means is, if the cgroup
> + * was within the zswap_max limit at the beginning of a large folio
> + * store, it could go over the limit by at most (HPAGE_PMD_NR - 1)
> + * pages due to this store.
> + */
> + objcg = get_obj_cgroup_from_folio(folio);
> + if (objcg && !obj_cgroup_may_zswap(objcg)) {
> + memcg = get_mem_cgroup_from_objcg(objcg);
> + if (shrink_memcg(memcg)) {
> + mem_cgroup_put(memcg);
> + goto put_objcg;
> + }
> + mem_cgroup_put(memcg);
> + }
> +
> + /*
> + * Check zpool utilization against zswap limits:
> + *
> + * The zswap zpool utilization is also checked against the limits
> + * just once, at the start of zswap_store(). If the check passes,
> + * any breaches of the limits set by zswap_max_pages() or
> + * zswap_accept_thr_pages() that may happen while storing this
> + * folio, will only be detected during the next call to
> + * zswap_store() by any process.
> + */
> + if (zswap_check_limits())
> + goto put_objcg;
There has been some back and forth on those comments. Both checks are
non-atomic and subject to races, so mentioning the HPAGE_PMD_NR - 1
overrun is somewhat misleading - it's much higher in the worst case.
Honestly, I would just get rid of the comments. You're not changing
anything fundamental in this regard, so I don't think there is a
burden to add new comments either.
> +
> + pool = zswap_pool_current_get();
> + if (!pool)
> + goto put_objcg;
> +
> + if (objcg) {
> + memcg = get_mem_cgroup_from_objcg(objcg);
> + if (memcg_list_lru_alloc(memcg, &zswap_list_lru, GFP_KERNEL)) {
> + mem_cgroup_put(memcg);
> + goto put_pool;
> + }
> + mem_cgroup_put(memcg);
> + }
> +
> + /*
> + * Store each page of the folio as a separate entry. If we fail to
> + * store a page, unwind by deleting all the pages for this folio
> + * currently in zswap.
> + */
The first sentence explains something that is internal to
zswap_store_page(). The second sentence IMO is obvious from the code
itself. I think you can delete this comment.
> + for (index = 0; index < nr_pages; ++index) {
> + if (!zswap_store_page(folio_page(folio, index), objcg, pool, tree, &compressed_bytes))
> + goto put_pool;
Hah, I'm not a stickler for the 80 column line limit, but this is
pushing it ;)
Please grab the page up front.
Yosry had also suggested replacing the compressed_bytes return
parameter with an actual return value. Basically, return compressed
bytes on success, -errno on error. I think this comment was missed
among the page_swap_entry() discussion.
for (index = 0; index < nr_pages; index++) {
struct page *page = folio_page(folio, index);
int bytes;
bytes = zswap_store_page(page, object, pool, tree);
if (bytes < 0)
goto put_pool;
total_bytes += bytes;
}
^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH v9 6/7] mm: zswap: Support large folios in zswap_store().
2024-10-01 1:14 ` Johannes Weiner
@ 2024-10-01 1:53 ` Sridhar, Kanchana P
0 siblings, 0 replies; 29+ messages in thread
From: Sridhar, Kanchana P @ 2024-10-01 1:53 UTC (permalink / raw)
To: Johannes Weiner
Cc: linux-kernel, linux-mm, yosryahmed, nphamcs, chengming.zhou,
usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
akpm, willy, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh,
Sridhar, Kanchana P
> -----Original Message-----
> From: Johannes Weiner <hannes@cmpxchg.org>
> Sent: Monday, September 30, 2024 6:14 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> yosryahmed@google.com; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying
> <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org;
> willy@infradead.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v9 6/7] mm: zswap: Support large folios in zswap_store().
>
> On Mon, Sep 30, 2024 at 03:12:20PM -0700, Kanchana P Sridhar wrote:
> > /*********************************
> > * main API
> > **********************************/
> > -bool zswap_store(struct folio *folio)
> > +
> > +/*
> > + * Stores the page at specified "index" in a folio.
>
> There is no more index and no folio in this function.
>
> > + *
> > + * @page: The page to store in zswap.
> > + * @objcg: The folio's objcg. Caller has a reference.
> > + * @pool: The zswap_pool to store the compressed data for the page.
> > + * The caller should have obtained a reference to a valid
> > + * zswap_pool by calling zswap_pool_tryget(), to pass as this
> > + * argument.
> > + * @tree: The xarray for the @page's folio's swap.
>
> This doesn't look safe.
>
> If the entries were to span a SWAP_ADDRESS_SPACE_SHIFT boundary, the
> subpage entries would need to be spread out to different trees also.
> Otherwise, it would break loading and writeback down the line.
>
> I *think* it works right now, but at best it's not very future proof.
> Please look up the tree inside the function for the specific
> swp_entry_t that is being stored.
>
> Same for the unwind/check_old: section.
>
> > + * @compressed_bytes: The compressed entry->length value is added
> > + * to this, so that the caller can get the total
> > + * compressed lengths of all sub-pages in a folio.
> > + */
>
> With just one caller, IMO the function comment can be dropped...
>
> > /* allocate entry */
> > - entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio));
> > + entry = zswap_entry_cache_alloc(GFP_KERNEL,
> folio_nid(page_folio(page)));
>
> page_to_nid() is safe to use here.
>
> > +bool zswap_store(struct folio *folio)
> > +{
> > + long nr_pages = folio_nr_pages(folio);
> > + swp_entry_t swp = folio->swap;
> > + struct xarray *tree = swap_zswap_tree(swp);
> > + struct obj_cgroup *objcg = NULL;
> > + struct mem_cgroup *memcg = NULL;
> > + struct zswap_pool *pool;
> > + size_t compressed_bytes = 0;
> > + bool ret = false;
> > + long index;
> > +
> > + VM_WARN_ON_ONCE(!folio_test_locked(folio));
> > + VM_WARN_ON_ONCE(!folio_test_swapcache(folio));
> > +
> > + if (!zswap_enabled)
> > + goto check_old;
> > +
> > + /*
> > + * Check cgroup zswap limits:
> > + *
> > + * The cgroup zswap limit check is done once at the beginning of
> > + * zswap_store(). The cgroup charging is done once, at the end
> > + * of a successful folio store. What this means is, if the cgroup
> > + * was within the zswap_max limit at the beginning of a large folio
> > + * store, it could go over the limit by at most (HPAGE_PMD_NR - 1)
> > + * pages due to this store.
> > + */
> > + objcg = get_obj_cgroup_from_folio(folio);
> > + if (objcg && !obj_cgroup_may_zswap(objcg)) {
> > + memcg = get_mem_cgroup_from_objcg(objcg);
> > + if (shrink_memcg(memcg)) {
> > + mem_cgroup_put(memcg);
> > + goto put_objcg;
> > + }
> > + mem_cgroup_put(memcg);
> > + }
> > +
> > + /*
> > + * Check zpool utilization against zswap limits:
> > + *
> > + * The zswap zpool utilization is also checked against the limits
> > + * just once, at the start of zswap_store(). If the check passes,
> > + * any breaches of the limits set by zswap_max_pages() or
> > + * zswap_accept_thr_pages() that may happen while storing this
> > + * folio, will only be detected during the next call to
> > + * zswap_store() by any process.
> > + */
> > + if (zswap_check_limits())
> > + goto put_objcg;
>
> There has been some back and forth on those comments. Both checks are
> non-atomic and subject to races, so mentioning the HPAGE_PMD_NR - 1
> overrun is somewhat misleading - it's much higher in the worst case.
>
> Honestly, I would just get rid of the comments. You're not changing
> anything fundamental in this regard, so I don't think there is a
> burden to add new comments either.
>
> > +
> > + pool = zswap_pool_current_get();
> > + if (!pool)
> > + goto put_objcg;
> > +
> > + if (objcg) {
> > + memcg = get_mem_cgroup_from_objcg(objcg);
> > + if (memcg_list_lru_alloc(memcg, &zswap_list_lru,
> GFP_KERNEL)) {
> > + mem_cgroup_put(memcg);
> > + goto put_pool;
> > + }
> > + mem_cgroup_put(memcg);
> > + }
> > +
> > + /*
> > + * Store each page of the folio as a separate entry. If we fail to
> > + * store a page, unwind by deleting all the pages for this folio
> > + * currently in zswap.
> > + */
>
> The first sentence explains something that is internal to
> zswap_store_page(). The second sentence IMO is obvious from the code
> itself. I think you can delete this comment.
>
> > + for (index = 0; index < nr_pages; ++index) {
> > + if (!zswap_store_page(folio_page(folio, index), objcg, pool,
> tree, &compressed_bytes))
> > + goto put_pool;
>
> Hah, I'm not a stickler for the 80 column line limit, but this is
> pushing it ;)
>
> Please grab the page up front.
>
> Yosry had also suggested replacing the compressed_bytes return
> parameter with an actual return value. Basically, return compressed
> bytes on success, -errno on error. I think this comment was missed
> among the page_swap_entry() discussion.
>
> for (index = 0; index < nr_pages; index++) {
> struct page *page = folio_page(folio, index);
> int bytes;
>
> bytes = zswap_store_page(page, object, pool, tree);
> if (bytes < 0)
> goto put_pool;
> total_bytes += bytes;
> }
Thanks Johannes! Appreciate your detailed code review comments.
I will incorporate all the comments and submit v10.
Thanks,
Kanchana
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v9 6/7] mm: zswap: Support large folios in zswap_store().
2024-10-01 0:35 ` Sridhar, Kanchana P
@ 2024-10-01 6:00 ` Yosry Ahmed
2024-10-01 16:58 ` Sridhar, Kanchana P
0 siblings, 1 reply; 29+ messages in thread
From: Yosry Ahmed @ 2024-10-01 6:00 UTC (permalink / raw)
To: Sridhar, Kanchana P
Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
akpm, willy, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh
[..]
> > > store_failed:
> > > zpool_free(entry->pool->zpool, entry->handle);
> > > -put_pool:
> > > - zswap_pool_put(entry->pool);
> > > -freepage:
> > > +put_pool_objcg:
> > > + zswap_pool_put(pool);
> > > + obj_cgroup_put(objcg);
> >
> > I think if we reorder the function we can drop these calls, make the
> > comments positioned a bit better, and centralize the entry
> > initializations. I am also not a fan of passing a semi-initialized
> > entry to zswap_compress() to get the pool pointer.
> >
> > Does the following diff improve things or did I miss something?
>
> We shouldn’t be adding the entry to the xarray before initializing its pool
> and objcg, right? Please let me know if I am misunderstanding what you're
> proposing in the diff.
It should be safe. We already initialize entry->lru after we insert
the entry in the tree. See the comment above the call to
zswap_lru_add(). Basically we are protected against concurrent
stores/loads through the folio lock, and are protected against
writeback because the entry is not on the LRU yet.
>
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index b74c8de996468..eac1f846886a6 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -881,7 +881,8 @@ static int zswap_cpu_comp_dead(unsigned int cpu,
> > struct hlist_node *node)
> > return 0;
> > }
> >
> > -static bool zswap_compress(struct page *page, struct zswap_entry *entry)
> > +static bool zswap_compress(struct page *page, struct zswap_entry *entry,
> > + struct zswap_pool *pool)
> > {
> > struct crypto_acomp_ctx *acomp_ctx;
> > struct scatterlist input, output;
> > @@ -893,7 +894,7 @@ static bool zswap_compress(struct page *page,
> > struct zswap_entry *entry)
> > gfp_t gfp;
> > u8 *dst;
> >
> > - acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> > + acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
> >
> > mutex_lock(&acomp_ctx->mutex);
> >
> > @@ -926,7 +927,7 @@ static bool zswap_compress(struct page *page,
> > struct zswap_entry *entry)
> > if (comp_ret)
> > goto unlock;
> >
> > - zpool = entry->pool->zpool;
> > + zpool = pool->zpool;
> > gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM;
> > if (zpool_malloc_support_movable(zpool))
> > gfp |= __GFP_HIGHMEM | __GFP_MOVABLE;
> > @@ -1435,23 +1436,11 @@ static bool zswap_store_page(struct page
> > *page,
> > entry = zswap_entry_cache_alloc(GFP_KERNEL,
> > folio_nid(page_folio(page)));
> > if (!entry) {
> > zswap_reject_kmemcache_fail++;
> > - goto reject;
> > + return false;
> > }
> >
> > - /* zswap_store() already holds a ref on 'objcg' and 'pool' */
> > - if (objcg)
> > - obj_cgroup_get(objcg);
> > - zswap_pool_get(pool);
> > -
> > - /* if entry is successfully added, it keeps the reference */
> > - entry->pool = pool;
> > -
> > - if (!zswap_compress(page, entry))
> > - goto put_pool_objcg;
> > -
> > - entry->swpentry = page_swap_entry(page);
> > - entry->objcg = objcg;
> > - entry->referenced = true;
> > + if (!zswap_compress(page, entry, pool))
> > + goto compress_failed;
> >
> > old = xa_store(tree, swp_offset(entry->swpentry), entry, GFP_KERNEL);
> > if (xa_is_err(old)) {
> > @@ -1470,6 +1459,16 @@ static bool zswap_store_page(struct page *page,
> > if (old)
> > zswap_entry_free(old);
> >
> > + /*
> > + * The entry is successfully compressed and stored in the tree, there is
> > + * no further possibility of failure. Grab refs to the pool and objcg.
> > + * These refs will be dropped by zswap_entry_free() when the entry is
> > + * removed from the tree.
> > + */
> > + zswap_pool_get(pool);
> > + if (objcg)
> > + obj_cgroup_get(objcg);
> > +
> > /*
> > * We finish initializing the entry while it's already in xarray.
> > * This is safe because:
> > @@ -1480,26 +1479,22 @@ static bool zswap_store_page(struct page
> > *page,
> > * The publishing order matters to prevent writeback from seeing
> > * an incoherent entry.
> > */
I am referring to the comment here ^
> > + entry->pool = pool;
> > + entry->swpentry = page_swap_entry(page);
> > + entry->objcg = objcg;
> > + entry->referenced = true;
> > if (entry->length) {
> > *compressed_bytes += entry->length;
> > INIT_LIST_HEAD(&entry->lru);
> > zswap_lru_add(&zswap_list_lru, entry);
> > }
^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH v9 6/7] mm: zswap: Support large folios in zswap_store().
2024-10-01 6:00 ` Yosry Ahmed
@ 2024-10-01 16:58 ` Sridhar, Kanchana P
2024-10-01 17:01 ` Yosry Ahmed
0 siblings, 1 reply; 29+ messages in thread
From: Sridhar, Kanchana P @ 2024-10-01 16:58 UTC (permalink / raw)
To: Yosry Ahmed
Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
akpm, willy, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh,
Sridhar, Kanchana P
> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Monday, September 30, 2024 11:00 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; willy@infradead.org; Zou, Nanhai <nanhai.zou@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v9 6/7] mm: zswap: Support large folios in zswap_store().
>
> [..]
> > > > store_failed:
> > > > zpool_free(entry->pool->zpool, entry->handle);
> > > > -put_pool:
> > > > - zswap_pool_put(entry->pool);
> > > > -freepage:
> > > > +put_pool_objcg:
> > > > + zswap_pool_put(pool);
> > > > + obj_cgroup_put(objcg);
> > >
> > > I think if we reorder the function we can drop these calls, make the
> > > comments positioned a bit better, and centralize the entry
> > > initializations. I am also not a fan of passing a semi-initialized
> > > entry to zswap_compress() to get the pool pointer.
> > >
> > > Does the following diff improve things or did I miss something?
> >
> > We shouldn’t be adding the entry to the xarray before initializing its pool
> > and objcg, right? Please let me know if I am misunderstanding what you're
> > proposing in the diff.
>
> It should be safe. We already initialize entry->lru after we insert
> the entry in the tree. See the comment above the call to
> zswap_lru_add(). Basically we are protected against concurrent
> stores/loads through the folio lock, and are protected against
> writeback because the entry is not on the LRU yet.
Thanks for the clarification, Yosry. Since this is a change in the entry
initialization wrt the mainline, is it Ok if this is done in a follow-up patch?
Thanks,
Kanchana
>
> >
> > >
> > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > index b74c8de996468..eac1f846886a6 100644
> > > --- a/mm/zswap.c
> > > +++ b/mm/zswap.c
> > > @@ -881,7 +881,8 @@ static int zswap_cpu_comp_dead(unsigned int
> cpu,
> > > struct hlist_node *node)
> > > return 0;
> > > }
> > >
> > > -static bool zswap_compress(struct page *page, struct zswap_entry
> *entry)
> > > +static bool zswap_compress(struct page *page, struct zswap_entry
> *entry,
> > > + struct zswap_pool *pool)
> > > {
> > > struct crypto_acomp_ctx *acomp_ctx;
> > > struct scatterlist input, output;
> > > @@ -893,7 +894,7 @@ static bool zswap_compress(struct page *page,
> > > struct zswap_entry *entry)
> > > gfp_t gfp;
> > > u8 *dst;
> > >
> > > - acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> > > + acomp_ctx = raw_cpu_ptr(pool->acomp_ctx);
> > >
> > > mutex_lock(&acomp_ctx->mutex);
> > >
> > > @@ -926,7 +927,7 @@ static bool zswap_compress(struct page *page,
> > > struct zswap_entry *entry)
> > > if (comp_ret)
> > > goto unlock;
> > >
> > > - zpool = entry->pool->zpool;
> > > + zpool = pool->zpool;
> > > gfp = __GFP_NORETRY | __GFP_NOWARN |
> __GFP_KSWAPD_RECLAIM;
> > > if (zpool_malloc_support_movable(zpool))
> > > gfp |= __GFP_HIGHMEM | __GFP_MOVABLE;
> > > @@ -1435,23 +1436,11 @@ static bool zswap_store_page(struct page
> > > *page,
> > > entry = zswap_entry_cache_alloc(GFP_KERNEL,
> > > folio_nid(page_folio(page)));
> > > if (!entry) {
> > > zswap_reject_kmemcache_fail++;
> > > - goto reject;
> > > + return false;
> > > }
> > >
> > > - /* zswap_store() already holds a ref on 'objcg' and 'pool' */
> > > - if (objcg)
> > > - obj_cgroup_get(objcg);
> > > - zswap_pool_get(pool);
> > > -
> > > - /* if entry is successfully added, it keeps the reference */
> > > - entry->pool = pool;
> > > -
> > > - if (!zswap_compress(page, entry))
> > > - goto put_pool_objcg;
> > > -
> > > - entry->swpentry = page_swap_entry(page);
> > > - entry->objcg = objcg;
> > > - entry->referenced = true;
> > > + if (!zswap_compress(page, entry, pool))
> > > + goto compress_failed;
> > >
> > > old = xa_store(tree, swp_offset(entry->swpentry), entry,
> GFP_KERNEL);
> > > if (xa_is_err(old)) {
> > > @@ -1470,6 +1459,16 @@ static bool zswap_store_page(struct page
> *page,
> > > if (old)
> > > zswap_entry_free(old);
> > >
> > > + /*
> > > + * The entry is successfully compressed and stored in the tree, there
> is
> > > + * no further possibility of failure. Grab refs to the pool and objcg.
> > > + * These refs will be dropped by zswap_entry_free() when the entry
> is
> > > + * removed from the tree.
> > > + */
> > > + zswap_pool_get(pool);
> > > + if (objcg)
> > > + obj_cgroup_get(objcg);
> > > +
> > > /*
> > > * We finish initializing the entry while it's already in xarray.
> > > * This is safe because:
> > > @@ -1480,26 +1479,22 @@ static bool zswap_store_page(struct page
> > > *page,
> > > * The publishing order matters to prevent writeback from seeing
> > > * an incoherent entry.
> > > */
>
> I am referring to the comment here ^
>
> > > + entry->pool = pool;
> > > + entry->swpentry = page_swap_entry(page);
> > > + entry->objcg = objcg;
> > > + entry->referenced = true;
> > > if (entry->length) {
> > > *compressed_bytes += entry->length;
> > > INIT_LIST_HEAD(&entry->lru);
> > > zswap_lru_add(&zswap_list_lru, entry);
> > > }
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v9 6/7] mm: zswap: Support large folios in zswap_store().
2024-10-01 16:58 ` Sridhar, Kanchana P
@ 2024-10-01 17:01 ` Yosry Ahmed
2024-10-01 17:09 ` Sridhar, Kanchana P
0 siblings, 1 reply; 29+ messages in thread
From: Yosry Ahmed @ 2024-10-01 17:01 UTC (permalink / raw)
To: Sridhar, Kanchana P
Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
akpm, willy, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh
On Tue, Oct 1, 2024 at 9:58 AM Sridhar, Kanchana P
<kanchana.p.sridhar@intel.com> wrote:
>
> > -----Original Message-----
> > From: Yosry Ahmed <yosryahmed@google.com>
> > Sent: Monday, September 30, 2024 11:00 PM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> > usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> > foundation.org; willy@infradead.org; Zou, Nanhai <nanhai.zou@intel.com>;
> > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> > <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v9 6/7] mm: zswap: Support large folios in zswap_store().
> >
> > [..]
> > > > > store_failed:
> > > > > zpool_free(entry->pool->zpool, entry->handle);
> > > > > -put_pool:
> > > > > - zswap_pool_put(entry->pool);
> > > > > -freepage:
> > > > > +put_pool_objcg:
> > > > > + zswap_pool_put(pool);
> > > > > + obj_cgroup_put(objcg);
> > > >
> > > > I think if we reorder the function we can drop these calls, make the
> > > > comments positioned a bit better, and centralize the entry
> > > > initializations. I am also not a fan of passing a semi-initialized
> > > > entry to zswap_compress() to get the pool pointer.
> > > >
> > > > Does the following diff improve things or did I miss something?
> > >
> > > We shouldn’t be adding the entry to the xarray before initializing its pool
> > > and objcg, right? Please let me know if I am misunderstanding what you're
> > > proposing in the diff.
> >
> > It should be safe. We already initialize entry->lru after we insert
> > the entry in the tree. See the comment above the call to
> > zswap_lru_add(). Basically we are protected against concurrent
> > stores/loads through the folio lock, and are protected against
> > writeback because the entry is not on the LRU yet.
>
> Thanks for the clarification, Yosry. Since this is a change in the entry
> initialization wrt the mainline, is it Ok if this is done in a follow-up patch?
Sure. We can discuss it separately. Do you want me to send a patch or
do you intend to?
^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH v9 6/7] mm: zswap: Support large folios in zswap_store().
2024-10-01 17:01 ` Yosry Ahmed
@ 2024-10-01 17:09 ` Sridhar, Kanchana P
2024-10-01 21:53 ` Sridhar, Kanchana P
0 siblings, 1 reply; 29+ messages in thread
From: Sridhar, Kanchana P @ 2024-10-01 17:09 UTC (permalink / raw)
To: Yosry Ahmed
Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
akpm, willy, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh,
Sridhar, Kanchana P
> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Tuesday, October 1, 2024 10:01 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; willy@infradead.org; Zou, Nanhai <nanhai.zou@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v9 6/7] mm: zswap: Support large folios in zswap_store().
>
> On Tue, Oct 1, 2024 at 9:58 AM Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > > -----Original Message-----
> > > From: Yosry Ahmed <yosryahmed@google.com>
> > > Sent: Monday, September 30, 2024 11:00 PM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > hannes@cmpxchg.org; nphamcs@gmail.com;
> chengming.zhou@linux.dev;
> > > usamaarif642@gmail.com; shakeel.butt@linux.dev;
> ryan.roberts@arm.com;
> > > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> > > foundation.org; willy@infradead.org; Zou, Nanhai
> <nanhai.zou@intel.com>;
> > > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> > > <vinodh.gopal@intel.com>
> > > Subject: Re: [PATCH v9 6/7] mm: zswap: Support large folios in
> zswap_store().
> > >
> > > [..]
> > > > > > store_failed:
> > > > > > zpool_free(entry->pool->zpool, entry->handle);
> > > > > > -put_pool:
> > > > > > - zswap_pool_put(entry->pool);
> > > > > > -freepage:
> > > > > > +put_pool_objcg:
> > > > > > + zswap_pool_put(pool);
> > > > > > + obj_cgroup_put(objcg);
> > > > >
> > > > > I think if we reorder the function we can drop these calls, make the
> > > > > comments positioned a bit better, and centralize the entry
> > > > > initializations. I am also not a fan of passing a semi-initialized
> > > > > entry to zswap_compress() to get the pool pointer.
> > > > >
> > > > > Does the following diff improve things or did I miss something?
> > > >
> > > > We shouldn’t be adding the entry to the xarray before initializing its pool
> > > > and objcg, right? Please let me know if I am misunderstanding what
> you're
> > > > proposing in the diff.
> > >
> > > It should be safe. We already initialize entry->lru after we insert
> > > the entry in the tree. See the comment above the call to
> > > zswap_lru_add(). Basically we are protected against concurrent
> > > stores/loads through the folio lock, and are protected against
> > > writeback because the entry is not on the LRU yet.
> >
> > Thanks for the clarification, Yosry. Since this is a change in the entry
> > initialization wrt the mainline, is it Ok if this is done in a follow-up patch?
>
> Sure. We can discuss it separately. Do you want me to send a patch or
> do you intend to?
Thanks Yosry! I will send the patch separately.
^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH v9 6/7] mm: zswap: Support large folios in zswap_store().
2024-10-01 17:09 ` Sridhar, Kanchana P
@ 2024-10-01 21:53 ` Sridhar, Kanchana P
2024-10-01 23:38 ` Yosry Ahmed
0 siblings, 1 reply; 29+ messages in thread
From: Sridhar, Kanchana P @ 2024-10-01 21:53 UTC (permalink / raw)
To: Yosry Ahmed
Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
akpm, willy, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh,
Sridhar, Kanchana P
> -----Original Message-----
> From: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Sent: Tuesday, October 1, 2024 10:09 AM
> To: Yosry Ahmed <yosryahmed@google.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; willy@infradead.org; Zou, Nanhai <nanhai.zou@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>; Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com>
> Subject: RE: [PATCH v9 6/7] mm: zswap: Support large folios in zswap_store().
>
> > -----Original Message-----
> > From: Yosry Ahmed <yosryahmed@google.com>
> > Sent: Tuesday, October 1, 2024 10:01 AM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> > usamaarif642@gmail.com; shakeel.butt@linux.dev;
> ryan.roberts@arm.com;
> > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> > foundation.org; willy@infradead.org; Zou, Nanhai <nanhai.zou@intel.com>;
> > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> > <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v9 6/7] mm: zswap: Support large folios in
> zswap_store().
> >
> > On Tue, Oct 1, 2024 at 9:58 AM Sridhar, Kanchana P
> > <kanchana.p.sridhar@intel.com> wrote:
> > >
> > > > -----Original Message-----
> > > > From: Yosry Ahmed <yosryahmed@google.com>
> > > > Sent: Monday, September 30, 2024 11:00 PM
> > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > > > hannes@cmpxchg.org; nphamcs@gmail.com;
> > chengming.zhou@linux.dev;
> > > > usamaarif642@gmail.com; shakeel.butt@linux.dev;
> > ryan.roberts@arm.com;
> > > > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com;
> akpm@linux-
> > > > foundation.org; willy@infradead.org; Zou, Nanhai
> > <nanhai.zou@intel.com>;
> > > > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> > > > <vinodh.gopal@intel.com>
> > > > Subject: Re: [PATCH v9 6/7] mm: zswap: Support large folios in
> > zswap_store().
> > > >
> > > > [..]
> > > > > > > store_failed:
> > > > > > > zpool_free(entry->pool->zpool, entry->handle);
> > > > > > > -put_pool:
> > > > > > > - zswap_pool_put(entry->pool);
> > > > > > > -freepage:
> > > > > > > +put_pool_objcg:
> > > > > > > + zswap_pool_put(pool);
> > > > > > > + obj_cgroup_put(objcg);
> > > > > >
> > > > > > I think if we reorder the function we can drop these calls, make the
> > > > > > comments positioned a bit better, and centralize the entry
> > > > > > initializations. I am also not a fan of passing a semi-initialized
> > > > > > entry to zswap_compress() to get the pool pointer.
> > > > > >
> > > > > > Does the following diff improve things or did I miss something?
> > > > >
> > > > > We shouldn’t be adding the entry to the xarray before initializing its
> pool
> > > > > and objcg, right? Please let me know if I am misunderstanding what
> > you're
> > > > > proposing in the diff.
> > > >
> > > > It should be safe. We already initialize entry->lru after we insert
> > > > the entry in the tree. See the comment above the call to
> > > > zswap_lru_add(). Basically we are protected against concurrent
> > > > stores/loads through the folio lock, and are protected against
> > > > writeback because the entry is not on the LRU yet.
> > >
> > > Thanks for the clarification, Yosry. Since this is a change in the entry
> > > initialization wrt the mainline, is it Ok if this is done in a follow-up patch?
> >
> > Sure. We can discuss it separately. Do you want me to send a patch or
> > do you intend to?
>
> Thanks Yosry! I will send the patch separately.
Hi Yosry,
I am preparing the follow-up patch so I can submit it once this patch-series is
merged to mm-unstable (since these changes have dependencies on my
existing patch).
Is my understanding correct that the folio lock also protects against swapoff
happening in between addition of the entry to the xarray and initializing its
members, which will need to be valid for
swapoff --> ... -> free_swap_slot() --> zswap_invalidate() ? Would appreciate
it if you can confirm.
Thanks,
Kanchana
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v9 6/7] mm: zswap: Support large folios in zswap_store().
2024-10-01 21:53 ` Sridhar, Kanchana P
@ 2024-10-01 23:38 ` Yosry Ahmed
2024-10-01 23:44 ` Sridhar, Kanchana P
0 siblings, 1 reply; 29+ messages in thread
From: Yosry Ahmed @ 2024-10-01 23:38 UTC (permalink / raw)
To: Sridhar, Kanchana P
Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
akpm, willy, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh
[..]
> > > > > > > > store_failed:
> > > > > > > > zpool_free(entry->pool->zpool, entry->handle);
> > > > > > > > -put_pool:
> > > > > > > > - zswap_pool_put(entry->pool);
> > > > > > > > -freepage:
> > > > > > > > +put_pool_objcg:
> > > > > > > > + zswap_pool_put(pool);
> > > > > > > > + obj_cgroup_put(objcg);
> > > > > > >
> > > > > > > I think if we reorder the function we can drop these calls, make the
> > > > > > > comments positioned a bit better, and centralize the entry
> > > > > > > initializations. I am also not a fan of passing a semi-initialized
> > > > > > > entry to zswap_compress() to get the pool pointer.
> > > > > > >
> > > > > > > Does the following diff improve things or did I miss something?
> > > > > >
> > > > > > We shouldn’t be adding the entry to the xarray before initializing its
> > pool
> > > > > > and objcg, right? Please let me know if I am misunderstanding what
> > > you're
> > > > > > proposing in the diff.
> > > > >
> > > > > It should be safe. We already initialize entry->lru after we insert
> > > > > the entry in the tree. See the comment above the call to
> > > > > zswap_lru_add(). Basically we are protected against concurrent
> > > > > stores/loads through the folio lock, and are protected against
> > > > > writeback because the entry is not on the LRU yet.
> > > >
> > > > Thanks for the clarification, Yosry. Since this is a change in the entry
> > > > initialization wrt the mainline, is it Ok if this is done in a follow-up patch?
> > >
> > > Sure. We can discuss it separately. Do you want me to send a patch or
> > > do you intend to?
> >
> > Thanks Yosry! I will send the patch separately.
>
> Hi Yosry,
>
> I am preparing the follow-up patch so I can submit it once this patch-series is
> merged to mm-unstable (since these changes have dependencies on my
> existing patch).
>
> Is my understanding correct that the folio lock also protects against swapoff
> happening in between addition of the entry to the xarray and initializing its
> members, which will need to be valid for
> swapoff --> ... -> free_swap_slot() --> zswap_invalidate() ? Would appreciate
> it if you can confirm.
Yes, the folio lock should protect against swapoff, as the folio must
be in the swapcache.
For shmem, try_to_unuse() (called by swapoff()) will end up calling
shmem_swapin_folio(), which will lookup the folio in the swapcache,
find it, then lock it before proceeding to delete it from the swap
cache and ultimately freeing the swap entry.
For anonymous memory, try_to_unuse() will call unuse_mm() -> .. ->
unuse_pte_range(), which will also lookup the folio and lock it before
deleting it from the swap cache and freeing the entry.
try_to_unuse() will also loop over any remaining swapcache entries,
lock the folios and then try to free the swap entry.
^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH v9 6/7] mm: zswap: Support large folios in zswap_store().
2024-10-01 23:38 ` Yosry Ahmed
@ 2024-10-01 23:44 ` Sridhar, Kanchana P
0 siblings, 0 replies; 29+ messages in thread
From: Sridhar, Kanchana P @ 2024-10-01 23:44 UTC (permalink / raw)
To: Yosry Ahmed
Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou,
usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao,
akpm, willy, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh,
Sridhar, Kanchana P
> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Tuesday, October 1, 2024 4:39 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com;
> Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-
> foundation.org; willy@infradead.org; Zou, Nanhai <nanhai.zou@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v9 6/7] mm: zswap: Support large folios in zswap_store().
>
> [..]
> > > > > > > > > store_failed:
> > > > > > > > > zpool_free(entry->pool->zpool, entry->handle);
> > > > > > > > > -put_pool:
> > > > > > > > > - zswap_pool_put(entry->pool);
> > > > > > > > > -freepage:
> > > > > > > > > +put_pool_objcg:
> > > > > > > > > + zswap_pool_put(pool);
> > > > > > > > > + obj_cgroup_put(objcg);
> > > > > > > >
> > > > > > > > I think if we reorder the function we can drop these calls, make
> the
> > > > > > > > comments positioned a bit better, and centralize the entry
> > > > > > > > initializations. I am also not a fan of passing a semi-initialized
> > > > > > > > entry to zswap_compress() to get the pool pointer.
> > > > > > > >
> > > > > > > > Does the following diff improve things or did I miss something?
> > > > > > >
> > > > > > > We shouldn’t be adding the entry to the xarray before initializing its
> > > pool
> > > > > > > and objcg, right? Please let me know if I am misunderstanding
> what
> > > > you're
> > > > > > > proposing in the diff.
> > > > > >
> > > > > > It should be safe. We already initialize entry->lru after we insert
> > > > > > the entry in the tree. See the comment above the call to
> > > > > > zswap_lru_add(). Basically we are protected against concurrent
> > > > > > stores/loads through the folio lock, and are protected against
> > > > > > writeback because the entry is not on the LRU yet.
> > > > >
> > > > > Thanks for the clarification, Yosry. Since this is a change in the entry
> > > > > initialization wrt the mainline, is it Ok if this is done in a follow-up
> patch?
> > > >
> > > > Sure. We can discuss it separately. Do you want me to send a patch or
> > > > do you intend to?
> > >
> > > Thanks Yosry! I will send the patch separately.
> >
> > Hi Yosry,
> >
> > I am preparing the follow-up patch so I can submit it once this patch-series
> is
> > merged to mm-unstable (since these changes have dependencies on my
> > existing patch).
> >
> > Is my understanding correct that the folio lock also protects against swapoff
> > happening in between addition of the entry to the xarray and initializing its
> > members, which will need to be valid for
> > swapoff --> ... -> free_swap_slot() --> zswap_invalidate() ? Would
> appreciate
> > it if you can confirm.
>
> Yes, the folio lock should protect against swapoff, as the folio must
> be in the swapcache.
>
> For shmem, try_to_unuse() (called by swapoff()) will end up calling
> shmem_swapin_folio(), which will lookup the folio in the swapcache,
> find it, then lock it before proceeding to delete it from the swap
> cache and ultimately freeing the swap entry.
>
> For anonymous memory, try_to_unuse() will call unuse_mm() -> .. ->
> unuse_pte_range(), which will also lookup the folio and lock it before
> deleting it from the swap cache and freeing the entry.
>
> try_to_unuse() will also loop over any remaining swapcache entries,
> lock the folios and then try to free the swap entry.
Sounds good Yosry. Thanks for the explanations!
Thanks,
Kanchana
^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2024-10-01 23:44 UTC | newest]
Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-09-30 22:12 [PATCH v9 0/7] mm: zswap swap-out of large folios Kanchana P Sridhar
2024-09-30 22:12 ` [PATCH v9 1/7] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined Kanchana P Sridhar
2024-09-30 22:12 ` [PATCH v9 2/7] mm: zswap: Modify zswap_compress() to accept a page instead of a folio Kanchana P Sridhar
2024-09-30 22:16 ` Yosry Ahmed
2024-09-30 22:12 ` [PATCH v9 3/7] mm: zswap: Rename zswap_pool_get() to zswap_pool_tryget() Kanchana P Sridhar
2024-09-30 22:12 ` [PATCH v9 4/7] mm: Change count_objcg_event() to count_objcg_events() for batch event updates Kanchana P Sridhar
2024-09-30 22:18 ` Yosry Ahmed
2024-09-30 22:59 ` Nhat Pham
2024-10-01 0:22 ` Johannes Weiner
2024-09-30 22:12 ` [PATCH v9 5/7] mm: zswap: Modify zswap_stored_pages to be atomic_long_t Kanchana P Sridhar
2024-09-30 22:12 ` [PATCH v9 6/7] mm: zswap: Support large folios in zswap_store() Kanchana P Sridhar
2024-09-30 23:09 ` Yosry Ahmed
2024-10-01 0:35 ` Sridhar, Kanchana P
2024-10-01 6:00 ` Yosry Ahmed
2024-10-01 16:58 ` Sridhar, Kanchana P
2024-10-01 17:01 ` Yosry Ahmed
2024-10-01 17:09 ` Sridhar, Kanchana P
2024-10-01 21:53 ` Sridhar, Kanchana P
2024-10-01 23:38 ` Yosry Ahmed
2024-10-01 23:44 ` Sridhar, Kanchana P
2024-09-30 23:11 ` Nhat Pham
2024-09-30 23:19 ` Yosry Ahmed
2024-09-30 23:29 ` Nhat Pham
2024-09-30 23:33 ` Yosry Ahmed
2024-09-30 23:43 ` Nhat Pham
2024-10-01 0:47 ` Sridhar, Kanchana P
2024-10-01 1:14 ` Johannes Weiner
2024-10-01 1:53 ` Sridhar, Kanchana P
2024-09-30 22:12 ` [PATCH v9 7/7] mm: swap: Count successful large folio zswap stores in hugepage zswpout stats Kanchana P Sridhar
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox