* [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios
@ 2024-09-24 1:17 Kanchana P Sridhar
2024-09-24 1:17 ` [PATCH v7 1/8] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined Kanchana P Sridhar
` (9 more replies)
0 siblings, 10 replies; 79+ messages in thread
From: Kanchana P Sridhar @ 2024-09-24 1:17 UTC (permalink / raw)
To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs,
chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts,
ying.huang, 21cnbao, akpm
Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar
Hi All,
This patch-series enables zswap_store() to accept and store mTHP
folios. The most significant contribution in this series is from the
earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been
migrated to mm-unstable as of 9-23-2024 in patches 5,6 of this series.
[1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting
https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
Additionally, there is an attempt to modularize some of the functionality
in zswap_store(), to make it more amenable to supporting any-order
mTHPs. For instance, the function zswap_store_entry() stores a zswap_entry
in the xarray. Likewise, zswap_delete_stored_offsets() can be used to
delete all offsets corresponding to a higher order folio stored in zswap.
For accounting purposes, the patch-series adds per-order mTHP sysfs
"zswpout" counters that get incremented upon successful zswap_store of
an mTHP folio:
/sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout
A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default)
will enable/disable zswap storing of (m)THP. When disabled, zswap will
fallback to rejecting the mTHP folio, to be processed by the backing
swap device.
This patch-series is a pre-requisite for ZSWAP compress batching of mTHP
swap-out and decompress batching of swap-ins based on swapin_readahead(),
using Intel IAA hardware acceleration, which we would like to submit in
subsequent patch-series, with performance improvement data.
Thanks to Ying Huang for pre-posting review feedback and suggestions!
Thanks also to Nhat, Yosry, Barry, Chengming, Usama and Ying for their
helpful feedback, data reviews and suggestions!
Co-development signoff request:
===============================
I would like to request Ryan Roberts' co-developer signoff on patches
5 and 6 in this series. Thanks Ryan!
Changes since v6:
=================
1) Rebased to mm-unstable as of 9-23-2024,
commit acfabf7e197f7a5bedf4749dac1f39551417b049.
2) Refactored into smaller commits, as suggested by Yosry and
Chengming. Thanks both!
3) Reworded the commit log for patches 5 and 6 as per Yosry's
suggestion. Thanks Yosry!
4) Gathered data on a Sapphire Rapids server that has 823GiB SSD swap disk
partition. Also, all experiments are run with usemem --sleep 10, so that
the memory allocated by the 70 processes remains in memory
longer. Posted elapsed and sys times. Thanks to Yosry, Nhat and Ying for
their help with refining the performance characterization methodology.
5) Updated Documentation/admin-guide/mm/transhuge.rst as suggested by
Nhat. Thanks Nhat!
Changes since v5:
=================
1) Rebased to mm-unstable as of 8/29/2024,
commit 9287e4adbc6ab8fa04d25eb82e097fed877a4642.
2) Added CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default) to
enable/disable zswap_store() of mTHP folios. Thanks Nhat for the
suggestion to add a knob by which users can enable/disable this
change. Nhat, I hope this is along the lines of what you were
thinking.
3) Added vm-scalability usemem data with 4K folios with
CONFIG_ZSWAP_STORE_THP_DEFAULT_ON off, that I gathered to make sure
there is no regression with this change.
4) Added data with usemem with 64K and 2M THP for an alternate view of
before/after, as suggested by Yosry, so we can understand the impact
of when mTHPs are split into 4K folios in shrink_folio_list()
(CONFIG_THP_SWAP off) vs. not split (CONFIG_THP_SWAP on) and stored
in zswap. Thanks Yosry for this suggestion.
Changes since v4:
=================
1) Published before/after data with zstd, as suggested by Nhat (Thanks
Nhat for the data reviews!).
2) Rebased to mm-unstable from 8/27/2024,
commit b659edec079c90012cf8d05624e312d1062b8b87.
3) Incorporated the change in memcontrol.h that defines obj_cgroup_get() if
CONFIG_MEMCG is not defined, to resolve build errors reported by kernel
robot; as per Nhat's and Michal's suggestion to not require a separate
patch to fix the build errors (thanks both!).
4) Deleted all same-filled folio processing in zswap_store() of mTHP, as
suggested by Yosry (Thanks Yosry!).
5) Squashed the commits that define new mthp zswpout stat counters, and
invoke count_mthp_stat() after successful zswap_store()s; into a single
commit. Thanks Yosry for this suggestion!
Changes since v3:
=================
1) Rebased to mm-unstable commit 8c0b4f7b65fd1ca7af01267f491e815a40d77444.
Thanks to Barry for suggesting aligning with Ryan Roberts' latest
changes to count_mthp_stat() so that it's always defined, even when THP
is disabled. Barry, I have also made one other change in page_io.c
where count_mthp_stat() is called by count_swpout_vm_event(). I would
appreciate it if you can review this. Thanks!
Hopefully this should resolve the kernel robot build errors.
Changes since v2:
=================
1) Gathered usemem data using SSD as the backing swap device for zswap,
as suggested by Ying Huang. Ying, I would appreciate it if you can
review the latest data. Thanks!
2) Generated the base commit info in the patches to attempt to address
the kernel test robot build errors.
3) No code changes to the individual patches themselves.
Changes since RFC v1:
=====================
1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion.
Thanks Barry!
2) Addressed some of the code review comments that Nhat Pham provided in
Ryan's initial RFC [1]:
- Added a comment about the cgroup zswap limit checks occuring once per
folio at the beginning of zswap_store().
Nhat, Ryan, please do let me know if the comments convey the summary
from the RFC discussion. Thanks!
- Posted data on running the cgroup suite's zswap kselftest.
3) Rebased to v6.11-rc3.
4) Gathered performance data with usemem and the rebased patch-series.
Regression Testing:
===================
I ran vm-scalability usemem 70 processes without mTHP, i.e., only 4K
folios with mm-unstable and with this patch-series. The main goal was
to make sure that there is no functional or performance regression
wrt the earlier zswap behavior for 4K folios,
CONFIG_ZSWAP_STORE_THP_DEFAULT_ON is not set, and zswap_store() of 4K
pages goes through the newly added code path [zswap_store(),
zswap_store_page()].
The data indicates there is no regression.
------------------------------------------------------------------------------
mm-unstable 8-28-2024 zswap-mTHP v6
CONFIG_ZSWAP_STORE_THP_DEFAULT_ON
is not set
------------------------------------------------------------------------------
ZSWAP compressor zstd deflate- zstd deflate-
iaa iaa
------------------------------------------------------------------------------
Throughput (KB/s) 110,775 113,010 111,550 121,937
sys time (sec) 1,141.72 954.87 1,131.95 828.47
memcg_high 140,500 153,737 139,772 134,129
memcg_swap_high 0 0 0 0
memcg_swap_fail 0 0 0 0
pswpin 0 0 0 0
pswpout 0 0 0 0
zswpin 675 690 682 684
zswpout 9,552,298 10,603,271 9,566,392 9,267,213
thp_swpout 0 0 0 0
thp_swpout_ 0 0 0 0
fallback
pgmajfault 3,453 3,468 3,841 3,487
ZSWPOUT-64kB-mTHP n/a n/a 0 0
SWPOUT-64kB-mTHP 0 0 0 0
------------------------------------------------------------------------------
Performance Testing:
====================
Testing of this patch-series was done with mm-unstable as of 9-23-2024,
commit acfabf7e197f7a5bedf4749dac1f39551417b049. Data was gathered
without/with this patch-series, on an Intel Sapphire Rapids server,
dual-socket 56 cores per socket, 4 IAA devices per socket, 503 GiB RAM and
823G SSD disk partition swap. Core frequency was fixed at 2500MHz.
The vm-scalability "usemem" test was run in a cgroup whose memory.high
was fixed at 40G. The is no swap limit set for the cgroup. Following a
similar methodology as in Ryan Roberts' "Swap-out mTHP without splitting"
series [2], 70 usemem processes were run, each allocating and writing 1G of
memory, and sleeping for 10 sec before exiting:
usemem --init-time -w -O -s 10 -n 70 1g
The vm/sysfs mTHP stats included with the performance data provide details
on the swapout activity to ZSWAP/swap.
Other kernel configuration parameters:
ZSWAP Compressors : zstd, deflate-iaa
ZSWAP Allocator : zsmalloc
SWAP page-cluster : 2
In the experiments where "deflate-iaa" is used as the ZSWAP compressor,
IAA "compression verification" is enabled. Hence each IAA compression
will be decompressed internally by the "iaa_crypto" driver, the crc-s
returned by the hardware will be compared and errors reported in case of
mismatches. Thus "deflate-iaa" helps ensure better data integrity as
compared to the software compressors.
Throughput is derived by averaging the individual 70 processes' throughputs
reported by usemem. elapsed/sys times are measured with perf. All data
points per compressor/kernel/mTHP configuration are averaged across 3 runs.
Case 1: Comparing zswap 4K vs. zswap mTHP
=========================================
In this scenario, the "before" is CONFIG_THP_SWAP set to off, that results in
64K/2M (m)THP to be split into 4K folios that get processed by zswap.
The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that results
in 64K/2M (m)THP to not be split, and processed by zswap.
64KB mTHP (cgroup memory.high set to 40G):
==========================================
-------------------------------------------------------------------------------
mm-unstable 9-23-2024 zswap-mTHP Change wrt
CONFIG_THP_SWAP=N CONFIG_THP_SWAP=Y Baseline
Baseline
-------------------------------------------------------------------------------
ZSWAP compressor zstd deflate- zstd deflate- zstd deflate-
iaa iaa iaa
-------------------------------------------------------------------------------
Throughput (KB/s) 143,323 125,485 153,550 129,609 7% 3%
elapsed time (sec) 24.97 25.42 23.90 25.19 4% 1%
sys time (sec) 822.72 750.96 757.70 731.13 8% 3%
memcg_high 132,743 169,825 148,075 192,744
memcg_swap_fail 639,067 841,553 2,204 2,215
pswpin 0 0 0 0
pswpout 0 0 0 0
zswpin 795 873 760 902
zswpout 10,011,266 13,195,137 10,010,017 13,193,554
thp_swpout 0 0 0 0
thp_swpout_ 0 0 0 0
fallback
64kB-mthp_ 639,065 841,553 2,204 2,215
swpout_fallback
pgmajfault 2,861 2,924 3,054 3,259
ZSWPOUT-64kB n/a n/a 623,451 822,268
SWPOUT-64kB 0 0 0 0
-------------------------------------------------------------------------------
2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
=======================================================
-------------------------------------------------------------------------------
mm-unstable 9-23-2024 zswap-mTHP Change wrt
CONFIG_THP_SWAP=N CONFIG_THP_SWAP=Y Baseline
Baseline
-------------------------------------------------------------------------------
ZSWAP compressor zstd deflate- zstd deflate- zstd deflate-
iaa iaa iaa
-------------------------------------------------------------------------------
Throughput (KB/s) 145,616 139,640 169,404 141,168 16% 1%
elapsed time (sec) 25.05 23.85 23.02 23.37 8% 2%
sys time (sec) 790.53 676.34 613.26 677.83 22% -0.2%
memcg_high 16,702 25,197 17,374 23,890
memcg_swap_fail 21,485 27,814 114 144
pswpin 0 0 0 0
pswpout 0 0 0 0
zswpin 793 852 778 922
zswpout 10,011,709 13,186,882 10,010,893 13,195,600
thp_swpout 0 0 0 0
thp_swpout_ 21,485 27,814 114 144
fallback
2048kB-mthp_ n/a n/a 0 0
swpout_fallback
pgmajfault 2,701 2,822 4,151 5,066
ZSWPOUT-2048kB n/a n/a 19,442 25,615
SWPOUT-2048kB 0 0 0 0
-------------------------------------------------------------------------------
We mostly see improvements in throughput, elapsed and sys time for zstd and
deflate-iaa, when comparing before (THP_SWAP=N) vs. after (THP_SWAP=Y).
Case 2: Comparing SSD swap mTHP vs. zswap mTHP
==============================================
In this scenario, CONFIG_THP_SWAP is enabled in "before" and "after"
experiments. The "before" represents zswap rejecting mTHP, and the mTHP
being stored by the 823G SSD swap. The "after" represents data with this
patch-series, that results in 64K/2M (m)THP being processed and stored by
zswap.
64KB mTHP (cgroup memory.high set to 40G):
==========================================
-------------------------------------------------------------------------------
mm-unstable 9-23-2024 zswap-mTHP Change wrt
CONFIG_THP_SWAP=Y CONFIG_THP_SWAP=Y Baseline
Baseline
-------------------------------------------------------------------------------
ZSWAP compressor zstd deflate- zstd deflate- zstd deflate-
iaa iaa iaa
-------------------------------------------------------------------------------
Throughput (KB/s) 20,265 20,696 153,550 129,609 658% 526%
elapsed time (sec) 72.44 70.86 23.90 25.19 67% 64%
sys time (sec) 77.95 77.99 757.70 731.13 -872% -837%
memcg_high 115,811 113,277 148,075 192,744
memcg_swap_fail 2,386 2,425 2,204 2,215
pswpin 16 16 0 0
pswpout 7,774,235 7,616,069 0 0
zswpin 728 749 760 902
zswpout 38,424 39,022 10,010,017 13,193,554
thp_swpout 0 0 0 0
thp_swpout_ 0 0 0 0
fallback
64kB-mthp_ 2,386 2,425 2,204 2,215
swpout_fallback
pgmajfault 2,757 2,860 3,054 3,259
ZSWPOUT-64kB n/a n/a 623,451 822,268
SWPOUT-64kB 485,890 476,004 0 0
-------------------------------------------------------------------------------
2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G):
=======================================================
-------------------------------------------------------------------------------
mm-unstable 9-23-2024 zswap-mTHP Change wrt
CONFIG_THP_SWAP=Y CONFIG_THP_SWAP=Y Baseline
Baseline
-------------------------------------------------------------------------------
ZSWAP compressor zstd deflate- zstd deflate- zstd deflate-
iaa iaa iaa
-------------------------------------------------------------------------------
Throughput (KB/s) 24,347 35,971 169,404 141,168 596% 292%
elapsed time (sec) 63.52 64.59 23.02 23.37 64% 64%
sys time (sec) 27.91 27.01 613.26 677.83 -2098% -2410%
memcg_high 13,576 13,467 17,374 23,890
memcg_swap_fail 162 124 114 144
pswpin 0 0 0 0
pswpout 7,003,307 7,168,853 0 0
zswpin 741 722 778 922
zswpout 84,429 65,315 10,010,893 13,195,600
thp_swpout 13,678 14,002 0 0
thp_swpout_ 162 124 114 144
fallback
2048kB-mthp_ n/a n/a 0 0
swpout_fallback
pgmajfault 3,345 2,903 4,151 5,066
ZSWPOUT-2048kB n/a n/a 19,442 25,615
SWPOUT-2048kB 13,678 14,002 0 0
-------------------------------------------------------------------------------
We see significant improvements in throughput and elapsed time for zstd and
deflate-iaa, when comparing before (mTHP-SSD) vs. after (mTHP-ZSWAP). The
sys time increases with mTHP-ZSWAP as expected, due to the CPU compression
time vs. asynchronous disk write times, as pointed out by Ying and Yosry.
In the "Before" scenario, when zswap does not store mTHP, only allocations
count towards the cgroup memory limit. However, in the "After" scenario,
with the introduction of zswap_store() mTHP, both, allocations as well as
the zswap compressed pool usage from all 70 processes are counted towards
the memory limit. As a result, we see higher swapout activity in the
"After" data. Hence, more time is spent doing reclaim as the zswap cgroup
charge leads to more frequent memory.high breaches.
Summary:
========
The v7 data presented above comparing zswap-mTHP with a conventional 823G
SSD swap demonstrates good performance improvements with zswap-mTHP. Hence,
it seems reasonable for zswap_store to support (m)THP, so that further
performance improvements can be implemented.
Some of the ideas that have shown promise in our experiments are:
1) IAA compress/decompress batching.
2) Distributing compress jobs across all IAA devices on the socket.
In the experimental setup used in this patchset, we have enabled
IAA compress verification to ensure additional hardware data integrity CRC
checks not currently done by the software compressors. The tests run for
this patchset are also using only 1 IAA device per core, that avails of 2
compress engines on the device. In our experiments with IAA batching, we
distribute compress jobs from all cores to the 8 compress engines available
per socket. We further compress the pages in each mTHP in parallel in the
accelerator. As a result, we improve compress latency and reclaim
throughput.
The following compares the same usemem workload characteristics between:
1) zstd (v7 experiments)
2) deflate-iaa "Fixed mode" (v7 experiments)
3) deflate-iaa with batching
4) deflate-iaa-canned "Canned mode" [3] with batching
vm.page-cluster is set to "2" for all runs.
64K mTHP ZSWAP:
===============
-------------------------------------------------------------------------------
ZSWAP zstd IAA Fixed IAA Fixed IAA Canned IAA IAA IAA
compressor (v7) (v7) + Batching + Batching Batch Canned Canned
vs. vs. Batch
64K mTHP Seqtl Fixed vs.
ZSTD
-------------------------------------------------------------------------------
Throughput 153,550 129,609 156,215 166,975 21% 7% 9%
(KB/s)
elapsed time 23.90 25.19 22.46 21.38 11% 5% 11%
(sec)
sys time 757.70 731.13 715.62 648.83 2% 9% 14%
(sec)
memcg_high 148,075 192,744 197,548 181,734
memcg_swap_ 2,204 2,215 2,293 2,263
fail
pswpin 0 0 0 0
pswpout 0 0 0 0
zswpin 760 902 774 833
zswpout 10,010,017 13,193,554 13,193,176 12,125,616
thp_swpout 0 0 0 0
thp_swpout_ 0 0 0 0
fallback
64kB-mthp_ 2,204 2,215 2,293 2,263
swpout_
fallback
pgmajfault 3,054 3,259 3,545 3,516
ZSWPOUT-64kB 623,451 822,268 822,176 755,480
SWPOUT-64kB 0 0 0 0
swap_ra 146 161 152 159
swap_ra_hit 64 121 68 88
-------------------------------------------------------------------------------
2M THP ZSWAP:
=============
-------------------------------------------------------------------------------
ZSWAP zstd IAA Fixed IAA Fixed IAA Canned IAA IAA IAA
compressor (v7) (v7) + Batching + Batching Batch Canned Canned
vs. vs. Batch
2M THP Seqtl Fixed vs.
ZSTD
-------------------------------------------------------------------------------
Throughput 169,404 141,168 175,089 193,407 24% 10% 14%
(KB/s)
elapsed time 23.02 23.37 21.13 19.97 10% 5% 13%
(sec)
sys time 613.26 677.83 630.51 533.80 7% 15% 13%
(sec)
memcg_high 17,374 23,890 24,349 22,374
memcg_swap_ 114 144 102 88
fail
pswpin 0 0 0 0
pswpout 0 0 0 0
zswpin 778 922 6,492 6,642
zswpout 10,010,893 13,195,600 13,199,907 12,132,265
thp_swpout 0 0 0 0
thp_swpout_ 114 144 102 88
fallback
pgmajfault 4,151 5,066 5,032 4,999
ZSWPOUT-2MB 19,442 25,615 25,666 23,594
SWPOUT-2MB 0 0 0 0
swap_ra 3 9 4,383 4,494
swap_ra_hit 2 6 4,298 4,412
-------------------------------------------------------------------------------
With ZSWAP IAA compress/decompress batching, we are able to demonstrate
significant performance improvements and memory savings in scalability
experiments under memory pressure, as compared to software compressors. We
hope to submit this work in subsequent patch series.
Thanks,
Kanchana
[1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u
[2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1-ryan.roberts@arm.com/
[3] https://patchwork.kernel.org/project/linux-crypto/cover/cover.1710969449.git.andre.glover@linux.intel.com/
Kanchana P Sridhar (8):
mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined.
mm: zswap: Modify zswap_compress() to accept a page instead of a
folio.
mm: zswap: Refactor code to store an entry in zswap xarray.
mm: zswap: Refactor code to delete stored offsets in case of errors.
mm: zswap: Compress and store a specific page in a folio.
mm: zswap: Support mTHP swapout in zswap_store().
mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout
stats.
mm: Document the newly added mTHP zswpout stats, clarify swpout
semantics.
Documentation/admin-guide/mm/transhuge.rst | 8 +-
include/linux/huge_mm.h | 1 +
include/linux/memcontrol.h | 4 +
mm/Kconfig | 8 +
mm/huge_memory.c | 3 +
mm/page_io.c | 1 +
mm/zswap.c | 248 ++++++++++++++++-----
7 files changed, 210 insertions(+), 63 deletions(-)
base-commit: acfabf7e197f7a5bedf4749dac1f39551417b049
--
2.27.0
^ permalink raw reply [flat|nested] 79+ messages in thread* [PATCH v7 1/8] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined. 2024-09-24 1:17 [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar @ 2024-09-24 1:17 ` Kanchana P Sridhar 2024-09-24 16:45 ` Nhat Pham 2024-09-24 1:17 ` [PATCH v7 2/8] mm: zswap: Modify zswap_compress() to accept a page instead of a folio Kanchana P Sridhar ` (8 subsequent siblings) 9 siblings, 1 reply; 79+ messages in thread From: Kanchana P Sridhar @ 2024-09-24 1:17 UTC (permalink / raw) To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao, akpm Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar This resolves an issue with obj_cgroup_get() not being defined if CONFIG_MEMCG is not defined. Before this patch, we would see build errors if obj_cgroup_get() is called from code that is agnostic of CONFIG_MEMCG. The zswap_store() changes for mTHP in subsequent commits will require the use of obj_cgroup_get() in zswap code that falls into this category. Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> --- include/linux/memcontrol.h | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 34d2da05f2f1..15c2716f9aa3 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -1282,6 +1282,10 @@ struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css) return NULL; } +static inline void obj_cgroup_get(struct obj_cgroup *objcg) +{ +} + static inline void obj_cgroup_put(struct obj_cgroup *objcg) { } -- 2.27.0 ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 1/8] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined. 2024-09-24 1:17 ` [PATCH v7 1/8] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined Kanchana P Sridhar @ 2024-09-24 16:45 ` Nhat Pham 0 siblings, 0 replies; 79+ messages in thread From: Nhat Pham @ 2024-09-24 16:45 UTC (permalink / raw) To: Kanchana P Sridhar Cc: linux-kernel, linux-mm, hannes, yosryahmed, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar <kanchana.p.sridhar@intel.com> wrote: > > This resolves an issue with obj_cgroup_get() not being defined if > CONFIG_MEMCG is not defined. > > Before this patch, we would see build errors if obj_cgroup_get() is > called from code that is agnostic of CONFIG_MEMCG. > > The zswap_store() changes for mTHP in subsequent commits will require > the use of obj_cgroup_get() in zswap code that falls into this category. > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > --- LGTM. Reviewed-by: Nhat Pham <nphamcs@gmail.com> ^ permalink raw reply [flat|nested] 79+ messages in thread
* [PATCH v7 2/8] mm: zswap: Modify zswap_compress() to accept a page instead of a folio. 2024-09-24 1:17 [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar 2024-09-24 1:17 ` [PATCH v7 1/8] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined Kanchana P Sridhar @ 2024-09-24 1:17 ` Kanchana P Sridhar 2024-09-24 16:50 ` Nhat Pham 2024-09-24 1:17 ` [PATCH v7 3/8] mm: zswap: Refactor code to store an entry in zswap xarray Kanchana P Sridhar ` (7 subsequent siblings) 9 siblings, 1 reply; 79+ messages in thread From: Kanchana P Sridhar @ 2024-09-24 1:17 UTC (permalink / raw) To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao, akpm Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar For zswap_store() to be able to store an mTHP by compressing it one page at a time, zswap_compress() needs to accept a page as input. This will allow us to iterate through each page in the mTHP in zswap_store(), compress it and store it in the zpool. Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> --- mm/zswap.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/zswap.c b/mm/zswap.c index 449914ea9919..59b7733a62d3 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -876,7 +876,7 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node) return 0; } -static bool zswap_compress(struct folio *folio, struct zswap_entry *entry) +static bool zswap_compress(struct page *page, struct zswap_entry *entry) { struct crypto_acomp_ctx *acomp_ctx; struct scatterlist input, output; @@ -894,7 +894,7 @@ static bool zswap_compress(struct folio *folio, struct zswap_entry *entry) dst = acomp_ctx->buffer; sg_init_table(&input, 1); - sg_set_folio(&input, folio, PAGE_SIZE, 0); + sg_set_page(&input, page, PAGE_SIZE, 0); /* * We need PAGE_SIZE * 2 here since there maybe over-compression case, @@ -1458,7 +1458,7 @@ bool zswap_store(struct folio *folio) mem_cgroup_put(memcg); } - if (!zswap_compress(folio, entry)) + if (!zswap_compress(&folio->page, entry)) goto put_pool; entry->swpentry = swp; -- 2.27.0 ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 2/8] mm: zswap: Modify zswap_compress() to accept a page instead of a folio. 2024-09-24 1:17 ` [PATCH v7 2/8] mm: zswap: Modify zswap_compress() to accept a page instead of a folio Kanchana P Sridhar @ 2024-09-24 16:50 ` Nhat Pham 0 siblings, 0 replies; 79+ messages in thread From: Nhat Pham @ 2024-09-24 16:50 UTC (permalink / raw) To: Kanchana P Sridhar Cc: linux-kernel, linux-mm, hannes, yosryahmed, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar <kanchana.p.sridhar@intel.com> wrote: > > For zswap_store() to be able to store an mTHP by compressing > it one page at a time, zswap_compress() needs to accept a page > as input. This will allow us to iterate through each page in > the mTHP in zswap_store(), compress it and store it in the zpool. > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> ^ permalink raw reply [flat|nested] 79+ messages in thread
* [PATCH v7 3/8] mm: zswap: Refactor code to store an entry in zswap xarray. 2024-09-24 1:17 [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar 2024-09-24 1:17 ` [PATCH v7 1/8] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined Kanchana P Sridhar 2024-09-24 1:17 ` [PATCH v7 2/8] mm: zswap: Modify zswap_compress() to accept a page instead of a folio Kanchana P Sridhar @ 2024-09-24 1:17 ` Kanchana P Sridhar 2024-09-24 17:16 ` Nhat Pham 2024-09-24 19:14 ` Yosry Ahmed 2024-09-24 1:17 ` [PATCH v7 4/8] mm: zswap: Refactor code to delete stored offsets in case of errors Kanchana P Sridhar ` (6 subsequent siblings) 9 siblings, 2 replies; 79+ messages in thread From: Kanchana P Sridhar @ 2024-09-24 1:17 UTC (permalink / raw) To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao, akpm Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar Added a new procedure zswap_store_entry() that refactors the code currently in zswap_store() to store an entry in the zswap xarray. This will allow us to call this procedure for each storing the swap offset of each page in an mTHP in the xarray, as part of zswap_store() supporting mTHP. Also, made a minor edit in the comments for 'struct zswap_entry' to delete the description of the 'value' member that was deleted in commit 20a5532ffa53d6ecf41ded920a7b0ff9c65a7dcf ("mm: remove code to handle same filled pages"). Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> --- mm/zswap.c | 51 ++++++++++++++++++++++++++++++++++----------------- 1 file changed, 34 insertions(+), 17 deletions(-) diff --git a/mm/zswap.c b/mm/zswap.c index 59b7733a62d3..fd35a81b6e36 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -190,7 +190,6 @@ static struct shrinker *zswap_shrinker; * section for context. * pool - the zswap_pool the entry's data is in * handle - zpool allocation handle that stores the compressed page data - * value - value of the same-value filled pages which have same content * objcg - the obj_cgroup that the compressed memory is charged to * lru - handle to the pool's lru used to evict pages. */ @@ -1404,12 +1403,44 @@ static void shrink_worker(struct work_struct *w) /********************************* * main API **********************************/ + +/* + * Returns true if the entry was successfully + * stored in the xarray, and false otherwise. + */ +static bool zswap_store_entry(struct xarray *tree, + struct zswap_entry *entry) +{ + struct zswap_entry *old; + pgoff_t offset = swp_offset(entry->swpentry); + + old = xa_store(tree, offset, entry, GFP_KERNEL); + + if (xa_is_err(old)) { + int err = xa_err(old); + + WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err); + zswap_reject_alloc_fail++; + return false; + } + + /* + * We may have had an existing entry that became stale when + * the folio was redirtied and now the new version is being + * swapped out. Get rid of the old. + */ + if (old) + zswap_entry_free(old); + + return true; +} + bool zswap_store(struct folio *folio) { swp_entry_t swp = folio->swap; pgoff_t offset = swp_offset(swp); struct xarray *tree = swap_zswap_tree(swp); - struct zswap_entry *entry, *old; + struct zswap_entry *entry; struct obj_cgroup *objcg = NULL; struct mem_cgroup *memcg = NULL; @@ -1465,22 +1496,8 @@ bool zswap_store(struct folio *folio) entry->objcg = objcg; entry->referenced = true; - old = xa_store(tree, offset, entry, GFP_KERNEL); - if (xa_is_err(old)) { - int err = xa_err(old); - - WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err); - zswap_reject_alloc_fail++; + if (!zswap_store_entry(tree, entry)) goto store_failed; - } - - /* - * We may have had an existing entry that became stale when - * the folio was redirtied and now the new version is being - * swapped out. Get rid of the old. - */ - if (old) - zswap_entry_free(old); if (objcg) { obj_cgroup_charge_zswap(objcg, entry->length); -- 2.27.0 ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 3/8] mm: zswap: Refactor code to store an entry in zswap xarray. 2024-09-24 1:17 ` [PATCH v7 3/8] mm: zswap: Refactor code to store an entry in zswap xarray Kanchana P Sridhar @ 2024-09-24 17:16 ` Nhat Pham 2024-09-24 20:40 ` Sridhar, Kanchana P 2024-09-24 19:14 ` Yosry Ahmed 1 sibling, 1 reply; 79+ messages in thread From: Nhat Pham @ 2024-09-24 17:16 UTC (permalink / raw) To: Kanchana P Sridhar Cc: linux-kernel, linux-mm, hannes, yosryahmed, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar <kanchana.p.sridhar@intel.com> wrote: > > Added a new procedure zswap_store_entry() that refactors the code > currently in zswap_store() to store an entry in the zswap xarray. > This will allow us to call this procedure for each storing the swap > offset of each page in an mTHP in the xarray, as part of zswap_store() > supporting mTHP. > > Also, made a minor edit in the comments for 'struct zswap_entry' to delete > the description of the 'value' member that was deleted in commit > 20a5532ffa53d6ecf41ded920a7b0ff9c65a7dcf ("mm: remove code to handle > same filled pages"). nit: This probably should be a separate patch... > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> Otherwise, LGTM :) Reviewed-by: Nhat Pham <nphamcs@gmail.com> ^ permalink raw reply [flat|nested] 79+ messages in thread
* RE: [PATCH v7 3/8] mm: zswap: Refactor code to store an entry in zswap xarray. 2024-09-24 17:16 ` Nhat Pham @ 2024-09-24 20:40 ` Sridhar, Kanchana P 0 siblings, 0 replies; 79+ messages in thread From: Sridhar, Kanchana P @ 2024-09-24 20:40 UTC (permalink / raw) To: Nhat Pham Cc: linux-kernel, linux-mm, hannes, yosryahmed, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P > -----Original Message----- > From: Nhat Pham <nphamcs@gmail.com> > Sent: Tuesday, September 24, 2024 10:17 AM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; yosryahmed@google.com; > chengming.zhou@linux.dev; usamaarif642@gmail.com; > shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org; > Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v7 3/8] mm: zswap: Refactor code to store an entry in > zswap xarray. > > On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar > <kanchana.p.sridhar@intel.com> wrote: > > > > Added a new procedure zswap_store_entry() that refactors the code > > currently in zswap_store() to store an entry in the zswap xarray. > > This will allow us to call this procedure for each storing the swap > > offset of each page in an mTHP in the xarray, as part of zswap_store() > > supporting mTHP. > > > > Also, made a minor edit in the comments for 'struct zswap_entry' to delete > > the description of the 'value' member that was deleted in commit > > 20a5532ffa53d6ecf41ded920a7b0ff9c65a7dcf ("mm: remove code to > handle > > same filled pages"). > > nit: This probably should be a separate patch... Sure, will delete this change in v8. > > > > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > > Otherwise, LGTM :) > > Reviewed-by: Nhat Pham <nphamcs@gmail.com> Thanks Nhat! ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 3/8] mm: zswap: Refactor code to store an entry in zswap xarray. 2024-09-24 1:17 ` [PATCH v7 3/8] mm: zswap: Refactor code to store an entry in zswap xarray Kanchana P Sridhar 2024-09-24 17:16 ` Nhat Pham @ 2024-09-24 19:14 ` Yosry Ahmed 2024-09-24 22:22 ` Sridhar, Kanchana P 1 sibling, 1 reply; 79+ messages in thread From: Yosry Ahmed @ 2024-09-24 19:14 UTC (permalink / raw) To: Kanchana P Sridhar Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar <kanchana.p.sridhar@intel.com> wrote: > > Added a new procedure zswap_store_entry() that refactors the code > currently in zswap_store() to store an entry in the zswap xarray. > This will allow us to call this procedure for each storing the swap > offset of each page in an mTHP in the xarray, as part of zswap_store() > supporting mTHP. > > Also, made a minor edit in the comments for 'struct zswap_entry' to delete > the description of the 'value' member that was deleted in commit > 20a5532ffa53d6ecf41ded920a7b0ff9c65a7dcf ("mm: remove code to handle > same filled pages"). > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > --- > mm/zswap.c | 51 ++++++++++++++++++++++++++++++++++----------------- > 1 file changed, 34 insertions(+), 17 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index 59b7733a62d3..fd35a81b6e36 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -190,7 +190,6 @@ static struct shrinker *zswap_shrinker; > * section for context. > * pool - the zswap_pool the entry's data is in > * handle - zpool allocation handle that stores the compressed page data > - * value - value of the same-value filled pages which have same content > * objcg - the obj_cgroup that the compressed memory is charged to > * lru - handle to the pool's lru used to evict pages. > */ > @@ -1404,12 +1403,44 @@ static void shrink_worker(struct work_struct *w) > /********************************* > * main API > **********************************/ > + > +/* > + * Returns true if the entry was successfully > + * stored in the xarray, and false otherwise. > + */ > +static bool zswap_store_entry(struct xarray *tree, > + struct zswap_entry *entry) I think zswap_tree_store() is a more descriptive name. > > +{ > + struct zswap_entry *old; > + pgoff_t offset = swp_offset(entry->swpentry); Reverse xmas tree where possible please (longest to shortest declarations). > > + > + old = xa_store(tree, offset, entry, GFP_KERNEL); > + No need for the blank line here. > + if (xa_is_err(old)) { > + int err = xa_err(old); > + > + WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err); > + zswap_reject_alloc_fail++; > + return false; > + } > + > + /* > + * We may have had an existing entry that became stale when > + * the folio was redirtied and now the new version is being > + * swapped out. Get rid of the old. > + */ > + if (old) > + zswap_entry_free(old); > + > + return true; > +} > + > bool zswap_store(struct folio *folio) > { > swp_entry_t swp = folio->swap; > pgoff_t offset = swp_offset(swp); > struct xarray *tree = swap_zswap_tree(swp); > - struct zswap_entry *entry, *old; > + struct zswap_entry *entry; > struct obj_cgroup *objcg = NULL; > struct mem_cgroup *memcg = NULL; > > @@ -1465,22 +1496,8 @@ bool zswap_store(struct folio *folio) > entry->objcg = objcg; > entry->referenced = true; > > - old = xa_store(tree, offset, entry, GFP_KERNEL); > - if (xa_is_err(old)) { > - int err = xa_err(old); > - > - WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", err); > - zswap_reject_alloc_fail++; > + if (!zswap_store_entry(tree, entry)) > goto store_failed; > - } > - > - /* > - * We may have had an existing entry that became stale when > - * the folio was redirtied and now the new version is being > - * swapped out. Get rid of the old. > - */ > - if (old) > - zswap_entry_free(old); > > if (objcg) { > obj_cgroup_charge_zswap(objcg, entry->length); > -- > 2.27.0 > ^ permalink raw reply [flat|nested] 79+ messages in thread
* RE: [PATCH v7 3/8] mm: zswap: Refactor code to store an entry in zswap xarray. 2024-09-24 19:14 ` Yosry Ahmed @ 2024-09-24 22:22 ` Sridhar, Kanchana P 0 siblings, 0 replies; 79+ messages in thread From: Sridhar, Kanchana P @ 2024-09-24 22:22 UTC (permalink / raw) To: Yosry Ahmed Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P > -----Original Message----- > From: Yosry Ahmed <yosryahmed@google.com> > Sent: Tuesday, September 24, 2024 12:15 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com; > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux- > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v7 3/8] mm: zswap: Refactor code to store an entry in > zswap xarray. > > On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar > <kanchana.p.sridhar@intel.com> wrote: > > > > Added a new procedure zswap_store_entry() that refactors the code > > currently in zswap_store() to store an entry in the zswap xarray. > > This will allow us to call this procedure for each storing the swap > > offset of each page in an mTHP in the xarray, as part of zswap_store() > > supporting mTHP. > > > > Also, made a minor edit in the comments for 'struct zswap_entry' to delete > > the description of the 'value' member that was deleted in commit > > 20a5532ffa53d6ecf41ded920a7b0ff9c65a7dcf ("mm: remove code to > handle > > same filled pages"). > > > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > > --- > > mm/zswap.c | 51 ++++++++++++++++++++++++++++++++++----------------- > > 1 file changed, 34 insertions(+), 17 deletions(-) > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > index 59b7733a62d3..fd35a81b6e36 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -190,7 +190,6 @@ static struct shrinker *zswap_shrinker; > > * section for context. > > * pool - the zswap_pool the entry's data is in > > * handle - zpool allocation handle that stores the compressed page data > > - * value - value of the same-value filled pages which have same content > > * objcg - the obj_cgroup that the compressed memory is charged to > > * lru - handle to the pool's lru used to evict pages. > > */ > > @@ -1404,12 +1403,44 @@ static void shrink_worker(struct work_struct > *w) > > /********************************* > > * main API > > **********************************/ > > + > > +/* > > + * Returns true if the entry was successfully > > + * stored in the xarray, and false otherwise. > > + */ > > +static bool zswap_store_entry(struct xarray *tree, > > + struct zswap_entry *entry) > > > I think zswap_tree_store() is a more descriptive name. Thanks Yosry for the code review comments! Sure, will change this to zswap_tree_store() in v8. > > > > > +{ > > + struct zswap_entry *old; > > + pgoff_t offset = swp_offset(entry->swpentry); > > > Reverse xmas tree where possible please (longest to shortest declarations). > > > > > + > > + old = xa_store(tree, offset, entry, GFP_KERNEL); > > + > > No need for the blank line here. Ok, will fix in v8. Thanks, Kanchana > > > + if (xa_is_err(old)) { > > + int err = xa_err(old); > > + > > + WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", > err); > > + zswap_reject_alloc_fail++; > > + return false; > > + } > > + > > + /* > > + * We may have had an existing entry that became stale when > > + * the folio was redirtied and now the new version is being > > + * swapped out. Get rid of the old. > > + */ > > + if (old) > > + zswap_entry_free(old); > > + > > + return true; > > +} > > + > > bool zswap_store(struct folio *folio) > > { > > swp_entry_t swp = folio->swap; > > pgoff_t offset = swp_offset(swp); > > struct xarray *tree = swap_zswap_tree(swp); > > - struct zswap_entry *entry, *old; > > + struct zswap_entry *entry; > > struct obj_cgroup *objcg = NULL; > > struct mem_cgroup *memcg = NULL; > > > > @@ -1465,22 +1496,8 @@ bool zswap_store(struct folio *folio) > > entry->objcg = objcg; > > entry->referenced = true; > > > > - old = xa_store(tree, offset, entry, GFP_KERNEL); > > - if (xa_is_err(old)) { > > - int err = xa_err(old); > > - > > - WARN_ONCE(err != -ENOMEM, "unexpected xarray error: %d\n", > err); > > - zswap_reject_alloc_fail++; > > + if (!zswap_store_entry(tree, entry)) > > goto store_failed; > > - } > > - > > - /* > > - * We may have had an existing entry that became stale when > > - * the folio was redirtied and now the new version is being > > - * swapped out. Get rid of the old. > > - */ > > - if (old) > > - zswap_entry_free(old); > > > > if (objcg) { > > obj_cgroup_charge_zswap(objcg, entry->length); > > -- > > 2.27.0 > > ^ permalink raw reply [flat|nested] 79+ messages in thread
* [PATCH v7 4/8] mm: zswap: Refactor code to delete stored offsets in case of errors. 2024-09-24 1:17 [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar ` (2 preceding siblings ...) 2024-09-24 1:17 ` [PATCH v7 3/8] mm: zswap: Refactor code to store an entry in zswap xarray Kanchana P Sridhar @ 2024-09-24 1:17 ` Kanchana P Sridhar 2024-09-24 17:25 ` Nhat Pham 2024-09-24 19:20 ` Yosry Ahmed 2024-09-24 1:17 ` [PATCH v7 5/8] mm: zswap: Compress and store a specific page in a folio Kanchana P Sridhar ` (5 subsequent siblings) 9 siblings, 2 replies; 79+ messages in thread From: Kanchana P Sridhar @ 2024-09-24 1:17 UTC (permalink / raw) To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao, akpm Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar Added a new procedure zswap_delete_stored_offsets() that can be called to delete stored offsets in a folio in case zswap_store() fails or zswap is disabled. Refactored the code in zswap_store() that handles these cases, to call zswap_delete_stored_offsets(). Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> --- mm/zswap.c | 33 ++++++++++++++++++++++++++++++--- 1 file changed, 30 insertions(+), 3 deletions(-) diff --git a/mm/zswap.c b/mm/zswap.c index fd35a81b6e36..9bea948d653e 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1435,8 +1435,37 @@ static bool zswap_store_entry(struct xarray *tree, return true; } +/* + * If the zswap store fails or zswap is disabled, we must invalidate the + * possibly stale entries which were previously stored at the offsets + * corresponding to each page of the folio. Otherwise, writeback could + * overwrite the new data in the swapfile. + * + * This is called after the store of an offset in a large folio has failed. + * All zswap entries in the folio must be deleted. This helps make sure + * that a swapped-out mTHP is either entirely stored in zswap, or entirely + * not stored in zswap. + * + * This is also called if zswap_store() is invoked, but zswap is not enabled. + * All offsets for the folio are deleted from zswap in this case. + */ +static void zswap_delete_stored_offsets(struct xarray *tree, + pgoff_t offset, + long nr_pages) +{ + struct zswap_entry *entry; + long i; + + for (i = 0; i < nr_pages; ++i) { + entry = xa_erase(tree, offset + i); + if (entry) + zswap_entry_free(entry); + } +} + bool zswap_store(struct folio *folio) { + long nr_pages = folio_nr_pages(folio); swp_entry_t swp = folio->swap; pgoff_t offset = swp_offset(swp); struct xarray *tree = swap_zswap_tree(swp); @@ -1541,9 +1570,7 @@ bool zswap_store(struct folio *folio) * possibly stale entry which was previously stored at this offset. * Otherwise, writeback could overwrite the new data in the swapfile. */ - entry = xa_erase(tree, offset); - if (entry) - zswap_entry_free(entry); + zswap_delete_stored_offsets(tree, offset, nr_pages); return false; } -- 2.27.0 ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 4/8] mm: zswap: Refactor code to delete stored offsets in case of errors. 2024-09-24 1:17 ` [PATCH v7 4/8] mm: zswap: Refactor code to delete stored offsets in case of errors Kanchana P Sridhar @ 2024-09-24 17:25 ` Nhat Pham 2024-09-24 20:41 ` Sridhar, Kanchana P 2024-09-24 19:20 ` Yosry Ahmed 1 sibling, 1 reply; 79+ messages in thread From: Nhat Pham @ 2024-09-24 17:25 UTC (permalink / raw) To: Kanchana P Sridhar Cc: linux-kernel, linux-mm, hannes, yosryahmed, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar <kanchana.p.sridhar@intel.com> wrote: > > Added a new procedure zswap_delete_stored_offsets() that can be > called to delete stored offsets in a folio in case zswap_store() > fails or zswap is disabled. > > Refactored the code in zswap_store() that handles these cases, > to call zswap_delete_stored_offsets(). > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > --- > mm/zswap.c | 33 ++++++++++++++++++++++++++++++--- > 1 file changed, 30 insertions(+), 3 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index fd35a81b6e36..9bea948d653e 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -1435,8 +1435,37 @@ static bool zswap_store_entry(struct xarray *tree, > return true; > } > > +/* > + * If the zswap store fails or zswap is disabled, we must invalidate the > + * possibly stale entries which were previously stored at the offsets > + * corresponding to each page of the folio. Otherwise, writeback could > + * overwrite the new data in the swapfile. > + * > + * This is called after the store of an offset in a large folio has failed. "store of a subpage" rather than "stored of an offset"? > + * All zswap entries in the folio must be deleted. This helps make sure > + * that a swapped-out mTHP is either entirely stored in zswap, or entirely > + * not stored in zswap. > + * > + * This is also called if zswap_store() is invoked, but zswap is not enabled. > + * All offsets for the folio are deleted from zswap in this case. > + */ > +static void zswap_delete_stored_offsets(struct xarray *tree, > + pgoff_t offset, > + long nr_pages) > +{ > + struct zswap_entry *entry; > + long i; > + > + for (i = 0; i < nr_pages; ++i) { > + entry = xa_erase(tree, offset + i); > + if (entry) > + zswap_entry_free(entry); > + } > +} > + ^ permalink raw reply [flat|nested] 79+ messages in thread
* RE: [PATCH v7 4/8] mm: zswap: Refactor code to delete stored offsets in case of errors. 2024-09-24 17:25 ` Nhat Pham @ 2024-09-24 20:41 ` Sridhar, Kanchana P 0 siblings, 0 replies; 79+ messages in thread From: Sridhar, Kanchana P @ 2024-09-24 20:41 UTC (permalink / raw) To: Nhat Pham Cc: linux-kernel, linux-mm, hannes, yosryahmed, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P > -----Original Message----- > From: Nhat Pham <nphamcs@gmail.com> > Sent: Tuesday, September 24, 2024 10:25 AM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; yosryahmed@google.com; > chengming.zhou@linux.dev; usamaarif642@gmail.com; > shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org; > Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v7 4/8] mm: zswap: Refactor code to delete stored > offsets in case of errors. > > On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar > <kanchana.p.sridhar@intel.com> wrote: > > > > Added a new procedure zswap_delete_stored_offsets() that can be > > called to delete stored offsets in a folio in case zswap_store() > > fails or zswap is disabled. > > > > Refactored the code in zswap_store() that handles these cases, > > to call zswap_delete_stored_offsets(). > > > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > > --- > > mm/zswap.c | 33 ++++++++++++++++++++++++++++++--- > > 1 file changed, 30 insertions(+), 3 deletions(-) > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > index fd35a81b6e36..9bea948d653e 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -1435,8 +1435,37 @@ static bool zswap_store_entry(struct xarray > *tree, > > return true; > > } > > > > +/* > > + * If the zswap store fails or zswap is disabled, we must invalidate the > > + * possibly stale entries which were previously stored at the offsets > > + * corresponding to each page of the folio. Otherwise, writeback could > > + * overwrite the new data in the swapfile. > > + * > > + * This is called after the store of an offset in a large folio has failed. > > "store of a subpage" rather than "stored of an offset"? Sure, I will make this change in v8. > > > > + * All zswap entries in the folio must be deleted. This helps make sure > > + * that a swapped-out mTHP is either entirely stored in zswap, or entirely > > + * not stored in zswap. > > + * > > + * This is also called if zswap_store() is invoked, but zswap is not enabled. > > + * All offsets for the folio are deleted from zswap in this case. > > + */ > > +static void zswap_delete_stored_offsets(struct xarray *tree, > > + pgoff_t offset, > > + long nr_pages) > > +{ > > + struct zswap_entry *entry; > > + long i; > > + > > + for (i = 0; i < nr_pages; ++i) { > > + entry = xa_erase(tree, offset + i); > > + if (entry) > > + zswap_entry_free(entry); > > + } > > +} > > + ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 4/8] mm: zswap: Refactor code to delete stored offsets in case of errors. 2024-09-24 1:17 ` [PATCH v7 4/8] mm: zswap: Refactor code to delete stored offsets in case of errors Kanchana P Sridhar 2024-09-24 17:25 ` Nhat Pham @ 2024-09-24 19:20 ` Yosry Ahmed 2024-09-24 22:32 ` Sridhar, Kanchana P 1 sibling, 1 reply; 79+ messages in thread From: Yosry Ahmed @ 2024-09-24 19:20 UTC (permalink / raw) To: Kanchana P Sridhar Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar <kanchana.p.sridhar@intel.com> wrote: > > Added a new procedure zswap_delete_stored_offsets() that can be > called to delete stored offsets in a folio in case zswap_store() > fails or zswap is disabled. I don't see the value in this helper. It will get called in one place AFAICT, and it is a bit inconsistent that we have to explicitly loop in zswap_store() to store pages, but the loop to delete pages upon failure is hidden in the helper. I am not against adding a trivial zswap_tree_delete() helper (or similar) that calls xa_erase() and zswap_entry_free() to match zswap_tree_store() if you prefer that. > > Refactored the code in zswap_store() that handles these cases, > to call zswap_delete_stored_offsets(). > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > --- > mm/zswap.c | 33 ++++++++++++++++++++++++++++++--- > 1 file changed, 30 insertions(+), 3 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index fd35a81b6e36..9bea948d653e 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -1435,8 +1435,37 @@ static bool zswap_store_entry(struct xarray *tree, > return true; > } > > +/* > + * If the zswap store fails or zswap is disabled, we must invalidate the > + * possibly stale entries which were previously stored at the offsets > + * corresponding to each page of the folio. Otherwise, writeback could > + * overwrite the new data in the swapfile. > + * > + * This is called after the store of an offset in a large folio has failed. > + * All zswap entries in the folio must be deleted. This helps make sure > + * that a swapped-out mTHP is either entirely stored in zswap, or entirely > + * not stored in zswap. > + * > + * This is also called if zswap_store() is invoked, but zswap is not enabled. > + * All offsets for the folio are deleted from zswap in this case. > + */ > +static void zswap_delete_stored_offsets(struct xarray *tree, > + pgoff_t offset, > + long nr_pages) > +{ > + struct zswap_entry *entry; > + long i; > + > + for (i = 0; i < nr_pages; ++i) { > + entry = xa_erase(tree, offset + i); > + if (entry) > + zswap_entry_free(entry); > + } > +} > + > bool zswap_store(struct folio *folio) > { > + long nr_pages = folio_nr_pages(folio); > swp_entry_t swp = folio->swap; > pgoff_t offset = swp_offset(swp); > struct xarray *tree = swap_zswap_tree(swp); > @@ -1541,9 +1570,7 @@ bool zswap_store(struct folio *folio) > * possibly stale entry which was previously stored at this offset. > * Otherwise, writeback could overwrite the new data in the swapfile. > */ > - entry = xa_erase(tree, offset); > - if (entry) > - zswap_entry_free(entry); > + zswap_delete_stored_offsets(tree, offset, nr_pages); > return false; > } > > -- > 2.27.0 > ^ permalink raw reply [flat|nested] 79+ messages in thread
* RE: [PATCH v7 4/8] mm: zswap: Refactor code to delete stored offsets in case of errors. 2024-09-24 19:20 ` Yosry Ahmed @ 2024-09-24 22:32 ` Sridhar, Kanchana P 2024-09-25 0:43 ` Yosry Ahmed 0 siblings, 1 reply; 79+ messages in thread From: Sridhar, Kanchana P @ 2024-09-24 22:32 UTC (permalink / raw) To: Yosry Ahmed Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P > -----Original Message----- > From: Yosry Ahmed <yosryahmed@google.com> > Sent: Tuesday, September 24, 2024 12:20 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com; > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux- > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v7 4/8] mm: zswap: Refactor code to delete stored > offsets in case of errors. > > On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar > <kanchana.p.sridhar@intel.com> wrote: > > > > Added a new procedure zswap_delete_stored_offsets() that can be > > called to delete stored offsets in a folio in case zswap_store() > > fails or zswap is disabled. > > I don't see the value in this helper. It will get called in one place > AFAICT, and it is a bit inconsistent that we have to explicitly loop > in zswap_store() to store pages, but the loop to delete pages upon > failure is hidden in the helper. > > I am not against adding a trivial zswap_tree_delete() helper (or > similar) that calls xa_erase() and zswap_entry_free() to match > zswap_tree_store() if you prefer that. This is a good point. I had refactored this routine in the context of my code that does batching and the same loop over the mTHP's subpages would get called in multiple error condition cases. I am thinking it might probably make sense for say zswap_tree_delete() to take a "folio" and "tree" and encapsulate deleting all stored offsets for that folio. Since we have already done the computes for finding the "tree", having that as an input parameter is mainly for latency, but if it is cleaner to have "zswap_tree_delete(struct folio *folio)", that should be Ok too. Please let me know your suggestion on this. Thanks, Kanchana > > > > > Refactored the code in zswap_store() that handles these cases, > > to call zswap_delete_stored_offsets(). > > > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > > --- > > mm/zswap.c | 33 ++++++++++++++++++++++++++++++--- > > 1 file changed, 30 insertions(+), 3 deletions(-) > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > index fd35a81b6e36..9bea948d653e 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -1435,8 +1435,37 @@ static bool zswap_store_entry(struct xarray > *tree, > > return true; > > } > > > > +/* > > + * If the zswap store fails or zswap is disabled, we must invalidate the > > + * possibly stale entries which were previously stored at the offsets > > + * corresponding to each page of the folio. Otherwise, writeback could > > + * overwrite the new data in the swapfile. > > + * > > + * This is called after the store of an offset in a large folio has failed. > > + * All zswap entries in the folio must be deleted. This helps make sure > > + * that a swapped-out mTHP is either entirely stored in zswap, or entirely > > + * not stored in zswap. > > + * > > + * This is also called if zswap_store() is invoked, but zswap is not enabled. > > + * All offsets for the folio are deleted from zswap in this case. > > + */ > > +static void zswap_delete_stored_offsets(struct xarray *tree, > > + pgoff_t offset, > > + long nr_pages) > > +{ > > + struct zswap_entry *entry; > > + long i; > > + > > + for (i = 0; i < nr_pages; ++i) { > > + entry = xa_erase(tree, offset + i); > > + if (entry) > > + zswap_entry_free(entry); > > + } > > +} > > + > > bool zswap_store(struct folio *folio) > > { > > + long nr_pages = folio_nr_pages(folio); > > swp_entry_t swp = folio->swap; > > pgoff_t offset = swp_offset(swp); > > struct xarray *tree = swap_zswap_tree(swp); > > @@ -1541,9 +1570,7 @@ bool zswap_store(struct folio *folio) > > * possibly stale entry which was previously stored at this offset. > > * Otherwise, writeback could overwrite the new data in the swapfile. > > */ > > - entry = xa_erase(tree, offset); > > - if (entry) > > - zswap_entry_free(entry); > > + zswap_delete_stored_offsets(tree, offset, nr_pages); > > return false; > > } > > > > -- > > 2.27.0 > > ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 4/8] mm: zswap: Refactor code to delete stored offsets in case of errors. 2024-09-24 22:32 ` Sridhar, Kanchana P @ 2024-09-25 0:43 ` Yosry Ahmed 2024-09-25 1:18 ` Sridhar, Kanchana P 2024-09-25 14:11 ` Johannes Weiner 0 siblings, 2 replies; 79+ messages in thread From: Yosry Ahmed @ 2024-09-25 0:43 UTC (permalink / raw) To: Sridhar, Kanchana P Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh On Tue, Sep 24, 2024 at 3:33 PM Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> wrote: > > > -----Original Message----- > > From: Yosry Ahmed <yosryahmed@google.com> > > Sent: Tuesday, September 24, 2024 12:20 PM > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > > usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com; > > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux- > > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > > Subject: Re: [PATCH v7 4/8] mm: zswap: Refactor code to delete stored > > offsets in case of errors. > > > > On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar > > <kanchana.p.sridhar@intel.com> wrote: > > > > > > Added a new procedure zswap_delete_stored_offsets() that can be > > > called to delete stored offsets in a folio in case zswap_store() > > > fails or zswap is disabled. > > > > I don't see the value in this helper. It will get called in one place > > AFAICT, and it is a bit inconsistent that we have to explicitly loop > > in zswap_store() to store pages, but the loop to delete pages upon > > failure is hidden in the helper. > > > > I am not against adding a trivial zswap_tree_delete() helper (or > > similar) that calls xa_erase() and zswap_entry_free() to match > > zswap_tree_store() if you prefer that. > > This is a good point. I had refactored this routine in the context > of my code that does batching and the same loop over the mTHP's > subpages would get called in multiple error condition cases. > > I am thinking it might probably make sense for say zswap_tree_delete() > to take a "folio" and "tree" and encapsulate deleting all stored offsets > for that folio. Since we have already done the computes for finding the > "tree", having that as an input parameter is mainly for latency, but if > it is cleaner to have "zswap_tree_delete(struct folio *folio)", that should > be Ok too. Please let me know your suggestion on this. > What I meant is "zswap_tree_delete(struct xarray *tree, pgoff_t offset)", and loop and call this in zswap_store(). This would be consistent on looping and calling zswap_store_page(). But we can keep the helper as-is actually and just rename it to zswap_tree_delete() and move the loop inside. No strong preference. ^ permalink raw reply [flat|nested] 79+ messages in thread
* RE: [PATCH v7 4/8] mm: zswap: Refactor code to delete stored offsets in case of errors. 2024-09-25 0:43 ` Yosry Ahmed @ 2024-09-25 1:18 ` Sridhar, Kanchana P 2024-09-25 14:11 ` Johannes Weiner 1 sibling, 0 replies; 79+ messages in thread From: Sridhar, Kanchana P @ 2024-09-25 1:18 UTC (permalink / raw) To: Yosry Ahmed Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P > -----Original Message----- > From: Yosry Ahmed <yosryahmed@google.com> > Sent: Tuesday, September 24, 2024 5:43 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com; > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux- > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v7 4/8] mm: zswap: Refactor code to delete stored > offsets in case of errors. > > On Tue, Sep 24, 2024 at 3:33 PM Sridhar, Kanchana P > <kanchana.p.sridhar@intel.com> wrote: > > > > > -----Original Message----- > > > From: Yosry Ahmed <yosryahmed@google.com> > > > Sent: Tuesday, September 24, 2024 12:20 PM > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > > > hannes@cmpxchg.org; nphamcs@gmail.com; > chengming.zhou@linux.dev; > > > usamaarif642@gmail.com; shakeel.butt@linux.dev; > ryan.roberts@arm.com; > > > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux- > > > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > > > Subject: Re: [PATCH v7 4/8] mm: zswap: Refactor code to delete stored > > > offsets in case of errors. > > > > > > On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar > > > <kanchana.p.sridhar@intel.com> wrote: > > > > > > > > Added a new procedure zswap_delete_stored_offsets() that can be > > > > called to delete stored offsets in a folio in case zswap_store() > > > > fails or zswap is disabled. > > > > > > I don't see the value in this helper. It will get called in one place > > > AFAICT, and it is a bit inconsistent that we have to explicitly loop > > > in zswap_store() to store pages, but the loop to delete pages upon > > > failure is hidden in the helper. > > > > > > I am not against adding a trivial zswap_tree_delete() helper (or > > > similar) that calls xa_erase() and zswap_entry_free() to match > > > zswap_tree_store() if you prefer that. > > > > This is a good point. I had refactored this routine in the context > > of my code that does batching and the same loop over the mTHP's > > subpages would get called in multiple error condition cases. > > > > I am thinking it might probably make sense for say zswap_tree_delete() > > to take a "folio" and "tree" and encapsulate deleting all stored offsets > > for that folio. Since we have already done the computes for finding the > > "tree", having that as an input parameter is mainly for latency, but if > > it is cleaner to have "zswap_tree_delete(struct folio *folio)", that should > > be Ok too. Please let me know your suggestion on this. > > > > What I meant is "zswap_tree_delete(struct xarray *tree, pgoff_t > offset)", and loop and call this in zswap_store(). This would be > consistent on looping and calling zswap_store_page(). > > But we can keep the helper as-is actually and just rename it to > zswap_tree_delete() and move the loop inside. No strong preference. Ok, sounds good. Thanks, Kanchana ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 4/8] mm: zswap: Refactor code to delete stored offsets in case of errors. 2024-09-25 0:43 ` Yosry Ahmed 2024-09-25 1:18 ` Sridhar, Kanchana P @ 2024-09-25 14:11 ` Johannes Weiner 2024-09-25 18:45 ` Sridhar, Kanchana P 1 sibling, 1 reply; 79+ messages in thread From: Johannes Weiner @ 2024-09-25 14:11 UTC (permalink / raw) To: Yosry Ahmed Cc: Sridhar, Kanchana P, linux-kernel, linux-mm, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh On Tue, Sep 24, 2024 at 05:43:22PM -0700, Yosry Ahmed wrote: > What I meant is "zswap_tree_delete(struct xarray *tree, pgoff_t > offset)", and loop and call this in zswap_store(). This would be > consistent on looping and calling zswap_store_page(). > > But we can keep the helper as-is actually and just rename it to > zswap_tree_delete() and move the loop inside. No strong preference. Both helpers seem unnecesary. zswap_tree_store() is not called in a loop directly. It's called from zswap_store_page(), which is essentially what zswap_store() is now, and that was fine with the open-coded insert. zswap_tree_delete() just hides what's going on. zswap_store() has the for-loop to store the subpages, so it makes sense it has the for loop for unwinding on rejection as well. This makes it easier on the reader to match up attempt and unwind. Please just drop both. ^ permalink raw reply [flat|nested] 79+ messages in thread
* RE: [PATCH v7 4/8] mm: zswap: Refactor code to delete stored offsets in case of errors. 2024-09-25 14:11 ` Johannes Weiner @ 2024-09-25 18:45 ` Sridhar, Kanchana P 0 siblings, 0 replies; 79+ messages in thread From: Sridhar, Kanchana P @ 2024-09-25 18:45 UTC (permalink / raw) To: Johannes Weiner, Yosry Ahmed Cc: linux-kernel, linux-mm, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P > -----Original Message----- > From: Johannes Weiner <hannes@cmpxchg.org> > Sent: Wednesday, September 25, 2024 7:11 AM > To: Yosry Ahmed <yosryahmed@google.com> > Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux- > kernel@vger.kernel.org; linux-mm@kvack.org; nphamcs@gmail.com; > chengming.zhou@linux.dev; usamaarif642@gmail.com; > shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org; > Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v7 4/8] mm: zswap: Refactor code to delete stored > offsets in case of errors. > > On Tue, Sep 24, 2024 at 05:43:22PM -0700, Yosry Ahmed wrote: > > What I meant is "zswap_tree_delete(struct xarray *tree, pgoff_t > > offset)", and loop and call this in zswap_store(). This would be > > consistent on looping and calling zswap_store_page(). > > > > But we can keep the helper as-is actually and just rename it to > > zswap_tree_delete() and move the loop inside. No strong preference. > > Both helpers seem unnecesary. > > zswap_tree_store() is not called in a loop directly. It's called from > zswap_store_page(), which is essentially what zswap_store() is now, > and that was fine with the open-coded insert. > > zswap_tree_delete() just hides what's going on. zswap_store() has the > for-loop to store the subpages, so it makes sense it has the for loop > for unwinding on rejection as well. This makes it easier on the reader > to match up attempt and unwind. > > Please just drop both. Ok, sounds good. ^ permalink raw reply [flat|nested] 79+ messages in thread
* [PATCH v7 5/8] mm: zswap: Compress and store a specific page in a folio. 2024-09-24 1:17 [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar ` (3 preceding siblings ...) 2024-09-24 1:17 ` [PATCH v7 4/8] mm: zswap: Refactor code to delete stored offsets in case of errors Kanchana P Sridhar @ 2024-09-24 1:17 ` Kanchana P Sridhar 2024-09-24 19:28 ` Yosry Ahmed 2024-09-24 1:17 ` [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store() Kanchana P Sridhar ` (4 subsequent siblings) 9 siblings, 1 reply; 79+ messages in thread From: Kanchana P Sridhar @ 2024-09-24 1:17 UTC (permalink / raw) To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao, akpm Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar For zswap_store() to handle mTHP folios, we need to iterate through each page in the mTHP, compress it and store it in the zswap pool. This patch introduces an auxiliary function zswap_store_page() that provides this functionality. The function signature reflects the design intent, namely, for it to be invoked by zswap_store() per-page in an mTHP. Hence, the folio's objcg and the zswap_pool to use are input parameters for sake of efficiency and consistency. The functionality in zswap_store_page() is reused and adapted from Ryan Roberts' RFC patch [1]: "[RFC,v1] mm: zswap: Store large folios without splitting" [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u Co-developed-by: Ryan Roberts Signed-off-by: Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> --- mm/zswap.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 88 insertions(+) diff --git a/mm/zswap.c b/mm/zswap.c index 9bea948d653e..8f2e0ab34c84 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -1463,6 +1463,94 @@ static void zswap_delete_stored_offsets(struct xarray *tree, } } +/* + * Stores the page at specified "index" in a folio. + * + * @folio: The folio to store in zswap. + * @index: Index into the page in the folio that this function will store. + * @objcg: The folio's objcg. + * @pool: The zswap_pool to store the compressed data for the page. + */ +static bool __maybe_unused zswap_store_page(struct folio *folio, long index, + struct obj_cgroup *objcg, + struct zswap_pool *pool) +{ + swp_entry_t swp = folio->swap; + int type = swp_type(swp); + pgoff_t offset = swp_offset(swp) + index; + struct page *page = folio_page(folio, index); + struct xarray *tree = swap_zswap_tree(swp); + struct zswap_entry *entry; + + if (objcg) + obj_cgroup_get(objcg); + + if (zswap_check_limits()) + goto reject; + + /* allocate entry */ + entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio)); + if (!entry) { + zswap_reject_kmemcache_fail++; + goto reject; + } + + /* if entry is successfully added, it keeps the reference */ + if (!zswap_pool_get(pool)) + goto freepage; + + entry->pool = pool; + + if (!zswap_compress(page, entry)) + goto put_pool; + + entry->swpentry = swp_entry(type, offset); + entry->objcg = objcg; + entry->referenced = true; + + if (!zswap_store_entry(tree, entry)) + goto store_failed; + + if (objcg) { + obj_cgroup_charge_zswap(objcg, entry->length); + count_objcg_event(objcg, ZSWPOUT); + } + + /* + * We finish initializing the entry while it's already in xarray. + * This is safe because: + * + * 1. Concurrent stores and invalidations are excluded by folio lock. + * + * 2. Writeback is excluded by the entry not being on the LRU yet. + * The publishing order matters to prevent writeback from seeing + * an incoherent entry. + */ + if (entry->length) { + INIT_LIST_HEAD(&entry->lru); + zswap_lru_add(&zswap_list_lru, entry); + } + + /* update stats */ + atomic_inc(&zswap_stored_pages); + count_vm_event(ZSWPOUT); + + return true; + +store_failed: + zpool_free(entry->pool->zpool, entry->handle); +put_pool: + zswap_pool_put(pool); +freepage: + zswap_entry_cache_free(entry); +reject: + obj_cgroup_put(objcg); + if (zswap_pool_reached_full) + queue_work(shrink_wq, &zswap_shrink_work); + + return false; +} + bool zswap_store(struct folio *folio) { long nr_pages = folio_nr_pages(folio); -- 2.27.0 ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 5/8] mm: zswap: Compress and store a specific page in a folio. 2024-09-24 1:17 ` [PATCH v7 5/8] mm: zswap: Compress and store a specific page in a folio Kanchana P Sridhar @ 2024-09-24 19:28 ` Yosry Ahmed 2024-09-24 22:45 ` Sridhar, Kanchana P 0 siblings, 1 reply; 79+ messages in thread From: Yosry Ahmed @ 2024-09-24 19:28 UTC (permalink / raw) To: Kanchana P Sridhar Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar <kanchana.p.sridhar@intel.com> wrote: > > For zswap_store() to handle mTHP folios, we need to iterate through each > page in the mTHP, compress it and store it in the zswap pool. This patch > introduces an auxiliary function zswap_store_page() that provides this > functionality. > > The function signature reflects the design intent, namely, for it > to be invoked by zswap_store() per-page in an mTHP. Hence, the folio's > objcg and the zswap_pool to use are input parameters for sake of > efficiency and consistency. > > The functionality in zswap_store_page() is reused and adapted from > Ryan Roberts' RFC patch [1]: > > "[RFC,v1] mm: zswap: Store large folios without splitting" > > [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u > > Co-developed-by: Ryan Roberts > Signed-off-by: > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > --- > mm/zswap.c | 88 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 88 insertions(+) > > diff --git a/mm/zswap.c b/mm/zswap.c > index 9bea948d653e..8f2e0ab34c84 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -1463,6 +1463,94 @@ static void zswap_delete_stored_offsets(struct xarray *tree, > } > } > > +/* > + * Stores the page at specified "index" in a folio. > + * > + * @folio: The folio to store in zswap. > + * @index: Index into the page in the folio that this function will store. > + * @objcg: The folio's objcg. > + * @pool: The zswap_pool to store the compressed data for the page. > + */ > +static bool __maybe_unused zswap_store_page(struct folio *folio, long index, > + struct obj_cgroup *objcg, > + struct zswap_pool *pool) Why are we adding an unused function that duplicates code in zswap_store(), then using it in the following patch? This makes it difficult to see that the function does the same thing. This patch should be refactoring the per-page code out of zswap_store() into zswap_store_page(), and directly calling zswap_store_page() from zswap_store(). > +{ > + swp_entry_t swp = folio->swap; > + int type = swp_type(swp); > + pgoff_t offset = swp_offset(swp) + index; > + struct page *page = folio_page(folio, index); > + struct xarray *tree = swap_zswap_tree(swp); > + struct zswap_entry *entry; > + > + if (objcg) > + obj_cgroup_get(objcg); > + > + if (zswap_check_limits()) > + goto reject; > + > + /* allocate entry */ > + entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio)); > + if (!entry) { > + zswap_reject_kmemcache_fail++; > + goto reject; > + } > + > + /* if entry is successfully added, it keeps the reference */ > + if (!zswap_pool_get(pool)) > + goto freepage; I think we can batch this for all pages in zswap_store(), maybe first add zswap_pool_get_many(). I am also wondering if it would be better to batch the limit checking and allocating the entries, to front load any failures before we start compression. Not sure if that's overall better though. To batch allocate entries we will have to also allocate an array to hold them. To batch the limit checking we will have to either allow going further over limit for mTHPs, or check if there is enough clearance to allow for compressing all the pages. Using the uncompressed size will lead to false negatives though, so maybe we can start tracking the average compression ratio for better limit checking. Nhat, Johannes, any thoughts here? I need someone to tell me if I am overthinking this :) > + > + entry->pool = pool; > + > + if (!zswap_compress(page, entry)) > + goto put_pool; > + > + entry->swpentry = swp_entry(type, offset); > + entry->objcg = objcg; > + entry->referenced = true; > + > + if (!zswap_store_entry(tree, entry)) > + goto store_failed; > + > + if (objcg) { > + obj_cgroup_charge_zswap(objcg, entry->length); > + count_objcg_event(objcg, ZSWPOUT); > + } > + > + /* > + * We finish initializing the entry while it's already in xarray. > + * This is safe because: > + * > + * 1. Concurrent stores and invalidations are excluded by folio lock. > + * > + * 2. Writeback is excluded by the entry not being on the LRU yet. > + * The publishing order matters to prevent writeback from seeing > + * an incoherent entry. > + */ > + if (entry->length) { > + INIT_LIST_HEAD(&entry->lru); > + zswap_lru_add(&zswap_list_lru, entry); > + } > + > + /* update stats */ > + atomic_inc(&zswap_stored_pages); > + count_vm_event(ZSWPOUT); We should probably also batch updating the stats. It actually seems like now we don't handle rolling them back upon failure. > + > + return true; > + > +store_failed: > + zpool_free(entry->pool->zpool, entry->handle); > +put_pool: > + zswap_pool_put(pool); > +freepage: > + zswap_entry_cache_free(entry); > +reject: > + obj_cgroup_put(objcg); > + if (zswap_pool_reached_full) > + queue_work(shrink_wq, &zswap_shrink_work); > + > + return false; > +} > + > bool zswap_store(struct folio *folio) > { > long nr_pages = folio_nr_pages(folio); > -- > 2.27.0 > ^ permalink raw reply [flat|nested] 79+ messages in thread
* RE: [PATCH v7 5/8] mm: zswap: Compress and store a specific page in a folio. 2024-09-24 19:28 ` Yosry Ahmed @ 2024-09-24 22:45 ` Sridhar, Kanchana P 2024-09-25 0:47 ` Yosry Ahmed 0 siblings, 1 reply; 79+ messages in thread From: Sridhar, Kanchana P @ 2024-09-24 22:45 UTC (permalink / raw) To: Yosry Ahmed Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P > -----Original Message----- > From: Yosry Ahmed <yosryahmed@google.com> > Sent: Tuesday, September 24, 2024 12:29 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com; > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux- > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v7 5/8] mm: zswap: Compress and store a specific page > in a folio. > > On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar > <kanchana.p.sridhar@intel.com> wrote: > > > > For zswap_store() to handle mTHP folios, we need to iterate through each > > page in the mTHP, compress it and store it in the zswap pool. This patch > > introduces an auxiliary function zswap_store_page() that provides this > > functionality. > > > > The function signature reflects the design intent, namely, for it > > to be invoked by zswap_store() per-page in an mTHP. Hence, the folio's > > objcg and the zswap_pool to use are input parameters for sake of > > efficiency and consistency. > > > > The functionality in zswap_store_page() is reused and adapted from > > Ryan Roberts' RFC patch [1]: > > > > "[RFC,v1] mm: zswap: Store large folios without splitting" > > > > [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1- > ryan.roberts@arm.com/T/#u > > > > Co-developed-by: Ryan Roberts > > Signed-off-by: > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > > --- > > mm/zswap.c | 88 > ++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > 1 file changed, 88 insertions(+) > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > index 9bea948d653e..8f2e0ab34c84 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -1463,6 +1463,94 @@ static void zswap_delete_stored_offsets(struct > xarray *tree, > > } > > } > > > > +/* > > + * Stores the page at specified "index" in a folio. > > + * > > + * @folio: The folio to store in zswap. > > + * @index: Index into the page in the folio that this function will store. > > + * @objcg: The folio's objcg. > > + * @pool: The zswap_pool to store the compressed data for the page. > > + */ > > +static bool __maybe_unused zswap_store_page(struct folio *folio, long > index, > > + struct obj_cgroup *objcg, > > + struct zswap_pool *pool) > > Why are we adding an unused function that duplicates code in > zswap_store(), then using it in the following patch? This makes it > difficult to see that the function does the same thing. This patch > should be refactoring the per-page code out of zswap_store() into > zswap_store_page(), and directly calling zswap_store_page() from > zswap_store(). Sure, thanks Yosry for this suggestion. Will fix in v8. > > > +{ > > + swp_entry_t swp = folio->swap; > > + int type = swp_type(swp); > > + pgoff_t offset = swp_offset(swp) + index; > > + struct page *page = folio_page(folio, index); > > + struct xarray *tree = swap_zswap_tree(swp); > > + struct zswap_entry *entry; > > + > > + if (objcg) > > + obj_cgroup_get(objcg); > > + > > + if (zswap_check_limits()) > > + goto reject; > > + > > + /* allocate entry */ > > + entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio)); > > + if (!entry) { > > + zswap_reject_kmemcache_fail++; > > + goto reject; > > + } > > + > > + /* if entry is successfully added, it keeps the reference */ > > + if (!zswap_pool_get(pool)) > > + goto freepage; > > I think we can batch this for all pages in zswap_store(), maybe first > add zswap_pool_get_many(). > > I am also wondering if it would be better to batch the limit checking > and allocating the entries, to front load any failures before we start > compression. Not sure if that's overall better though. > > To batch allocate entries we will have to also allocate an array to > hold them. To batch the limit checking we will have to either allow > going further over limit for mTHPs, or check if there is enough > clearance to allow for compressing all the pages. Using the > uncompressed size will lead to false negatives though, so maybe we can > start tracking the average compression ratio for better limit > checking. > > Nhat, Johannes, any thoughts here? I need someone to tell me if I am > overthinking this :) These are all good points. I suppose I was thinking along the same lines of what Nhat mentioned in an earlier comment. I was trying the incremental zswap_pool_get() and limit checks and shrinker invocations in case of (all) error conditions to allow different concurrent stores to make progress, without favoring only one process's mTHP store. I was thinking this would have minimal impact on the process(es) that see the zswap limit being exceeded, and that this would be better than preemptively checking for the entire mTHP and failing (this could also complicate things where no one makes progress because multiple processes run the batch checks and fail, when realistically one/many could have triggered the shrinker before erroring out, and at least one could have made progress). Would appreciate your perspectives on how this should be handled, and will implement a solution in v8 accordingly. Thanks, Kanchana > > > + > > + entry->pool = pool; > > + > > + if (!zswap_compress(page, entry)) > > + goto put_pool; > > + > > + entry->swpentry = swp_entry(type, offset); > > + entry->objcg = objcg; > > + entry->referenced = true; > > + > > + if (!zswap_store_entry(tree, entry)) > > + goto store_failed; > > + > > + if (objcg) { > > + obj_cgroup_charge_zswap(objcg, entry->length); > > + count_objcg_event(objcg, ZSWPOUT); > > + } > > + > > + /* > > + * We finish initializing the entry while it's already in xarray. > > + * This is safe because: > > + * > > + * 1. Concurrent stores and invalidations are excluded by folio lock. > > + * > > + * 2. Writeback is excluded by the entry not being on the LRU yet. > > + * The publishing order matters to prevent writeback from seeing > > + * an incoherent entry. > > + */ > > + if (entry->length) { > > + INIT_LIST_HEAD(&entry->lru); > > + zswap_lru_add(&zswap_list_lru, entry); > > + } > > + > > + /* update stats */ > > + atomic_inc(&zswap_stored_pages); > > + count_vm_event(ZSWPOUT); > > We should probably also batch updating the stats. It actually seems > like now we don't handle rolling them back upon failure. Good point! I assume you are referring only to the "ZSWPOUT" vm event stats updates and not the "zswap_stored_pages" (since latter is used in limit checking)? I will fix this in v8. Thanks, Kanchana > > > > + > > + return true; > > + > > +store_failed: > > + zpool_free(entry->pool->zpool, entry->handle); > > +put_pool: > > + zswap_pool_put(pool); > > +freepage: > > + zswap_entry_cache_free(entry); > > +reject: > > + obj_cgroup_put(objcg); > > + if (zswap_pool_reached_full) > > + queue_work(shrink_wq, &zswap_shrink_work); > > + > > + return false; > > +} > > + > > bool zswap_store(struct folio *folio) > > { > > long nr_pages = folio_nr_pages(folio); > > -- > > 2.27.0 > > ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 5/8] mm: zswap: Compress and store a specific page in a folio. 2024-09-24 22:45 ` Sridhar, Kanchana P @ 2024-09-25 0:47 ` Yosry Ahmed 2024-09-25 1:49 ` Sridhar, Kanchana P 0 siblings, 1 reply; 79+ messages in thread From: Yosry Ahmed @ 2024-09-25 0:47 UTC (permalink / raw) To: Sridhar, Kanchana P Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh [..] > > > > > +{ > > > + swp_entry_t swp = folio->swap; > > > + int type = swp_type(swp); > > > + pgoff_t offset = swp_offset(swp) + index; > > > + struct page *page = folio_page(folio, index); > > > + struct xarray *tree = swap_zswap_tree(swp); > > > + struct zswap_entry *entry; > > > + > > > + if (objcg) > > > + obj_cgroup_get(objcg); > > > + > > > + if (zswap_check_limits()) > > > + goto reject; > > > + > > > + /* allocate entry */ > > > + entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio)); > > > + if (!entry) { > > > + zswap_reject_kmemcache_fail++; > > > + goto reject; > > > + } > > > + > > > + /* if entry is successfully added, it keeps the reference */ > > > + if (!zswap_pool_get(pool)) > > > + goto freepage; > > > > I think we can batch this for all pages in zswap_store(), maybe first > > add zswap_pool_get_many(). > > > > I am also wondering if it would be better to batch the limit checking > > and allocating the entries, to front load any failures before we start > > compression. Not sure if that's overall better though. > > > > To batch allocate entries we will have to also allocate an array to > > hold them. To batch the limit checking we will have to either allow > > going further over limit for mTHPs, or check if there is enough > > clearance to allow for compressing all the pages. Using the > > uncompressed size will lead to false negatives though, so maybe we can > > start tracking the average compression ratio for better limit > > checking. > > > > Nhat, Johannes, any thoughts here? I need someone to tell me if I am > > overthinking this :) > > These are all good points. I suppose I was thinking along the same lines > of what Nhat mentioned in an earlier comment. I was trying the > incremental zswap_pool_get() and limit checks and shrinker invocations > in case of (all) error conditions to allow different concurrent stores to make > progress, without favoring only one process's mTHP store. I was thinking > this would have minimal impact on the process(es) that see the zswap > limit being exceeded, and that this would be better than preemptively > checking for the entire mTHP and failing (this could also complicate things > where no one makes progress because multiple processes run the batch > checks and fail, when realistically one/many could have triggered > the shrinker before erroring out, and at least one could have made > progress). On the other hand, if we allow concurrent mTHP swapouts to do limit checks incrementally, they may all fail at the last page. While if they all do limit checks beforehand, one of them can proceed. I think we need to agree on a higher-level strategy for limit checking, both global and per-memcg. The per-memcg limit should be stricter though, so we may end up having different policies. > > Would appreciate your perspectives on how this should be handled, > and will implement a solution in v8 accordingly. > > Thanks, > Kanchana > > > > > > + > > > + entry->pool = pool; > > > + > > > + if (!zswap_compress(page, entry)) > > > + goto put_pool; > > > + > > > + entry->swpentry = swp_entry(type, offset); > > > + entry->objcg = objcg; > > > + entry->referenced = true; > > > + > > > + if (!zswap_store_entry(tree, entry)) > > > + goto store_failed; > > > + > > > + if (objcg) { > > > + obj_cgroup_charge_zswap(objcg, entry->length); > > > + count_objcg_event(objcg, ZSWPOUT); > > > + } > > > + > > > + /* > > > + * We finish initializing the entry while it's already in xarray. > > > + * This is safe because: > > > + * > > > + * 1. Concurrent stores and invalidations are excluded by folio lock. > > > + * > > > + * 2. Writeback is excluded by the entry not being on the LRU yet. > > > + * The publishing order matters to prevent writeback from seeing > > > + * an incoherent entry. > > > + */ > > > + if (entry->length) { > > > + INIT_LIST_HEAD(&entry->lru); > > > + zswap_lru_add(&zswap_list_lru, entry); > > > + } > > > + > > > + /* update stats */ > > > + atomic_inc(&zswap_stored_pages); > > > + count_vm_event(ZSWPOUT); > > > > We should probably also batch updating the stats. It actually seems > > like now we don't handle rolling them back upon failure. > > Good point! I assume you are referring only to the "ZSWPOUT" vm event stats > updates and not the "zswap_stored_pages" (since latter is used in limit checking)? I actually meant both. Do we rollback changes to zswap_stored_pages when some stores succeed and some of them fail? I think it's more correct and efficient to update the atomic once after all the pages are successfully compressed and stored. ^ permalink raw reply [flat|nested] 79+ messages in thread
* RE: [PATCH v7 5/8] mm: zswap: Compress and store a specific page in a folio. 2024-09-25 0:47 ` Yosry Ahmed @ 2024-09-25 1:49 ` Sridhar, Kanchana P 2024-09-25 13:53 ` Johannes Weiner 0 siblings, 1 reply; 79+ messages in thread From: Sridhar, Kanchana P @ 2024-09-25 1:49 UTC (permalink / raw) To: Yosry Ahmed Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P > -----Original Message----- > From: Yosry Ahmed <yosryahmed@google.com> > Sent: Tuesday, September 24, 2024 5:47 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com; > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux- > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v7 5/8] mm: zswap: Compress and store a specific page > in a folio. > > [..] > > > > > > > +{ > > > > + swp_entry_t swp = folio->swap; > > > > + int type = swp_type(swp); > > > > + pgoff_t offset = swp_offset(swp) + index; > > > > + struct page *page = folio_page(folio, index); > > > > + struct xarray *tree = swap_zswap_tree(swp); > > > > + struct zswap_entry *entry; > > > > + > > > > + if (objcg) > > > > + obj_cgroup_get(objcg); > > > > + > > > > + if (zswap_check_limits()) > > > > + goto reject; > > > > + > > > > + /* allocate entry */ > > > > + entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio)); > > > > + if (!entry) { > > > > + zswap_reject_kmemcache_fail++; > > > > + goto reject; > > > > + } > > > > + > > > > + /* if entry is successfully added, it keeps the reference */ > > > > + if (!zswap_pool_get(pool)) > > > > + goto freepage; > > > > > > I think we can batch this for all pages in zswap_store(), maybe first > > > add zswap_pool_get_many(). > > > > > > I am also wondering if it would be better to batch the limit checking > > > and allocating the entries, to front load any failures before we start > > > compression. Not sure if that's overall better though. > > > > > > To batch allocate entries we will have to also allocate an array to > > > hold them. To batch the limit checking we will have to either allow > > > going further over limit for mTHPs, or check if there is enough > > > clearance to allow for compressing all the pages. Using the > > > uncompressed size will lead to false negatives though, so maybe we can > > > start tracking the average compression ratio for better limit > > > checking. > > > > > > Nhat, Johannes, any thoughts here? I need someone to tell me if I am > > > overthinking this :) > > > > These are all good points. I suppose I was thinking along the same lines > > of what Nhat mentioned in an earlier comment. I was trying the > > incremental zswap_pool_get() and limit checks and shrinker invocations > > in case of (all) error conditions to allow different concurrent stores to make > > progress, without favoring only one process's mTHP store. I was thinking > > this would have minimal impact on the process(es) that see the zswap > > limit being exceeded, and that this would be better than preemptively > > checking for the entire mTHP and failing (this could also complicate things > > where no one makes progress because multiple processes run the batch > > checks and fail, when realistically one/many could have triggered > > the shrinker before erroring out, and at least one could have made > > progress). > > On the other hand, if we allow concurrent mTHP swapouts to do limit > checks incrementally, they may all fail at the last page. While if > they all do limit checks beforehand, one of them can proceed. Yes, this is possible too. Although, given the dynamic nature of the usage, even with a check-before-store strategy for mTHP we could end up in a similar situation as the optimistic approach in which we allowed progress until there really was a reason to fail. > > I think we need to agree on a higher-level strategy for limit > checking, both global and per-memcg. The per-memcg limit should be > stricter though, so we may end up having different policies. Sure, this makes sense. One possibility is we could allow zswap to follow the "optimistic approach" used currently, while we manage the limits checking at the memcg level? Something along the lines of mem_cgroup_handle_over_high() that gets called every time after a page-fault is handled; instead checks the cgroup's zswap usage and triggers writeback? This seems like one way of not adding overhead to the reclaim path (zswap will store mTHP until the limit checking causes error and unwinding state), while triggering zswap-LRU based writeback at a higher level to manage the limit. > > > > > Would appreciate your perspectives on how this should be handled, > > and will implement a solution in v8 accordingly. > > > > Thanks, > > Kanchana > > > > > > > > > + > > > > + entry->pool = pool; > > > > + > > > > + if (!zswap_compress(page, entry)) > > > > + goto put_pool; > > > > + > > > > + entry->swpentry = swp_entry(type, offset); > > > > + entry->objcg = objcg; > > > > + entry->referenced = true; > > > > + > > > > + if (!zswap_store_entry(tree, entry)) > > > > + goto store_failed; > > > > + > > > > + if (objcg) { > > > > + obj_cgroup_charge_zswap(objcg, entry->length); > > > > + count_objcg_event(objcg, ZSWPOUT); > > > > + } > > > > + > > > > + /* > > > > + * We finish initializing the entry while it's already in xarray. > > > > + * This is safe because: > > > > + * > > > > + * 1. Concurrent stores and invalidations are excluded by folio lock. > > > > + * > > > > + * 2. Writeback is excluded by the entry not being on the LRU yet. > > > > + * The publishing order matters to prevent writeback from seeing > > > > + * an incoherent entry. > > > > + */ > > > > + if (entry->length) { > > > > + INIT_LIST_HEAD(&entry->lru); > > > > + zswap_lru_add(&zswap_list_lru, entry); > > > > + } > > > > + > > > > + /* update stats */ > > > > + atomic_inc(&zswap_stored_pages); > > > > + count_vm_event(ZSWPOUT); > > > > > > We should probably also batch updating the stats. It actually seems > > > like now we don't handle rolling them back upon failure. > > > > Good point! I assume you are referring only to the "ZSWPOUT" vm event > stats > > updates and not the "zswap_stored_pages" (since latter is used in limit > checking)? > > I actually meant both. Do we rollback changes to zswap_stored_pages > when some stores succeed and some of them fail? Yes we do. zswap_tree_delete() calls zswap_entry_free() which will decrement zswap_stored_pages. The only stat that is left in an incorrect state in this case is the vmstat 'zswpout'. > > I think it's more correct and efficient to update the atomic once > after all the pages are successfully compressed and stored. Actually this would need to co-relate with the limits checking strategy, because the atomic is used there and needs to be as accurate as possible. As far as the vmstat 'zswpout', the reason I left it as-is in my patchset was to be more indicative of the actual zswpout compute events that occurred (for things like getting the compressions count), regardless of whether or not the overall mTHP store was successful. If this vmstat needs to reflect only successful zswpout events (i.e., represent the zswap usage), I can fix it by updating it once only if the mTHP is stored successfully. Thanks, Kanchana ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 5/8] mm: zswap: Compress and store a specific page in a folio. 2024-09-25 1:49 ` Sridhar, Kanchana P @ 2024-09-25 13:53 ` Johannes Weiner 2024-09-25 18:45 ` Sridhar, Kanchana P 0 siblings, 1 reply; 79+ messages in thread From: Johannes Weiner @ 2024-09-25 13:53 UTC (permalink / raw) To: Sridhar, Kanchana P Cc: Yosry Ahmed, linux-kernel, linux-mm, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh On Wed, Sep 25, 2024 at 01:49:03AM +0000, Sridhar, Kanchana P wrote: > > From: Yosry Ahmed <yosryahmed@google.com> > > I think it's more correct and efficient to update the atomic once > > after all the pages are successfully compressed and stored. > > Actually this would need to co-relate with the limits checking strategy, > because the atomic is used there and needs to be as accurate as possible. For the limit checks, we use the zpool counters, not zswap_stored_pages. zswap_stored_pages is used in the zswap shrinker to guesstimate pressure, so it's likely a good thing to only count entries that are expected to stay, and not account the ones that might fail just yet. > As far as the vmstat 'zswpout', the reason I left it as-is in my patchset > was to be more indicative of the actual zswpout compute events that > occurred (for things like getting the compressions count), regardless > of whether or not the overall mTHP store was successful. If this vmstat > needs to reflect only successful zswpout events (i.e., represent the zswap > usage), I can fix it by updating it once only if the mTHP is stored successfully. Yeah, that's fine as well. I would suggest batching them both at the end of zswap_store(). ^ permalink raw reply [flat|nested] 79+ messages in thread
* RE: [PATCH v7 5/8] mm: zswap: Compress and store a specific page in a folio. 2024-09-25 13:53 ` Johannes Weiner @ 2024-09-25 18:45 ` Sridhar, Kanchana P 0 siblings, 0 replies; 79+ messages in thread From: Sridhar, Kanchana P @ 2024-09-25 18:45 UTC (permalink / raw) To: Johannes Weiner Cc: Yosry Ahmed, linux-kernel, linux-mm, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P > -----Original Message----- > From: Johannes Weiner <hannes@cmpxchg.org> > Sent: Wednesday, September 25, 2024 6:54 AM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: Yosry Ahmed <yosryahmed@google.com>; linux-kernel@vger.kernel.org; > linux-mm@kvack.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com; > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux- > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v7 5/8] mm: zswap: Compress and store a specific page > in a folio. > > On Wed, Sep 25, 2024 at 01:49:03AM +0000, Sridhar, Kanchana P wrote: > > > From: Yosry Ahmed <yosryahmed@google.com> > > > I think it's more correct and efficient to update the atomic once > > > after all the pages are successfully compressed and stored. > > > > Actually this would need to co-relate with the limits checking strategy, > > because the atomic is used there and needs to be as accurate as possible. > > For the limit checks, we use the zpool counters, not zswap_stored_pages. Thanks Johannes for your insights and comments. Yes, you are absolutely right. My apologies. > > zswap_stored_pages is used in the zswap shrinker to guesstimate > pressure, so it's likely a good thing to only count entries that are > expected to stay, and not account the ones that might fail just yet. Sure, makes sense. > > > As far as the vmstat 'zswpout', the reason I left it as-is in my patchset > > was to be more indicative of the actual zswpout compute events that > > occurred (for things like getting the compressions count), regardless > > of whether or not the overall mTHP store was successful. If this vmstat > > needs to reflect only successful zswpout events (i.e., represent the zswap > > usage), I can fix it by updating it once only if the mTHP is stored successfully. > > Yeah, that's fine as well. > > I would suggest batching them both at the end of zswap_store(). Ok, will do so in v8. Thanks, Kanchana ^ permalink raw reply [flat|nested] 79+ messages in thread
* [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-24 1:17 [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar ` (4 preceding siblings ...) 2024-09-24 1:17 ` [PATCH v7 5/8] mm: zswap: Compress and store a specific page in a folio Kanchana P Sridhar @ 2024-09-24 1:17 ` Kanchana P Sridhar 2024-09-24 17:33 ` Nhat Pham ` (2 more replies) 2024-09-24 1:17 ` [PATCH v7 7/8] mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout stats Kanchana P Sridhar ` (3 subsequent siblings) 9 siblings, 3 replies; 79+ messages in thread From: Kanchana P Sridhar @ 2024-09-24 1:17 UTC (permalink / raw) To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao, akpm Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar zswap_store() will now store mTHP and PMD-size THP folios by compressing them page by page. This patch provides a sequential implementation of storing an mTHP in zswap_store() by iterating through each page in the folio to compress and store it in the zswap zpool. Towards this goal, zswap_compress() is modified to take a page instead of a folio as input. Each page's swap offset is stored as a separate zswap entry. If an error is encountered during the store of any page in the mTHP, all previous pages/entries stored will be invalidated. Thus, an mTHP is either entirely stored in ZSWAP, or entirely not stored in ZSWAP. This forms the basis for building batching of pages during zswap store of large folios by compressing batches of up to say, 8 pages in an mTHP in parallel in hardware, with the Intel In-Memory Analytics Accelerator (Intel IAA). A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default) will enable/disable zswap storing of (m)THP. The corresponding tunable zswap module parameter is "mthp_enabled". This change reuses and adapts the functionality in Ryan Roberts' RFC patch [1]: "[RFC,v1] mm: zswap: Store large folios without splitting" [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u Also, addressed some of the RFC comments from the discussion in [1]. Co-developed-by: Ryan Roberts Signed-off-by: Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> --- mm/Kconfig | 8 ++++ mm/zswap.c | 122 +++++++++++++++++++++++++---------------------------- 2 files changed, 66 insertions(+), 64 deletions(-) diff --git a/mm/Kconfig b/mm/Kconfig index 09aebca1cae3..c659fb732ec4 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -59,6 +59,14 @@ config ZSWAP_SHRINKER_DEFAULT_ON reducing the chance that cold pages will reside in the zswap pool and consume memory indefinitely. +config ZSWAP_STORE_THP_DEFAULT_ON + bool "Store mTHP and THP folios in zswap" + depends on ZSWAP + default n + help + If selected, zswap will process mTHP and THP folios by + compressing and storing each 4K page in the large folio. + choice prompt "Default compressor" depends on ZSWAP diff --git a/mm/zswap.c b/mm/zswap.c index 8f2e0ab34c84..16ab770546d6 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -127,6 +127,14 @@ static bool zswap_shrinker_enabled = IS_ENABLED( CONFIG_ZSWAP_SHRINKER_DEFAULT_ON); module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644); +/* + * Enable/disable zswap processing of mTHP folios. + * For now, only zswap_store will process mTHP folios. + */ +static bool zswap_mthp_enabled = IS_ENABLED( + CONFIG_ZSWAP_STORE_THP_DEFAULT_ON); +module_param_named(mthp_enabled, zswap_mthp_enabled, bool, 0644); + bool zswap_is_enabled(void) { return zswap_enabled; @@ -1471,9 +1479,9 @@ static void zswap_delete_stored_offsets(struct xarray *tree, * @objcg: The folio's objcg. * @pool: The zswap_pool to store the compressed data for the page. */ -static bool __maybe_unused zswap_store_page(struct folio *folio, long index, - struct obj_cgroup *objcg, - struct zswap_pool *pool) +static bool zswap_store_page(struct folio *folio, long index, + struct obj_cgroup *objcg, + struct zswap_pool *pool) { swp_entry_t swp = folio->swap; int type = swp_type(swp); @@ -1551,51 +1559,63 @@ static bool __maybe_unused zswap_store_page(struct folio *folio, long index, return false; } +/* + * Modified to store mTHP folios. Each page in the mTHP will be compressed + * and stored sequentially. + */ bool zswap_store(struct folio *folio) { long nr_pages = folio_nr_pages(folio); swp_entry_t swp = folio->swap; pgoff_t offset = swp_offset(swp); struct xarray *tree = swap_zswap_tree(swp); - struct zswap_entry *entry; struct obj_cgroup *objcg = NULL; struct mem_cgroup *memcg = NULL; + struct zswap_pool *pool; + bool ret = false; + long index; VM_WARN_ON_ONCE(!folio_test_locked(folio)); VM_WARN_ON_ONCE(!folio_test_swapcache(folio)); - /* Large folios aren't supported */ - if (folio_test_large(folio)) + /* Storing large folios isn't enabled */ + if (!zswap_mthp_enabled && folio_test_large(folio)) return false; if (!zswap_enabled) - goto check_old; + goto reject; - /* Check cgroup limits */ + /* + * Check cgroup limits: + * + * The cgroup zswap limit check is done once at the beginning of an + * mTHP store, and not within zswap_store_page() for each page + * in the mTHP. We do however check the zswap pool limits at the + * start of zswap_store_page(). What this means is, the cgroup + * could go over the limits by at most (HPAGE_PMD_NR - 1) pages. + * However, the per-store-page zswap pool limits check should + * hopefully trigger the cgroup aware and zswap LRU aware global + * reclaim implemented in the shrinker. If this assumption holds, + * the cgroup exceeding the zswap limits could potentially be + * resolved before the next zswap_store, and if it is not, the next + * zswap_store would fail the cgroup zswap limit check at the start. + */ objcg = get_obj_cgroup_from_folio(folio); if (objcg && !obj_cgroup_may_zswap(objcg)) { memcg = get_mem_cgroup_from_objcg(objcg); if (shrink_memcg(memcg)) { mem_cgroup_put(memcg); - goto reject; + goto put_objcg; } mem_cgroup_put(memcg); } if (zswap_check_limits()) - goto reject; - - /* allocate entry */ - entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio)); - if (!entry) { - zswap_reject_kmemcache_fail++; - goto reject; - } + goto put_objcg; - /* if entry is successfully added, it keeps the reference */ - entry->pool = zswap_pool_current_get(); - if (!entry->pool) - goto freepage; + pool = zswap_pool_current_get(); + if (!pool) + goto put_objcg; if (objcg) { memcg = get_mem_cgroup_from_objcg(objcg); @@ -1606,60 +1626,34 @@ bool zswap_store(struct folio *folio) mem_cgroup_put(memcg); } - if (!zswap_compress(&folio->page, entry)) - goto put_pool; - - entry->swpentry = swp; - entry->objcg = objcg; - entry->referenced = true; - - if (!zswap_store_entry(tree, entry)) - goto store_failed; - - if (objcg) { - obj_cgroup_charge_zswap(objcg, entry->length); - count_objcg_event(objcg, ZSWPOUT); - } - /* - * We finish initializing the entry while it's already in xarray. - * This is safe because: - * - * 1. Concurrent stores and invalidations are excluded by folio lock. - * - * 2. Writeback is excluded by the entry not being on the LRU yet. - * The publishing order matters to prevent writeback from seeing - * an incoherent entry. + * Store each page of the folio as a separate entry. If we fail to store + * a page, unwind by removing all the previous pages we stored. */ - if (entry->length) { - INIT_LIST_HEAD(&entry->lru); - zswap_lru_add(&zswap_list_lru, entry); + for (index = 0; index < nr_pages; ++index) { + if (!zswap_store_page(folio, index, objcg, pool)) + goto put_pool; } - /* update stats */ - atomic_inc(&zswap_stored_pages); - count_vm_event(ZSWPOUT); - - return true; + ret = true; -store_failed: - zpool_free(entry->pool->zpool, entry->handle); put_pool: - zswap_pool_put(entry->pool); -freepage: - zswap_entry_cache_free(entry); -reject: + zswap_pool_put(pool); +put_objcg: obj_cgroup_put(objcg); if (zswap_pool_reached_full) queue_work(shrink_wq, &zswap_shrink_work); -check_old: +reject: /* - * If the zswap store fails or zswap is disabled, we must invalidate the - * possibly stale entry which was previously stored at this offset. - * Otherwise, writeback could overwrite the new data in the swapfile. + * If the zswap store fails or zswap is disabled, we must invalidate + * the possibly stale entries which were previously stored at the + * offsets corresponding to each page of the folio. Otherwise, + * writeback could overwrite the new data in the swapfile. */ - zswap_delete_stored_offsets(tree, offset, nr_pages); - return false; + if (!ret) + zswap_delete_stored_offsets(tree, offset, nr_pages); + + return ret; } bool zswap_load(struct folio *folio) -- 2.27.0 ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-24 1:17 ` [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store() Kanchana P Sridhar @ 2024-09-24 17:33 ` Nhat Pham 2024-09-24 20:51 ` Sridhar, Kanchana P 2024-09-24 19:38 ` Yosry Ahmed 2024-09-25 14:27 ` Johannes Weiner 2 siblings, 1 reply; 79+ messages in thread From: Nhat Pham @ 2024-09-24 17:33 UTC (permalink / raw) To: Kanchana P Sridhar Cc: linux-kernel, linux-mm, hannes, yosryahmed, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar <kanchana.p.sridhar@intel.com> wrote: > > zswap_store() will now store mTHP and PMD-size THP folios by compressing > them page by page. > > This patch provides a sequential implementation of storing an mTHP in > zswap_store() by iterating through each page in the folio to compress > and store it in the zswap zpool. > > Towards this goal, zswap_compress() is modified to take a page instead > of a folio as input. > > Each page's swap offset is stored as a separate zswap entry. > > If an error is encountered during the store of any page in the mTHP, > all previous pages/entries stored will be invalidated. Thus, an mTHP > is either entirely stored in ZSWAP, or entirely not stored in ZSWAP. > > This forms the basis for building batching of pages during zswap store > of large folios by compressing batches of up to say, 8 pages in an > mTHP in parallel in hardware, with the Intel In-Memory Analytics > Accelerator (Intel IAA). > > A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default) > will enable/disable zswap storing of (m)THP. The corresponding tunable > zswap module parameter is "mthp_enabled". > > This change reuses and adapts the functionality in Ryan Roberts' RFC > patch [1]: > > "[RFC,v1] mm: zswap: Store large folios without splitting" > > [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u > > Also, addressed some of the RFC comments from the discussion in [1]. > > Co-developed-by: Ryan Roberts > Signed-off-by: > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > --- > mm/Kconfig | 8 ++++ > mm/zswap.c | 122 +++++++++++++++++++++++++---------------------------- > 2 files changed, 66 insertions(+), 64 deletions(-) > > diff --git a/mm/Kconfig b/mm/Kconfig > index 09aebca1cae3..c659fb732ec4 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -59,6 +59,14 @@ config ZSWAP_SHRINKER_DEFAULT_ON > reducing the chance that cold pages will reside in the zswap pool > and consume memory indefinitely. > > +config ZSWAP_STORE_THP_DEFAULT_ON > + bool "Store mTHP and THP folios in zswap" > + depends on ZSWAP > + default n > + help > + If selected, zswap will process mTHP and THP folios by > + compressing and storing each 4K page in the large folio. > + > choice > prompt "Default compressor" > depends on ZSWAP > diff --git a/mm/zswap.c b/mm/zswap.c > index 8f2e0ab34c84..16ab770546d6 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -127,6 +127,14 @@ static bool zswap_shrinker_enabled = IS_ENABLED( > CONFIG_ZSWAP_SHRINKER_DEFAULT_ON); > module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644); > > +/* > + * Enable/disable zswap processing of mTHP folios. > + * For now, only zswap_store will process mTHP folios. > + */ > +static bool zswap_mthp_enabled = IS_ENABLED( > + CONFIG_ZSWAP_STORE_THP_DEFAULT_ON); > +module_param_named(mthp_enabled, zswap_mthp_enabled, bool, 0644); > + Hmm, so this is a runtime knob. Also, should this be zswap_thp_enabled? :) > bool zswap_is_enabled(void) > { > return zswap_enabled; > @@ -1471,9 +1479,9 @@ static void zswap_delete_stored_offsets(struct xarray *tree, > * @objcg: The folio's objcg. > * @pool: The zswap_pool to store the compressed data for the page. > */ > -static bool __maybe_unused zswap_store_page(struct folio *folio, long index, > - struct obj_cgroup *objcg, > - struct zswap_pool *pool) > +static bool zswap_store_page(struct folio *folio, long index, > + struct obj_cgroup *objcg, > + struct zswap_pool *pool) > { > swp_entry_t swp = folio->swap; > int type = swp_type(swp); > @@ -1551,51 +1559,63 @@ static bool __maybe_unused zswap_store_page(struct folio *folio, long index, > return false; > } > > +/* > + * Modified to store mTHP folios. Each page in the mTHP will be compressed > + * and stored sequentially. > + */ > bool zswap_store(struct folio *folio) > { > long nr_pages = folio_nr_pages(folio); > swp_entry_t swp = folio->swap; > pgoff_t offset = swp_offset(swp); > struct xarray *tree = swap_zswap_tree(swp); > - struct zswap_entry *entry; > struct obj_cgroup *objcg = NULL; > struct mem_cgroup *memcg = NULL; > + struct zswap_pool *pool; > + bool ret = false; > + long index; > > VM_WARN_ON_ONCE(!folio_test_locked(folio)); > VM_WARN_ON_ONCE(!folio_test_swapcache(folio)); > > - /* Large folios aren't supported */ > - if (folio_test_large(folio)) > + /* Storing large folios isn't enabled */ > + if (!zswap_mthp_enabled && folio_test_large(folio)) > return false; Hmm can this go wrong somehow? Can we have a case where we enable zswap_mthp_enabled, have a large folio written to zswap, disable zswap_mthp_enabled, and attempt to store that folio to zswap again. Now, we have a stale copy in zswap that is not invalidated...? Or am I missing something here :) ^ permalink raw reply [flat|nested] 79+ messages in thread
* RE: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-24 17:33 ` Nhat Pham @ 2024-09-24 20:51 ` Sridhar, Kanchana P 2024-09-24 21:08 ` Nhat Pham 0 siblings, 1 reply; 79+ messages in thread From: Sridhar, Kanchana P @ 2024-09-24 20:51 UTC (permalink / raw) To: Nhat Pham Cc: linux-kernel, linux-mm, hannes, yosryahmed, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P > -----Original Message----- > From: Nhat Pham <nphamcs@gmail.com> > Sent: Tuesday, September 24, 2024 10:34 AM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; yosryahmed@google.com; > chengming.zhou@linux.dev; usamaarif642@gmail.com; > shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org; > Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in > zswap_store(). > > On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar > <kanchana.p.sridhar@intel.com> wrote: > > > > zswap_store() will now store mTHP and PMD-size THP folios by compressing > > them page by page. > > > > This patch provides a sequential implementation of storing an mTHP in > > zswap_store() by iterating through each page in the folio to compress > > and store it in the zswap zpool. > > > > Towards this goal, zswap_compress() is modified to take a page instead > > of a folio as input. > > > > Each page's swap offset is stored as a separate zswap entry. > > > > If an error is encountered during the store of any page in the mTHP, > > all previous pages/entries stored will be invalidated. Thus, an mTHP > > is either entirely stored in ZSWAP, or entirely not stored in ZSWAP. > > > > This forms the basis for building batching of pages during zswap store > > of large folios by compressing batches of up to say, 8 pages in an > > mTHP in parallel in hardware, with the Intel In-Memory Analytics > > Accelerator (Intel IAA). > > > > A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by > default) > > will enable/disable zswap storing of (m)THP. The corresponding tunable > > zswap module parameter is "mthp_enabled". > > > > This change reuses and adapts the functionality in Ryan Roberts' RFC > > patch [1]: > > > > "[RFC,v1] mm: zswap: Store large folios without splitting" > > > > [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1- > ryan.roberts@arm.com/T/#u > > > > Also, addressed some of the RFC comments from the discussion in [1]. > > > > Co-developed-by: Ryan Roberts > > Signed-off-by: > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > > --- > > mm/Kconfig | 8 ++++ > > mm/zswap.c | 122 +++++++++++++++++++++++++---------------------------- > > 2 files changed, 66 insertions(+), 64 deletions(-) > > > > diff --git a/mm/Kconfig b/mm/Kconfig > > index 09aebca1cae3..c659fb732ec4 100644 > > --- a/mm/Kconfig > > +++ b/mm/Kconfig > > @@ -59,6 +59,14 @@ config ZSWAP_SHRINKER_DEFAULT_ON > > reducing the chance that cold pages will reside in the zswap pool > > and consume memory indefinitely. > > > > +config ZSWAP_STORE_THP_DEFAULT_ON > > + bool "Store mTHP and THP folios in zswap" > > + depends on ZSWAP > > + default n > > + help > > + If selected, zswap will process mTHP and THP folios by > > + compressing and storing each 4K page in the large folio. > > + > > choice > > prompt "Default compressor" > > depends on ZSWAP > > diff --git a/mm/zswap.c b/mm/zswap.c > > index 8f2e0ab34c84..16ab770546d6 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -127,6 +127,14 @@ static bool zswap_shrinker_enabled = > IS_ENABLED( > > CONFIG_ZSWAP_SHRINKER_DEFAULT_ON); > > module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, > 0644); > > > > +/* > > + * Enable/disable zswap processing of mTHP folios. > > + * For now, only zswap_store will process mTHP folios. > > + */ > > +static bool zswap_mthp_enabled = IS_ENABLED( > > + CONFIG_ZSWAP_STORE_THP_DEFAULT_ON); > > +module_param_named(mthp_enabled, zswap_mthp_enabled, bool, > 0644); > > + > > Hmm, so this is a runtime knob. Also, should this be zswap_thp_enabled? :) Agreed, zswap_thp_enabled is a better name. I will make the change in v8. More comments below as to the runtime knob. > > > bool zswap_is_enabled(void) > > { > > return zswap_enabled; > > @@ -1471,9 +1479,9 @@ static void zswap_delete_stored_offsets(struct > xarray *tree, > > * @objcg: The folio's objcg. > > * @pool: The zswap_pool to store the compressed data for the page. > > */ > > -static bool __maybe_unused zswap_store_page(struct folio *folio, long > index, > > - struct obj_cgroup *objcg, > > - struct zswap_pool *pool) > > +static bool zswap_store_page(struct folio *folio, long index, > > + struct obj_cgroup *objcg, > > + struct zswap_pool *pool) > > { > > swp_entry_t swp = folio->swap; > > int type = swp_type(swp); > > @@ -1551,51 +1559,63 @@ static bool __maybe_unused > zswap_store_page(struct folio *folio, long index, > > return false; > > } > > > > +/* > > + * Modified to store mTHP folios. Each page in the mTHP will be > compressed > > + * and stored sequentially. > > + */ > > bool zswap_store(struct folio *folio) > > { > > long nr_pages = folio_nr_pages(folio); > > swp_entry_t swp = folio->swap; > > pgoff_t offset = swp_offset(swp); > > struct xarray *tree = swap_zswap_tree(swp); > > - struct zswap_entry *entry; > > struct obj_cgroup *objcg = NULL; > > struct mem_cgroup *memcg = NULL; > > + struct zswap_pool *pool; > > + bool ret = false; > > + long index; > > > > VM_WARN_ON_ONCE(!folio_test_locked(folio)); > > VM_WARN_ON_ONCE(!folio_test_swapcache(folio)); > > > > - /* Large folios aren't supported */ > > - if (folio_test_large(folio)) > > + /* Storing large folios isn't enabled */ > > + if (!zswap_mthp_enabled && folio_test_large(folio)) > > return false; > > Hmm can this go wrong somehow? Can we have a case where we enable > zswap_mthp_enabled, have a large folio written to zswap, disable > zswap_mthp_enabled, and attempt to store that folio to zswap again. > > Now, we have a stale copy in zswap that is not invalidated...? > > Or am I missing something here :) This is an excellent point. Thanks Nhat for catching this! I can see two options to solving this: Option 1: If zswap_mthp_enabled is "false", delete all stored offsets for the mTHP in zswap before exiting. This could race with writeback (either one or more subpages could be written back before zswap_store acquires the tree lock), however, I don't think it will cause data inconsistencies. Any offsets for subpages not written back will be deleted from zswap, zswap_store() will return false, and the backing swap device's subsequent swapout will over-write the zswap write-back data. Could anything go wrong with this? Option 2: Only provide a build config option, CONFIG_ZSWAP_STORE_THP_DEFAULT_ON, that cannot be dynamically changed. Would appreciate suggestions on these, and other potential solutions. Thanks, Kanchana ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-24 20:51 ` Sridhar, Kanchana P @ 2024-09-24 21:08 ` Nhat Pham 2024-09-24 21:34 ` Yosry Ahmed 0 siblings, 1 reply; 79+ messages in thread From: Nhat Pham @ 2024-09-24 21:08 UTC (permalink / raw) To: Sridhar, Kanchana P Cc: linux-kernel, linux-mm, hannes, yosryahmed, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh On Tue, Sep 24, 2024 at 1:51 PM Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> wrote: > > > This is an excellent point. Thanks Nhat for catching this! I can see two > options to solving this: > > Option 1: If zswap_mthp_enabled is "false", delete all stored offsets > for the mTHP in zswap before exiting. This could race with writeback > (either one or more subpages could be written back before zswap_store > acquires the tree lock), however, I don't think it will cause data inconsistencies. > Any offsets for subpages not written back will be deleted from zswap, > zswap_store() will return false, and the backing swap device's subsequent > swapout will over-write the zswap write-back data. Could anything go wrong > with this? I think this should be safe, albeit a bit awkward. At this point (zswap_store()), we should have the folio added to to swap cache, and locked. All the associated swap entries will point to this same (large) folio. Any concurrent zswap writeback attempt, even on a tail page, should get that folio when it calls __read_swap_cache_async(), and with page_allocated == false, and should short circuit. So I don't think we will race with zswap_writeback(). Yosry, Chengming, Johannes, any thoughts? > > Option 2: Only provide a build config option, > CONFIG_ZSWAP_STORE_THP_DEFAULT_ON, that cannot be dynamically changed. This can be a last resort thing, if the above doesn't work. Not the end of the world, but not ideal :) > > Would appreciate suggestions on these, and other potential solutions. > > Thanks, > Kanchana ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-24 21:08 ` Nhat Pham @ 2024-09-24 21:34 ` Yosry Ahmed 2024-09-24 22:16 ` Nhat Pham 2024-09-24 22:17 ` Sridhar, Kanchana P 0 siblings, 2 replies; 79+ messages in thread From: Yosry Ahmed @ 2024-09-24 21:34 UTC (permalink / raw) To: Nhat Pham Cc: Sridhar, Kanchana P, linux-kernel, linux-mm, hannes, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh On Tue, Sep 24, 2024 at 2:08 PM Nhat Pham <nphamcs@gmail.com> wrote: > > On Tue, Sep 24, 2024 at 1:51 PM Sridhar, Kanchana P > <kanchana.p.sridhar@intel.com> wrote: > > > > > > This is an excellent point. Thanks Nhat for catching this! I can see two > > options to solving this: > > > > Option 1: If zswap_mthp_enabled is "false", delete all stored offsets > > for the mTHP in zswap before exiting. This could race with writeback > > (either one or more subpages could be written back before zswap_store > > acquires the tree lock), however, I don't think it will cause data inconsistencies. > > Any offsets for subpages not written back will be deleted from zswap, > > zswap_store() will return false, and the backing swap device's subsequent > > swapout will over-write the zswap write-back data. Could anything go wrong > > with this? > > I think this should be safe, albeit a bit awkward. > > At this point (zswap_store()), we should have the folio added to to > swap cache, and locked. All the associated swap entries will point to > this same (large) folio. > > Any concurrent zswap writeback attempt, even on a tail page, should > get that folio when it calls __read_swap_cache_async(), and with > page_allocated == false, and should short circuit. > > So I don't think we will race with zswap_writeback(). > > Yosry, Chengming, Johannes, any thoughts? Why can't we just handle it the same way as we handle zswap disablement? If it is disabled, we invalidate any old entries for the offsets and return false to swapout to disk. Taking a step back, why do we need the runtime knob and config option? Are there cases where we think zswapout of mTHPs will perform badly, or is it just due to lack of confidence in the feature? > > > > > Option 2: Only provide a build config option, > > CONFIG_ZSWAP_STORE_THP_DEFAULT_ON, that cannot be dynamically changed. > > This can be a last resort thing, if the above doesn't work. Not the > end of the world, but not ideal :) > > > > > Would appreciate suggestions on these, and other potential solutions. > > > > Thanks, > > Kanchana ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-24 21:34 ` Yosry Ahmed @ 2024-09-24 22:16 ` Nhat Pham 2024-09-24 22:18 ` Sridhar, Kanchana P 2024-09-24 22:28 ` Yosry Ahmed 2024-09-24 22:17 ` Sridhar, Kanchana P 1 sibling, 2 replies; 79+ messages in thread From: Nhat Pham @ 2024-09-24 22:16 UTC (permalink / raw) To: Yosry Ahmed Cc: Sridhar, Kanchana P, linux-kernel, linux-mm, hannes, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh On Tue, Sep 24, 2024 at 2:34 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > Why can't we just handle it the same way as we handle zswap > disablement? If it is disabled, we invalidate any old entries for the > offsets and return false to swapout to disk. I think that was the suggestion. > > Taking a step back, why do we need the runtime knob and config option? > Are there cases where we think zswapout of mTHPs will perform badly, > or is it just due to lack of confidence in the feature? Fair point. I think the reason why I suggested this knob was because we observe so much regressions in earlier benchmarks, and especially on the software compressor column. But now that we've reworked the benchmark + use zstd for software compressor, I think we can get rid of this knob/config option, and simplify things. ^ permalink raw reply [flat|nested] 79+ messages in thread
* RE: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-24 22:16 ` Nhat Pham @ 2024-09-24 22:18 ` Sridhar, Kanchana P 2024-09-24 22:28 ` Yosry Ahmed 1 sibling, 0 replies; 79+ messages in thread From: Sridhar, Kanchana P @ 2024-09-24 22:18 UTC (permalink / raw) To: Nhat Pham, Yosry Ahmed Cc: linux-kernel, linux-mm, hannes, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P > -----Original Message----- > From: Nhat Pham <nphamcs@gmail.com> > Sent: Tuesday, September 24, 2024 3:16 PM > To: Yosry Ahmed <yosryahmed@google.com> > Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux- > kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org; > chengming.zhou@linux.dev; usamaarif642@gmail.com; > shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org; > Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in > zswap_store(). > > On Tue, Sep 24, 2024 at 2:34 PM Yosry Ahmed <yosryahmed@google.com> > wrote: > > > > > > Why can't we just handle it the same way as we handle zswap > > disablement? If it is disabled, we invalidate any old entries for the > > offsets and return false to swapout to disk. > > I think that was the suggestion. > > > > > Taking a step back, why do we need the runtime knob and config option? > > Are there cases where we think zswapout of mTHPs will perform badly, > > or is it just due to lack of confidence in the feature? > > Fair point. I think the reason why I suggested this knob was because > we observe so much regressions in earlier benchmarks, and especially > on the software compressor column. > > But now that we've reworked the benchmark + use zstd for software > compressor, I think we can get rid of this knob/config option, and > simplify things. I agree, thanks Nhat! Will fix this in v8. Thanks, Kanchana ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-24 22:16 ` Nhat Pham 2024-09-24 22:18 ` Sridhar, Kanchana P @ 2024-09-24 22:28 ` Yosry Ahmed 1 sibling, 0 replies; 79+ messages in thread From: Yosry Ahmed @ 2024-09-24 22:28 UTC (permalink / raw) To: Nhat Pham Cc: Sridhar, Kanchana P, linux-kernel, linux-mm, hannes, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh On Tue, Sep 24, 2024 at 3:16 PM Nhat Pham <nphamcs@gmail.com> wrote: > > On Tue, Sep 24, 2024 at 2:34 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > Why can't we just handle it the same way as we handle zswap > > disablement? If it is disabled, we invalidate any old entries for the > > offsets and return false to swapout to disk. > > I think that was the suggestion. Hmm I may be reading this wrong, but my understanding was that the suggestion is to synchronously remove all entries of large folios from zswap when zswap_mthp is disabled. What I am suggesting is to do the same thing we do in zswap_store() when zswap is disabled. Anyway, if we are removing the knob this is not relevant anymore. ^ permalink raw reply [flat|nested] 79+ messages in thread
* RE: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-24 21:34 ` Yosry Ahmed 2024-09-24 22:16 ` Nhat Pham @ 2024-09-24 22:17 ` Sridhar, Kanchana P 1 sibling, 0 replies; 79+ messages in thread From: Sridhar, Kanchana P @ 2024-09-24 22:17 UTC (permalink / raw) To: Yosry Ahmed, Nhat Pham Cc: linux-kernel, linux-mm, hannes, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P > -----Original Message----- > From: Yosry Ahmed <yosryahmed@google.com> > Sent: Tuesday, September 24, 2024 2:34 PM > To: Nhat Pham <nphamcs@gmail.com> > Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux- > kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org; > chengming.zhou@linux.dev; usamaarif642@gmail.com; > shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org; > Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in > zswap_store(). > > On Tue, Sep 24, 2024 at 2:08 PM Nhat Pham <nphamcs@gmail.com> wrote: > > > > On Tue, Sep 24, 2024 at 1:51 PM Sridhar, Kanchana P > > <kanchana.p.sridhar@intel.com> wrote: > > > > > > > > > This is an excellent point. Thanks Nhat for catching this! I can see two > > > options to solving this: > > > > > > Option 1: If zswap_mthp_enabled is "false", delete all stored offsets > > > for the mTHP in zswap before exiting. This could race with writeback > > > (either one or more subpages could be written back before zswap_store > > > acquires the tree lock), however, I don't think it will cause data > inconsistencies. > > > Any offsets for subpages not written back will be deleted from zswap, > > > zswap_store() will return false, and the backing swap device's subsequent > > > swapout will over-write the zswap write-back data. Could anything go > wrong > > > with this? > > > > I think this should be safe, albeit a bit awkward. > > > > At this point (zswap_store()), we should have the folio added to to > > swap cache, and locked. All the associated swap entries will point to > > this same (large) folio. > > > > Any concurrent zswap writeback attempt, even on a tail page, should > > get that folio when it calls __read_swap_cache_async(), and with > > page_allocated == false, and should short circuit. > > > > So I don't think we will race with zswap_writeback(). > > > > Yosry, Chengming, Johannes, any thoughts? > > Why can't we just handle it the same way as we handle zswap > disablement? If it is disabled, we invalidate any old entries for the > offsets and return false to swapout to disk. > > Taking a step back, why do we need the runtime knob and config option? > Are there cases where we think zswapout of mTHPs will perform badly, > or is it just due to lack of confidence in the feature? Thanks Nhat and Yosry for the suggestions/comments. If I recall correctly, the topic of adding a config option/knob came up based on earlier data I had collected with a zram backing device setup, which showed a performance degradation with zstd, but not with deflate-iaa. Since the v7 data collected with an 823G SSD swap disk partition indicates that we get good throughput and latency improvements with zswap-mTHP with zstd and deflate-iaa, I am not sure if the knob is still required (if this is representative of most of the setups that use mTHP). I am confident about the zswap-mTHP feature itself, and don’t think the knob is needed from that perspective. I think the question is really about having the ability to disable zswap-mTHP in some existing setup where having mTHP enabled performs worse with this patchset than without. I am Ok with having the knob and handling it using Option 1, or, not having a knob. Thanks, Kanchana > > > > > > > > > Option 2: Only provide a build config option, > > > CONFIG_ZSWAP_STORE_THP_DEFAULT_ON, that cannot be dynamically > changed. > > > > This can be a last resort thing, if the above doesn't work. Not the > > end of the world, but not ideal :) > > > > > > > > Would appreciate suggestions on these, and other potential solutions. > > > > > > Thanks, > > > Kanchana ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-24 1:17 ` [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store() Kanchana P Sridhar 2024-09-24 17:33 ` Nhat Pham @ 2024-09-24 19:38 ` Yosry Ahmed 2024-09-24 20:51 ` Nhat Pham ` (2 more replies) 2024-09-25 14:27 ` Johannes Weiner 2 siblings, 3 replies; 79+ messages in thread From: Yosry Ahmed @ 2024-09-24 19:38 UTC (permalink / raw) To: Kanchana P Sridhar Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar <kanchana.p.sridhar@intel.com> wrote: > > zswap_store() will now store mTHP and PMD-size THP folios by compressing > them page by page. > > This patch provides a sequential implementation of storing an mTHP in > zswap_store() by iterating through each page in the folio to compress > and store it in the zswap zpool. > > Towards this goal, zswap_compress() is modified to take a page instead > of a folio as input. > > Each page's swap offset is stored as a separate zswap entry. > > If an error is encountered during the store of any page in the mTHP, > all previous pages/entries stored will be invalidated. Thus, an mTHP > is either entirely stored in ZSWAP, or entirely not stored in ZSWAP. > > This forms the basis for building batching of pages during zswap store > of large folios by compressing batches of up to say, 8 pages in an > mTHP in parallel in hardware, with the Intel In-Memory Analytics > Accelerator (Intel IAA). > > A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default) > will enable/disable zswap storing of (m)THP. The corresponding tunable > zswap module parameter is "mthp_enabled". > > This change reuses and adapts the functionality in Ryan Roberts' RFC > patch [1]: > > "[RFC,v1] mm: zswap: Store large folios without splitting" > > [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u > > Also, addressed some of the RFC comments from the discussion in [1]. > > Co-developed-by: Ryan Roberts > Signed-off-by: > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > --- > mm/Kconfig | 8 ++++ > mm/zswap.c | 122 +++++++++++++++++++++++++---------------------------- > 2 files changed, 66 insertions(+), 64 deletions(-) > > diff --git a/mm/Kconfig b/mm/Kconfig > index 09aebca1cae3..c659fb732ec4 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -59,6 +59,14 @@ config ZSWAP_SHRINKER_DEFAULT_ON > reducing the chance that cold pages will reside in the zswap pool > and consume memory indefinitely. > > +config ZSWAP_STORE_THP_DEFAULT_ON > + bool "Store mTHP and THP folios in zswap" > + depends on ZSWAP > + default n > + help > + If selected, zswap will process mTHP and THP folios by > + compressing and storing each 4K page in the large folio. > + > choice > prompt "Default compressor" > depends on ZSWAP > diff --git a/mm/zswap.c b/mm/zswap.c > index 8f2e0ab34c84..16ab770546d6 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -127,6 +127,14 @@ static bool zswap_shrinker_enabled = IS_ENABLED( > CONFIG_ZSWAP_SHRINKER_DEFAULT_ON); > module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644); > > +/* > + * Enable/disable zswap processing of mTHP folios. > + * For now, only zswap_store will process mTHP folios. > + */ > +static bool zswap_mthp_enabled = IS_ENABLED( > + CONFIG_ZSWAP_STORE_THP_DEFAULT_ON); > +module_param_named(mthp_enabled, zswap_mthp_enabled, bool, 0644); > + > bool zswap_is_enabled(void) > { > return zswap_enabled; > @@ -1471,9 +1479,9 @@ static void zswap_delete_stored_offsets(struct xarray *tree, > * @objcg: The folio's objcg. > * @pool: The zswap_pool to store the compressed data for the page. > */ > -static bool __maybe_unused zswap_store_page(struct folio *folio, long index, > - struct obj_cgroup *objcg, > - struct zswap_pool *pool) > +static bool zswap_store_page(struct folio *folio, long index, > + struct obj_cgroup *objcg, > + struct zswap_pool *pool) As I mentioned earlier, the patch that introduced zswap_store_page() should have directly used it in zswap_store(). This would make this patch much clearer. > { > swp_entry_t swp = folio->swap; > int type = swp_type(swp); > @@ -1551,51 +1559,63 @@ static bool __maybe_unused zswap_store_page(struct folio *folio, long index, > return false; > } > > +/* > + * Modified to store mTHP folios. Each page in the mTHP will be compressed > + * and stored sequentially. > + */ > bool zswap_store(struct folio *folio) > { > long nr_pages = folio_nr_pages(folio); > swp_entry_t swp = folio->swap; > pgoff_t offset = swp_offset(swp); > struct xarray *tree = swap_zswap_tree(swp); > - struct zswap_entry *entry; > struct obj_cgroup *objcg = NULL; > struct mem_cgroup *memcg = NULL; > + struct zswap_pool *pool; > + bool ret = false; > + long index; > > VM_WARN_ON_ONCE(!folio_test_locked(folio)); > VM_WARN_ON_ONCE(!folio_test_swapcache(folio)); > > - /* Large folios aren't supported */ > - if (folio_test_large(folio)) > + /* Storing large folios isn't enabled */ The comment is now stating the obvious, please remove it. > + if (!zswap_mthp_enabled && folio_test_large(folio)) > return false; > > if (!zswap_enabled) > - goto check_old; > + goto reject; > > - /* Check cgroup limits */ > + /* > + * Check cgroup limits: > + * > + * The cgroup zswap limit check is done once at the beginning of an > + * mTHP store, and not within zswap_store_page() for each page > + * in the mTHP. We do however check the zswap pool limits at the > + * start of zswap_store_page(). What this means is, the cgroup > + * could go over the limits by at most (HPAGE_PMD_NR - 1) pages. > + * However, the per-store-page zswap pool limits check should > + * hopefully trigger the cgroup aware and zswap LRU aware global > + * reclaim implemented in the shrinker. If this assumption holds, > + * the cgroup exceeding the zswap limits could potentially be > + * resolved before the next zswap_store, and if it is not, the next > + * zswap_store would fail the cgroup zswap limit check at the start. > + */ I do not really like this. Allowing going one page above the limit is one thing, but one THP above the limit seems too much. I also don't like relying on the repeated limit checking in zswap_store_page(), if anything I think that should be batched too. Is it too unreasonable to maintain the average compression ratio and use that to estimate limit checking for both memcg and global limits? Johannes, Nhat, any thoughts on this? > objcg = get_obj_cgroup_from_folio(folio); > if (objcg && !obj_cgroup_may_zswap(objcg)) { > memcg = get_mem_cgroup_from_objcg(objcg); > if (shrink_memcg(memcg)) { > mem_cgroup_put(memcg); > - goto reject; > + goto put_objcg; > } > mem_cgroup_put(memcg); > } > > if (zswap_check_limits()) > - goto reject; > - > - /* allocate entry */ > - entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio)); > - if (!entry) { > - zswap_reject_kmemcache_fail++; > - goto reject; > - } > + goto put_objcg; > > - /* if entry is successfully added, it keeps the reference */ > - entry->pool = zswap_pool_current_get(); > - if (!entry->pool) > - goto freepage; > + pool = zswap_pool_current_get(); > + if (!pool) > + goto put_objcg; > > if (objcg) { > memcg = get_mem_cgroup_from_objcg(objcg); > @@ -1606,60 +1626,34 @@ bool zswap_store(struct folio *folio) > mem_cgroup_put(memcg); > } > > - if (!zswap_compress(&folio->page, entry)) > - goto put_pool; > - > - entry->swpentry = swp; > - entry->objcg = objcg; > - entry->referenced = true; > - > - if (!zswap_store_entry(tree, entry)) > - goto store_failed; > - > - if (objcg) { > - obj_cgroup_charge_zswap(objcg, entry->length); > - count_objcg_event(objcg, ZSWPOUT); > - } > - > /* > - * We finish initializing the entry while it's already in xarray. > - * This is safe because: > - * > - * 1. Concurrent stores and invalidations are excluded by folio lock. > - * > - * 2. Writeback is excluded by the entry not being on the LRU yet. > - * The publishing order matters to prevent writeback from seeing > - * an incoherent entry. > + * Store each page of the folio as a separate entry. If we fail to store > + * a page, unwind by removing all the previous pages we stored. > */ > - if (entry->length) { > - INIT_LIST_HEAD(&entry->lru); > - zswap_lru_add(&zswap_list_lru, entry); > + for (index = 0; index < nr_pages; ++index) { > + if (!zswap_store_page(folio, index, objcg, pool)) > + goto put_pool; > } > > - /* update stats */ > - atomic_inc(&zswap_stored_pages); > - count_vm_event(ZSWPOUT); > - > - return true; > + ret = true; > > -store_failed: > - zpool_free(entry->pool->zpool, entry->handle); > put_pool: > - zswap_pool_put(entry->pool); > -freepage: > - zswap_entry_cache_free(entry); > -reject: > + zswap_pool_put(pool); > +put_objcg: > obj_cgroup_put(objcg); > if (zswap_pool_reached_full) > queue_work(shrink_wq, &zswap_shrink_work); > -check_old: > +reject: > /* > - * If the zswap store fails or zswap is disabled, we must invalidate the > - * possibly stale entry which was previously stored at this offset. > - * Otherwise, writeback could overwrite the new data in the swapfile. > + * If the zswap store fails or zswap is disabled, we must invalidate > + * the possibly stale entries which were previously stored at the > + * offsets corresponding to each page of the folio. Otherwise, > + * writeback could overwrite the new data in the swapfile. > */ > - zswap_delete_stored_offsets(tree, offset, nr_pages); > - return false; > + if (!ret) > + zswap_delete_stored_offsets(tree, offset, nr_pages); > + > + return ret; > } > > bool zswap_load(struct folio *folio) > -- > 2.27.0 > ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-24 19:38 ` Yosry Ahmed @ 2024-09-24 20:51 ` Nhat Pham 2024-09-24 21:38 ` Yosry Ahmed 2024-09-24 23:21 ` Sridhar, Kanchana P 2024-09-24 23:02 ` Sridhar, Kanchana P 2024-09-25 13:40 ` Johannes Weiner 2 siblings, 2 replies; 79+ messages in thread From: Nhat Pham @ 2024-09-24 20:51 UTC (permalink / raw) To: Yosry Ahmed Cc: Kanchana P Sridhar, linux-kernel, linux-mm, hannes, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal On Tue, Sep 24, 2024 at 12:39 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar > > + * The cgroup zswap limit check is done once at the beginning of an > > + * mTHP store, and not within zswap_store_page() for each page > > + * in the mTHP. We do however check the zswap pool limits at the > > + * start of zswap_store_page(). What this means is, the cgroup > > + * could go over the limits by at most (HPAGE_PMD_NR - 1) pages. > > + * However, the per-store-page zswap pool limits check should > > + * hopefully trigger the cgroup aware and zswap LRU aware global > > + * reclaim implemented in the shrinker. If this assumption holds, > > + * the cgroup exceeding the zswap limits could potentially be > > + * resolved before the next zswap_store, and if it is not, the next > > + * zswap_store would fail the cgroup zswap limit check at the start. > > + */ > > I do not really like this. Allowing going one page above the limit is > one thing, but one THP above the limit seems too much. I also don't Hmm what if you have multiple concurrent zswap stores, from different tasks but the same cgroup? If none of them has charged, they would all get greenlit, and charge towards the cgroup... So technically the zswap limit checking is already best-effort only. But now, instead of one page per violation, it's 512 pages per violation :) Yeah this can be bad. I think this is only safe if you only use zswap.max as a binary knob (0 or max)... > like relying on the repeated limit checking in zswap_store_page(), if > anything I think that should be batched too. > > Is it too unreasonable to maintain the average compression ratio and > use that to estimate limit checking for both memcg and global limits? > Johannes, Nhat, any thoughts on this? I remember asking about this, but past Nhat might have relented :) https://lore.kernel.org/linux-mm/CAKEwX=PfAMZ2qJtwKwJsVx3TZWxV5z2ZaU1Epk1UD=DBdMsjFA@mail.gmail.com/ We can do limit checking and charging after compression is done, but that's a lot of code change (might not even be possible)... It will, however, allow us to do charging + checking in one go (rather than doing it 8, 16, or 512 times) Another thing we can do is to register a zswap writeback after the zswap store attempts to clean up excess capacity. Not sure what will happen if zswap writeback is disabled for the cgroup though :) If it's too hard, the average estimate could be a decent compromise, until we figure something smarter. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-24 20:51 ` Nhat Pham @ 2024-09-24 21:38 ` Yosry Ahmed 2024-09-24 23:11 ` Nhat Pham 2024-09-24 23:21 ` Sridhar, Kanchana P 1 sibling, 1 reply; 79+ messages in thread From: Yosry Ahmed @ 2024-09-24 21:38 UTC (permalink / raw) To: Nhat Pham Cc: Kanchana P Sridhar, linux-kernel, linux-mm, hannes, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal On Tue, Sep 24, 2024 at 1:51 PM Nhat Pham <nphamcs@gmail.com> wrote: > > On Tue, Sep 24, 2024 at 12:39 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar > > > + * The cgroup zswap limit check is done once at the beginning of an > > > + * mTHP store, and not within zswap_store_page() for each page > > > + * in the mTHP. We do however check the zswap pool limits at the > > > + * start of zswap_store_page(). What this means is, the cgroup > > > + * could go over the limits by at most (HPAGE_PMD_NR - 1) pages. > > > + * However, the per-store-page zswap pool limits check should > > > + * hopefully trigger the cgroup aware and zswap LRU aware global > > > + * reclaim implemented in the shrinker. If this assumption holds, > > > + * the cgroup exceeding the zswap limits could potentially be > > > + * resolved before the next zswap_store, and if it is not, the next > > > + * zswap_store would fail the cgroup zswap limit check at the start. > > > + */ > > > > I do not really like this. Allowing going one page above the limit is > > one thing, but one THP above the limit seems too much. I also don't > > Hmm what if you have multiple concurrent zswap stores, from different > tasks but the same cgroup? If none of them has charged, they would all > get greenlit, and charge towards the cgroup... > > So technically the zswap limit checking is already best-effort only. > But now, instead of one page per violation, it's 512 pages per > violation :) Yeah good point about concurrent operations, we can go 512 pages above limit * number of concurrent swapouts. That can be a lot of memory. > > Yeah this can be bad. I think this is only safe if you only use > zswap.max as a binary knob (0 or max)... > > > like relying on the repeated limit checking in zswap_store_page(), if > > anything I think that should be batched too. > > > > Is it too unreasonable to maintain the average compression ratio and > > use that to estimate limit checking for both memcg and global limits? > > Johannes, Nhat, any thoughts on this? > > I remember asking about this, but past Nhat might have relented :) > > https://lore.kernel.org/linux-mm/CAKEwX=PfAMZ2qJtwKwJsVx3TZWxV5z2ZaU1Epk1UD=DBdMsjFA@mail.gmail.com/ > > We can do limit checking and charging after compression is done, but > that's a lot of code change (might not even be possible)... It will, > however, allow us to do charging + checking in one go (rather than > doing it 8, 16, or 512 times) > > Another thing we can do is to register a zswap writeback after the > zswap store attempts to clean up excess capacity. Not sure what will > happen if zswap writeback is disabled for the cgroup though :) > > If it's too hard, the average estimate could be a decent compromise, > until we figure something smarter. We can also do what we discussed before about double charging. The pages that are being reclaimed are already charged, so technically we don't need to charge them again. We can uncharge the difference between compressed and uncompressed sizes after compression and call it a day. This fixes the limit checking and the double charging in one go. I am a little bit nervous though about zswap uncharing the pages from under reclaim, there are likely further accesses of the page memcg after zswap. Maybe we can plumb the info back to reclaim or set a flag on the page to avoid uncharging it when it's freed. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-24 21:38 ` Yosry Ahmed @ 2024-09-24 23:11 ` Nhat Pham 2024-09-25 0:05 ` Sridhar, Kanchana P 2024-09-25 0:52 ` Yosry Ahmed 0 siblings, 2 replies; 79+ messages in thread From: Nhat Pham @ 2024-09-24 23:11 UTC (permalink / raw) To: Yosry Ahmed Cc: Kanchana P Sridhar, linux-kernel, linux-mm, hannes, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal, joshua.hahnjy On Tue, Sep 24, 2024 at 2:38 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > We can also do what we discussed before about double charging. The > pages that are being reclaimed are already charged, so technically we > don't need to charge them again. We can uncharge the difference > between compressed and uncompressed sizes after compression and call > it a day. This fixes the limit checking and the double charging in one > go. > I am a little bit nervous though about zswap uncharing the pages from > under reclaim, there are likely further accesses of the page memcg > after zswap. Maybe we can plumb the info back to reclaim or set a flag > on the page to avoid uncharging it when it's freed. Hmm this is just for memory usage charging, no? The problem here is the zswap usage (zswap.current), and its relation to the limit. One thing we can do is check the zswap usage against the limit for every subpage, but that's likely expensive...? With the new atomic counters Joshua is working on, we can check-and-charge at the same time, after we have compressed the whole large folio, like this: for (memcg = original_memcg; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)); old_usage = atomic_read(&memcg->zswap); do { new_usage = old_usage + size; if (new_usage > limit) { /* undo charging of descendants, then return false */ } } while (!atomic_try_cmpxchg(&memcg->zswap, old_usage, new_usage)) } But I don't know what we can do in the current design. I gave it some more thought, and even if we only check after we know the size, we can still potentially overshoot the limit :( ^ permalink raw reply [flat|nested] 79+ messages in thread
* RE: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-24 23:11 ` Nhat Pham @ 2024-09-25 0:05 ` Sridhar, Kanchana P 2024-09-25 0:52 ` Yosry Ahmed 1 sibling, 0 replies; 79+ messages in thread From: Sridhar, Kanchana P @ 2024-09-25 0:05 UTC (permalink / raw) To: Nhat Pham, Yosry Ahmed Cc: linux-kernel, linux-mm, hannes, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, joshua.hahnjy, Sridhar, Kanchana P > -----Original Message----- > From: Nhat Pham <nphamcs@gmail.com> > Sent: Tuesday, September 24, 2024 4:11 PM > To: Yosry Ahmed <yosryahmed@google.com> > Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux- > kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org; > chengming.zhou@linux.dev; usamaarif642@gmail.com; > shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org; > Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>; > joshua.hahnjy@gmail.com > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in > zswap_store(). > > On Tue, Sep 24, 2024 at 2:38 PM Yosry Ahmed <yosryahmed@google.com> > wrote: > > > > > > We can also do what we discussed before about double charging. The > > pages that are being reclaimed are already charged, so technically we > > don't need to charge them again. We can uncharge the difference > > between compressed and uncompressed sizes after compression and call > > it a day. This fixes the limit checking and the double charging in one > > go. > > I am a little bit nervous though about zswap uncharing the pages from > > under reclaim, there are likely further accesses of the page memcg > > after zswap. Maybe we can plumb the info back to reclaim or set a flag > > on the page to avoid uncharging it when it's freed. > > Hmm this is just for memory usage charging, no? The problem here is > the zswap usage (zswap.current), and its relation to the limit. > > One thing we can do is check the zswap usage against the limit for > every subpage, but that's likely expensive...? This is the approach currently implemented in v7. Data gathered doesn’t indicate a performance issue with this specific workload in the two scenarios validated, namely, zswap-4K vs. zswap-mTHP and SSD-mTHP vs. zswap-mTHP (we only see performance gains with explainable sys time increase). Of course, the existing implementation could be a baseline for validating performance of other approaches, e.g., checking zswap usage per mTHP. However, these other approaches would also need to be evaluated for more global multi-instance implications as far as all processes being able to make progress. > > With the new atomic counters Joshua is working on, we can > check-and-charge at the same time, after we have compressed the whole > large folio, like this: > > for (memcg = original_memcg; !mem_cgroup_is_root(memcg); > memcg = parent_mem_cgroup(memcg)); > old_usage = atomic_read(&memcg->zswap); > > do { > new_usage = old_usage + size; > if (new_usage > limit) { > /* undo charging of descendants, then return false */ > } > } while (!atomic_try_cmpxchg(&memcg->zswap, old_usage, new_usage)) > } > > But I don't know what we can do in the current design. I gave it some > more thought, and even if we only check after we know the size, we can > still potentially overshoot the limit :( I agree. Moreover, these checks based on estimated ratio or compressed size could also add overhead in the normal case where we are not near the usage limits. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-24 23:11 ` Nhat Pham 2024-09-25 0:05 ` Sridhar, Kanchana P @ 2024-09-25 0:52 ` Yosry Ahmed 1 sibling, 0 replies; 79+ messages in thread From: Yosry Ahmed @ 2024-09-25 0:52 UTC (permalink / raw) To: Nhat Pham Cc: Kanchana P Sridhar, linux-kernel, linux-mm, hannes, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal, joshua.hahnjy On Tue, Sep 24, 2024 at 4:11 PM Nhat Pham <nphamcs@gmail.com> wrote: > > On Tue, Sep 24, 2024 at 2:38 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > > > We can also do what we discussed before about double charging. The > > pages that are being reclaimed are already charged, so technically we > > don't need to charge them again. We can uncharge the difference > > between compressed and uncompressed sizes after compression and call > > it a day. This fixes the limit checking and the double charging in one > > go. > > I am a little bit nervous though about zswap uncharing the pages from > > under reclaim, there are likely further accesses of the page memcg > > after zswap. Maybe we can plumb the info back to reclaim or set a flag > > on the page to avoid uncharging it when it's freed. > > Hmm this is just for memory usage charging, no? The problem here is > the zswap usage (zswap.current), and its relation to the limit. > > One thing we can do is check the zswap usage against the limit for > every subpage, but that's likely expensive...? Ah yes, I totally missed this. > > With the new atomic counters Joshua is working on, we can > check-and-charge at the same time, after we have compressed the whole > large folio, like this: > > for (memcg = original_memcg; !mem_cgroup_is_root(memcg); > memcg = parent_mem_cgroup(memcg)); > old_usage = atomic_read(&memcg->zswap); > > do { > new_usage = old_usage + size; > if (new_usage > limit) { > /* undo charging of descendants, then return false */ > } > } while (!atomic_try_cmpxchg(&memcg->zswap, old_usage, new_usage)) > } > > But I don't know what we can do in the current design. I gave it some > more thought, and even if we only check after we know the size, we can > still potentially overshoot the limit :( Yeah it's difficult because if we check the limit before compressing, we have to estimate the compressed size or check using the uncompressed size. If we wait until after compression we will either overshoot the limit or free the compressed page and fallback to swap. Maybe a good compromise is to do the check before compression with an estimate based on historical compression ratio, and then do the actual charging after the compression and allow overshooting, hopefully it's not too much if our estimate is good. We can also improve this later by adding a backoff mechanism where we make more conservative estimates the more we overshoot the limit. ^ permalink raw reply [flat|nested] 79+ messages in thread
* RE: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-24 20:51 ` Nhat Pham 2024-09-24 21:38 ` Yosry Ahmed @ 2024-09-24 23:21 ` Sridhar, Kanchana P 1 sibling, 0 replies; 79+ messages in thread From: Sridhar, Kanchana P @ 2024-09-24 23:21 UTC (permalink / raw) To: Nhat Pham, Yosry Ahmed Cc: linux-kernel, linux-mm, hannes, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P > -----Original Message----- > From: Nhat Pham <nphamcs@gmail.com> > Sent: Tuesday, September 24, 2024 1:51 PM > To: Yosry Ahmed <yosryahmed@google.com> > Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux- > kernel@vger.kernel.org; linux-mm@kvack.org; hannes@cmpxchg.org; > chengming.zhou@linux.dev; usamaarif642@gmail.com; > shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org; > Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in > zswap_store(). > > On Tue, Sep 24, 2024 at 12:39 PM Yosry Ahmed <yosryahmed@google.com> > wrote: > > > > On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar > > > + * The cgroup zswap limit check is done once at the beginning of an > > > + * mTHP store, and not within zswap_store_page() for each page > > > + * in the mTHP. We do however check the zswap pool limits at the > > > + * start of zswap_store_page(). What this means is, the cgroup > > > + * could go over the limits by at most (HPAGE_PMD_NR - 1) pages. > > > + * However, the per-store-page zswap pool limits check should > > > + * hopefully trigger the cgroup aware and zswap LRU aware global > > > + * reclaim implemented in the shrinker. If this assumption holds, > > > + * the cgroup exceeding the zswap limits could potentially be > > > + * resolved before the next zswap_store, and if it is not, the next > > > + * zswap_store would fail the cgroup zswap limit check at the start. > > > + */ > > > > I do not really like this. Allowing going one page above the limit is > > one thing, but one THP above the limit seems too much. I also don't > > Hmm what if you have multiple concurrent zswap stores, from different > tasks but the same cgroup? If none of them has charged, they would all > get greenlit, and charge towards the cgroup... > > So technically the zswap limit checking is already best-effort only. > But now, instead of one page per violation, it's 512 pages per > violation :) > > Yeah this can be bad. I think this is only safe if you only use > zswap.max as a binary knob (0 or max)... > > > like relying on the repeated limit checking in zswap_store_page(), if > > anything I think that should be batched too. > > > > Is it too unreasonable to maintain the average compression ratio and > > use that to estimate limit checking for both memcg and global limits? > > Johannes, Nhat, any thoughts on this? > > I remember asking about this, but past Nhat might have relented :) > > https://lore.kernel.org/linux- > mm/CAKEwX=PfAMZ2qJtwKwJsVx3TZWxV5z2ZaU1Epk1UD=DBdMsjFA@mail > .gmail.com/ > > We can do limit checking and charging after compression is done, but > that's a lot of code change (might not even be possible)... It will, > however, allow us to do charging + checking in one go (rather than > doing it 8, 16, or 512 times) > > Another thing we can do is to register a zswap writeback after the > zswap store attempts to clean up excess capacity. Not sure what will > happen if zswap writeback is disabled for the cgroup though :) > > If it's too hard, the average estimate could be a decent compromise, > until we figure something smarter. Thanks Yosry and Nhat for these insights. This is how I was viewing this scenario: I thought of incrementally (per subpage store) calling zswap_pool_get() and limit checks followed by shrinker invocations in case of error conditions to allow different concurrent stores to make progress, without favoring only one process's mTHP store based on there being enough zpool space available (for e.g. based on compression ratio estimate). Besides simplicity and no added overhead in the regular cases, I was thinking this approach would have minimal impact on the process(es) that see the zswap limit being exceeded, and that this would be better than preemptively checking for the entire mTHP and failing (this could also complicate things where no one makes progress because multiple processes run the batch checks and fail, when realistically one/many could have triggered the shrinker before erroring out, and at least one/few could have made progress). Another potential solution for this could be based on experimental data for a given setup, on mTHP swapout failures and say, reducing the zswap zpool max-limit and/or acceptance threshold perhaps? Would appreciate your suggestions on how to proceed as far as the limit checks. Thanks, Kanchana ^ permalink raw reply [flat|nested] 79+ messages in thread
* RE: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-24 19:38 ` Yosry Ahmed 2024-09-24 20:51 ` Nhat Pham @ 2024-09-24 23:02 ` Sridhar, Kanchana P 2024-09-25 13:40 ` Johannes Weiner 2 siblings, 0 replies; 79+ messages in thread From: Sridhar, Kanchana P @ 2024-09-24 23:02 UTC (permalink / raw) To: Yosry Ahmed Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P > -----Original Message----- > From: Yosry Ahmed <yosryahmed@google.com> > Sent: Tuesday, September 24, 2024 12:39 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com; > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux- > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in > zswap_store(). > > On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar > <kanchana.p.sridhar@intel.com> wrote: > > > > zswap_store() will now store mTHP and PMD-size THP folios by compressing > > them page by page. > > > > This patch provides a sequential implementation of storing an mTHP in > > zswap_store() by iterating through each page in the folio to compress > > and store it in the zswap zpool. > > > > Towards this goal, zswap_compress() is modified to take a page instead > > of a folio as input. > > > > Each page's swap offset is stored as a separate zswap entry. > > > > If an error is encountered during the store of any page in the mTHP, > > all previous pages/entries stored will be invalidated. Thus, an mTHP > > is either entirely stored in ZSWAP, or entirely not stored in ZSWAP. > > > > This forms the basis for building batching of pages during zswap store > > of large folios by compressing batches of up to say, 8 pages in an > > mTHP in parallel in hardware, with the Intel In-Memory Analytics > > Accelerator (Intel IAA). > > > > A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by > default) > > will enable/disable zswap storing of (m)THP. The corresponding tunable > > zswap module parameter is "mthp_enabled". > > > > This change reuses and adapts the functionality in Ryan Roberts' RFC > > patch [1]: > > > > "[RFC,v1] mm: zswap: Store large folios without splitting" > > > > [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1- > ryan.roberts@arm.com/T/#u > > > > Also, addressed some of the RFC comments from the discussion in [1]. > > > > Co-developed-by: Ryan Roberts > > Signed-off-by: > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > > --- > > mm/Kconfig | 8 ++++ > > mm/zswap.c | 122 +++++++++++++++++++++++++---------------------------- > > 2 files changed, 66 insertions(+), 64 deletions(-) > > > > diff --git a/mm/Kconfig b/mm/Kconfig > > index 09aebca1cae3..c659fb732ec4 100644 > > --- a/mm/Kconfig > > +++ b/mm/Kconfig > > @@ -59,6 +59,14 @@ config ZSWAP_SHRINKER_DEFAULT_ON > > reducing the chance that cold pages will reside in the zswap pool > > and consume memory indefinitely. > > > > +config ZSWAP_STORE_THP_DEFAULT_ON > > + bool "Store mTHP and THP folios in zswap" > > + depends on ZSWAP > > + default n > > + help > > + If selected, zswap will process mTHP and THP folios by > > + compressing and storing each 4K page in the large folio. > > + > > choice > > prompt "Default compressor" > > depends on ZSWAP > > diff --git a/mm/zswap.c b/mm/zswap.c > > index 8f2e0ab34c84..16ab770546d6 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -127,6 +127,14 @@ static bool zswap_shrinker_enabled = > IS_ENABLED( > > CONFIG_ZSWAP_SHRINKER_DEFAULT_ON); > > module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, > 0644); > > > > +/* > > + * Enable/disable zswap processing of mTHP folios. > > + * For now, only zswap_store will process mTHP folios. > > + */ > > +static bool zswap_mthp_enabled = IS_ENABLED( > > + CONFIG_ZSWAP_STORE_THP_DEFAULT_ON); > > +module_param_named(mthp_enabled, zswap_mthp_enabled, bool, > 0644); > > + > > bool zswap_is_enabled(void) > > { > > return zswap_enabled; > > @@ -1471,9 +1479,9 @@ static void zswap_delete_stored_offsets(struct > xarray *tree, > > * @objcg: The folio's objcg. > > * @pool: The zswap_pool to store the compressed data for the page. > > */ > > -static bool __maybe_unused zswap_store_page(struct folio *folio, long > index, > > - struct obj_cgroup *objcg, > > - struct zswap_pool *pool) > > +static bool zswap_store_page(struct folio *folio, long index, > > + struct obj_cgroup *objcg, > > + struct zswap_pool *pool) > > As I mentioned earlier, the patch that introduced zswap_store_page() > should have directly used it in zswap_store(). This would make this > patch much clearer. Sure. I will fix this in v8. > > > { > > swp_entry_t swp = folio->swap; > > int type = swp_type(swp); > > @@ -1551,51 +1559,63 @@ static bool __maybe_unused > zswap_store_page(struct folio *folio, long index, > > return false; > > } > > > > +/* > > + * Modified to store mTHP folios. Each page in the mTHP will be > compressed > > + * and stored sequentially. > > + */ > > bool zswap_store(struct folio *folio) > > { > > long nr_pages = folio_nr_pages(folio); > > swp_entry_t swp = folio->swap; > > pgoff_t offset = swp_offset(swp); > > struct xarray *tree = swap_zswap_tree(swp); > > - struct zswap_entry *entry; > > struct obj_cgroup *objcg = NULL; > > struct mem_cgroup *memcg = NULL; > > + struct zswap_pool *pool; > > + bool ret = false; > > + long index; > > > > VM_WARN_ON_ONCE(!folio_test_locked(folio)); > > VM_WARN_ON_ONCE(!folio_test_swapcache(folio)); > > > > - /* Large folios aren't supported */ > > - if (folio_test_large(folio)) > > + /* Storing large folios isn't enabled */ > > The comment is now stating the obvious, please remove it. Ok. I suppose this check will also no longer be needed based on the config knob not being needed. > > > + if (!zswap_mthp_enabled && folio_test_large(folio)) > > return false; > > > > if (!zswap_enabled) > > - goto check_old; > > + goto reject; > > > > - /* Check cgroup limits */ > > + /* > > + * Check cgroup limits: > > + * > > + * The cgroup zswap limit check is done once at the beginning of an > > + * mTHP store, and not within zswap_store_page() for each page > > + * in the mTHP. We do however check the zswap pool limits at the > > + * start of zswap_store_page(). What this means is, the cgroup > > + * could go over the limits by at most (HPAGE_PMD_NR - 1) pages. > > + * However, the per-store-page zswap pool limits check should > > + * hopefully trigger the cgroup aware and zswap LRU aware global > > + * reclaim implemented in the shrinker. If this assumption holds, > > + * the cgroup exceeding the zswap limits could potentially be > > + * resolved before the next zswap_store, and if it is not, the next > > + * zswap_store would fail the cgroup zswap limit check at the start. > > + */ > > I do not really like this. Allowing going one page above the limit is > one thing, but one THP above the limit seems too much. I also don't > like relying on the repeated limit checking in zswap_store_page(), if > anything I think that should be batched too. > > Is it too unreasonable to maintain the average compression ratio and > use that to estimate limit checking for both memcg and global limits? > Johannes, Nhat, any thoughts on this? I see that Nhat has responded. Hopefully we can discuss this in the follow-up to Nhat's comments. Thanks, Kanchana > > > objcg = get_obj_cgroup_from_folio(folio); > > if (objcg && !obj_cgroup_may_zswap(objcg)) { > > memcg = get_mem_cgroup_from_objcg(objcg); > > if (shrink_memcg(memcg)) { > > mem_cgroup_put(memcg); > > - goto reject; > > + goto put_objcg; > > } > > mem_cgroup_put(memcg); > > } > > > > if (zswap_check_limits()) > > - goto reject; > > - > > - /* allocate entry */ > > - entry = zswap_entry_cache_alloc(GFP_KERNEL, folio_nid(folio)); > > - if (!entry) { > > - zswap_reject_kmemcache_fail++; > > - goto reject; > > - } > > + goto put_objcg; > > > > - /* if entry is successfully added, it keeps the reference */ > > - entry->pool = zswap_pool_current_get(); > > - if (!entry->pool) > > - goto freepage; > > + pool = zswap_pool_current_get(); > > + if (!pool) > > + goto put_objcg; > > > > if (objcg) { > > memcg = get_mem_cgroup_from_objcg(objcg); > > @@ -1606,60 +1626,34 @@ bool zswap_store(struct folio *folio) > > mem_cgroup_put(memcg); > > } > > > > - if (!zswap_compress(&folio->page, entry)) > > - goto put_pool; > > - > > - entry->swpentry = swp; > > - entry->objcg = objcg; > > - entry->referenced = true; > > - > > - if (!zswap_store_entry(tree, entry)) > > - goto store_failed; > > - > > - if (objcg) { > > - obj_cgroup_charge_zswap(objcg, entry->length); > > - count_objcg_event(objcg, ZSWPOUT); > > - } > > - > > /* > > - * We finish initializing the entry while it's already in xarray. > > - * This is safe because: > > - * > > - * 1. Concurrent stores and invalidations are excluded by folio lock. > > - * > > - * 2. Writeback is excluded by the entry not being on the LRU yet. > > - * The publishing order matters to prevent writeback from seeing > > - * an incoherent entry. > > + * Store each page of the folio as a separate entry. If we fail to store > > + * a page, unwind by removing all the previous pages we stored. > > */ > > - if (entry->length) { > > - INIT_LIST_HEAD(&entry->lru); > > - zswap_lru_add(&zswap_list_lru, entry); > > + for (index = 0; index < nr_pages; ++index) { > > + if (!zswap_store_page(folio, index, objcg, pool)) > > + goto put_pool; > > } > > > > - /* update stats */ > > - atomic_inc(&zswap_stored_pages); > > - count_vm_event(ZSWPOUT); > > - > > - return true; > > + ret = true; > > > > -store_failed: > > - zpool_free(entry->pool->zpool, entry->handle); > > put_pool: > > - zswap_pool_put(entry->pool); > > -freepage: > > - zswap_entry_cache_free(entry); > > -reject: > > + zswap_pool_put(pool); > > +put_objcg: > > obj_cgroup_put(objcg); > > if (zswap_pool_reached_full) > > queue_work(shrink_wq, &zswap_shrink_work); > > -check_old: > > +reject: > > /* > > - * If the zswap store fails or zswap is disabled, we must invalidate the > > - * possibly stale entry which was previously stored at this offset. > > - * Otherwise, writeback could overwrite the new data in the swapfile. > > + * If the zswap store fails or zswap is disabled, we must invalidate > > + * the possibly stale entries which were previously stored at the > > + * offsets corresponding to each page of the folio. Otherwise, > > + * writeback could overwrite the new data in the swapfile. > > */ > > - zswap_delete_stored_offsets(tree, offset, nr_pages); > > - return false; > > + if (!ret) > > + zswap_delete_stored_offsets(tree, offset, nr_pages); > > + > > + return ret; > > } > > > > bool zswap_load(struct folio *folio) > > -- > > 2.27.0 > > ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-24 19:38 ` Yosry Ahmed 2024-09-24 20:51 ` Nhat Pham 2024-09-24 23:02 ` Sridhar, Kanchana P @ 2024-09-25 13:40 ` Johannes Weiner 2024-09-25 18:30 ` Yosry Ahmed 2 siblings, 1 reply; 79+ messages in thread From: Johannes Weiner @ 2024-09-25 13:40 UTC (permalink / raw) To: Yosry Ahmed Cc: Kanchana P Sridhar, linux-kernel, linux-mm, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal On Tue, Sep 24, 2024 at 12:38:32PM -0700, Yosry Ahmed wrote: > On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar > <kanchana.p.sridhar@intel.com> wrote: > > > > zswap_store() will now store mTHP and PMD-size THP folios by compressing > > them page by page. > > > > This patch provides a sequential implementation of storing an mTHP in > > zswap_store() by iterating through each page in the folio to compress > > and store it in the zswap zpool. > > > > Towards this goal, zswap_compress() is modified to take a page instead > > of a folio as input. > > > > Each page's swap offset is stored as a separate zswap entry. > > > > If an error is encountered during the store of any page in the mTHP, > > all previous pages/entries stored will be invalidated. Thus, an mTHP > > is either entirely stored in ZSWAP, or entirely not stored in ZSWAP. > > > > This forms the basis for building batching of pages during zswap store > > of large folios by compressing batches of up to say, 8 pages in an > > mTHP in parallel in hardware, with the Intel In-Memory Analytics > > Accelerator (Intel IAA). > > > > A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default) > > will enable/disable zswap storing of (m)THP. The corresponding tunable > > zswap module parameter is "mthp_enabled". > > > > This change reuses and adapts the functionality in Ryan Roberts' RFC > > patch [1]: > > > > "[RFC,v1] mm: zswap: Store large folios without splitting" > > > > [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u > > > > Also, addressed some of the RFC comments from the discussion in [1]. > > > > Co-developed-by: Ryan Roberts > > Signed-off-by: > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > > --- > > mm/Kconfig | 8 ++++ > > mm/zswap.c | 122 +++++++++++++++++++++++++---------------------------- > > 2 files changed, 66 insertions(+), 64 deletions(-) > > > > diff --git a/mm/Kconfig b/mm/Kconfig > > index 09aebca1cae3..c659fb732ec4 100644 > > --- a/mm/Kconfig > > +++ b/mm/Kconfig > > @@ -59,6 +59,14 @@ config ZSWAP_SHRINKER_DEFAULT_ON > > reducing the chance that cold pages will reside in the zswap pool > > and consume memory indefinitely. > > > > +config ZSWAP_STORE_THP_DEFAULT_ON > > + bool "Store mTHP and THP folios in zswap" > > + depends on ZSWAP > > + default n > > + help > > + If selected, zswap will process mTHP and THP folios by > > + compressing and storing each 4K page in the large folio. > > + > > choice > > prompt "Default compressor" > > depends on ZSWAP > > diff --git a/mm/zswap.c b/mm/zswap.c > > index 8f2e0ab34c84..16ab770546d6 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -127,6 +127,14 @@ static bool zswap_shrinker_enabled = IS_ENABLED( > > CONFIG_ZSWAP_SHRINKER_DEFAULT_ON); > > module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644); > > > > +/* > > + * Enable/disable zswap processing of mTHP folios. > > + * For now, only zswap_store will process mTHP folios. > > + */ > > +static bool zswap_mthp_enabled = IS_ENABLED( > > + CONFIG_ZSWAP_STORE_THP_DEFAULT_ON); > > +module_param_named(mthp_enabled, zswap_mthp_enabled, bool, 0644); > > + > > bool zswap_is_enabled(void) > > { > > return zswap_enabled; > > @@ -1471,9 +1479,9 @@ static void zswap_delete_stored_offsets(struct xarray *tree, > > * @objcg: The folio's objcg. > > * @pool: The zswap_pool to store the compressed data for the page. > > */ > > -static bool __maybe_unused zswap_store_page(struct folio *folio, long index, > > - struct obj_cgroup *objcg, > > - struct zswap_pool *pool) > > +static bool zswap_store_page(struct folio *folio, long index, > > + struct obj_cgroup *objcg, > > + struct zswap_pool *pool) > > As I mentioned earlier, the patch that introduced zswap_store_page() > should have directly used it in zswap_store(). This would make this > patch much clearer. > > > { > > swp_entry_t swp = folio->swap; > > int type = swp_type(swp); > > @@ -1551,51 +1559,63 @@ static bool __maybe_unused zswap_store_page(struct folio *folio, long index, > > return false; > > } > > > > +/* > > + * Modified to store mTHP folios. Each page in the mTHP will be compressed > > + * and stored sequentially. > > + */ > > bool zswap_store(struct folio *folio) > > { > > long nr_pages = folio_nr_pages(folio); > > swp_entry_t swp = folio->swap; > > pgoff_t offset = swp_offset(swp); > > struct xarray *tree = swap_zswap_tree(swp); > > - struct zswap_entry *entry; > > struct obj_cgroup *objcg = NULL; > > struct mem_cgroup *memcg = NULL; > > + struct zswap_pool *pool; > > + bool ret = false; > > + long index; > > > > VM_WARN_ON_ONCE(!folio_test_locked(folio)); > > VM_WARN_ON_ONCE(!folio_test_swapcache(folio)); > > > > - /* Large folios aren't supported */ > > - if (folio_test_large(folio)) > > + /* Storing large folios isn't enabled */ > > The comment is now stating the obvious, please remove it. > > > + if (!zswap_mthp_enabled && folio_test_large(folio)) > > return false; > > > > if (!zswap_enabled) > > - goto check_old; > > + goto reject; > > > > - /* Check cgroup limits */ > > + /* > > + * Check cgroup limits: > > + * > > + * The cgroup zswap limit check is done once at the beginning of an > > + * mTHP store, and not within zswap_store_page() for each page > > + * in the mTHP. We do however check the zswap pool limits at the > > + * start of zswap_store_page(). What this means is, the cgroup > > + * could go over the limits by at most (HPAGE_PMD_NR - 1) pages. > > + * However, the per-store-page zswap pool limits check should > > + * hopefully trigger the cgroup aware and zswap LRU aware global > > + * reclaim implemented in the shrinker. If this assumption holds, > > + * the cgroup exceeding the zswap limits could potentially be > > + * resolved before the next zswap_store, and if it is not, the next > > + * zswap_store would fail the cgroup zswap limit check at the start. > > + */ > > I do not really like this. Allowing going one page above the limit is > one thing, but one THP above the limit seems too much. I also don't > like relying on the repeated limit checking in zswap_store_page(), if > anything I think that should be batched too. > > Is it too unreasonable to maintain the average compression ratio and > use that to estimate limit checking for both memcg and global limits? > Johannes, Nhat, any thoughts on this? I honestly don't think it's much of an issue. The global limit is huge, and the cgroup limit is to the best of my knowledge only used as a binary switch. Setting a non-binary limit - global or cgroup - seems like a bit of an obscure usecase to me, because in the vast majority of cases it's preferable to keep compresing over declaring OOM. And even if you do have some granular limit, the workload size scales with it. It's not like you have a thousand THPs in a 10M cgroup. If this ever becomes an issue, we can handle it in a fastpath-slowpath scheme: check the limit up front for fast-path failure if we're already maxed out, just like now; then make obj_cgroup_charge_zswap() atomically charge against zswap.max and unwind the store if we raced. For now, I would just keep the simple version we currently have: check once in zswap_store() and then just go ahead for the whole folio. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-25 13:40 ` Johannes Weiner @ 2024-09-25 18:30 ` Yosry Ahmed 2024-09-25 19:10 ` Sridhar, Kanchana P 2024-09-25 19:20 ` Johannes Weiner 0 siblings, 2 replies; 79+ messages in thread From: Yosry Ahmed @ 2024-09-25 18:30 UTC (permalink / raw) To: Johannes Weiner Cc: Kanchana P Sridhar, linux-kernel, linux-mm, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal [..] > > > + /* > > > + * Check cgroup limits: > > > + * > > > + * The cgroup zswap limit check is done once at the beginning of an > > > + * mTHP store, and not within zswap_store_page() for each page > > > + * in the mTHP. We do however check the zswap pool limits at the > > > + * start of zswap_store_page(). What this means is, the cgroup > > > + * could go over the limits by at most (HPAGE_PMD_NR - 1) pages. > > > + * However, the per-store-page zswap pool limits check should > > > + * hopefully trigger the cgroup aware and zswap LRU aware global > > > + * reclaim implemented in the shrinker. If this assumption holds, > > > + * the cgroup exceeding the zswap limits could potentially be > > > + * resolved before the next zswap_store, and if it is not, the next > > > + * zswap_store would fail the cgroup zswap limit check at the start. > > > + */ > > > > I do not really like this. Allowing going one page above the limit is > > one thing, but one THP above the limit seems too much. I also don't > > like relying on the repeated limit checking in zswap_store_page(), if > > anything I think that should be batched too. > > > > Is it too unreasonable to maintain the average compression ratio and > > use that to estimate limit checking for both memcg and global limits? > > Johannes, Nhat, any thoughts on this? > > I honestly don't think it's much of an issue. The global limit is > huge, and the cgroup limit is to the best of my knowledge only used as > a binary switch. Setting a non-binary limit - global or cgroup - seems > like a bit of an obscure usecase to me, because in the vast majority > of cases it's preferable to keep compresing over declaring OOM. > > And even if you do have some granular limit, the workload size scales > with it. It's not like you have a thousand THPs in a 10M cgroup. The memcg limit and zswap limit can be disproportionate, although that shouldn't be common. > > If this ever becomes an issue, we can handle it in a fastpath-slowpath > scheme: check the limit up front for fast-path failure if we're > already maxed out, just like now; then make obj_cgroup_charge_zswap() > atomically charge against zswap.max and unwind the store if we raced. > > For now, I would just keep the simple version we currently have: check > once in zswap_store() and then just go ahead for the whole folio. I am not totally against this but I feel like this is too optimistic. I think we can keep it simple-ish by maintaining an ewma for the compression ratio, we already have primitives for this (see DECLARE_EWMA). Then in zswap_store(), we can use the ewma to estimate the compressed size and use it to do the memcg and global limit checks once, like we do today. Instead of just checking if we are below the limits, we check if we have enough headroom for the estimated compressed size. Then we call zswap_store_page() to do the per-page stuff, then do batched charging and stats updates. If you think that's an overkill we can keep doing the limit checks as we do today, but I would still like to see batching of all the limit checks, charging, and stats updates. It makes little sense otherwise. ^ permalink raw reply [flat|nested] 79+ messages in thread
* RE: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-25 18:30 ` Yosry Ahmed @ 2024-09-25 19:10 ` Sridhar, Kanchana P 2024-09-25 19:49 ` Yosry Ahmed 2024-09-25 19:20 ` Johannes Weiner 1 sibling, 1 reply; 79+ messages in thread From: Sridhar, Kanchana P @ 2024-09-25 19:10 UTC (permalink / raw) To: Yosry Ahmed, Johannes Weiner Cc: linux-kernel, linux-mm, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P > -----Original Message----- > From: Yosry Ahmed <yosryahmed@google.com> > Sent: Wednesday, September 25, 2024 11:31 AM > To: Johannes Weiner <hannes@cmpxchg.org> > Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux- > kernel@vger.kernel.org; linux-mm@kvack.org; nphamcs@gmail.com; > chengming.zhou@linux.dev; usamaarif642@gmail.com; > shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org; > Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in > zswap_store(). > > [..] > > > > + /* > > > > + * Check cgroup limits: > > > > + * > > > > + * The cgroup zswap limit check is done once at the beginning of an > > > > + * mTHP store, and not within zswap_store_page() for each page > > > > + * in the mTHP. We do however check the zswap pool limits at the > > > > + * start of zswap_store_page(). What this means is, the cgroup > > > > + * could go over the limits by at most (HPAGE_PMD_NR - 1) pages. > > > > + * However, the per-store-page zswap pool limits check should > > > > + * hopefully trigger the cgroup aware and zswap LRU aware global > > > > + * reclaim implemented in the shrinker. If this assumption holds, > > > > + * the cgroup exceeding the zswap limits could potentially be > > > > + * resolved before the next zswap_store, and if it is not, the next > > > > + * zswap_store would fail the cgroup zswap limit check at the start. > > > > + */ > > > > > > I do not really like this. Allowing going one page above the limit is > > > one thing, but one THP above the limit seems too much. I also don't > > > like relying on the repeated limit checking in zswap_store_page(), if > > > anything I think that should be batched too. > > > > > > Is it too unreasonable to maintain the average compression ratio and > > > use that to estimate limit checking for both memcg and global limits? > > > Johannes, Nhat, any thoughts on this? > > > > I honestly don't think it's much of an issue. The global limit is > > huge, and the cgroup limit is to the best of my knowledge only used as > > a binary switch. Setting a non-binary limit - global or cgroup - seems > > like a bit of an obscure usecase to me, because in the vast majority > > of cases it's preferable to keep compresing over declaring OOM. > > > > And even if you do have some granular limit, the workload size scales > > with it. It's not like you have a thousand THPs in a 10M cgroup. > > The memcg limit and zswap limit can be disproportionate, although that > shouldn't be common. > > > > > If this ever becomes an issue, we can handle it in a fastpath-slowpath > > scheme: check the limit up front for fast-path failure if we're > > already maxed out, just like now; then make obj_cgroup_charge_zswap() > > atomically charge against zswap.max and unwind the store if we raced. > > > > For now, I would just keep the simple version we currently have: check > > once in zswap_store() and then just go ahead for the whole folio. > > I am not totally against this but I feel like this is too optimistic. > I think we can keep it simple-ish by maintaining an ewma for the > compression ratio, we already have primitives for this (see > DECLARE_EWMA). > > Then in zswap_store(), we can use the ewma to estimate the compressed > size and use it to do the memcg and global limit checks once, like we > do today. Instead of just checking if we are below the limits, we > check if we have enough headroom for the estimated compressed size. > Then we call zswap_store_page() to do the per-page stuff, then do > batched charging and stats updates. > > If you think that's an overkill we can keep doing the limit checks as > we do today, > but I would still like to see batching of all the limit checks, > charging, and stats updates. It makes little sense otherwise. Thanks Johannes and Yosry for these suggestions and pointers. I think there is general agreement about the batch charging and zswap_stored_pages/stats updates. Yosry, does "batching of limit checks" imply the same as a simple check for being over the cgroup limit at the start of zswap_store and not doing this check in zswap_store_page? Does this also imply a zswap_pool_get_many()? Would appreciate it if you can help clarify. The main question in my mind about using the EWMA checks is, will it add overhead to the normal zswap reclaim path; and if so, would a simple limit check at the start of zswap_store as suggested by Johannes suffice. I can run a few experiments to quantify this overhead, and maybe we can revisit this? Thanks, Kanchana ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-25 19:10 ` Sridhar, Kanchana P @ 2024-09-25 19:49 ` Yosry Ahmed 2024-09-25 20:49 ` Johannes Weiner 0 siblings, 1 reply; 79+ messages in thread From: Yosry Ahmed @ 2024-09-25 19:49 UTC (permalink / raw) To: Sridhar, Kanchana P Cc: Johannes Weiner, linux-kernel, linux-mm, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh [..] > > > > > + /* > > > > > + * Check cgroup limits: > > > > > + * > > > > > + * The cgroup zswap limit check is done once at the beginning of an > > > > > + * mTHP store, and not within zswap_store_page() for each page > > > > > + * in the mTHP. We do however check the zswap pool limits at the > > > > > + * start of zswap_store_page(). What this means is, the cgroup > > > > > + * could go over the limits by at most (HPAGE_PMD_NR - 1) pages. > > > > > + * However, the per-store-page zswap pool limits check should > > > > > + * hopefully trigger the cgroup aware and zswap LRU aware global > > > > > + * reclaim implemented in the shrinker. If this assumption holds, > > > > > + * the cgroup exceeding the zswap limits could potentially be > > > > > + * resolved before the next zswap_store, and if it is not, the next > > > > > + * zswap_store would fail the cgroup zswap limit check at the start. > > > > > + */ > > > > > > > > I do not really like this. Allowing going one page above the limit is > > > > one thing, but one THP above the limit seems too much. I also don't > > > > like relying on the repeated limit checking in zswap_store_page(), if > > > > anything I think that should be batched too. > > > > > > > > Is it too unreasonable to maintain the average compression ratio and > > > > use that to estimate limit checking for both memcg and global limits? > > > > Johannes, Nhat, any thoughts on this? > > > > > > I honestly don't think it's much of an issue. The global limit is > > > huge, and the cgroup limit is to the best of my knowledge only used as > > > a binary switch. Setting a non-binary limit - global or cgroup - seems > > > like a bit of an obscure usecase to me, because in the vast majority > > > of cases it's preferable to keep compresing over declaring OOM. > > > > > > And even if you do have some granular limit, the workload size scales > > > with it. It's not like you have a thousand THPs in a 10M cgroup. > > > > The memcg limit and zswap limit can be disproportionate, although that > > shouldn't be common. > > > > > > > > If this ever becomes an issue, we can handle it in a fastpath-slowpath > > > scheme: check the limit up front for fast-path failure if we're > > > already maxed out, just like now; then make obj_cgroup_charge_zswap() > > > atomically charge against zswap.max and unwind the store if we raced. > > > > > > For now, I would just keep the simple version we currently have: check > > > once in zswap_store() and then just go ahead for the whole folio. > > > > I am not totally against this but I feel like this is too optimistic. > > I think we can keep it simple-ish by maintaining an ewma for the > > compression ratio, we already have primitives for this (see > > DECLARE_EWMA). > > > > Then in zswap_store(), we can use the ewma to estimate the compressed > > size and use it to do the memcg and global limit checks once, like we > > do today. Instead of just checking if we are below the limits, we > > check if we have enough headroom for the estimated compressed size. > > Then we call zswap_store_page() to do the per-page stuff, then do > > batched charging and stats updates. > > > > If you think that's an overkill we can keep doing the limit checks as > > we do today, > > but I would still like to see batching of all the limit checks, > > charging, and stats updates. It makes little sense otherwise. > > Thanks Johannes and Yosry for these suggestions and pointers. > I think there is general agreement about the batch charging and > zswap_stored_pages/stats updates. Yosry, does "batching of limit > checks" imply the same as a simple check for being over the cgroup > limit at the start of zswap_store and not doing this check in > zswap_store_page? Does this also imply a zswap_pool_get_many()? > Would appreciate it if you can help clarify. Yes I think we should batch as much as possible in zswap_store(), and only do the things are truly per-page in zswap_store_page(). The limit checks, stats updates, zswap_pool refs, charging etc. Batching all of these things should be clear wins. > > The main question in my mind about using the EWMA checks is, > will it add overhead to the normal zswap reclaim path; and if so, > would a simple limit check at the start of zswap_store as suggested > by Johannes suffice. I can run a few experiments to quantify this > overhead, and maybe we can revisit this? If you look at ewma_##name##_add() in include/linux/average.h, it's really just a bunch of bit shifts, so I am not concerned about runtime overhead. My discussion with Johannes is more about if the complexity is justified, I'd wait for that discussion to settle. Either way, we should check the limits once in zswap_store(). ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-25 19:49 ` Yosry Ahmed @ 2024-09-25 20:49 ` Johannes Weiner 0 siblings, 0 replies; 79+ messages in thread From: Johannes Weiner @ 2024-09-25 20:49 UTC (permalink / raw) To: Yosry Ahmed Cc: Sridhar, Kanchana P, linux-kernel, linux-mm, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh On Wed, Sep 25, 2024 at 12:49:13PM -0700, Yosry Ahmed wrote: > Kanchana wrote: > > The main question in my mind about using the EWMA checks is, > > will it add overhead to the normal zswap reclaim path; and if so, > > would a simple limit check at the start of zswap_store as suggested > > by Johannes suffice. I can run a few experiments to quantify this > > overhead, and maybe we can revisit this? > > If you look at ewma_##name##_add() in include/linux/average.h, it's > really just a bunch of bit shifts, so I am not concerned about runtime > overhead. My discussion with Johannes is more about if the complexity > is justified, I'd wait for that discussion to settle. Sorry to be blunt, but "precision" in a non-atomic check like this makes no sense. The fact that it's not too expensive is irrelevant. This discussion around this honestly has gone off the rails. Just leave the limit checks exactly as they are. Check limits and cgroup_may_zswap() once up front. Compress the subpages. Acquire references and bump all stats in batches of folio_nr_pages(). You can add up the subpage compressed bytes in the for-loop and do the obj_cgroup_charge_zswap() in a single call at the end as well. That's my suggestion. If that's no good, please ELI5. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-25 18:30 ` Yosry Ahmed 2024-09-25 19:10 ` Sridhar, Kanchana P @ 2024-09-25 19:20 ` Johannes Weiner 2024-09-25 19:39 ` Yosry Ahmed 1 sibling, 1 reply; 79+ messages in thread From: Johannes Weiner @ 2024-09-25 19:20 UTC (permalink / raw) To: Yosry Ahmed Cc: Kanchana P Sridhar, linux-kernel, linux-mm, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal On Wed, Sep 25, 2024 at 11:30:34AM -0700, Yosry Ahmed wrote: > Johannes wrote: > > If this ever becomes an issue, we can handle it in a fastpath-slowpath > > scheme: check the limit up front for fast-path failure if we're > > already maxed out, just like now; then make obj_cgroup_charge_zswap() > > atomically charge against zswap.max and unwind the store if we raced. > > > > For now, I would just keep the simple version we currently have: check > > once in zswap_store() and then just go ahead for the whole folio. > > I am not totally against this but I feel like this is too optimistic. > I think we can keep it simple-ish by maintaining an ewma for the > compression ratio, we already have primitives for this (see > DECLARE_EWMA). > > Then in zswap_store(), we can use the ewma to estimate the compressed > size and use it to do the memcg and global limit checks once, like we > do today. Instead of just checking if we are below the limits, we > check if we have enough headroom for the estimated compressed size. > Then we call zswap_store_page() to do the per-page stuff, then do > batched charging and stats updates. I'm not sure what you gain from making a non-atomic check precise. You can get a hundred threads determining down precisely that *their* store will fit exactly into the last 800kB before the limit. > If you think that's an overkill we can keep doing the limit checks as > we do today, I just don't see how it would make a practical difference. What would make a difference is atomic transactional charging of the compressed size, and unwinding on failure - with the upfront check to avoid pointlessly compressing (outside of race conditions). And I'm not against doing that in general, I am just against doing it per default. It's a lot of complexity, and like I said, the practical usecase for limiting zswap memory to begin with is quite unclear to me. Zswap is not a limited resource. It's just memory. And you already had the memory for the uncompressed copy. So it's a bit strange to me to say "you have compressed your memory enough, so now you get sent to disk (or we declare OOM)". What would be a reason to limit it? It sort of makes sense as a binary switch, but I don't get the usecase for a granular limit. (And I blame my own cowardice for making the cgroup knob a limit, to keep options open, instead of a switch.) All that to say, this would be better in a follow-up patch. We allow overshooting now, it's not clear how overshooting by a larger amount makes a categorical difference. > but I would still like to see batching of all the limit checks, > charging, and stats updates. It makes little sense otherwise. Definitely. One check, one charge, one stat update per folio. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-25 19:20 ` Johannes Weiner @ 2024-09-25 19:39 ` Yosry Ahmed 2024-09-25 20:13 ` Johannes Weiner 0 siblings, 1 reply; 79+ messages in thread From: Yosry Ahmed @ 2024-09-25 19:39 UTC (permalink / raw) To: Johannes Weiner Cc: Kanchana P Sridhar, linux-kernel, linux-mm, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal On Wed, Sep 25, 2024 at 12:20 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Wed, Sep 25, 2024 at 11:30:34AM -0700, Yosry Ahmed wrote: > > Johannes wrote: > > > If this ever becomes an issue, we can handle it in a fastpath-slowpath > > > scheme: check the limit up front for fast-path failure if we're > > > already maxed out, just like now; then make obj_cgroup_charge_zswap() > > > atomically charge against zswap.max and unwind the store if we raced. > > > > > > For now, I would just keep the simple version we currently have: check > > > once in zswap_store() and then just go ahead for the whole folio. > > > > I am not totally against this but I feel like this is too optimistic. > > I think we can keep it simple-ish by maintaining an ewma for the > > compression ratio, we already have primitives for this (see > > DECLARE_EWMA). > > > > Then in zswap_store(), we can use the ewma to estimate the compressed > > size and use it to do the memcg and global limit checks once, like we > > do today. Instead of just checking if we are below the limits, we > > check if we have enough headroom for the estimated compressed size. > > Then we call zswap_store_page() to do the per-page stuff, then do > > batched charging and stats updates. > > I'm not sure what you gain from making a non-atomic check precise. You > can get a hundred threads determining down precisely that *their* > store will fit exactly into the last 800kB before the limit. We just get to avoid overshooting in cases where we know we probably can't fit it anyway. If we have 4KB left and we are trying to compress a 2MB THP, for example. It just makes the upfront check to avoid pointless compression a little bit more meaningful. > > > If you think that's an overkill we can keep doing the limit checks as > > we do today, > > I just don't see how it would make a practical difference. > > What would make a difference is atomic transactional charging of the > compressed size, and unwinding on failure - with the upfront check to > avoid pointlessly compressing (outside of race conditions). > > And I'm not against doing that in general, I am just against doing it > per default. > > It's a lot of complexity, and like I said, the practical usecase for > limiting zswap memory to begin with is quite unclear to me. Zswap is > not a limited resource. It's just memory. And you already had the > memory for the uncompressed copy. So it's a bit strange to me to say > "you have compressed your memory enough, so now you get sent to disk > (or we declare OOM)". What would be a reason to limit it? Technically speaking if we have a global zswap limit, it becomes a limited resource and distributing it across workloads can make sense. That being said, I am not aware of any existing use cases for that. The other use case is controlling when writeback kicks in for different workloads. It may not make sense for limit-based reclaim, because as you mentioned the memory is limited anyway and workloads should be free to compress their own memory within their limit as they please. But it may make sense for proactive reclaim, controlling how much memory we compress vs how much memory we completely evict to disk. Again, not aware of any existing use cases for this as well. > > It sort of makes sense as a binary switch, but I don't get the usecase > for a granular limit. (And I blame my own cowardice for making the > cgroup knob a limit, to keep options open, instead of a switch.) > > All that to say, this would be better in a follow-up patch. We allow > overshooting now, it's not clear how overshooting by a larger amount > makes a categorical difference. I am not against making this a follow-up, if/when the need arises. My whole point was that using EWMA (or similar) we can make the upfront check a little bit more meaningful than "We have 1 byte of headroom, let's go compress a 2MB THP!". I think it's not a lot of complexity to check for headroom based on an estimated compression size, but I didn't try to code it, so maybe I am wrong :) > > > but I would still like to see batching of all the limit checks, > > charging, and stats updates. It makes little sense otherwise. > > Definitely. One check, one charge, one stat update per folio. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-25 19:39 ` Yosry Ahmed @ 2024-09-25 20:13 ` Johannes Weiner 2024-09-25 21:06 ` Yosry Ahmed 0 siblings, 1 reply; 79+ messages in thread From: Johannes Weiner @ 2024-09-25 20:13 UTC (permalink / raw) To: Yosry Ahmed Cc: Kanchana P Sridhar, linux-kernel, linux-mm, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal On Wed, Sep 25, 2024 at 12:39:02PM -0700, Yosry Ahmed wrote: > On Wed, Sep 25, 2024 at 12:20 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > On Wed, Sep 25, 2024 at 11:30:34AM -0700, Yosry Ahmed wrote: > > > Johannes wrote: > > > > If this ever becomes an issue, we can handle it in a fastpath-slowpath > > > > scheme: check the limit up front for fast-path failure if we're > > > > already maxed out, just like now; then make obj_cgroup_charge_zswap() > > > > atomically charge against zswap.max and unwind the store if we raced. > > > > > > > > For now, I would just keep the simple version we currently have: check > > > > once in zswap_store() and then just go ahead for the whole folio. > > > > > > I am not totally against this but I feel like this is too optimistic. > > > I think we can keep it simple-ish by maintaining an ewma for the > > > compression ratio, we already have primitives for this (see > > > DECLARE_EWMA). > > > > > > Then in zswap_store(), we can use the ewma to estimate the compressed > > > size and use it to do the memcg and global limit checks once, like we > > > do today. Instead of just checking if we are below the limits, we > > > check if we have enough headroom for the estimated compressed size. > > > Then we call zswap_store_page() to do the per-page stuff, then do > > > batched charging and stats updates. > > > > I'm not sure what you gain from making a non-atomic check precise. You > > can get a hundred threads determining down precisely that *their* > > store will fit exactly into the last 800kB before the limit. > > We just get to avoid overshooting in cases where we know we probably > can't fit it anyway. If we have 4KB left and we are trying to compress > a 2MB THP, for example. It just makes the upfront check to avoid > pointless compression a little bit more meaningful. I think I'm missing something. It's not just an upfront check, it's the only check. The charge down the line doesn't limit anything, it just counts. So if this check passes, we WILL store the folio. There is no pointless compression. We might overshoot the limit by about one folio in a single-threaded scenario. But that is negligible in comparison to the overshoot we can get due to race conditions. Again, I see no no practical, meaningful difference in outcome by making that limit check any more precise. Just keep it as-is. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-25 20:13 ` Johannes Weiner @ 2024-09-25 21:06 ` Yosry Ahmed 2024-09-25 22:29 ` Sridhar, Kanchana P 0 siblings, 1 reply; 79+ messages in thread From: Yosry Ahmed @ 2024-09-25 21:06 UTC (permalink / raw) To: Johannes Weiner Cc: Kanchana P Sridhar, linux-kernel, linux-mm, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal On Wed, Sep 25, 2024 at 1:13 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Wed, Sep 25, 2024 at 12:39:02PM -0700, Yosry Ahmed wrote: > > On Wed, Sep 25, 2024 at 12:20 PM Johannes Weiner <hannes@cmpxchg.org> wrote: > > > > > > On Wed, Sep 25, 2024 at 11:30:34AM -0700, Yosry Ahmed wrote: > > > > Johannes wrote: > > > > > If this ever becomes an issue, we can handle it in a fastpath-slowpath > > > > > scheme: check the limit up front for fast-path failure if we're > > > > > already maxed out, just like now; then make obj_cgroup_charge_zswap() > > > > > atomically charge against zswap.max and unwind the store if we raced. > > > > > > > > > > For now, I would just keep the simple version we currently have: check > > > > > once in zswap_store() and then just go ahead for the whole folio. > > > > > > > > I am not totally against this but I feel like this is too optimistic. > > > > I think we can keep it simple-ish by maintaining an ewma for the > > > > compression ratio, we already have primitives for this (see > > > > DECLARE_EWMA). > > > > > > > > Then in zswap_store(), we can use the ewma to estimate the compressed > > > > size and use it to do the memcg and global limit checks once, like we > > > > do today. Instead of just checking if we are below the limits, we > > > > check if we have enough headroom for the estimated compressed size. > > > > Then we call zswap_store_page() to do the per-page stuff, then do > > > > batched charging and stats updates. > > > > > > I'm not sure what you gain from making a non-atomic check precise. You > > > can get a hundred threads determining down precisely that *their* > > > store will fit exactly into the last 800kB before the limit. > > > > We just get to avoid overshooting in cases where we know we probably > > can't fit it anyway. If we have 4KB left and we are trying to compress > > a 2MB THP, for example. It just makes the upfront check to avoid > > pointless compression a little bit more meaningful. > > I think I'm missing something. It's not just an upfront check, it's > the only check. The charge down the line doesn't limit anything, it > just counts. So if this check passes, we WILL store the folio. There > is no pointless compression. I got confused by what you said about the fast-slow path, I thought you were suggesting we do this now, so I was saying it's better to use an estimate of the compressed size in the fast path to avoid pointless compression. I missed the second paragraph. > > We might overshoot the limit by about one folio in a single-threaded > scenario. But that is negligible in comparison to the overshoot we can > get due to race conditions. > > Again, I see no no practical, meaningful difference in outcome by > making that limit check any more precise. Just keep it as-is. > Sorry to be blunt, but "precision" in a non-atomic check like this? > makes no sense. The fact that it's not too expensive is irrelevant. > This discussion around this honestly has gone off the rails. Yeah I thought we were talking about the version where we rollback compressions if we overshoot, my bad. We discussed quite a few things and I managed to confuse myself. > Just leave the limit checks exactly as they are. Check limits and > cgroup_may_zswap() once up front. Compress the subpages. Acquire > references and bump all stats in batches of folio_nr_pages(). You can > add up the subpage compressed bytes in the for-loop and do the > obj_cgroup_charge_zswap() in a single call at the end as well. We can keep the limit checks as they are for now, and revisit as needed. ^ permalink raw reply [flat|nested] 79+ messages in thread
* RE: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-25 21:06 ` Yosry Ahmed @ 2024-09-25 22:29 ` Sridhar, Kanchana P 2024-09-26 3:58 ` Sridhar, Kanchana P 0 siblings, 1 reply; 79+ messages in thread From: Sridhar, Kanchana P @ 2024-09-25 22:29 UTC (permalink / raw) To: Yosry Ahmed, Johannes Weiner Cc: linux-kernel, linux-mm, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P > -----Original Message----- > From: Yosry Ahmed <yosryahmed@google.com> > Sent: Wednesday, September 25, 2024 2:06 PM > To: Johannes Weiner <hannes@cmpxchg.org> > Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux- > kernel@vger.kernel.org; linux-mm@kvack.org; nphamcs@gmail.com; > chengming.zhou@linux.dev; usamaarif642@gmail.com; > shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org; > Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in > zswap_store(). > > On Wed, Sep 25, 2024 at 1:13 PM Johannes Weiner <hannes@cmpxchg.org> > wrote: > > > > On Wed, Sep 25, 2024 at 12:39:02PM -0700, Yosry Ahmed wrote: > > > On Wed, Sep 25, 2024 at 12:20 PM Johannes Weiner > <hannes@cmpxchg.org> wrote: > > > > > > > > On Wed, Sep 25, 2024 at 11:30:34AM -0700, Yosry Ahmed wrote: > > > > > Johannes wrote: > > > > > > If this ever becomes an issue, we can handle it in a fastpath- > slowpath > > > > > > scheme: check the limit up front for fast-path failure if we're > > > > > > already maxed out, just like now; then make > obj_cgroup_charge_zswap() > > > > > > atomically charge against zswap.max and unwind the store if we > raced. > > > > > > > > > > > > For now, I would just keep the simple version we currently have: > check > > > > > > once in zswap_store() and then just go ahead for the whole folio. > > > > > > > > > > I am not totally against this but I feel like this is too optimistic. > > > > > I think we can keep it simple-ish by maintaining an ewma for the > > > > > compression ratio, we already have primitives for this (see > > > > > DECLARE_EWMA). > > > > > > > > > > Then in zswap_store(), we can use the ewma to estimate the > compressed > > > > > size and use it to do the memcg and global limit checks once, like we > > > > > do today. Instead of just checking if we are below the limits, we > > > > > check if we have enough headroom for the estimated compressed size. > > > > > Then we call zswap_store_page() to do the per-page stuff, then do > > > > > batched charging and stats updates. > > > > > > > > I'm not sure what you gain from making a non-atomic check precise. You > > > > can get a hundred threads determining down precisely that *their* > > > > store will fit exactly into the last 800kB before the limit. > > > > > > We just get to avoid overshooting in cases where we know we probably > > > can't fit it anyway. If we have 4KB left and we are trying to compress > > > a 2MB THP, for example. It just makes the upfront check to avoid > > > pointless compression a little bit more meaningful. > > > > I think I'm missing something. It's not just an upfront check, it's > > the only check. The charge down the line doesn't limit anything, it > > just counts. So if this check passes, we WILL store the folio. There > > is no pointless compression. > > I got confused by what you said about the fast-slow path, I thought > you were suggesting we do this now, so I was saying it's better to use > an estimate of the compressed size in the fast path to avoid pointless > compression. > > I missed the second paragraph. > > > > > We might overshoot the limit by about one folio in a single-threaded > > scenario. But that is negligible in comparison to the overshoot we can > > get due to race conditions. > > > > Again, I see no no practical, meaningful difference in outcome by > > making that limit check any more precise. Just keep it as-is. > > > Sorry to be blunt, but "precision" in a non-atomic check like this? > > makes no sense. The fact that it's not too expensive is irrelevant. > > This discussion around this honestly has gone off the rails. > > Yeah I thought we were talking about the version where we rollback > compressions if we overshoot, my bad. We discussed quite a few things > and I managed to confuse myself. > > > Just leave the limit checks exactly as they are. Check limits and > > cgroup_may_zswap() once up front. Compress the subpages. Acquire > > references and bump all stats in batches of folio_nr_pages(). You can > > add up the subpage compressed bytes in the for-loop and do the > > obj_cgroup_charge_zswap() in a single call at the end as well. > > We can keep the limit checks as they are for now, and revisit as needed. Thanks Johannes and Yosry for the discussion! I will proceed as suggested. Thanks, Kanchana ^ permalink raw reply [flat|nested] 79+ messages in thread
* RE: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-25 22:29 ` Sridhar, Kanchana P @ 2024-09-26 3:58 ` Sridhar, Kanchana P 2024-09-26 4:52 ` Yosry Ahmed 0 siblings, 1 reply; 79+ messages in thread From: Sridhar, Kanchana P @ 2024-09-26 3:58 UTC (permalink / raw) To: Yosry Ahmed, Johannes Weiner Cc: linux-kernel, linux-mm, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P > -----Original Message----- > From: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Sent: Wednesday, September 25, 2024 3:29 PM > To: Yosry Ahmed <yosryahmed@google.com>; Johannes Weiner > <hannes@cmpxchg.org> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > nphamcs@gmail.com; chengming.zhou@linux.dev; > usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com; > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux- > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>; > Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Subject: RE: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in > zswap_store(). > > > -----Original Message----- > > From: Yosry Ahmed <yosryahmed@google.com> > > Sent: Wednesday, September 25, 2024 2:06 PM > > To: Johannes Weiner <hannes@cmpxchg.org> > > Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux- > > kernel@vger.kernel.org; linux-mm@kvack.org; nphamcs@gmail.com; > > chengming.zhou@linux.dev; usamaarif642@gmail.com; > > shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying > > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux- > foundation.org; > > Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in > > zswap_store(). > > > > On Wed, Sep 25, 2024 at 1:13 PM Johannes Weiner > <hannes@cmpxchg.org> > > wrote: > > > > > > On Wed, Sep 25, 2024 at 12:39:02PM -0700, Yosry Ahmed wrote: > > > > On Wed, Sep 25, 2024 at 12:20 PM Johannes Weiner > > <hannes@cmpxchg.org> wrote: > > > > > > > > > > On Wed, Sep 25, 2024 at 11:30:34AM -0700, Yosry Ahmed wrote: > > > > > > Johannes wrote: > > > > > > > If this ever becomes an issue, we can handle it in a fastpath- > > slowpath > > > > > > > scheme: check the limit up front for fast-path failure if we're > > > > > > > already maxed out, just like now; then make > > obj_cgroup_charge_zswap() > > > > > > > atomically charge against zswap.max and unwind the store if we > > raced. > > > > > > > > > > > > > > For now, I would just keep the simple version we currently have: > > check > > > > > > > once in zswap_store() and then just go ahead for the whole folio. > > > > > > > > > > > > I am not totally against this but I feel like this is too optimistic. > > > > > > I think we can keep it simple-ish by maintaining an ewma for the > > > > > > compression ratio, we already have primitives for this (see > > > > > > DECLARE_EWMA). > > > > > > > > > > > > Then in zswap_store(), we can use the ewma to estimate the > > compressed > > > > > > size and use it to do the memcg and global limit checks once, like we > > > > > > do today. Instead of just checking if we are below the limits, we > > > > > > check if we have enough headroom for the estimated compressed > size. > > > > > > Then we call zswap_store_page() to do the per-page stuff, then do > > > > > > batched charging and stats updates. > > > > > > > > > > I'm not sure what you gain from making a non-atomic check precise. > You > > > > > can get a hundred threads determining down precisely that *their* > > > > > store will fit exactly into the last 800kB before the limit. > > > > > > > > We just get to avoid overshooting in cases where we know we probably > > > > can't fit it anyway. If we have 4KB left and we are trying to compress > > > > a 2MB THP, for example. It just makes the upfront check to avoid > > > > pointless compression a little bit more meaningful. > > > > > > I think I'm missing something. It's not just an upfront check, it's > > > the only check. The charge down the line doesn't limit anything, it > > > just counts. So if this check passes, we WILL store the folio. There > > > is no pointless compression. > > > > I got confused by what you said about the fast-slow path, I thought > > you were suggesting we do this now, so I was saying it's better to use > > an estimate of the compressed size in the fast path to avoid pointless > > compression. > > > > I missed the second paragraph. > > > > > > > > We might overshoot the limit by about one folio in a single-threaded > > > scenario. But that is negligible in comparison to the overshoot we can > > > get due to race conditions. > > > > > > Again, I see no no practical, meaningful difference in outcome by > > > making that limit check any more precise. Just keep it as-is. > > > > > Sorry to be blunt, but "precision" in a non-atomic check like this? > > > makes no sense. The fact that it's not too expensive is irrelevant. > > > This discussion around this honestly has gone off the rails. > > > > Yeah I thought we were talking about the version where we rollback > > compressions if we overshoot, my bad. We discussed quite a few things > > and I managed to confuse myself. > > > > > Just leave the limit checks exactly as they are. Check limits and > > > cgroup_may_zswap() once up front. Compress the subpages. Acquire > > > references and bump all stats in batches of folio_nr_pages(). You can > > > add up the subpage compressed bytes in the for-loop and do the > > > obj_cgroup_charge_zswap() in a single call at the end as well. > > > > We can keep the limit checks as they are for now, and revisit as needed. > > Thanks Johannes and Yosry for the discussion! I will proceed as suggested. One thing I realized while reworking the patches for the batched checks is: within zswap_store_page(), we set the entry->objcg and entry->pool before adding it to the xarray. Given this, wouldn't it be safer to get the objcg and pool reference per sub-page, locally in zswap_store_page(), rather than obtaining batched references at the end if the store is successful? If we want zswap_store_page() to be self-contained and correct as far as the entry being created and added to the xarray, it seems like the right thing to do? I am a bit apprehensive about the entry being added to the xarray without a reference obtained on the objcg and pool, because any page-faults/writeback that occur on sub-pages added to the xarray before the entire folio has been stored, would run into issues. Just wanted to run this by you. The rest of the batched charging, atomic and stat updates should be Ok. Thanks, Kanchana > > Thanks, > Kanchana ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-26 3:58 ` Sridhar, Kanchana P @ 2024-09-26 4:52 ` Yosry Ahmed 2024-09-26 16:40 ` Sridhar, Kanchana P 0 siblings, 1 reply; 79+ messages in thread From: Yosry Ahmed @ 2024-09-26 4:52 UTC (permalink / raw) To: Sridhar, Kanchana P Cc: Johannes Weiner, linux-kernel, linux-mm, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh [..] > > One thing I realized while reworking the patches for the batched checks is: > within zswap_store_page(), we set the entry->objcg and entry->pool before > adding it to the xarray. Given this, wouldn't it be safer to get the objcg > and pool reference per sub-page, locally in zswap_store_page(), rather than > obtaining batched references at the end if the store is successful? If we want > zswap_store_page() to be self-contained and correct as far as the entry > being created and added to the xarray, it seems like the right thing to do? > I am a bit apprehensive about the entry being added to the xarray without > a reference obtained on the objcg and pool, because any page-faults/writeback > that occur on sub-pages added to the xarray before the entire folio has been > stored, would run into issues. We definitely should not obtain references to the pool and objcg after initializing the entries with them. We can obtain all references in zswap_store() before zswap_store_page(). IOW, the batching in this case should be done before the per-page operations, not after. > > Just wanted to run this by you. The rest of the batched charging, atomic > and stat updates should be Ok. > > Thanks, > Kanchana > > > > > Thanks, > > Kanchana ^ permalink raw reply [flat|nested] 79+ messages in thread
* RE: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-26 4:52 ` Yosry Ahmed @ 2024-09-26 16:40 ` Sridhar, Kanchana P 2024-09-26 17:19 ` Yosry Ahmed 0 siblings, 1 reply; 79+ messages in thread From: Sridhar, Kanchana P @ 2024-09-26 16:40 UTC (permalink / raw) To: Yosry Ahmed Cc: Johannes Weiner, linux-kernel, linux-mm, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P > -----Original Message----- > From: Yosry Ahmed <yosryahmed@google.com> > Sent: Wednesday, September 25, 2024 9:52 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: Johannes Weiner <hannes@cmpxchg.org>; linux-kernel@vger.kernel.org; > linux-mm@kvack.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com; > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux- > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in > zswap_store(). > > [..] > > > > One thing I realized while reworking the patches for the batched checks is: > > within zswap_store_page(), we set the entry->objcg and entry->pool before > > adding it to the xarray. Given this, wouldn't it be safer to get the objcg > > and pool reference per sub-page, locally in zswap_store_page(), rather than > > obtaining batched references at the end if the store is successful? If we > want > > zswap_store_page() to be self-contained and correct as far as the entry > > being created and added to the xarray, it seems like the right thing to do? > > I am a bit apprehensive about the entry being added to the xarray without > > a reference obtained on the objcg and pool, because any page- > faults/writeback > > that occur on sub-pages added to the xarray before the entire folio has been > > stored, would run into issues. > > We definitely should not obtain references to the pool and objcg after > initializing the entries with them. We can obtain all references in > zswap_store() before zswap_store_page(). IOW, the batching in this > case should be done before the per-page operations, not after. Thanks Yosry. IIUC, we should obtain all references to the objcg and to the zswap_pool at the start of zswap_store. In the case of error on any sub-page, we will unwind state for potentially only the stored pages or the entire folio if it happened to already be in zswap and is being re-written. We might need some additional book-keeping to keep track of which sub-pages were found in the xarray and zswap_entry_free() got called (nr_sb). Assuming I define a new "obj_cgroup_put_many()", I would need to call this with (folio_nr_pages() - nr_sb). As far as zswap_pool_get(), there is some added complexity if we want to keep the existing implementation that calls "percpu_ref_tryget()", and assuming this is extended to provide a new "zswap_pool_get_many()" that calls "percpu_ref_tryget_many()". Is there a reason we use percpu_ref_tryget() instead of percpu_ref_get()? Reason I ask is, with tryget(), if for some reason the pool->ref is 0, no further increments will be made. If so, upon unwinding state in zswap_store(), I would need to special-case to catch this before calling a new "zswap_pool_put_many()". Things could be a little simpler if zswap_pool_get() can use "percpu_ref_get()" which will always increment the refcount. Since the zswap pool->ref is initialized to "1", this seems Ok, but I don't know if there will be unintended consequences. Can you please advise on what is the simplest/cleanest approach: 1) Proceed with the above changes without changing percpu_ref_tryget in zswap_pool_get. Needs special-casing in zswap_store to detect pool->ref being "0" before calling zswap_pool_put[_many]. 2) Modify zswap_pool_get/zswap_pool_get_many to use percpu_ref_get_many and avoid special-casing to detect pool->ref being "0" before calling zswap_pool_put[_many]. 3) Keep the approach in v7 where obj_cgroup_get/put is localized to zswap_store_page for both success and error conditions, and any unwinding state in zswap_store will take care of dropping references obtained from prior successful writes (from this or prior invocations of zswap_store). Thanks, Kanchana > > > > > Just wanted to run this by you. The rest of the batched charging, atomic > > and stat updates should be Ok. > > > > Thanks, > > Kanchana > > > > > > > > Thanks, > > > Kanchana ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-26 16:40 ` Sridhar, Kanchana P @ 2024-09-26 17:19 ` Yosry Ahmed 2024-09-26 17:29 ` Sridhar, Kanchana P 0 siblings, 1 reply; 79+ messages in thread From: Yosry Ahmed @ 2024-09-26 17:19 UTC (permalink / raw) To: Sridhar, Kanchana P Cc: Johannes Weiner, linux-kernel, linux-mm, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh On Thu, Sep 26, 2024 at 9:40 AM Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> wrote: > > > -----Original Message----- > > From: Yosry Ahmed <yosryahmed@google.com> > > Sent: Wednesday, September 25, 2024 9:52 PM > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > > Cc: Johannes Weiner <hannes@cmpxchg.org>; linux-kernel@vger.kernel.org; > > linux-mm@kvack.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > > usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com; > > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux- > > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in > > zswap_store(). > > > > [..] > > > > > > One thing I realized while reworking the patches for the batched checks is: > > > within zswap_store_page(), we set the entry->objcg and entry->pool before > > > adding it to the xarray. Given this, wouldn't it be safer to get the objcg > > > and pool reference per sub-page, locally in zswap_store_page(), rather than > > > obtaining batched references at the end if the store is successful? If we > > want > > > zswap_store_page() to be self-contained and correct as far as the entry > > > being created and added to the xarray, it seems like the right thing to do? > > > I am a bit apprehensive about the entry being added to the xarray without > > > a reference obtained on the objcg and pool, because any page- > > faults/writeback > > > that occur on sub-pages added to the xarray before the entire folio has been > > > stored, would run into issues. > > > > We definitely should not obtain references to the pool and objcg after > > initializing the entries with them. We can obtain all references in > > zswap_store() before zswap_store_page(). IOW, the batching in this > > case should be done before the per-page operations, not after. > > Thanks Yosry. IIUC, we should obtain all references to the objcg and to the > zswap_pool at the start of zswap_store. > > In the case of error on any sub-page, we will unwind state for potentially > only the stored pages or the entire folio if it happened to already be in zswap > and is being re-written. We might need some additional book-keeping to > keep track of which sub-pages were found in the xarray and zswap_entry_free() > got called (nr_sb). Assuming I define a new "obj_cgroup_put_many()", I would need > to call this with (folio_nr_pages() - nr_sb). > > As far as zswap_pool_get(), there is some added complexity if we want to > keep the existing implementation that calls "percpu_ref_tryget()", and assuming > this is extended to provide a new "zswap_pool_get_many()" that calls > "percpu_ref_tryget_many()". Is there a reason we use percpu_ref_tryget() instead > of percpu_ref_get()? Reason I ask is, with tryget(), if for some reason the pool->ref > is 0, no further increments will be made. If so, upon unwinding state in > zswap_store(), I would need to special-case to catch this before calling a new > "zswap_pool_put_many()". > > Things could be a little simpler if zswap_pool_get() can use "percpu_ref_get()" > which will always increment the refcount. Since the zswap pool->ref is initialized > to "1", this seems Ok, but I don't know if there will be unintended consequences. > > Can you please advise on what is the simplest/cleanest approach: > > 1) Proceed with the above changes without changing percpu_ref_tryget in > zswap_pool_get. Needs special-casing in zswap_store to detect pool->ref > being "0" before calling zswap_pool_put[_many]. My assumption is that we can reorder the code such that if zswap_pool_get_many() fails we don't call zswap_pool_put_many() to begin with (e.g. jump to a label after zswap_pool_put_many()). > 2) Modify zswap_pool_get/zswap_pool_get_many to use percpu_ref_get_many > and avoid special-casing to detect pool->ref being "0" before calling > zswap_pool_put[_many]. I don't think we can simply switch the tryget to a get, as I believe we can race with the pool being destroyed. > 3) Keep the approach in v7 where obj_cgroup_get/put is localized to > zswap_store_page for both success and error conditions, and any unwinding > state in zswap_store will take care of dropping references obtained from > prior successful writes (from this or prior invocations of zswap_store). I am also fine with doing that and doing the reference batching as a follow up. > > Thanks, > Kanchana > > > > > > > > > Just wanted to run this by you. The rest of the batched charging, atomic > > > and stat updates should be Ok. > > > > > > Thanks, > > > Kanchana > > > > > > > > > > > Thanks, > > > > Kanchana ^ permalink raw reply [flat|nested] 79+ messages in thread
* RE: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-26 17:19 ` Yosry Ahmed @ 2024-09-26 17:29 ` Sridhar, Kanchana P 2024-09-26 17:34 ` Yosry Ahmed 2024-09-26 18:43 ` Johannes Weiner 0 siblings, 2 replies; 79+ messages in thread From: Sridhar, Kanchana P @ 2024-09-26 17:29 UTC (permalink / raw) To: Yosry Ahmed Cc: Johannes Weiner, linux-kernel, linux-mm, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P > -----Original Message----- > From: Yosry Ahmed <yosryahmed@google.com> > Sent: Thursday, September 26, 2024 10:20 AM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: Johannes Weiner <hannes@cmpxchg.org>; linux-kernel@vger.kernel.org; > linux-mm@kvack.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com; > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux- > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in > zswap_store(). > > On Thu, Sep 26, 2024 at 9:40 AM Sridhar, Kanchana P > <kanchana.p.sridhar@intel.com> wrote: > > > > > -----Original Message----- > > > From: Yosry Ahmed <yosryahmed@google.com> > > > Sent: Wednesday, September 25, 2024 9:52 PM > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > > > Cc: Johannes Weiner <hannes@cmpxchg.org>; linux- > kernel@vger.kernel.org; > > > linux-mm@kvack.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > > > usamaarif642@gmail.com; shakeel.butt@linux.dev; > ryan.roberts@arm.com; > > > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux- > > > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > > > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in > > > zswap_store(). > > > > > > [..] > > > > > > > > One thing I realized while reworking the patches for the batched checks > is: > > > > within zswap_store_page(), we set the entry->objcg and entry->pool > before > > > > adding it to the xarray. Given this, wouldn't it be safer to get the objcg > > > > and pool reference per sub-page, locally in zswap_store_page(), rather > than > > > > obtaining batched references at the end if the store is successful? If we > > > want > > > > zswap_store_page() to be self-contained and correct as far as the entry > > > > being created and added to the xarray, it seems like the right thing to > do? > > > > I am a bit apprehensive about the entry being added to the xarray > without > > > > a reference obtained on the objcg and pool, because any page- > > > faults/writeback > > > > that occur on sub-pages added to the xarray before the entire folio has > been > > > > stored, would run into issues. > > > > > > We definitely should not obtain references to the pool and objcg after > > > initializing the entries with them. We can obtain all references in > > > zswap_store() before zswap_store_page(). IOW, the batching in this > > > case should be done before the per-page operations, not after. > > > > Thanks Yosry. IIUC, we should obtain all references to the objcg and to the > > zswap_pool at the start of zswap_store. > > > > In the case of error on any sub-page, we will unwind state for potentially > > only the stored pages or the entire folio if it happened to already be in > zswap > > and is being re-written. We might need some additional book-keeping to > > keep track of which sub-pages were found in the xarray and > zswap_entry_free() > > got called (nr_sb). Assuming I define a new "obj_cgroup_put_many()", I > would need > > to call this with (folio_nr_pages() - nr_sb). > > > > As far as zswap_pool_get(), there is some added complexity if we want to > > keep the existing implementation that calls "percpu_ref_tryget()", and > assuming > > this is extended to provide a new "zswap_pool_get_many()" that calls > > "percpu_ref_tryget_many()". Is there a reason we use percpu_ref_tryget() > instead > > of percpu_ref_get()? Reason I ask is, with tryget(), if for some reason the > pool->ref > > is 0, no further increments will be made. If so, upon unwinding state in > > zswap_store(), I would need to special-case to catch this before calling a > new > > "zswap_pool_put_many()". > > > > Things could be a little simpler if zswap_pool_get() can use > "percpu_ref_get()" > > which will always increment the refcount. Since the zswap pool->ref is > initialized > > to "1", this seems Ok, but I don't know if there will be unintended > consequences. > > > > Can you please advise on what is the simplest/cleanest approach: > > > > 1) Proceed with the above changes without changing percpu_ref_tryget in > > zswap_pool_get. Needs special-casing in zswap_store to detect pool- > >ref > > being "0" before calling zswap_pool_put[_many]. > > My assumption is that we can reorder the code such that if > zswap_pool_get_many() fails we don't call zswap_pool_put_many() to > begin with (e.g. jump to a label after zswap_pool_put_many()). However, the pool refcount could change between the start and end of zswap_store. > > > 2) Modify zswap_pool_get/zswap_pool_get_many to use > percpu_ref_get_many > > and avoid special-casing to detect pool->ref being "0" before calling > > zswap_pool_put[_many]. > > I don't think we can simply switch the tryget to a get, as I believe > we can race with the pool being destroyed. That was my initial thought as well, but I figured this couldn't happen since the pool->ref is initialized to "1", and based on the existing implementation. In any case, I can understand the intent of the use of "tryget"; it is just that it adds to the considerations for reference batching. > > > 3) Keep the approach in v7 where obj_cgroup_get/put is localized to > > zswap_store_page for both success and error conditions, and any > unwinding > > state in zswap_store will take care of dropping references obtained from > > prior successful writes (from this or prior invocations of zswap_store). > > I am also fine with doing that and doing the reference batching as a follow up. I think so too! We could try and improve upon (3) with reference batching in a follow-up patch. Thanks, Kanchana > > > > > > Thanks, > > Kanchana > > > > > > > > > > > > > Just wanted to run this by you. The rest of the batched charging, atomic > > > > and stat updates should be Ok. > > > > > > > > Thanks, > > > > Kanchana > > > > > > > > > > > > > > Thanks, > > > > > Kanchana ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-26 17:29 ` Sridhar, Kanchana P @ 2024-09-26 17:34 ` Yosry Ahmed 2024-09-26 19:36 ` Sridhar, Kanchana P 2024-09-26 18:43 ` Johannes Weiner 1 sibling, 1 reply; 79+ messages in thread From: Yosry Ahmed @ 2024-09-26 17:34 UTC (permalink / raw) To: Sridhar, Kanchana P Cc: Johannes Weiner, linux-kernel, linux-mm, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh On Thu, Sep 26, 2024 at 10:29 AM Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> wrote: > > > -----Original Message----- > > From: Yosry Ahmed <yosryahmed@google.com> > > Sent: Thursday, September 26, 2024 10:20 AM > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > > Cc: Johannes Weiner <hannes@cmpxchg.org>; linux-kernel@vger.kernel.org; > > linux-mm@kvack.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > > usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com; > > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux- > > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in > > zswap_store(). > > > > On Thu, Sep 26, 2024 at 9:40 AM Sridhar, Kanchana P > > <kanchana.p.sridhar@intel.com> wrote: > > > > > > > -----Original Message----- > > > > From: Yosry Ahmed <yosryahmed@google.com> > > > > Sent: Wednesday, September 25, 2024 9:52 PM > > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > > > > Cc: Johannes Weiner <hannes@cmpxchg.org>; linux- > > kernel@vger.kernel.org; > > > > linux-mm@kvack.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > > > > usamaarif642@gmail.com; shakeel.butt@linux.dev; > > ryan.roberts@arm.com; > > > > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux- > > > > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > > > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > > > > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in > > > > zswap_store(). > > > > > > > > [..] > > > > > > > > > > One thing I realized while reworking the patches for the batched checks > > is: > > > > > within zswap_store_page(), we set the entry->objcg and entry->pool > > before > > > > > adding it to the xarray. Given this, wouldn't it be safer to get the objcg > > > > > and pool reference per sub-page, locally in zswap_store_page(), rather > > than > > > > > obtaining batched references at the end if the store is successful? If we > > > > want > > > > > zswap_store_page() to be self-contained and correct as far as the entry > > > > > being created and added to the xarray, it seems like the right thing to > > do? > > > > > I am a bit apprehensive about the entry being added to the xarray > > without > > > > > a reference obtained on the objcg and pool, because any page- > > > > faults/writeback > > > > > that occur on sub-pages added to the xarray before the entire folio has > > been > > > > > stored, would run into issues. > > > > > > > > We definitely should not obtain references to the pool and objcg after > > > > initializing the entries with them. We can obtain all references in > > > > zswap_store() before zswap_store_page(). IOW, the batching in this > > > > case should be done before the per-page operations, not after. > > > > > > Thanks Yosry. IIUC, we should obtain all references to the objcg and to the > > > zswap_pool at the start of zswap_store. > > > > > > In the case of error on any sub-page, we will unwind state for potentially > > > only the stored pages or the entire folio if it happened to already be in > > zswap > > > and is being re-written. We might need some additional book-keeping to > > > keep track of which sub-pages were found in the xarray and > > zswap_entry_free() > > > got called (nr_sb). Assuming I define a new "obj_cgroup_put_many()", I > > would need > > > to call this with (folio_nr_pages() - nr_sb). > > > > > > As far as zswap_pool_get(), there is some added complexity if we want to > > > keep the existing implementation that calls "percpu_ref_tryget()", and > > assuming > > > this is extended to provide a new "zswap_pool_get_many()" that calls > > > "percpu_ref_tryget_many()". Is there a reason we use percpu_ref_tryget() > > instead > > > of percpu_ref_get()? Reason I ask is, with tryget(), if for some reason the > > pool->ref > > > is 0, no further increments will be made. If so, upon unwinding state in > > > zswap_store(), I would need to special-case to catch this before calling a > > new > > > "zswap_pool_put_many()". > > > > > > Things could be a little simpler if zswap_pool_get() can use > > "percpu_ref_get()" > > > which will always increment the refcount. Since the zswap pool->ref is > > initialized > > > to "1", this seems Ok, but I don't know if there will be unintended > > consequences. > > > > > > Can you please advise on what is the simplest/cleanest approach: > > > > > > 1) Proceed with the above changes without changing percpu_ref_tryget in > > > zswap_pool_get. Needs special-casing in zswap_store to detect pool- > > >ref > > > being "0" before calling zswap_pool_put[_many]. > > > > My assumption is that we can reorder the code such that if > > zswap_pool_get_many() fails we don't call zswap_pool_put_many() to > > begin with (e.g. jump to a label after zswap_pool_put_many()). > > However, the pool refcount could change between the start and end of > zswap_store. I am not sure what you mean. If zswap_pool_get_many() fails then we just do not call zswap_pool_put_many() at all and abort. > > > > > > 2) Modify zswap_pool_get/zswap_pool_get_many to use > > percpu_ref_get_many > > > and avoid special-casing to detect pool->ref being "0" before calling > > > zswap_pool_put[_many]. > > > > I don't think we can simply switch the tryget to a get, as I believe > > we can race with the pool being destroyed. > > That was my initial thought as well, but I figured this couldn't happen > since the pool->ref is initialized to "1", and based on the existing > implementation. In any case, I can understand the intent of the use > of "tryget"; it is just that it adds to the considerations for reference > batching. The initial ref can be dropped in __zswap_param_set() if a new pool is created (see the call to ercpu_ref_kill(()). > > > > > > 3) Keep the approach in v7 where obj_cgroup_get/put is localized to > > > zswap_store_page for both success and error conditions, and any > > unwinding > > > state in zswap_store will take care of dropping references obtained from > > > prior successful writes (from this or prior invocations of zswap_store). > > > > I am also fine with doing that and doing the reference batching as a follow up. > > I think so too! We could try and improve upon (3) with reference batching > in a follow-up patch. SGTM. ^ permalink raw reply [flat|nested] 79+ messages in thread
* RE: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-26 17:34 ` Yosry Ahmed @ 2024-09-26 19:36 ` Sridhar, Kanchana P 0 siblings, 0 replies; 79+ messages in thread From: Sridhar, Kanchana P @ 2024-09-26 19:36 UTC (permalink / raw) To: Yosry Ahmed Cc: Johannes Weiner, linux-kernel, linux-mm, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P > -----Original Message----- > From: Yosry Ahmed <yosryahmed@google.com> > Sent: Thursday, September 26, 2024 10:35 AM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: Johannes Weiner <hannes@cmpxchg.org>; linux-kernel@vger.kernel.org; > linux-mm@kvack.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com; > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux- > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in > zswap_store(). > > On Thu, Sep 26, 2024 at 10:29 AM Sridhar, Kanchana P > <kanchana.p.sridhar@intel.com> wrote: > > > > > -----Original Message----- > > > From: Yosry Ahmed <yosryahmed@google.com> > > > Sent: Thursday, September 26, 2024 10:20 AM > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > > > Cc: Johannes Weiner <hannes@cmpxchg.org>; linux- > kernel@vger.kernel.org; > > > linux-mm@kvack.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > > > usamaarif642@gmail.com; shakeel.butt@linux.dev; > ryan.roberts@arm.com; > > > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux- > > > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > > > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in > > > zswap_store(). > > > > > > On Thu, Sep 26, 2024 at 9:40 AM Sridhar, Kanchana P > > > <kanchana.p.sridhar@intel.com> wrote: > > > > > > > > > -----Original Message----- > > > > > From: Yosry Ahmed <yosryahmed@google.com> > > > > > Sent: Wednesday, September 25, 2024 9:52 PM > > > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > > > > > Cc: Johannes Weiner <hannes@cmpxchg.org>; linux- > > > kernel@vger.kernel.org; > > > > > linux-mm@kvack.org; nphamcs@gmail.com; > chengming.zhou@linux.dev; > > > > > usamaarif642@gmail.com; shakeel.butt@linux.dev; > > > ryan.roberts@arm.com; > > > > > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; > akpm@linux- > > > > > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > > > > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh > <vinodh.gopal@intel.com> > > > > > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in > > > > > zswap_store(). > > > > > > > > > > [..] > > > > > > > > > > > > One thing I realized while reworking the patches for the batched > checks > > > is: > > > > > > within zswap_store_page(), we set the entry->objcg and entry->pool > > > before > > > > > > adding it to the xarray. Given this, wouldn't it be safer to get the > objcg > > > > > > and pool reference per sub-page, locally in zswap_store_page(), > rather > > > than > > > > > > obtaining batched references at the end if the store is successful? If > we > > > > > want > > > > > > zswap_store_page() to be self-contained and correct as far as the > entry > > > > > > being created and added to the xarray, it seems like the right thing to > > > do? > > > > > > I am a bit apprehensive about the entry being added to the xarray > > > without > > > > > > a reference obtained on the objcg and pool, because any page- > > > > > faults/writeback > > > > > > that occur on sub-pages added to the xarray before the entire folio > has > > > been > > > > > > stored, would run into issues. > > > > > > > > > > We definitely should not obtain references to the pool and objcg after > > > > > initializing the entries with them. We can obtain all references in > > > > > zswap_store() before zswap_store_page(). IOW, the batching in this > > > > > case should be done before the per-page operations, not after. > > > > > > > > Thanks Yosry. IIUC, we should obtain all references to the objcg and to > the > > > > zswap_pool at the start of zswap_store. > > > > > > > > In the case of error on any sub-page, we will unwind state for potentially > > > > only the stored pages or the entire folio if it happened to already be in > > > zswap > > > > and is being re-written. We might need some additional book-keeping to > > > > keep track of which sub-pages were found in the xarray and > > > zswap_entry_free() > > > > got called (nr_sb). Assuming I define a new "obj_cgroup_put_many()", I > > > would need > > > > to call this with (folio_nr_pages() - nr_sb). > > > > > > > > As far as zswap_pool_get(), there is some added complexity if we want > to > > > > keep the existing implementation that calls "percpu_ref_tryget()", and > > > assuming > > > > this is extended to provide a new "zswap_pool_get_many()" that calls > > > > "percpu_ref_tryget_many()". Is there a reason we use > percpu_ref_tryget() > > > instead > > > > of percpu_ref_get()? Reason I ask is, with tryget(), if for some reason the > > > pool->ref > > > > is 0, no further increments will be made. If so, upon unwinding state in > > > > zswap_store(), I would need to special-case to catch this before calling a > > > new > > > > "zswap_pool_put_many()". > > > > > > > > Things could be a little simpler if zswap_pool_get() can use > > > "percpu_ref_get()" > > > > which will always increment the refcount. Since the zswap pool->ref is > > > initialized > > > > to "1", this seems Ok, but I don't know if there will be unintended > > > consequences. > > > > > > > > Can you please advise on what is the simplest/cleanest approach: > > > > > > > > 1) Proceed with the above changes without changing percpu_ref_tryget > in > > > > zswap_pool_get. Needs special-casing in zswap_store to detect pool- > > > >ref > > > > being "0" before calling zswap_pool_put[_many]. > > > > > > My assumption is that we can reorder the code such that if > > > zswap_pool_get_many() fails we don't call zswap_pool_put_many() to > > > begin with (e.g. jump to a label after zswap_pool_put_many()). > > > > However, the pool refcount could change between the start and end of > > zswap_store. > > I am not sure what you mean. If zswap_pool_get_many() fails then we > just do not call zswap_pool_put_many() at all and abort. I guess I was thinking of a scenario where zswap_pool_get_many() returns true; subsequently, the pool refcount reaches 0 before the zswap_pool_put_many(). I just realized this shouldn’t happen, so I think we are Ok. Will think about this some more while creating the follow-up patch. > > > > > > > > > > 2) Modify zswap_pool_get/zswap_pool_get_many to use > > > percpu_ref_get_many > > > > and avoid special-casing to detect pool->ref being "0" before calling > > > > zswap_pool_put[_many]. > > > > > > I don't think we can simply switch the tryget to a get, as I believe > > > we can race with the pool being destroyed. > > > > That was my initial thought as well, but I figured this couldn't happen > > since the pool->ref is initialized to "1", and based on the existing > > implementation. In any case, I can understand the intent of the use > > of "tryget"; it is just that it adds to the considerations for reference > > batching. > > The initial ref can be dropped in __zswap_param_set() if a new pool is > created (see the call to ercpu_ref_kill(()). I see.. this makes sense, thanks Yosry! > > > > > > > > > > 3) Keep the approach in v7 where obj_cgroup_get/put is localized to > > > > zswap_store_page for both success and error conditions, and any > > > unwinding > > > > state in zswap_store will take care of dropping references obtained > from > > > > prior successful writes (from this or prior invocations of zswap_store). > > > > > > I am also fine with doing that and doing the reference batching as a follow > up. > > > > I think so too! We could try and improve upon (3) with reference batching > > in a follow-up patch. > > SGTM. Thanks, will proceed! ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-26 17:29 ` Sridhar, Kanchana P 2024-09-26 17:34 ` Yosry Ahmed @ 2024-09-26 18:43 ` Johannes Weiner 2024-09-26 18:45 ` Yosry Ahmed 2024-09-26 19:39 ` Sridhar, Kanchana P 1 sibling, 2 replies; 79+ messages in thread From: Johannes Weiner @ 2024-09-26 18:43 UTC (permalink / raw) To: Sridhar, Kanchana P Cc: Yosry Ahmed, linux-kernel, linux-mm, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh On Thu, Sep 26, 2024 at 05:29:30PM +0000, Sridhar, Kanchana P wrote: > > > 3) Keep the approach in v7 where obj_cgroup_get/put is localized to > > > zswap_store_page for both success and error conditions, and any > > unwinding > > > state in zswap_store will take care of dropping references obtained from > > > prior successful writes (from this or prior invocations of zswap_store). > > > > I am also fine with doing that and doing the reference batching as a follow up. > > I think so too! We could try and improve upon (3) with reference batching > in a follow-up patch. Yeah, I agree. The percpu-refcounts are not that expensive, we should be able to live with per-page ops for now. One thing you *can* do from the start is tryget a pool reference in zswap_store(), to prevent the pools untimely demise while you work on it, and then in zswap_store_page() you can do gets instead of trygets. You'd have to rename zswap_pool_get() to zswap_pool_tryget() (which is probably for the best) and implement the trivial new zswap_pool_get(). ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-26 18:43 ` Johannes Weiner @ 2024-09-26 18:45 ` Yosry Ahmed 2024-09-26 19:40 ` Sridhar, Kanchana P 2024-09-26 19:39 ` Sridhar, Kanchana P 1 sibling, 1 reply; 79+ messages in thread From: Yosry Ahmed @ 2024-09-26 18:45 UTC (permalink / raw) To: Johannes Weiner Cc: Sridhar, Kanchana P, linux-kernel, linux-mm, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh On Thu, Sep 26, 2024 at 11:43 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Thu, Sep 26, 2024 at 05:29:30PM +0000, Sridhar, Kanchana P wrote: > > > > 3) Keep the approach in v7 where obj_cgroup_get/put is localized to > > > > zswap_store_page for both success and error conditions, and any > > > unwinding > > > > state in zswap_store will take care of dropping references obtained from > > > > prior successful writes (from this or prior invocations of zswap_store). > > > > > > I am also fine with doing that and doing the reference batching as a follow up. > > > > I think so too! We could try and improve upon (3) with reference batching > > in a follow-up patch. > > Yeah, I agree. The percpu-refcounts are not that expensive, we should > be able to live with per-page ops for now. > > One thing you *can* do from the start is tryget a pool reference in > zswap_store(), to prevent the pools untimely demise while you work on > it, and then in zswap_store_page() you can do gets instead of trygets. > > You'd have to rename zswap_pool_get() to zswap_pool_tryget() (which is > probably for the best) and implement the trivial new zswap_pool_get(). Yeah I was actually planning to send a follow-up patch to do exactly that until we figure out proper patching for the refcounts. Even better if Kanchana incorporates it in the next version :) ^ permalink raw reply [flat|nested] 79+ messages in thread
* RE: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-26 18:45 ` Yosry Ahmed @ 2024-09-26 19:40 ` Sridhar, Kanchana P 0 siblings, 0 replies; 79+ messages in thread From: Sridhar, Kanchana P @ 2024-09-26 19:40 UTC (permalink / raw) To: Yosry Ahmed, Johannes Weiner Cc: linux-kernel, linux-mm, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P > -----Original Message----- > From: Yosry Ahmed <yosryahmed@google.com> > Sent: Thursday, September 26, 2024 11:46 AM > To: Johannes Weiner <hannes@cmpxchg.org> > Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; linux- > kernel@vger.kernel.org; linux-mm@kvack.org; nphamcs@gmail.com; > chengming.zhou@linux.dev; usamaarif642@gmail.com; > shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org; > Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in > zswap_store(). > > On Thu, Sep 26, 2024 at 11:43 AM Johannes Weiner <hannes@cmpxchg.org> > wrote: > > > > On Thu, Sep 26, 2024 at 05:29:30PM +0000, Sridhar, Kanchana P wrote: > > > > > 3) Keep the approach in v7 where obj_cgroup_get/put is localized to > > > > > zswap_store_page for both success and error conditions, and any > > > > unwinding > > > > > state in zswap_store will take care of dropping references obtained > from > > > > > prior successful writes (from this or prior invocations of > zswap_store). > > > > > > > > I am also fine with doing that and doing the reference batching as a > follow up. > > > > > > I think so too! We could try and improve upon (3) with reference batching > > > in a follow-up patch. > > > > Yeah, I agree. The percpu-refcounts are not that expensive, we should > > be able to live with per-page ops for now. > > > > One thing you *can* do from the start is tryget a pool reference in > > zswap_store(), to prevent the pools untimely demise while you work on > > it, and then in zswap_store_page() you can do gets instead of trygets. > > > > You'd have to rename zswap_pool_get() to zswap_pool_tryget() (which is > > probably for the best) and implement the trivial new zswap_pool_get(). > > Yeah I was actually planning to send a follow-up patch to do exactly > that until we figure out proper patching for the refcounts. Even > better if Kanchana incorporates it in the next version :) Sure, Yosry, I will incorporate it in the next version! Thanks again, Kanchana ^ permalink raw reply [flat|nested] 79+ messages in thread
* RE: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-26 18:43 ` Johannes Weiner 2024-09-26 18:45 ` Yosry Ahmed @ 2024-09-26 19:39 ` Sridhar, Kanchana P 1 sibling, 0 replies; 79+ messages in thread From: Sridhar, Kanchana P @ 2024-09-26 19:39 UTC (permalink / raw) To: Johannes Weiner Cc: Yosry Ahmed, linux-kernel, linux-mm, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P > -----Original Message----- > From: Johannes Weiner <hannes@cmpxchg.org> > Sent: Thursday, September 26, 2024 11:43 AM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: Yosry Ahmed <yosryahmed@google.com>; linux-kernel@vger.kernel.org; > linux-mm@kvack.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com; > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux- > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in > zswap_store(). > > On Thu, Sep 26, 2024 at 05:29:30PM +0000, Sridhar, Kanchana P wrote: > > > > 3) Keep the approach in v7 where obj_cgroup_get/put is localized to > > > > zswap_store_page for both success and error conditions, and any > > > unwinding > > > > state in zswap_store will take care of dropping references obtained > from > > > > prior successful writes (from this or prior invocations of zswap_store). > > > > > > I am also fine with doing that and doing the reference batching as a follow > up. > > > > I think so too! We could try and improve upon (3) with reference batching > > in a follow-up patch. > > Yeah, I agree. The percpu-refcounts are not that expensive, we should > be able to live with per-page ops for now. > > One thing you *can* do from the start is tryget a pool reference in > zswap_store(), to prevent the pools untimely demise while you work on > it, and then in zswap_store_page() you can do gets instead of trygets. Sure, this sounds good Johannes, thanks for the suggestion! I already do a zswap_pool_current_get() at the beginning of zswap_store in the v7 code, for this purpose. > > You'd have to rename zswap_pool_get() to zswap_pool_tryget() (which is > probably for the best) and implement the trivial new zswap_pool_get(). Ok, will do so. Thanks, Kanchana ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-24 1:17 ` [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store() Kanchana P Sridhar 2024-09-24 17:33 ` Nhat Pham 2024-09-24 19:38 ` Yosry Ahmed @ 2024-09-25 14:27 ` Johannes Weiner 2024-09-25 18:17 ` Yosry Ahmed 2024-09-25 18:48 ` Sridhar, Kanchana P 2 siblings, 2 replies; 79+ messages in thread From: Johannes Weiner @ 2024-09-25 14:27 UTC (permalink / raw) To: Kanchana P Sridhar Cc: linux-kernel, linux-mm, yosryahmed, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal On Mon, Sep 23, 2024 at 06:17:07PM -0700, Kanchana P Sridhar wrote: > zswap_store() will now store mTHP and PMD-size THP folios by compressing The hugepage terminology throughout the patches is a bit convoluted. There is no real distinction in this code between PMD-size THPs and sub-PMD-sized mTHPs e.g. In particular, I think "mTHP" made sense when they were added, to distinguish them from conventional THPs. But using this term going forward just causes confusion, IMO. We're going through a big effort in the codebase to call all of these things simply "folios" - which stands for "one or more pages". If you want to emphasize the "more than one page", the convention is to call it a "large folio". (If you need to emphasize that it's PMD size - which doesn't apply to these patches, but just for the record - the convention is "pmd-mappable folio".) So what this patch set does is "support large folios in zswap". > @@ -1551,51 +1559,63 @@ static bool __maybe_unused zswap_store_page(struct folio *folio, long index, > return false; > } > > +/* > + * Modified to store mTHP folios. Each page in the mTHP will be compressed > + * and stored sequentially. > + */ This is a changelog, not a code comment ;) Please delete it. > bool zswap_store(struct folio *folio) > { > long nr_pages = folio_nr_pages(folio); > swp_entry_t swp = folio->swap; > pgoff_t offset = swp_offset(swp); > struct xarray *tree = swap_zswap_tree(swp); > - struct zswap_entry *entry; > struct obj_cgroup *objcg = NULL; > struct mem_cgroup *memcg = NULL; > + struct zswap_pool *pool; > + bool ret = false; > + long index; > > VM_WARN_ON_ONCE(!folio_test_locked(folio)); > VM_WARN_ON_ONCE(!folio_test_swapcache(folio)); > > - /* Large folios aren't supported */ > - if (folio_test_large(folio)) > + /* Storing large folios isn't enabled */ > + if (!zswap_mthp_enabled && folio_test_large(folio)) > return false; > > if (!zswap_enabled) > - goto check_old; > + goto reject; > > - /* Check cgroup limits */ > + /* > + * Check cgroup limits: > + * > + * The cgroup zswap limit check is done once at the beginning of an > + * mTHP store, and not within zswap_store_page() for each page > + * in the mTHP. We do however check the zswap pool limits at the Use "folio" and "large folio" as appropriate here and throughout. ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-25 14:27 ` Johannes Weiner @ 2024-09-25 18:17 ` Yosry Ahmed 2024-09-25 18:48 ` Sridhar, Kanchana P 1 sibling, 0 replies; 79+ messages in thread From: Yosry Ahmed @ 2024-09-25 18:17 UTC (permalink / raw) To: Johannes Weiner Cc: Kanchana P Sridhar, linux-kernel, linux-mm, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal On Wed, Sep 25, 2024 at 7:27 AM Johannes Weiner <hannes@cmpxchg.org> wrote: > > On Mon, Sep 23, 2024 at 06:17:07PM -0700, Kanchana P Sridhar wrote: > > zswap_store() will now store mTHP and PMD-size THP folios by compressing > > The hugepage terminology throughout the patches is a bit convoluted. > > There is no real distinction in this code between PMD-size THPs and > sub-PMD-sized mTHPs e.g. In particular, I think "mTHP" made sense when > they were added, to distinguish them from conventional THPs. But using > this term going forward just causes confusion, IMO. > > We're going through a big effort in the codebase to call all of these > things simply "folios" - which stands for "one or more pages". If you > want to emphasize the "more than one page", the convention is to call > it a "large folio". (If you need to emphasize that it's PMD size - > which doesn't apply to these patches, but just for the record - the > convention is "pmd-mappable folio".) > > So what this patch set does is "support large folios in zswap". Agreed on all of this, except it should be "support large folios in zswap _stores". We don't really support loading large folios. ^ permalink raw reply [flat|nested] 79+ messages in thread
* RE: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store(). 2024-09-25 14:27 ` Johannes Weiner 2024-09-25 18:17 ` Yosry Ahmed @ 2024-09-25 18:48 ` Sridhar, Kanchana P 1 sibling, 0 replies; 79+ messages in thread From: Sridhar, Kanchana P @ 2024-09-25 18:48 UTC (permalink / raw) To: Johannes Weiner Cc: linux-kernel, linux-mm, yosryahmed, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P > -----Original Message----- > From: Johannes Weiner <hannes@cmpxchg.org> > Sent: Wednesday, September 25, 2024 7:28 AM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > yosryahmed@google.com; nphamcs@gmail.com; > chengming.zhou@linux.dev; usamaarif642@gmail.com; > shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org; > Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v7 6/8] mm: zswap: Support mTHP swapout in > zswap_store(). > > On Mon, Sep 23, 2024 at 06:17:07PM -0700, Kanchana P Sridhar wrote: > > zswap_store() will now store mTHP and PMD-size THP folios by compressing > > The hugepage terminology throughout the patches is a bit convoluted. > > There is no real distinction in this code between PMD-size THPs and > sub-PMD-sized mTHPs e.g. In particular, I think "mTHP" made sense when > they were added, to distinguish them from conventional THPs. But using > this term going forward just causes confusion, IMO. > > We're going through a big effort in the codebase to call all of these > things simply "folios" - which stands for "one or more pages". If you > want to emphasize the "more than one page", the convention is to call > it a "large folio". (If you need to emphasize that it's PMD size - > which doesn't apply to these patches, but just for the record - the > convention is "pmd-mappable folio".) > > So what this patch set does is "support large folios in zswap". Sure. Will modify this to be "support large folios in zswap _stores" as per Yosry's follow-up clarification. > > > @@ -1551,51 +1559,63 @@ static bool __maybe_unused > zswap_store_page(struct folio *folio, long index, > > return false; > > } > > > > +/* > > + * Modified to store mTHP folios. Each page in the mTHP will be > compressed > > + * and stored sequentially. > > + */ > > This is a changelog, not a code comment ;) Please delete it. Ok, sure. > > > bool zswap_store(struct folio *folio) > > { > > long nr_pages = folio_nr_pages(folio); > > swp_entry_t swp = folio->swap; > > pgoff_t offset = swp_offset(swp); > > struct xarray *tree = swap_zswap_tree(swp); > > - struct zswap_entry *entry; > > struct obj_cgroup *objcg = NULL; > > struct mem_cgroup *memcg = NULL; > > + struct zswap_pool *pool; > > + bool ret = false; > > + long index; > > > > VM_WARN_ON_ONCE(!folio_test_locked(folio)); > > VM_WARN_ON_ONCE(!folio_test_swapcache(folio)); > > > > - /* Large folios aren't supported */ > > - if (folio_test_large(folio)) > > + /* Storing large folios isn't enabled */ > > + if (!zswap_mthp_enabled && folio_test_large(folio)) > > return false; > > > > if (!zswap_enabled) > > - goto check_old; > > + goto reject; > > > > - /* Check cgroup limits */ > > + /* > > + * Check cgroup limits: > > + * > > + * The cgroup zswap limit check is done once at the beginning of an > > + * mTHP store, and not within zswap_store_page() for each page > > + * in the mTHP. We do however check the zswap pool limits at the > > Use "folio" and "large folio" as appropriate here and throughout. Sounds good. Thanks, Kanchana ^ permalink raw reply [flat|nested] 79+ messages in thread
* [PATCH v7 7/8] mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout stats. 2024-09-24 1:17 [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar ` (5 preceding siblings ...) 2024-09-24 1:17 ` [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store() Kanchana P Sridhar @ 2024-09-24 1:17 ` Kanchana P Sridhar 2024-09-24 1:17 ` [PATCH v7 8/8] mm: Document the newly added mTHP zswpout stats, clarify swpout semantics Kanchana P Sridhar ` (2 subsequent siblings) 9 siblings, 0 replies; 79+ messages in thread From: Kanchana P Sridhar @ 2024-09-24 1:17 UTC (permalink / raw) To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao, akpm Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar Add a new MTHP_STAT_ZSWPOUT entry to the sysfs mTHP stats so that per-order mTHP folio ZSWAP stores can be accounted. If zswap_store() successfully swaps out an mTHP, it will be counted under the per-order sysfs "zswpout" stats: /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout Other block dev/fs mTHP swap-out events will be counted under the existing sysfs "swpout" stats: /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/swpout Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> --- include/linux/huge_mm.h | 1 + mm/huge_memory.c | 3 +++ mm/page_io.c | 1 + 3 files changed, 5 insertions(+) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 0b0539f4ee1a..ab95b94e9627 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -118,6 +118,7 @@ enum mthp_stat_item { MTHP_STAT_ANON_FAULT_ALLOC, MTHP_STAT_ANON_FAULT_FALLBACK, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE, + MTHP_STAT_ZSWPOUT, MTHP_STAT_SWPOUT, MTHP_STAT_SWPOUT_FALLBACK, MTHP_STAT_SHMEM_ALLOC, diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 4e34b7f89daf..7d8ce7891ba8 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -612,6 +612,7 @@ static struct kobj_attribute _name##_attr = __ATTR_RO(_name) DEFINE_MTHP_STAT_ATTR(anon_fault_alloc, MTHP_STAT_ANON_FAULT_ALLOC); DEFINE_MTHP_STAT_ATTR(anon_fault_fallback, MTHP_STAT_ANON_FAULT_FALLBACK); DEFINE_MTHP_STAT_ATTR(anon_fault_fallback_charge, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE); +DEFINE_MTHP_STAT_ATTR(zswpout, MTHP_STAT_ZSWPOUT); DEFINE_MTHP_STAT_ATTR(swpout, MTHP_STAT_SWPOUT); DEFINE_MTHP_STAT_ATTR(swpout_fallback, MTHP_STAT_SWPOUT_FALLBACK); #ifdef CONFIG_SHMEM @@ -630,6 +631,7 @@ static struct attribute *anon_stats_attrs[] = { &anon_fault_fallback_attr.attr, &anon_fault_fallback_charge_attr.attr, #ifndef CONFIG_SHMEM + &zswpout_attr.attr, &swpout_attr.attr, &swpout_fallback_attr.attr, #endif @@ -660,6 +662,7 @@ static struct attribute_group file_stats_attr_grp = { static struct attribute *any_stats_attrs[] = { #ifdef CONFIG_SHMEM + &zswpout_attr.attr, &swpout_attr.attr, &swpout_fallback_attr.attr, #endif diff --git a/mm/page_io.c b/mm/page_io.c index bc1183299a7d..4aa34862676f 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -269,6 +269,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc) swap_zeromap_folio_clear(folio); } if (zswap_store(folio)) { + count_mthp_stat(folio_order(folio), MTHP_STAT_ZSWPOUT); folio_unlock(folio); return 0; } -- 2.27.0 ^ permalink raw reply [flat|nested] 79+ messages in thread
* [PATCH v7 8/8] mm: Document the newly added mTHP zswpout stats, clarify swpout semantics. 2024-09-24 1:17 [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar ` (6 preceding siblings ...) 2024-09-24 1:17 ` [PATCH v7 7/8] mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout stats Kanchana P Sridhar @ 2024-09-24 1:17 ` Kanchana P Sridhar 2024-09-24 17:36 ` Nhat Pham 2024-09-24 19:34 ` [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios Yosry Ahmed 2024-09-25 6:35 ` Huang, Ying 9 siblings, 1 reply; 79+ messages in thread From: Kanchana P Sridhar @ 2024-09-24 1:17 UTC (permalink / raw) To: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao, akpm Cc: nanhai.zou, wajdi.k.feghali, vinodh.gopal, kanchana.p.sridhar Added documentation for the newly added sysfs mTHP "zswpout" stats. Clarified that only non-ZSWAP mTHP swapouts will be accounted in the mTHP "swpout" stats. Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> --- Documentation/admin-guide/mm/transhuge.rst | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst index cfdd16a52e39..a65f905e9ca7 100644 --- a/Documentation/admin-guide/mm/transhuge.rst +++ b/Documentation/admin-guide/mm/transhuge.rst @@ -530,10 +530,14 @@ anon_fault_fallback_charge instead falls back to using huge pages with lower orders or small pages even though the allocation was successful. -swpout - is incremented every time a huge page is swapped out in one +zswpout + is incremented every time a huge page is swapped out to ZSWAP in one piece without splitting. +swpout + is incremented every time a huge page is swapped out to a non-ZSWAP + swap entity in one piece without splitting. + swpout_fallback is incremented if a huge page has to be split before swapout. Usually because failed to allocate some continuous swap space -- 2.27.0 ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 8/8] mm: Document the newly added mTHP zswpout stats, clarify swpout semantics. 2024-09-24 1:17 ` [PATCH v7 8/8] mm: Document the newly added mTHP zswpout stats, clarify swpout semantics Kanchana P Sridhar @ 2024-09-24 17:36 ` Nhat Pham 2024-09-24 20:52 ` Sridhar, Kanchana P 0 siblings, 1 reply; 79+ messages in thread From: Nhat Pham @ 2024-09-24 17:36 UTC (permalink / raw) To: Kanchana P Sridhar Cc: linux-kernel, linux-mm, hannes, yosryahmed, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar <kanchana.p.sridhar@intel.com> wrote: > > Added documentation for the newly added sysfs mTHP "zswpout" stats. > > Clarified that only non-ZSWAP mTHP swapouts will be accounted in the mTHP > "swpout" stats. > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > --- > Documentation/admin-guide/mm/transhuge.rst | 8 ++++++-- > 1 file changed, 6 insertions(+), 2 deletions(-) > > diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst > index cfdd16a52e39..a65f905e9ca7 100644 > --- a/Documentation/admin-guide/mm/transhuge.rst > +++ b/Documentation/admin-guide/mm/transhuge.rst > @@ -530,10 +530,14 @@ anon_fault_fallback_charge > instead falls back to using huge pages with lower orders or > small pages even though the allocation was successful. > > -swpout > - is incremented every time a huge page is swapped out in one > +zswpout > + is incremented every time a huge page is swapped out to ZSWAP in one > piece without splitting. nit: a bit weird to capitalize ZSWAP no? :) > > +swpout > + is incremented every time a huge page is swapped out to a non-ZSWAP > + swap entity in one piece without splitting. > + nit: "non-zswap swap entity" is a bit awkward. Maybe swapped out to a non-zswap swap device? ^ permalink raw reply [flat|nested] 79+ messages in thread
* RE: [PATCH v7 8/8] mm: Document the newly added mTHP zswpout stats, clarify swpout semantics. 2024-09-24 17:36 ` Nhat Pham @ 2024-09-24 20:52 ` Sridhar, Kanchana P 0 siblings, 0 replies; 79+ messages in thread From: Sridhar, Kanchana P @ 2024-09-24 20:52 UTC (permalink / raw) To: Nhat Pham Cc: linux-kernel, linux-mm, hannes, yosryahmed, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P > -----Original Message----- > From: Nhat Pham <nphamcs@gmail.com> > Sent: Tuesday, September 24, 2024 10:37 AM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; yosryahmed@google.com; > chengming.zhou@linux.dev; usamaarif642@gmail.com; > shakeel.butt@linux.dev; ryan.roberts@arm.com; Huang, Ying > <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux-foundation.org; > Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v7 8/8] mm: Document the newly added mTHP zswpout > stats, clarify swpout semantics. > > On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar > <kanchana.p.sridhar@intel.com> wrote: > > > > Added documentation for the newly added sysfs mTHP "zswpout" stats. > > > > Clarified that only non-ZSWAP mTHP swapouts will be accounted in the > mTHP > > "swpout" stats. > > > > Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> > > --- > > Documentation/admin-guide/mm/transhuge.rst | 8 ++++++-- > > 1 file changed, 6 insertions(+), 2 deletions(-) > > > > diff --git a/Documentation/admin-guide/mm/transhuge.rst > b/Documentation/admin-guide/mm/transhuge.rst > > index cfdd16a52e39..a65f905e9ca7 100644 > > --- a/Documentation/admin-guide/mm/transhuge.rst > > +++ b/Documentation/admin-guide/mm/transhuge.rst > > @@ -530,10 +530,14 @@ anon_fault_fallback_charge > > instead falls back to using huge pages with lower orders or > > small pages even though the allocation was successful. > > > > -swpout > > - is incremented every time a huge page is swapped out in one > > +zswpout > > + is incremented every time a huge page is swapped out to ZSWAP in > one > > piece without splitting. > > nit: a bit weird to capitalize ZSWAP no? :) No problem :). Will fix in v8. > > > > > +swpout > > + is incremented every time a huge page is swapped out to a non-ZSWAP > > + swap entity in one piece without splitting. > > + > > nit: "non-zswap swap entity" is a bit awkward. Maybe swapped out to a > non-zswap swap device? Sure, will make this change in v8. Thanks Nhat! ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios 2024-09-24 1:17 [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar ` (7 preceding siblings ...) 2024-09-24 1:17 ` [PATCH v7 8/8] mm: Document the newly added mTHP zswpout stats, clarify swpout semantics Kanchana P Sridhar @ 2024-09-24 19:34 ` Yosry Ahmed 2024-09-24 22:50 ` Sridhar, Kanchana P 2024-09-25 6:35 ` Huang, Ying 9 siblings, 1 reply; 79+ messages in thread From: Yosry Ahmed @ 2024-09-24 19:34 UTC (permalink / raw) To: Kanchana P Sridhar Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, ying.huang, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar <kanchana.p.sridhar@intel.com> wrote: > > Hi All, > > This patch-series enables zswap_store() to accept and store mTHP > folios. The most significant contribution in this series is from the > earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been > migrated to mm-unstable as of 9-23-2024 in patches 5,6 of this series. > > [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting > https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u > > Additionally, there is an attempt to modularize some of the functionality > in zswap_store(), to make it more amenable to supporting any-order > mTHPs. For instance, the function zswap_store_entry() stores a zswap_entry > in the xarray. Likewise, zswap_delete_stored_offsets() can be used to > delete all offsets corresponding to a higher order folio stored in zswap. These are implementation details that are not very useful here, you can just mention that the first few patches do refactoring prep work. > > For accounting purposes, the patch-series adds per-order mTHP sysfs > "zswpout" counters that get incremented upon successful zswap_store of > an mTHP folio: > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout > > A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default) > will enable/disable zswap storing of (m)THP. When disabled, zswap will > fallback to rejecting the mTHP folio, to be processed by the backing > swap device. Why is this needed? Do we just not have enough confidence in the feature yet, or are there some cases that regress from enabling mTHP for zswapout? Does generic mTHP swapout/swapin also use config options? > > This patch-series is a pre-requisite for ZSWAP compress batching of mTHP > swap-out and decompress batching of swap-ins based on swapin_readahead(), > using Intel IAA hardware acceleration, which we would like to submit in > subsequent patch-series, with performance improvement data. > > Thanks to Ying Huang for pre-posting review feedback and suggestions! > > Thanks also to Nhat, Yosry, Barry, Chengming, Usama and Ying for their > helpful feedback, data reviews and suggestions! > > Co-development signoff request: > =============================== > I would like to request Ryan Roberts' co-developer signoff on patches > 5 and 6 in this series. Thanks Ryan! > > Changes since v6: > ================= Please put the changelog at the very end, I almost missed the performance evaluation. > 1) Rebased to mm-unstable as of 9-23-2024, > commit acfabf7e197f7a5bedf4749dac1f39551417b049. > 2) Refactored into smaller commits, as suggested by Yosry and > Chengming. Thanks both! > 3) Reworded the commit log for patches 5 and 6 as per Yosry's > suggestion. Thanks Yosry! > 4) Gathered data on a Sapphire Rapids server that has 823GiB SSD swap disk > partition. Also, all experiments are run with usemem --sleep 10, so that > the memory allocated by the 70 processes remains in memory > longer. Posted elapsed and sys times. Thanks to Yosry, Nhat and Ying for > their help with refining the performance characterization methodology. > 5) Updated Documentation/admin-guide/mm/transhuge.rst as suggested by > Nhat. Thanks Nhat! > > Changes since v5: > ================= > 1) Rebased to mm-unstable as of 8/29/2024, > commit 9287e4adbc6ab8fa04d25eb82e097fed877a4642. > 2) Added CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default) to > enable/disable zswap_store() of mTHP folios. Thanks Nhat for the > suggestion to add a knob by which users can enable/disable this > change. Nhat, I hope this is along the lines of what you were > thinking. > 3) Added vm-scalability usemem data with 4K folios with > CONFIG_ZSWAP_STORE_THP_DEFAULT_ON off, that I gathered to make sure > there is no regression with this change. > 4) Added data with usemem with 64K and 2M THP for an alternate view of > before/after, as suggested by Yosry, so we can understand the impact > of when mTHPs are split into 4K folios in shrink_folio_list() > (CONFIG_THP_SWAP off) vs. not split (CONFIG_THP_SWAP on) and stored > in zswap. Thanks Yosry for this suggestion. > > Changes since v4: > ================= > 1) Published before/after data with zstd, as suggested by Nhat (Thanks > Nhat for the data reviews!). > 2) Rebased to mm-unstable from 8/27/2024, > commit b659edec079c90012cf8d05624e312d1062b8b87. > 3) Incorporated the change in memcontrol.h that defines obj_cgroup_get() if > CONFIG_MEMCG is not defined, to resolve build errors reported by kernel > robot; as per Nhat's and Michal's suggestion to not require a separate > patch to fix the build errors (thanks both!). > 4) Deleted all same-filled folio processing in zswap_store() of mTHP, as > suggested by Yosry (Thanks Yosry!). > 5) Squashed the commits that define new mthp zswpout stat counters, and > invoke count_mthp_stat() after successful zswap_store()s; into a single > commit. Thanks Yosry for this suggestion! > > Changes since v3: > ================= > 1) Rebased to mm-unstable commit 8c0b4f7b65fd1ca7af01267f491e815a40d77444. > Thanks to Barry for suggesting aligning with Ryan Roberts' latest > changes to count_mthp_stat() so that it's always defined, even when THP > is disabled. Barry, I have also made one other change in page_io.c > where count_mthp_stat() is called by count_swpout_vm_event(). I would > appreciate it if you can review this. Thanks! > Hopefully this should resolve the kernel robot build errors. > > Changes since v2: > ================= > 1) Gathered usemem data using SSD as the backing swap device for zswap, > as suggested by Ying Huang. Ying, I would appreciate it if you can > review the latest data. Thanks! > 2) Generated the base commit info in the patches to attempt to address > the kernel test robot build errors. > 3) No code changes to the individual patches themselves. > > Changes since RFC v1: > ===================== > > 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion. > Thanks Barry! > 2) Addressed some of the code review comments that Nhat Pham provided in > Ryan's initial RFC [1]: > - Added a comment about the cgroup zswap limit checks occuring once per > folio at the beginning of zswap_store(). > Nhat, Ryan, please do let me know if the comments convey the summary > from the RFC discussion. Thanks! > - Posted data on running the cgroup suite's zswap kselftest. > 3) Rebased to v6.11-rc3. > 4) Gathered performance data with usemem and the rebased patch-series. > > > Regression Testing: > =================== > I ran vm-scalability usemem 70 processes without mTHP, i.e., only 4K > folios with mm-unstable and with this patch-series. The main goal was > to make sure that there is no functional or performance regression > wrt the earlier zswap behavior for 4K folios, > CONFIG_ZSWAP_STORE_THP_DEFAULT_ON is not set, and zswap_store() of 4K > pages goes through the newly added code path [zswap_store(), > zswap_store_page()]. > > The data indicates there is no regression. > > ------------------------------------------------------------------------------ > mm-unstable 8-28-2024 zswap-mTHP v6 > CONFIG_ZSWAP_STORE_THP_DEFAULT_ON > is not set > ------------------------------------------------------------------------------ > ZSWAP compressor zstd deflate- zstd deflate- > iaa iaa > ------------------------------------------------------------------------------ > Throughput (KB/s) 110,775 113,010 111,550 121,937 > sys time (sec) 1,141.72 954.87 1,131.95 828.47 > memcg_high 140,500 153,737 139,772 134,129 > memcg_swap_high 0 0 0 0 > memcg_swap_fail 0 0 0 0 > pswpin 0 0 0 0 > pswpout 0 0 0 0 > zswpin 675 690 682 684 > zswpout 9,552,298 10,603,271 9,566,392 9,267,213 > thp_swpout 0 0 0 0 > thp_swpout_ 0 0 0 0 > fallback > pgmajfault 3,453 3,468 3,841 3,487 > ZSWPOUT-64kB-mTHP n/a n/a 0 0 > SWPOUT-64kB-mTHP 0 0 0 0 > ------------------------------------------------------------------------------ It's probably better to put the zstd columns next to each other, and the deflate-iaa columns next to each other, for easier visual comparisons. > > > Performance Testing: > ==================== > Testing of this patch-series was done with mm-unstable as of 9-23-2024, > commit acfabf7e197f7a5bedf4749dac1f39551417b049. Data was gathered > without/with this patch-series, on an Intel Sapphire Rapids server, > dual-socket 56 cores per socket, 4 IAA devices per socket, 503 GiB RAM and > 823G SSD disk partition swap. Core frequency was fixed at 2500MHz. > > The vm-scalability "usemem" test was run in a cgroup whose memory.high > was fixed at 40G. The is no swap limit set for the cgroup. Following a > similar methodology as in Ryan Roberts' "Swap-out mTHP without splitting" > series [2], 70 usemem processes were run, each allocating and writing 1G of > memory, and sleeping for 10 sec before exiting: > > usemem --init-time -w -O -s 10 -n 70 1g > > The vm/sysfs mTHP stats included with the performance data provide details > on the swapout activity to ZSWAP/swap. > > Other kernel configuration parameters: > > ZSWAP Compressors : zstd, deflate-iaa > ZSWAP Allocator : zsmalloc > SWAP page-cluster : 2 > > In the experiments where "deflate-iaa" is used as the ZSWAP compressor, > IAA "compression verification" is enabled. Hence each IAA compression > will be decompressed internally by the "iaa_crypto" driver, the crc-s > returned by the hardware will be compared and errors reported in case of > mismatches. Thus "deflate-iaa" helps ensure better data integrity as > compared to the software compressors. > > Throughput is derived by averaging the individual 70 processes' throughputs > reported by usemem. elapsed/sys times are measured with perf. All data > points per compressor/kernel/mTHP configuration are averaged across 3 runs. > > Case 1: Comparing zswap 4K vs. zswap mTHP > ========================================= > > In this scenario, the "before" is CONFIG_THP_SWAP set to off, that results in > 64K/2M (m)THP to be split into 4K folios that get processed by zswap. > > The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that results > in 64K/2M (m)THP to not be split, and processed by zswap. > > 64KB mTHP (cgroup memory.high set to 40G): > ========================================== > > ------------------------------------------------------------------------------- > mm-unstable 9-23-2024 zswap-mTHP Change wrt > CONFIG_THP_SWAP=N CONFIG_THP_SWAP=Y Baseline > Baseline > ------------------------------------------------------------------------------- > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- > iaa iaa iaa > ------------------------------------------------------------------------------- > Throughput (KB/s) 143,323 125,485 153,550 129,609 7% 3% > elapsed time (sec) 24.97 25.42 23.90 25.19 4% 1% > sys time (sec) 822.72 750.96 757.70 731.13 8% 3% > memcg_high 132,743 169,825 148,075 192,744 > memcg_swap_fail 639,067 841,553 2,204 2,215 > pswpin 0 0 0 0 > pswpout 0 0 0 0 > zswpin 795 873 760 902 > zswpout 10,011,266 13,195,137 10,010,017 13,193,554 > thp_swpout 0 0 0 0 > thp_swpout_ 0 0 0 0 > fallback > 64kB-mthp_ 639,065 841,553 2,204 2,215 > swpout_fallback > pgmajfault 2,861 2,924 3,054 3,259 > ZSWPOUT-64kB n/a n/a 623,451 822,268 > SWPOUT-64kB 0 0 0 0 > ------------------------------------------------------------------------------- > > > 2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G): > ======================================================= > > ------------------------------------------------------------------------------- > mm-unstable 9-23-2024 zswap-mTHP Change wrt > CONFIG_THP_SWAP=N CONFIG_THP_SWAP=Y Baseline > Baseline > ------------------------------------------------------------------------------- > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- > iaa iaa iaa > ------------------------------------------------------------------------------- > Throughput (KB/s) 145,616 139,640 169,404 141,168 16% 1% > elapsed time (sec) 25.05 23.85 23.02 23.37 8% 2% > sys time (sec) 790.53 676.34 613.26 677.83 22% -0.2% > memcg_high 16,702 25,197 17,374 23,890 > memcg_swap_fail 21,485 27,814 114 144 > pswpin 0 0 0 0 > pswpout 0 0 0 0 > zswpin 793 852 778 922 > zswpout 10,011,709 13,186,882 10,010,893 13,195,600 > thp_swpout 0 0 0 0 > thp_swpout_ 21,485 27,814 114 144 > fallback > 2048kB-mthp_ n/a n/a 0 0 > swpout_fallback > pgmajfault 2,701 2,822 4,151 5,066 > ZSWPOUT-2048kB n/a n/a 19,442 25,615 > SWPOUT-2048kB 0 0 0 0 > ------------------------------------------------------------------------------- > > We mostly see improvements in throughput, elapsed and sys time for zstd and > deflate-iaa, when comparing before (THP_SWAP=N) vs. after (THP_SWAP=Y). > > > Case 2: Comparing SSD swap mTHP vs. zswap mTHP > ============================================== > > In this scenario, CONFIG_THP_SWAP is enabled in "before" and "after" > experiments. The "before" represents zswap rejecting mTHP, and the mTHP > being stored by the 823G SSD swap. The "after" represents data with this > patch-series, that results in 64K/2M (m)THP being processed and stored by > zswap. > > 64KB mTHP (cgroup memory.high set to 40G): > ========================================== > > ------------------------------------------------------------------------------- > mm-unstable 9-23-2024 zswap-mTHP Change wrt > CONFIG_THP_SWAP=Y CONFIG_THP_SWAP=Y Baseline > Baseline > ------------------------------------------------------------------------------- > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- > iaa iaa iaa > ------------------------------------------------------------------------------- > Throughput (KB/s) 20,265 20,696 153,550 129,609 658% 526% > elapsed time (sec) 72.44 70.86 23.90 25.19 67% 64% > sys time (sec) 77.95 77.99 757.70 731.13 -872% -837% > memcg_high 115,811 113,277 148,075 192,744 > memcg_swap_fail 2,386 2,425 2,204 2,215 > pswpin 16 16 0 0 > pswpout 7,774,235 7,616,069 0 0 > zswpin 728 749 760 902 > zswpout 38,424 39,022 10,010,017 13,193,554 > thp_swpout 0 0 0 0 > thp_swpout_ 0 0 0 0 > fallback > 64kB-mthp_ 2,386 2,425 2,204 2,215 > swpout_fallback > pgmajfault 2,757 2,860 3,054 3,259 > ZSWPOUT-64kB n/a n/a 623,451 822,268 > SWPOUT-64kB 485,890 476,004 0 0 > ------------------------------------------------------------------------------- > > > 2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G): > ======================================================= > > ------------------------------------------------------------------------------- > mm-unstable 9-23-2024 zswap-mTHP Change wrt > CONFIG_THP_SWAP=Y CONFIG_THP_SWAP=Y Baseline > Baseline > ------------------------------------------------------------------------------- > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- > iaa iaa iaa > ------------------------------------------------------------------------------- > Throughput (KB/s) 24,347 35,971 169,404 141,168 596% 292% > elapsed time (sec) 63.52 64.59 23.02 23.37 64% 64% > sys time (sec) 27.91 27.01 613.26 677.83 -2098% -2410% > memcg_high 13,576 13,467 17,374 23,890 > memcg_swap_fail 162 124 114 144 > pswpin 0 0 0 0 > pswpout 7,003,307 7,168,853 0 0 > zswpin 741 722 778 922 > zswpout 84,429 65,315 10,010,893 13,195,600 > thp_swpout 13,678 14,002 0 0 > thp_swpout_ 162 124 114 144 > fallback > 2048kB-mthp_ n/a n/a 0 0 > swpout_fallback > pgmajfault 3,345 2,903 4,151 5,066 > ZSWPOUT-2048kB n/a n/a 19,442 25,615 > SWPOUT-2048kB 13,678 14,002 0 0 > ------------------------------------------------------------------------------- > > We see significant improvements in throughput and elapsed time for zstd and > deflate-iaa, when comparing before (mTHP-SSD) vs. after (mTHP-ZSWAP). The > sys time increases with mTHP-ZSWAP as expected, due to the CPU compression > time vs. asynchronous disk write times, as pointed out by Ying and Yosry. > > In the "Before" scenario, when zswap does not store mTHP, only allocations > count towards the cgroup memory limit. However, in the "After" scenario, > with the introduction of zswap_store() mTHP, both, allocations as well as > the zswap compressed pool usage from all 70 processes are counted towards > the memory limit. As a result, we see higher swapout activity in the > "After" data. Hence, more time is spent doing reclaim as the zswap cgroup > charge leads to more frequent memory.high breaches. > > Summary: > ======== > The v7 data presented above comparing zswap-mTHP with a conventional 823G > SSD swap demonstrates good performance improvements with zswap-mTHP. Hence, > it seems reasonable for zswap_store to support (m)THP, so that further > performance improvements can be implemented. > > Some of the ideas that have shown promise in our experiments are: > > 1) IAA compress/decompress batching. > 2) Distributing compress jobs across all IAA devices on the socket. > > In the experimental setup used in this patchset, we have enabled > IAA compress verification to ensure additional hardware data integrity CRC > checks not currently done by the software compressors. The tests run for > this patchset are also using only 1 IAA device per core, that avails of 2 > compress engines on the device. In our experiments with IAA batching, we > distribute compress jobs from all cores to the 8 compress engines available > per socket. We further compress the pages in each mTHP in parallel in the > accelerator. As a result, we improve compress latency and reclaim > throughput. > > The following compares the same usemem workload characteristics between: > > 1) zstd (v7 experiments) > 2) deflate-iaa "Fixed mode" (v7 experiments) > 3) deflate-iaa with batching > 4) deflate-iaa-canned "Canned mode" [3] with batching > > vm.page-cluster is set to "2" for all runs. > > 64K mTHP ZSWAP: > =============== > > ------------------------------------------------------------------------------- > ZSWAP zstd IAA Fixed IAA Fixed IAA Canned IAA IAA IAA > compressor (v7) (v7) + Batching + Batching Batch Canned Canned > vs. vs. Batch > 64K mTHP Seqtl Fixed vs. > ZSTD > ------------------------------------------------------------------------------- > Throughput 153,550 129,609 156,215 166,975 21% 7% 9% > (KB/s) > elapsed time 23.90 25.19 22.46 21.38 11% 5% 11% > (sec) > sys time 757.70 731.13 715.62 648.83 2% 9% 14% > (sec) > memcg_high 148,075 192,744 197,548 181,734 > memcg_swap_ 2,204 2,215 2,293 2,263 > fail > pswpin 0 0 0 0 > pswpout 0 0 0 0 > zswpin 760 902 774 833 > zswpout 10,010,017 13,193,554 13,193,176 12,125,616 > thp_swpout 0 0 0 0 > thp_swpout_ 0 0 0 0 > fallback > 64kB-mthp_ 2,204 2,215 2,293 2,263 > swpout_ > fallback > pgmajfault 3,054 3,259 3,545 3,516 > ZSWPOUT-64kB 623,451 822,268 822,176 755,480 > SWPOUT-64kB 0 0 0 0 > swap_ra 146 161 152 159 > swap_ra_hit 64 121 68 88 > ------------------------------------------------------------------------------- > > > 2M THP ZSWAP: > ============= > > ------------------------------------------------------------------------------- > ZSWAP zstd IAA Fixed IAA Fixed IAA Canned IAA IAA IAA > compressor (v7) (v7) + Batching + Batching Batch Canned Canned > vs. vs. Batch > 2M THP Seqtl Fixed vs. > ZSTD > ------------------------------------------------------------------------------- > Throughput 169,404 141,168 175,089 193,407 24% 10% 14% > (KB/s) > elapsed time 23.02 23.37 21.13 19.97 10% 5% 13% > (sec) > sys time 613.26 677.83 630.51 533.80 7% 15% 13% > (sec) > memcg_high 17,374 23,890 24,349 22,374 > memcg_swap_ 114 144 102 88 > fail > pswpin 0 0 0 0 > pswpout 0 0 0 0 > zswpin 778 922 6,492 6,642 > zswpout 10,010,893 13,195,600 13,199,907 12,132,265 > thp_swpout 0 0 0 0 > thp_swpout_ 114 144 102 88 > fallback > pgmajfault 4,151 5,066 5,032 4,999 > ZSWPOUT-2MB 19,442 25,615 25,666 23,594 > SWPOUT-2MB 0 0 0 0 > swap_ra 3 9 4,383 4,494 > swap_ra_hit 2 6 4,298 4,412 > ------------------------------------------------------------------------------- > > > With ZSWAP IAA compress/decompress batching, we are able to demonstrate > significant performance improvements and memory savings in scalability > experiments under memory pressure, as compared to software compressors. We > hope to submit this work in subsequent patch series. Honestly I would remove the detailed results of the followup series for batching, it should be enough to mention a single figure for further expected improvement from ongoing work that depends on this. > > Thanks, > Kanchana > > [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1-ryan.roberts@arm.com/T/#u > [2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1-ryan.roberts@arm.com/ > [3] https://patchwork.kernel.org/project/linux-crypto/cover/cover.1710969449.git.andre.glover@linux.intel.com/ > > > Kanchana P Sridhar (8): > mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined. > mm: zswap: Modify zswap_compress() to accept a page instead of a > folio. > mm: zswap: Refactor code to store an entry in zswap xarray. > mm: zswap: Refactor code to delete stored offsets in case of errors. > mm: zswap: Compress and store a specific page in a folio. > mm: zswap: Support mTHP swapout in zswap_store(). > mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout > stats. > mm: Document the newly added mTHP zswpout stats, clarify swpout > semantics. > > Documentation/admin-guide/mm/transhuge.rst | 8 +- > include/linux/huge_mm.h | 1 + > include/linux/memcontrol.h | 4 + > mm/Kconfig | 8 + > mm/huge_memory.c | 3 + > mm/page_io.c | 1 + > mm/zswap.c | 248 ++++++++++++++++----- > 7 files changed, 210 insertions(+), 63 deletions(-) > > > base-commit: acfabf7e197f7a5bedf4749dac1f39551417b049 > -- > 2.27.0 > ^ permalink raw reply [flat|nested] 79+ messages in thread
* RE: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios 2024-09-24 19:34 ` [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios Yosry Ahmed @ 2024-09-24 22:50 ` Sridhar, Kanchana P 0 siblings, 0 replies; 79+ messages in thread From: Sridhar, Kanchana P @ 2024-09-24 22:50 UTC (permalink / raw) To: Yosry Ahmed Cc: linux-kernel, linux-mm, hannes, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, Huang, Ying, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P > -----Original Message----- > From: Yosry Ahmed <yosryahmed@google.com> > Sent: Tuesday, September 24, 2024 12:35 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > usamaarif642@gmail.com; shakeel.butt@linux.dev; ryan.roberts@arm.com; > Huang, Ying <ying.huang@intel.com>; 21cnbao@gmail.com; akpm@linux- > foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios > > On Mon, Sep 23, 2024 at 6:17 PM Kanchana P Sridhar > <kanchana.p.sridhar@intel.com> wrote: > > > > Hi All, > > > > This patch-series enables zswap_store() to accept and store mTHP > > folios. The most significant contribution in this series is from the > > earlier RFC submitted by Ryan Roberts [1]. Ryan's original RFC has been > > migrated to mm-unstable as of 9-23-2024 in patches 5,6 of this series. > > > > [1]: [RFC PATCH v1] mm: zswap: Store large folios without splitting > > https://lore.kernel.org/linux-mm/20231019110543.3284654-1- > ryan.roberts@arm.com/T/#u > > > > Additionally, there is an attempt to modularize some of the functionality > > in zswap_store(), to make it more amenable to supporting any-order > > mTHPs. For instance, the function zswap_store_entry() stores a > zswap_entry > > in the xarray. Likewise, zswap_delete_stored_offsets() can be used to > > delete all offsets corresponding to a higher order folio stored in zswap. > > These are implementation details that are not very useful here, you > can just mention that the first few patches do refactoring prep work. Thanks Yosry for the comments! Sure, I will reword this as you've suggested in v8. > > > > > For accounting purposes, the patch-series adds per-order mTHP sysfs > > "zswpout" counters that get incremented upon successful zswap_store of > > an mTHP folio: > > > > /sys/kernel/mm/transparent_hugepage/hugepages-*kB/stats/zswpout > > > > A new config variable CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by > default) > > will enable/disable zswap storing of (m)THP. When disabled, zswap will > > fallback to rejecting the mTHP folio, to be processed by the backing > > swap device. > > Why is this needed? Do we just not have enough confidence in the > feature yet, or are there some cases that regress from enabling mTHP > for zswapout? > > Does generic mTHP swapout/swapin also use config options? As discussed in the other comments' follow-up, I will delete the config option and runtime knob. > > > > > This patch-series is a pre-requisite for ZSWAP compress batching of mTHP > > swap-out and decompress batching of swap-ins based on > swapin_readahead(), > > using Intel IAA hardware acceleration, which we would like to submit in > > subsequent patch-series, with performance improvement data. > > > > Thanks to Ying Huang for pre-posting review feedback and suggestions! > > > > Thanks also to Nhat, Yosry, Barry, Chengming, Usama and Ying for their > > helpful feedback, data reviews and suggestions! > > > > Co-development signoff request: > > =============================== > > I would like to request Ryan Roberts' co-developer signoff on patches > > 5 and 6 in this series. Thanks Ryan! > > > > Changes since v6: > > ================= > > Please put the changelog at the very end, I almost missed the > performance evaluation. Sure, will fix this. > > > 1) Rebased to mm-unstable as of 9-23-2024, > > commit acfabf7e197f7a5bedf4749dac1f39551417b049. > > 2) Refactored into smaller commits, as suggested by Yosry and > > Chengming. Thanks both! > > 3) Reworded the commit log for patches 5 and 6 as per Yosry's > > suggestion. Thanks Yosry! > > 4) Gathered data on a Sapphire Rapids server that has 823GiB SSD swap disk > > partition. Also, all experiments are run with usemem --sleep 10, so that > > the memory allocated by the 70 processes remains in memory > > longer. Posted elapsed and sys times. Thanks to Yosry, Nhat and Ying for > > their help with refining the performance characterization methodology. > > 5) Updated Documentation/admin-guide/mm/transhuge.rst as suggested > by > > Nhat. Thanks Nhat! > > > > Changes since v5: > > ================= > > 1) Rebased to mm-unstable as of 8/29/2024, > > commit 9287e4adbc6ab8fa04d25eb82e097fed877a4642. > > 2) Added CONFIG_ZSWAP_STORE_THP_DEFAULT_ON (off by default) to > > enable/disable zswap_store() of mTHP folios. Thanks Nhat for the > > suggestion to add a knob by which users can enable/disable this > > change. Nhat, I hope this is along the lines of what you were > > thinking. > > 3) Added vm-scalability usemem data with 4K folios with > > CONFIG_ZSWAP_STORE_THP_DEFAULT_ON off, that I gathered to make > sure > > there is no regression with this change. > > 4) Added data with usemem with 64K and 2M THP for an alternate view of > > before/after, as suggested by Yosry, so we can understand the impact > > of when mTHPs are split into 4K folios in shrink_folio_list() > > (CONFIG_THP_SWAP off) vs. not split (CONFIG_THP_SWAP on) and stored > > in zswap. Thanks Yosry for this suggestion. > > > > Changes since v4: > > ================= > > 1) Published before/after data with zstd, as suggested by Nhat (Thanks > > Nhat for the data reviews!). > > 2) Rebased to mm-unstable from 8/27/2024, > > commit b659edec079c90012cf8d05624e312d1062b8b87. > > 3) Incorporated the change in memcontrol.h that defines obj_cgroup_get() if > > CONFIG_MEMCG is not defined, to resolve build errors reported by kernel > > robot; as per Nhat's and Michal's suggestion to not require a separate > > patch to fix the build errors (thanks both!). > > 4) Deleted all same-filled folio processing in zswap_store() of mTHP, as > > suggested by Yosry (Thanks Yosry!). > > 5) Squashed the commits that define new mthp zswpout stat counters, and > > invoke count_mthp_stat() after successful zswap_store()s; into a single > > commit. Thanks Yosry for this suggestion! > > > > Changes since v3: > > ================= > > 1) Rebased to mm-unstable commit > 8c0b4f7b65fd1ca7af01267f491e815a40d77444. > > Thanks to Barry for suggesting aligning with Ryan Roberts' latest > > changes to count_mthp_stat() so that it's always defined, even when THP > > is disabled. Barry, I have also made one other change in page_io.c > > where count_mthp_stat() is called by count_swpout_vm_event(). I would > > appreciate it if you can review this. Thanks! > > Hopefully this should resolve the kernel robot build errors. > > > > Changes since v2: > > ================= > > 1) Gathered usemem data using SSD as the backing swap device for zswap, > > as suggested by Ying Huang. Ying, I would appreciate it if you can > > review the latest data. Thanks! > > 2) Generated the base commit info in the patches to attempt to address > > the kernel test robot build errors. > > 3) No code changes to the individual patches themselves. > > > > Changes since RFC v1: > > ===================== > > > > 1) Use sysfs for zswpout mTHP stats, as per Barry Song's suggestion. > > Thanks Barry! > > 2) Addressed some of the code review comments that Nhat Pham provided > in > > Ryan's initial RFC [1]: > > - Added a comment about the cgroup zswap limit checks occuring once > per > > folio at the beginning of zswap_store(). > > Nhat, Ryan, please do let me know if the comments convey the summary > > from the RFC discussion. Thanks! > > - Posted data on running the cgroup suite's zswap kselftest. > > 3) Rebased to v6.11-rc3. > > 4) Gathered performance data with usemem and the rebased patch-series. > > > > > > Regression Testing: > > =================== > > I ran vm-scalability usemem 70 processes without mTHP, i.e., only 4K > > folios with mm-unstable and with this patch-series. The main goal was > > to make sure that there is no functional or performance regression > > wrt the earlier zswap behavior for 4K folios, > > CONFIG_ZSWAP_STORE_THP_DEFAULT_ON is not set, and zswap_store() of > 4K > > pages goes through the newly added code path [zswap_store(), > > zswap_store_page()]. > > > > The data indicates there is no regression. > > > > ------------------------------------------------------------------------------ > > mm-unstable 8-28-2024 zswap-mTHP v6 > > CONFIG_ZSWAP_STORE_THP_DEFAULT_ON > > is not set > > ------------------------------------------------------------------------------ > > ZSWAP compressor zstd deflate- zstd deflate- > > iaa iaa > > ------------------------------------------------------------------------------ > > Throughput (KB/s) 110,775 113,010 111,550 121,937 > > sys time (sec) 1,141.72 954.87 1,131.95 828.47 > > memcg_high 140,500 153,737 139,772 134,129 > > memcg_swap_high 0 0 0 0 > > memcg_swap_fail 0 0 0 0 > > pswpin 0 0 0 0 > > pswpout 0 0 0 0 > > zswpin 675 690 682 684 > > zswpout 9,552,298 10,603,271 9,566,392 9,267,213 > > thp_swpout 0 0 0 0 > > thp_swpout_ 0 0 0 0 > > fallback > > pgmajfault 3,453 3,468 3,841 3,487 > > ZSWPOUT-64kB-mTHP n/a n/a 0 0 > > SWPOUT-64kB-mTHP 0 0 0 0 > > ------------------------------------------------------------------------------ > > It's probably better to put the zstd columns next to each other, and > the deflate-iaa columns next to each other, for easier visual > comparisons. Sure. Will change this accordingly, in v8. > > > > > > > Performance Testing: > > ==================== > > Testing of this patch-series was done with mm-unstable as of 9-23-2024, > > commit acfabf7e197f7a5bedf4749dac1f39551417b049. Data was gathered > > without/with this patch-series, on an Intel Sapphire Rapids server, > > dual-socket 56 cores per socket, 4 IAA devices per socket, 503 GiB RAM and > > 823G SSD disk partition swap. Core frequency was fixed at 2500MHz. > > > > The vm-scalability "usemem" test was run in a cgroup whose memory.high > > was fixed at 40G. The is no swap limit set for the cgroup. Following a > > similar methodology as in Ryan Roberts' "Swap-out mTHP without splitting" > > series [2], 70 usemem processes were run, each allocating and writing 1G of > > memory, and sleeping for 10 sec before exiting: > > > > usemem --init-time -w -O -s 10 -n 70 1g > > > > The vm/sysfs mTHP stats included with the performance data provide > details > > on the swapout activity to ZSWAP/swap. > > > > Other kernel configuration parameters: > > > > ZSWAP Compressors : zstd, deflate-iaa > > ZSWAP Allocator : zsmalloc > > SWAP page-cluster : 2 > > > > In the experiments where "deflate-iaa" is used as the ZSWAP compressor, > > IAA "compression verification" is enabled. Hence each IAA compression > > will be decompressed internally by the "iaa_crypto" driver, the crc-s > > returned by the hardware will be compared and errors reported in case of > > mismatches. Thus "deflate-iaa" helps ensure better data integrity as > > compared to the software compressors. > > > > Throughput is derived by averaging the individual 70 processes' throughputs > > reported by usemem. elapsed/sys times are measured with perf. All data > > points per compressor/kernel/mTHP configuration are averaged across 3 > runs. > > > > Case 1: Comparing zswap 4K vs. zswap mTHP > > ========================================= > > > > In this scenario, the "before" is CONFIG_THP_SWAP set to off, that results in > > 64K/2M (m)THP to be split into 4K folios that get processed by zswap. > > > > The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that > results > > in 64K/2M (m)THP to not be split, and processed by zswap. > > > > 64KB mTHP (cgroup memory.high set to 40G): > > ========================================== > > > > ------------------------------------------------------------------------------- > > mm-unstable 9-23-2024 zswap-mTHP Change wrt > > CONFIG_THP_SWAP=N CONFIG_THP_SWAP=Y Baseline > > Baseline > > ------------------------------------------------------------------------------- > > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- > > iaa iaa iaa > > ------------------------------------------------------------------------------- > > Throughput (KB/s) 143,323 125,485 153,550 129,609 7% 3% > > elapsed time (sec) 24.97 25.42 23.90 25.19 4% 1% > > sys time (sec) 822.72 750.96 757.70 731.13 8% 3% > > memcg_high 132,743 169,825 148,075 192,744 > > memcg_swap_fail 639,067 841,553 2,204 2,215 > > pswpin 0 0 0 0 > > pswpout 0 0 0 0 > > zswpin 795 873 760 902 > > zswpout 10,011,266 13,195,137 10,010,017 13,193,554 > > thp_swpout 0 0 0 0 > > thp_swpout_ 0 0 0 0 > > fallback > > 64kB-mthp_ 639,065 841,553 2,204 2,215 > > swpout_fallback > > pgmajfault 2,861 2,924 3,054 3,259 > > ZSWPOUT-64kB n/a n/a 623,451 822,268 > > SWPOUT-64kB 0 0 0 0 > > ------------------------------------------------------------------------------- > > > > > > 2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G): > > ======================================================= > > > > ------------------------------------------------------------------------------- > > mm-unstable 9-23-2024 zswap-mTHP Change wrt > > CONFIG_THP_SWAP=N CONFIG_THP_SWAP=Y Baseline > > Baseline > > ------------------------------------------------------------------------------- > > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- > > iaa iaa iaa > > ------------------------------------------------------------------------------- > > Throughput (KB/s) 145,616 139,640 169,404 141,168 16% 1% > > elapsed time (sec) 25.05 23.85 23.02 23.37 8% 2% > > sys time (sec) 790.53 676.34 613.26 677.83 22% -0.2% > > memcg_high 16,702 25,197 17,374 23,890 > > memcg_swap_fail 21,485 27,814 114 144 > > pswpin 0 0 0 0 > > pswpout 0 0 0 0 > > zswpin 793 852 778 922 > > zswpout 10,011,709 13,186,882 10,010,893 13,195,600 > > thp_swpout 0 0 0 0 > > thp_swpout_ 21,485 27,814 114 144 > > fallback > > 2048kB-mthp_ n/a n/a 0 0 > > swpout_fallback > > pgmajfault 2,701 2,822 4,151 5,066 > > ZSWPOUT-2048kB n/a n/a 19,442 25,615 > > SWPOUT-2048kB 0 0 0 0 > > ------------------------------------------------------------------------------- > > > > We mostly see improvements in throughput, elapsed and sys time for zstd > and > > deflate-iaa, when comparing before (THP_SWAP=N) vs. after > (THP_SWAP=Y). > > > > > > Case 2: Comparing SSD swap mTHP vs. zswap mTHP > > ============================================== > > > > In this scenario, CONFIG_THP_SWAP is enabled in "before" and "after" > > experiments. The "before" represents zswap rejecting mTHP, and the mTHP > > being stored by the 823G SSD swap. The "after" represents data with this > > patch-series, that results in 64K/2M (m)THP being processed and stored by > > zswap. > > > > 64KB mTHP (cgroup memory.high set to 40G): > > ========================================== > > > > ------------------------------------------------------------------------------- > > mm-unstable 9-23-2024 zswap-mTHP Change wrt > > CONFIG_THP_SWAP=Y CONFIG_THP_SWAP=Y Baseline > > Baseline > > ------------------------------------------------------------------------------- > > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- > > iaa iaa iaa > > ------------------------------------------------------------------------------- > > Throughput (KB/s) 20,265 20,696 153,550 129,609 658% 526% > > elapsed time (sec) 72.44 70.86 23.90 25.19 67% 64% > > sys time (sec) 77.95 77.99 757.70 731.13 -872% -837% > > memcg_high 115,811 113,277 148,075 192,744 > > memcg_swap_fail 2,386 2,425 2,204 2,215 > > pswpin 16 16 0 0 > > pswpout 7,774,235 7,616,069 0 0 > > zswpin 728 749 760 902 > > zswpout 38,424 39,022 10,010,017 13,193,554 > > thp_swpout 0 0 0 0 > > thp_swpout_ 0 0 0 0 > > fallback > > 64kB-mthp_ 2,386 2,425 2,204 2,215 > > swpout_fallback > > pgmajfault 2,757 2,860 3,054 3,259 > > ZSWPOUT-64kB n/a n/a 623,451 822,268 > > SWPOUT-64kB 485,890 476,004 0 0 > > ------------------------------------------------------------------------------- > > > > > > 2MB PMD-THP/2048K mTHP (cgroup memory.high set to 40G): > > ======================================================= > > > > ------------------------------------------------------------------------------- > > mm-unstable 9-23-2024 zswap-mTHP Change wrt > > CONFIG_THP_SWAP=Y CONFIG_THP_SWAP=Y Baseline > > Baseline > > ------------------------------------------------------------------------------- > > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- > > iaa iaa iaa > > ------------------------------------------------------------------------------- > > Throughput (KB/s) 24,347 35,971 169,404 141,168 596% 292% > > elapsed time (sec) 63.52 64.59 23.02 23.37 64% 64% > > sys time (sec) 27.91 27.01 613.26 677.83 -2098% -2410% > > memcg_high 13,576 13,467 17,374 23,890 > > memcg_swap_fail 162 124 114 144 > > pswpin 0 0 0 0 > > pswpout 7,003,307 7,168,853 0 0 > > zswpin 741 722 778 922 > > zswpout 84,429 65,315 10,010,893 13,195,600 > > thp_swpout 13,678 14,002 0 0 > > thp_swpout_ 162 124 114 144 > > fallback > > 2048kB-mthp_ n/a n/a 0 0 > > swpout_fallback > > pgmajfault 3,345 2,903 4,151 5,066 > > ZSWPOUT-2048kB n/a n/a 19,442 25,615 > > SWPOUT-2048kB 13,678 14,002 0 0 > > ------------------------------------------------------------------------------- > > > > We see significant improvements in throughput and elapsed time for zstd > and > > deflate-iaa, when comparing before (mTHP-SSD) vs. after (mTHP-ZSWAP). > The > > sys time increases with mTHP-ZSWAP as expected, due to the CPU > compression > > time vs. asynchronous disk write times, as pointed out by Ying and Yosry. > > > > In the "Before" scenario, when zswap does not store mTHP, only allocations > > count towards the cgroup memory limit. However, in the "After" scenario, > > with the introduction of zswap_store() mTHP, both, allocations as well as > > the zswap compressed pool usage from all 70 processes are counted > towards > > the memory limit. As a result, we see higher swapout activity in the > > "After" data. Hence, more time is spent doing reclaim as the zswap cgroup > > charge leads to more frequent memory.high breaches. > > > > Summary: > > ======== > > The v7 data presented above comparing zswap-mTHP with a conventional > 823G > > SSD swap demonstrates good performance improvements with zswap- > mTHP. Hence, > > it seems reasonable for zswap_store to support (m)THP, so that further > > performance improvements can be implemented. > > > > Some of the ideas that have shown promise in our experiments are: > > > > 1) IAA compress/decompress batching. > > 2) Distributing compress jobs across all IAA devices on the socket. > > > > In the experimental setup used in this patchset, we have enabled > > IAA compress verification to ensure additional hardware data integrity CRC > > checks not currently done by the software compressors. The tests run for > > this patchset are also using only 1 IAA device per core, that avails of 2 > > compress engines on the device. In our experiments with IAA batching, we > > distribute compress jobs from all cores to the 8 compress engines available > > per socket. We further compress the pages in each mTHP in parallel in the > > accelerator. As a result, we improve compress latency and reclaim > > throughput. > > > > The following compares the same usemem workload characteristics > between: > > > > 1) zstd (v7 experiments) > > 2) deflate-iaa "Fixed mode" (v7 experiments) > > 3) deflate-iaa with batching > > 4) deflate-iaa-canned "Canned mode" [3] with batching > > > > vm.page-cluster is set to "2" for all runs. > > > > 64K mTHP ZSWAP: > > =============== > > > > ------------------------------------------------------------------------------- > > ZSWAP zstd IAA Fixed IAA Fixed IAA Canned IAA IAA IAA > > compressor (v7) (v7) + Batching + Batching Batch Canned Canned > > vs. vs. Batch > > 64K mTHP Seqtl Fixed vs. > > ZSTD > > ------------------------------------------------------------------------------- > > Throughput 153,550 129,609 156,215 166,975 21% 7% 9% > > (KB/s) > > elapsed time 23.90 25.19 22.46 21.38 11% 5% 11% > > (sec) > > sys time 757.70 731.13 715.62 648.83 2% 9% 14% > > (sec) > > memcg_high 148,075 192,744 197,548 181,734 > > memcg_swap_ 2,204 2,215 2,293 2,263 > > fail > > pswpin 0 0 0 0 > > pswpout 0 0 0 0 > > zswpin 760 902 774 833 > > zswpout 10,010,017 13,193,554 13,193,176 12,125,616 > > thp_swpout 0 0 0 0 > > thp_swpout_ 0 0 0 0 > > fallback > > 64kB-mthp_ 2,204 2,215 2,293 2,263 > > swpout_ > > fallback > > pgmajfault 3,054 3,259 3,545 3,516 > > ZSWPOUT-64kB 623,451 822,268 822,176 755,480 > > SWPOUT-64kB 0 0 0 0 > > swap_ra 146 161 152 159 > > swap_ra_hit 64 121 68 88 > > ------------------------------------------------------------------------------- > > > > > > 2M THP ZSWAP: > > ============= > > > > ------------------------------------------------------------------------------- > > ZSWAP zstd IAA Fixed IAA Fixed IAA Canned IAA IAA IAA > > compressor (v7) (v7) + Batching + Batching Batch Canned Canned > > vs. vs. Batch > > 2M THP Seqtl Fixed vs. > > ZSTD > > ------------------------------------------------------------------------------- > > Throughput 169,404 141,168 175,089 193,407 24% 10% 14% > > (KB/s) > > elapsed time 23.02 23.37 21.13 19.97 10% 5% 13% > > (sec) > > sys time 613.26 677.83 630.51 533.80 7% 15% 13% > > (sec) > > memcg_high 17,374 23,890 24,349 22,374 > > memcg_swap_ 114 144 102 88 > > fail > > pswpin 0 0 0 0 > > pswpout 0 0 0 0 > > zswpin 778 922 6,492 6,642 > > zswpout 10,010,893 13,195,600 13,199,907 12,132,265 > > thp_swpout 0 0 0 0 > > thp_swpout_ 114 144 102 88 > > fallback > > pgmajfault 4,151 5,066 5,032 4,999 > > ZSWPOUT-2MB 19,442 25,615 25,666 23,594 > > SWPOUT-2MB 0 0 0 0 > > swap_ra 3 9 4,383 4,494 > > swap_ra_hit 2 6 4,298 4,412 > > ------------------------------------------------------------------------------- > > > > > > With ZSWAP IAA compress/decompress batching, we are able to > demonstrate > > significant performance improvements and memory savings in scalability > > experiments under memory pressure, as compared to software > compressors. We > > hope to submit this work in subsequent patch series. > > Honestly I would remove the detailed results of the followup series > for batching, it should be enough to mention a single figure for > further expected improvement from ongoing work that depends on this. Definitely, will summarize the results of batching in the cover letter for v8. Thanks, Kanchana > > > > > Thanks, > > Kanchana > > > > [1] https://lore.kernel.org/linux-mm/20231019110543.3284654-1- > ryan.roberts@arm.com/T/#u > > [2] https://lore.kernel.org/linux-mm/20240408183946.2991168-1- > ryan.roberts@arm.com/ > > [3] https://patchwork.kernel.org/project/linux- > crypto/cover/cover.1710969449.git.andre.glover@linux.intel.com/ > > > > > > Kanchana P Sridhar (8): > > mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined. > > mm: zswap: Modify zswap_compress() to accept a page instead of a > > folio. > > mm: zswap: Refactor code to store an entry in zswap xarray. > > mm: zswap: Refactor code to delete stored offsets in case of errors. > > mm: zswap: Compress and store a specific page in a folio. > > mm: zswap: Support mTHP swapout in zswap_store(). > > mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout > > stats. > > mm: Document the newly added mTHP zswpout stats, clarify swpout > > semantics. > > > > Documentation/admin-guide/mm/transhuge.rst | 8 +- > > include/linux/huge_mm.h | 1 + > > include/linux/memcontrol.h | 4 + > > mm/Kconfig | 8 + > > mm/huge_memory.c | 3 + > > mm/page_io.c | 1 + > > mm/zswap.c | 248 ++++++++++++++++----- > > 7 files changed, 210 insertions(+), 63 deletions(-) > > > > > > base-commit: acfabf7e197f7a5bedf4749dac1f39551417b049 > > -- > > 2.27.0 > > ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios 2024-09-24 1:17 [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar ` (8 preceding siblings ...) 2024-09-24 19:34 ` [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios Yosry Ahmed @ 2024-09-25 6:35 ` Huang, Ying 2024-09-25 18:39 ` Sridhar, Kanchana P 9 siblings, 1 reply; 79+ messages in thread From: Huang, Ying @ 2024-09-25 6:35 UTC (permalink / raw) To: Kanchana P Sridhar Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, 21cnbao, akpm, nanhai.zou, wajdi.k.feghali, vinodh.gopal Kanchana P Sridhar <kanchana.p.sridhar@intel.com> writes: [snip] > > Case 1: Comparing zswap 4K vs. zswap mTHP > ========================================= > > In this scenario, the "before" is CONFIG_THP_SWAP set to off, that results in > 64K/2M (m)THP to be split into 4K folios that get processed by zswap. > > The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that results > in 64K/2M (m)THP to not be split, and processed by zswap. > > 64KB mTHP (cgroup memory.high set to 40G): > ========================================== > > ------------------------------------------------------------------------------- > mm-unstable 9-23-2024 zswap-mTHP Change wrt > CONFIG_THP_SWAP=N CONFIG_THP_SWAP=Y Baseline > Baseline > ------------------------------------------------------------------------------- > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- > iaa iaa iaa > ------------------------------------------------------------------------------- > Throughput (KB/s) 143,323 125,485 153,550 129,609 7% 3% > elapsed time (sec) 24.97 25.42 23.90 25.19 4% 1% > sys time (sec) 822.72 750.96 757.70 731.13 8% 3% > memcg_high 132,743 169,825 148,075 192,744 > memcg_swap_fail 639,067 841,553 2,204 2,215 > pswpin 0 0 0 0 > pswpout 0 0 0 0 > zswpin 795 873 760 902 > zswpout 10,011,266 13,195,137 10,010,017 13,193,554 > thp_swpout 0 0 0 0 > thp_swpout_ 0 0 0 0 > fallback > 64kB-mthp_ 639,065 841,553 2,204 2,215 > swpout_fallback > pgmajfault 2,861 2,924 3,054 3,259 > ZSWPOUT-64kB n/a n/a 623,451 822,268 > SWPOUT-64kB 0 0 0 0 > ------------------------------------------------------------------------------- > IIUC, the throughput is the sum of throughput of all usemem processes? One possible issue of usemem test case is the "imbalance" issue. That is, some usemem processes may swap-out/swap-in less, so the score is very high; while some other processes may swap-out/swap-in more, so the score is very low. Sometimes, the total score decreases, but the scores of usemem processes are more balanced, so that the performance should be considered better. And, in general, we should make usemem score balanced among processes via say longer test time. Can you check this in your test results? [snip] -- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 79+ messages in thread
* RE: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios 2024-09-25 6:35 ` Huang, Ying @ 2024-09-25 18:39 ` Sridhar, Kanchana P 2024-09-26 0:44 ` Huang, Ying 0 siblings, 1 reply; 79+ messages in thread From: Sridhar, Kanchana P @ 2024-09-25 18:39 UTC (permalink / raw) To: Huang, Ying Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P > -----Original Message----- > From: Huang, Ying <ying.huang@intel.com> > Sent: Tuesday, September 24, 2024 11:35 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com; > chengming.zhou@linux.dev; usamaarif642@gmail.com; > shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com; > akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, > Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh > <vinodh.gopal@intel.com> > Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios > > Kanchana P Sridhar <kanchana.p.sridhar@intel.com> writes: > > [snip] > > > > > Case 1: Comparing zswap 4K vs. zswap mTHP > > ========================================= > > > > In this scenario, the "before" is CONFIG_THP_SWAP set to off, that results in > > 64K/2M (m)THP to be split into 4K folios that get processed by zswap. > > > > The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that > results > > in 64K/2M (m)THP to not be split, and processed by zswap. > > > > 64KB mTHP (cgroup memory.high set to 40G): > > ========================================== > > > > ------------------------------------------------------------------------------- > > mm-unstable 9-23-2024 zswap-mTHP Change wrt > > CONFIG_THP_SWAP=N CONFIG_THP_SWAP=Y Baseline > > Baseline > > ------------------------------------------------------------------------------- > > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- > > iaa iaa iaa > > ------------------------------------------------------------------------------- > > Throughput (KB/s) 143,323 125,485 153,550 129,609 7% 3% > > elapsed time (sec) 24.97 25.42 23.90 25.19 4% 1% > > sys time (sec) 822.72 750.96 757.70 731.13 8% 3% > > memcg_high 132,743 169,825 148,075 192,744 > > memcg_swap_fail 639,067 841,553 2,204 2,215 > > pswpin 0 0 0 0 > > pswpout 0 0 0 0 > > zswpin 795 873 760 902 > > zswpout 10,011,266 13,195,137 10,010,017 13,193,554 > > thp_swpout 0 0 0 0 > > thp_swpout_ 0 0 0 0 > > fallback > > 64kB-mthp_ 639,065 841,553 2,204 2,215 > > swpout_fallback > > pgmajfault 2,861 2,924 3,054 3,259 > > ZSWPOUT-64kB n/a n/a 623,451 822,268 > > SWPOUT-64kB 0 0 0 0 > > ------------------------------------------------------------------------------- > > > > IIUC, the throughput is the sum of throughput of all usemem processes? > > One possible issue of usemem test case is the "imbalance" issue. That > is, some usemem processes may swap-out/swap-in less, so the score is > very high; while some other processes may swap-out/swap-in more, so the > score is very low. Sometimes, the total score decreases, but the scores > of usemem processes are more balanced, so that the performance should be > considered better. And, in general, we should make usemem score > balanced among processes via say longer test time. Can you check this > in your test results? Actually, the throughput data listed in the cover-letter is the average of all the usemem processes. Your observation about the "imbalance" issue is right. Some processes see a higher throughput than others. I have noticed that the throughputs progressively reduce as the individual processes exit and print their stats. Listed below are the stats from two runs of usemem70: sleep 10 and sleep 30. Both are run with a cgroup mem-limit of 40G. Data is with v7, 64K folios are enabled, zswap uses zstd. ----------------------------------------------- sleep 10 sleep 30 Throughput (KB/s) Throughput (KB/s) ----------------------------------------------- 181,540 191,686 179,651 191,459 179,068 188,834 177,244 187,568 177,215 186,703 176,565 185,584 176,546 185,370 176,470 185,021 176,214 184,303 176,128 184,040 175,279 183,932 174,745 180,831 173,935 179,418 161,546 168,014 160,332 167,540 160,122 167,364 159,613 167,020 159,546 166,590 159,021 166,483 158,845 166,418 158,426 166,264 158,396 166,066 158,371 165,944 158,298 165,866 158,250 165,884 158,057 165,533 158,011 165,532 157,899 165,457 157,894 165,424 157,839 165,410 157,731 165,407 157,629 165,273 157,626 164,867 157,581 164,636 157,471 164,266 157,430 164,225 157,287 163,290 156,289 153,597 153,970 147,494 148,244 147,102 142,907 146,111 142,811 145,789 139,171 141,168 136,314 140,714 133,616 140,111 132,881 139,636 132,729 136,943 132,680 136,844 132,248 135,726 132,027 135,384 131,929 135,270 131,766 134,748 131,667 134,733 131,576 134,582 131,396 134,302 131,351 134,160 131,135 134,102 130,885 134,097 130,854 134,058 130,767 134,006 130,666 133,960 130,647 133,894 130,152 133,837 130,006 133,747 129,921 133,679 129,856 133,666 129,377 133,564 128,366 133,331 127,988 132,938 126,903 132,746 ----------------------------------------------- sum 10,526,916 10,919,561 average 150,385 155,994 stddev 17,551 19,633 ----------------------------------------------- elapsed 24.40 43.66 time (sec) sys time 806.25 766.05 (sec) zswpout 10,008,713 10,008,407 64K folio 623,463 623,629 swpout ----------------------------------------------- As we increase the time for which allocations are maintained, there seems to be a slight improvement in throughput, but the variance increases as well. The processes with lower throughput could be the ones that handle the memcg being over limit by doing reclaim, possibly before they can allocate. Interestingly, the longer test time does seem to reduce the amount of reclaim (hence lower sys time), but more 64K large folios seem to be reclaimed. Could this mean that with longer test time (sleep 30), more cold memory residing in large folios is getting reclaimed, as against memory just relinquished by the exiting processes? Thanks, Kanchana > > [snip] > > -- > Best Regards, > Huang, Ying ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios 2024-09-25 18:39 ` Sridhar, Kanchana P @ 2024-09-26 0:44 ` Huang, Ying 2024-09-26 3:48 ` Sridhar, Kanchana P 0 siblings, 1 reply; 79+ messages in thread From: Huang, Ying @ 2024-09-26 0:44 UTC (permalink / raw) To: Sridhar, Kanchana P Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh "Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com> writes: >> -----Original Message----- >> From: Huang, Ying <ying.huang@intel.com> >> Sent: Tuesday, September 24, 2024 11:35 PM >> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> >> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; >> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com; >> chengming.zhou@linux.dev; usamaarif642@gmail.com; >> shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com; >> akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, >> Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh >> <vinodh.gopal@intel.com> >> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios >> >> Kanchana P Sridhar <kanchana.p.sridhar@intel.com> writes: >> >> [snip] >> >> > >> > Case 1: Comparing zswap 4K vs. zswap mTHP >> > ========================================= >> > >> > In this scenario, the "before" is CONFIG_THP_SWAP set to off, that results in >> > 64K/2M (m)THP to be split into 4K folios that get processed by zswap. >> > >> > The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that >> results >> > in 64K/2M (m)THP to not be split, and processed by zswap. >> > >> > 64KB mTHP (cgroup memory.high set to 40G): >> > ========================================== >> > >> > ------------------------------------------------------------------------------- >> > mm-unstable 9-23-2024 zswap-mTHP Change wrt >> > CONFIG_THP_SWAP=N CONFIG_THP_SWAP=Y Baseline >> > Baseline >> > ------------------------------------------------------------------------------- >> > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- >> > iaa iaa iaa >> > ------------------------------------------------------------------------------- >> > Throughput (KB/s) 143,323 125,485 153,550 129,609 7% 3% >> > elapsed time (sec) 24.97 25.42 23.90 25.19 4% 1% >> > sys time (sec) 822.72 750.96 757.70 731.13 8% 3% >> > memcg_high 132,743 169,825 148,075 192,744 >> > memcg_swap_fail 639,067 841,553 2,204 2,215 >> > pswpin 0 0 0 0 >> > pswpout 0 0 0 0 >> > zswpin 795 873 760 902 >> > zswpout 10,011,266 13,195,137 10,010,017 13,193,554 >> > thp_swpout 0 0 0 0 >> > thp_swpout_ 0 0 0 0 >> > fallback >> > 64kB-mthp_ 639,065 841,553 2,204 2,215 >> > swpout_fallback >> > pgmajfault 2,861 2,924 3,054 3,259 >> > ZSWPOUT-64kB n/a n/a 623,451 822,268 >> > SWPOUT-64kB 0 0 0 0 >> > ------------------------------------------------------------------------------- >> > >> >> IIUC, the throughput is the sum of throughput of all usemem processes? >> >> One possible issue of usemem test case is the "imbalance" issue. That >> is, some usemem processes may swap-out/swap-in less, so the score is >> very high; while some other processes may swap-out/swap-in more, so the >> score is very low. Sometimes, the total score decreases, but the scores >> of usemem processes are more balanced, so that the performance should be >> considered better. And, in general, we should make usemem score >> balanced among processes via say longer test time. Can you check this >> in your test results? > > Actually, the throughput data listed in the cover-letter is the average of > all the usemem processes. Your observation about the "imbalance" issue is > right. Some processes see a higher throughput than others. I have noticed > that the throughputs progressively reduce as the individual processes exit > and print their stats. > > Listed below are the stats from two runs of usemem70: sleep 10 and sleep 30. > Both are run with a cgroup mem-limit of 40G. Data is with v7, 64K folios are > enabled, zswap uses zstd. > > > ----------------------------------------------- > sleep 10 sleep 30 > Throughput (KB/s) Throughput (KB/s) > ----------------------------------------------- > 181,540 191,686 > 179,651 191,459 > 179,068 188,834 > 177,244 187,568 > 177,215 186,703 > 176,565 185,584 > 176,546 185,370 > 176,470 185,021 > 176,214 184,303 > 176,128 184,040 > 175,279 183,932 > 174,745 180,831 > 173,935 179,418 > 161,546 168,014 > 160,332 167,540 > 160,122 167,364 > 159,613 167,020 > 159,546 166,590 > 159,021 166,483 > 158,845 166,418 > 158,426 166,264 > 158,396 166,066 > 158,371 165,944 > 158,298 165,866 > 158,250 165,884 > 158,057 165,533 > 158,011 165,532 > 157,899 165,457 > 157,894 165,424 > 157,839 165,410 > 157,731 165,407 > 157,629 165,273 > 157,626 164,867 > 157,581 164,636 > 157,471 164,266 > 157,430 164,225 > 157,287 163,290 > 156,289 153,597 > 153,970 147,494 > 148,244 147,102 > 142,907 146,111 > 142,811 145,789 > 139,171 141,168 > 136,314 140,714 > 133,616 140,111 > 132,881 139,636 > 132,729 136,943 > 132,680 136,844 > 132,248 135,726 > 132,027 135,384 > 131,929 135,270 > 131,766 134,748 > 131,667 134,733 > 131,576 134,582 > 131,396 134,302 > 131,351 134,160 > 131,135 134,102 > 130,885 134,097 > 130,854 134,058 > 130,767 134,006 > 130,666 133,960 > 130,647 133,894 > 130,152 133,837 > 130,006 133,747 > 129,921 133,679 > 129,856 133,666 > 129,377 133,564 > 128,366 133,331 > 127,988 132,938 > 126,903 132,746 > ----------------------------------------------- > sum 10,526,916 10,919,561 > average 150,385 155,994 > stddev 17,551 19,633 > ----------------------------------------------- > elapsed 24.40 43.66 > time (sec) > sys time 806.25 766.05 > (sec) > zswpout 10,008,713 10,008,407 > 64K folio 623,463 623,629 > swpout > ----------------------------------------------- Although there are some imbalance, I don't find it's too much. So, I think the test result is reasonable. Please pay attention to the imbalance issue in the future tests. > As we increase the time for which allocations are maintained, > there seems to be a slight improvement in throughput, but the > variance increases as well. The processes with lower throughput > could be the ones that handle the memcg being over limit by > doing reclaim, possibly before they can allocate. > > Interestingly, the longer test time does seem to reduce the amount > of reclaim (hence lower sys time), but more 64K large folios seem to > be reclaimed. Could this mean that with longer test time (sleep 30), > more cold memory residing in large folios is getting reclaimed, as > against memory just relinquished by the exiting processes? I don't think longer sleep time in test helps much to balance. Can you try with less process, and larger memory size per process? I guess that this will improve balance. -- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 79+ messages in thread
* RE: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios 2024-09-26 0:44 ` Huang, Ying @ 2024-09-26 3:48 ` Sridhar, Kanchana P 2024-09-26 6:47 ` Huang, Ying 0 siblings, 1 reply; 79+ messages in thread From: Sridhar, Kanchana P @ 2024-09-26 3:48 UTC (permalink / raw) To: Huang, Ying Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P Hi Ying, > -----Original Message----- > From: Huang, Ying <ying.huang@intel.com> > Sent: Wednesday, September 25, 2024 5:45 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com; > chengming.zhou@linux.dev; usamaarif642@gmail.com; > shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com; > akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, > Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh > <vinodh.gopal@intel.com> > Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios > > "Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com> writes: > > >> -----Original Message----- > >> From: Huang, Ying <ying.huang@intel.com> > >> Sent: Tuesday, September 24, 2024 11:35 PM > >> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > >> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > >> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com; > >> chengming.zhou@linux.dev; usamaarif642@gmail.com; > >> shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com; > >> akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; > Feghali, > >> Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh > >> <vinodh.gopal@intel.com> > >> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios > >> > >> Kanchana P Sridhar <kanchana.p.sridhar@intel.com> writes: > >> > >> [snip] > >> > >> > > >> > Case 1: Comparing zswap 4K vs. zswap mTHP > >> > ========================================= > >> > > >> > In this scenario, the "before" is CONFIG_THP_SWAP set to off, that > results in > >> > 64K/2M (m)THP to be split into 4K folios that get processed by zswap. > >> > > >> > The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that > >> results > >> > in 64K/2M (m)THP to not be split, and processed by zswap. > >> > > >> > 64KB mTHP (cgroup memory.high set to 40G): > >> > ========================================== > >> > > >> > ------------------------------------------------------------------------------- > >> > mm-unstable 9-23-2024 zswap-mTHP Change wrt > >> > CONFIG_THP_SWAP=N CONFIG_THP_SWAP=Y > Baseline > >> > Baseline > >> > ------------------------------------------------------------------------------- > >> > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- > >> > iaa iaa iaa > >> > ------------------------------------------------------------------------------- > >> > Throughput (KB/s) 143,323 125,485 153,550 129,609 7% > 3% > >> > elapsed time (sec) 24.97 25.42 23.90 25.19 4% 1% > >> > sys time (sec) 822.72 750.96 757.70 731.13 8% 3% > >> > memcg_high 132,743 169,825 148,075 192,744 > >> > memcg_swap_fail 639,067 841,553 2,204 2,215 > >> > pswpin 0 0 0 0 > >> > pswpout 0 0 0 0 > >> > zswpin 795 873 760 902 > >> > zswpout 10,011,266 13,195,137 10,010,017 13,193,554 > >> > thp_swpout 0 0 0 0 > >> > thp_swpout_ 0 0 0 0 > >> > fallback > >> > 64kB-mthp_ 639,065 841,553 2,204 2,215 > >> > swpout_fallback > >> > pgmajfault 2,861 2,924 3,054 3,259 > >> > ZSWPOUT-64kB n/a n/a 623,451 822,268 > >> > SWPOUT-64kB 0 0 0 0 > >> > ------------------------------------------------------------------------------- > >> > > >> > >> IIUC, the throughput is the sum of throughput of all usemem processes? > >> > >> One possible issue of usemem test case is the "imbalance" issue. That > >> is, some usemem processes may swap-out/swap-in less, so the score is > >> very high; while some other processes may swap-out/swap-in more, so the > >> score is very low. Sometimes, the total score decreases, but the scores > >> of usemem processes are more balanced, so that the performance should > be > >> considered better. And, in general, we should make usemem score > >> balanced among processes via say longer test time. Can you check this > >> in your test results? > > > > Actually, the throughput data listed in the cover-letter is the average of > > all the usemem processes. Your observation about the "imbalance" issue is > > right. Some processes see a higher throughput than others. I have noticed > > that the throughputs progressively reduce as the individual processes exit > > and print their stats. > > > > Listed below are the stats from two runs of usemem70: sleep 10 and sleep > 30. > > Both are run with a cgroup mem-limit of 40G. Data is with v7, 64K folios are > > enabled, zswap uses zstd. > > > > > > ----------------------------------------------- > > sleep 10 sleep 30 > > Throughput (KB/s) Throughput (KB/s) > > ----------------------------------------------- > > 181,540 191,686 > > 179,651 191,459 > > 179,068 188,834 > > 177,244 187,568 > > 177,215 186,703 > > 176,565 185,584 > > 176,546 185,370 > > 176,470 185,021 > > 176,214 184,303 > > 176,128 184,040 > > 175,279 183,932 > > 174,745 180,831 > > 173,935 179,418 > > 161,546 168,014 > > 160,332 167,540 > > 160,122 167,364 > > 159,613 167,020 > > 159,546 166,590 > > 159,021 166,483 > > 158,845 166,418 > > 158,426 166,264 > > 158,396 166,066 > > 158,371 165,944 > > 158,298 165,866 > > 158,250 165,884 > > 158,057 165,533 > > 158,011 165,532 > > 157,899 165,457 > > 157,894 165,424 > > 157,839 165,410 > > 157,731 165,407 > > 157,629 165,273 > > 157,626 164,867 > > 157,581 164,636 > > 157,471 164,266 > > 157,430 164,225 > > 157,287 163,290 > > 156,289 153,597 > > 153,970 147,494 > > 148,244 147,102 > > 142,907 146,111 > > 142,811 145,789 > > 139,171 141,168 > > 136,314 140,714 > > 133,616 140,111 > > 132,881 139,636 > > 132,729 136,943 > > 132,680 136,844 > > 132,248 135,726 > > 132,027 135,384 > > 131,929 135,270 > > 131,766 134,748 > > 131,667 134,733 > > 131,576 134,582 > > 131,396 134,302 > > 131,351 134,160 > > 131,135 134,102 > > 130,885 134,097 > > 130,854 134,058 > > 130,767 134,006 > > 130,666 133,960 > > 130,647 133,894 > > 130,152 133,837 > > 130,006 133,747 > > 129,921 133,679 > > 129,856 133,666 > > 129,377 133,564 > > 128,366 133,331 > > 127,988 132,938 > > 126,903 132,746 > > ----------------------------------------------- > > sum 10,526,916 10,919,561 > > average 150,385 155,994 > > stddev 17,551 19,633 > > ----------------------------------------------- > > elapsed 24.40 43.66 > > time (sec) > > sys time 806.25 766.05 > > (sec) > > zswpout 10,008,713 10,008,407 > > 64K folio 623,463 623,629 > > swpout > > ----------------------------------------------- > > Although there are some imbalance, I don't find it's too much. So, I > think the test result is reasonable. Please pay attention to the > imbalance issue in the future tests. Sure, will do so. > > > As we increase the time for which allocations are maintained, > > there seems to be a slight improvement in throughput, but the > > variance increases as well. The processes with lower throughput > > could be the ones that handle the memcg being over limit by > > doing reclaim, possibly before they can allocate. > > > > Interestingly, the longer test time does seem to reduce the amount > > of reclaim (hence lower sys time), but more 64K large folios seem to > > be reclaimed. Could this mean that with longer test time (sleep 30), > > more cold memory residing in large folios is getting reclaimed, as > > against memory just relinquished by the exiting processes? > > I don't think longer sleep time in test helps much to balance. Can you > try with less process, and larger memory size per process? I guess that > this will improve balance. I tried this, and the data is listed below: usemem options: --------------- 30 processes allocate 10G each cgroup memory limit = 150G sleep 10 525Gi SSD disk swap partition 64K large folios enabled Throughput (KB/s) of each of the 30 processes: --------------------------------------------------------------- mm-unstable zswap_store of large folios 9-25-2024 v7 zswap compressor: zstd zstd deflate-iaa --------------------------------------------------------------- 38,393 234,485 374,427 37,283 215,528 314,225 37,156 214,942 304,413 37,143 213,073 304,146 36,814 212,904 290,186 36,277 212,304 288,212 36,104 212,207 285,682 36,000 210,173 270,661 35,994 208,487 256,960 35,979 207,788 248,313 35,967 207,714 235,338 35,966 207,703 229,335 35,835 207,690 221,697 35,793 207,418 221,600 35,692 206,160 219,346 35,682 206,128 219,162 35,681 205,817 219,155 35,678 205,546 214,862 35,678 205,523 214,710 35,677 204,951 214,282 35,677 204,283 213,441 35,677 203,348 213,011 35,675 203,028 212,923 35,673 201,922 212,492 35,672 201,660 212,225 35,672 200,724 211,808 35,672 200,324 211,420 35,671 199,686 211,413 35,667 198,858 211,346 35,667 197,590 211,209 --------------------------------------------------------------- sum 1,081,515 6,217,964 7,268,000 average 36,051 207,265 242,267 stddev 655 7,010 42,234 elapsed time (sec) 343.70 107.40 84.34 sys time (sec) 269.30 2,520.13 1,696.20 memcg.high breaches 443,672 475,074 623,333 zswpout 22,605 48,931,249 54,777,100 pswpout 40,004,528 0 0 hugepages-64K zswpout 0 3,057,090 3,421,855 hugepages-64K swpout 2,500,283 0 0 --------------------------------------------------------------- As you can see, this is quite a memory-constrained scenario, where we are giving a 50% of total memory required, as the memory limit for the cgroup in which the 30 processes are run. This causes significantly more reclaim activity than the setup I was using thus far (70 processes, 1G, 40G limit). The variance or "imbalance" reduces somewhat for zstd, but not for IAA. IAA shows really good throughput (17%) and elapsed time (21%) and sys time (33%) improvement wrt zstd with zswap_store of large folios. These are the memory-constrained scenarios in which IAA typically does really well. IAA verify_compress is enabled, so this is an added data integrity checks benefit we get with IAA. I would like to get your and the maintainers' feedback on whether I should switch to this "usemem30-10G" setup for v8? Thanks, Kanchana > > -- > Best Regards, > Huang, Ying ^ permalink raw reply [flat|nested] 79+ messages in thread
* Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios 2024-09-26 3:48 ` Sridhar, Kanchana P @ 2024-09-26 6:47 ` Huang, Ying 2024-09-26 21:44 ` Sridhar, Kanchana P 0 siblings, 1 reply; 79+ messages in thread From: Huang, Ying @ 2024-09-26 6:47 UTC (permalink / raw) To: Sridhar, Kanchana P Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh "Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com> writes: > Hi Ying, > >> -----Original Message----- >> From: Huang, Ying <ying.huang@intel.com> >> Sent: Wednesday, September 25, 2024 5:45 PM >> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> >> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; >> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com; >> chengming.zhou@linux.dev; usamaarif642@gmail.com; >> shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com; >> akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, >> Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh >> <vinodh.gopal@intel.com> >> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios >> >> "Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com> writes: >> >> >> -----Original Message----- >> >> From: Huang, Ying <ying.huang@intel.com> >> >> Sent: Tuesday, September 24, 2024 11:35 PM >> >> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> >> >> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; >> >> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com; >> >> chengming.zhou@linux.dev; usamaarif642@gmail.com; >> >> shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com; >> >> akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; >> Feghali, >> >> Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh >> >> <vinodh.gopal@intel.com> >> >> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios >> >> >> >> Kanchana P Sridhar <kanchana.p.sridhar@intel.com> writes: >> >> >> >> [snip] >> >> >> >> > >> >> > Case 1: Comparing zswap 4K vs. zswap mTHP >> >> > ========================================= >> >> > >> >> > In this scenario, the "before" is CONFIG_THP_SWAP set to off, that >> results in >> >> > 64K/2M (m)THP to be split into 4K folios that get processed by zswap. >> >> > >> >> > The "after" is CONFIG_THP_SWAP set to on, and this patch-series, that >> >> results >> >> > in 64K/2M (m)THP to not be split, and processed by zswap. >> >> > >> >> > 64KB mTHP (cgroup memory.high set to 40G): >> >> > ========================================== >> >> > >> >> > ------------------------------------------------------------------------------- >> >> > mm-unstable 9-23-2024 zswap-mTHP Change wrt >> >> > CONFIG_THP_SWAP=N CONFIG_THP_SWAP=Y >> Baseline >> >> > Baseline >> >> > ------------------------------------------------------------------------------- >> >> > ZSWAP compressor zstd deflate- zstd deflate- zstd deflate- >> >> > iaa iaa iaa >> >> > ------------------------------------------------------------------------------- >> >> > Throughput (KB/s) 143,323 125,485 153,550 129,609 7% >> 3% >> >> > elapsed time (sec) 24.97 25.42 23.90 25.19 4% 1% >> >> > sys time (sec) 822.72 750.96 757.70 731.13 8% 3% >> >> > memcg_high 132,743 169,825 148,075 192,744 >> >> > memcg_swap_fail 639,067 841,553 2,204 2,215 >> >> > pswpin 0 0 0 0 >> >> > pswpout 0 0 0 0 >> >> > zswpin 795 873 760 902 >> >> > zswpout 10,011,266 13,195,137 10,010,017 13,193,554 >> >> > thp_swpout 0 0 0 0 >> >> > thp_swpout_ 0 0 0 0 >> >> > fallback >> >> > 64kB-mthp_ 639,065 841,553 2,204 2,215 >> >> > swpout_fallback >> >> > pgmajfault 2,861 2,924 3,054 3,259 >> >> > ZSWPOUT-64kB n/a n/a 623,451 822,268 >> >> > SWPOUT-64kB 0 0 0 0 >> >> > ------------------------------------------------------------------------------- >> >> > >> >> >> >> IIUC, the throughput is the sum of throughput of all usemem processes? >> >> >> >> One possible issue of usemem test case is the "imbalance" issue. That >> >> is, some usemem processes may swap-out/swap-in less, so the score is >> >> very high; while some other processes may swap-out/swap-in more, so the >> >> score is very low. Sometimes, the total score decreases, but the scores >> >> of usemem processes are more balanced, so that the performance should >> be >> >> considered better. And, in general, we should make usemem score >> >> balanced among processes via say longer test time. Can you check this >> >> in your test results? >> > >> > Actually, the throughput data listed in the cover-letter is the average of >> > all the usemem processes. Your observation about the "imbalance" issue is >> > right. Some processes see a higher throughput than others. I have noticed >> > that the throughputs progressively reduce as the individual processes exit >> > and print their stats. >> > >> > Listed below are the stats from two runs of usemem70: sleep 10 and sleep >> 30. >> > Both are run with a cgroup mem-limit of 40G. Data is with v7, 64K folios are >> > enabled, zswap uses zstd. >> > >> > >> > ----------------------------------------------- >> > sleep 10 sleep 30 >> > Throughput (KB/s) Throughput (KB/s) >> > ----------------------------------------------- >> > 181,540 191,686 >> > 179,651 191,459 >> > 179,068 188,834 >> > 177,244 187,568 >> > 177,215 186,703 >> > 176,565 185,584 >> > 176,546 185,370 >> > 176,470 185,021 >> > 176,214 184,303 >> > 176,128 184,040 >> > 175,279 183,932 >> > 174,745 180,831 >> > 173,935 179,418 >> > 161,546 168,014 >> > 160,332 167,540 >> > 160,122 167,364 >> > 159,613 167,020 >> > 159,546 166,590 >> > 159,021 166,483 >> > 158,845 166,418 >> > 158,426 166,264 >> > 158,396 166,066 >> > 158,371 165,944 >> > 158,298 165,866 >> > 158,250 165,884 >> > 158,057 165,533 >> > 158,011 165,532 >> > 157,899 165,457 >> > 157,894 165,424 >> > 157,839 165,410 >> > 157,731 165,407 >> > 157,629 165,273 >> > 157,626 164,867 >> > 157,581 164,636 >> > 157,471 164,266 >> > 157,430 164,225 >> > 157,287 163,290 >> > 156,289 153,597 >> > 153,970 147,494 >> > 148,244 147,102 >> > 142,907 146,111 >> > 142,811 145,789 >> > 139,171 141,168 >> > 136,314 140,714 >> > 133,616 140,111 >> > 132,881 139,636 >> > 132,729 136,943 >> > 132,680 136,844 >> > 132,248 135,726 >> > 132,027 135,384 >> > 131,929 135,270 >> > 131,766 134,748 >> > 131,667 134,733 >> > 131,576 134,582 >> > 131,396 134,302 >> > 131,351 134,160 >> > 131,135 134,102 >> > 130,885 134,097 >> > 130,854 134,058 >> > 130,767 134,006 >> > 130,666 133,960 >> > 130,647 133,894 >> > 130,152 133,837 >> > 130,006 133,747 >> > 129,921 133,679 >> > 129,856 133,666 >> > 129,377 133,564 >> > 128,366 133,331 >> > 127,988 132,938 >> > 126,903 132,746 >> > ----------------------------------------------- >> > sum 10,526,916 10,919,561 >> > average 150,385 155,994 >> > stddev 17,551 19,633 >> > ----------------------------------------------- >> > elapsed 24.40 43.66 >> > time (sec) >> > sys time 806.25 766.05 >> > (sec) >> > zswpout 10,008,713 10,008,407 >> > 64K folio 623,463 623,629 >> > swpout >> > ----------------------------------------------- >> >> Although there are some imbalance, I don't find it's too much. So, I >> think the test result is reasonable. Please pay attention to the >> imbalance issue in the future tests. > > Sure, will do so. > >> >> > As we increase the time for which allocations are maintained, >> > there seems to be a slight improvement in throughput, but the >> > variance increases as well. The processes with lower throughput >> > could be the ones that handle the memcg being over limit by >> > doing reclaim, possibly before they can allocate. >> > >> > Interestingly, the longer test time does seem to reduce the amount >> > of reclaim (hence lower sys time), but more 64K large folios seem to >> > be reclaimed. Could this mean that with longer test time (sleep 30), >> > more cold memory residing in large folios is getting reclaimed, as >> > against memory just relinquished by the exiting processes? >> >> I don't think longer sleep time in test helps much to balance. Can you >> try with less process, and larger memory size per process? I guess that >> this will improve balance. > > I tried this, and the data is listed below: > > usemem options: > --------------- > 30 processes allocate 10G each > cgroup memory limit = 150G > sleep 10 > 525Gi SSD disk swap partition > 64K large folios enabled > > Throughput (KB/s) of each of the 30 processes: > --------------------------------------------------------------- > mm-unstable zswap_store of large folios > 9-25-2024 v7 > zswap compressor: zstd zstd deflate-iaa > --------------------------------------------------------------- > 38,393 234,485 374,427 > 37,283 215,528 314,225 > 37,156 214,942 304,413 > 37,143 213,073 304,146 > 36,814 212,904 290,186 > 36,277 212,304 288,212 > 36,104 212,207 285,682 > 36,000 210,173 270,661 > 35,994 208,487 256,960 > 35,979 207,788 248,313 > 35,967 207,714 235,338 > 35,966 207,703 229,335 > 35,835 207,690 221,697 > 35,793 207,418 221,600 > 35,692 206,160 219,346 > 35,682 206,128 219,162 > 35,681 205,817 219,155 > 35,678 205,546 214,862 > 35,678 205,523 214,710 > 35,677 204,951 214,282 > 35,677 204,283 213,441 > 35,677 203,348 213,011 > 35,675 203,028 212,923 > 35,673 201,922 212,492 > 35,672 201,660 212,225 > 35,672 200,724 211,808 > 35,672 200,324 211,420 > 35,671 199,686 211,413 > 35,667 198,858 211,346 > 35,667 197,590 211,209 > --------------------------------------------------------------- > sum 1,081,515 6,217,964 7,268,000 > average 36,051 207,265 242,267 > stddev 655 7,010 42,234 > elapsed time (sec) 343.70 107.40 84.34 > sys time (sec) 269.30 2,520.13 1,696.20 > memcg.high breaches 443,672 475,074 623,333 > zswpout 22,605 48,931,249 54,777,100 > pswpout 40,004,528 0 0 > hugepages-64K zswpout 0 3,057,090 3,421,855 > hugepages-64K swpout 2,500,283 0 0 > --------------------------------------------------------------- > > As you can see, this is quite a memory-constrained scenario, where we > are giving a 50% of total memory required, as the memory limit for the > cgroup in which the 30 processes are run. This causes significantly more > reclaim activity than the setup I was using thus far (70 processes, 1G, > 40G limit). > > The variance or "imbalance" reduces somewhat for zstd, but not for IAA. > > IAA shows really good throughput (17%) and elapsed time (21%) and > sys time (33%) improvement wrt zstd with zswap_store of large folios. > These are the memory-constrained scenarios in which IAA typically > does really well. IAA verify_compress is enabled, so this is an added > data integrity checks benefit we get with IAA. > > I would like to get your and the maintainers' feedback on whether > I should switch to this "usemem30-10G" setup for v8? The results looks good to me. I suggest you to use it. -- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 79+ messages in thread
* RE: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios 2024-09-26 6:47 ` Huang, Ying @ 2024-09-26 21:44 ` Sridhar, Kanchana P 0 siblings, 0 replies; 79+ messages in thread From: Sridhar, Kanchana P @ 2024-09-26 21:44 UTC (permalink / raw) To: Huang, Ying Cc: linux-kernel, linux-mm, hannes, yosryahmed, nphamcs, chengming.zhou, usamaarif642, shakeel.butt, ryan.roberts, 21cnbao, akpm, Zou, Nanhai, Feghali, Wajdi K, Gopal, Vinodh, Sridhar, Kanchana P > -----Original Message----- > From: Huang, Ying <ying.huang@intel.com> > Sent: Wednesday, September 25, 2024 11:48 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com; > chengming.zhou@linux.dev; usamaarif642@gmail.com; > shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com; > akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; Feghali, > Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh > <vinodh.gopal@intel.com> > Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios > > "Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com> writes: > > > Hi Ying, > > > >> -----Original Message----- > >> From: Huang, Ying <ying.huang@intel.com> > >> Sent: Wednesday, September 25, 2024 5:45 PM > >> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > >> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > >> hannes@cmpxchg.org; yosryahmed@google.com; nphamcs@gmail.com; > >> chengming.zhou@linux.dev; usamaarif642@gmail.com; > >> shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com; > >> akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; > Feghali, > >> Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh > >> <vinodh.gopal@intel.com> > >> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios > >> > >> "Sridhar, Kanchana P" <kanchana.p.sridhar@intel.com> writes: > >> > >> >> -----Original Message----- > >> >> From: Huang, Ying <ying.huang@intel.com> > >> >> Sent: Tuesday, September 24, 2024 11:35 PM > >> >> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > >> >> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > >> >> hannes@cmpxchg.org; yosryahmed@google.com; > nphamcs@gmail.com; > >> >> chengming.zhou@linux.dev; usamaarif642@gmail.com; > >> >> shakeel.butt@linux.dev; ryan.roberts@arm.com; 21cnbao@gmail.com; > >> >> akpm@linux-foundation.org; Zou, Nanhai <nanhai.zou@intel.com>; > >> Feghali, > >> >> Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh > >> >> <vinodh.gopal@intel.com> > >> >> Subject: Re: [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios > >> >> > >> >> Kanchana P Sridhar <kanchana.p.sridhar@intel.com> writes: > >> >> > >> >> [snip] > >> >> > >> >> > > >> >> > Case 1: Comparing zswap 4K vs. zswap mTHP > >> >> > ========================================= > >> >> > > >> >> > In this scenario, the "before" is CONFIG_THP_SWAP set to off, that > >> results in > >> >> > 64K/2M (m)THP to be split into 4K folios that get processed by zswap. > >> >> > > >> >> > The "after" is CONFIG_THP_SWAP set to on, and this patch-series, > that > >> >> results > >> >> > in 64K/2M (m)THP to not be split, and processed by zswap. > >> >> > > >> >> > 64KB mTHP (cgroup memory.high set to 40G): > >> >> > ========================================== > >> >> > > >> >> > ------------------------------------------------------------------------------- > >> >> > mm-unstable 9-23-2024 zswap-mTHP Change > wrt > >> >> > CONFIG_THP_SWAP=N CONFIG_THP_SWAP=Y > >> Baseline > >> >> > Baseline > >> >> > ------------------------------------------------------------------------------- > >> >> > ZSWAP compressor zstd deflate- zstd deflate- zstd > deflate- > >> >> > iaa iaa iaa > >> >> > ------------------------------------------------------------------------------- > >> >> > Throughput (KB/s) 143,323 125,485 153,550 129,609 7% > >> 3% > >> >> > elapsed time (sec) 24.97 25.42 23.90 25.19 4% 1% > >> >> > sys time (sec) 822.72 750.96 757.70 731.13 8% 3% > >> >> > memcg_high 132,743 169,825 148,075 192,744 > >> >> > memcg_swap_fail 639,067 841,553 2,204 2,215 > >> >> > pswpin 0 0 0 0 > >> >> > pswpout 0 0 0 0 > >> >> > zswpin 795 873 760 902 > >> >> > zswpout 10,011,266 13,195,137 10,010,017 13,193,554 > >> >> > thp_swpout 0 0 0 0 > >> >> > thp_swpout_ 0 0 0 0 > >> >> > fallback > >> >> > 64kB-mthp_ 639,065 841,553 2,204 2,215 > >> >> > swpout_fallback > >> >> > pgmajfault 2,861 2,924 3,054 3,259 > >> >> > ZSWPOUT-64kB n/a n/a 623,451 822,268 > >> >> > SWPOUT-64kB 0 0 0 0 > >> >> > ------------------------------------------------------------------------------- > >> >> > > >> >> > >> >> IIUC, the throughput is the sum of throughput of all usemem processes? > >> >> > >> >> One possible issue of usemem test case is the "imbalance" issue. That > >> >> is, some usemem processes may swap-out/swap-in less, so the score is > >> >> very high; while some other processes may swap-out/swap-in more, so > the > >> >> score is very low. Sometimes, the total score decreases, but the scores > >> >> of usemem processes are more balanced, so that the performance > should > >> be > >> >> considered better. And, in general, we should make usemem score > >> >> balanced among processes via say longer test time. Can you check this > >> >> in your test results? > >> > > >> > Actually, the throughput data listed in the cover-letter is the average of > >> > all the usemem processes. Your observation about the "imbalance" issue > is > >> > right. Some processes see a higher throughput than others. I have > noticed > >> > that the throughputs progressively reduce as the individual processes > exit > >> > and print their stats. > >> > > >> > Listed below are the stats from two runs of usemem70: sleep 10 and > sleep > >> 30. > >> > Both are run with a cgroup mem-limit of 40G. Data is with v7, 64K folios > are > >> > enabled, zswap uses zstd. > >> > > >> > > >> > ----------------------------------------------- > >> > sleep 10 sleep 30 > >> > Throughput (KB/s) Throughput (KB/s) > >> > ----------------------------------------------- > >> > 181,540 191,686 > >> > 179,651 191,459 > >> > 179,068 188,834 > >> > 177,244 187,568 > >> > 177,215 186,703 > >> > 176,565 185,584 > >> > 176,546 185,370 > >> > 176,470 185,021 > >> > 176,214 184,303 > >> > 176,128 184,040 > >> > 175,279 183,932 > >> > 174,745 180,831 > >> > 173,935 179,418 > >> > 161,546 168,014 > >> > 160,332 167,540 > >> > 160,122 167,364 > >> > 159,613 167,020 > >> > 159,546 166,590 > >> > 159,021 166,483 > >> > 158,845 166,418 > >> > 158,426 166,264 > >> > 158,396 166,066 > >> > 158,371 165,944 > >> > 158,298 165,866 > >> > 158,250 165,884 > >> > 158,057 165,533 > >> > 158,011 165,532 > >> > 157,899 165,457 > >> > 157,894 165,424 > >> > 157,839 165,410 > >> > 157,731 165,407 > >> > 157,629 165,273 > >> > 157,626 164,867 > >> > 157,581 164,636 > >> > 157,471 164,266 > >> > 157,430 164,225 > >> > 157,287 163,290 > >> > 156,289 153,597 > >> > 153,970 147,494 > >> > 148,244 147,102 > >> > 142,907 146,111 > >> > 142,811 145,789 > >> > 139,171 141,168 > >> > 136,314 140,714 > >> > 133,616 140,111 > >> > 132,881 139,636 > >> > 132,729 136,943 > >> > 132,680 136,844 > >> > 132,248 135,726 > >> > 132,027 135,384 > >> > 131,929 135,270 > >> > 131,766 134,748 > >> > 131,667 134,733 > >> > 131,576 134,582 > >> > 131,396 134,302 > >> > 131,351 134,160 > >> > 131,135 134,102 > >> > 130,885 134,097 > >> > 130,854 134,058 > >> > 130,767 134,006 > >> > 130,666 133,960 > >> > 130,647 133,894 > >> > 130,152 133,837 > >> > 130,006 133,747 > >> > 129,921 133,679 > >> > 129,856 133,666 > >> > 129,377 133,564 > >> > 128,366 133,331 > >> > 127,988 132,938 > >> > 126,903 132,746 > >> > ----------------------------------------------- > >> > sum 10,526,916 10,919,561 > >> > average 150,385 155,994 > >> > stddev 17,551 19,633 > >> > ----------------------------------------------- > >> > elapsed 24.40 43.66 > >> > time (sec) > >> > sys time 806.25 766.05 > >> > (sec) > >> > zswpout 10,008,713 10,008,407 > >> > 64K folio 623,463 623,629 > >> > swpout > >> > ----------------------------------------------- > >> > >> Although there are some imbalance, I don't find it's too much. So, I > >> think the test result is reasonable. Please pay attention to the > >> imbalance issue in the future tests. > > > > Sure, will do so. > > > >> > >> > As we increase the time for which allocations are maintained, > >> > there seems to be a slight improvement in throughput, but the > >> > variance increases as well. The processes with lower throughput > >> > could be the ones that handle the memcg being over limit by > >> > doing reclaim, possibly before they can allocate. > >> > > >> > Interestingly, the longer test time does seem to reduce the amount > >> > of reclaim (hence lower sys time), but more 64K large folios seem to > >> > be reclaimed. Could this mean that with longer test time (sleep 30), > >> > more cold memory residing in large folios is getting reclaimed, as > >> > against memory just relinquished by the exiting processes? > >> > >> I don't think longer sleep time in test helps much to balance. Can you > >> try with less process, and larger memory size per process? I guess that > >> this will improve balance. > > > > I tried this, and the data is listed below: > > > > usemem options: > > --------------- > > 30 processes allocate 10G each > > cgroup memory limit = 150G > > sleep 10 > > 525Gi SSD disk swap partition > > 64K large folios enabled > > > > Throughput (KB/s) of each of the 30 processes: > > --------------------------------------------------------------- > > mm-unstable zswap_store of large folios > > 9-25-2024 v7 > > zswap compressor: zstd zstd deflate-iaa > > --------------------------------------------------------------- > > 38,393 234,485 374,427 > > 37,283 215,528 314,225 > > 37,156 214,942 304,413 > > 37,143 213,073 304,146 > > 36,814 212,904 290,186 > > 36,277 212,304 288,212 > > 36,104 212,207 285,682 > > 36,000 210,173 270,661 > > 35,994 208,487 256,960 > > 35,979 207,788 248,313 > > 35,967 207,714 235,338 > > 35,966 207,703 229,335 > > 35,835 207,690 221,697 > > 35,793 207,418 221,600 > > 35,692 206,160 219,346 > > 35,682 206,128 219,162 > > 35,681 205,817 219,155 > > 35,678 205,546 214,862 > > 35,678 205,523 214,710 > > 35,677 204,951 214,282 > > 35,677 204,283 213,441 > > 35,677 203,348 213,011 > > 35,675 203,028 212,923 > > 35,673 201,922 212,492 > > 35,672 201,660 212,225 > > 35,672 200,724 211,808 > > 35,672 200,324 211,420 > > 35,671 199,686 211,413 > > 35,667 198,858 211,346 > > 35,667 197,590 211,209 > > --------------------------------------------------------------- > > sum 1,081,515 6,217,964 7,268,000 > > average 36,051 207,265 242,267 > > stddev 655 7,010 42,234 > > elapsed time (sec) 343.70 107.40 84.34 > > sys time (sec) 269.30 2,520.13 1,696.20 > > memcg.high breaches 443,672 475,074 623,333 > > zswpout 22,605 48,931,249 54,777,100 > > pswpout 40,004,528 0 0 > > hugepages-64K zswpout 0 3,057,090 3,421,855 > > hugepages-64K swpout 2,500,283 0 0 > > --------------------------------------------------------------- > > > > As you can see, this is quite a memory-constrained scenario, where we > > are giving a 50% of total memory required, as the memory limit for the > > cgroup in which the 30 processes are run. This causes significantly more > > reclaim activity than the setup I was using thus far (70 processes, 1G, > > 40G limit). > > > > The variance or "imbalance" reduces somewhat for zstd, but not for IAA. > > > > IAA shows really good throughput (17%) and elapsed time (21%) and > > sys time (33%) improvement wrt zstd with zswap_store of large folios. > > These are the memory-constrained scenarios in which IAA typically > > does really well. IAA verify_compress is enabled, so this is an added > > data integrity checks benefit we get with IAA. > > > > I would like to get your and the maintainers' feedback on whether > > I should switch to this "usemem30-10G" setup for v8? > > The results looks good to me. I suggest you to use it. Ok, sure, thanks Ying. Thanks, Kanchana > > -- > Best Regards, > Huang, Ying ^ permalink raw reply [flat|nested] 79+ messages in thread
end of thread, other threads:[~2024-09-26 21:44 UTC | newest] Thread overview: 79+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-09-24 1:17 [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios Kanchana P Sridhar 2024-09-24 1:17 ` [PATCH v7 1/8] mm: Define obj_cgroup_get() if CONFIG_MEMCG is not defined Kanchana P Sridhar 2024-09-24 16:45 ` Nhat Pham 2024-09-24 1:17 ` [PATCH v7 2/8] mm: zswap: Modify zswap_compress() to accept a page instead of a folio Kanchana P Sridhar 2024-09-24 16:50 ` Nhat Pham 2024-09-24 1:17 ` [PATCH v7 3/8] mm: zswap: Refactor code to store an entry in zswap xarray Kanchana P Sridhar 2024-09-24 17:16 ` Nhat Pham 2024-09-24 20:40 ` Sridhar, Kanchana P 2024-09-24 19:14 ` Yosry Ahmed 2024-09-24 22:22 ` Sridhar, Kanchana P 2024-09-24 1:17 ` [PATCH v7 4/8] mm: zswap: Refactor code to delete stored offsets in case of errors Kanchana P Sridhar 2024-09-24 17:25 ` Nhat Pham 2024-09-24 20:41 ` Sridhar, Kanchana P 2024-09-24 19:20 ` Yosry Ahmed 2024-09-24 22:32 ` Sridhar, Kanchana P 2024-09-25 0:43 ` Yosry Ahmed 2024-09-25 1:18 ` Sridhar, Kanchana P 2024-09-25 14:11 ` Johannes Weiner 2024-09-25 18:45 ` Sridhar, Kanchana P 2024-09-24 1:17 ` [PATCH v7 5/8] mm: zswap: Compress and store a specific page in a folio Kanchana P Sridhar 2024-09-24 19:28 ` Yosry Ahmed 2024-09-24 22:45 ` Sridhar, Kanchana P 2024-09-25 0:47 ` Yosry Ahmed 2024-09-25 1:49 ` Sridhar, Kanchana P 2024-09-25 13:53 ` Johannes Weiner 2024-09-25 18:45 ` Sridhar, Kanchana P 2024-09-24 1:17 ` [PATCH v7 6/8] mm: zswap: Support mTHP swapout in zswap_store() Kanchana P Sridhar 2024-09-24 17:33 ` Nhat Pham 2024-09-24 20:51 ` Sridhar, Kanchana P 2024-09-24 21:08 ` Nhat Pham 2024-09-24 21:34 ` Yosry Ahmed 2024-09-24 22:16 ` Nhat Pham 2024-09-24 22:18 ` Sridhar, Kanchana P 2024-09-24 22:28 ` Yosry Ahmed 2024-09-24 22:17 ` Sridhar, Kanchana P 2024-09-24 19:38 ` Yosry Ahmed 2024-09-24 20:51 ` Nhat Pham 2024-09-24 21:38 ` Yosry Ahmed 2024-09-24 23:11 ` Nhat Pham 2024-09-25 0:05 ` Sridhar, Kanchana P 2024-09-25 0:52 ` Yosry Ahmed 2024-09-24 23:21 ` Sridhar, Kanchana P 2024-09-24 23:02 ` Sridhar, Kanchana P 2024-09-25 13:40 ` Johannes Weiner 2024-09-25 18:30 ` Yosry Ahmed 2024-09-25 19:10 ` Sridhar, Kanchana P 2024-09-25 19:49 ` Yosry Ahmed 2024-09-25 20:49 ` Johannes Weiner 2024-09-25 19:20 ` Johannes Weiner 2024-09-25 19:39 ` Yosry Ahmed 2024-09-25 20:13 ` Johannes Weiner 2024-09-25 21:06 ` Yosry Ahmed 2024-09-25 22:29 ` Sridhar, Kanchana P 2024-09-26 3:58 ` Sridhar, Kanchana P 2024-09-26 4:52 ` Yosry Ahmed 2024-09-26 16:40 ` Sridhar, Kanchana P 2024-09-26 17:19 ` Yosry Ahmed 2024-09-26 17:29 ` Sridhar, Kanchana P 2024-09-26 17:34 ` Yosry Ahmed 2024-09-26 19:36 ` Sridhar, Kanchana P 2024-09-26 18:43 ` Johannes Weiner 2024-09-26 18:45 ` Yosry Ahmed 2024-09-26 19:40 ` Sridhar, Kanchana P 2024-09-26 19:39 ` Sridhar, Kanchana P 2024-09-25 14:27 ` Johannes Weiner 2024-09-25 18:17 ` Yosry Ahmed 2024-09-25 18:48 ` Sridhar, Kanchana P 2024-09-24 1:17 ` [PATCH v7 7/8] mm: swap: Count successful mTHP ZSWAP stores in sysfs mTHP zswpout stats Kanchana P Sridhar 2024-09-24 1:17 ` [PATCH v7 8/8] mm: Document the newly added mTHP zswpout stats, clarify swpout semantics Kanchana P Sridhar 2024-09-24 17:36 ` Nhat Pham 2024-09-24 20:52 ` Sridhar, Kanchana P 2024-09-24 19:34 ` [PATCH v7 0/8] mm: ZSWAP swap-out of mTHP folios Yosry Ahmed 2024-09-24 22:50 ` Sridhar, Kanchana P 2024-09-25 6:35 ` Huang, Ying 2024-09-25 18:39 ` Sridhar, Kanchana P 2024-09-26 0:44 ` Huang, Ying 2024-09-26 3:48 ` Sridhar, Kanchana P 2024-09-26 6:47 ` Huang, Ying 2024-09-26 21:44 ` Sridhar, Kanchana P
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox