* [PATCH v8 0/3] mm: enable large folios swap-in support
@ 2024-09-06 0:10 Barry Song
2024-09-06 0:10 ` [PATCH v8 1/3] mm: Fix swap_read_folio_zeromap() for large folios with partial zeromap Barry Song
` (2 more replies)
0 siblings, 3 replies; 8+ messages in thread
From: Barry Song @ 2024-09-06 0:10 UTC (permalink / raw)
To: akpm, linux-mm
Cc: hanchuanhua, baolin.wang, chrisl, david, hannes, hch, hughd,
kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs,
ryan.roberts, ryncsn, senozhatsky, shakeel.butt, shy828301,
surenb, v-songbaohua, willy, xiang, ying.huang, yosryahmed
From: Barry Song <v-songbaohua@oppo.com>
Currently, we support mTHP swapout but not swapin. This means that once mTHP
is swapped out, it will come back as small folios when swapped in. This is
particularly detrimental for devices like Android, where more than half of
the memory is in swap.
The lack of mTHP swapin functionality makes mTHP a showstopper in scenarios
that heavily rely on swap. This patchset introduces mTHP swap-in support.
It starts with synchronous devices similar to zRAM, aiming to benefit as
many users as possible with minimal changes.
-v8:
* fix the conflicts with zeromap(this is also a hotfix to zeromap with a
Fixes tag), reported by Kairui, thanks!
Usama, Yosry, thanks for all your comments during the discussion!
* refine the changelog to add the case Kanchana reported, using Intel
IAA, with mTHP swap-in zRAM read latency can improve 7X. thanks!
* some other code cleanup
-v7:
https://lore.kernel.org/linux-mm/20240821074541.516249-1-hanchuanhua@oppo.com/
* collect Chris's ack tags, thanks!
* adjust the comment and subject,pointed by Christoph.
* make alloc_swap_folio() always charge the folio to fix the problem of charge
failure in memcg when the memory limit is reached(reported and pointed by
Kairui), pointed by Kefeng, Matthew.
-v6:
https://lore.kernel.org/linux-mm/20240802122031.117548-1-21cnbao@gmail.com/
* remove the swapin control added in v5, per Willy, Christoph;
The original reason for adding the swpin_enabled control was primarily
to address concerns for slower devices. Currently, since we only support
fast sync devices, swap-in size is less of a concern.
We’ll gain a clearer understanding of the next steps while more devices
begin to support mTHP swap-in.
* add nr argument in mem_cgroup_swapin_uncharge_swap() instead of adding
new API, Willy;
* swapcache_prepare() and swapcache_clear() large folios support is also
removed as it has been separated per Baolin's request, right now has
been in mm-unstable.
* provide more data in changelog.
-v5:
https://lore.kernel.org/linux-mm/20240726094618.401593-1-21cnbao@gmail.com/
* Add swap-in control policy according to Ying's proposal. Right now only
"always" and "never" are supported, later we can extend to "auto";
* Fix the comment regarding zswap_never_enabled() according to Yosry;
* Filter out unaligned swp entries earlier;
* add mem_cgroup_swapin_uncharge_swap_nr() helper
-v4:
https://lore.kernel.org/linux-mm/20240629111010.230484-1-21cnbao@gmail.com/
Many parts of v3 have been merged into the mm tree with the help on reviewing
from Ryan, David, Ying and Chris etc. Thank you very much!
This is the final part to allocate large folios and map them.
* Use Yosry's zswap_never_enabled(), notice there is a bug. I put the bug fix
in this v4 RFC though it should be fixed in Yosry's patch
* lots of code improvement (drop large stack, hold ptl etc) according
to Yosry's and Ryan's feedback
* rebased on top of the latest mm-unstable and utilized some new helpers
introduced recently.
-v3:
https://lore.kernel.org/linux-mm/20240304081348.197341-1-21cnbao@gmail.com/
* avoid over-writing err in __swap_duplicate_nr, pointed out by Yosry,
thanks!
* fix the issue folio is charged twice for do_swap_page, separating
alloc_anon_folio and alloc_swap_folio as they have many differences
now on
* memcg charing
* clearing allocated folio or not
-v2:
https://lore.kernel.org/linux-mm/20240229003753.134193-1-21cnbao@gmail.com/
* lots of code cleanup according to Chris's comments, thanks!
* collect Chris's ack tags, thanks!
* address David's comment on moving to use folio_add_new_anon_rmap
for !folio_test_anon in do_swap_page, thanks!
* remove the MADV_PAGEOUT patch from this series as Ryan will
intergrate it into swap-out series
* Apply Kairui's work of "mm/swap: fix race when skipping swapcache"
on large folios swap-in as well
* fixed corrupted data(zero-filled data) in two races: zswap and
a part of entries are in swapcache while some others are not
in by checking SWAP_HAS_CACHE while swapping in a large folio
-v1:
https://lore.kernel.org/all/20240118111036.72641-1-21cnbao@gmail.com/#t
Barry Song (2):
mm: Fix swap_read_folio_zeromap() for large folios with partial
zeromap
mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to
support large folios
Chuanhua Han (1):
mm: support large folios swap-in for sync io devices
include/linux/memcontrol.h | 5 +-
mm/memcontrol.c | 7 +-
mm/memory.c | 261 +++++++++++++++++++++++++++++++++----
mm/page_io.c | 32 +----
mm/swap.h | 33 +++++
mm/swap_state.c | 2 +-
6 files changed, 282 insertions(+), 58 deletions(-)
--
2.34.1
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH v8 1/3] mm: Fix swap_read_folio_zeromap() for large folios with partial zeromap
2024-09-06 0:10 [PATCH v8 0/3] mm: enable large folios swap-in support Barry Song
@ 2024-09-06 0:10 ` Barry Song
2024-09-06 18:32 ` Yosry Ahmed
2024-09-06 0:10 ` [PATCH v8 2/3] mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios Barry Song
2024-09-06 0:10 ` [PATCH v8 3/3] mm: support large folios swap-in for sync io devices Barry Song
2 siblings, 1 reply; 8+ messages in thread
From: Barry Song @ 2024-09-06 0:10 UTC (permalink / raw)
To: akpm, linux-mm
Cc: hanchuanhua, baolin.wang, chrisl, david, hannes, hch, hughd,
kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs,
ryan.roberts, ryncsn, senozhatsky, shakeel.butt, shy828301,
surenb, v-songbaohua, willy, xiang, ying.huang, yosryahmed,
Usama Arif
From: Barry Song <v-songbaohua@oppo.com>
There could be a corner case where the first entry is non-zeromap,
but a subsequent entry is zeromap. In this case, we should not
let swap_read_folio_zeromap() return false since we will still
read corrupted data.
Additionally, the iteration of test_bit() is unnecessary and
can be replaced with bitmap operations, which are more efficient.
We can adopt the style of swap_pte_batch() and folio_pte_batch() to
introduce swap_zeromap_batch() which seems to provide the greatest
flexibility for the caller. This approach allows the caller to either
check if the zeromap status of all entries is consistent or determine
the number of contiguous entries with the same status.
Since swap_read_folio() can't handle reading a large folio that's
partially zeromap and partially non-zeromap, we've moved the code
to mm/swap.h so that others, like those working on swap-in, can
access it.
Fixes: 0ca0c24e3211 ("mm: store zero pages to be swapped out in a bitmap")
Cc: Usama Arif <usamaarif642@gmail.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
mm/page_io.c | 32 +++++++-------------------------
mm/swap.h | 33 +++++++++++++++++++++++++++++++++
2 files changed, 40 insertions(+), 25 deletions(-)
diff --git a/mm/page_io.c b/mm/page_io.c
index 4bc77d1c6bfa..2dfe2273a1f1 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -226,26 +226,6 @@ static void swap_zeromap_folio_clear(struct folio *folio)
}
}
-/*
- * Return the index of the first subpage which is not zero-filled
- * according to swap_info_struct->zeromap.
- * If all pages are zero-filled according to zeromap, it will return
- * folio_nr_pages(folio).
- */
-static unsigned int swap_zeromap_folio_test(struct folio *folio)
-{
- struct swap_info_struct *sis = swp_swap_info(folio->swap);
- swp_entry_t entry;
- unsigned int i;
-
- for (i = 0; i < folio_nr_pages(folio); i++) {
- entry = page_swap_entry(folio_page(folio, i));
- if (!test_bit(swp_offset(entry), sis->zeromap))
- return i;
- }
- return i;
-}
-
/*
* We may have stale swap cache pages in memory: notice
* them here and get rid of the unnecessary final write.
@@ -524,19 +504,21 @@ static void sio_read_complete(struct kiocb *iocb, long ret)
static bool swap_read_folio_zeromap(struct folio *folio)
{
- unsigned int idx = swap_zeromap_folio_test(folio);
-
- if (idx == 0)
- return false;
+ int nr_pages = folio_nr_pages(folio);
+ bool is_zeromap;
+ int nr_zeromap = swap_zeromap_batch(folio->swap, nr_pages, &is_zeromap);
/*
* Swapping in a large folio that is partially in the zeromap is not
* currently handled. Return true without marking the folio uptodate so
* that an IO error is emitted (e.g. do_swap_page() will sigbus).
*/
- if (WARN_ON_ONCE(idx < folio_nr_pages(folio)))
+ if (WARN_ON_ONCE(nr_zeromap != nr_pages))
return true;
+ if (!is_zeromap)
+ return false;
+
folio_zero_range(folio, 0, folio_size(folio));
folio_mark_uptodate(folio);
return true;
diff --git a/mm/swap.h b/mm/swap.h
index f8711ff82f84..1cc56a02fb5f 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -80,6 +80,32 @@ static inline unsigned int folio_swap_flags(struct folio *folio)
{
return swp_swap_info(folio->swap)->flags;
}
+
+/*
+ * Return the count of contiguous swap entries that share the same
+ * zeromap status as the starting entry. If is_zeromap is not NULL,
+ * it will return the zeromap status of the starting entry.
+ */
+static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
+ bool *is_zeromap)
+{
+ struct swap_info_struct *sis = swp_swap_info(entry);
+ unsigned long start = swp_offset(entry);
+ unsigned long end = start + max_nr;
+ bool start_entry_zeromap;
+
+ start_entry_zeromap = test_bit(start, sis->zeromap);
+ if (is_zeromap)
+ *is_zeromap = start_entry_zeromap;
+
+ if (max_nr <= 1)
+ return max_nr;
+ if (start_entry_zeromap)
+ return find_next_zero_bit(sis->zeromap, end, start) - start;
+ else
+ return find_next_bit(sis->zeromap, end, start) - start;
+}
+
#else /* CONFIG_SWAP */
struct swap_iocb;
static inline void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
@@ -171,6 +197,13 @@ static inline unsigned int folio_swap_flags(struct folio *folio)
{
return 0;
}
+
+static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
+ bool *has_zeromap)
+{
+ return 0;
+}
+
#endif /* CONFIG_SWAP */
#endif /* _MM_SWAP_H */
--
2.34.1
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH v8 2/3] mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios
2024-09-06 0:10 [PATCH v8 0/3] mm: enable large folios swap-in support Barry Song
2024-09-06 0:10 ` [PATCH v8 1/3] mm: Fix swap_read_folio_zeromap() for large folios with partial zeromap Barry Song
@ 2024-09-06 0:10 ` Barry Song
2024-09-06 18:33 ` Yosry Ahmed
2024-09-06 0:10 ` [PATCH v8 3/3] mm: support large folios swap-in for sync io devices Barry Song
2 siblings, 1 reply; 8+ messages in thread
From: Barry Song @ 2024-09-06 0:10 UTC (permalink / raw)
To: akpm, linux-mm
Cc: hanchuanhua, baolin.wang, chrisl, david, hannes, hch, hughd,
kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs,
ryan.roberts, ryncsn, senozhatsky, shakeel.butt, shy828301,
surenb, v-songbaohua, willy, xiang, ying.huang, yosryahmed
From: Barry Song <v-songbaohua@oppo.com>
With large folios swap-in, we might need to uncharge multiple entries all
together, add nr argument in mem_cgroup_swapin_uncharge_swap().
For the existing two users, just pass nr=1.
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Acked-by: Chris Li <chrisl@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Gao Xiang <xiang@kernel.org>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kairui Song <ryncsn@gmail.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
---
include/linux/memcontrol.h | 5 +++--
mm/memcontrol.c | 7 ++++---
mm/memory.c | 2 +-
mm/swap_state.c | 2 +-
4 files changed, 9 insertions(+), 7 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 2ef94c74847d..34d2da05f2f1 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -699,7 +699,8 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
gfp_t gfp, swp_entry_t entry);
-void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry);
+
+void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry, unsigned int nr_pages);
void __mem_cgroup_uncharge(struct folio *folio);
@@ -1206,7 +1207,7 @@ static inline int mem_cgroup_swapin_charge_folio(struct folio *folio,
return 0;
}
-static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry)
+static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry, unsigned int nr)
{
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bda6f75d22ff..c0d36ca20332 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4559,14 +4559,15 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
/*
* mem_cgroup_swapin_uncharge_swap - uncharge swap slot
- * @entry: swap entry for which the page is charged
+ * @entry: the first swap entry for which the pages are charged
+ * @nr_pages: number of pages which will be uncharged
*
* Call this function after successfully adding the charged page to swapcache.
*
* Note: This function assumes the page for which swap slot is being uncharged
* is order 0 page.
*/
-void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry)
+void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
{
/*
* Cgroup1's unified memory+swap counter has been charged with the
@@ -4586,7 +4587,7 @@ void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry)
* let's not wait for it. The page already received a
* memory+swap charge, drop the swap entry duplicate.
*/
- mem_cgroup_uncharge_swap(entry, 1);
+ mem_cgroup_uncharge_swap(entry, nr_pages);
}
}
diff --git a/mm/memory.c b/mm/memory.c
index 42674c0748cb..cdf03b39a92c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4100,7 +4100,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
ret = VM_FAULT_OOM;
goto out_page;
}
- mem_cgroup_swapin_uncharge_swap(entry);
+ mem_cgroup_swapin_uncharge_swap(entry, 1);
shadow = get_shadow_from_swap_cache(entry);
if (shadow)
diff --git a/mm/swap_state.c b/mm/swap_state.c
index a042720554a7..4669f29cf555 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -522,7 +522,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
if (add_to_swap_cache(new_folio, entry, gfp_mask & GFP_RECLAIM_MASK, &shadow))
goto fail_unlock;
- mem_cgroup_swapin_uncharge_swap(entry);
+ mem_cgroup_swapin_uncharge_swap(entry, 1);
if (shadow)
workingset_refault(new_folio, shadow);
--
2.34.1
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH v8 3/3] mm: support large folios swap-in for sync io devices
2024-09-06 0:10 [PATCH v8 0/3] mm: enable large folios swap-in support Barry Song
2024-09-06 0:10 ` [PATCH v8 1/3] mm: Fix swap_read_folio_zeromap() for large folios with partial zeromap Barry Song
2024-09-06 0:10 ` [PATCH v8 2/3] mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios Barry Song
@ 2024-09-06 0:10 ` Barry Song
2 siblings, 0 replies; 8+ messages in thread
From: Barry Song @ 2024-09-06 0:10 UTC (permalink / raw)
To: akpm, linux-mm
Cc: hanchuanhua, baolin.wang, chrisl, david, hannes, hch, hughd,
kaleshsingh, kasong, linux-kernel, mhocko, minchan, nphamcs,
ryan.roberts, ryncsn, senozhatsky, shakeel.butt, shy828301,
surenb, v-songbaohua, willy, xiang, ying.huang, yosryahmed,
Usama Arif, Kanchana P Sridhar
From: Chuanhua Han <hanchuanhua@oppo.com>
Currently, we have mTHP features, but unfortunately, without support for
large folio swap-ins, once these large folios are swapped out, they are
lost because mTHP swap is a one-way process. The lack of mTHP swap-in
functionality prevents mTHP from being used on devices like Android that
heavily rely on swap.
This patch introduces mTHP swap-in support. It starts from sync devices
such as zRAM. This is probably the simplest and most common use case,
benefiting billions of Android phones and similar devices with minimal
implementation cost. In this straightforward scenario, large folios are
always exclusive, eliminating the need to handle complex rmap and
swapcache issues.
It offers several benefits:
1. Enables bidirectional mTHP swapping, allowing retrieval of mTHP after
swap-out and swap-in. Large folios in the buddy system are also
preserved as much as possible, rather than being fragmented due
to swap-in.
2. Eliminates fragmentation in swap slots and supports successful
THP_SWPOUT.
w/o this patch (Refer to the data from Chris's and Kairui's latest
swap allocator optimization while running ./thp_swap_allocator_test
w/o "-a" option [1]):
./thp_swap_allocator_test
Iteration 1: swpout inc: 233, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 2: swpout inc: 131, swpout fallback inc: 101, Fallback percentage: 43.53%
Iteration 3: swpout inc: 71, swpout fallback inc: 155, Fallback percentage: 68.58%
Iteration 4: swpout inc: 55, swpout fallback inc: 168, Fallback percentage: 75.34%
Iteration 5: swpout inc: 35, swpout fallback inc: 191, Fallback percentage: 84.51%
Iteration 6: swpout inc: 25, swpout fallback inc: 199, Fallback percentage: 88.84%
Iteration 7: swpout inc: 23, swpout fallback inc: 205, Fallback percentage: 89.91%
Iteration 8: swpout inc: 9, swpout fallback inc: 219, Fallback percentage: 96.05%
Iteration 9: swpout inc: 13, swpout fallback inc: 213, Fallback percentage: 94.25%
Iteration 10: swpout inc: 12, swpout fallback inc: 216, Fallback percentage: 94.74%
Iteration 11: swpout inc: 16, swpout fallback inc: 213, Fallback percentage: 93.01%
Iteration 12: swpout inc: 10, swpout fallback inc: 210, Fallback percentage: 95.45%
Iteration 13: swpout inc: 16, swpout fallback inc: 212, Fallback percentage: 92.98%
Iteration 14: swpout inc: 12, swpout fallback inc: 212, Fallback percentage: 94.64%
Iteration 15: swpout inc: 15, swpout fallback inc: 211, Fallback percentage: 93.36%
Iteration 16: swpout inc: 15, swpout fallback inc: 200, Fallback percentage: 93.02%
Iteration 17: swpout inc: 9, swpout fallback inc: 220, Fallback percentage: 96.07%
w/ this patch (always 0%):
Iteration 1: swpout inc: 948, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 2: swpout inc: 953, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 3: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 4: swpout inc: 952, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 5: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 6: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 7: swpout inc: 947, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 8: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 9: swpout inc: 950, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 10: swpout inc: 945, swpout fallback inc: 0, Fallback percentage: 0.00%
Iteration 11: swpout inc: 947, swpout fallback inc: 0, Fallback percentage: 0.00%
...
3. With both mTHP swap-out and swap-in supported, we offer the option to enable
zsmalloc compression/decompression with larger granularity[2]. The upcoming
optimization in zsmalloc will significantly increase swap speed and improve
compression efficiency. Tested by running 100 iterations of swapping 100MiB
of anon memory, the swap speed improved dramatically:
time consumption of swapin(ms) time consumption of swapout(ms)
lz4 4k 45274 90540
lz4 64k 22942 55667
zstdn 4k 85035 186585
zstdn 64k 46558 118533
The compression ratio also improved, as evaluated with 1 GiB of data:
granularity orig_data_size compr_data_size
4KiB-zstd 1048576000 246876055
64KiB-zstd 1048576000 199763892
Without mTHP swap-in, the potential optimizations in zsmalloc cannot be
realized.
4. Even mTHP swap-in itself can reduce swap-in page faults by a factor
of nr_pages. Swapping in content filled with the same data 0x11, w/o
and w/ the patch for five rounds (Since the content is the same,
decompression will be very fast. This primarily assesses the impact of
reduced page faults):
swp in bandwidth(bytes/ms) w/o w/
round1 624152 1127501
round2 631672 1127501
round3 620459 1139756
round4 606113 1139756
round5 624152 1152281
avg 621310 1137359 +83%
5. With both mTHP swap-out and swap-in supported, we offer the option to enable
hardware accelerators(Intel IAA) to do parallel decompression with which
Kanchana reported 7X improvement on zRAM read latency[3].
[1] https://lore.kernel.org/all/20240730-swap-allocator-v5-0-cb9c148b9297@kernel.org/
[2] https://lore.kernel.org/all/20240327214816.31191-1-21cnbao@gmail.com/
[3] https://lore.kernel.org/all/cover.1714581792.git.andre.glover@linux.intel.com/
Signed-off-by: Chuanhua Han <hanchuanhua@oppo.com>
Co-developed-by: Barry Song <v-songbaohua@oppo.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Gao Xiang <xiang@kernel.org>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Usama Arif <usamaarif642@gmail.com>
Cc: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
mm/memory.c | 261 ++++++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 234 insertions(+), 27 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index cdf03b39a92c..d35dd8d99c8a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3985,6 +3985,194 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
return VM_FAULT_SIGBUS;
}
+static struct folio *__alloc_swap_folio(struct vm_fault *vmf)
+{
+ struct vm_area_struct *vma = vmf->vma;
+ struct folio *folio;
+ swp_entry_t entry;
+
+ folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma,
+ vmf->address, false);
+ if (!folio)
+ return NULL;
+
+ entry = pte_to_swp_entry(vmf->orig_pte);
+ if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm,
+ GFP_KERNEL, entry)) {
+ folio_put(folio);
+ return NULL;
+ }
+
+ return folio;
+}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
+{
+ struct swap_info_struct *si = swp_swap_info(entry);
+ pgoff_t offset = swp_offset(entry);
+ int i;
+
+ /*
+ * While allocating a large folio and doing swap_read_folio, which is
+ * the case the being faulted pte doesn't have swapcache. We need to
+ * ensure all PTEs have no cache as well, otherwise, we might go to
+ * swap devices while the content is in swapcache.
+ */
+ for (i = 0; i < max_nr; i++) {
+ if ((si->swap_map[offset + i] & SWAP_HAS_CACHE))
+ return i;
+ }
+
+ return i;
+}
+
+/*
+ * Check if the PTEs within a range are contiguous swap entries
+ * and have consistent swapcache, zeromap.
+ */
+static bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages)
+{
+ unsigned long addr;
+ swp_entry_t entry;
+ int idx;
+ pte_t pte;
+
+ addr = ALIGN_DOWN(vmf->address, nr_pages * PAGE_SIZE);
+ idx = (vmf->address - addr) / PAGE_SIZE;
+ pte = ptep_get(ptep);
+
+ if (!pte_same(pte, pte_move_swp_offset(vmf->orig_pte, -idx)))
+ return false;
+ entry = pte_to_swp_entry(pte);
+ if (swap_pte_batch(ptep, nr_pages, pte) != nr_pages)
+ return false;
+
+ /*
+ * swap_read_folio() can't handle the case a large folio is hybridly
+ * from different backends. And they are likely corner cases. Similar
+ * things might be added once zswap support large folios.
+ */
+ if (unlikely(swap_zeromap_batch(entry, nr_pages, NULL) != nr_pages))
+ return false;
+ if (unlikely(non_swapcache_batch(entry, nr_pages) != nr_pages))
+ return false;
+
+ return true;
+}
+
+static inline unsigned long thp_swap_suitable_orders(pgoff_t swp_offset,
+ unsigned long addr,
+ unsigned long orders)
+{
+ int order, nr;
+
+ order = highest_order(orders);
+
+ /*
+ * To swap in a THP with nr pages, we require that its first swap_offset
+ * is aligned with that number, as it was when the THP was swapped out.
+ * This helps filter out most invalid entries.
+ */
+ while (orders) {
+ nr = 1 << order;
+ if ((addr >> PAGE_SHIFT) % nr == swp_offset % nr)
+ break;
+ order = next_order(&orders, order);
+ }
+
+ return orders;
+}
+
+static struct folio *alloc_swap_folio(struct vm_fault *vmf)
+{
+ struct vm_area_struct *vma = vmf->vma;
+ unsigned long orders;
+ struct folio *folio;
+ unsigned long addr;
+ swp_entry_t entry;
+ spinlock_t *ptl;
+ pte_t *pte;
+ gfp_t gfp;
+ int order;
+
+ /*
+ * If uffd is active for the vma we need per-page fault fidelity to
+ * maintain the uffd semantics.
+ */
+ if (unlikely(userfaultfd_armed(vma)))
+ goto fallback;
+
+ /*
+ * A large swapped out folio could be partially or fully in zswap. We
+ * lack handling for such cases, so fallback to swapping in order-0
+ * folio.
+ */
+ if (!zswap_never_enabled())
+ goto fallback;
+
+ entry = pte_to_swp_entry(vmf->orig_pte);
+ /*
+ * Get a list of all the (large) orders below PMD_ORDER that are enabled
+ * and suitable for swapping THP.
+ */
+ orders = thp_vma_allowable_orders(vma, vma->vm_flags,
+ TVA_IN_PF | TVA_ENFORCE_SYSFS, BIT(PMD_ORDER) - 1);
+ orders = thp_vma_suitable_orders(vma, vmf->address, orders);
+ orders = thp_swap_suitable_orders(swp_offset(entry),
+ vmf->address, orders);
+
+ if (!orders)
+ goto fallback;
+
+ pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
+ vmf->address & PMD_MASK, &ptl);
+ if (unlikely(!pte))
+ goto fallback;
+
+ /*
+ * For do_swap_page, find the highest order where the aligned range is
+ * completely swap entries with contiguous swap offsets.
+ */
+ order = highest_order(orders);
+ while (orders) {
+ addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
+ if (can_swapin_thp(vmf, pte + pte_index(addr), 1 << order))
+ break;
+ order = next_order(&orders, order);
+ }
+
+ pte_unmap_unlock(pte, ptl);
+
+ /* Try allocating the highest of the remaining orders. */
+ gfp = vma_thp_gfp_mask(vma);
+ while (orders) {
+ addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
+ folio = vma_alloc_folio(gfp, order, vma, addr, true);
+ if (folio) {
+ if (!mem_cgroup_swapin_charge_folio(folio, vma->vm_mm,
+ gfp, entry))
+ return folio;
+ folio_put(folio);
+ }
+ order = next_order(&orders, order);
+ }
+
+fallback:
+ return __alloc_swap_folio(vmf);
+}
+#else /* !CONFIG_TRANSPARENT_HUGEPAGE */
+static inline bool can_swapin_thp(struct vm_fault *vmf, pte_t *ptep, int nr_pages)
+{
+ return false;
+}
+
+static struct folio *alloc_swap_folio(struct vm_fault *vmf)
+{
+ return __alloc_swap_folio(vmf);
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
/*
* We enter with non-exclusive mmap_lock (to exclude vma changes,
* but allow concurrent faults), and pte mapped but not yet locked.
@@ -4073,34 +4261,34 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
if (!folio) {
if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
__swap_count(entry) == 1) {
- /*
- * Prevent parallel swapin from proceeding with
- * the cache flag. Otherwise, another thread may
- * finish swapin first, free the entry, and swapout
- * reusing the same entry. It's undetectable as
- * pte_same() returns true due to entry reuse.
- */
- if (swapcache_prepare(entry, 1)) {
- /* Relax a bit to prevent rapid repeated page faults */
- schedule_timeout_uninterruptible(1);
- goto out;
- }
- need_clear_cache = true;
-
/* skip swapcache */
- folio = vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0,
- vma, vmf->address, false);
+ folio = alloc_swap_folio(vmf);
if (folio) {
__folio_set_locked(folio);
__folio_set_swapbacked(folio);
- if (mem_cgroup_swapin_charge_folio(folio,
- vma->vm_mm, GFP_KERNEL,
- entry)) {
- ret = VM_FAULT_OOM;
+ nr_pages = folio_nr_pages(folio);
+ if (folio_test_large(folio))
+ entry.val = ALIGN_DOWN(entry.val, nr_pages);
+ /*
+ * Prevent parallel swapin from proceeding with
+ * the cache flag. Otherwise, another thread
+ * may finish swapin first, free the entry, and
+ * swapout reusing the same entry. It's
+ * undetectable as pte_same() returns true due
+ * to entry reuse.
+ */
+ if (swapcache_prepare(entry, nr_pages)) {
+ /*
+ * Relax a bit to prevent rapid
+ * repeated page faults.
+ */
+ schedule_timeout_uninterruptible(1);
goto out_page;
}
- mem_cgroup_swapin_uncharge_swap(entry, 1);
+ need_clear_cache = true;
+
+ mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
shadow = get_shadow_from_swap_cache(entry);
if (shadow)
@@ -4206,6 +4394,24 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
goto out_nomap;
}
+ /* allocated large folios for SWP_SYNCHRONOUS_IO */
+ if (folio_test_large(folio) && !folio_test_swapcache(folio)) {
+ unsigned long nr = folio_nr_pages(folio);
+ unsigned long folio_start = ALIGN_DOWN(vmf->address, nr * PAGE_SIZE);
+ unsigned long idx = (vmf->address - folio_start) / PAGE_SIZE;
+ pte_t *folio_ptep = vmf->pte - idx;
+ pte_t folio_pte = ptep_get(folio_ptep);
+
+ if (!pte_same(folio_pte, pte_move_swp_offset(vmf->orig_pte, -idx)) ||
+ swap_pte_batch(folio_ptep, nr, folio_pte) != nr)
+ goto out_nomap;
+
+ page_idx = idx;
+ address = folio_start;
+ ptep = folio_ptep;
+ goto check_folio;
+ }
+
nr_pages = 1;
page_idx = 0;
address = vmf->address;
@@ -4337,11 +4543,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
folio_add_lru_vma(folio, vma);
} else if (!folio_test_anon(folio)) {
/*
- * We currently only expect small !anon folios, which are either
- * fully exclusive or fully shared. If we ever get large folios
- * here, we have to be careful.
+ * We currently only expect small !anon folios which are either
+ * fully exclusive or fully shared, or new allocated large
+ * folios which are fully exclusive. If we ever get large
+ * folios within swapcache here, we have to be careful.
*/
- VM_WARN_ON_ONCE(folio_test_large(folio));
+ VM_WARN_ON_ONCE(folio_test_large(folio) && folio_test_swapcache(folio));
VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
folio_add_new_anon_rmap(folio, vma, address, rmap_flags);
} else {
@@ -4384,7 +4591,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
out:
/* Clear the swap cache pin for direct swapin after PTL unlock */
if (need_clear_cache)
- swapcache_clear(si, entry, 1);
+ swapcache_clear(si, entry, nr_pages);
if (si)
put_swap_device(si);
return ret;
@@ -4400,7 +4607,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
folio_put(swapcache);
}
if (need_clear_cache)
- swapcache_clear(si, entry, 1);
+ swapcache_clear(si, entry, nr_pages);
if (si)
put_swap_device(si);
return ret;
--
2.34.1
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v8 1/3] mm: Fix swap_read_folio_zeromap() for large folios with partial zeromap
2024-09-06 0:10 ` [PATCH v8 1/3] mm: Fix swap_read_folio_zeromap() for large folios with partial zeromap Barry Song
@ 2024-09-06 18:32 ` Yosry Ahmed
2024-09-06 21:58 ` Barry Song
0 siblings, 1 reply; 8+ messages in thread
From: Yosry Ahmed @ 2024-09-06 18:32 UTC (permalink / raw)
To: Barry Song
Cc: akpm, linux-mm, hanchuanhua, baolin.wang, chrisl, david, hannes,
hch, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan,
nphamcs, ryan.roberts, ryncsn, senozhatsky, shakeel.butt,
shy828301, surenb, v-songbaohua, willy, xiang, ying.huang,
Usama Arif
On Thu, Sep 5, 2024 at 5:11 PM Barry Song <21cnbao@gmail.com> wrote:
>
> From: Barry Song <v-songbaohua@oppo.com>
>
> There could be a corner case where the first entry is non-zeromap,
> but a subsequent entry is zeromap. In this case, we should not
> let swap_read_folio_zeromap() return false since we will still
> read corrupted data.
>
> Additionally, the iteration of test_bit() is unnecessary and
> can be replaced with bitmap operations, which are more efficient.
>
> We can adopt the style of swap_pte_batch() and folio_pte_batch() to
> introduce swap_zeromap_batch() which seems to provide the greatest
> flexibility for the caller. This approach allows the caller to either
> check if the zeromap status of all entries is consistent or determine
> the number of contiguous entries with the same status.
>
> Since swap_read_folio() can't handle reading a large folio that's
> partially zeromap and partially non-zeromap, we've moved the code
> to mm/swap.h so that others, like those working on swap-in, can
> access it.
>
> Fixes: 0ca0c24e3211 ("mm: store zero pages to be swapped out in a bitmap")
> Cc: Usama Arif <usamaarif642@gmail.com>
> Cc: Yosry Ahmed <yosryahmed@google.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
> mm/page_io.c | 32 +++++++-------------------------
> mm/swap.h | 33 +++++++++++++++++++++++++++++++++
> 2 files changed, 40 insertions(+), 25 deletions(-)
>
> diff --git a/mm/page_io.c b/mm/page_io.c
> index 4bc77d1c6bfa..2dfe2273a1f1 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -226,26 +226,6 @@ static void swap_zeromap_folio_clear(struct folio *folio)
> }
> }
>
> -/*
> - * Return the index of the first subpage which is not zero-filled
> - * according to swap_info_struct->zeromap.
> - * If all pages are zero-filled according to zeromap, it will return
> - * folio_nr_pages(folio).
> - */
> -static unsigned int swap_zeromap_folio_test(struct folio *folio)
> -{
> - struct swap_info_struct *sis = swp_swap_info(folio->swap);
> - swp_entry_t entry;
> - unsigned int i;
> -
> - for (i = 0; i < folio_nr_pages(folio); i++) {
> - entry = page_swap_entry(folio_page(folio, i));
> - if (!test_bit(swp_offset(entry), sis->zeromap))
> - return i;
> - }
> - return i;
> -}
> -
> /*
> * We may have stale swap cache pages in memory: notice
> * them here and get rid of the unnecessary final write.
> @@ -524,19 +504,21 @@ static void sio_read_complete(struct kiocb *iocb, long ret)
>
> static bool swap_read_folio_zeromap(struct folio *folio)
> {
> - unsigned int idx = swap_zeromap_folio_test(folio);
> -
> - if (idx == 0)
> - return false;
> + int nr_pages = folio_nr_pages(folio);
> + bool is_zeromap;
> + int nr_zeromap = swap_zeromap_batch(folio->swap, nr_pages, &is_zeromap);
swap_zeromap_batch() reads to me like the number of entries that are
in the zeromap (i.e. bits are set), not the number of contiguous equal
bits. I can't think of a better name though :/
The local variable is not adding much value here either. It's
reinforcing the misunderstanding I point out above, if anything. You
can just drop that.
>
> /*
> * Swapping in a large folio that is partially in the zeromap is not
> * currently handled. Return true without marking the folio uptodate so
> * that an IO error is emitted (e.g. do_swap_page() will sigbus).
> */
> - if (WARN_ON_ONCE(idx < folio_nr_pages(folio)))
> + if (WARN_ON_ONCE(nr_zeromap != nr_pages))
> return true;
>
> + if (!is_zeromap)
> + return false;
> +
> folio_zero_range(folio, 0, folio_size(folio));
> folio_mark_uptodate(folio);
> return true;
> diff --git a/mm/swap.h b/mm/swap.h
> index f8711ff82f84..1cc56a02fb5f 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -80,6 +80,32 @@ static inline unsigned int folio_swap_flags(struct folio *folio)
> {
> return swp_swap_info(folio->swap)->flags;
> }
> +
> +/*
> + * Return the count of contiguous swap entries that share the same
> + * zeromap status as the starting entry. If is_zeromap is not NULL,
> + * it will return the zeromap status of the starting entry.
> + */
> +static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
> + bool *is_zeromap)
> +{
> + struct swap_info_struct *sis = swp_swap_info(entry);
> + unsigned long start = swp_offset(entry);
> + unsigned long end = start + max_nr;
> + bool start_entry_zeromap;
> +
> + start_entry_zeromap = test_bit(start, sis->zeromap);
first_bit is probably a better name.
> + if (is_zeromap)
> + *is_zeromap = start_entry_zeromap;
> +
> + if (max_nr <= 1)
> + return max_nr;
> + if (start_entry_zeromap)
> + return find_next_zero_bit(sis->zeromap, end, start) - start;
> + else
> + return find_next_bit(sis->zeromap, end, start) - start;
The usage of these functions look correct to me, although
FIND_NEXT_BIT is not really easy for me to parse :)
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v8 2/3] mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios
2024-09-06 0:10 ` [PATCH v8 2/3] mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios Barry Song
@ 2024-09-06 18:33 ` Yosry Ahmed
0 siblings, 0 replies; 8+ messages in thread
From: Yosry Ahmed @ 2024-09-06 18:33 UTC (permalink / raw)
To: Barry Song
Cc: akpm, linux-mm, hanchuanhua, baolin.wang, chrisl, david, hannes,
hch, hughd, kaleshsingh, kasong, linux-kernel, mhocko, minchan,
nphamcs, ryan.roberts, ryncsn, senozhatsky, shakeel.butt,
shy828301, surenb, v-songbaohua, willy, xiang, ying.huang
On Thu, Sep 5, 2024 at 5:11 PM Barry Song <21cnbao@gmail.com> wrote:
>
> From: Barry Song <v-songbaohua@oppo.com>
>
> With large folios swap-in, we might need to uncharge multiple entries all
> together, add nr argument in mem_cgroup_swapin_uncharge_swap().
>
> For the existing two users, just pass nr=1.
>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
> Acked-by: Chris Li <chrisl@kernel.org>
> Cc: Shakeel Butt <shakeel.butt@linux.dev>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Gao Xiang <xiang@kernel.org>
> Cc: "Huang, Ying" <ying.huang@intel.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Kairui Song <kasong@tencent.com>
> Cc: Kairui Song <ryncsn@gmail.com>
> Cc: Kalesh Singh <kaleshsingh@google.com>
> Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Minchan Kim <minchan@kernel.org>
> Cc: Nhat Pham <nphamcs@gmail.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Yang Shi <shy828301@gmail.com>
> Cc: Yosry Ahmed <yosryahmed@google.com>
> ---
> include/linux/memcontrol.h | 5 +++--
> mm/memcontrol.c | 7 ++++---
> mm/memory.c | 2 +-
> mm/swap_state.c | 2 +-
> 4 files changed, 9 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 2ef94c74847d..34d2da05f2f1 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -699,7 +699,8 @@ int mem_cgroup_hugetlb_try_charge(struct mem_cgroup *memcg, gfp_t gfp,
>
> int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
> gfp_t gfp, swp_entry_t entry);
> -void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry);
> +
> +void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry, unsigned int nr_pages);
>
> void __mem_cgroup_uncharge(struct folio *folio);
>
> @@ -1206,7 +1207,7 @@ static inline int mem_cgroup_swapin_charge_folio(struct folio *folio,
> return 0;
> }
>
> -static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry)
> +static inline void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry, unsigned int nr)
> {
> }
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index bda6f75d22ff..c0d36ca20332 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4559,14 +4559,15 @@ int mem_cgroup_swapin_charge_folio(struct folio *folio, struct mm_struct *mm,
>
> /*
> * mem_cgroup_swapin_uncharge_swap - uncharge swap slot
> - * @entry: swap entry for which the page is charged
> + * @entry: the first swap entry for which the pages are charged
> + * @nr_pages: number of pages which will be uncharged
> *
> * Call this function after successfully adding the charged page to swapcache.
> *
> * Note: This function assumes the page for which swap slot is being uncharged
> * is order 0 page.
> */
> -void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry)
> +void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
> {
> /*
> * Cgroup1's unified memory+swap counter has been charged with the
> @@ -4586,7 +4587,7 @@ void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry)
> * let's not wait for it. The page already received a
> * memory+swap charge, drop the swap entry duplicate.
> */
> - mem_cgroup_uncharge_swap(entry, 1);
> + mem_cgroup_uncharge_swap(entry, nr_pages);
> }
> }
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 42674c0748cb..cdf03b39a92c 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4100,7 +4100,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> ret = VM_FAULT_OOM;
> goto out_page;
> }
> - mem_cgroup_swapin_uncharge_swap(entry);
> + mem_cgroup_swapin_uncharge_swap(entry, 1);
>
> shadow = get_shadow_from_swap_cache(entry);
> if (shadow)
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index a042720554a7..4669f29cf555 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -522,7 +522,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> if (add_to_swap_cache(new_folio, entry, gfp_mask & GFP_RECLAIM_MASK, &shadow))
> goto fail_unlock;
>
> - mem_cgroup_swapin_uncharge_swap(entry);
> + mem_cgroup_swapin_uncharge_swap(entry, 1);
>
> if (shadow)
> workingset_refault(new_folio, shadow);
> --
> 2.34.1
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v8 1/3] mm: Fix swap_read_folio_zeromap() for large folios with partial zeromap
2024-09-06 18:32 ` Yosry Ahmed
@ 2024-09-06 21:58 ` Barry Song
2024-09-06 22:55 ` Yosry Ahmed
0 siblings, 1 reply; 8+ messages in thread
From: Barry Song @ 2024-09-06 21:58 UTC (permalink / raw)
To: yosryahmed
Cc: 21cnbao, akpm, baolin.wang, chrisl, david, hanchuanhua, hannes,
hch, hughd, kaleshsingh, kasong, linux-kernel, linux-mm, mhocko,
minchan, nphamcs, ryan.roberts, ryncsn, senozhatsky,
shakeel.butt, shy828301, surenb, usamaarif642, v-songbaohua,
willy, xiang, ying.huang
On Sat, Sep 7, 2024 at 6:32 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Thu, Sep 5, 2024 at 5:11 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > There could be a corner case where the first entry is non-zeromap,
> > but a subsequent entry is zeromap. In this case, we should not
> > let swap_read_folio_zeromap() return false since we will still
> > read corrupted data.
> >
> > Additionally, the iteration of test_bit() is unnecessary and
> > can be replaced with bitmap operations, which are more efficient.
> >
> > We can adopt the style of swap_pte_batch() and folio_pte_batch() to
> > introduce swap_zeromap_batch() which seems to provide the greatest
> > flexibility for the caller. This approach allows the caller to either
> > check if the zeromap status of all entries is consistent or determine
> > the number of contiguous entries with the same status.
> >
> > Since swap_read_folio() can't handle reading a large folio that's
> > partially zeromap and partially non-zeromap, we've moved the code
> > to mm/swap.h so that others, like those working on swap-in, can
> > access it.
> >
> > Fixes: 0ca0c24e3211 ("mm: store zero pages to be swapped out in a bitmap")
> > Cc: Usama Arif <usamaarif642@gmail.com>
> > Cc: Yosry Ahmed <yosryahmed@google.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> > mm/page_io.c | 32 +++++++-------------------------
> > mm/swap.h | 33 +++++++++++++++++++++++++++++++++
> > 2 files changed, 40 insertions(+), 25 deletions(-)
> >
> > diff --git a/mm/page_io.c b/mm/page_io.c
> > index 4bc77d1c6bfa..2dfe2273a1f1 100644
> > --- a/mm/page_io.c
> > +++ b/mm/page_io.c
> > @@ -226,26 +226,6 @@ static void swap_zeromap_folio_clear(struct folio *folio)
> > }
> > }
> >
> > -/*
> > - * Return the index of the first subpage which is not zero-filled
> > - * according to swap_info_struct->zeromap.
> > - * If all pages are zero-filled according to zeromap, it will return
> > - * folio_nr_pages(folio).
> > - */
> > -static unsigned int swap_zeromap_folio_test(struct folio *folio)
> > -{
> > - struct swap_info_struct *sis = swp_swap_info(folio->swap);
> > - swp_entry_t entry;
> > - unsigned int i;
> > -
> > - for (i = 0; i < folio_nr_pages(folio); i++) {
> > - entry = page_swap_entry(folio_page(folio, i));
> > - if (!test_bit(swp_offset(entry), sis->zeromap))
> > - return i;
> > - }
> > - return i;
> > -}
> > -
> > /*
> > * We may have stale swap cache pages in memory: notice
> > * them here and get rid of the unnecessary final write.
> > @@ -524,19 +504,21 @@ static void sio_read_complete(struct kiocb *iocb, long ret)
> >
> > static bool swap_read_folio_zeromap(struct folio *folio)
> > {
> > - unsigned int idx = swap_zeromap_folio_test(folio);
> > -
> > - if (idx == 0)
> > - return false;
> > + int nr_pages = folio_nr_pages(folio);
> > + bool is_zeromap;
> > + int nr_zeromap = swap_zeromap_batch(folio->swap, nr_pages, &is_zeromap);
>
> swap_zeromap_batch() reads to me like the number of entries that are
> in the zeromap (i.e. bits are set), not the number of contiguous equal
> bits. I can't think of a better name though :/
We now have swap_pte_batch() and folio_pte_batch(), both of which return the number
of entries sharing the same attribute as the first entry. These functions are frequently used
in the memory management code with conditions like if (swap_pte_batch() != nr) and
if (folio_pte_batch() != nr). Given this, it seems we could adopt a consistent approach
for handling entries in the same manner as the first one :-)
>
> The local variable is not adding much value here either. It's
> reinforcing the misunderstanding I point out above, if anything. You
> can just drop that.
>
well, I feel I can remove this local variable by:
diff --git a/mm/page_io.c b/mm/page_io.c
index 2dfe2273a1f1..bc1183299a7d 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -506,14 +506,14 @@ static bool swap_read_folio_zeromap(struct folio *folio)
{
int nr_pages = folio_nr_pages(folio);
bool is_zeromap;
- int nr_zeromap = swap_zeromap_batch(folio->swap, nr_pages, &is_zeromap);
/*
* Swapping in a large folio that is partially in the zeromap is not
* currently handled. Return true without marking the folio uptodate so
* that an IO error is emitted (e.g. do_swap_page() will sigbus).
*/
- if (WARN_ON_ONCE(nr_zeromap != nr_pages))
+ if (WARN_ON_ONCE(swap_zeromap_batch(folio->swap, nr_pages,
+ &is_zeromap) != nr_pages))
return true;
if (!is_zeromap)
> >
> > /*
> > * Swapping in a large folio that is partially in the zeromap is not
> > * currently handled. Return true without marking the folio uptodate so
> > * that an IO error is emitted (e.g. do_swap_page() will sigbus).
> > */
> > - if (WARN_ON_ONCE(idx < folio_nr_pages(folio)))
> > + if (WARN_ON_ONCE(nr_zeromap != nr_pages))
> > return true;
> >
> > + if (!is_zeromap)
> > + return false;
> > +
> > folio_zero_range(folio, 0, folio_size(folio));
> > folio_mark_uptodate(folio);
> > return true;
> > diff --git a/mm/swap.h b/mm/swap.h
> > index f8711ff82f84..1cc56a02fb5f 100644
> > --- a/mm/swap.h
> > +++ b/mm/swap.h
> > @@ -80,6 +80,32 @@ static inline unsigned int folio_swap_flags(struct folio *folio)
> > {
> > return swp_swap_info(folio->swap)->flags;
> > }
> > +
> > +/*
> > + * Return the count of contiguous swap entries that share the same
> > + * zeromap status as the starting entry. If is_zeromap is not NULL,
> > + * it will return the zeromap status of the starting entry.
> > + */
> > +static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
> > + bool *is_zeromap)
> > +{
> > + struct swap_info_struct *sis = swp_swap_info(entry);
> > + unsigned long start = swp_offset(entry);
> > + unsigned long end = start + max_nr;
> > + bool start_entry_zeromap;
> > +
> > + start_entry_zeromap = test_bit(start, sis->zeromap);
>
> first_bit is probably a better name.
>
yep, might be. I am glad to rename if it makes the code easier to understand:
diff --git a/mm/swap.h b/mm/swap.h
index 1cc56a02fb5f..e0397a197620 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -92,15 +92,15 @@ static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
struct swap_info_struct *sis = swp_swap_info(entry);
unsigned long start = swp_offset(entry);
unsigned long end = start + max_nr;
- bool start_entry_zeromap;
+ bool first_bit;
- start_entry_zeromap = test_bit(start, sis->zeromap);
- if (is_zeromap)
- *is_zeromap = start_entry_zeromap;
+ first_bit = test_bit(start, sis->zeromap);
+ if (is_zeromap)
+ *is_zeromap = first_bit;
if (max_nr <= 1)
return max_nr;
- if (start_entry_zeromap)
+ if (first_bit)
return find_next_zero_bit(sis->zeromap, end, start) - start;
else
return find_next_bit(sis->zeromap, end, start) - start;
> > + if (is_zeromap)
> > + *is_zeromap = start_entry_zeromap;
> > +
> > + if (max_nr <= 1)
> > + return max_nr;
> > + if (start_entry_zeromap)
> > + return find_next_zero_bit(sis->zeromap, end, start) - start;
> > + else
> > + return find_next_bit(sis->zeromap, end, start) - start;
>
> The usage of these functions look correct to me, although
> FIND_NEXT_BIT is not really easy for me to parse :)
Yep :-) with the above two changes, the patch becomes:
From 272c04cb758b8062eaa96a52b855ff79c8afdf6a Mon Sep 17 00:00:00 2001
From: Barry Song <v-songbaohua@oppo.com>
Date: Thu, 5 Sep 2024 11:56:03 +1200
Subject: [PATCH v8 1/3] mm: Fix swap_read_folio_zeromap() for large folios
with partial zeromap
There could be a corner case where the first entry is non-zeromap,
but a subsequent entry is zeromap. In this case, we should not
let swap_read_folio_zeromap() return false since we will still
read corrupted data.
Additionally, the iteration of test_bit() is unnecessary and
can be replaced with bitmap operations, which are more efficient.
We can adopt the style of swap_pte_batch() and folio_pte_batch() to
introduce swap_zeromap_batch() which seems to provide the greatest
flexibility for the caller. This approach allows the caller to either
check if the zeromap status of all entries is consistent or determine
the number of contiguous entries with the same status.
Since swap_read_folio() can't handle reading a large folio that's
partially zeromap and partially non-zeromap, we've moved the code
to mm/swap.h so that others, like those working on swap-in, can
access it.
Fixes: 0ca0c24e3211 ("mm: store zero pages to be swapped out in a bitmap")
Cc: Usama Arif <usamaarif642@gmail.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
mm/page_io.c | 32 +++++++-------------------------
mm/swap.h | 33 +++++++++++++++++++++++++++++++++
2 files changed, 40 insertions(+), 25 deletions(-)
diff --git a/mm/page_io.c b/mm/page_io.c
index 4bc77d1c6bfa..bc1183299a7d 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -226,26 +226,6 @@ static void swap_zeromap_folio_clear(struct folio *folio)
}
}
-/*
- * Return the index of the first subpage which is not zero-filled
- * according to swap_info_struct->zeromap.
- * If all pages are zero-filled according to zeromap, it will return
- * folio_nr_pages(folio).
- */
-static unsigned int swap_zeromap_folio_test(struct folio *folio)
-{
- struct swap_info_struct *sis = swp_swap_info(folio->swap);
- swp_entry_t entry;
- unsigned int i;
-
- for (i = 0; i < folio_nr_pages(folio); i++) {
- entry = page_swap_entry(folio_page(folio, i));
- if (!test_bit(swp_offset(entry), sis->zeromap))
- return i;
- }
- return i;
-}
-
/*
* We may have stale swap cache pages in memory: notice
* them here and get rid of the unnecessary final write.
@@ -524,19 +504,21 @@ static void sio_read_complete(struct kiocb *iocb, long ret)
static bool swap_read_folio_zeromap(struct folio *folio)
{
- unsigned int idx = swap_zeromap_folio_test(folio);
-
- if (idx == 0)
- return false;
+ int nr_pages = folio_nr_pages(folio);
+ bool is_zeromap;
/*
* Swapping in a large folio that is partially in the zeromap is not
* currently handled. Return true without marking the folio uptodate so
* that an IO error is emitted (e.g. do_swap_page() will sigbus).
*/
- if (WARN_ON_ONCE(idx < folio_nr_pages(folio)))
+ if (WARN_ON_ONCE(swap_zeromap_batch(folio->swap, nr_pages,
+ &is_zeromap) != nr_pages))
return true;
+ if (!is_zeromap)
+ return false;
+
folio_zero_range(folio, 0, folio_size(folio));
folio_mark_uptodate(folio);
return true;
diff --git a/mm/swap.h b/mm/swap.h
index f8711ff82f84..e0397a197620 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -80,6 +80,32 @@ static inline unsigned int folio_swap_flags(struct folio *folio)
{
return swp_swap_info(folio->swap)->flags;
}
+
+/*
+ * Return the count of contiguous swap entries that share the same
+ * zeromap status as the starting entry. If is_zeromap is not NULL,
+ * it will return the zeromap status of the starting entry.
+ */
+static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
+ bool *is_zeromap)
+{
+ struct swap_info_struct *sis = swp_swap_info(entry);
+ unsigned long start = swp_offset(entry);
+ unsigned long end = start + max_nr;
+ bool first_bit;
+
+ first_bit = test_bit(start, sis->zeromap);
+ if (is_zeromap)
+ *is_zeromap = first_bit;
+
+ if (max_nr <= 1)
+ return max_nr;
+ if (first_bit)
+ return find_next_zero_bit(sis->zeromap, end, start) - start;
+ else
+ return find_next_bit(sis->zeromap, end, start) - start;
+}
+
#else /* CONFIG_SWAP */
struct swap_iocb;
static inline void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
@@ -171,6 +197,13 @@ static inline unsigned int folio_swap_flags(struct folio *folio)
{
return 0;
}
+
+static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
+ bool *has_zeromap)
+{
+ return 0;
+}
+
#endif /* CONFIG_SWAP */
#endif /* _MM_SWAP_H */
--
2.34.1
Thanks
Barry
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v8 1/3] mm: Fix swap_read_folio_zeromap() for large folios with partial zeromap
2024-09-06 21:58 ` Barry Song
@ 2024-09-06 22:55 ` Yosry Ahmed
0 siblings, 0 replies; 8+ messages in thread
From: Yosry Ahmed @ 2024-09-06 22:55 UTC (permalink / raw)
To: Barry Song
Cc: akpm, baolin.wang, chrisl, david, hanchuanhua, hannes, hch,
hughd, kaleshsingh, kasong, linux-kernel, linux-mm, mhocko,
minchan, nphamcs, ryan.roberts, ryncsn, senozhatsky,
shakeel.butt, shy828301, surenb, usamaarif642, v-songbaohua,
willy, xiang, ying.huang
[..]
> Yep :-) with the above two changes, the patch becomes:
>
> From 272c04cb758b8062eaa96a52b855ff79c8afdf6a Mon Sep 17 00:00:00 2001
> From: Barry Song <v-songbaohua@oppo.com>
> Date: Thu, 5 Sep 2024 11:56:03 +1200
> Subject: [PATCH v8 1/3] mm: Fix swap_read_folio_zeromap() for large folios
> with partial zeromap
>
> There could be a corner case where the first entry is non-zeromap,
> but a subsequent entry is zeromap. In this case, we should not
> let swap_read_folio_zeromap() return false since we will still
> read corrupted data.
>
> Additionally, the iteration of test_bit() is unnecessary and
> can be replaced with bitmap operations, which are more efficient.
>
> We can adopt the style of swap_pte_batch() and folio_pte_batch() to
> introduce swap_zeromap_batch() which seems to provide the greatest
> flexibility for the caller. This approach allows the caller to either
> check if the zeromap status of all entries is consistent or determine
> the number of contiguous entries with the same status.
>
> Since swap_read_folio() can't handle reading a large folio that's
> partially zeromap and partially non-zeromap, we've moved the code
> to mm/swap.h so that others, like those working on swap-in, can
> access it.
>
> Fixes: 0ca0c24e3211 ("mm: store zero pages to be swapped out in a bitmap")
> Cc: Usama Arif <usamaarif642@gmail.com>
> Cc: Yosry Ahmed <yosryahmed@google.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
> ---
> mm/page_io.c | 32 +++++++-------------------------
> mm/swap.h | 33 +++++++++++++++++++++++++++++++++
> 2 files changed, 40 insertions(+), 25 deletions(-)
>
> diff --git a/mm/page_io.c b/mm/page_io.c
> index 4bc77d1c6bfa..bc1183299a7d 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -226,26 +226,6 @@ static void swap_zeromap_folio_clear(struct folio *folio)
> }
> }
>
> -/*
> - * Return the index of the first subpage which is not zero-filled
> - * according to swap_info_struct->zeromap.
> - * If all pages are zero-filled according to zeromap, it will return
> - * folio_nr_pages(folio).
> - */
> -static unsigned int swap_zeromap_folio_test(struct folio *folio)
> -{
> - struct swap_info_struct *sis = swp_swap_info(folio->swap);
> - swp_entry_t entry;
> - unsigned int i;
> -
> - for (i = 0; i < folio_nr_pages(folio); i++) {
> - entry = page_swap_entry(folio_page(folio, i));
> - if (!test_bit(swp_offset(entry), sis->zeromap))
> - return i;
> - }
> - return i;
> -}
> -
> /*
> * We may have stale swap cache pages in memory: notice
> * them here and get rid of the unnecessary final write.
> @@ -524,19 +504,21 @@ static void sio_read_complete(struct kiocb *iocb, long ret)
>
> static bool swap_read_folio_zeromap(struct folio *folio)
> {
> - unsigned int idx = swap_zeromap_folio_test(folio);
> -
> - if (idx == 0)
> - return false;
> + int nr_pages = folio_nr_pages(folio);
> + bool is_zeromap;
>
> /*
> * Swapping in a large folio that is partially in the zeromap is not
> * currently handled. Return true without marking the folio uptodate so
> * that an IO error is emitted (e.g. do_swap_page() will sigbus).
> */
> - if (WARN_ON_ONCE(idx < folio_nr_pages(folio)))
> + if (WARN_ON_ONCE(swap_zeromap_batch(folio->swap, nr_pages,
> + &is_zeromap) != nr_pages))
> return true;
>
> + if (!is_zeromap)
> + return false;
> +
> folio_zero_range(folio, 0, folio_size(folio));
> folio_mark_uptodate(folio);
> return true;
> diff --git a/mm/swap.h b/mm/swap.h
> index f8711ff82f84..e0397a197620 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -80,6 +80,32 @@ static inline unsigned int folio_swap_flags(struct folio *folio)
> {
> return swp_swap_info(folio->swap)->flags;
> }
> +
> +/*
> + * Return the count of contiguous swap entries that share the same
> + * zeromap status as the starting entry. If is_zeromap is not NULL,
> + * it will return the zeromap status of the starting entry.
> + */
> +static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
> + bool *is_zeromap)
> +{
> + struct swap_info_struct *sis = swp_swap_info(entry);
> + unsigned long start = swp_offset(entry);
> + unsigned long end = start + max_nr;
> + bool first_bit;
> +
> + first_bit = test_bit(start, sis->zeromap);
> + if (is_zeromap)
> + *is_zeromap = first_bit;
> +
> + if (max_nr <= 1)
> + return max_nr;
> + if (first_bit)
> + return find_next_zero_bit(sis->zeromap, end, start) - start;
> + else
> + return find_next_bit(sis->zeromap, end, start) - start;
> +}
> +
> #else /* CONFIG_SWAP */
> struct swap_iocb;
> static inline void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
> @@ -171,6 +197,13 @@ static inline unsigned int folio_swap_flags(struct folio *folio)
> {
> return 0;
> }
> +
> +static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
> + bool *has_zeromap)
> +{
> + return 0;
> +}
> +
> #endif /* CONFIG_SWAP */
>
> #endif /* _MM_SWAP_H */
> --
> 2.34.1
>
> Thanks
> Barry
>
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2024-09-06 22:56 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-09-06 0:10 [PATCH v8 0/3] mm: enable large folios swap-in support Barry Song
2024-09-06 0:10 ` [PATCH v8 1/3] mm: Fix swap_read_folio_zeromap() for large folios with partial zeromap Barry Song
2024-09-06 18:32 ` Yosry Ahmed
2024-09-06 21:58 ` Barry Song
2024-09-06 22:55 ` Yosry Ahmed
2024-09-06 0:10 ` [PATCH v8 2/3] mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios Barry Song
2024-09-06 18:33 ` Yosry Ahmed
2024-09-06 0:10 ` [PATCH v8 3/3] mm: support large folios swap-in for sync io devices Barry Song
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox