* [RFC PATCH 1/2] swapfile: add a batched variant for swap_duplicate()
2024-09-23 23:11 [RFC PATCH 0/2] remove SWAP_MAP_SHMEM Nhat Pham
@ 2024-09-23 23:11 ` Nhat Pham
2024-09-23 23:11 ` [RFC PATCH 2/2] swap: shmem: remove SWAP_MAP_SHMEM Nhat Pham
` (3 subsequent siblings)
4 siblings, 0 replies; 29+ messages in thread
From: Nhat Pham @ 2024-09-23 23:11 UTC (permalink / raw)
To: akpm
Cc: hannes, yosryahmed, hughd, shakeel.butt, ryan.roberts,
ying.huang, chrisl, david, kasong, willy, viro, baohua,
chengming.zhou, linux-mm, kernel-team, linux-kernel
Add swap_duplicate_nr(), the batched variant of swap_duplicate(), that
operates on multiple contiguous swap entries.
This will be used in the following patch to remove SWAP_MAP_SHMEM.
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
include/linux/swap.h | 10 ++++++++--
mm/swapfile.c | 13 +++++++++----
2 files changed, 17 insertions(+), 6 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index ca533b478c21..e6ab234be7be 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -483,7 +483,7 @@ extern swp_entry_t get_swap_page_of_type(int);
extern int get_swap_pages(int n, swp_entry_t swp_entries[], int order);
extern int add_swap_count_continuation(swp_entry_t, gfp_t);
extern void swap_shmem_alloc(swp_entry_t, int);
-extern int swap_duplicate(swp_entry_t);
+extern int swap_duplicate_nr(swp_entry_t, int);
extern int swapcache_prepare(swp_entry_t entry, int nr);
extern void swap_free_nr(swp_entry_t entry, int nr_pages);
extern void swapcache_free_entries(swp_entry_t *entries, int n);
@@ -553,7 +553,7 @@ static inline void swap_shmem_alloc(swp_entry_t swp, int nr)
{
}
-static inline int swap_duplicate(swp_entry_t swp)
+static inline int swap_duplicate_nr(swp_entry_t swp, int nr)
{
return 0;
}
@@ -606,6 +606,12 @@ static inline int add_swap_extent(struct swap_info_struct *sis,
}
#endif /* CONFIG_SWAP */
+static inline int swap_duplicate(swp_entry_t entry)
+{
+ return swap_duplicate_nr(entry, 1);
+}
+
+
static inline void free_swap_and_cache(swp_entry_t entry)
{
free_swap_and_cache_nr(entry, 1);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 0cded32414a1..47a2cd5f590d 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3695,18 +3695,23 @@ void swap_shmem_alloc(swp_entry_t entry, int nr)
__swap_duplicate(entry, SWAP_MAP_SHMEM, nr);
}
-/*
- * Increase reference count of swap entry by 1.
+/**
+ * swap_duplicate_nr() - Increase reference count of nr contiguous swap entries
+ * by 1.
+ *
+ * @entry: first swap entry from which we want to increase the refcount.
+ * @nr: Number of entries in range.
+ *
* Returns 0 for success, or -ENOMEM if a swap_count_continuation is required
* but could not be atomically allocated. Returns 0, just as if it succeeded,
* if __swap_duplicate() fails for another reason (-EINVAL or -ENOENT), which
* might occur if a page table entry has got corrupted.
*/
-int swap_duplicate(swp_entry_t entry)
+int swap_duplicate_nr(swp_entry_t entry, int nr)
{
int err = 0;
- while (!err && __swap_duplicate(entry, 1, 1) == -ENOMEM)
+ while (!err && __swap_duplicate(entry, 1, nr) == -ENOMEM)
err = add_swap_count_continuation(entry, GFP_ATOMIC);
return err;
}
--
2.43.5
^ permalink raw reply [flat|nested] 29+ messages in thread* [RFC PATCH 2/2] swap: shmem: remove SWAP_MAP_SHMEM
2024-09-23 23:11 [RFC PATCH 0/2] remove SWAP_MAP_SHMEM Nhat Pham
2024-09-23 23:11 ` [RFC PATCH 1/2] swapfile: add a batched variant for swap_duplicate() Nhat Pham
@ 2024-09-23 23:11 ` Nhat Pham
2024-09-24 0:32 ` Yosry Ahmed
2024-09-24 0:20 ` [RFC PATCH 0/2] " Yosry Ahmed
` (2 subsequent siblings)
4 siblings, 1 reply; 29+ messages in thread
From: Nhat Pham @ 2024-09-23 23:11 UTC (permalink / raw)
To: akpm
Cc: hannes, yosryahmed, hughd, shakeel.butt, ryan.roberts,
ying.huang, chrisl, david, kasong, willy, viro, baohua,
chengming.zhou, linux-mm, kernel-team, linux-kernel
The SWAP_MAP_SHMEM state was introduced in the commit aaa468653b4a
("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a swap entry
belongs to shmem during swapoff.
However, swapoff has since been rewritten in the commit b56a2d8af914
("mm: rid swapoff of quadratic complexity"). Now having swap count ==
SWAP_MAP_SHMEM value is basically the same as having swap count == 1,
and swap_shmem_alloc() behaves analogously to swap_duplicate().
Remove this state and the associated helper to simplify the state
machine (both mentally and in terms of actual code). We will also have
an extra state/special value that can be repurposed (for swap entries
that never gets re-duplicated).
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
include/linux/swap.h | 6 ------
mm/shmem.c | 2 +-
mm/swapfile.c | 15 ---------------
3 files changed, 1 insertion(+), 22 deletions(-)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index e6ab234be7be..017f3c03ff7a 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -232,7 +232,6 @@ enum {
/* Special value in first swap_map */
#define SWAP_MAP_MAX 0x3e /* Max count */
#define SWAP_MAP_BAD 0x3f /* Note page is bad */
-#define SWAP_MAP_SHMEM 0xbf /* Owned by shmem/tmpfs */
/* Special value in each swap_map continuation */
#define SWAP_CONT_MAX 0x7f /* Max count */
@@ -482,7 +481,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry);
extern swp_entry_t get_swap_page_of_type(int);
extern int get_swap_pages(int n, swp_entry_t swp_entries[], int order);
extern int add_swap_count_continuation(swp_entry_t, gfp_t);
-extern void swap_shmem_alloc(swp_entry_t, int);
extern int swap_duplicate_nr(swp_entry_t, int);
extern int swapcache_prepare(swp_entry_t entry, int nr);
extern void swap_free_nr(swp_entry_t entry, int nr_pages);
@@ -549,10 +547,6 @@ static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
return 0;
}
-static inline void swap_shmem_alloc(swp_entry_t swp, int nr)
-{
-}
-
static inline int swap_duplicate_nr(swp_entry_t swp, int nr)
{
return 0;
diff --git a/mm/shmem.c b/mm/shmem.c
index 361affdf3990..1875f2521dc6 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1559,7 +1559,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
__GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN,
NULL) == 0) {
shmem_recalc_inode(inode, 0, nr_pages);
- swap_shmem_alloc(swap, nr_pages);
+ swap_duplicate_nr(swap, nr_pages);
shmem_delete_from_page_cache(folio, swp_to_radix_entry(swap));
mutex_unlock(&shmem_swaplist_mutex);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 47a2cd5f590d..cebc244ee60f 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1381,12 +1381,6 @@ static unsigned char __swap_entry_free_locked(struct swap_info_struct *si,
if (usage == SWAP_HAS_CACHE) {
VM_BUG_ON(!has_cache);
has_cache = 0;
- } else if (count == SWAP_MAP_SHMEM) {
- /*
- * Or we could insist on shmem.c using a special
- * swap_shmem_free() and free_shmem_swap_and_cache()...
- */
- count = 0;
} else if ((count & ~COUNT_CONTINUED) <= SWAP_MAP_MAX) {
if (count == COUNT_CONTINUED) {
if (swap_count_continued(si, offset, count))
@@ -3686,15 +3680,6 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
return err;
}
-/*
- * Help swapoff by noting that swap entry belongs to shmem/tmpfs
- * (in which case its reference count is never incremented).
- */
-void swap_shmem_alloc(swp_entry_t entry, int nr)
-{
- __swap_duplicate(entry, SWAP_MAP_SHMEM, nr);
-}
-
/**
* swap_duplicate_nr() - Increase reference count of nr contiguous swap entries
* by 1.
--
2.43.5
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [RFC PATCH 2/2] swap: shmem: remove SWAP_MAP_SHMEM
2024-09-23 23:11 ` [RFC PATCH 2/2] swap: shmem: remove SWAP_MAP_SHMEM Nhat Pham
@ 2024-09-24 0:32 ` Yosry Ahmed
0 siblings, 0 replies; 29+ messages in thread
From: Yosry Ahmed @ 2024-09-24 0:32 UTC (permalink / raw)
To: Nhat Pham
Cc: akpm, hannes, hughd, shakeel.butt, ryan.roberts, ying.huang,
chrisl, david, kasong, willy, viro, baohua, chengming.zhou,
linux-mm, kernel-team, linux-kernel
On Mon, Sep 23, 2024 at 4:11 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> The SWAP_MAP_SHMEM state was introduced in the commit aaa468653b4a
> ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a swap entry
> belongs to shmem during swapoff.
>
> However, swapoff has since been rewritten in the commit b56a2d8af914
> ("mm: rid swapoff of quadratic complexity"). Now having swap count ==
> SWAP_MAP_SHMEM value is basically the same as having swap count == 1,
> and swap_shmem_alloc() behaves analogously to swap_duplicate().
It's probably useful to point out that swap_shmem_alloc() is
equivalent to swap_duplicate() because __swap_duplicate() should never
return -ENOMEM for shmem, as we only ever increment the swap count by
1 (for each entry).
>
> Remove this state and the associated helper to simplify the state
> machine (both mentally and in terms of actual code). We will also have
> an extra state/special value that can be repurposed (for swap entries
> that never gets re-duplicated).
>
> Signed-off-by: Nhat Pham <nphamcs@gmail.com>
> ---
> include/linux/swap.h | 6 ------
> mm/shmem.c | 2 +-
> mm/swapfile.c | 15 ---------------
> 3 files changed, 1 insertion(+), 22 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index e6ab234be7be..017f3c03ff7a 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -232,7 +232,6 @@ enum {
> /* Special value in first swap_map */
> #define SWAP_MAP_MAX 0x3e /* Max count */
> #define SWAP_MAP_BAD 0x3f /* Note page is bad */
> -#define SWAP_MAP_SHMEM 0xbf /* Owned by shmem/tmpfs */
>
> /* Special value in each swap_map continuation */
> #define SWAP_CONT_MAX 0x7f /* Max count */
> @@ -482,7 +481,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry);
> extern swp_entry_t get_swap_page_of_type(int);
> extern int get_swap_pages(int n, swp_entry_t swp_entries[], int order);
> extern int add_swap_count_continuation(swp_entry_t, gfp_t);
> -extern void swap_shmem_alloc(swp_entry_t, int);
> extern int swap_duplicate_nr(swp_entry_t, int);
> extern int swapcache_prepare(swp_entry_t entry, int nr);
> extern void swap_free_nr(swp_entry_t entry, int nr_pages);
> @@ -549,10 +547,6 @@ static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
> return 0;
> }
>
> -static inline void swap_shmem_alloc(swp_entry_t swp, int nr)
> -{
> -}
> -
> static inline int swap_duplicate_nr(swp_entry_t swp, int nr)
> {
> return 0;
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 361affdf3990..1875f2521dc6 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1559,7 +1559,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
> __GFP_HIGH | __GFP_NOMEMALLOC | __GFP_NOWARN,
> NULL) == 0) {
> shmem_recalc_inode(inode, 0, nr_pages);
> - swap_shmem_alloc(swap, nr_pages);
> + swap_duplicate_nr(swap, nr_pages);
> shmem_delete_from_page_cache(folio, swp_to_radix_entry(swap));
>
> mutex_unlock(&shmem_swaplist_mutex);
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 47a2cd5f590d..cebc244ee60f 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1381,12 +1381,6 @@ static unsigned char __swap_entry_free_locked(struct swap_info_struct *si,
> if (usage == SWAP_HAS_CACHE) {
> VM_BUG_ON(!has_cache);
> has_cache = 0;
> - } else if (count == SWAP_MAP_SHMEM) {
> - /*
> - * Or we could insist on shmem.c using a special
> - * swap_shmem_free() and free_shmem_swap_and_cache()...
> - */
> - count = 0;
> } else if ((count & ~COUNT_CONTINUED) <= SWAP_MAP_MAX) {
> if (count == COUNT_CONTINUED) {
> if (swap_count_continued(si, offset, count))
> @@ -3686,15 +3680,6 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
> return err;
> }
>
> -/*
> - * Help swapoff by noting that swap entry belongs to shmem/tmpfs
> - * (in which case its reference count is never incremented).
> - */
> -void swap_shmem_alloc(swp_entry_t entry, int nr)
> -{
> - __swap_duplicate(entry, SWAP_MAP_SHMEM, nr);
> -}
> -
> /**
> * swap_duplicate_nr() - Increase reference count of nr contiguous swap entries
> * by 1.
> --
> 2.43.5
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/2] remove SWAP_MAP_SHMEM
2024-09-23 23:11 [RFC PATCH 0/2] remove SWAP_MAP_SHMEM Nhat Pham
2024-09-23 23:11 ` [RFC PATCH 1/2] swapfile: add a batched variant for swap_duplicate() Nhat Pham
2024-09-23 23:11 ` [RFC PATCH 2/2] swap: shmem: remove SWAP_MAP_SHMEM Nhat Pham
@ 2024-09-24 0:20 ` Yosry Ahmed
2024-09-24 1:55 ` Baolin Wang
2024-09-24 20:15 ` Chris Li
4 siblings, 0 replies; 29+ messages in thread
From: Yosry Ahmed @ 2024-09-24 0:20 UTC (permalink / raw)
To: Nhat Pham
Cc: akpm, hannes, hughd, shakeel.butt, ryan.roberts, ying.huang,
chrisl, david, kasong, willy, viro, baohua, chengming.zhou,
linux-mm, kernel-team, linux-kernel
On Mon, Sep 23, 2024 at 4:11 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> The SWAP_MAP_SHMEM state was originally introduced in the commit
> aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a
> swap entry belongs to shmem during swapoff.
>
> However, swapoff has since been rewritten drastically in the commit
> b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now
> having swap count == SWAP_MAP_SHMEM value is basically the same as having
> swap count == 1, and swap_shmem_alloc() behaves analogously to
> swap_duplicate()
>
> This RFC proposes the removal of this state and the associated helper to
> simplify the state machine (both mentally and code-wise). We will also
> have an extra state/special value that can be repurposed (for swap entries
> that never gets re-duplicated).
>
> Another motivation (albeit a bit premature at the moment) is the new swap
> abstraction I am currently working on, that would allow for swap/zswap
> decoupling, swapoff optimization, etc. The fewer states and swap API
> functions there are, the simpler the conversion will be.
>
> I am sending this series first as an RFC, just in case I missed something
> or misunderstood this state, or if someone has a swap optimization in mind
> for shmem that would require this special state.
I have the same patch sitting in a tree somewhere from when I tried
working on swap abstraction, except then swap_shmem_alloc() did not
take a 'nr' argument so I did not need swap_duplicate_nr(). I was
going to send it out with other swap code cleanups I had, but I ended
up deciding to do nothing.
So for what it's worth I think this is correct:
Reviewed-by: Yosry Ahmed <yosryahmed@google.com>
>
> Swap experts, let me know if I'm mistaken :) Otherwise if there is no
> objection I will resend this patch series again for merging.
>
> Nhat Pham (2):
> swapfile: add a batched variant for swap_duplicate()
> swap: shmem: remove SWAP_MAP_SHMEM
>
> include/linux/swap.h | 16 ++++++++--------
> mm/shmem.c | 2 +-
> mm/swapfile.c | 28 +++++++++-------------------
> 3 files changed, 18 insertions(+), 28 deletions(-)
>
>
> base-commit: acfabf7e197f7a5bedf4749dac1f39551417b049
> --
> 2.43.5
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [RFC PATCH 0/2] remove SWAP_MAP_SHMEM
2024-09-23 23:11 [RFC PATCH 0/2] remove SWAP_MAP_SHMEM Nhat Pham
` (2 preceding siblings ...)
2024-09-24 0:20 ` [RFC PATCH 0/2] " Yosry Ahmed
@ 2024-09-24 1:55 ` Baolin Wang
2024-09-24 2:15 ` Yosry Ahmed
2024-09-24 20:15 ` Chris Li
4 siblings, 1 reply; 29+ messages in thread
From: Baolin Wang @ 2024-09-24 1:55 UTC (permalink / raw)
To: Nhat Pham, akpm
Cc: hannes, yosryahmed, hughd, shakeel.butt, ryan.roberts,
ying.huang, chrisl, david, kasong, willy, viro, baohua,
chengming.zhou, linux-mm, kernel-team, linux-kernel
On 2024/9/24 07:11, Nhat Pham wrote:
> The SWAP_MAP_SHMEM state was originally introduced in the commit
> aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a
> swap entry belongs to shmem during swapoff.
>
> However, swapoff has since been rewritten drastically in the commit
> b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now
> having swap count == SWAP_MAP_SHMEM value is basically the same as having
> swap count == 1, and swap_shmem_alloc() behaves analogously to
> swap_duplicate()
>
> This RFC proposes the removal of this state and the associated helper to
> simplify the state machine (both mentally and code-wise). We will also
> have an extra state/special value that can be repurposed (for swap entries
> that never gets re-duplicated).
>
> Another motivation (albeit a bit premature at the moment) is the new swap
> abstraction I am currently working on, that would allow for swap/zswap
> decoupling, swapoff optimization, etc. The fewer states and swap API
> functions there are, the simpler the conversion will be.
>
> I am sending this series first as an RFC, just in case I missed something
> or misunderstood this state, or if someone has a swap optimization in mind
> for shmem that would require this special state.
The idea makes sense to me. I did a quick test with shmem mTHP, and
encountered the following warning which is triggered by
'VM_WARN_ON(usage == 1 && nr > 1)' in __swap_duplicate().
[ 81.064967] ------------[ cut here ]------------
[ 81.064968] WARNING: CPU: 4 PID: 6852 at mm/swapfile.c:3623
__swap_duplicate+0x1d0/0x2e0
[ 81.064994] pstate: 23400005 (nzCv daif +PAN -UAO +TCO +DIT -SSBS
BTYPE=--)
[ 81.064995] pc : __swap_duplicate+0x1d0/0x2e0
[ 81.064997] lr : swap_duplicate_nr+0x30/0x70
[......]
[ 81.065019] Call trace:
[ 81.065019] __swap_duplicate+0x1d0/0x2e0
[ 81.065021] swap_duplicate_nr+0x30/0x70
[ 81.065022] shmem_writepage+0x24c/0x438
[ 81.065024] pageout+0x104/0x2e0
[ 81.065026] shrink_folio_list+0x7f0/0xe60
[ 81.065027] reclaim_folio_list+0x90/0x178
[ 81.065029] reclaim_pages+0x128/0x1a8
[ 81.065030] madvise_cold_or_pageout_pte_range+0x80c/0xd10
[ 81.065031] walk_pmd_range.isra.0+0x1b8/0x3a0
[ 81.065033] walk_pud_range+0x120/0x1b0
[ 81.065035] walk_pgd_range+0x150/0x1a8
[ 81.065036] __walk_page_range+0xa4/0xb8
[ 81.065038] walk_page_range+0x1c8/0x250
[ 81.065039] madvise_pageout+0xf4/0x280
[ 81.065041] madvise_vma_behavior+0x268/0x3f0
[ 81.065042] madvise_walk_vmas.constprop.0+0xb8/0x128
[ 81.065043] do_madvise.part.0+0xe8/0x2a0
[ 81.065044] __arm64_sys_madvise+0x64/0x78
[ 81.065046] invoke_syscall.constprop.0+0x54/0xe8
[ 81.065048] do_el0_svc+0xa4/0xc0
[ 81.065050] el0_svc+0x2c/0xb0
[ 81.065052] el0t_64_sync_handler+0xb8/0xc0
[ 81.065054] el0t_64_sync+0x14c/0x150
> Swap experts, let me know if I'm mistaken :) Otherwise if there is no
> objection I will resend this patch series again for merging.
>
> Nhat Pham (2):
> swapfile: add a batched variant for swap_duplicate()
> swap: shmem: remove SWAP_MAP_SHMEM
>
> include/linux/swap.h | 16 ++++++++--------
> mm/shmem.c | 2 +-
> mm/swapfile.c | 28 +++++++++-------------------
> 3 files changed, 18 insertions(+), 28 deletions(-)
>
>
> base-commit: acfabf7e197f7a5bedf4749dac1f39551417b049
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [RFC PATCH 0/2] remove SWAP_MAP_SHMEM
2024-09-24 1:55 ` Baolin Wang
@ 2024-09-24 2:15 ` Yosry Ahmed
2024-09-24 3:25 ` Baolin Wang
0 siblings, 1 reply; 29+ messages in thread
From: Yosry Ahmed @ 2024-09-24 2:15 UTC (permalink / raw)
To: Baolin Wang
Cc: Nhat Pham, akpm, hannes, hughd, shakeel.butt, ryan.roberts,
ying.huang, chrisl, david, kasong, willy, viro, baohua,
chengming.zhou, linux-mm, kernel-team, linux-kernel
On Mon, Sep 23, 2024 at 6:55 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 2024/9/24 07:11, Nhat Pham wrote:
> > The SWAP_MAP_SHMEM state was originally introduced in the commit
> > aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a
> > swap entry belongs to shmem during swapoff.
> >
> > However, swapoff has since been rewritten drastically in the commit
> > b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now
> > having swap count == SWAP_MAP_SHMEM value is basically the same as having
> > swap count == 1, and swap_shmem_alloc() behaves analogously to
> > swap_duplicate()
> >
> > This RFC proposes the removal of this state and the associated helper to
> > simplify the state machine (both mentally and code-wise). We will also
> > have an extra state/special value that can be repurposed (for swap entries
> > that never gets re-duplicated).
> >
> > Another motivation (albeit a bit premature at the moment) is the new swap
> > abstraction I am currently working on, that would allow for swap/zswap
> > decoupling, swapoff optimization, etc. The fewer states and swap API
> > functions there are, the simpler the conversion will be.
> >
> > I am sending this series first as an RFC, just in case I missed something
> > or misunderstood this state, or if someone has a swap optimization in mind
> > for shmem that would require this special state.
>
> The idea makes sense to me. I did a quick test with shmem mTHP, and
> encountered the following warning which is triggered by
> 'VM_WARN_ON(usage == 1 && nr > 1)' in __swap_duplicate().
Apparently __swap_duplicate() does not currently handle increasing the
swap count for multiple swap entries by 1 (i.e. usage == 1) because it
does not handle rolling back count increases when
swap_count_continued() fails.
I guess this voids my Reviewed-by until we sort this out. Technically
swap_count_continued() won't ever be called for shmem because we only
ever increment the count by 1, but there is no way to know this in
__swap_duplicate() without SWAP_HAS_SHMEM.
>
> [ 81.064967] ------------[ cut here ]------------
> [ 81.064968] WARNING: CPU: 4 PID: 6852 at mm/swapfile.c:3623
> __swap_duplicate+0x1d0/0x2e0
> [ 81.064994] pstate: 23400005 (nzCv daif +PAN -UAO +TCO +DIT -SSBS
> BTYPE=--)
> [ 81.064995] pc : __swap_duplicate+0x1d0/0x2e0
> [ 81.064997] lr : swap_duplicate_nr+0x30/0x70
> [......]
> [ 81.065019] Call trace:
> [ 81.065019] __swap_duplicate+0x1d0/0x2e0
> [ 81.065021] swap_duplicate_nr+0x30/0x70
> [ 81.065022] shmem_writepage+0x24c/0x438
> [ 81.065024] pageout+0x104/0x2e0
> [ 81.065026] shrink_folio_list+0x7f0/0xe60
> [ 81.065027] reclaim_folio_list+0x90/0x178
> [ 81.065029] reclaim_pages+0x128/0x1a8
> [ 81.065030] madvise_cold_or_pageout_pte_range+0x80c/0xd10
> [ 81.065031] walk_pmd_range.isra.0+0x1b8/0x3a0
> [ 81.065033] walk_pud_range+0x120/0x1b0
> [ 81.065035] walk_pgd_range+0x150/0x1a8
> [ 81.065036] __walk_page_range+0xa4/0xb8
> [ 81.065038] walk_page_range+0x1c8/0x250
> [ 81.065039] madvise_pageout+0xf4/0x280
> [ 81.065041] madvise_vma_behavior+0x268/0x3f0
> [ 81.065042] madvise_walk_vmas.constprop.0+0xb8/0x128
> [ 81.065043] do_madvise.part.0+0xe8/0x2a0
> [ 81.065044] __arm64_sys_madvise+0x64/0x78
> [ 81.065046] invoke_syscall.constprop.0+0x54/0xe8
> [ 81.065048] do_el0_svc+0xa4/0xc0
> [ 81.065050] el0_svc+0x2c/0xb0
> [ 81.065052] el0t_64_sync_handler+0xb8/0xc0
> [ 81.065054] el0t_64_sync+0x14c/0x150
>
> > Swap experts, let me know if I'm mistaken :) Otherwise if there is no
> > objection I will resend this patch series again for merging.
> >
> > Nhat Pham (2):
> > swapfile: add a batched variant for swap_duplicate()
> > swap: shmem: remove SWAP_MAP_SHMEM
> >
> > include/linux/swap.h | 16 ++++++++--------
> > mm/shmem.c | 2 +-
> > mm/swapfile.c | 28 +++++++++-------------------
> > 3 files changed, 18 insertions(+), 28 deletions(-)
> >
> >
> > base-commit: acfabf7e197f7a5bedf4749dac1f39551417b049
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [RFC PATCH 0/2] remove SWAP_MAP_SHMEM
2024-09-24 2:15 ` Yosry Ahmed
@ 2024-09-24 3:25 ` Baolin Wang
2024-09-24 14:32 ` Nhat Pham
0 siblings, 1 reply; 29+ messages in thread
From: Baolin Wang @ 2024-09-24 3:25 UTC (permalink / raw)
To: Yosry Ahmed
Cc: Nhat Pham, akpm, hannes, hughd, shakeel.butt, ryan.roberts,
ying.huang, chrisl, david, kasong, willy, viro, baohua,
chengming.zhou, linux-mm, kernel-team, linux-kernel
On 2024/9/24 10:15, Yosry Ahmed wrote:
> On Mon, Sep 23, 2024 at 6:55 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>>
>>
>>
>> On 2024/9/24 07:11, Nhat Pham wrote:
>>> The SWAP_MAP_SHMEM state was originally introduced in the commit
>>> aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a
>>> swap entry belongs to shmem during swapoff.
>>>
>>> However, swapoff has since been rewritten drastically in the commit
>>> b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now
>>> having swap count == SWAP_MAP_SHMEM value is basically the same as having
>>> swap count == 1, and swap_shmem_alloc() behaves analogously to
>>> swap_duplicate()
>>>
>>> This RFC proposes the removal of this state and the associated helper to
>>> simplify the state machine (both mentally and code-wise). We will also
>>> have an extra state/special value that can be repurposed (for swap entries
>>> that never gets re-duplicated).
>>>
>>> Another motivation (albeit a bit premature at the moment) is the new swap
>>> abstraction I am currently working on, that would allow for swap/zswap
>>> decoupling, swapoff optimization, etc. The fewer states and swap API
>>> functions there are, the simpler the conversion will be.
>>>
>>> I am sending this series first as an RFC, just in case I missed something
>>> or misunderstood this state, or if someone has a swap optimization in mind
>>> for shmem that would require this special state.
>>
>> The idea makes sense to me. I did a quick test with shmem mTHP, and
>> encountered the following warning which is triggered by
>> 'VM_WARN_ON(usage == 1 && nr > 1)' in __swap_duplicate().
>
> Apparently __swap_duplicate() does not currently handle increasing the
> swap count for multiple swap entries by 1 (i.e. usage == 1) because it
> does not handle rolling back count increases when
> swap_count_continued() fails.
>
> I guess this voids my Reviewed-by until we sort this out. Technically
> swap_count_continued() won't ever be called for shmem because we only
> ever increment the count by 1, but there is no way to know this in
> __swap_duplicate() without SWAP_HAS_SHMEM.
Agreed. An easy solution might be to add a new boolean parameter to
indicate whether the SHMEM swap entry count is increasing?
diff --git a/mm/swapfile.c b/mm/swapfile.c
index cebc244ee60f..21f1eec2c30a 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3607,7 +3607,7 @@ void si_swapinfo(struct sysinfo *val)
* - swap-cache reference is requested but the entry is not used. ->
ENOENT
* - swap-mapped reference requested but needs continued swap count.
-> ENOMEM
*/
-static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
+static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int
nr, bool shmem)
{
struct swap_info_struct *si;
struct swap_cluster_info *ci;
@@ -3620,7 +3620,7 @@ static int __swap_duplicate(swp_entry_t entry,
unsigned char usage, int nr)
offset = swp_offset(entry);
VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
- VM_WARN_ON(usage == 1 && nr > 1);
+ VM_WARN_ON(usage == 1 && nr > 1 && !shmem);
ci = lock_cluster_or_swap_info(si, offset);
err = 0;
@@ -3661,7 +3661,7 @@ static int __swap_duplicate(swp_entry_t entry,
unsigned char usage, int nr)
has_cache = SWAP_HAS_CACHE;
else if ((count & ~COUNT_CONTINUED) < SWAP_MAP_MAX)
count += usage;
- else if (swap_count_continued(si, offset + i, count))
+ else if (!shmem && swap_count_continued(si, offset + i,
count))
count = COUNT_CONTINUED;
else {
/*
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [RFC PATCH 0/2] remove SWAP_MAP_SHMEM
2024-09-24 3:25 ` Baolin Wang
@ 2024-09-24 14:32 ` Nhat Pham
2024-09-24 15:07 ` Yosry Ahmed
0 siblings, 1 reply; 29+ messages in thread
From: Nhat Pham @ 2024-09-24 14:32 UTC (permalink / raw)
To: Baolin Wang
Cc: Yosry Ahmed, akpm, hannes, hughd, shakeel.butt, ryan.roberts,
ying.huang, chrisl, david, kasong, willy, viro, baohua,
chengming.zhou, linux-mm, kernel-team, linux-kernel
On Mon, Sep 23, 2024 at 8:25 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
> On 2024/9/24 10:15, Yosry Ahmed wrote:
> > On Mon, Sep 23, 2024 at 6:55 PM Baolin Wang
> > <baolin.wang@linux.alibaba.com> wrote:
> >>
> >>
> >>
> >> On 2024/9/24 07:11, Nhat Pham wrote:
> >>> The SWAP_MAP_SHMEM state was originally introduced in the commit
> >>> aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a
> >>> swap entry belongs to shmem during swapoff.
> >>>
> >>> However, swapoff has since been rewritten drastically in the commit
> >>> b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now
> >>> having swap count == SWAP_MAP_SHMEM value is basically the same as having
> >>> swap count == 1, and swap_shmem_alloc() behaves analogously to
> >>> swap_duplicate()
> >>>
> >>> This RFC proposes the removal of this state and the associated helper to
> >>> simplify the state machine (both mentally and code-wise). We will also
> >>> have an extra state/special value that can be repurposed (for swap entries
> >>> that never gets re-duplicated).
> >>>
> >>> Another motivation (albeit a bit premature at the moment) is the new swap
> >>> abstraction I am currently working on, that would allow for swap/zswap
> >>> decoupling, swapoff optimization, etc. The fewer states and swap API
> >>> functions there are, the simpler the conversion will be.
> >>>
> >>> I am sending this series first as an RFC, just in case I missed something
> >>> or misunderstood this state, or if someone has a swap optimization in mind
> >>> for shmem that would require this special state.
> >>
> >> The idea makes sense to me. I did a quick test with shmem mTHP, and
> >> encountered the following warning which is triggered by
> >> 'VM_WARN_ON(usage == 1 && nr > 1)' in __swap_duplicate().
> >
> > Apparently __swap_duplicate() does not currently handle increasing the
> > swap count for multiple swap entries by 1 (i.e. usage == 1) because it
> > does not handle rolling back count increases when
> > swap_count_continued() fails.
> >
> > I guess this voids my Reviewed-by until we sort this out. Technically
> > swap_count_continued() won't ever be called for shmem because we only
> > ever increment the count by 1, but there is no way to know this in
> > __swap_duplicate() without SWAP_HAS_SHMEM.
Ah this is my bad. I compiled with CONFIG_THP_SWAP, but forgot to
remove the swapfile check (that's another can of worms, but I need
data before submitting the patch to remove it...)
One thing we can do is instead of warning here, we can handle it in
the for loop check, where we have access to count - that's the point
of having that for-loop check anyway? :)
There's a couple of ways to go about it:
1. VM_WARN_ON(usage == 1 && nr > 1 && count != 0 );
(or more accurately, (count & ~COUNT_CONTINUED) >= SWAP_MAP_MAX))
2. Alternatively, instead of warning here, we can simply return
-ENOMEM. Then, at shmem callsite, have a VM_WARN_ON/VM_BUG_ON(), since
this MUST succeed.
Either solutions should follow with careful documentation to make it
clear the expectation/guarantee of the new API.
Yosry, Baolin, how do you two feel about this? Would something like
this work? I need to test it first, but let me know if I'm missing
something.
If this does not work, we can do what Baolin is suggesting, and
perhaps maintain the swap_shmem_alloc() helper. It's less than ideal,
but at least we still lose a state...
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [RFC PATCH 0/2] remove SWAP_MAP_SHMEM
2024-09-24 14:32 ` Nhat Pham
@ 2024-09-24 15:07 ` Yosry Ahmed
2024-09-24 15:48 ` Nhat Pham
0 siblings, 1 reply; 29+ messages in thread
From: Yosry Ahmed @ 2024-09-24 15:07 UTC (permalink / raw)
To: Nhat Pham
Cc: Baolin Wang, akpm, hannes, hughd, shakeel.butt, ryan.roberts,
ying.huang, chrisl, david, kasong, willy, viro, baohua,
chengming.zhou, linux-mm, kernel-team, linux-kernel
On Tue, Sep 24, 2024 at 7:32 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Mon, Sep 23, 2024 at 8:25 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
> >
> >
> > On 2024/9/24 10:15, Yosry Ahmed wrote:
> > > On Mon, Sep 23, 2024 at 6:55 PM Baolin Wang
> > > <baolin.wang@linux.alibaba.com> wrote:
> > >>
> > >>
> > >>
> > >> On 2024/9/24 07:11, Nhat Pham wrote:
> > >>> The SWAP_MAP_SHMEM state was originally introduced in the commit
> > >>> aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a
> > >>> swap entry belongs to shmem during swapoff.
> > >>>
> > >>> However, swapoff has since been rewritten drastically in the commit
> > >>> b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now
> > >>> having swap count == SWAP_MAP_SHMEM value is basically the same as having
> > >>> swap count == 1, and swap_shmem_alloc() behaves analogously to
> > >>> swap_duplicate()
> > >>>
> > >>> This RFC proposes the removal of this state and the associated helper to
> > >>> simplify the state machine (both mentally and code-wise). We will also
> > >>> have an extra state/special value that can be repurposed (for swap entries
> > >>> that never gets re-duplicated).
> > >>>
> > >>> Another motivation (albeit a bit premature at the moment) is the new swap
> > >>> abstraction I am currently working on, that would allow for swap/zswap
> > >>> decoupling, swapoff optimization, etc. The fewer states and swap API
> > >>> functions there are, the simpler the conversion will be.
> > >>>
> > >>> I am sending this series first as an RFC, just in case I missed something
> > >>> or misunderstood this state, or if someone has a swap optimization in mind
> > >>> for shmem that would require this special state.
> > >>
> > >> The idea makes sense to me. I did a quick test with shmem mTHP, and
> > >> encountered the following warning which is triggered by
> > >> 'VM_WARN_ON(usage == 1 && nr > 1)' in __swap_duplicate().
> > >
> > > Apparently __swap_duplicate() does not currently handle increasing the
> > > swap count for multiple swap entries by 1 (i.e. usage == 1) because it
> > > does not handle rolling back count increases when
> > > swap_count_continued() fails.
> > >
> > > I guess this voids my Reviewed-by until we sort this out. Technically
> > > swap_count_continued() won't ever be called for shmem because we only
> > > ever increment the count by 1, but there is no way to know this in
> > > __swap_duplicate() without SWAP_HAS_SHMEM.
>
> Ah this is my bad. I compiled with CONFIG_THP_SWAP, but forgot to
> remove the swapfile check (that's another can of worms, but I need
> data before submitting the patch to remove it...)
>
> One thing we can do is instead of warning here, we can handle it in
> the for loop check, where we have access to count - that's the point
> of having that for-loop check anyway? :)
>
> There's a couple of ways to go about it:
>
> 1. VM_WARN_ON(usage == 1 && nr > 1 && count != 0 );
Hmm that should work, although it's a bit complicated tbh.
> (or more accurately, (count & ~COUNT_CONTINUED) >= SWAP_MAP_MAX))
I think this will make the warning very hard to hit if there's a
misuse of __swap_duplicate(). It will only be hit when an entry needs
count continuation, which I am not sure is very common. If there's a
bug, the warning will potentially catch it too late, if ever.
The side effect here is failing to decrement the swap count of some
swap entries which will lead to them never being freed, essentially
leaking swap capacity slowly over time. I am not sure if there are
more detrimental effects.
>
> 2. Alternatively, instead of warning here, we can simply return
> -ENOMEM. Then, at shmem callsite, have a VM_WARN_ON/VM_BUG_ON(), since
> this MUST succeed.
We still fail to rollback incremented counts though when we return
-ENOMEM, right? Maybe I didn't get what you mean.
>
> Either solutions should follow with careful documentation to make it
> clear the expectation/guarantee of the new API.
>
> Yosry, Baolin, how do you two feel about this? Would something like
> this work? I need to test it first, but let me know if I'm missing
> something.
>
> If this does not work, we can do what Baolin is suggesting, and
> perhaps maintain the swap_shmem_alloc() helper. It's less than ideal,
> but at least we still lose a state...
Depending on the complexity tbh, right now removing SWAP_MAP_SHMEM is
just a cleanup with small wins, so if it's too complicated to remove
it it may not be worth it. I am assuming with your ongoing work, it
becomes much more valuable, so maybe if it's too complicated we can
defer it until the benefits are realizable?
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [RFC PATCH 0/2] remove SWAP_MAP_SHMEM
2024-09-24 15:07 ` Yosry Ahmed
@ 2024-09-24 15:48 ` Nhat Pham
2024-09-24 18:11 ` Yosry Ahmed
` (2 more replies)
0 siblings, 3 replies; 29+ messages in thread
From: Nhat Pham @ 2024-09-24 15:48 UTC (permalink / raw)
To: Yosry Ahmed
Cc: Baolin Wang, akpm, hannes, hughd, shakeel.butt, ryan.roberts,
ying.huang, chrisl, david, kasong, willy, viro, baohua,
chengming.zhou, linux-mm, kernel-team, linux-kernel
On Tue, Sep 24, 2024 at 8:08 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Tue, Sep 24, 2024 at 7:32 AM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > On Mon, Sep 23, 2024 at 8:25 PM Baolin Wang
> > <baolin.wang@linux.alibaba.com> wrote:
> > >
> > >
> > > On 2024/9/24 10:15, Yosry Ahmed wrote:
> > > > On Mon, Sep 23, 2024 at 6:55 PM Baolin Wang
> > > > <baolin.wang@linux.alibaba.com> wrote:
> > > >>
> > > >>
> > > >>
> > > >> On 2024/9/24 07:11, Nhat Pham wrote:
> > > >>> The SWAP_MAP_SHMEM state was originally introduced in the commit
> > > >>> aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a
> > > >>> swap entry belongs to shmem during swapoff.
> > > >>>
> > > >>> However, swapoff has since been rewritten drastically in the commit
> > > >>> b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now
> > > >>> having swap count == SWAP_MAP_SHMEM value is basically the same as having
> > > >>> swap count == 1, and swap_shmem_alloc() behaves analogously to
> > > >>> swap_duplicate()
> > > >>>
> > > >>> This RFC proposes the removal of this state and the associated helper to
> > > >>> simplify the state machine (both mentally and code-wise). We will also
> > > >>> have an extra state/special value that can be repurposed (for swap entries
> > > >>> that never gets re-duplicated).
> > > >>>
> > > >>> Another motivation (albeit a bit premature at the moment) is the new swap
> > > >>> abstraction I am currently working on, that would allow for swap/zswap
> > > >>> decoupling, swapoff optimization, etc. The fewer states and swap API
> > > >>> functions there are, the simpler the conversion will be.
> > > >>>
> > > >>> I am sending this series first as an RFC, just in case I missed something
> > > >>> or misunderstood this state, or if someone has a swap optimization in mind
> > > >>> for shmem that would require this special state.
> > > >>
> > > >> The idea makes sense to me. I did a quick test with shmem mTHP, and
> > > >> encountered the following warning which is triggered by
> > > >> 'VM_WARN_ON(usage == 1 && nr > 1)' in __swap_duplicate().
> > > >
> > > > Apparently __swap_duplicate() does not currently handle increasing the
> > > > swap count for multiple swap entries by 1 (i.e. usage == 1) because it
> > > > does not handle rolling back count increases when
> > > > swap_count_continued() fails.
> > > >
> > > > I guess this voids my Reviewed-by until we sort this out. Technically
> > > > swap_count_continued() won't ever be called for shmem because we only
> > > > ever increment the count by 1, but there is no way to know this in
> > > > __swap_duplicate() without SWAP_HAS_SHMEM.
> >
> > Ah this is my bad. I compiled with CONFIG_THP_SWAP, but forgot to
> > remove the swapfile check (that's another can of worms, but I need
> > data before submitting the patch to remove it...)
> >
> > One thing we can do is instead of warning here, we can handle it in
> > the for loop check, where we have access to count - that's the point
> > of having that for-loop check anyway? :)
> >
> > There's a couple of ways to go about it:
> >
> > 1. VM_WARN_ON(usage == 1 && nr > 1 && count != 0 );
>
> Hmm that should work, although it's a bit complicated tbh.
>
> > (or more accurately, (count & ~COUNT_CONTINUED) >= SWAP_MAP_MAX))
>
> I think this will make the warning very hard to hit if there's a
> misuse of __swap_duplicate(). It will only be hit when an entry needs
> count continuation, which I am not sure is very common. If there's a
> bug, the warning will potentially catch it too late, if ever.
>
> The side effect here is failing to decrement the swap count of some
> swap entries which will lead to them never being freed, essentially
> leaking swap capacity slowly over time. I am not sure if there are
> more detrimental effects.
>
> >
> > 2. Alternatively, instead of warning here, we can simply return
> > -ENOMEM. Then, at shmem callsite, have a VM_WARN_ON/VM_BUG_ON(), since
> > this MUST succeed.
>
> We still fail to rollback incremented counts though when we return
> -ENOMEM, right? Maybe I didn't get what you mean.
My understanding now is that there are two for loops. One for loop
that checks the entry's states, and one for loop that does the actual
incrementing work (or state modification).
We can check in the first for loop, if it is safe to proceed:
if (!count && !has_cache) {
err = -ENOENT;
} else if (usage == SWAP_HAS_CACHE) {
if (has_cache)
err = -EEXIST;
} else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) {
err = -EINVAL;
} else if (usage == 1 && nr > 1 && (count & ~COUNT_CONTINUED) >=
SWAP_MAP_MAX)) {
/* the batched variants currently do not support rollback */
err = -ENOMEM;
}
At this point, IIUC, we have not done any incrementing, so no rollback
needed? :)
>
> >
> > Either solutions should follow with careful documentation to make it
> > clear the expectation/guarantee of the new API.
> >
> > Yosry, Baolin, how do you two feel about this? Would something like
> > this work? I need to test it first, but let me know if I'm missing
> > something.
> >
> > If this does not work, we can do what Baolin is suggesting, and
> > perhaps maintain the swap_shmem_alloc() helper. It's less than ideal,
> > but at least we still lose a state...
>
> Depending on the complexity tbh, right now removing SWAP_MAP_SHMEM is
> just a cleanup with small wins, so if it's too complicated to remove
> it it may not be worth it. I am assuming with your ongoing work, it
> becomes much more valuable, so maybe if it's too complicated we can
> defer it until the benefits are realizable?
I agree :)
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [RFC PATCH 0/2] remove SWAP_MAP_SHMEM
2024-09-24 15:48 ` Nhat Pham
@ 2024-09-24 18:11 ` Yosry Ahmed
2024-09-25 6:26 ` Barry Song
2024-09-25 1:53 ` Baolin Wang
2024-09-25 7:19 ` Huang, Ying
2 siblings, 1 reply; 29+ messages in thread
From: Yosry Ahmed @ 2024-09-24 18:11 UTC (permalink / raw)
To: Nhat Pham, Barry Song
Cc: Baolin Wang, akpm, hannes, hughd, shakeel.butt, ryan.roberts,
ying.huang, chrisl, david, kasong, willy, viro, baohua,
chengming.zhou, linux-mm, kernel-team, linux-kernel
[..]
> > > > > Apparently __swap_duplicate() does not currently handle increasing the
> > > > > swap count for multiple swap entries by 1 (i.e. usage == 1) because it
> > > > > does not handle rolling back count increases when
> > > > > swap_count_continued() fails.
> > > > >
> > > > > I guess this voids my Reviewed-by until we sort this out. Technically
> > > > > swap_count_continued() won't ever be called for shmem because we only
> > > > > ever increment the count by 1, but there is no way to know this in
> > > > > __swap_duplicate() without SWAP_HAS_SHMEM.
> > >
> > > Ah this is my bad. I compiled with CONFIG_THP_SWAP, but forgot to
> > > remove the swapfile check (that's another can of worms, but I need
> > > data before submitting the patch to remove it...)
> > >
> > > One thing we can do is instead of warning here, we can handle it in
> > > the for loop check, where we have access to count - that's the point
> > > of having that for-loop check anyway? :)
> > >
> > > There's a couple of ways to go about it:
> > >
> > > 1. VM_WARN_ON(usage == 1 && nr > 1 && count != 0 );
> >
> > Hmm that should work, although it's a bit complicated tbh.
> >
> > > (or more accurately, (count & ~COUNT_CONTINUED) >= SWAP_MAP_MAX))
> >
> > I think this will make the warning very hard to hit if there's a
> > misuse of __swap_duplicate(). It will only be hit when an entry needs
> > count continuation, which I am not sure is very common. If there's a
> > bug, the warning will potentially catch it too late, if ever.
> >
> > The side effect here is failing to decrement the swap count of some
> > swap entries which will lead to them never being freed, essentially
> > leaking swap capacity slowly over time. I am not sure if there are
> > more detrimental effects.
> >
> > >
> > > 2. Alternatively, instead of warning here, we can simply return
> > > -ENOMEM. Then, at shmem callsite, have a VM_WARN_ON/VM_BUG_ON(), since
> > > this MUST succeed.
> >
> > We still fail to rollback incremented counts though when we return
> > -ENOMEM, right? Maybe I didn't get what you mean.
>
> My understanding now is that there are two for loops. One for loop
> that checks the entry's states, and one for loop that does the actual
> incrementing work (or state modification).
>
> We can check in the first for loop, if it is safe to proceed:
>
> if (!count && !has_cache) {
> err = -ENOENT;
> } else if (usage == SWAP_HAS_CACHE) {
> if (has_cache)
> err = -EEXIST;
> } else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) {
> err = -EINVAL;
> } else if (usage == 1 && nr > 1 && (count & ~COUNT_CONTINUED) >=
> SWAP_MAP_MAX)) {
> /* the batched variants currently do not support rollback */
> err = -ENOMEM;
> }
Hmm yeah I think something like this should work and is arguably
better than just warning, although this needs cleaning up:
- We already know usage != SWAP_HAS_CACHE, so no need to check if usage == 1.
- We already know (count & ~COUNT_CONTINUED) is larger than
SWAP_MAP_MAX, so we should check if it's equal to SWAP_MAP_MAX.
- We should probably just calculate count & ~COUNT_CONTINUED above the
if conditions at this point.
I would also like to hear what Barry thinks since he added this (and I
just realized he is not CC'd).
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [RFC PATCH 0/2] remove SWAP_MAP_SHMEM
2024-09-24 18:11 ` Yosry Ahmed
@ 2024-09-25 6:26 ` Barry Song
2024-09-25 7:24 ` Huang, Ying
0 siblings, 1 reply; 29+ messages in thread
From: Barry Song @ 2024-09-25 6:26 UTC (permalink / raw)
To: Yosry Ahmed
Cc: Nhat Pham, Baolin Wang, akpm, hannes, hughd, shakeel.butt,
ryan.roberts, ying.huang, chrisl, david, kasong, willy, viro,
chengming.zhou, linux-mm, kernel-team, linux-kernel
On Wed, Sep 25, 2024 at 2:12 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> [..]
> > > > > > Apparently __swap_duplicate() does not currently handle increasing the
> > > > > > swap count for multiple swap entries by 1 (i.e. usage == 1) because it
> > > > > > does not handle rolling back count increases when
> > > > > > swap_count_continued() fails.
> > > > > >
> > > > > > I guess this voids my Reviewed-by until we sort this out. Technically
> > > > > > swap_count_continued() won't ever be called for shmem because we only
> > > > > > ever increment the count by 1, but there is no way to know this in
> > > > > > __swap_duplicate() without SWAP_HAS_SHMEM.
> > > >
> > > > Ah this is my bad. I compiled with CONFIG_THP_SWAP, but forgot to
> > > > remove the swapfile check (that's another can of worms, but I need
> > > > data before submitting the patch to remove it...)
> > > >
> > > > One thing we can do is instead of warning here, we can handle it in
> > > > the for loop check, where we have access to count - that's the point
> > > > of having that for-loop check anyway? :)
> > > >
> > > > There's a couple of ways to go about it:
> > > >
> > > > 1. VM_WARN_ON(usage == 1 && nr > 1 && count != 0 );
> > >
> > > Hmm that should work, although it's a bit complicated tbh.
> > >
> > > > (or more accurately, (count & ~COUNT_CONTINUED) >= SWAP_MAP_MAX))
> > >
> > > I think this will make the warning very hard to hit if there's a
> > > misuse of __swap_duplicate(). It will only be hit when an entry needs
> > > count continuation, which I am not sure is very common. If there's a
> > > bug, the warning will potentially catch it too late, if ever.
> > >
> > > The side effect here is failing to decrement the swap count of some
> > > swap entries which will lead to them never being freed, essentially
> > > leaking swap capacity slowly over time. I am not sure if there are
> > > more detrimental effects.
> > >
> > > >
> > > > 2. Alternatively, instead of warning here, we can simply return
> > > > -ENOMEM. Then, at shmem callsite, have a VM_WARN_ON/VM_BUG_ON(), since
> > > > this MUST succeed.
> > >
> > > We still fail to rollback incremented counts though when we return
> > > -ENOMEM, right? Maybe I didn't get what you mean.
> >
> > My understanding now is that there are two for loops. One for loop
> > that checks the entry's states, and one for loop that does the actual
> > incrementing work (or state modification).
> >
> > We can check in the first for loop, if it is safe to proceed:
> >
> > if (!count && !has_cache) {
> > err = -ENOENT;
> > } else if (usage == SWAP_HAS_CACHE) {
> > if (has_cache)
> > err = -EEXIST;
> > } else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) {
> > err = -EINVAL;
> > } else if (usage == 1 && nr > 1 && (count & ~COUNT_CONTINUED) >=
> > SWAP_MAP_MAX)) {
> > /* the batched variants currently do not support rollback */
> > err = -ENOMEM;
> > }
>
> Hmm yeah I think something like this should work and is arguably
> better than just warning, although this needs cleaning up:
> - We already know usage != SWAP_HAS_CACHE, so no need to check if usage == 1.
> - We already know (count & ~COUNT_CONTINUED) is larger than
> SWAP_MAP_MAX, so we should check if it's equal to SWAP_MAP_MAX.
> - We should probably just calculate count & ~COUNT_CONTINUED above the
> if conditions at this point.
>
> I would also like to hear what Barry thinks since he added this (and I
> just realized he is not CC'd).
I am perfectly fine with the approach, in the first loop, if we find all entries
don't need CONTINUED, we can run the 2nd loop even for usage==1
and nr > 1. this is almost always true for a real product where anon folios
are unlikely to be fork-shared by so many processes.
but we need fall back to iterating nr times if this really happens:
int swap_duplicate_nr(swp_entry_t entry, int nr)
{
....
if (nr > 1 and ENOMEM) {
for(nr entries) {
__swap_duplicate(entry, 1, 1);
entry = next_entry;
}
}
seems a bit ugly?
maybe we can keep the swap_duplicate(swp_entry_t entry)
there? then avoid __swap_duplicate(entry, 1, 1);?
Thanks
Barry
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [RFC PATCH 0/2] remove SWAP_MAP_SHMEM
2024-09-25 6:26 ` Barry Song
@ 2024-09-25 7:24 ` Huang, Ying
2024-09-25 7:38 ` Barry Song
0 siblings, 1 reply; 29+ messages in thread
From: Huang, Ying @ 2024-09-25 7:24 UTC (permalink / raw)
To: Barry Song
Cc: Yosry Ahmed, Nhat Pham, Baolin Wang, akpm, hannes, hughd,
shakeel.butt, ryan.roberts, chrisl, david, kasong, willy, viro,
chengming.zhou, linux-mm, kernel-team, linux-kernel
Barry Song <21cnbao@gmail.com> writes:
[snip]
> I am perfectly fine with the approach, in the first loop, if we find all entries
> don't need CONTINUED, we can run the 2nd loop even for usage==1
> and nr > 1. this is almost always true for a real product where anon folios
> are unlikely to be fork-shared by so many processes.
One possible use case is ksm. Where the map count could be large.
--
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/2] remove SWAP_MAP_SHMEM
2024-09-25 7:24 ` Huang, Ying
@ 2024-09-25 7:38 ` Barry Song
0 siblings, 0 replies; 29+ messages in thread
From: Barry Song @ 2024-09-25 7:38 UTC (permalink / raw)
To: Huang, Ying
Cc: Yosry Ahmed, Nhat Pham, Baolin Wang, akpm, hannes, hughd,
shakeel.butt, ryan.roberts, chrisl, david, kasong, willy, viro,
chengming.zhou, linux-mm, kernel-team, linux-kernel
On Wed, Sep 25, 2024 at 3:28 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> [snip]
>
> > I am perfectly fine with the approach, in the first loop, if we find all entries
> > don't need CONTINUED, we can run the 2nd loop even for usage==1
> > and nr > 1. this is almost always true for a real product where anon folios
> > are unlikely to be fork-shared by so many processes.
>
> One possible use case is ksm. Where the map count could be large.
Sorry, I overlooked the KSM case, but it seems this doesn't significantly
change the overall direction. :-)
Since we can fall back to swap_duplicate() and handle each entry one
by one after the first loopback returns -ENOMEM, KSM isn't a concern
either.
>
> --
> Best Regards,
> Huang, Ying
Thanks
Barry
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/2] remove SWAP_MAP_SHMEM
2024-09-24 15:48 ` Nhat Pham
2024-09-24 18:11 ` Yosry Ahmed
@ 2024-09-25 1:53 ` Baolin Wang
2024-09-25 14:37 ` Nhat Pham
2024-09-25 7:19 ` Huang, Ying
2 siblings, 1 reply; 29+ messages in thread
From: Baolin Wang @ 2024-09-25 1:53 UTC (permalink / raw)
To: Nhat Pham, Yosry Ahmed
Cc: akpm, hannes, hughd, shakeel.butt, ryan.roberts, ying.huang,
chrisl, david, kasong, willy, viro, baohua, chengming.zhou,
linux-mm, kernel-team, linux-kernel
On 2024/9/24 23:48, Nhat Pham wrote:
> On Tue, Sep 24, 2024 at 8:08 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>>
>> On Tue, Sep 24, 2024 at 7:32 AM Nhat Pham <nphamcs@gmail.com> wrote:
>>>
>>> On Mon, Sep 23, 2024 at 8:25 PM Baolin Wang
>>> <baolin.wang@linux.alibaba.com> wrote:
>>>>
>>>>
>>>> On 2024/9/24 10:15, Yosry Ahmed wrote:
>>>>> On Mon, Sep 23, 2024 at 6:55 PM Baolin Wang
>>>>> <baolin.wang@linux.alibaba.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 2024/9/24 07:11, Nhat Pham wrote:
>>>>>>> The SWAP_MAP_SHMEM state was originally introduced in the commit
>>>>>>> aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a
>>>>>>> swap entry belongs to shmem during swapoff.
>>>>>>>
>>>>>>> However, swapoff has since been rewritten drastically in the commit
>>>>>>> b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now
>>>>>>> having swap count == SWAP_MAP_SHMEM value is basically the same as having
>>>>>>> swap count == 1, and swap_shmem_alloc() behaves analogously to
>>>>>>> swap_duplicate()
>>>>>>>
>>>>>>> This RFC proposes the removal of this state and the associated helper to
>>>>>>> simplify the state machine (both mentally and code-wise). We will also
>>>>>>> have an extra state/special value that can be repurposed (for swap entries
>>>>>>> that never gets re-duplicated).
>>>>>>>
>>>>>>> Another motivation (albeit a bit premature at the moment) is the new swap
>>>>>>> abstraction I am currently working on, that would allow for swap/zswap
>>>>>>> decoupling, swapoff optimization, etc. The fewer states and swap API
>>>>>>> functions there are, the simpler the conversion will be.
>>>>>>>
>>>>>>> I am sending this series first as an RFC, just in case I missed something
>>>>>>> or misunderstood this state, or if someone has a swap optimization in mind
>>>>>>> for shmem that would require this special state.
>>>>>>
>>>>>> The idea makes sense to me. I did a quick test with shmem mTHP, and
>>>>>> encountered the following warning which is triggered by
>>>>>> 'VM_WARN_ON(usage == 1 && nr > 1)' in __swap_duplicate().
>>>>>
>>>>> Apparently __swap_duplicate() does not currently handle increasing the
>>>>> swap count for multiple swap entries by 1 (i.e. usage == 1) because it
>>>>> does not handle rolling back count increases when
>>>>> swap_count_continued() fails.
>>>>>
>>>>> I guess this voids my Reviewed-by until we sort this out. Technically
>>>>> swap_count_continued() won't ever be called for shmem because we only
>>>>> ever increment the count by 1, but there is no way to know this in
>>>>> __swap_duplicate() without SWAP_HAS_SHMEM.
>>>
>>> Ah this is my bad. I compiled with CONFIG_THP_SWAP, but forgot to
>>> remove the swapfile check (that's another can of worms, but I need
>>> data before submitting the patch to remove it...)
>>>
>>> One thing we can do is instead of warning here, we can handle it in
>>> the for loop check, where we have access to count - that's the point
>>> of having that for-loop check anyway? :)
>>>
>>> There's a couple of ways to go about it:
>>>
>>> 1. VM_WARN_ON(usage == 1 && nr > 1 && count != 0 );
>>
>> Hmm that should work, although it's a bit complicated tbh.
>>
>>> (or more accurately, (count & ~COUNT_CONTINUED) >= SWAP_MAP_MAX))
>>
>> I think this will make the warning very hard to hit if there's a
>> misuse of __swap_duplicate(). It will only be hit when an entry needs
>> count continuation, which I am not sure is very common. If there's a
>> bug, the warning will potentially catch it too late, if ever.
>>
>> The side effect here is failing to decrement the swap count of some
>> swap entries which will lead to them never being freed, essentially
>> leaking swap capacity slowly over time. I am not sure if there are
>> more detrimental effects.
>>
>>>
>>> 2. Alternatively, instead of warning here, we can simply return
>>> -ENOMEM. Then, at shmem callsite, have a VM_WARN_ON/VM_BUG_ON(), since
>>> this MUST succeed.
>>
>> We still fail to rollback incremented counts though when we return
>> -ENOMEM, right? Maybe I didn't get what you mean.
>
> My understanding now is that there are two for loops. One for loop
> that checks the entry's states, and one for loop that does the actual
> incrementing work (or state modification).
>
> We can check in the first for loop, if it is safe to proceed:
>
> if (!count && !has_cache) {
> err = -ENOENT;
> } else if (usage == SWAP_HAS_CACHE) {
> if (has_cache)
> err = -EEXIST;
> } else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) {
> err = -EINVAL;
> } else if (usage == 1 && nr > 1 && (count & ~COUNT_CONTINUED) >=
> SWAP_MAP_MAX)) {
> /* the batched variants currently do not support rollback */
> err = -ENOMEM;
> }
>
> At this point, IIUC, we have not done any incrementing, so no rollback
> needed? :)
Right, looks good (although need some cleanup pointed by Yosry).
>>>
>>> Either solutions should follow with careful documentation to make it
>>> clear the expectation/guarantee of the new API.
>>>
>>> Yosry, Baolin, how do you two feel about this? Would something like
>>> this work? I need to test it first, but let me know if I'm missing
>>> something.
>>>
>>> If this does not work, we can do what Baolin is suggesting, and
>>> perhaps maintain the swap_shmem_alloc() helper. It's less than ideal,
>>> but at least we still lose a state...
>>
>> Depending on the complexity tbh, right now removing SWAP_MAP_SHMEM is
>> just a cleanup with small wins, so if it's too complicated to remove
>> it it may not be worth it. I am assuming with your ongoing work, it
>> becomes much more valuable, so maybe if it's too complicated we can
>> defer it until the benefits are realizable?
>
> I agree :)
One benefit I can mention is that removing 'SWAP_MAP_SHMEM' can help to
batch free shmem swap entries in __swap_entries_free(), similar to the
commit bea67dcc5eea ("mm: attempt to batch free swap entries for
zap_pte_range()") did, which can improve the performance of shmem mTHP
munmap() function based on my testing.
Without this patch set, I need do following changes to batch free shmem
swap entries:
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 0cded32414a1..94e28cd60c52 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -163,7 +163,7 @@ static bool swap_is_last_map(struct swap_info_struct
*si,
unsigned char *map_end = map + nr_pages;
unsigned char count = *map;
- if (swap_count(count) != 1)
+ if (swap_count(count) != 1 && swap_count(count) != SWAP_MAP_SHMEM)
return false;
while (++map < map_end) {
@@ -1503,10 +1503,10 @@ static bool __swap_entries_free(struct
swap_info_struct *si,
unsigned int type = swp_type(entry);
struct swap_cluster_info *ci;
bool has_cache = false;
- unsigned char count;
+ unsigned char count = swap_count(data_race(si->swap_map[offset]));
int i;
- if (nr <= 1 || swap_count(data_race(si->swap_map[offset])) != 1)
+ if (nr <= 1 || (count != 1 && count != SWAP_MAP_SHMEM))
goto fallback;
/* cross into another cluster */
if (nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER)
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [RFC PATCH 0/2] remove SWAP_MAP_SHMEM
2024-09-25 1:53 ` Baolin Wang
@ 2024-09-25 14:37 ` Nhat Pham
2024-09-26 1:59 ` Huang, Ying
` (2 more replies)
0 siblings, 3 replies; 29+ messages in thread
From: Nhat Pham @ 2024-09-25 14:37 UTC (permalink / raw)
To: Baolin Wang
Cc: Yosry Ahmed, akpm, hannes, hughd, shakeel.butt, ryan.roberts,
ying.huang, chrisl, david, kasong, willy, viro, baohua,
chengming.zhou, linux-mm, kernel-team, linux-kernel
On Tue, Sep 24, 2024 at 6:53 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
> One benefit I can mention is that removing 'SWAP_MAP_SHMEM' can help to
> batch free shmem swap entries in __swap_entries_free(), similar to the
> commit bea67dcc5eea ("mm: attempt to batch free swap entries for
> zap_pte_range()") did, which can improve the performance of shmem mTHP
> munmap() function based on my testing.
Yeah, the problem with having an extraneous state is you have to
constantly check for it in code, and/or keep it in mind when you
develop things. I've been constantly having to check for this state
when I develop code around this area, and it gets old fast.
If we can use it to optimize something, I can understand keeping it.
But it just seems like dead code to me :)
My preference is to do this as simply as possible - add another case
(usage == 1, nr > 1, and we need to add swap continuations) in the
check in __swap_duplicate()'s first loop, and just WARN right there.
That case CANNOT happen UNLESS we introduce a bug, or have a new use
case. When we actually have a use case, we can always introduce
handling/fallback logic for that case.
Barry, Yosry, Baolin, Ying, how do you feel about this?
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [RFC PATCH 0/2] remove SWAP_MAP_SHMEM
2024-09-25 14:37 ` Nhat Pham
@ 2024-09-26 1:59 ` Huang, Ying
2024-09-26 3:30 ` Baolin Wang
2024-09-26 3:59 ` Barry Song
2024-09-26 4:00 ` Barry Song
2 siblings, 1 reply; 29+ messages in thread
From: Huang, Ying @ 2024-09-26 1:59 UTC (permalink / raw)
To: Nhat Pham
Cc: Baolin Wang, Yosry Ahmed, akpm, hannes, hughd, shakeel.butt,
ryan.roberts, chrisl, david, kasong, willy, viro, baohua,
chengming.zhou, linux-mm, kernel-team, linux-kernel
Nhat Pham <nphamcs@gmail.com> writes:
> On Tue, Sep 24, 2024 at 6:53 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>>
>>
>> One benefit I can mention is that removing 'SWAP_MAP_SHMEM' can help to
>> batch free shmem swap entries in __swap_entries_free(), similar to the
>> commit bea67dcc5eea ("mm: attempt to batch free swap entries for
>> zap_pte_range()") did, which can improve the performance of shmem mTHP
>> munmap() function based on my testing.
>
> Yeah, the problem with having an extraneous state is you have to
> constantly check for it in code, and/or keep it in mind when you
> develop things. I've been constantly having to check for this state
> when I develop code around this area, and it gets old fast.
>
> If we can use it to optimize something, I can understand keeping it.
> But it just seems like dead code to me :)
>
> My preference is to do this as simply as possible - add another case
> (usage == 1, nr > 1, and we need to add swap continuations) in the
> check in __swap_duplicate()'s first loop, and just WARN right there.
>
> That case CANNOT happen UNLESS we introduce a bug, or have a new use
> case. When we actually have a use case, we can always introduce
> handling/fallback logic for that case.
>
> Barry, Yosry, Baolin, Ying, how do you feel about this?
Sounds good to me to just WARN now. We can always improve when it's
necessary.
--
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [RFC PATCH 0/2] remove SWAP_MAP_SHMEM
2024-09-26 1:59 ` Huang, Ying
@ 2024-09-26 3:30 ` Baolin Wang
0 siblings, 0 replies; 29+ messages in thread
From: Baolin Wang @ 2024-09-26 3:30 UTC (permalink / raw)
To: Huang, Ying, Nhat Pham
Cc: Yosry Ahmed, akpm, hannes, hughd, shakeel.butt, ryan.roberts,
chrisl, david, kasong, willy, viro, baohua, chengming.zhou,
linux-mm, kernel-team, linux-kernel
On 2024/9/26 09:59, Huang, Ying wrote:
> Nhat Pham <nphamcs@gmail.com> writes:
>
>> On Tue, Sep 24, 2024 at 6:53 PM Baolin Wang
>> <baolin.wang@linux.alibaba.com> wrote:
>>>
>>>
>>> One benefit I can mention is that removing 'SWAP_MAP_SHMEM' can help to
>>> batch free shmem swap entries in __swap_entries_free(), similar to the
>>> commit bea67dcc5eea ("mm: attempt to batch free swap entries for
>>> zap_pte_range()") did, which can improve the performance of shmem mTHP
>>> munmap() function based on my testing.
>>
>> Yeah, the problem with having an extraneous state is you have to
>> constantly check for it in code, and/or keep it in mind when you
>> develop things. I've been constantly having to check for this state
>> when I develop code around this area, and it gets old fast.
>>
>> If we can use it to optimize something, I can understand keeping it.
>> But it just seems like dead code to me :)
>>
>> My preference is to do this as simply as possible - add another case
>> (usage == 1, nr > 1, and we need to add swap continuations) in the
>> check in __swap_duplicate()'s first loop, and just WARN right there.
>>
>> That case CANNOT happen UNLESS we introduce a bug, or have a new use
>> case. When we actually have a use case, we can always introduce
>> handling/fallback logic for that case.
>>
>> Barry, Yosry, Baolin, Ying, how do you feel about this?
>
> Sounds good to me to just WARN now. We can always improve when it's
> necessary.
+1. Agreed.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/2] remove SWAP_MAP_SHMEM
2024-09-25 14:37 ` Nhat Pham
2024-09-26 1:59 ` Huang, Ying
@ 2024-09-26 3:59 ` Barry Song
2024-09-26 22:50 ` Nhat Pham
2024-09-26 4:00 ` Barry Song
2 siblings, 1 reply; 29+ messages in thread
From: Barry Song @ 2024-09-26 3:59 UTC (permalink / raw)
To: Nhat Pham
Cc: Baolin Wang, Yosry Ahmed, akpm, hannes, hughd, shakeel.butt,
ryan.roberts, ying.huang, chrisl, david, kasong, willy, viro,
chengming.zhou, linux-mm, kernel-team, linux-kernel
On Thu, Sep 26, 2024 at 2:37 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Tue, Sep 24, 2024 at 6:53 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
> >
> >
> > One benefit I can mention is that removing 'SWAP_MAP_SHMEM' can help to
> > batch free shmem swap entries in __swap_entries_free(), similar to the
> > commit bea67dcc5eea ("mm: attempt to batch free swap entries for
> > zap_pte_range()") did, which can improve the performance of shmem mTHP
> > munmap() function based on my testing.
>
> Yeah, the problem with having an extraneous state is you have to
> constantly check for it in code, and/or keep it in mind when you
> develop things. I've been constantly having to check for this state
> when I develop code around this area, and it gets old fast.
>
> If we can use it to optimize something, I can understand keeping it.
> But it just seems like dead code to me :)
>
> My preference is to do this as simply as possible - add another case
> (usage == 1, nr > 1, and we need to add swap continuations) in the
> check in __swap_duplicate()'s first loop, and just WARN right there.
>
> That case CANNOT happen UNLESS we introduce a bug, or have a new use
> case. When we actually have a use case, we can always introduce
> handling/fallback logic for that case.
>
> Barry, Yosry, Baolin, Ying, how do you feel about this?
I’m not entirely clear on your point. If your proposal is to support the
case where usage == 1 and nr > 1 only when we don’t require
CONTINUED, and to issue a warning once we determine that
CONTINUED is needed, then I’m completely on board with that
approach.
It seems that your intention is to simply relocate the existing warning
to the scenario where CONTINUED is actually required, rather than
maintaining a warning for the case where usage == 1 and nr > 1 at
all times?
I wasn't actually suggesting a rollback as you posted:
err = __swap_duplicate(entry, 1, nr);
if (err == -ENOMEM) {
/* fallback to non-batched version */
for (i = 0; i < nr; i++) {
cur_entry = (swp_entry_t){entry.val + i};
if (swap_duplicate(cur_entry)) {
/* rollback */
while (--i >= 0) {
cur_entry = (swp_entry_t){entry.val + i};
swap_free(cur_entry);
}
I suggested checking for all errors except -ENOMEM in the first loop. If all
conditions indicate that only CONTINUED is necessary (with no other
issues like ENOENT, EEXIST, etc., for any entry), we can proceed
with a for loop for swap_duplicate(), eliminating the need for a rollback.
However, I agree that we can proceed with that only when there is actually
a user involved. (There's no need to address it right now.)
Thanks
Barry
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [RFC PATCH 0/2] remove SWAP_MAP_SHMEM
2024-09-26 3:59 ` Barry Song
@ 2024-09-26 22:50 ` Nhat Pham
0 siblings, 0 replies; 29+ messages in thread
From: Nhat Pham @ 2024-09-26 22:50 UTC (permalink / raw)
To: Barry Song
Cc: Baolin Wang, Yosry Ahmed, akpm, hannes, hughd, shakeel.butt,
ryan.roberts, ying.huang, chrisl, david, kasong, willy, viro,
chengming.zhou, linux-mm, kernel-team, linux-kernel
On Wed, Sep 25, 2024 at 8:59 PM Barry Song <baohua@kernel.org> wrote:
>
> On Thu, Sep 26, 2024 at 2:37 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> I’m not entirely clear on your point. If your proposal is to support the
> case where usage == 1 and nr > 1 only when we don’t require
> CONTINUED, and to issue a warning once we determine that
> CONTINUED is needed, then I’m completely on board with that
> approach.
>
> It seems that your intention is to simply relocate the existing warning
> to the scenario where CONTINUED is actually required, rather than
> maintaining a warning for the case where usage == 1 and nr > 1 at
> all times?
Ohhh yeah we definitely agreed on intentions, but I think I
misunderstood your request :) The code below was an attempt to satisfy
that request...
Please ignore it. I'll submit an actual patch taking into account our
discussions :) Hopefully I won't forget to actually test with thp
swaps this time...
>
> I wasn't actually suggesting a rollback as you posted:
> err = __swap_duplicate(entry, 1, nr);
> if (err == -ENOMEM) {
> /* fallback to non-batched version */
> for (i = 0; i < nr; i++) {
> cur_entry = (swp_entry_t){entry.val + i};
> if (swap_duplicate(cur_entry)) {
> /* rollback */
> while (--i >= 0) {
> cur_entry = (swp_entry_t){entry.val + i};
> swap_free(cur_entry);
> }
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/2] remove SWAP_MAP_SHMEM
2024-09-25 14:37 ` Nhat Pham
2024-09-26 1:59 ` Huang, Ying
2024-09-26 3:59 ` Barry Song
@ 2024-09-26 4:00 ` Barry Song
2 siblings, 0 replies; 29+ messages in thread
From: Barry Song @ 2024-09-26 4:00 UTC (permalink / raw)
To: Nhat Pham
Cc: Baolin Wang, Yosry Ahmed, akpm, hannes, hughd, shakeel.butt,
ryan.roberts, ying.huang, chrisl, david, kasong, willy, viro,
chengming.zhou, linux-mm, kernel-team, linux-kernel
On Thu, Sep 26, 2024 at 2:37 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Tue, Sep 24, 2024 at 6:53 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
> >
> >
> > One benefit I can mention is that removing 'SWAP_MAP_SHMEM' can help to
> > batch free shmem swap entries in __swap_entries_free(), similar to the
> > commit bea67dcc5eea ("mm: attempt to batch free swap entries for
> > zap_pte_range()") did, which can improve the performance of shmem mTHP
> > munmap() function based on my testing.
>
> Yeah, the problem with having an extraneous state is you have to
> constantly check for it in code, and/or keep it in mind when you
> develop things. I've been constantly having to check for this state
> when I develop code around this area, and it gets old fast.
>
> If we can use it to optimize something, I can understand keeping it.
> But it just seems like dead code to me :)
>
> My preference is to do this as simply as possible - add another case
> (usage == 1, nr > 1, and we need to add swap continuations) in the
> check in __swap_duplicate()'s first loop, and just WARN right there.
>
> That case CANNOT happen UNLESS we introduce a bug, or have a new use
> case. When we actually have a use case, we can always introduce
> handling/fallback logic for that case.
>
> Barry, Yosry, Baolin, Ying, how do you feel about this?
Hi Nhat,
I’m not entirely clear on your point. If your proposal is to support the
case where usage == 1 and nr > 1 only when we don’t require
CONTINUED, and to issue a warning once we determine that
CONTINUED is needed, then I’m completely on board with that
approach.
It seems that your intention is to simply relocate the existing warning
to the scenario where CONTINUED is actually required, rather than
maintaining a warning for the case where usage == 1 and nr > 1 at
all times?
I wasn't actually suggesting a rollback as you posted:
err = __swap_duplicate(entry, 1, nr);
if (err == -ENOMEM) {
/* fallback to non-batched version */
for (i = 0; i < nr; i++) {
cur_entry = (swp_entry_t){entry.val + i};
if (swap_duplicate(cur_entry)) {
/* rollback */
while (--i >= 0) {
cur_entry = (swp_entry_t){entry.val + i};
swap_free(cur_entry);
}
I suggested checking for all errors except -ENOMEM in the first loop. If all
conditions indicate that only CONTINUED is necessary (with no other
issues like ENOENT, EEXIST, etc., for any entry), we can proceed
with a for loop for swap_duplicate(), eliminating the need for a rollback.
However, I agree that we can proceed with that only when there is actually
a user involved. (There's no need to address it right now.)
Thanks
Barry
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/2] remove SWAP_MAP_SHMEM
2024-09-24 15:48 ` Nhat Pham
2024-09-24 18:11 ` Yosry Ahmed
2024-09-25 1:53 ` Baolin Wang
@ 2024-09-25 7:19 ` Huang, Ying
2024-09-25 7:32 ` Barry Song
2 siblings, 1 reply; 29+ messages in thread
From: Huang, Ying @ 2024-09-25 7:19 UTC (permalink / raw)
To: Nhat Pham
Cc: Yosry Ahmed, Baolin Wang, akpm, hannes, hughd, shakeel.butt,
ryan.roberts, chrisl, david, kasong, willy, viro, baohua,
chengming.zhou, linux-mm, kernel-team, linux-kernel
Nhat Pham <nphamcs@gmail.com> writes:
[snip]
>
> My understanding now is that there are two for loops. One for loop
> that checks the entry's states, and one for loop that does the actual
> incrementing work (or state modification).
>
> We can check in the first for loop, if it is safe to proceed:
>
> if (!count && !has_cache) {
> err = -ENOENT;
> } else if (usage == SWAP_HAS_CACHE) {
> if (has_cache)
> err = -EEXIST;
> } else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) {
> err = -EINVAL;
> } else if (usage == 1 && nr > 1 && (count & ~COUNT_CONTINUED) >=
> SWAP_MAP_MAX)) {
> /* the batched variants currently do not support rollback */
> err = -ENOMEM;
> }
>
> At this point, IIUC, we have not done any incrementing, so no rollback
> needed? :)
I think that it's better to add a VM_WARN_ONCE() here. If someone
enabled 'nr > 1' for __swap_duplicate(), the issue will be more
explicit.
[snip]
--
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [RFC PATCH 0/2] remove SWAP_MAP_SHMEM
2024-09-25 7:19 ` Huang, Ying
@ 2024-09-25 7:32 ` Barry Song
2024-09-25 14:21 ` Nhat Pham
0 siblings, 1 reply; 29+ messages in thread
From: Barry Song @ 2024-09-25 7:32 UTC (permalink / raw)
To: Huang, Ying
Cc: Nhat Pham, Yosry Ahmed, Baolin Wang, akpm, hannes, hughd,
shakeel.butt, ryan.roberts, chrisl, david, kasong, willy, viro,
chengming.zhou, linux-mm, kernel-team, linux-kernel
On Wed, Sep 25, 2024 at 3:23 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Nhat Pham <nphamcs@gmail.com> writes:
>
> [snip]
>
> >
> > My understanding now is that there are two for loops. One for loop
> > that checks the entry's states, and one for loop that does the actual
> > incrementing work (or state modification).
> >
> > We can check in the first for loop, if it is safe to proceed:
> >
> > if (!count && !has_cache) {
> > err = -ENOENT;
> > } else if (usage == SWAP_HAS_CACHE) {
> > if (has_cache)
> > err = -EEXIST;
> > } else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) {
> > err = -EINVAL;
> > } else if (usage == 1 && nr > 1 && (count & ~COUNT_CONTINUED) >=
> > SWAP_MAP_MAX)) {
> > /* the batched variants currently do not support rollback */
> > err = -ENOMEM;
> > }
> >
> > At this point, IIUC, we have not done any incrementing, so no rollback
> > needed? :)
>
> I think that it's better to add a VM_WARN_ONCE() here. If someone
> enabled 'nr > 1' for __swap_duplicate(), the issue will be more
> explicit.
ying, i guess you missed this is the exact case Nhat is enabling
'nr > 1' for __swap_duplicate(). and this warning is functioning.
and he is trying to support the nr>1 case.
https://lore.kernel.org/linux-mm/20240923231142.4155415-2-nphamcs@gmail.com/
>
> [snip]
>
> --
> Best Regards,
> Huang, Ying
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [RFC PATCH 0/2] remove SWAP_MAP_SHMEM
2024-09-25 7:32 ` Barry Song
@ 2024-09-25 14:21 ` Nhat Pham
2024-09-25 14:24 ` Nhat Pham
2024-09-25 14:28 ` Nhat Pham
0 siblings, 2 replies; 29+ messages in thread
From: Nhat Pham @ 2024-09-25 14:21 UTC (permalink / raw)
To: Barry Song
Cc: Huang, Ying, Yosry Ahmed, Baolin Wang, akpm, hannes, hughd,
shakeel.butt, ryan.roberts, chrisl, david, kasong, willy, viro,
chengming.zhou, linux-mm, kernel-team, linux-kernel
On Wed, Sep 25, 2024 at 12:33 AM Barry Song <baohua@kernel.org> wrote:
>
> On Wed, Sep 25, 2024 at 3:23 PM Huang, Ying <ying.huang@intel.com> wrote:
> >
> > Nhat Pham <nphamcs@gmail.com> writes:
> >
> > [snip]
> >
> > >
> > > My understanding now is that there are two for loops. One for loop
> > > that checks the entry's states, and one for loop that does the actual
> > > incrementing work (or state modification).
> > >
> > > We can check in the first for loop, if it is safe to proceed:
> > >
> > > if (!count && !has_cache) {
> > > err = -ENOENT;
> > > } else if (usage == SWAP_HAS_CACHE) {
> > > if (has_cache)
> > > err = -EEXIST;
> > > } else if ((count & ~COUNT_CONTINUED) > SWAP_MAP_MAX) {
> > > err = -EINVAL;
> > > } else if (usage == 1 && nr > 1 && (count & ~COUNT_CONTINUED) >=
> > > SWAP_MAP_MAX)) {
> > > /* the batched variants currently do not support rollback */
> > > err = -ENOMEM;
> > > }
> > >
> > > At this point, IIUC, we have not done any incrementing, so no rollback
> > > needed? :)
> >
> > I think that it's better to add a VM_WARN_ONCE() here. If someone
> > enabled 'nr > 1' for __swap_duplicate(), the issue will be more
> > explicit.
>
> ying, i guess you missed this is the exact case Nhat is enabling
> 'nr > 1' for __swap_duplicate(). and this warning is functioning.
> and he is trying to support the nr>1 case.
>
> https://lore.kernel.org/linux-mm/20240923231142.4155415-2-nphamcs@gmail.com/
I'm only supporting the case nr > 1, when there is no need to add swap
continuations :) That's the only current use case right now (shmem) :)
1. Keep the non-batched variant:
int swap_duplicate(swp_entry_t entry)
{
int err = 0;
while (!err && __swap_duplicate(entry, 1, 1) == -ENOMEM)
err = add_swap_count_continuation(entry, GFP_ATOMIC);
return err;
}
2. Implement the batched variant:
int swap_duplicate_nr(swp_entry_t entry, int nr)
{
swp_entry_t cur_entry;
int i, err;
if (nr == 1)
return swap_duplicate(entry);
err = __swap_duplicate(entry, 1, nr);
if (err == -ENOMEM) {
/* fallback to non-batched version */
for (i = 0; i < nr; i++) {
cur_entry = (swp_entry_t){entry.val + i};
if (swap_duplicate(cur_entry)) {
/* rollback */
while (--i >= 0) {
cur_entry = (swp_entry_t){entry.val + i};
swap_free(cur_entry);
}
}
}
}
return err;
}
How does this look? My concern is that there is not really a use for
the fallback logic. Basically dead code.
I can keep it in if you guys have a use for it soon, but otherwise I
lean towards just adding a WARN etc. there, or return -ENOMEM, and
WARN at shmem's callsite (because it cannot get -ENOMEM).
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [RFC PATCH 0/2] remove SWAP_MAP_SHMEM
2024-09-25 14:21 ` Nhat Pham
@ 2024-09-25 14:24 ` Nhat Pham
2024-09-25 14:28 ` Nhat Pham
1 sibling, 0 replies; 29+ messages in thread
From: Nhat Pham @ 2024-09-25 14:24 UTC (permalink / raw)
To: Barry Song
Cc: Huang, Ying, Yosry Ahmed, Baolin Wang, akpm, hannes, hughd,
shakeel.butt, ryan.roberts, chrisl, david, kasong, willy, viro,
chengming.zhou, linux-mm, kernel-team, linux-kernel
On Wed, Sep 25, 2024 at 7:21 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
>
> I'm only supporting the case nr > 1, when there is no need to add swap
> continuations :) That's the only current use case right now (shmem) :)
Sorry, I forgot to say - but to fully support a batched variant, we
can do something like this:
>
> 1. Keep the non-batched variant:
>
> int swap_duplicate(swp_entry_t entry)
> {
> int err = 0;
>
> while (!err && __swap_duplicate(entry, 1, 1) == -ENOMEM)
> err = add_swap_count_continuation(entry, GFP_ATOMIC);
> return err;
> }
>
> 2. Implement the batched variant:
>
> int swap_duplicate_nr(swp_entry_t entry, int nr)
> {
> swp_entry_t cur_entry;
> int i, err;
>
> if (nr == 1)
> return swap_duplicate(entry);
>
> err = __swap_duplicate(entry, 1, nr);
> if (err == -ENOMEM) {
> /* fallback to non-batched version */
> for (i = 0; i < nr; i++) {
> cur_entry = (swp_entry_t){entry.val + i};
> if (swap_duplicate(cur_entry)) {
> /* rollback */
> while (--i >= 0) {
> cur_entry = (swp_entry_t){entry.val + i};
> swap_free(cur_entry);
> }
missing a "return err;" here. Not my best idea to write (pseudo) code
before caffeine in the morning :)
> }
> }
> }
> return err;
> }
>
> How does this look? My concern is that there is not really a use for
> the fallback logic. Basically dead code.
>
> I can keep it in if you guys have a use for it soon, but otherwise I
> lean towards just adding a WARN etc. there, or return -ENOMEM, and
> WARN at shmem's callsite (because it cannot get -ENOMEM).
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [RFC PATCH 0/2] remove SWAP_MAP_SHMEM
2024-09-25 14:21 ` Nhat Pham
2024-09-25 14:24 ` Nhat Pham
@ 2024-09-25 14:28 ` Nhat Pham
1 sibling, 0 replies; 29+ messages in thread
From: Nhat Pham @ 2024-09-25 14:28 UTC (permalink / raw)
To: Barry Song
Cc: Huang, Ying, Yosry Ahmed, Baolin Wang, akpm, hannes, hughd,
shakeel.butt, ryan.roberts, chrisl, david, kasong, willy, viro,
chengming.zhou, linux-mm, kernel-team, linux-kernel
On Wed, Sep 25, 2024 at 7:21 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Wed, Sep 25, 2024 at 12:33 AM Barry Song <baohua@kernel.org> wrote:
> How does this look? My concern is that there is not really a use for
> the fallback logic. Basically dead code.
>
> I can keep it in if you guys have a use for it soon, but otherwise I
> lean towards just adding a WARN etc. there, or return -ENOMEM, and
> WARN at shmem's callsite (because it cannot get -ENOMEM).
Oh and another point - I plan to rewrite this swap entity lifetime
logic in the new abstraction layer. The swap continuation part will go
away with it - I don't quite like the way we're currently doing
things. So this added complexity might be unnecessary.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 0/2] remove SWAP_MAP_SHMEM
2024-09-23 23:11 [RFC PATCH 0/2] remove SWAP_MAP_SHMEM Nhat Pham
` (3 preceding siblings ...)
2024-09-24 1:55 ` Baolin Wang
@ 2024-09-24 20:15 ` Chris Li
2024-09-24 21:30 ` Yosry Ahmed
4 siblings, 1 reply; 29+ messages in thread
From: Chris Li @ 2024-09-24 20:15 UTC (permalink / raw)
To: Nhat Pham
Cc: akpm, hannes, yosryahmed, hughd, shakeel.butt, ryan.roberts,
ying.huang, david, kasong, willy, viro, baohua, chengming.zhou,
linux-mm, kernel-team, linux-kernel
Hi Nhat,
On Mon, Sep 23, 2024 at 4:11 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> The SWAP_MAP_SHMEM state was originally introduced in the commit
> aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a
> swap entry belongs to shmem during swapoff.
>
> However, swapoff has since been rewritten drastically in the commit
> b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now
> having swap count == SWAP_MAP_SHMEM value is basically the same as having
> swap count == 1, and swap_shmem_alloc() behaves analogously to
> swap_duplicate()
>
> This RFC proposes the removal of this state and the associated helper to
> simplify the state machine (both mentally and code-wise). We will also
> have an extra state/special value that can be repurposed (for swap entries
> that never gets re-duplicated).
Please help me understand. After having this patch, the shmem swap
entry will also use the swap continue as the only way to count shmem
swap entries, am I missing something?
That seems to have some performance hit for the shmem. Because unlike
anonymous memory, The shmem can easily have a very high swap count.
I would expect there to be some performance hit for the high share
swap count usage case.
Do you have any test number on high shared swap count usage that says
it is negligible to remove it?
Also if you remove the in the SWAP_MAP_SHMEM, wouldn't you need
remove the counter in the shmem as well. Because the shmem counter is
no longer connected to the swap count any more.
Huge, do you have any feedback on this change?
Chris
> Another motivation (albeit a bit premature at the moment) is the new swap
> abstraction I am currently working on, that would allow for swap/zswap
> decoupling, swapoff optimization, etc. The fewer states and swap API
> functions there are, the simpler the conversion will be.
>
> I am sending this series first as an RFC, just in case I missed something
> or misunderstood this state, or if someone has a swap optimization in mind
> for shmem that would require this special state.
>
> Swap experts, let me know if I'm mistaken :) Otherwise if there is no
> objection I will resend this patch series again for merging.
>
> Nhat Pham (2):
> swapfile: add a batched variant for swap_duplicate()
> swap: shmem: remove SWAP_MAP_SHMEM
>
> include/linux/swap.h | 16 ++++++++--------
> mm/shmem.c | 2 +-
> mm/swapfile.c | 28 +++++++++-------------------
> 3 files changed, 18 insertions(+), 28 deletions(-)
>
>
> base-commit: acfabf7e197f7a5bedf4749dac1f39551417b049
> --
> 2.43.5
^ permalink raw reply [flat|nested] 29+ messages in thread* Re: [RFC PATCH 0/2] remove SWAP_MAP_SHMEM
2024-09-24 20:15 ` Chris Li
@ 2024-09-24 21:30 ` Yosry Ahmed
0 siblings, 0 replies; 29+ messages in thread
From: Yosry Ahmed @ 2024-09-24 21:30 UTC (permalink / raw)
To: Chris Li
Cc: Nhat Pham, akpm, hannes, hughd, shakeel.butt, ryan.roberts,
ying.huang, david, kasong, willy, viro, baohua, chengming.zhou,
linux-mm, kernel-team, linux-kernel
On Tue, Sep 24, 2024 at 1:16 PM Chris Li <chrisl@kernel.org> wrote:
>
> Hi Nhat,
>
> On Mon, Sep 23, 2024 at 4:11 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > The SWAP_MAP_SHMEM state was originally introduced in the commit
> > aaa468653b4a ("swap_info: note SWAP_MAP_SHMEM"), to quickly determine if a
> > swap entry belongs to shmem during swapoff.
> >
> > However, swapoff has since been rewritten drastically in the commit
> > b56a2d8af914 ("mm: rid swapoff of quadratic complexity"). Now
> > having swap count == SWAP_MAP_SHMEM value is basically the same as having
> > swap count == 1, and swap_shmem_alloc() behaves analogously to
> > swap_duplicate()
> >
> > This RFC proposes the removal of this state and the associated helper to
> > simplify the state machine (both mentally and code-wise). We will also
> > have an extra state/special value that can be repurposed (for swap entries
> > that never gets re-duplicated).
>
> Please help me understand. After having this patch, the shmem swap
> entry will also use the swap continue as the only way to count shmem
> swap entries, am I missing something?
>
> That seems to have some performance hit for the shmem. Because unlike
> anonymous memory, The shmem can easily have a very high swap count.
> I would expect there to be some performance hit for the high share
> swap count usage case.
> Do you have any test number on high shared swap count usage that says
> it is negligible to remove it?
Shmem only calls swap_duplicate() once in shmem_writepage() when we
add the swap entry to the page cache. We do not increment the swap
count once for each user like anon memory. IOW, the page cache is the
only user of the shmem swap entry, so when we remove SWAP_MAP_SHMEM
the swap count of shmem pages will either be 0 or 1 (disregarding
SWAP_HAS_CACHE). So the swap continuation code is not used here.
>
> Also if you remove the in the SWAP_MAP_SHMEM, wouldn't you need
> remove the counter in the shmem as well. Because the shmem counter is
> no longer connected to the swap count any more.
Not sure what you mean here.
>
> Huge, do you have any feedback on this change?
Hugh*
>
> Chris
^ permalink raw reply [flat|nested] 29+ messages in thread