linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Revert "mm, swap: avoid redundant swap device pinning"
@ 2025-11-09 18:06 Kairui Song via B4 Relay
  2025-11-10  1:00 ` Greg KH
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Kairui Song via B4 Relay @ 2025-11-09 18:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Chris Li, Johannes Weiner, Yosry Ahmed, Chengming Zhou,
	Youngjun Park, Kairui Song, linux-kernel, stable, Kairui Song

From: Kairui Song <kasong@tencent.com>

This reverts commit 78524b05f1a3e16a5d00cc9c6259c41a9d6003ce.

While reviewing recent leaf entry changes, I noticed that commit
78524b05f1a3 ("mm, swap: avoid redundant swap device pinning") isn't
correct. It's true that most all callers of __read_swap_cache_async are
already holding a swap entry reference, so the repeated swap device
pinning isn't needed on the same swap device, but it is possible that
VMA readahead (swap_vma_readahead()) may encounter swap entries from a
different swap device when there are multiple swap devices, and call
__read_swap_cache_async without holding a reference to that swap device.

So it is possible to cause a UAF if swapoff of device A raced with
swapin on device B, and VMA readahead tries to read swap entries from
device A. It's not easy to trigger but in theory possible to cause real
issues. And besides, that commit made swap more vulnerable to issues
like corrupted page tables.

Just revert it. __read_swap_cache_async isn't that sensitive to
performance after all, as it's mostly used for SSD/HDD swap devices with
readahead. SYNCHRONOUS_IO devices may fallback onto it for swap count >
1 entries, but very soon we will have a new helper and routine for
such devices, so they will never touch this helper or have redundant
swap device reference overhead.

Fixes: 78524b05f1a3 ("mm, swap: avoid redundant swap device pinning")
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swap_state.c | 14 ++++++--------
 mm/zswap.c      |  8 +-------
 2 files changed, 7 insertions(+), 15 deletions(-)

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3f85a1c4cfd9..0c25675de977 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -406,13 +406,17 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated,
 		bool skip_if_exists)
 {
-	struct swap_info_struct *si = __swap_entry_to_info(entry);
+	struct swap_info_struct *si;
 	struct folio *folio;
 	struct folio *new_folio = NULL;
 	struct folio *result = NULL;
 	void *shadow = NULL;
 
 	*new_page_allocated = false;
+	si = get_swap_device(entry);
+	if (!si)
+		return NULL;
+
 	for (;;) {
 		int err;
 
@@ -499,6 +503,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 	put_swap_folio(new_folio, entry);
 	folio_unlock(new_folio);
 put_and_return:
+	put_swap_device(si);
 	if (!(*new_page_allocated) && new_folio)
 		folio_put(new_folio);
 	return result;
@@ -518,16 +523,11 @@ struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		struct vm_area_struct *vma, unsigned long addr,
 		struct swap_iocb **plug)
 {
-	struct swap_info_struct *si;
 	bool page_allocated;
 	struct mempolicy *mpol;
 	pgoff_t ilx;
 	struct folio *folio;
 
-	si = get_swap_device(entry);
-	if (!si)
-		return NULL;
-
 	mpol = get_vma_policy(vma, addr, 0, &ilx);
 	folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
 					&page_allocated, false);
@@ -535,8 +535,6 @@ struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 
 	if (page_allocated)
 		swap_read_folio(folio, plug);
-
-	put_swap_device(si);
 	return folio;
 }
 
diff --git a/mm/zswap.c b/mm/zswap.c
index 5d0f8b13a958..aefe71fd160c 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1005,18 +1005,12 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	struct folio *folio;
 	struct mempolicy *mpol;
 	bool folio_was_allocated;
-	struct swap_info_struct *si;
 	int ret = 0;
 
 	/* try to allocate swap cache folio */
-	si = get_swap_device(swpentry);
-	if (!si)
-		return -EEXIST;
-
 	mpol = get_task_policy(current);
 	folio = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
-			NO_INTERLEAVE_INDEX, &folio_was_allocated, true);
-	put_swap_device(si);
+				NO_INTERLEAVE_INDEX, &folio_was_allocated, true);
 	if (!folio)
 		return -ENOMEM;
 

---
base-commit: 02dafa01ec9a00c3758c1c6478d82fe601f5f1ba
change-id: 20251109-revert-78524b05f1a3-04a1295bef8a

Best regards,
-- 
Kairui Song <kasong@tencent.com>




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Revert "mm, swap: avoid redundant swap device pinning"
  2025-11-09 18:06 [PATCH] Revert "mm, swap: avoid redundant swap device pinning" Kairui Song via B4 Relay
@ 2025-11-10  1:00 ` Greg KH
  2025-11-10  5:33   ` Kairui Song
  2025-11-10  1:56 ` Huang, Ying
  2025-11-14 15:18 ` Kairui Song
  2 siblings, 1 reply; 10+ messages in thread
From: Greg KH @ 2025-11-10  1:00 UTC (permalink / raw)
  To: kasong
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Chris Li, Johannes Weiner, Yosry Ahmed,
	Chengming Zhou, Youngjun Park, Kairui Song, linux-kernel, stable

On Mon, Nov 10, 2025 at 02:06:03AM +0800, Kairui Song via B4 Relay wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> This reverts commit 78524b05f1a3e16a5d00cc9c6259c41a9d6003ce.
> 
> While reviewing recent leaf entry changes, I noticed that commit
> 78524b05f1a3 ("mm, swap: avoid redundant swap device pinning") isn't
> correct. It's true that most all callers of __read_swap_cache_async are
> already holding a swap entry reference, so the repeated swap device
> pinning isn't needed on the same swap device, but it is possible that
> VMA readahead (swap_vma_readahead()) may encounter swap entries from a
> different swap device when there are multiple swap devices, and call
> __read_swap_cache_async without holding a reference to that swap device.
> 
> So it is possible to cause a UAF if swapoff of device A raced with
> swapin on device B, and VMA readahead tries to read swap entries from
> device A. It's not easy to trigger but in theory possible to cause real
> issues. And besides, that commit made swap more vulnerable to issues
> like corrupted page tables.
> 
> Just revert it. __read_swap_cache_async isn't that sensitive to
> performance after all, as it's mostly used for SSD/HDD swap devices with
> readahead. SYNCHRONOUS_IO devices may fallback onto it for swap count >
> 1 entries, but very soon we will have a new helper and routine for
> such devices, so they will never touch this helper or have redundant
> swap device reference overhead.
> 
> Fixes: 78524b05f1a3 ("mm, swap: avoid redundant swap device pinning")
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>  mm/swap_state.c | 14 ++++++--------
>  mm/zswap.c      |  8 +-------
>  2 files changed, 7 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 3f85a1c4cfd9..0c25675de977 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -406,13 +406,17 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  		struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated,
>  		bool skip_if_exists)
>  {
> -	struct swap_info_struct *si = __swap_entry_to_info(entry);
> +	struct swap_info_struct *si;
>  	struct folio *folio;
>  	struct folio *new_folio = NULL;
>  	struct folio *result = NULL;
>  	void *shadow = NULL;
>  
>  	*new_page_allocated = false;
> +	si = get_swap_device(entry);
> +	if (!si)
> +		return NULL;
> +
>  	for (;;) {
>  		int err;
>  
> @@ -499,6 +503,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  	put_swap_folio(new_folio, entry);
>  	folio_unlock(new_folio);
>  put_and_return:
> +	put_swap_device(si);
>  	if (!(*new_page_allocated) && new_folio)
>  		folio_put(new_folio);
>  	return result;
> @@ -518,16 +523,11 @@ struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  		struct vm_area_struct *vma, unsigned long addr,
>  		struct swap_iocb **plug)
>  {
> -	struct swap_info_struct *si;
>  	bool page_allocated;
>  	struct mempolicy *mpol;
>  	pgoff_t ilx;
>  	struct folio *folio;
>  
> -	si = get_swap_device(entry);
> -	if (!si)
> -		return NULL;
> -
>  	mpol = get_vma_policy(vma, addr, 0, &ilx);
>  	folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
>  					&page_allocated, false);
> @@ -535,8 +535,6 @@ struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  
>  	if (page_allocated)
>  		swap_read_folio(folio, plug);
> -
> -	put_swap_device(si);
>  	return folio;
>  }
>  
> diff --git a/mm/zswap.c b/mm/zswap.c
> index 5d0f8b13a958..aefe71fd160c 100644
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1005,18 +1005,12 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
>  	struct folio *folio;
>  	struct mempolicy *mpol;
>  	bool folio_was_allocated;
> -	struct swap_info_struct *si;
>  	int ret = 0;
>  
>  	/* try to allocate swap cache folio */
> -	si = get_swap_device(swpentry);
> -	if (!si)
> -		return -EEXIST;
> -
>  	mpol = get_task_policy(current);
>  	folio = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
> -			NO_INTERLEAVE_INDEX, &folio_was_allocated, true);
> -	put_swap_device(si);
> +				NO_INTERLEAVE_INDEX, &folio_was_allocated, true);
>  	if (!folio)
>  		return -ENOMEM;
>  
> 
> ---
> base-commit: 02dafa01ec9a00c3758c1c6478d82fe601f5f1ba
> change-id: 20251109-revert-78524b05f1a3-04a1295bef8a
> 
> Best regards,
> -- 
> Kairui Song <kasong@tencent.com>
> 
> 
> 

<formletter>

This is not the correct way to submit patches for inclusion in the
stable kernel tree.  Please read:
    https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
for how to do this properly.

</formletter>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Revert "mm, swap: avoid redundant swap device pinning"
  2025-11-09 18:06 [PATCH] Revert "mm, swap: avoid redundant swap device pinning" Kairui Song via B4 Relay
  2025-11-10  1:00 ` Greg KH
@ 2025-11-10  1:56 ` Huang, Ying
  2025-11-10  5:32   ` Kairui Song
  2025-11-14 15:18 ` Kairui Song
  2 siblings, 1 reply; 10+ messages in thread
From: Huang, Ying @ 2025-11-10  1:56 UTC (permalink / raw)
  To: Kairui Song via B4 Relay
  Cc: linux-mm, kasong, Andrew Morton, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, Chris Li, Johannes Weiner, Yosry Ahmed,
	Chengming Zhou, Youngjun Park, Kairui Song, linux-kernel, stable,
	Lorenzo Stoakes

Hi, Kairui,

Kairui Song via B4 Relay <devnull+kasong.tencent.com@kernel.org> writes:

> From: Kairui Song <kasong@tencent.com>
>
> This reverts commit 78524b05f1a3e16a5d00cc9c6259c41a9d6003ce.
>
> While reviewing recent leaf entry changes, I noticed that commit
> 78524b05f1a3 ("mm, swap: avoid redundant swap device pinning") isn't
> correct. It's true that most all callers of __read_swap_cache_async are
> already holding a swap entry reference, so the repeated swap device
> pinning isn't needed on the same swap device, but it is possible that
> VMA readahead (swap_vma_readahead()) may encounter swap entries from a
> different swap device when there are multiple swap devices, and call
> __read_swap_cache_async without holding a reference to that swap device.
>
> So it is possible to cause a UAF if swapoff of device A raced with
> swapin on device B, and VMA readahead tries to read swap entries from
> device A. It's not easy to trigger but in theory possible to cause real
> issues. And besides, that commit made swap more vulnerable to issues
> like corrupted page tables.
>
> Just revert it. __read_swap_cache_async isn't that sensitive to
> performance after all, as it's mostly used for SSD/HDD swap devices with
> readahead. SYNCHRONOUS_IO devices may fallback onto it for swap count >
> 1 entries, but very soon we will have a new helper and routine for
> such devices, so they will never touch this helper or have redundant
> swap device reference overhead.

Is it better to add get_swap_device() in swap_vma_readahead()?  Whenever
we get a swap entry, the first thing we need to do is call
get_swap_device() to check the validity of the swap entry and prevent
the backing swap device from going under us.  This helps us to avoid
checking the validity of the swap entry in every swap function.  Does
this sound reasonable?

> Fixes: 78524b05f1a3 ("mm, swap: avoid redundant swap device pinning")
> Signed-off-by: Kairui Song <kasong@tencent.com>

[snip]

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Revert "mm, swap: avoid redundant swap device pinning"
  2025-11-10  1:56 ` Huang, Ying
@ 2025-11-10  5:32   ` Kairui Song
  2025-11-10 10:50     ` Huang, Ying
  0 siblings, 1 reply; 10+ messages in thread
From: Kairui Song @ 2025-11-10  5:32 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Kairui Song via B4 Relay, linux-mm, Andrew Morton, Kemeng Shi,
	Nhat Pham, Baoquan He, Barry Song, Chris Li, Johannes Weiner,
	Yosry Ahmed, Chengming Zhou, Youngjun Park, linux-kernel, stable,
	Lorenzo Stoakes

On Mon, Nov 10, 2025 at 9:56 AM Huang, Ying
<ying.huang@linux.alibaba.com> wrote:
>
> Hi, Kairui,
>
> Kairui Song via B4 Relay <devnull+kasong.tencent.com@kernel.org> writes:
>
> > From: Kairui Song <kasong@tencent.com>
> >
> > This reverts commit 78524b05f1a3e16a5d00cc9c6259c41a9d6003ce.
> >
> > While reviewing recent leaf entry changes, I noticed that commit
> > 78524b05f1a3 ("mm, swap: avoid redundant swap device pinning") isn't
> > correct. It's true that most all callers of __read_swap_cache_async are
> > already holding a swap entry reference, so the repeated swap device
> > pinning isn't needed on the same swap device, but it is possible that
> > VMA readahead (swap_vma_readahead()) may encounter swap entries from a
> > different swap device when there are multiple swap devices, and call
> > __read_swap_cache_async without holding a reference to that swap device.
> >
> > So it is possible to cause a UAF if swapoff of device A raced with
> > swapin on device B, and VMA readahead tries to read swap entries from
> > device A. It's not easy to trigger but in theory possible to cause real
> > issues. And besides, that commit made swap more vulnerable to issues
> > like corrupted page tables.
> >
> > Just revert it. __read_swap_cache_async isn't that sensitive to
> > performance after all, as it's mostly used for SSD/HDD swap devices with
> > readahead. SYNCHRONOUS_IO devices may fallback onto it for swap count >
> > 1 entries, but very soon we will have a new helper and routine for
> > such devices, so they will never touch this helper or have redundant
> > swap device reference overhead.
>
> Is it better to add get_swap_device() in swap_vma_readahead()?  Whenever
> we get a swap entry, the first thing we need to do is call
> get_swap_device() to check the validity of the swap entry and prevent
> the backing swap device from going under us.  This helps us to avoid
> checking the validity of the swap entry in every swap function.  Does
> this sound reasonable?

Hi Ying, thanks for the suggestion!

Yes, that's also a feasible approach.

What I was thinking is that, currently except the readahead path, all
swapin entry goes through the get_swap_device() helper, that helper
also helps to mitigate swap entry corruption that may causes OOB or
NULL deref. Although I think it's really not that helpful at all to
mitigate page table corruption from the kernel side, but seems not a
really bad idea to have.

And the code is simpler this way, and seems more suitable for a stable
& mainline fix. If we want  to add get_swap_device() in
swap_vma_readahead(), we need to do that for every entry that doesn't
match the target entry's swap device. The reference overhead is
trivial compared to readhead and bio layer, and only non
SYNCHRONOUS_IO devices use this helper (madvise is a special case, we
may optimize that later). ZRAM may fallback to the readahead path but
this fallback will be eliminated very soon in swap table p2.

Another approach I thought about is that we might want readahead to
stop when it sees entries from a different swap device. That swap
device might be ZRAM where VMA readahead is not helpful.

How do you think?


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Revert "mm, swap: avoid redundant swap device pinning"
  2025-11-10  1:00 ` Greg KH
@ 2025-11-10  5:33   ` Kairui Song
  0 siblings, 0 replies; 10+ messages in thread
From: Kairui Song @ 2025-11-10  5:33 UTC (permalink / raw)
  To: Greg KH
  Cc: linux-mm, Andrew Morton, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Chris Li, Johannes Weiner, Yosry Ahmed,
	Chengming Zhou, Youngjun Park, LKML, stable

Greg KH <gregkh@linuxfoundation.org> 于 2025年11月10日周一 09:01写道:
>
> On Mon, Nov 10, 2025 at 02:06:03AM +0800, Kairui Song via B4 Relay wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > This reverts commit 78524b05f1a3e16a5d00cc9c6259c41a9d6003ce.
> >
> > While reviewing recent leaf entry changes, I noticed that commit
> > 78524b05f1a3 ("mm, swap: avoid redundant swap device pinning") isn't
> > correct. It's true that most all callers of __read_swap_cache_async are
> > already holding a swap entry reference, so the repeated swap device
> > pinning isn't needed on the same swap device, but it is possible that
> > VMA readahead (swap_vma_readahead()) may encounter swap entries from a
> > different swap device when there are multiple swap devices, and call
> > __read_swap_cache_async without holding a reference to that swap device.
> >
> > So it is possible to cause a UAF if swapoff of device A raced with
> > swapin on device B, and VMA readahead tries to read swap entries from
> > device A. It's not easy to trigger but in theory possible to cause real
> > issues. And besides, that commit made swap more vulnerable to issues
> > like corrupted page tables.
> >
> > Just revert it. __read_swap_cache_async isn't that sensitive to
> > performance after all, as it's mostly used for SSD/HDD swap devices with
> > readahead. SYNCHRONOUS_IO devices may fallback onto it for swap count >
> > 1 entries, but very soon we will have a new helper and routine for
> > such devices, so they will never touch this helper or have redundant
> > swap device reference overhead.
> >
> > Fixes: 78524b05f1a3 ("mm, swap: avoid redundant swap device pinning")
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >  mm/swap_state.c | 14 ++++++--------
> >  mm/zswap.c      |  8 +-------
> >  2 files changed, 7 insertions(+), 15 deletions(-)
> >
> > diff --git a/mm/swap_state.c b/mm/swap_state.c
> > index 3f85a1c4cfd9..0c25675de977 100644
> > --- a/mm/swap_state.c
> > +++ b/mm/swap_state.c
> > @@ -406,13 +406,17 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> >               struct mempolicy *mpol, pgoff_t ilx, bool *new_page_allocated,
> >               bool skip_if_exists)
> >  {
> > -     struct swap_info_struct *si = __swap_entry_to_info(entry);
> > +     struct swap_info_struct *si;
> >       struct folio *folio;
> >       struct folio *new_folio = NULL;
> >       struct folio *result = NULL;
> >       void *shadow = NULL;
> >
> >       *new_page_allocated = false;
> > +     si = get_swap_device(entry);
> > +     if (!si)
> > +             return NULL;
> > +
> >       for (;;) {
> >               int err;
> >
> > @@ -499,6 +503,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> >       put_swap_folio(new_folio, entry);
> >       folio_unlock(new_folio);
> >  put_and_return:
> > +     put_swap_device(si);
> >       if (!(*new_page_allocated) && new_folio)
> >               folio_put(new_folio);
> >       return result;
> > @@ -518,16 +523,11 @@ struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> >               struct vm_area_struct *vma, unsigned long addr,
> >               struct swap_iocb **plug)
> >  {
> > -     struct swap_info_struct *si;
> >       bool page_allocated;
> >       struct mempolicy *mpol;
> >       pgoff_t ilx;
> >       struct folio *folio;
> >
> > -     si = get_swap_device(entry);
> > -     if (!si)
> > -             return NULL;
> > -
> >       mpol = get_vma_policy(vma, addr, 0, &ilx);
> >       folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
> >                                       &page_allocated, false);
> > @@ -535,8 +535,6 @@ struct folio *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> >
> >       if (page_allocated)
> >               swap_read_folio(folio, plug);
> > -
> > -     put_swap_device(si);
> >       return folio;
> >  }
> >
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 5d0f8b13a958..aefe71fd160c 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -1005,18 +1005,12 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
> >       struct folio *folio;
> >       struct mempolicy *mpol;
> >       bool folio_was_allocated;
> > -     struct swap_info_struct *si;
> >       int ret = 0;
> >
> >       /* try to allocate swap cache folio */
> > -     si = get_swap_device(swpentry);
> > -     if (!si)
> > -             return -EEXIST;
> > -
> >       mpol = get_task_policy(current);
> >       folio = __read_swap_cache_async(swpentry, GFP_KERNEL, mpol,
> > -                     NO_INTERLEAVE_INDEX, &folio_was_allocated, true);
> > -     put_swap_device(si);
> > +                             NO_INTERLEAVE_INDEX, &folio_was_allocated, true);
> >       if (!folio)
> >               return -ENOMEM;
> >
> >
> > ---
> > base-commit: 02dafa01ec9a00c3758c1c6478d82fe601f5f1ba
> > change-id: 20251109-revert-78524b05f1a3-04a1295bef8a
> >
> > Best regards,
> > --
> > Kairui Song <kasong@tencent.com>
> >
> >
> >
>
> <formletter>
>
> This is not the correct way to submit patches for inclusion in the
> stable kernel tree.  Please read:
>     https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
> for how to do this properly.
>
> </formletter>

Thanks for the info, my bad, I was trying new tools to send patches so
the Cc tags were missing, will fix it. This patch is meant to be
merged into the mainline first.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Revert "mm, swap: avoid redundant swap device pinning"
  2025-11-10  5:32   ` Kairui Song
@ 2025-11-10 10:50     ` Huang, Ying
  2025-11-10 11:37       ` Kairui Song
  0 siblings, 1 reply; 10+ messages in thread
From: Huang, Ying @ 2025-11-10 10:50 UTC (permalink / raw)
  To: Kairui Song
  Cc: Kairui Song via B4 Relay, linux-mm, Andrew Morton, Kemeng Shi,
	Nhat Pham, Baoquan He, Barry Song, Chris Li, Johannes Weiner,
	Yosry Ahmed, Chengming Zhou, Youngjun Park, linux-kernel, stable,
	Lorenzo Stoakes

Kairui Song <ryncsn@gmail.com> writes:

> On Mon, Nov 10, 2025 at 9:56 AM Huang, Ying
> <ying.huang@linux.alibaba.com> wrote:
>>
>> Hi, Kairui,
>>
>> Kairui Song via B4 Relay <devnull+kasong.tencent.com@kernel.org> writes:
>>
>> > From: Kairui Song <kasong@tencent.com>
>> >
>> > This reverts commit 78524b05f1a3e16a5d00cc9c6259c41a9d6003ce.
>> >
>> > While reviewing recent leaf entry changes, I noticed that commit
>> > 78524b05f1a3 ("mm, swap: avoid redundant swap device pinning") isn't
>> > correct. It's true that most all callers of __read_swap_cache_async are
>> > already holding a swap entry reference, so the repeated swap device
>> > pinning isn't needed on the same swap device, but it is possible that
>> > VMA readahead (swap_vma_readahead()) may encounter swap entries from a
>> > different swap device when there are multiple swap devices, and call
>> > __read_swap_cache_async without holding a reference to that swap device.
>> >
>> > So it is possible to cause a UAF if swapoff of device A raced with
>> > swapin on device B, and VMA readahead tries to read swap entries from
>> > device A. It's not easy to trigger but in theory possible to cause real
>> > issues. And besides, that commit made swap more vulnerable to issues
>> > like corrupted page tables.
>> >
>> > Just revert it. __read_swap_cache_async isn't that sensitive to
>> > performance after all, as it's mostly used for SSD/HDD swap devices with
>> > readahead. SYNCHRONOUS_IO devices may fallback onto it for swap count >
>> > 1 entries, but very soon we will have a new helper and routine for
>> > such devices, so they will never touch this helper or have redundant
>> > swap device reference overhead.
>>
>> Is it better to add get_swap_device() in swap_vma_readahead()?  Whenever
>> we get a swap entry, the first thing we need to do is call
>> get_swap_device() to check the validity of the swap entry and prevent
>> the backing swap device from going under us.  This helps us to avoid
>> checking the validity of the swap entry in every swap function.  Does
>> this sound reasonable?
>
> Hi Ying, thanks for the suggestion!
>
> Yes, that's also a feasible approach.
>
> What I was thinking is that, currently except the readahead path, all
> swapin entry goes through the get_swap_device() helper, that helper
> also helps to mitigate swap entry corruption that may causes OOB or
> NULL deref. Although I think it's really not that helpful at all to
> mitigate page table corruption from the kernel side, but seems not a
> really bad idea to have.
>
> And the code is simpler this way, and seems more suitable for a stable
> & mainline fix. If we want  to add get_swap_device() in
> swap_vma_readahead(), we need to do that for every entry that doesn't
> match the target entry's swap device. The reference overhead is
> trivial compared to readhead and bio layer, and only non
> SYNCHRONOUS_IO devices use this helper (madvise is a special case, we
> may optimize that later). ZRAM may fallback to the readahead path but
> this fallback will be eliminated very soon in swap table p2.

We have 2 choices in general.

1. Add get/put_swap_device() in every swap function.

2. Add get/put_swap_device() in every caller of the swap functions.

Personally, I prefer 2.  It works better in situations like calling
multiple swap functions.  It can reduce duplicated references.  It helps
improve code reasoning and readability.

> Another approach I thought about is that we might want readahead to
> stop when it sees entries from a different swap device. That swap
> device might be ZRAM where VMA readahead is not helpful.
>
> How do you think?

One possible solution is to skip or stop for a swap entry from the
SYNCHRONOUS_IO swap device.

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Revert "mm, swap: avoid redundant swap device pinning"
  2025-11-10 10:50     ` Huang, Ying
@ 2025-11-10 11:37       ` Kairui Song
  2025-11-10 12:33         ` Kairui Song
  2025-11-11  6:48         ` Huang, Ying
  0 siblings, 2 replies; 10+ messages in thread
From: Kairui Song @ 2025-11-10 11:37 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Kairui Song via B4 Relay, linux-mm, Andrew Morton, Kemeng Shi,
	Nhat Pham, Baoquan He, Barry Song, Chris Li, Johannes Weiner,
	Yosry Ahmed, Chengming Zhou, Youngjun Park, linux-kernel, stable,
	Lorenzo Stoakes

On Mon, Nov 10, 2025 at 6:50 PM Huang, Ying
<ying.huang@linux.alibaba.com> wrote:
>
> Kairui Song <ryncsn@gmail.com> writes:
>
> > On Mon, Nov 10, 2025 at 9:56 AM Huang, Ying
> > <ying.huang@linux.alibaba.com> wrote:
> >>
> >> Hi, Kairui,
> >>
> >> Kairui Song via B4 Relay <devnull+kasong.tencent.com@kernel.org> writes:
> >>
> >> > From: Kairui Song <kasong@tencent.com>
> >> >
> >> > This reverts commit 78524b05f1a3e16a5d00cc9c6259c41a9d6003ce.
> >> >
> >> > While reviewing recent leaf entry changes, I noticed that commit
> >> > 78524b05f1a3 ("mm, swap: avoid redundant swap device pinning") isn't
> >> > correct. It's true that most all callers of __read_swap_cache_async are
> >> > already holding a swap entry reference, so the repeated swap device
> >> > pinning isn't needed on the same swap device, but it is possible that
> >> > VMA readahead (swap_vma_readahead()) may encounter swap entries from a
> >> > different swap device when there are multiple swap devices, and call
> >> > __read_swap_cache_async without holding a reference to that swap device.
> >> >
> >> > So it is possible to cause a UAF if swapoff of device A raced with
> >> > swapin on device B, and VMA readahead tries to read swap entries from
> >> > device A. It's not easy to trigger but in theory possible to cause real
> >> > issues. And besides, that commit made swap more vulnerable to issues
> >> > like corrupted page tables.
> >> >
> >> > Just revert it. __read_swap_cache_async isn't that sensitive to
> >> > performance after all, as it's mostly used for SSD/HDD swap devices with
> >> > readahead. SYNCHRONOUS_IO devices may fallback onto it for swap count >
> >> > 1 entries, but very soon we will have a new helper and routine for
> >> > such devices, so they will never touch this helper or have redundant
> >> > swap device reference overhead.
> >>
> >> Is it better to add get_swap_device() in swap_vma_readahead()?  Whenever
> >> we get a swap entry, the first thing we need to do is call
> >> get_swap_device() to check the validity of the swap entry and prevent
> >> the backing swap device from going under us.  This helps us to avoid
> >> checking the validity of the swap entry in every swap function.  Does
> >> this sound reasonable?
> >
> > Hi Ying, thanks for the suggestion!
> >
> > Yes, that's also a feasible approach.
> >
> > What I was thinking is that, currently except the readahead path, all
> > swapin entry goes through the get_swap_device() helper, that helper
> > also helps to mitigate swap entry corruption that may causes OOB or
> > NULL deref. Although I think it's really not that helpful at all to
> > mitigate page table corruption from the kernel side, but seems not a
> > really bad idea to have.
> >
> > And the code is simpler this way, and seems more suitable for a stable
> > & mainline fix. If we want  to add get_swap_device() in
> > swap_vma_readahead(), we need to do that for every entry that doesn't
> > match the target entry's swap device. The reference overhead is
> > trivial compared to readhead and bio layer, and only non
> > SYNCHRONOUS_IO devices use this helper (madvise is a special case, we
> > may optimize that later). ZRAM may fallback to the readahead path but
> > this fallback will be eliminated very soon in swap table p2.
>
> We have 2 choices in general.
>
> 1. Add get/put_swap_device() in every swap function.
>
> 2. Add get/put_swap_device() in every caller of the swap functions.
>
> Personally, I prefer 2.  It works better in situations like calling
> multiple swap functions.  It can reduce duplicated references.  It helps
> improve code reasoning and readability.

Totally agree, that's exactly what the recently added kerneldoc is
suggesting, caller of the swap function will need to use refcount or
lock to protect the swap device.

I'm not suggesting to add get/put in every function, just thinking
that maybe reverting it can have some nice side effects.

But anyway, this fix should also be good:

diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3f85a1c4cfd9..4cca4865627f 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -747,6 +747,7 @@ static struct folio
*swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,

        blk_start_plug(&plug);
        for (addr = start; addr < end; ilx++, addr += PAGE_SIZE) {
+               struct swap_info_struct *si = NULL;
                leaf_entry_t entry;

                if (!pte++) {
@@ -761,8 +762,12 @@ static struct folio
*swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
                        continue;
                pte_unmap(pte);
                pte = NULL;
+               if (swp_type(entry) != swp_type(targ_entry))
+                       si = get_swap_device(entry);
                folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
                                                &page_allocated, false);
+               if (si)
+                       put_swap_device(si);
                if (!folio)
                        continue;
                if (page_allocated) {

I'll post a patch if it looks ok.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Revert "mm, swap: avoid redundant swap device pinning"
  2025-11-10 11:37       ` Kairui Song
@ 2025-11-10 12:33         ` Kairui Song
  2025-11-11  6:48         ` Huang, Ying
  1 sibling, 0 replies; 10+ messages in thread
From: Kairui Song @ 2025-11-10 12:33 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Kairui Song via B4 Relay, linux-mm, Andrew Morton, Kemeng Shi,
	Nhat Pham, Baoquan He, Barry Song, Chris Li, Johannes Weiner,
	Yosry Ahmed, Chengming Zhou, Youngjun Park, linux-kernel, stable,
	Lorenzo Stoakes

On Mon, Nov 10, 2025 at 7:37 PM Kairui Song <ryncsn@gmail.com> wrote:
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 3f85a1c4cfd9..4cca4865627f 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -747,6 +747,7 @@ static struct folio
> *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
>
>         blk_start_plug(&plug);
>         for (addr = start; addr < end; ilx++, addr += PAGE_SIZE) {
> +               struct swap_info_struct *si = NULL;
>                 leaf_entry_t entry;
>
>                 if (!pte++) {
> @@ -761,8 +762,12 @@ static struct folio
> *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
>                         continue;
>                 pte_unmap(pte);
>                 pte = NULL;
> +               if (swp_type(entry) != swp_type(targ_entry))
> +                       si = get_swap_device(entry);

Oops I missed a NULL check here.


>                 folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
>                                                 &page_allocated, false);
> +               if (si)
> +                       put_swap_device(si);
>                 if (!folio)
>                         continue;
>                 if (page_allocated) {
>
> I'll post a patch if it looks ok.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Revert "mm, swap: avoid redundant swap device pinning"
  2025-11-10 11:37       ` Kairui Song
  2025-11-10 12:33         ` Kairui Song
@ 2025-11-11  6:48         ` Huang, Ying
  1 sibling, 0 replies; 10+ messages in thread
From: Huang, Ying @ 2025-11-11  6:48 UTC (permalink / raw)
  To: Kairui Song
  Cc: Kairui Song via B4 Relay, linux-mm, Andrew Morton, Kemeng Shi,
	Nhat Pham, Baoquan He, Barry Song, Chris Li, Johannes Weiner,
	Yosry Ahmed, Chengming Zhou, Youngjun Park, linux-kernel, stable,
	Lorenzo Stoakes

Kairui Song <ryncsn@gmail.com> writes:

> On Mon, Nov 10, 2025 at 6:50 PM Huang, Ying
> <ying.huang@linux.alibaba.com> wrote:
>>
>> Kairui Song <ryncsn@gmail.com> writes:
>>
>> > On Mon, Nov 10, 2025 at 9:56 AM Huang, Ying
>> > <ying.huang@linux.alibaba.com> wrote:
>> >>
>> >> Hi, Kairui,
>> >>
>> >> Kairui Song via B4 Relay <devnull+kasong.tencent.com@kernel.org> writes:
>> >>
>> >> > From: Kairui Song <kasong@tencent.com>
>> >> >
>> >> > This reverts commit 78524b05f1a3e16a5d00cc9c6259c41a9d6003ce.
>> >> >
>> >> > While reviewing recent leaf entry changes, I noticed that commit
>> >> > 78524b05f1a3 ("mm, swap: avoid redundant swap device pinning") isn't
>> >> > correct. It's true that most all callers of __read_swap_cache_async are
>> >> > already holding a swap entry reference, so the repeated swap device
>> >> > pinning isn't needed on the same swap device, but it is possible that
>> >> > VMA readahead (swap_vma_readahead()) may encounter swap entries from a
>> >> > different swap device when there are multiple swap devices, and call
>> >> > __read_swap_cache_async without holding a reference to that swap device.
>> >> >
>> >> > So it is possible to cause a UAF if swapoff of device A raced with
>> >> > swapin on device B, and VMA readahead tries to read swap entries from
>> >> > device A. It's not easy to trigger but in theory possible to cause real
>> >> > issues. And besides, that commit made swap more vulnerable to issues
>> >> > like corrupted page tables.
>> >> >
>> >> > Just revert it. __read_swap_cache_async isn't that sensitive to
>> >> > performance after all, as it's mostly used for SSD/HDD swap devices with
>> >> > readahead. SYNCHRONOUS_IO devices may fallback onto it for swap count >
>> >> > 1 entries, but very soon we will have a new helper and routine for
>> >> > such devices, so they will never touch this helper or have redundant
>> >> > swap device reference overhead.
>> >>
>> >> Is it better to add get_swap_device() in swap_vma_readahead()?  Whenever
>> >> we get a swap entry, the first thing we need to do is call
>> >> get_swap_device() to check the validity of the swap entry and prevent
>> >> the backing swap device from going under us.  This helps us to avoid
>> >> checking the validity of the swap entry in every swap function.  Does
>> >> this sound reasonable?
>> >
>> > Hi Ying, thanks for the suggestion!
>> >
>> > Yes, that's also a feasible approach.
>> >
>> > What I was thinking is that, currently except the readahead path, all
>> > swapin entry goes through the get_swap_device() helper, that helper
>> > also helps to mitigate swap entry corruption that may causes OOB or
>> > NULL deref. Although I think it's really not that helpful at all to
>> > mitigate page table corruption from the kernel side, but seems not a
>> > really bad idea to have.
>> >
>> > And the code is simpler this way, and seems more suitable for a stable
>> > & mainline fix. If we want  to add get_swap_device() in
>> > swap_vma_readahead(), we need to do that for every entry that doesn't
>> > match the target entry's swap device. The reference overhead is
>> > trivial compared to readhead and bio layer, and only non
>> > SYNCHRONOUS_IO devices use this helper (madvise is a special case, we
>> > may optimize that later). ZRAM may fallback to the readahead path but
>> > this fallback will be eliminated very soon in swap table p2.
>>
>> We have 2 choices in general.
>>
>> 1. Add get/put_swap_device() in every swap function.
>>
>> 2. Add get/put_swap_device() in every caller of the swap functions.
>>
>> Personally, I prefer 2.  It works better in situations like calling
>> multiple swap functions.  It can reduce duplicated references.  It helps
>> improve code reasoning and readability.
>
> Totally agree, that's exactly what the recently added kerneldoc is
> suggesting, caller of the swap function will need to use refcount or
> lock to protect the swap device.
>
> I'm not suggesting to add get/put in every function, just thinking
> that maybe reverting it can have some nice side effects.
>
> But anyway, this fix should also be good:
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 3f85a1c4cfd9..4cca4865627f 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -747,6 +747,7 @@ static struct folio
> *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
>
>         blk_start_plug(&plug);
>         for (addr = start; addr < end; ilx++, addr += PAGE_SIZE) {
> +               struct swap_info_struct *si = NULL;
>                 leaf_entry_t entry;
>
>                 if (!pte++) {
> @@ -761,8 +762,12 @@ static struct folio
> *swap_vma_readahead(swp_entry_t targ_entry, gfp_t gfp_mask,
>                         continue;
>                 pte_unmap(pte);
>                 pte = NULL;
> +               if (swp_type(entry) != swp_type(targ_entry))
> +                       si = get_swap_device(entry);
>                 folio = __read_swap_cache_async(entry, gfp_mask, mpol, ilx,
>                                                 &page_allocated, false);
> +               if (si)
> +                       put_swap_device(si);
>                 if (!folio)
>                         continue;
>                 if (page_allocated) {
>
> I'll post a patch if it looks ok.

LGTM with the NULL check as you said in another email.

---
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] Revert "mm, swap: avoid redundant swap device pinning"
  2025-11-09 18:06 [PATCH] Revert "mm, swap: avoid redundant swap device pinning" Kairui Song via B4 Relay
  2025-11-10  1:00 ` Greg KH
  2025-11-10  1:56 ` Huang, Ying
@ 2025-11-14 15:18 ` Kairui Song
  2 siblings, 0 replies; 10+ messages in thread
From: Kairui Song @ 2025-11-14 15:18 UTC (permalink / raw)
  To: Andrew Morton, kasong
  Cc: linux-mm, Nhat Pham, Baoquan He, Barry Song, Chris Li,
	Yosry Ahmed, Youngjun Park

On Mon, Nov 10, 2025 at 2:06 AM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@kernel.org> wrote:
>
> From: Kairui Song <kasong@tencent.com>
>
> This reverts commit 78524b05f1a3e16a5d00cc9c6259c41a9d6003ce.
>
> While reviewing recent leaf entry changes, I noticed that commit
> 78524b05f1a3 ("mm, swap: avoid redundant swap device pinning") isn't
> correct. It's true that most all callers of __read_swap_cache_async are
> already holding a swap entry reference, so the repeated swap device
> pinning isn't needed on the same swap device, but it is possible that
> VMA readahead (swap_vma_readahead()) may encounter swap entries from a
> different swap device when there are multiple swap devices, and call
> __read_swap_cache_async without holding a reference to that swap device.
>
> So it is possible to cause a UAF if swapoff of device A raced with
> swapin on device B, and VMA readahead tries to read swap entries from
> device A. It's not easy to trigger but in theory possible to cause real
> issues. And besides, that commit made swap more vulnerable to issues
> like corrupted page tables.
>
> Just revert it. __read_swap_cache_async isn't that sensitive to
> performance after all, as it's mostly used for SSD/HDD swap devices with
> readahead. SYNCHRONOUS_IO devices may fallback onto it for swap count >
> 1 entries, but very soon we will have a new helper and routine for
> such devices, so they will never touch this helper or have redundant
> swap device reference overhead.
>

Hi Andrew, I saw you have merged a later UAF fix I posted:
https://lore.kernel.org/linux-mm/20251111-swap-fix-vma-uaf-v1-1-41c660e58562@tencent.com/

So this patch is no longer needed, and should be dropped from
mm-unstable and mm-new.

Can you help drop this? I'm doing a rebase for the swap table and
noticed this patch is still there.


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-11-14 15:19 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-09 18:06 [PATCH] Revert "mm, swap: avoid redundant swap device pinning" Kairui Song via B4 Relay
2025-11-10  1:00 ` Greg KH
2025-11-10  5:33   ` Kairui Song
2025-11-10  1:56 ` Huang, Ying
2025-11-10  5:32   ` Kairui Song
2025-11-10 10:50     ` Huang, Ying
2025-11-10 11:37       ` Kairui Song
2025-11-10 12:33         ` Kairui Song
2025-11-11  6:48         ` Huang, Ying
2025-11-14 15:18 ` Kairui Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox