linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm/shmem, swap: fix race of truncate and swap entry split
@ 2026-01-11 17:53 Kairui Song
  2026-01-12  4:00 ` Baolin Wang
  0 siblings, 1 reply; 5+ messages in thread
From: Kairui Song @ 2026-01-11 17:53 UTC (permalink / raw)
  To: linux-mm
  Cc: Hugh Dickins, Baolin Wang, Andrew Morton, Kemeng Shi, Nhat Pham,
	Chris Li, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	linux-kernel, Kairui Song, stable

From: Kairui Song <kasong@tencent.com>

The helper for shmem swap freeing is not handling the order of swap
entries correctly. It uses xa_cmpxchg_irq to erase the swap entry,
but it gets the entry order before that using xa_get_order
without lock protection. As a result the order could be a stalled value
if the entry is split after the xa_get_order and before the
xa_cmpxchg_irq. In fact that are more way for other races to occur
during the time window.

To fix that, open code the Xarray cmpxchg and put the order retrivial and
value checking in the same critical section. Also ensure the order won't
exceed the truncate border.

I observed random swapoff hangs and swap entry leaks when stress
testing ZSWAP with shmem. After applying this patch, the problem is resolved.

Fixes: 809bc86517cc ("mm: shmem: support large folio swap out")
Cc: stable@vger.kernel.org
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/shmem.c | 35 +++++++++++++++++++++++------------
 1 file changed, 23 insertions(+), 12 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 0b4c8c70d017..e160da0cd30f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -961,18 +961,28 @@ static void shmem_delete_from_page_cache(struct folio *folio, void *radswap)
  * the number of pages being freed. 0 means entry not found in XArray (0 pages
  * being freed).
  */
-static long shmem_free_swap(struct address_space *mapping,
-			    pgoff_t index, void *radswap)
+static long shmem_free_swap(struct address_space *mapping, pgoff_t index,
+			    unsigned int max_nr, void *radswap)
 {
-	int order = xa_get_order(&mapping->i_pages, index);
-	void *old;
+	XA_STATE(xas, &mapping->i_pages, index);
+	unsigned int nr_pages = 0;
+	void *entry;
 
-	old = xa_cmpxchg_irq(&mapping->i_pages, index, radswap, NULL, 0);
-	if (old != radswap)
-		return 0;
-	swap_put_entries_direct(radix_to_swp_entry(radswap), 1 << order);
+	xas_lock_irq(&xas);
+	entry = xas_load(&xas);
+	if (entry == radswap) {
+		nr_pages = 1 << xas_get_order(&xas);
+		if (index == round_down(xas.xa_index, nr_pages) && nr_pages < max_nr)
+			xas_store(&xas, NULL);
+		else
+			nr_pages = 0;
+	}
+	xas_unlock_irq(&xas);
+
+	if (nr_pages)
+		swap_put_entries_direct(radix_to_swp_entry(radswap), nr_pages);
 
-	return 1 << order;
+	return nr_pages;
 }
 
 /*
@@ -1124,8 +1134,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, uoff_t lend,
 			if (xa_is_value(folio)) {
 				if (unfalloc)
 					continue;
-				nr_swaps_freed += shmem_free_swap(mapping,
-							indices[i], folio);
+				nr_swaps_freed += shmem_free_swap(mapping, indices[i],
+								  end - indices[i], folio);
 				continue;
 			}
 
@@ -1195,7 +1205,8 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, uoff_t lend,
 
 				if (unfalloc)
 					continue;
-				swaps_freed = shmem_free_swap(mapping, indices[i], folio);
+				swaps_freed = shmem_free_swap(mapping, indices[i],
+							      end - indices[i], folio);
 				if (!swaps_freed) {
 					/* Swap was replaced by page: retry */
 					index = indices[i];

---
base-commit: ab3d40bdac831c67e130fda12f3011505556500f
change-id: 20260111-shmem-swap-fix-8d0e20a14b5d

Best regards,
-- 
Kairui Song <kasong@tencent.com>



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm/shmem, swap: fix race of truncate and swap entry split
  2026-01-11 17:53 [PATCH] mm/shmem, swap: fix race of truncate and swap entry split Kairui Song
@ 2026-01-12  4:00 ` Baolin Wang
  2026-01-12  5:56   ` Kairui Song
  0 siblings, 1 reply; 5+ messages in thread
From: Baolin Wang @ 2026-01-12  4:00 UTC (permalink / raw)
  To: Kairui Song, linux-mm
  Cc: Hugh Dickins, Andrew Morton, Kemeng Shi, Nhat Pham, Chris Li,
	Baoquan He, Barry Song, linux-kernel, Kairui Song, stable


On 1/12/26 1:53 AM, Kairui Song wrote:
> From: Kairui Song <kasong@tencent.com>
> 
> The helper for shmem swap freeing is not handling the order of swap
> entries correctly. It uses xa_cmpxchg_irq to erase the swap entry,
> but it gets the entry order before that using xa_get_order
> without lock protection. As a result the order could be a stalled value
> if the entry is split after the xa_get_order and before the
> xa_cmpxchg_irq. In fact that are more way for other races to occur
> during the time window.
> 
> To fix that, open code the Xarray cmpxchg and put the order retrivial and
> value checking in the same critical section. Also ensure the order won't
> exceed the truncate border.
> 
> I observed random swapoff hangs and swap entry leaks when stress
> testing ZSWAP with shmem. After applying this patch, the problem is resolved.
> 
> Fixes: 809bc86517cc ("mm: shmem: support large folio swap out")
> Cc: stable@vger.kernel.org
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
>   mm/shmem.c | 35 +++++++++++++++++++++++------------
>   1 file changed, 23 insertions(+), 12 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 0b4c8c70d017..e160da0cd30f 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -961,18 +961,28 @@ static void shmem_delete_from_page_cache(struct folio *folio, void *radswap)
>    * the number of pages being freed. 0 means entry not found in XArray (0 pages
>    * being freed).
>    */
> -static long shmem_free_swap(struct address_space *mapping,
> -			    pgoff_t index, void *radswap)
> +static long shmem_free_swap(struct address_space *mapping, pgoff_t index,
> +			    unsigned int max_nr, void *radswap)
>   {
> -	int order = xa_get_order(&mapping->i_pages, index);
> -	void *old;
> +	XA_STATE(xas, &mapping->i_pages, index);
> +	unsigned int nr_pages = 0;
> +	void *entry;
>   
> -	old = xa_cmpxchg_irq(&mapping->i_pages, index, radswap, NULL, 0);
> -	if (old != radswap)
> -		return 0;
> -	swap_put_entries_direct(radix_to_swp_entry(radswap), 1 << order);
> +	xas_lock_irq(&xas);
> +	entry = xas_load(&xas);
> +	if (entry == radswap) {
> +		nr_pages = 1 << xas_get_order(&xas);
> +		if (index == round_down(xas.xa_index, nr_pages) && nr_pages < max_nr)
> +			xas_store(&xas, NULL);
> +		else
> +			nr_pages = 0;
> +	}
> +	xas_unlock_irq(&xas);
> +
> +	if (nr_pages)
> +		swap_put_entries_direct(radix_to_swp_entry(radswap), nr_pages);
>   
> -	return 1 << order;
> +	return nr_pages;
>   }

Thanks for the analysis, and it makes sense to me. Would the following 
implementation be simpler and also address your issue (we will not 
release the lock in __xa_cmpxchg() since gfp = 0)?

static long shmem_free_swap(struct address_space *mapping,
                             pgoff_t index, void *radswap)
{
         XA_STATE(xas, &mapping->i_pages, index);
         int order;
         void *old;

         xas_lock_irq(&xas);
         order = xas_get_order(&xas);
         old = __xa_cmpxchg(xas.xa, index, radswap, NULL, 0);
         if (old != radswap) {
                 xas_unlock_irq(&xas);
                 return 0;
         }
         xas_lock_irq(&xas);

         swap_put_entries_direct(radix_to_swp_entry(radswap), 1 << order);

         return 1 << order;
}


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm/shmem, swap: fix race of truncate and swap entry split
  2026-01-12  4:00 ` Baolin Wang
@ 2026-01-12  5:56   ` Kairui Song
  2026-01-12  8:22     ` Baolin Wang
  0 siblings, 1 reply; 5+ messages in thread
From: Kairui Song @ 2026-01-12  5:56 UTC (permalink / raw)
  To: Baolin Wang
  Cc: linux-mm, Hugh Dickins, Andrew Morton, Kemeng Shi, Nhat Pham,
	Chris Li, Baoquan He, Barry Song, linux-kernel, stable

On Mon, Jan 12, 2026 at 12:00 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
> On 1/12/26 1:53 AM, Kairui Song wrote:
> > From: Kairui Song <kasong@tencent.com>
> >
> > The helper for shmem swap freeing is not handling the order of swap
> > entries correctly. It uses xa_cmpxchg_irq to erase the swap entry,
> > but it gets the entry order before that using xa_get_order
> > without lock protection. As a result the order could be a stalled value
> > if the entry is split after the xa_get_order and before the
> > xa_cmpxchg_irq. In fact that are more way for other races to occur
> > during the time window.
> >
> > To fix that, open code the Xarray cmpxchg and put the order retrivial and
> > value checking in the same critical section. Also ensure the order won't
> > exceed the truncate border.
> >
> > I observed random swapoff hangs and swap entry leaks when stress
> > testing ZSWAP with shmem. After applying this patch, the problem is resolved.
> >
> > Fixes: 809bc86517cc ("mm: shmem: support large folio swap out")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Kairui Song <kasong@tencent.com>
> > ---
> >   mm/shmem.c | 35 +++++++++++++++++++++++------------
> >   1 file changed, 23 insertions(+), 12 deletions(-)
> >
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 0b4c8c70d017..e160da0cd30f 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -961,18 +961,28 @@ static void shmem_delete_from_page_cache(struct folio *folio, void *radswap)
> >    * the number of pages being freed. 0 means entry not found in XArray (0 pages
> >    * being freed).
> >    */
> > -static long shmem_free_swap(struct address_space *mapping,
> > -                         pgoff_t index, void *radswap)
> > +static long shmem_free_swap(struct address_space *mapping, pgoff_t index,
> > +                         unsigned int max_nr, void *radswap)
> >   {
> > -     int order = xa_get_order(&mapping->i_pages, index);
> > -     void *old;
> > +     XA_STATE(xas, &mapping->i_pages, index);
> > +     unsigned int nr_pages = 0;
> > +     void *entry;
> >
> > -     old = xa_cmpxchg_irq(&mapping->i_pages, index, radswap, NULL, 0);
> > -     if (old != radswap)
> > -             return 0;
> > -     swap_put_entries_direct(radix_to_swp_entry(radswap), 1 << order);
> > +     xas_lock_irq(&xas);
> > +     entry = xas_load(&xas);
> > +     if (entry == radswap) {
> > +             nr_pages = 1 << xas_get_order(&xas);
> > +             if (index == round_down(xas.xa_index, nr_pages) && nr_pages < max_nr)
> > +                     xas_store(&xas, NULL);
> > +             else
> > +                     nr_pages = 0;
> > +     }
> > +     xas_unlock_irq(&xas);
> > +
> > +     if (nr_pages)
> > +             swap_put_entries_direct(radix_to_swp_entry(radswap), nr_pages);
> >
> > -     return 1 << order;
> > +     return nr_pages;
> >   }
>
> Thanks for the analysis, and it makes sense to me. Would the following
> implementation be simpler and also address your issue (we will not
> release the lock in __xa_cmpxchg() since gfp = 0)?

Hi Baolin,

>
> static long shmem_free_swap(struct address_space *mapping,
>                              pgoff_t index, void *radswap)
> {
>          XA_STATE(xas, &mapping->i_pages, index);
>          int order;
>          void *old;
>
>          xas_lock_irq(&xas);
>          order = xas_get_order(&xas);

Thanks for the suggestion. I did consider implementing it this way,
but I was worried that the order could grow upwards. For example
shmem_undo_range is trying to free 0-95 and there is an entry at 64
with order 5 (64 - 95). Before shmem_free_swap is called, the entry
was swapped in, then the folio was freed, then an order 6 folio was
allocated there and swapped out again using the same entry.

Then here it will free the whole order 6 entry (64 - 127), while
shmem_undo_range is only supposed to erase (0-96).

That's why I added a max_nr argument to the helper. The GFP == 0 below
looks not very clean either, that's trivial though.

>          old = __xa_cmpxchg(xas.xa, index, radswap, NULL, 0);

Am I overthinking it?


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm/shmem, swap: fix race of truncate and swap entry split
  2026-01-12  5:56   ` Kairui Song
@ 2026-01-12  8:22     ` Baolin Wang
  2026-01-12  9:55       ` Kairui Song
  0 siblings, 1 reply; 5+ messages in thread
From: Baolin Wang @ 2026-01-12  8:22 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Hugh Dickins, Andrew Morton, Kemeng Shi, Nhat Pham,
	Chris Li, Baoquan He, Barry Song, linux-kernel, stable



On 1/12/26 1:56 PM, Kairui Song wrote:
> On Mon, Jan 12, 2026 at 12:00 PM Baolin Wang
> <baolin.wang@linux.alibaba.com> wrote:
>> On 1/12/26 1:53 AM, Kairui Song wrote:
>>> From: Kairui Song <kasong@tencent.com>
>>>
>>> The helper for shmem swap freeing is not handling the order of swap
>>> entries correctly. It uses xa_cmpxchg_irq to erase the swap entry,
>>> but it gets the entry order before that using xa_get_order
>>> without lock protection. As a result the order could be a stalled value
>>> if the entry is split after the xa_get_order and before the
>>> xa_cmpxchg_irq. In fact that are more way for other races to occur
>>> during the time window.
>>>
>>> To fix that, open code the Xarray cmpxchg and put the order retrivial and
>>> value checking in the same critical section. Also ensure the order won't
>>> exceed the truncate border.
>>>
>>> I observed random swapoff hangs and swap entry leaks when stress
>>> testing ZSWAP with shmem. After applying this patch, the problem is resolved.
>>>
>>> Fixes: 809bc86517cc ("mm: shmem: support large folio swap out")
>>> Cc: stable@vger.kernel.org
>>> Signed-off-by: Kairui Song <kasong@tencent.com>
>>> ---
>>>    mm/shmem.c | 35 +++++++++++++++++++++++------------
>>>    1 file changed, 23 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>> index 0b4c8c70d017..e160da0cd30f 100644
>>> --- a/mm/shmem.c
>>> +++ b/mm/shmem.c
>>> @@ -961,18 +961,28 @@ static void shmem_delete_from_page_cache(struct folio *folio, void *radswap)
>>>     * the number of pages being freed. 0 means entry not found in XArray (0 pages
>>>     * being freed).
>>>     */
>>> -static long shmem_free_swap(struct address_space *mapping,
>>> -                         pgoff_t index, void *radswap)
>>> +static long shmem_free_swap(struct address_space *mapping, pgoff_t index,
>>> +                         unsigned int max_nr, void *radswap)
>>>    {
>>> -     int order = xa_get_order(&mapping->i_pages, index);
>>> -     void *old;
>>> +     XA_STATE(xas, &mapping->i_pages, index);
>>> +     unsigned int nr_pages = 0;
>>> +     void *entry;
>>>
>>> -     old = xa_cmpxchg_irq(&mapping->i_pages, index, radswap, NULL, 0);
>>> -     if (old != radswap)
>>> -             return 0;
>>> -     swap_put_entries_direct(radix_to_swp_entry(radswap), 1 << order);
>>> +     xas_lock_irq(&xas);
>>> +     entry = xas_load(&xas);
>>> +     if (entry == radswap) {
>>> +             nr_pages = 1 << xas_get_order(&xas);
>>> +             if (index == round_down(xas.xa_index, nr_pages) && nr_pages < max_nr)
>>> +                     xas_store(&xas, NULL);
>>> +             else
>>> +                     nr_pages = 0;
>>> +     }
>>> +     xas_unlock_irq(&xas);
>>> +
>>> +     if (nr_pages)
>>> +             swap_put_entries_direct(radix_to_swp_entry(radswap), nr_pages);
>>>
>>> -     return 1 << order;
>>> +     return nr_pages;
>>>    }
>>
>> Thanks for the analysis, and it makes sense to me. Would the following
>> implementation be simpler and also address your issue (we will not
>> release the lock in __xa_cmpxchg() since gfp = 0)?
> 
> Hi Baolin,
> 
>>
>> static long shmem_free_swap(struct address_space *mapping,
>>                               pgoff_t index, void *radswap)
>> {
>>           XA_STATE(xas, &mapping->i_pages, index);
>>           int order;
>>           void *old;
>>
>>           xas_lock_irq(&xas);
>>           order = xas_get_order(&xas);
> 
> Thanks for the suggestion. I did consider implementing it this way,
> but I was worried that the order could grow upwards. For example
> shmem_undo_range is trying to free 0-95 and there is an entry at 64
> with order 5 (64 - 95). Before shmem_free_swap is called, the entry
> was swapped in, then the folio was freed, then an order 6 folio was
> allocated there and swapped out again using the same entry.
> 
> Then here it will free the whole order 6 entry (64 - 127), while
> shmem_undo_range is only supposed to erase (0-96).

Good point. However, this cannot happen during swapoff, because the 
'end' is set to -1 in shmem_evict_inode().

Actually, the real question is how to handle the case where a large swap 
entry happens to cross the 'end' when calling shmem_truncate_range(). If 
the shmem mapping stores a folio, we would split that large folio by 
truncate_inode_partial_folio(). If the shmem mapping stores a large swap 
entry, then as you noted, the truncation range can indeed exceed the 'end'.

But with your change, that large swap entry would not be truncated, and 
I’m not sure whether that might cause other issues. Perhaps the best 
approach is to first split the large swap entry and only truncate the 
swap entries within the 'end' boundary like the 
truncate_inode_partial_folio() does.

Alternatively, this patch could only focus on the race on the order, 
which seems uncontested. As for handling large swap entries that go 
beyond the 'end', should we address that in a follow-up, for example by 
splitting? What do you think?

> That's why I added a max_nr argument to the helper. The GFP == 0 below
> looks not very clean either, that's trivial though.
> 
>>           old = __xa_cmpxchg(xas.xa, index, radswap, NULL, 0);
> 
> Am I overthinking it?



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm/shmem, swap: fix race of truncate and swap entry split
  2026-01-12  8:22     ` Baolin Wang
@ 2026-01-12  9:55       ` Kairui Song
  0 siblings, 0 replies; 5+ messages in thread
From: Kairui Song @ 2026-01-12  9:55 UTC (permalink / raw)
  To: Baolin Wang
  Cc: linux-mm, Hugh Dickins, Andrew Morton, Kemeng Shi, Nhat Pham,
	Chris Li, Baoquan He, Barry Song, linux-kernel, stable

On Mon, Jan 12, 2026 at 4:22 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
> On 1/12/26 1:56 PM, Kairui Song wrote:
> > On Mon, Jan 12, 2026 at 12:00 PM Baolin Wang
> > <baolin.wang@linux.alibaba.com> wrote:
> >> On 1/12/26 1:53 AM, Kairui Song wrote:
> >>> From: Kairui Song <kasong@tencent.com>
> >>>
> >>> The helper for shmem swap freeing is not handling the order of swap
> >>> entries correctly. It uses xa_cmpxchg_irq to erase the swap entry,
> >>> but it gets the entry order before that using xa_get_order
> >>> without lock protection. As a result the order could be a stalled value
> >>> if the entry is split after the xa_get_order and before the
> >>> xa_cmpxchg_irq. In fact that are more way for other races to occur
> >>> during the time window.
> >>>
> >>> To fix that, open code the Xarray cmpxchg and put the order retrivial and
> >>> value checking in the same critical section. Also ensure the order won't
> >>> exceed the truncate border.
> >>>
> >>> I observed random swapoff hangs and swap entry leaks when stress
> >>> testing ZSWAP with shmem. After applying this patch, the problem is resolved.
> >>>
> >>> Fixes: 809bc86517cc ("mm: shmem: support large folio swap out")
> >>> Cc: stable@vger.kernel.org
> >>> Signed-off-by: Kairui Song <kasong@tencent.com>
> >>> ---
> >>>    mm/shmem.c | 35 +++++++++++++++++++++++------------
> >>>    1 file changed, 23 insertions(+), 12 deletions(-)
> >>>
> >>> diff --git a/mm/shmem.c b/mm/shmem.c
> >>> index 0b4c8c70d017..e160da0cd30f 100644
> >>> --- a/mm/shmem.c
> >>> +++ b/mm/shmem.c
> >>> @@ -961,18 +961,28 @@ static void shmem_delete_from_page_cache(struct folio *folio, void *radswap)
> >>>     * the number of pages being freed. 0 means entry not found in XArray (0 pages
> >>>     * being freed).
> >>>     */
> >>> -static long shmem_free_swap(struct address_space *mapping,
> >>> -                         pgoff_t index, void *radswap)
> >>> +static long shmem_free_swap(struct address_space *mapping, pgoff_t index,
> >>> +                         unsigned int max_nr, void *radswap)
> >>>    {
> >>> -     int order = xa_get_order(&mapping->i_pages, index);
> >>> -     void *old;
> >>> +     XA_STATE(xas, &mapping->i_pages, index);
> >>> +     unsigned int nr_pages = 0;
> >>> +     void *entry;
> >>>
> >>> -     old = xa_cmpxchg_irq(&mapping->i_pages, index, radswap, NULL, 0);
> >>> -     if (old != radswap)
> >>> -             return 0;
> >>> -     swap_put_entries_direct(radix_to_swp_entry(radswap), 1 << order);
> >>> +     xas_lock_irq(&xas);
> >>> +     entry = xas_load(&xas);
> >>> +     if (entry == radswap) {
> >>> +             nr_pages = 1 << xas_get_order(&xas);
> >>> +             if (index == round_down(xas.xa_index, nr_pages) && nr_pages < max_nr)
> >>> +                     xas_store(&xas, NULL);
> >>> +             else
> >>> +                     nr_pages = 0;
> >>> +     }
> >>> +     xas_unlock_irq(&xas);
> >>> +
> >>> +     if (nr_pages)
> >>> +             swap_put_entries_direct(radix_to_swp_entry(radswap), nr_pages);
> >>>
> >>> -     return 1 << order;
> >>> +     return nr_pages;
> >>>    }
> >>
> >> Thanks for the analysis, and it makes sense to me. Would the following
> >> implementation be simpler and also address your issue (we will not
> >> release the lock in __xa_cmpxchg() since gfp = 0)?
> >
> > Hi Baolin,
> >
> >>
> >> static long shmem_free_swap(struct address_space *mapping,
> >>                               pgoff_t index, void *radswap)
> >> {
> >>           XA_STATE(xas, &mapping->i_pages, index);
> >>           int order;
> >>           void *old;
> >>
> >>           xas_lock_irq(&xas);
> >>           order = xas_get_order(&xas);
> >
> > Thanks for the suggestion. I did consider implementing it this way,
> > but I was worried that the order could grow upwards. For example
> > shmem_undo_range is trying to free 0-95 and there is an entry at 64
> > with order 5 (64 - 95). Before shmem_free_swap is called, the entry
> > was swapped in, then the folio was freed, then an order 6 folio was
> > allocated there and swapped out again using the same entry.
> >
> > Then here it will free the whole order 6 entry (64 - 127), while
> > shmem_undo_range is only supposed to erase (0-96).
>
> Good point. However, this cannot happen during swapoff, because the
> 'end' is set to -1 in shmem_evict_inode().

That's not only for swapff, shmem_truncate_range / falloc can also use it right?

>
> Actually, the real question is how to handle the case where a large swap
> entry happens to cross the 'end' when calling shmem_truncate_range(). If
> the shmem mapping stores a folio, we would split that large folio by
> truncate_inode_partial_folio(). If the shmem mapping stores a large swap
> entry, then as you noted, the truncation range can indeed exceed the 'end'.
>
> But with your change, that large swap entry would not be truncated, and
> I’m not sure whether that might cause other issues. Perhaps the best
> approach is to first split the large swap entry and only truncate the
> swap entries within the 'end' boundary like the
> truncate_inode_partial_folio() does.

Right... I was thinking that the shmem_undo_range iterates the undo
range twice IIUC, in the second try it will retry if shmem_free_swap
returns 0:

swaps_freed = shmem_free_swap(mapping, indices[i], end - indices[i], folio);
if (!swaps_freed) {
    /* Swap was replaced by page: retry */
    index = indices[i];
    break;
}

So I thought shmem_free_swap returning 0 is good enough. Which is not,
it may cause the second loop to retry forever.

>
> Alternatively, this patch could only focus on the race on the order,
> which seems uncontested. As for handling large swap entries that go
> beyond the 'end', should we address that in a follow-up, for example by
> splitting? What do you think?
>

I think a partial fix is still wrong, How about we just handle the
split here, like this?

static int shmem_free_swap(struct address_space *mapping, pgoff_t index,
                unsigned int max_nr, void *radswap)
{
    XA_STATE(xas, &mapping->i_pages, index);
    int nr_pages = 0, ret;
    void *entry;
    bool split;

retry:
    xas_lock_irq(&xas);
    entry = xas_load(&xas);
    if (entry == radswap) {
        nr_pages = 1 << xas_get_order(&xas);
        /*
         * Check if the order growed upwards and a larger entry is
         * now covering the target entry. In this case caller may need to
         * restart the iteration.
         */
        if (index != round_down(xas.xa_index, nr_pages)) {
            xas_unlock_irq(&xas);
            return 0;
        }

        /* Check if we are freeing part of a large entry. */
        if (nr_pages > max_nr) {
            xas_unlock_irq(&xas);
            /* Let the caller decide what to do by returning 0 if
split failed. */
            if (shmem_split_large_entry(mapping, index + max_nr,
radswap, mapping_gfp(mapping)))
                return 0;
            goto retry;
        }

        xas_store(&xas, NULL);
        xas_unlock_irq(&xas);

        swap_put_entries_direct(radix_to_swp_entry(radswap), nr_pages);
        return nr_pages;
    }

    xas_unlock_irq(&xas);
    return 0;
}


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-01-12  9:55 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-01-11 17:53 [PATCH] mm/shmem, swap: fix race of truncate and swap entry split Kairui Song
2026-01-12  4:00 ` Baolin Wang
2026-01-12  5:56   ` Kairui Song
2026-01-12  8:22     ` Baolin Wang
2026-01-12  9:55       ` Kairui Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox