* [PATCH] mm/swap: fix wrong plist empty check in swap_alloc_slow()
@ 2025-11-19 11:41 Youngjun Park
2025-11-19 13:02 ` Kairui Song
2025-11-20 2:47 ` Baoquan He
0 siblings, 2 replies; 9+ messages in thread
From: Youngjun Park @ 2025-11-19 11:41 UTC (permalink / raw)
To: akpm
Cc: chrisl, kasong, shikemeng, nphamcs, bhe, baohua, linux-mm, youngjun.park
swap_alloc_slow() was checking `si->avail_list` instead of `next->avail_list`
when verifying if the next swap device is still in the list, which could cause
unnecessary restarts during allocation.
Fixes: 8e689f8ea45ff ("mm/swap: do not choose swap device according to numa node")
Signed-off-by: Youngjun Park <youngjun.park@lge.com>
---
mm/swapfile.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 94e0f0c54168..cf780fefaf7d 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1374,7 +1374,7 @@ static bool swap_alloc_slow(swp_entry_t *entry,
* still in the swap_avail_head list then try it, otherwise
* start over if we have not gotten any slots.
*/
- if (plist_node_empty(&si->avail_list))
+ if (plist_node_empty(&next->avail_list))
goto start_over;
}
spin_unlock(&swap_avail_lock);
--
2.34.1
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] mm/swap: fix wrong plist empty check in swap_alloc_slow()
2025-11-19 11:41 [PATCH] mm/swap: fix wrong plist empty check in swap_alloc_slow() Youngjun Park
@ 2025-11-19 13:02 ` Kairui Song
2025-11-19 16:37 ` YoungJun Park
2025-11-20 2:06 ` YoungJun Park
2025-11-20 2:47 ` Baoquan He
1 sibling, 2 replies; 9+ messages in thread
From: Kairui Song @ 2025-11-19 13:02 UTC (permalink / raw)
To: Youngjun Park; +Cc: akpm, chrisl, shikemeng, nphamcs, bhe, baohua, linux-mm
On Wed, Nov 19, 2025 at 7:56 PM Youngjun Park <youngjun.park@lge.com> wrote:
>
> swap_alloc_slow() was checking `si->avail_list` instead of `next->avail_list`
> when verifying if the next swap device is still in the list, which could cause
> unnecessary restarts during allocation.
>
> Fixes: 8e689f8ea45ff ("mm/swap: do not choose swap device according to numa node")
> Signed-off-by: Youngjun Park <youngjun.park@lge.com>
> ---
> mm/swapfile.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 94e0f0c54168..cf780fefaf7d 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1374,7 +1374,7 @@ static bool swap_alloc_slow(swp_entry_t *entry,
> * still in the swap_avail_head list then try it, otherwise
> * start over if we have not gotten any slots.
> */
> - if (plist_node_empty(&si->avail_list))
> + if (plist_node_empty(&next->avail_list))
> goto start_over;
> }
> spin_unlock(&swap_avail_lock);
> --
> 2.34.1
Thanks!
Acked-by: Kairui Song <kasong@tencent.com>
BTW I noticed the comment here needs update too. And the plist usage
here seems can also be simplified? We never use a lower priority
device when there is any higher priority device in the list, as
devices are removed from the list when they are full. And the first
thing we do here is plist_requeue. So we can just keep trying the
head, until the whole list is empty, seems good enough?
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] mm/swap: fix wrong plist empty check in swap_alloc_slow()
2025-11-19 13:02 ` Kairui Song
@ 2025-11-19 16:37 ` YoungJun Park
2025-11-20 2:08 ` Kairui Song
2025-11-20 2:06 ` YoungJun Park
1 sibling, 1 reply; 9+ messages in thread
From: YoungJun Park @ 2025-11-19 16:37 UTC (permalink / raw)
To: Kairui Song; +Cc: akpm, chrisl, shikemeng, nphamcs, bhe, baohua, linux-mm
On Wed, Nov 19, 2025 at 09:02:14PM +0800, Kairui Song wrote:
> On Wed, Nov 19, 2025 at 7:56 PM Youngjun Park <youngjun.park@lge.com> wrote:
> >
> > swap_alloc_slow() was checking `si->avail_list` instead of `next->avail_list`
> > when verifying if the next swap device is still in the list, which could cause
> > unnecessary restarts during allocation.
> >
> > Fixes: 8e689f8ea45ff ("mm/swap: do not choose swap device according to numa node")
> > Signed-off-by: Youngjun Park <youngjun.park@lge.com>
> > ---
> > mm/swapfile.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index 94e0f0c54168..cf780fefaf7d 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -1374,7 +1374,7 @@ static bool swap_alloc_slow(swp_entry_t *entry,
> > * still in the swap_avail_head list then try it, otherwise
> > * start over if we have not gotten any slots.
> > */
> > - if (plist_node_empty(&si->avail_list))
> > + if (plist_node_empty(&next->avail_list))
> > goto start_over;
> > }
> > spin_unlock(&swap_avail_lock);
> > --
> > 2.34.1
>
> Thanks!
>
> Acked-by: Kairui Song <kasong@tencent.com>
>
> BTW I noticed the comment here needs update too. And the plist usage
As I think, the first part of the comment is out of date and provides
wrong information. I suggest we simplify it to keep only the last part:
/*
* swap_avail_head list may have been modified; so if next is
* no longer in the swap_avail_head list, start over if we have
* not gotten any slots.
*/
> here seems can also be simplified? We never use a lower priority
> device when there is any higher priority device in the list, as
> devices are removed from the list when they are full. And the first
> thing we do here is plist_requeue. So we can just keep trying the
> head, until the whole list is empty, seems good enough?
>
if we can guarantee that cluster_alloc_swap_entry() failure is always due to
the device being full, then we could indeed improve this by removing
the "try each device in order" loop and only trying the highest
priority device, as you suggested.
A simple implementation might look like this (though it needs more
thought and testing):
spin_lock(&swap_avail_lock);
while (!plist_head_empty(&swap_avail_head)) {
si = plist_first_entry(&swap_avail_head, ...);
plist_requeue(&si->avail_list[nid], &swap_avail_head);
spin_unlock(&swap_avail_lock);
if (cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE))
return;
spin_lock(&swap_avail_lock);
}
spin_unlock(&swap_avail_lock);
For now, would it be okay if I just address the comment update in this
patch, and submit your simplification suggestion as a separate patch
after further consideration?
Thanks,
Youngjun Park
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] mm/swap: fix wrong plist empty check in swap_alloc_slow()
2025-11-19 13:02 ` Kairui Song
2025-11-19 16:37 ` YoungJun Park
@ 2025-11-20 2:06 ` YoungJun Park
2025-11-20 2:18 ` Kairui Song
1 sibling, 1 reply; 9+ messages in thread
From: YoungJun Park @ 2025-11-20 2:06 UTC (permalink / raw)
To: Kairui Song; +Cc: akpm, chrisl, shikemeng, nphamcs, bhe, baohua, linux-mm
I've investigated this further and noticed something about swap_sync_discard.
During iteration, it doesn't perform an empty check on swap devices. As a
result, if the iteration exits because the next device is full
(deleted by plist at this time), the
operation(swap I/O) fails even when other available swap devices remain.
Should this be addressed?
Best Regards,
Youngjun Park
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] mm/swap: fix wrong plist empty check in swap_alloc_slow()
2025-11-19 16:37 ` YoungJun Park
@ 2025-11-20 2:08 ` Kairui Song
0 siblings, 0 replies; 9+ messages in thread
From: Kairui Song @ 2025-11-20 2:08 UTC (permalink / raw)
To: YoungJun Park; +Cc: akpm, chrisl, shikemeng, nphamcs, bhe, baohua, linux-mm
On Thu, Nov 20, 2025 at 12:37 AM YoungJun Park <youngjun.park@lge.com> wrote:
>
> On Wed, Nov 19, 2025 at 09:02:14PM +0800, Kairui Song wrote:
> > On Wed, Nov 19, 2025 at 7:56 PM Youngjun Park <youngjun.park@lge.com> wrote:
> > >
> > > swap_alloc_slow() was checking `si->avail_list` instead of `next->avail_list`
> > > when verifying if the next swap device is still in the list, which could cause
> > > unnecessary restarts during allocation.
> > >
> > > Fixes: 8e689f8ea45ff ("mm/swap: do not choose swap device according to numa node")
> > > Signed-off-by: Youngjun Park <youngjun.park@lge.com>
> > > ---
> > > mm/swapfile.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > index 94e0f0c54168..cf780fefaf7d 100644
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
> > > @@ -1374,7 +1374,7 @@ static bool swap_alloc_slow(swp_entry_t *entry,
> > > * still in the swap_avail_head list then try it, otherwise
> > > * start over if we have not gotten any slots.
> > > */
> > > - if (plist_node_empty(&si->avail_list))
> > > + if (plist_node_empty(&next->avail_list))
> > > goto start_over;
> > > }
> > > spin_unlock(&swap_avail_lock);
> > > --
> > > 2.34.1
> >
> > Thanks!
> >
> > Acked-by: Kairui Song <kasong@tencent.com>
> >
> > BTW I noticed the comment here needs update too. And the plist usage
>
> As I think, the first part of the comment is out of date and provides
> wrong information. I suggest we simplify it to keep only the last part:
>
> /*
> * swap_avail_head list may have been modified; so if next is
> * no longer in the swap_avail_head list, start over if we have
> * not gotten any slots.
> */
>
> > here seems can also be simplified? We never use a lower priority
> > device when there is any higher priority device in the list, as
> > devices are removed from the list when they are full. And the first
> > thing we do here is plist_requeue. So we can just keep trying the
> > head, until the whole list is empty, seems good enough?
> >
>
> if we can guarantee that cluster_alloc_swap_entry() failure is always due to
> the device being full, then we could indeed improve this by removing
> the "try each device in order" loop and only trying the highest
> priority device, as you suggested.
>
> A simple implementation might look like this (though it needs more
> thought and testing):
>
> spin_lock(&swap_avail_lock);
> while (!plist_head_empty(&swap_avail_head)) {
> si = plist_first_entry(&swap_avail_head, ...);
> plist_requeue(&si->avail_list[nid], &swap_avail_head);
> spin_unlock(&swap_avail_lock);
> if (cluster_alloc_swap_entry(si, order, SWAP_HAS_CACHE))
> return;
> spin_lock(&swap_avail_lock);
> }
> spin_unlock(&swap_avail_lock);
>
> For now, would it be okay if I just address the comment update in this
> patch, and submit your simplification suggestion as a separate patch
> after further consideration?
Sure, no need to do too much refractor here. I'm fine with the patch as it is.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] mm/swap: fix wrong plist empty check in swap_alloc_slow()
2025-11-20 2:06 ` YoungJun Park
@ 2025-11-20 2:18 ` Kairui Song
2025-11-21 6:54 ` YoungJun Park
0 siblings, 1 reply; 9+ messages in thread
From: Kairui Song @ 2025-11-20 2:18 UTC (permalink / raw)
To: YoungJun Park; +Cc: akpm, chrisl, shikemeng, nphamcs, bhe, baohua, linux-mm
On Thu, Nov 20, 2025 at 10:06 AM YoungJun Park <youngjun.park@lge.com> wrote:
>
> I've investigated this further and noticed something about swap_sync_discard.
> During iteration, it doesn't perform an empty check on swap devices. As a
> result, if the iteration exits because the next device is full
> (deleted by plist at this time), the
> operation(swap I/O) fails even when other available swap devices remain.
>
> Should this be addressed?
>
> Best Regards,
> Youngjun Park
Actually after thinking about it again, swap_sync_discard should be
looking at swap_active_head, not swap_avail_head. Changing to
swap_active_head and also checking SWP_WRITEOK is a right fix I think,
and should be good enough.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] mm/swap: fix wrong plist empty check in swap_alloc_slow()
2025-11-19 11:41 [PATCH] mm/swap: fix wrong plist empty check in swap_alloc_slow() Youngjun Park
2025-11-19 13:02 ` Kairui Song
@ 2025-11-20 2:47 ` Baoquan He
1 sibling, 0 replies; 9+ messages in thread
From: Baoquan He @ 2025-11-20 2:47 UTC (permalink / raw)
To: Youngjun Park; +Cc: akpm, chrisl, kasong, shikemeng, nphamcs, baohua, linux-mm
On 11/19/25 at 08:41pm, Youngjun Park wrote:
> swap_alloc_slow() was checking `si->avail_list` instead of `next->avail_list`
> when verifying if the next swap device is still in the list, which could cause
> unnecessary restarts during allocation.
>
> Fixes: 8e689f8ea45ff ("mm/swap: do not choose swap device according to numa node")
> Signed-off-by: Youngjun Park <youngjun.park@lge.com>
> ---
> mm/swapfile.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
Thanks for the fix.
Reviewed-by: Baoquan He <bhe@redhat.com>
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 94e0f0c54168..cf780fefaf7d 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1374,7 +1374,7 @@ static bool swap_alloc_slow(swp_entry_t *entry,
> * still in the swap_avail_head list then try it, otherwise
> * start over if we have not gotten any slots.
> */
> - if (plist_node_empty(&si->avail_list))
> + if (plist_node_empty(&next->avail_list))
> goto start_over;
> }
> spin_unlock(&swap_avail_lock);
> --
> 2.34.1
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] mm/swap: fix wrong plist empty check in swap_alloc_slow()
2025-11-20 2:18 ` Kairui Song
@ 2025-11-21 6:54 ` YoungJun Park
2025-11-21 16:56 ` Kairui Song
0 siblings, 1 reply; 9+ messages in thread
From: YoungJun Park @ 2025-11-21 6:54 UTC (permalink / raw)
To: Kairui Song; +Cc: akpm, chrisl, shikemeng, nphamcs, bhe, baohua, linux-mm
On Thu, Nov 20, 2025 at 10:18:34AM +0800, Kairui Song wrote:
> On Thu, Nov 20, 2025 at 10:06 AM YoungJun Park <youngjun.park@lge.com> wrote:
> >
> > I've investigated this further and noticed something about swap_sync_discard.
> > During iteration, it doesn't perform an empty check on swap devices. As a
> > result, if the iteration exits because the next device is full
> > (deleted by plist at this time), the
> > operation(swap I/O) fails even when other available swap devices remain.
> >
> > Should this be addressed?
> >
> > Best Regards,
> > Youngjun Park
>
> Actually after thinking about it again, swap_sync_discard should be
> looking at swap_active_head, not swap_avail_head. Changing to
> swap_active_head and also checking SWP_WRITEOK is a right fix I think,
> and should be good enough.
>
Hi Kairui,
Let me confirm if I understand your intention correctly.
It seems the following changes would be made:
- Change from swap_avail_lock to swap_lock
- After sync_discard, reacquire the lock and check if the list was
broken using SWP_WRITEOK instead. If broken, since we can't know
the next si, restart the list traversal from the beginning.
The advantage would be that it's only affected by swapoff changes,
so exceptional condition checks in this logic would occur less
frequently. Also, using swap_lock would be better in terms of
contention compared to swap I/O contention.
I'd like to confirm if this is what you intended, and if we agree
on this approach, I'll submit it as a separate patch.
Thank you.
Youngjun Park
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] mm/swap: fix wrong plist empty check in swap_alloc_slow()
2025-11-21 6:54 ` YoungJun Park
@ 2025-11-21 16:56 ` Kairui Song
0 siblings, 0 replies; 9+ messages in thread
From: Kairui Song @ 2025-11-21 16:56 UTC (permalink / raw)
To: YoungJun Park; +Cc: akpm, chrisl, shikemeng, nphamcs, bhe, baohua, linux-mm
On Fri, Nov 21, 2025 at 2:54 PM YoungJun Park <youngjun.park@lge.com> wrote:
>
> On Thu, Nov 20, 2025 at 10:18:34AM +0800, Kairui Song wrote:
> > On Thu, Nov 20, 2025 at 10:06 AM YoungJun Park <youngjun.park@lge.com> wrote:
> > >
> > > I've investigated this further and noticed something about swap_sync_discard.
> > > During iteration, it doesn't perform an empty check on swap devices. As a
> > > result, if the iteration exits because the next device is full
> > > (deleted by plist at this time), the
> > > operation(swap I/O) fails even when other available swap devices remain.
> > >
> > > Should this be addressed?
> > >
> > > Best Regards,
> > > Youngjun Park
> >
> > Actually after thinking about it again, swap_sync_discard should be
> > looking at swap_active_head, not swap_avail_head. Changing to
> > swap_active_head and also checking SWP_WRITEOK is a right fix I think,
> > and should be good enough.
> >
>
> Hi Kairui,
>
> Let me confirm if I understand your intention correctly.
>
> It seems the following changes would be made:
> - Change from swap_avail_lock to swap_lock
> - After sync_discard, reacquire the lock and check if the list was
> broken using SWP_WRITEOK instead. If broken, since we can't know
> the next si, restart the list traversal from the beginning.
>
> The advantage would be that it's only affected by swapoff changes,
> so exceptional condition checks in this logic would occur less
> frequently. Also, using swap_lock would be better in terms of
> contention compared to swap I/O contention.
Yes, swap_lock is better here, I should use that but somehow forgot
about it while working on something else, thanks for helping out!
And maybe you don't need the `SWP_WRITEOK` change, I was thinking that
swapoff sets ~SWP_WRITEOK and then takes the device off
swap_active_head, so to follow that convention you better check the
flag. But actually no, that's not related, the existing
get_swap_device_info is good enough.
Checking `plist_node_empty(&next->list)` just like what you are doing
here is the right way I think.
Thanks!
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-11-21 16:57 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-19 11:41 [PATCH] mm/swap: fix wrong plist empty check in swap_alloc_slow() Youngjun Park
2025-11-19 13:02 ` Kairui Song
2025-11-19 16:37 ` YoungJun Park
2025-11-20 2:08 ` Kairui Song
2025-11-20 2:06 ` YoungJun Park
2025-11-20 2:18 ` Kairui Song
2025-11-21 6:54 ` YoungJun Park
2025-11-21 16:56 ` Kairui Song
2025-11-20 2:47 ` Baoquan He
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox