cluster_alloc_swap_entry() may change per-cpu cluster of another CPU randomly

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* cluster_alloc_swap_entry() may change per-cpu cluster of another CPU randomly
@ 2024-10-24  3:25 Huang, Ying
  2024-10-24  3:50 ` Kairui Song
  0 siblings, 1 reply; 8+ messages in thread
From: Huang, Ying @ 2024-10-24  3:25 UTC (permalink / raw)
  To: Kairui Song; +Cc: linux-mm

Hi, Kairui,

When reading new swap cluster allocation code.  Sorry, I didn't find
time to review your patch at the first place.  I found that in
cluster_alloc_swap_entry(), the following code path is possible,

  cluster = this_cpu_ptr(si->percpu_cluster);
  offset = alloc_swap_scan_cluster();
    ...
      spin_unlock(&ci->lock);
      spin_unlock(&si->lock);
      /* migrate to another cpu */
      spin_lock(&ci->lock);
      spin_lock(&si->lock);
  cluster->next[order] = offset;

That is, the per cpu cluster of a CPU may be changed on another CPU.  I
guess that this will not cause some functionality issue.  However, this
makes code harder to be reasoned.  Is it possible to avoid unlock before
changing per cpu cluster?  Or, give up updating per cpu cluster if we
need to unlock.

--
Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: cluster_alloc_swap_entry() may change per-cpu cluster of another CPU randomly
  2024-10-24  3:25 cluster_alloc_swap_entry() may change per-cpu cluster of another CPU randomly Huang, Ying
@ 2024-10-24  3:50 ` Kairui Song
  2024-10-24  3:54   ` Kairui Song
  2024-10-25  5:52   ` Huang, Ying
  0 siblings, 2 replies; 8+ messages in thread
From: Kairui Song @ 2024-10-24  3:50 UTC (permalink / raw)
  To: Huang, Ying; +Cc: linux-mm

On Thu, Oct 24, 2024 at 11:29 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Hi, Kairui,
>
> When reading new swap cluster allocation code.  Sorry, I didn't find
> time to review your patch at the first place.  I found that in
> cluster_alloc_swap_entry(), the following code path is possible,
>
>   cluster = this_cpu_ptr(si->percpu_cluster);
>   offset = alloc_swap_scan_cluster();
>     ...
>       spin_unlock(&ci->lock);
>       spin_unlock(&si->lock);
>       /* migrate to another cpu */
>       spin_lock(&ci->lock);
>       spin_lock(&si->lock);
>   cluster->next[order] = offset;
>
> That is, the per cpu cluster of a CPU may be changed on another CPU.  I
> guess that this will not cause some functionality issue.  However, this
> makes code harder to be reasoned.  Is it possible to avoid unlock before
> changing per cpu cluster?  Or, give up updating per cpu cluster if we
> need to unlock.

Hi Ying

Yes, I've noticed this when working on the new si->lock series. It may
cause fragmentation, only if the CPU the current allocation migrates
from requested another swap allocation, in such case it may waste one
per cpu cluster.

No functionality or leak issue, but make things harder to reason
indeed, as you mentioned.

In the new series I used a local lock to prevent migration, so the cpu
is stable during the whole process. However that is not doable unless
si->lock is removed from allocation path or there are lock dependency
issue.

Do we need to fix this separately in the current tree? The
fragmentation is overall better with these patches, so I think that is
causing any issues. Before these patches per CPU cluster can also be
stolen by other CPUs so these were not guaranteed to be accurate.

One easy way we can do is set the cluster->next[order] to 0 before
scanning for a new cluster to use, and check if it is 0 before setting
next[order] again.

That may still cause fragmentation as migration will cause extra
clusters to be scanned and used, which is hard to avoid without
changing too much code (like the things being done in the new si->lock
series).

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: cluster_alloc_swap_entry() may change per-cpu cluster of another CPU randomly
  2024-10-24  3:50 ` Kairui Song
@ 2024-10-24  3:54   ` Kairui Song
  2024-10-25  5:52   ` Huang, Ying
  1 sibling, 0 replies; 8+ messages in thread
From: Kairui Song @ 2024-10-24  3:54 UTC (permalink / raw)
  To: Huang, Ying; +Cc: linux-mm

On Thu, Oct 24, 2024 at 11:50 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Thu, Oct 24, 2024 at 11:29 AM Huang, Ying <ying.huang@intel.com> wrote:
> >
> > Hi, Kairui,
> >
> > When reading new swap cluster allocation code.  Sorry, I didn't find
> > time to review your patch at the first place.  I found that in
> > cluster_alloc_swap_entry(), the following code path is possible,
> >
> >   cluster = this_cpu_ptr(si->percpu_cluster);
> >   offset = alloc_swap_scan_cluster();
> >     ...
> >       spin_unlock(&ci->lock);
> >       spin_unlock(&si->lock);
> >       /* migrate to another cpu */
> >       spin_lock(&ci->lock);
> >       spin_lock(&si->lock);
> >   cluster->next[order] = offset;
> >
> > That is, the per cpu cluster of a CPU may be changed on another CPU.  I
> > guess that this will not cause some functionality issue.  However, this
> > makes code harder to be reasoned.  Is it possible to avoid unlock before
> > changing per cpu cluster?  Or, give up updating per cpu cluster if we
> > need to unlock.
>
> Hi Ying
>
> Yes, I've noticed this when working on the new si->lock series. It may
> cause fragmentation, only if the CPU the current allocation migrates
> from requested another swap allocation, in such case it may waste one
> per cpu cluster.
>
> No functionality or leak issue, but make things harder to reason
> indeed, as you mentioned.
>
> In the new series I used a local lock to prevent migration, so the cpu
> is stable during the whole process. However that is not doable unless
> si->lock is removed from allocation path or there are lock dependency
> issue.
>
> Do we need to fix this separately in the current tree? The
> fragmentation is overall better with these patches, so I think that is

Typo: I mean I think these patches are not causing new issues.

> causing any issues. Before these patches per CPU cluster can also be
> stolen by other CPUs so these were not guaranteed to be accurate.
>
> One easy way we can do is set the cluster->next[order] to 0 before
> scanning for a new cluster to use, and check if it is 0 before setting
> next[order] again.
>
> That may still cause fragmentation as migration will cause extra
> clusters to be scanned and used, which is hard to avoid without
> changing too much code (like the things being done in the new si->lock
> series).


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: cluster_alloc_swap_entry() may change per-cpu cluster of another CPU randomly
  2024-10-24  3:50 ` Kairui Song
  2024-10-24  3:54   ` Kairui Song
@ 2024-10-25  5:52   ` Huang, Ying
  2024-11-03 17:11     ` Kairui Song
  1 sibling, 1 reply; 8+ messages in thread
From: Huang, Ying @ 2024-10-25  5:52 UTC (permalink / raw)
  To: Kairui Song; +Cc: linux-mm

Kairui Song <ryncsn@gmail.com> writes:

> On Thu, Oct 24, 2024 at 11:29 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Hi, Kairui,
>>
>> When reading new swap cluster allocation code.  Sorry, I didn't find
>> time to review your patch at the first place.  I found that in
>> cluster_alloc_swap_entry(), the following code path is possible,
>>
>>   cluster = this_cpu_ptr(si->percpu_cluster);
>>   offset = alloc_swap_scan_cluster();
>>     ...
>>       spin_unlock(&ci->lock);
>>       spin_unlock(&si->lock);
>>       /* migrate to another cpu */
>>       spin_lock(&ci->lock);
>>       spin_lock(&si->lock);
>>   cluster->next[order] = offset;
>>
>> That is, the per cpu cluster of a CPU may be changed on another CPU.  I
>> guess that this will not cause some functionality issue.  However, this
>> makes code harder to be reasoned.  Is it possible to avoid unlock before
>> changing per cpu cluster?  Or, give up updating per cpu cluster if we
>> need to unlock.
>
> Hi Ying
>
> Yes, I've noticed this when working on the new si->lock series. It may
> cause fragmentation, only if the CPU the current allocation migrates
> from requested another swap allocation, in such case it may waste one
> per cpu cluster.
>
> No functionality or leak issue, but make things harder to reason
> indeed, as you mentioned.
>
> In the new series I used a local lock to prevent migration, so the cpu
> is stable during the whole process. However that is not doable unless
> si->lock is removed from allocation path or there are lock dependency
> issue.
>
> Do we need to fix this separately in the current tree? The
> fragmentation is overall better with these patches, so I think that is
> causing any issues. Before these patches per CPU cluster can also be
> stolen by other CPUs so these were not guaranteed to be accurate.
>
> One easy way we can do is set the cluster->next[order] to 0 before
> scanning for a new cluster to use, and check if it is 0 before setting
> next[order] again.
>
> That may still cause fragmentation as migration will cause extra
> clusters to be scanned and used, which is hard to avoid without
> changing too much code (like the things being done in the new si->lock
> series).

Find another possible more serious problem.  After unlock/lock cycle,
the state of swap device may be changed (for example, swapoffed).  So in
the previous code,

	si->flags += SWP_SCANNING;

	if (!(si->flags & SWP_WRITEOK))

is used for that.  I don't find they are used after unlock/lock cycle.
Can you check it?

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: cluster_alloc_swap_entry() may change per-cpu cluster of another CPU randomly
  2024-10-25  5:52   ` Huang, Ying
@ 2024-11-03 17:11     ` Kairui Song
  2024-11-04  3:02       ` Huang, Ying
  0 siblings, 1 reply; 8+ messages in thread
From: Kairui Song @ 2024-11-03 17:11 UTC (permalink / raw)
  To: Huang, Ying; +Cc: linux-mm, Chris Li

Hi Ying,

On Fri, Oct 25, 2024 at 1:56 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Kairui Song <ryncsn@gmail.com> writes:
>
> > On Thu, Oct 24, 2024 at 11:29 AM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Hi, Kairui,
> >>
> >> When reading new swap cluster allocation code.  Sorry, I didn't find
> >> time to review your patch at the first place.  I found that in
> >> cluster_alloc_swap_entry(), the following code path is possible,
> >>
> >>   cluster = this_cpu_ptr(si->percpu_cluster);
> >>   offset = alloc_swap_scan_cluster();
> >>     ...
> >>       spin_unlock(&ci->lock);
> >>       spin_unlock(&si->lock);
> >>       /* migrate to another cpu */
> >>       spin_lock(&ci->lock);
> >>       spin_lock(&si->lock);
> >>   cluster->next[order] = offset;
> >>
> >> That is, the per cpu cluster of a CPU may be changed on another CPU.  I
> >> guess that this will not cause some functionality issue.  However, this
> >> makes code harder to be reasoned.  Is it possible to avoid unlock before
> >> changing per cpu cluster?  Or, give up updating per cpu cluster if we
> >> need to unlock.
> >
> > Hi Ying
> >
> > Yes, I've noticed this when working on the new si->lock series. It may
> > cause fragmentation, only if the CPU the current allocation migrates
> > from requested another swap allocation, in such case it may waste one
> > per cpu cluster.
> >
> > No functionality or leak issue, but make things harder to reason
> > indeed, as you mentioned.
> >
> > In the new series I used a local lock to prevent migration, so the cpu
> > is stable during the whole process. However that is not doable unless
> > si->lock is removed from allocation path or there are lock dependency
> > issue.
> >
> > Do we need to fix this separately in the current tree? The
> > fragmentation is overall better with these patches, so I think that is
> > causing any issues. Before these patches per CPU cluster can also be
> > stolen by other CPUs so these were not guaranteed to be accurate.
> >
> > One easy way we can do is set the cluster->next[order] to 0 before
> > scanning for a new cluster to use, and check if it is 0 before setting
> > next[order] again.
> >
> > That may still cause fragmentation as migration will cause extra
> > clusters to be scanned and used, which is hard to avoid without
> > changing too much code (like the things being done in the new si->lock
> > series).
>
> Find another possible more serious problem.  After unlock/lock cycle,
> the state of swap device may be changed (for example, swapoffed).  So in
> the previous code,
>
>         si->flags += SWP_SCANNING;
>
>         if (!(si->flags & SWP_WRITEOK))
>
> is used for that.  I don't find they are used after unlock/lock cycle.
> Can you check it?

Thanks for your review.

Yes, I think you are right. I did notice the flags are not updated
properly in the latest mm branch, but did not notice it was a newly
introduced issue.

So I fixed the issue in
https://lore.kernel.org/linux-mm/20241022192451.38138-8-ryncsn@gmail.com/

But a quick fix can be posted before the series. Following patch
should be enough, how do you think?

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 46bd4b1a3c07..6b7d4ac1d4a3 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1041,7 +1041,9 @@ static int cluster_alloc_swap(struct swap_info_struct *si,

        VM_BUG_ON(!si->cluster_info);

-       while (n_ret < nr) {
+       si->flags += SWP_SCANNING;
+       while (n_ret < nr && si->flags & SWP_WRITEOK) {
                unsigned long offset = cluster_alloc_swap_entry(si,
order, usage);

                if (!offset)
@@ -1049,6 +1051,8 @@ static int cluster_alloc_swap(struct swap_info_struct *si,
                slots[n_ret++] = swp_entry(si->type, offset);
        }

+       si->flags -= SWP_SCANNING;
+
        return n_ret;
 }

The "+= SWP_SCANNING" looks ugly, but has to be done in such a way,
just like in the original allocation code. Multiple scanners can
update the flag, and SCANNING is the highest defined bit in the flag
field, it's not a strictly bit flag in such cases. This will be
cleaned up in a proper way in the patch I just mentioned.


>
> --
> Best Regards,
> Huang, Ying


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: cluster_alloc_swap_entry() may change per-cpu cluster of another CPU randomly
  2024-11-03 17:11     ` Kairui Song
@ 2024-11-04  3:02       ` Huang, Ying
  2024-11-05 18:04         ` Kairui Song
  0 siblings, 1 reply; 8+ messages in thread
From: Huang, Ying @ 2024-11-04  3:02 UTC (permalink / raw)
  To: Kairui Song; +Cc: linux-mm, Chris Li

Kairui Song <ryncsn@gmail.com> writes:

> Hi Ying,
>
> On Fri, Oct 25, 2024 at 1:56 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Kairui Song <ryncsn@gmail.com> writes:
>>
>> > On Thu, Oct 24, 2024 at 11:29 AM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Hi, Kairui,
>> >>
>> >> When reading new swap cluster allocation code.  Sorry, I didn't find
>> >> time to review your patch at the first place.  I found that in
>> >> cluster_alloc_swap_entry(), the following code path is possible,
>> >>
>> >>   cluster = this_cpu_ptr(si->percpu_cluster);
>> >>   offset = alloc_swap_scan_cluster();
>> >>     ...
>> >>       spin_unlock(&ci->lock);
>> >>       spin_unlock(&si->lock);
>> >>       /* migrate to another cpu */
>> >>       spin_lock(&ci->lock);
>> >>       spin_lock(&si->lock);
>> >>   cluster->next[order] = offset;
>> >>
>> >> That is, the per cpu cluster of a CPU may be changed on another CPU.  I
>> >> guess that this will not cause some functionality issue.  However, this
>> >> makes code harder to be reasoned.  Is it possible to avoid unlock before
>> >> changing per cpu cluster?  Or, give up updating per cpu cluster if we
>> >> need to unlock.
>> >
>> > Hi Ying
>> >
>> > Yes, I've noticed this when working on the new si->lock series. It may
>> > cause fragmentation, only if the CPU the current allocation migrates
>> > from requested another swap allocation, in such case it may waste one
>> > per cpu cluster.
>> >
>> > No functionality or leak issue, but make things harder to reason
>> > indeed, as you mentioned.
>> >
>> > In the new series I used a local lock to prevent migration, so the cpu
>> > is stable during the whole process. However that is not doable unless
>> > si->lock is removed from allocation path or there are lock dependency
>> > issue.
>> >
>> > Do we need to fix this separately in the current tree? The
>> > fragmentation is overall better with these patches, so I think that is
>> > causing any issues. Before these patches per CPU cluster can also be
>> > stolen by other CPUs so these were not guaranteed to be accurate.
>> >
>> > One easy way we can do is set the cluster->next[order] to 0 before
>> > scanning for a new cluster to use, and check if it is 0 before setting
>> > next[order] again.
>> >
>> > That may still cause fragmentation as migration will cause extra
>> > clusters to be scanned and used, which is hard to avoid without
>> > changing too much code (like the things being done in the new si->lock
>> > series).
>>
>> Find another possible more serious problem.  After unlock/lock cycle,
>> the state of swap device may be changed (for example, swapoffed).  So in
>> the previous code,
>>
>>         si->flags += SWP_SCANNING;
>>
>>         if (!(si->flags & SWP_WRITEOK))
>>
>> is used for that.  I don't find they are used after unlock/lock cycle.
>> Can you check it?
>
> Thanks for your review.
>
> Yes, I think you are right. I did notice the flags are not updated
> properly in the latest mm branch, but did not notice it was a newly
> introduced issue.
>
> So I fixed the issue in
> https://lore.kernel.org/linux-mm/20241022192451.38138-8-ryncsn@gmail.com/
>
> But a quick fix can be posted before the series. Following patch
> should be enough, how do you think?
>
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 46bd4b1a3c07..6b7d4ac1d4a3 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1041,7 +1041,9 @@ static int cluster_alloc_swap(struct swap_info_struct *si,
>
>         VM_BUG_ON(!si->cluster_info);
>
> -       while (n_ret < nr) {
> +       si->flags += SWP_SCANNING;
> +       while (n_ret < nr && si->flags & SWP_WRITEOK) {
>                 unsigned long offset = cluster_alloc_swap_entry(si,
> order, usage);
>
>                 if (!offset)
> @@ -1049,6 +1051,8 @@ static int cluster_alloc_swap(struct swap_info_struct *si,
>                 slots[n_ret++] = swp_entry(si->type, offset);
>         }
>
> +       si->flags -= SWP_SCANNING;
> +
>         return n_ret;
>  }
>
> The "+= SWP_SCANNING" looks ugly, but has to be done in such a way,
> just like in the original allocation code. Multiple scanners can
> update the flag, and SCANNING is the highest defined bit in the flag
> field, it's not a strictly bit flag in such cases. This will be
> cleaned up in a proper way in the patch I just mentioned.

I think that the logic of SWP_SCANNING is correct.

As for SWP_WRITEOK, we should avoid to allocate swap entries when other
task clears it ASAP.  So, the above checking may be too late, because we
have already allocated the swap entry.  I think that we can check it in
cluster_reclaim_range() and refuse to allocate swap entry there.

Combined with the per cpu cluster (cluster->next[]) changing issue we
discussed in another email of the thread.  I think the long term
solution could be separate the swap entry allocation and reclaiming a
bit more.  For example, we could try to allocate some swap entries with
si->lock held firstly, record possible reclaim offset if any, change
cluster->next[], then unlock, reclaim, lock.

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: cluster_alloc_swap_entry() may change per-cpu cluster of another CPU randomly
  2024-11-04  3:02       ` Huang, Ying
@ 2024-11-05 18:04         ` Kairui Song
  2024-11-06  5:24           ` Huang, Ying
  0 siblings, 1 reply; 8+ messages in thread
From: Kairui Song @ 2024-11-05 18:04 UTC (permalink / raw)
  To: Huang, Ying; +Cc: linux-mm, Chris Li

On Mon, Nov 4, 2024 at 11:06 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> Kairui Song <ryncsn@gmail.com> writes:
>
> I think that the logic of SWP_SCANNING is correct.
>
> As for SWP_WRITEOK, we should avoid to allocate swap entries when other
> task clears it ASAP.  So, the above checking may be too late, because we
> have already allocated the swap entry.  I think that we can check it in
> cluster_reclaim_range() and refuse to allocate swap entry there.

Thanks for the feedback.

Not sure if this is a big issue. With the above patch SWP_WRITEOK will
prevents any new allocations from using the device, already ongoing
allocations may still use this device indeed, but in the end swapoff
will still succeed, maybe very slightly slower as swapoff may have to
unuse a few more entries. I can add another SWP_WRITEOK check to make
it better, will post a patch.

> Combined with the per cpu cluster (cluster->next[]) changing issue we
> discussed in another email of the thread.  I think the long term
> solution could be separate the swap entry allocation and reclaiming a
> bit more.  For example, we could try to allocate some swap entries with
> si->lock held firstly, record possible reclaim offset if any, change
> cluster->next[], then unlock, reclaim, lock.
>
> --
> Best Regards,
> Huang, Ying


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: cluster_alloc_swap_entry() may change per-cpu cluster of another CPU randomly
  2024-11-05 18:04         ` Kairui Song
@ 2024-11-06  5:24           ` Huang, Ying
  0 siblings, 0 replies; 8+ messages in thread
From: Huang, Ying @ 2024-11-06  5:24 UTC (permalink / raw)
  To: Kairui Song; +Cc: linux-mm, Chris Li

Kairui Song <ryncsn@gmail.com> writes:

> On Mon, Nov 4, 2024 at 11:06 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Kairui Song <ryncsn@gmail.com> writes:
>>
>> I think that the logic of SWP_SCANNING is correct.
>>
>> As for SWP_WRITEOK, we should avoid to allocate swap entries when other
>> task clears it ASAP.  So, the above checking may be too late, because we
>> have already allocated the swap entry.  I think that we can check it in
>> cluster_reclaim_range() and refuse to allocate swap entry there.
>
> Thanks for the feedback.
>
> Not sure if this is a big issue. With the above patch SWP_WRITEOK will
> prevents any new allocations from using the device, already ongoing
> allocations may still use this device indeed, but in the end swapoff
> will still succeed, maybe very slightly slower as swapoff may have to
> unuse a few more entries. I can add another SWP_WRITEOK check to make
> it better, will post a patch.

IIUC, this is more serious than delaying swapoff.  try_to_unuse() checks
si->inuse_pages, if it becomes 0, then swapoff() will go ahead without
checking it again.  However, the current kernel may allocate swap entry
even if si->inuse_pages is 0 and SWP_WRITEOK is cleared.

>> Combined with the per cpu cluster (cluster->next[]) changing issue we
>> discussed in another email of the thread.  I think the long term
>> solution could be separate the swap entry allocation and reclaiming a
>> bit more.  For example, we could try to allocate some swap entries with
>> si->lock held firstly, record possible reclaim offset if any, change
>> cluster->next[], then unlock, reclaim, lock.

--
Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-11-06  5:27 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-10-24  3:25 cluster_alloc_swap_entry() may change per-cpu cluster of another CPU randomly Huang, Ying
2024-10-24  3:50 ` Kairui Song
2024-10-24  3:54   ` Kairui Song
2024-10-25  5:52   ` Huang, Ying
2024-11-03 17:11     ` Kairui Song
2024-11-04  3:02       ` Huang, Ying
2024-11-05 18:04         ` Kairui Song
2024-11-06  5:24           ` Huang, Ying

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox