Re: cluster_alloc_swap_entry() may change per-cpu cluster of another CPU randomly

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Kairui Song <ryncsn@gmail.com>
To: "Huang, Ying" <ying.huang@intel.com>
Cc: linux-mm <linux-mm@kvack.org>, Chris Li <chrisl@kernel.org>
Subject: Re: cluster_alloc_swap_entry() may change per-cpu cluster of another CPU randomly
Date: Mon, 4 Nov 2024 01:11:30 +0800	[thread overview]
Message-ID: <CAMgjq7CFKeVN7kQ9Ta--hcCs1Ng4KpMTa-xjs58_1EsbjSO6UQ@mail.gmail.com> (raw)
In-Reply-To: <87a5es3f1f.fsf@yhuang6-desk2.ccr.corp.intel.com>

Hi Ying,

On Fri, Oct 25, 2024 at 1:56 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Kairui Song <ryncsn@gmail.com> writes:
>
> > On Thu, Oct 24, 2024 at 11:29 AM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Hi, Kairui,
> >>
> >> When reading new swap cluster allocation code.  Sorry, I didn't find
> >> time to review your patch at the first place.  I found that in
> >> cluster_alloc_swap_entry(), the following code path is possible,
> >>
> >>   cluster = this_cpu_ptr(si->percpu_cluster);
> >>   offset = alloc_swap_scan_cluster();
> >>     ...
> >>       spin_unlock(&ci->lock);
> >>       spin_unlock(&si->lock);
> >>       /* migrate to another cpu */
> >>       spin_lock(&ci->lock);
> >>       spin_lock(&si->lock);
> >>   cluster->next[order] = offset;
> >>
> >> That is, the per cpu cluster of a CPU may be changed on another CPU.  I
> >> guess that this will not cause some functionality issue.  However, this
> >> makes code harder to be reasoned.  Is it possible to avoid unlock before
> >> changing per cpu cluster?  Or, give up updating per cpu cluster if we
> >> need to unlock.
> >
> > Hi Ying
> >
> > Yes, I've noticed this when working on the new si->lock series. It may
> > cause fragmentation, only if the CPU the current allocation migrates
> > from requested another swap allocation, in such case it may waste one
> > per cpu cluster.
> >
> > No functionality or leak issue, but make things harder to reason
> > indeed, as you mentioned.
> >
> > In the new series I used a local lock to prevent migration, so the cpu
> > is stable during the whole process. However that is not doable unless
> > si->lock is removed from allocation path or there are lock dependency
> > issue.
> >
> > Do we need to fix this separately in the current tree? The
> > fragmentation is overall better with these patches, so I think that is
> > causing any issues. Before these patches per CPU cluster can also be
> > stolen by other CPUs so these were not guaranteed to be accurate.
> >
> > One easy way we can do is set the cluster->next[order] to 0 before
> > scanning for a new cluster to use, and check if it is 0 before setting
> > next[order] again.
> >
> > That may still cause fragmentation as migration will cause extra
> > clusters to be scanned and used, which is hard to avoid without
> > changing too much code (like the things being done in the new si->lock
> > series).
>
> Find another possible more serious problem.  After unlock/lock cycle,
> the state of swap device may be changed (for example, swapoffed).  So in
> the previous code,
>
>         si->flags += SWP_SCANNING;
>
>         if (!(si->flags & SWP_WRITEOK))
>
> is used for that.  I don't find they are used after unlock/lock cycle.
> Can you check it?

Thanks for your review.

Yes, I think you are right. I did notice the flags are not updated
properly in the latest mm branch, but did not notice it was a newly
introduced issue.

So I fixed the issue in
https://lore.kernel.org/linux-mm/20241022192451.38138-8-ryncsn@gmail.com/

But a quick fix can be posted before the series. Following patch
should be enough, how do you think?

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 46bd4b1a3c07..6b7d4ac1d4a3 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1041,7 +1041,9 @@ static int cluster_alloc_swap(struct swap_info_struct *si,

        VM_BUG_ON(!si->cluster_info);

-       while (n_ret < nr) {
+       si->flags += SWP_SCANNING;
+       while (n_ret < nr && si->flags & SWP_WRITEOK) {
                unsigned long offset = cluster_alloc_swap_entry(si,
order, usage);

                if (!offset)
@@ -1049,6 +1051,8 @@ static int cluster_alloc_swap(struct swap_info_struct *si,
                slots[n_ret++] = swp_entry(si->type, offset);
        }

+       si->flags -= SWP_SCANNING;
+
        return n_ret;
 }

The "+= SWP_SCANNING" looks ugly, but has to be done in such a way,
just like in the original allocation code. Multiple scanners can
update the flag, and SCANNING is the highest defined bit in the flag
field, it's not a strictly bit flag in such cases. This will be
cleaned up in a proper way in the patch I just mentioned.


>
> --
> Best Regards,
> Huang, Ying

next prev parent reply	other threads:[~2024-11-03 17:11 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-24  3:25 Huang, Ying
2024-10-24  3:50 ` Kairui Song
2024-10-24  3:54   ` Kairui Song
2024-10-25  5:52   ` Huang, Ying
2024-11-03 17:11     ` Kairui Song [this message]
2024-11-04  3:02       ` Huang, Ying
2024-11-05 18:04         ` Kairui Song
2024-11-06  5:24           ` Huang, Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAMgjq7CFKeVN7kQ9Ta--hcCs1Ng4KpMTa-xjs58_1EsbjSO6UQ@mail.gmail.com \
    --to=ryncsn@gmail.com \
    --cc=chrisl@kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox