From: Kairui Song <ryncsn@gmail.com>
To: "Huang, Ying" <ying.huang@intel.com>
Cc: linux-mm <linux-mm@kvack.org>, Chris Li <chrisl@kernel.org>
Subject: Re: cluster_alloc_swap_entry() may change per-cpu cluster of another CPU randomly
Date: Mon, 4 Nov 2024 01:11:30 +0800 [thread overview]
Message-ID: <CAMgjq7CFKeVN7kQ9Ta--hcCs1Ng4KpMTa-xjs58_1EsbjSO6UQ@mail.gmail.com> (raw)
In-Reply-To: <87a5es3f1f.fsf@yhuang6-desk2.ccr.corp.intel.com>
Hi Ying,
On Fri, Oct 25, 2024 at 1:56 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Kairui Song <ryncsn@gmail.com> writes:
>
> > On Thu, Oct 24, 2024 at 11:29 AM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Hi, Kairui,
> >>
> >> When reading new swap cluster allocation code. Sorry, I didn't find
> >> time to review your patch at the first place. I found that in
> >> cluster_alloc_swap_entry(), the following code path is possible,
> >>
> >> cluster = this_cpu_ptr(si->percpu_cluster);
> >> offset = alloc_swap_scan_cluster();
> >> ...
> >> spin_unlock(&ci->lock);
> >> spin_unlock(&si->lock);
> >> /* migrate to another cpu */
> >> spin_lock(&ci->lock);
> >> spin_lock(&si->lock);
> >> cluster->next[order] = offset;
> >>
> >> That is, the per cpu cluster of a CPU may be changed on another CPU. I
> >> guess that this will not cause some functionality issue. However, this
> >> makes code harder to be reasoned. Is it possible to avoid unlock before
> >> changing per cpu cluster? Or, give up updating per cpu cluster if we
> >> need to unlock.
> >
> > Hi Ying
> >
> > Yes, I've noticed this when working on the new si->lock series. It may
> > cause fragmentation, only if the CPU the current allocation migrates
> > from requested another swap allocation, in such case it may waste one
> > per cpu cluster.
> >
> > No functionality or leak issue, but make things harder to reason
> > indeed, as you mentioned.
> >
> > In the new series I used a local lock to prevent migration, so the cpu
> > is stable during the whole process. However that is not doable unless
> > si->lock is removed from allocation path or there are lock dependency
> > issue.
> >
> > Do we need to fix this separately in the current tree? The
> > fragmentation is overall better with these patches, so I think that is
> > causing any issues. Before these patches per CPU cluster can also be
> > stolen by other CPUs so these were not guaranteed to be accurate.
> >
> > One easy way we can do is set the cluster->next[order] to 0 before
> > scanning for a new cluster to use, and check if it is 0 before setting
> > next[order] again.
> >
> > That may still cause fragmentation as migration will cause extra
> > clusters to be scanned and used, which is hard to avoid without
> > changing too much code (like the things being done in the new si->lock
> > series).
>
> Find another possible more serious problem. After unlock/lock cycle,
> the state of swap device may be changed (for example, swapoffed). So in
> the previous code,
>
> si->flags += SWP_SCANNING;
>
> if (!(si->flags & SWP_WRITEOK))
>
> is used for that. I don't find they are used after unlock/lock cycle.
> Can you check it?
Thanks for your review.
Yes, I think you are right. I did notice the flags are not updated
properly in the latest mm branch, but did not notice it was a newly
introduced issue.
So I fixed the issue in
https://lore.kernel.org/linux-mm/20241022192451.38138-8-ryncsn@gmail.com/
But a quick fix can be posted before the series. Following patch
should be enough, how do you think?
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 46bd4b1a3c07..6b7d4ac1d4a3 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1041,7 +1041,9 @@ static int cluster_alloc_swap(struct swap_info_struct *si,
VM_BUG_ON(!si->cluster_info);
- while (n_ret < nr) {
+ si->flags += SWP_SCANNING;
+ while (n_ret < nr && si->flags & SWP_WRITEOK) {
unsigned long offset = cluster_alloc_swap_entry(si,
order, usage);
if (!offset)
@@ -1049,6 +1051,8 @@ static int cluster_alloc_swap(struct swap_info_struct *si,
slots[n_ret++] = swp_entry(si->type, offset);
}
+ si->flags -= SWP_SCANNING;
+
return n_ret;
}
The "+= SWP_SCANNING" looks ugly, but has to be done in such a way,
just like in the original allocation code. Multiple scanners can
update the flag, and SCANNING is the highest defined bit in the flag
field, it's not a strictly bit flag in such cases. This will be
cleaned up in a proper way in the patch I just mentioned.
>
> --
> Best Regards,
> Huang, Ying
next prev parent reply other threads:[~2024-11-03 17:11 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-24 3:25 Huang, Ying
2024-10-24 3:50 ` Kairui Song
2024-10-24 3:54 ` Kairui Song
2024-10-25 5:52 ` Huang, Ying
2024-11-03 17:11 ` Kairui Song [this message]
2024-11-04 3:02 ` Huang, Ying
2024-11-05 18:04 ` Kairui Song
2024-11-06 5:24 ` Huang, Ying
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAMgjq7CFKeVN7kQ9Ta--hcCs1Ng4KpMTa-xjs58_1EsbjSO6UQ@mail.gmail.com \
--to=ryncsn@gmail.com \
--cc=chrisl@kernel.org \
--cc=linux-mm@kvack.org \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox