* cluster_alloc_swap_entry() may change per-cpu cluster of another CPU randomly
@ 2024-10-24 3:25 Huang, Ying
2024-10-24 3:50 ` Kairui Song
0 siblings, 1 reply; 8+ messages in thread
From: Huang, Ying @ 2024-10-24 3:25 UTC (permalink / raw)
To: Kairui Song; +Cc: linux-mm
Hi, Kairui,
When reading new swap cluster allocation code. Sorry, I didn't find
time to review your patch at the first place. I found that in
cluster_alloc_swap_entry(), the following code path is possible,
cluster = this_cpu_ptr(si->percpu_cluster);
offset = alloc_swap_scan_cluster();
...
spin_unlock(&ci->lock);
spin_unlock(&si->lock);
/* migrate to another cpu */
spin_lock(&ci->lock);
spin_lock(&si->lock);
cluster->next[order] = offset;
That is, the per cpu cluster of a CPU may be changed on another CPU. I
guess that this will not cause some functionality issue. However, this
makes code harder to be reasoned. Is it possible to avoid unlock before
changing per cpu cluster? Or, give up updating per cpu cluster if we
need to unlock.
--
Best Regards,
Huang, Ying
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: cluster_alloc_swap_entry() may change per-cpu cluster of another CPU randomly 2024-10-24 3:25 cluster_alloc_swap_entry() may change per-cpu cluster of another CPU randomly Huang, Ying @ 2024-10-24 3:50 ` Kairui Song 2024-10-24 3:54 ` Kairui Song 2024-10-25 5:52 ` Huang, Ying 0 siblings, 2 replies; 8+ messages in thread From: Kairui Song @ 2024-10-24 3:50 UTC (permalink / raw) To: Huang, Ying; +Cc: linux-mm On Thu, Oct 24, 2024 at 11:29 AM Huang, Ying <ying.huang@intel.com> wrote: > > Hi, Kairui, > > When reading new swap cluster allocation code. Sorry, I didn't find > time to review your patch at the first place. I found that in > cluster_alloc_swap_entry(), the following code path is possible, > > cluster = this_cpu_ptr(si->percpu_cluster); > offset = alloc_swap_scan_cluster(); > ... > spin_unlock(&ci->lock); > spin_unlock(&si->lock); > /* migrate to another cpu */ > spin_lock(&ci->lock); > spin_lock(&si->lock); > cluster->next[order] = offset; > > That is, the per cpu cluster of a CPU may be changed on another CPU. I > guess that this will not cause some functionality issue. However, this > makes code harder to be reasoned. Is it possible to avoid unlock before > changing per cpu cluster? Or, give up updating per cpu cluster if we > need to unlock. Hi Ying Yes, I've noticed this when working on the new si->lock series. It may cause fragmentation, only if the CPU the current allocation migrates from requested another swap allocation, in such case it may waste one per cpu cluster. No functionality or leak issue, but make things harder to reason indeed, as you mentioned. In the new series I used a local lock to prevent migration, so the cpu is stable during the whole process. However that is not doable unless si->lock is removed from allocation path or there are lock dependency issue. Do we need to fix this separately in the current tree? The fragmentation is overall better with these patches, so I think that is causing any issues. Before these patches per CPU cluster can also be stolen by other CPUs so these were not guaranteed to be accurate. One easy way we can do is set the cluster->next[order] to 0 before scanning for a new cluster to use, and check if it is 0 before setting next[order] again. That may still cause fragmentation as migration will cause extra clusters to be scanned and used, which is hard to avoid without changing too much code (like the things being done in the new si->lock series). ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: cluster_alloc_swap_entry() may change per-cpu cluster of another CPU randomly 2024-10-24 3:50 ` Kairui Song @ 2024-10-24 3:54 ` Kairui Song 2024-10-25 5:52 ` Huang, Ying 1 sibling, 0 replies; 8+ messages in thread From: Kairui Song @ 2024-10-24 3:54 UTC (permalink / raw) To: Huang, Ying; +Cc: linux-mm On Thu, Oct 24, 2024 at 11:50 AM Kairui Song <ryncsn@gmail.com> wrote: > > On Thu, Oct 24, 2024 at 11:29 AM Huang, Ying <ying.huang@intel.com> wrote: > > > > Hi, Kairui, > > > > When reading new swap cluster allocation code. Sorry, I didn't find > > time to review your patch at the first place. I found that in > > cluster_alloc_swap_entry(), the following code path is possible, > > > > cluster = this_cpu_ptr(si->percpu_cluster); > > offset = alloc_swap_scan_cluster(); > > ... > > spin_unlock(&ci->lock); > > spin_unlock(&si->lock); > > /* migrate to another cpu */ > > spin_lock(&ci->lock); > > spin_lock(&si->lock); > > cluster->next[order] = offset; > > > > That is, the per cpu cluster of a CPU may be changed on another CPU. I > > guess that this will not cause some functionality issue. However, this > > makes code harder to be reasoned. Is it possible to avoid unlock before > > changing per cpu cluster? Or, give up updating per cpu cluster if we > > need to unlock. > > Hi Ying > > Yes, I've noticed this when working on the new si->lock series. It may > cause fragmentation, only if the CPU the current allocation migrates > from requested another swap allocation, in such case it may waste one > per cpu cluster. > > No functionality or leak issue, but make things harder to reason > indeed, as you mentioned. > > In the new series I used a local lock to prevent migration, so the cpu > is stable during the whole process. However that is not doable unless > si->lock is removed from allocation path or there are lock dependency > issue. > > Do we need to fix this separately in the current tree? The > fragmentation is overall better with these patches, so I think that is Typo: I mean I think these patches are not causing new issues. > causing any issues. Before these patches per CPU cluster can also be > stolen by other CPUs so these were not guaranteed to be accurate. > > One easy way we can do is set the cluster->next[order] to 0 before > scanning for a new cluster to use, and check if it is 0 before setting > next[order] again. > > That may still cause fragmentation as migration will cause extra > clusters to be scanned and used, which is hard to avoid without > changing too much code (like the things being done in the new si->lock > series). ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: cluster_alloc_swap_entry() may change per-cpu cluster of another CPU randomly 2024-10-24 3:50 ` Kairui Song 2024-10-24 3:54 ` Kairui Song @ 2024-10-25 5:52 ` Huang, Ying 2024-11-03 17:11 ` Kairui Song 1 sibling, 1 reply; 8+ messages in thread From: Huang, Ying @ 2024-10-25 5:52 UTC (permalink / raw) To: Kairui Song; +Cc: linux-mm Kairui Song <ryncsn@gmail.com> writes: > On Thu, Oct 24, 2024 at 11:29 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> Hi, Kairui, >> >> When reading new swap cluster allocation code. Sorry, I didn't find >> time to review your patch at the first place. I found that in >> cluster_alloc_swap_entry(), the following code path is possible, >> >> cluster = this_cpu_ptr(si->percpu_cluster); >> offset = alloc_swap_scan_cluster(); >> ... >> spin_unlock(&ci->lock); >> spin_unlock(&si->lock); >> /* migrate to another cpu */ >> spin_lock(&ci->lock); >> spin_lock(&si->lock); >> cluster->next[order] = offset; >> >> That is, the per cpu cluster of a CPU may be changed on another CPU. I >> guess that this will not cause some functionality issue. However, this >> makes code harder to be reasoned. Is it possible to avoid unlock before >> changing per cpu cluster? Or, give up updating per cpu cluster if we >> need to unlock. > > Hi Ying > > Yes, I've noticed this when working on the new si->lock series. It may > cause fragmentation, only if the CPU the current allocation migrates > from requested another swap allocation, in such case it may waste one > per cpu cluster. > > No functionality or leak issue, but make things harder to reason > indeed, as you mentioned. > > In the new series I used a local lock to prevent migration, so the cpu > is stable during the whole process. However that is not doable unless > si->lock is removed from allocation path or there are lock dependency > issue. > > Do we need to fix this separately in the current tree? The > fragmentation is overall better with these patches, so I think that is > causing any issues. Before these patches per CPU cluster can also be > stolen by other CPUs so these were not guaranteed to be accurate. > > One easy way we can do is set the cluster->next[order] to 0 before > scanning for a new cluster to use, and check if it is 0 before setting > next[order] again. > > That may still cause fragmentation as migration will cause extra > clusters to be scanned and used, which is hard to avoid without > changing too much code (like the things being done in the new si->lock > series). Find another possible more serious problem. After unlock/lock cycle, the state of swap device may be changed (for example, swapoffed). So in the previous code, si->flags += SWP_SCANNING; if (!(si->flags & SWP_WRITEOK)) is used for that. I don't find they are used after unlock/lock cycle. Can you check it? -- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: cluster_alloc_swap_entry() may change per-cpu cluster of another CPU randomly 2024-10-25 5:52 ` Huang, Ying @ 2024-11-03 17:11 ` Kairui Song 2024-11-04 3:02 ` Huang, Ying 0 siblings, 1 reply; 8+ messages in thread From: Kairui Song @ 2024-11-03 17:11 UTC (permalink / raw) To: Huang, Ying; +Cc: linux-mm, Chris Li Hi Ying, On Fri, Oct 25, 2024 at 1:56 PM Huang, Ying <ying.huang@intel.com> wrote: > > Kairui Song <ryncsn@gmail.com> writes: > > > On Thu, Oct 24, 2024 at 11:29 AM Huang, Ying <ying.huang@intel.com> wrote: > >> > >> Hi, Kairui, > >> > >> When reading new swap cluster allocation code. Sorry, I didn't find > >> time to review your patch at the first place. I found that in > >> cluster_alloc_swap_entry(), the following code path is possible, > >> > >> cluster = this_cpu_ptr(si->percpu_cluster); > >> offset = alloc_swap_scan_cluster(); > >> ... > >> spin_unlock(&ci->lock); > >> spin_unlock(&si->lock); > >> /* migrate to another cpu */ > >> spin_lock(&ci->lock); > >> spin_lock(&si->lock); > >> cluster->next[order] = offset; > >> > >> That is, the per cpu cluster of a CPU may be changed on another CPU. I > >> guess that this will not cause some functionality issue. However, this > >> makes code harder to be reasoned. Is it possible to avoid unlock before > >> changing per cpu cluster? Or, give up updating per cpu cluster if we > >> need to unlock. > > > > Hi Ying > > > > Yes, I've noticed this when working on the new si->lock series. It may > > cause fragmentation, only if the CPU the current allocation migrates > > from requested another swap allocation, in such case it may waste one > > per cpu cluster. > > > > No functionality or leak issue, but make things harder to reason > > indeed, as you mentioned. > > > > In the new series I used a local lock to prevent migration, so the cpu > > is stable during the whole process. However that is not doable unless > > si->lock is removed from allocation path or there are lock dependency > > issue. > > > > Do we need to fix this separately in the current tree? The > > fragmentation is overall better with these patches, so I think that is > > causing any issues. Before these patches per CPU cluster can also be > > stolen by other CPUs so these were not guaranteed to be accurate. > > > > One easy way we can do is set the cluster->next[order] to 0 before > > scanning for a new cluster to use, and check if it is 0 before setting > > next[order] again. > > > > That may still cause fragmentation as migration will cause extra > > clusters to be scanned and used, which is hard to avoid without > > changing too much code (like the things being done in the new si->lock > > series). > > Find another possible more serious problem. After unlock/lock cycle, > the state of swap device may be changed (for example, swapoffed). So in > the previous code, > > si->flags += SWP_SCANNING; > > if (!(si->flags & SWP_WRITEOK)) > > is used for that. I don't find they are used after unlock/lock cycle. > Can you check it? Thanks for your review. Yes, I think you are right. I did notice the flags are not updated properly in the latest mm branch, but did not notice it was a newly introduced issue. So I fixed the issue in https://lore.kernel.org/linux-mm/20241022192451.38138-8-ryncsn@gmail.com/ But a quick fix can be posted before the series. Following patch should be enough, how do you think? diff --git a/mm/swapfile.c b/mm/swapfile.c index 46bd4b1a3c07..6b7d4ac1d4a3 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1041,7 +1041,9 @@ static int cluster_alloc_swap(struct swap_info_struct *si, VM_BUG_ON(!si->cluster_info); - while (n_ret < nr) { + si->flags += SWP_SCANNING; + while (n_ret < nr && si->flags & SWP_WRITEOK) { unsigned long offset = cluster_alloc_swap_entry(si, order, usage); if (!offset) @@ -1049,6 +1051,8 @@ static int cluster_alloc_swap(struct swap_info_struct *si, slots[n_ret++] = swp_entry(si->type, offset); } + si->flags -= SWP_SCANNING; + return n_ret; } The "+= SWP_SCANNING" looks ugly, but has to be done in such a way, just like in the original allocation code. Multiple scanners can update the flag, and SCANNING is the highest defined bit in the flag field, it's not a strictly bit flag in such cases. This will be cleaned up in a proper way in the patch I just mentioned. > > -- > Best Regards, > Huang, Ying ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: cluster_alloc_swap_entry() may change per-cpu cluster of another CPU randomly 2024-11-03 17:11 ` Kairui Song @ 2024-11-04 3:02 ` Huang, Ying 2024-11-05 18:04 ` Kairui Song 0 siblings, 1 reply; 8+ messages in thread From: Huang, Ying @ 2024-11-04 3:02 UTC (permalink / raw) To: Kairui Song; +Cc: linux-mm, Chris Li Kairui Song <ryncsn@gmail.com> writes: > Hi Ying, > > On Fri, Oct 25, 2024 at 1:56 PM Huang, Ying <ying.huang@intel.com> wrote: >> >> Kairui Song <ryncsn@gmail.com> writes: >> >> > On Thu, Oct 24, 2024 at 11:29 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> >> >> Hi, Kairui, >> >> >> >> When reading new swap cluster allocation code. Sorry, I didn't find >> >> time to review your patch at the first place. I found that in >> >> cluster_alloc_swap_entry(), the following code path is possible, >> >> >> >> cluster = this_cpu_ptr(si->percpu_cluster); >> >> offset = alloc_swap_scan_cluster(); >> >> ... >> >> spin_unlock(&ci->lock); >> >> spin_unlock(&si->lock); >> >> /* migrate to another cpu */ >> >> spin_lock(&ci->lock); >> >> spin_lock(&si->lock); >> >> cluster->next[order] = offset; >> >> >> >> That is, the per cpu cluster of a CPU may be changed on another CPU. I >> >> guess that this will not cause some functionality issue. However, this >> >> makes code harder to be reasoned. Is it possible to avoid unlock before >> >> changing per cpu cluster? Or, give up updating per cpu cluster if we >> >> need to unlock. >> > >> > Hi Ying >> > >> > Yes, I've noticed this when working on the new si->lock series. It may >> > cause fragmentation, only if the CPU the current allocation migrates >> > from requested another swap allocation, in such case it may waste one >> > per cpu cluster. >> > >> > No functionality or leak issue, but make things harder to reason >> > indeed, as you mentioned. >> > >> > In the new series I used a local lock to prevent migration, so the cpu >> > is stable during the whole process. However that is not doable unless >> > si->lock is removed from allocation path or there are lock dependency >> > issue. >> > >> > Do we need to fix this separately in the current tree? The >> > fragmentation is overall better with these patches, so I think that is >> > causing any issues. Before these patches per CPU cluster can also be >> > stolen by other CPUs so these were not guaranteed to be accurate. >> > >> > One easy way we can do is set the cluster->next[order] to 0 before >> > scanning for a new cluster to use, and check if it is 0 before setting >> > next[order] again. >> > >> > That may still cause fragmentation as migration will cause extra >> > clusters to be scanned and used, which is hard to avoid without >> > changing too much code (like the things being done in the new si->lock >> > series). >> >> Find another possible more serious problem. After unlock/lock cycle, >> the state of swap device may be changed (for example, swapoffed). So in >> the previous code, >> >> si->flags += SWP_SCANNING; >> >> if (!(si->flags & SWP_WRITEOK)) >> >> is used for that. I don't find they are used after unlock/lock cycle. >> Can you check it? > > Thanks for your review. > > Yes, I think you are right. I did notice the flags are not updated > properly in the latest mm branch, but did not notice it was a newly > introduced issue. > > So I fixed the issue in > https://lore.kernel.org/linux-mm/20241022192451.38138-8-ryncsn@gmail.com/ > > But a quick fix can be posted before the series. Following patch > should be enough, how do you think? > > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 46bd4b1a3c07..6b7d4ac1d4a3 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -1041,7 +1041,9 @@ static int cluster_alloc_swap(struct swap_info_struct *si, > > VM_BUG_ON(!si->cluster_info); > > - while (n_ret < nr) { > + si->flags += SWP_SCANNING; > + while (n_ret < nr && si->flags & SWP_WRITEOK) { > unsigned long offset = cluster_alloc_swap_entry(si, > order, usage); > > if (!offset) > @@ -1049,6 +1051,8 @@ static int cluster_alloc_swap(struct swap_info_struct *si, > slots[n_ret++] = swp_entry(si->type, offset); > } > > + si->flags -= SWP_SCANNING; > + > return n_ret; > } > > The "+= SWP_SCANNING" looks ugly, but has to be done in such a way, > just like in the original allocation code. Multiple scanners can > update the flag, and SCANNING is the highest defined bit in the flag > field, it's not a strictly bit flag in such cases. This will be > cleaned up in a proper way in the patch I just mentioned. I think that the logic of SWP_SCANNING is correct. As for SWP_WRITEOK, we should avoid to allocate swap entries when other task clears it ASAP. So, the above checking may be too late, because we have already allocated the swap entry. I think that we can check it in cluster_reclaim_range() and refuse to allocate swap entry there. Combined with the per cpu cluster (cluster->next[]) changing issue we discussed in another email of the thread. I think the long term solution could be separate the swap entry allocation and reclaiming a bit more. For example, we could try to allocate some swap entries with si->lock held firstly, record possible reclaim offset if any, change cluster->next[], then unlock, reclaim, lock. -- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: cluster_alloc_swap_entry() may change per-cpu cluster of another CPU randomly 2024-11-04 3:02 ` Huang, Ying @ 2024-11-05 18:04 ` Kairui Song 2024-11-06 5:24 ` Huang, Ying 0 siblings, 1 reply; 8+ messages in thread From: Kairui Song @ 2024-11-05 18:04 UTC (permalink / raw) To: Huang, Ying; +Cc: linux-mm, Chris Li On Mon, Nov 4, 2024 at 11:06 AM Huang, Ying <ying.huang@intel.com> wrote: > > Kairui Song <ryncsn@gmail.com> writes: > > I think that the logic of SWP_SCANNING is correct. > > As for SWP_WRITEOK, we should avoid to allocate swap entries when other > task clears it ASAP. So, the above checking may be too late, because we > have already allocated the swap entry. I think that we can check it in > cluster_reclaim_range() and refuse to allocate swap entry there. Thanks for the feedback. Not sure if this is a big issue. With the above patch SWP_WRITEOK will prevents any new allocations from using the device, already ongoing allocations may still use this device indeed, but in the end swapoff will still succeed, maybe very slightly slower as swapoff may have to unuse a few more entries. I can add another SWP_WRITEOK check to make it better, will post a patch. > Combined with the per cpu cluster (cluster->next[]) changing issue we > discussed in another email of the thread. I think the long term > solution could be separate the swap entry allocation and reclaiming a > bit more. For example, we could try to allocate some swap entries with > si->lock held firstly, record possible reclaim offset if any, change > cluster->next[], then unlock, reclaim, lock. > > -- > Best Regards, > Huang, Ying ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: cluster_alloc_swap_entry() may change per-cpu cluster of another CPU randomly 2024-11-05 18:04 ` Kairui Song @ 2024-11-06 5:24 ` Huang, Ying 0 siblings, 0 replies; 8+ messages in thread From: Huang, Ying @ 2024-11-06 5:24 UTC (permalink / raw) To: Kairui Song; +Cc: linux-mm, Chris Li Kairui Song <ryncsn@gmail.com> writes: > On Mon, Nov 4, 2024 at 11:06 AM Huang, Ying <ying.huang@intel.com> wrote: >> >> Kairui Song <ryncsn@gmail.com> writes: >> >> I think that the logic of SWP_SCANNING is correct. >> >> As for SWP_WRITEOK, we should avoid to allocate swap entries when other >> task clears it ASAP. So, the above checking may be too late, because we >> have already allocated the swap entry. I think that we can check it in >> cluster_reclaim_range() and refuse to allocate swap entry there. > > Thanks for the feedback. > > Not sure if this is a big issue. With the above patch SWP_WRITEOK will > prevents any new allocations from using the device, already ongoing > allocations may still use this device indeed, but in the end swapoff > will still succeed, maybe very slightly slower as swapoff may have to > unuse a few more entries. I can add another SWP_WRITEOK check to make > it better, will post a patch. IIUC, this is more serious than delaying swapoff. try_to_unuse() checks si->inuse_pages, if it becomes 0, then swapoff() will go ahead without checking it again. However, the current kernel may allocate swap entry even if si->inuse_pages is 0 and SWP_WRITEOK is cleared. >> Combined with the per cpu cluster (cluster->next[]) changing issue we >> discussed in another email of the thread. I think the long term >> solution could be separate the swap entry allocation and reclaiming a >> bit more. For example, we could try to allocate some swap entries with >> si->lock held firstly, record possible reclaim offset if any, change >> cluster->next[], then unlock, reclaim, lock. -- Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2024-11-06 5:27 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-10-24 3:25 cluster_alloc_swap_entry() may change per-cpu cluster of another CPU randomly Huang, Ying 2024-10-24 3:50 ` Kairui Song 2024-10-24 3:54 ` Kairui Song 2024-10-25 5:52 ` Huang, Ying 2024-11-03 17:11 ` Kairui Song 2024-11-04 3:02 ` Huang, Ying 2024-11-05 18:04 ` Kairui Song 2024-11-06 5:24 ` Huang, Ying
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox