From: Baoquan He <bhe@redhat.com>
To: Kairui Song <ryncsn@gmail.com>
Cc: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
Chris Li <chrisl@kernel.org>, Barry Song <v-songbaohua@oppo.com>,
Ryan Roberts <ryan.roberts@arm.com>,
Hugh Dickins <hughd@google.com>,
Yosry Ahmed <yosryahmed@google.com>,
"Huang, Ying" <ying.huang@linux.alibaba.com>,
Nhat Pham <nphamcs@gmail.com>,
Johannes Weiner <hannes@cmpxchg.org>,
Kalesh Singh <kaleshsingh@google.com>,
linux-kernel@vger.kernel.org
Subject: Re: [PATCH v3 09/13] mm, swap: reduce contention on device lock
Date: Fri, 10 Jan 2025 19:23:47 +0800 [thread overview]
Message-ID: <Z4EDQx2TwLokw1hJ@MiWiFi-R3L-srv> (raw)
In-Reply-To: <CAMgjq7Ad3v+oyBH8598-0emcqnLNK9izS-azgn89J5q=a=6N1w@mail.gmail.com>
On 01/09/25 at 10:15am, Kairui Song wrote:
> On Wed, Jan 8, 2025 at 7:10 PM Baoquan He <bhe@redhat.com> wrote:
> >
>
> Thanks for the very detailed review!
>
> > On 12/31/24 at 01:46am, Kairui Song wrote:
> > ......snip.....
> > > ---
> > > include/linux/swap.h | 3 +-
> > > mm/swapfile.c | 435 ++++++++++++++++++++++++-------------------
> > > 2 files changed, 246 insertions(+), 192 deletions(-)
> > >
> > > diff --git a/include/linux/swap.h b/include/linux/swap.h
> > > index 339d7f0192ff..c4ff31cb6bde 100644
> > > --- a/include/linux/swap.h
> > > +++ b/include/linux/swap.h
> > > @@ -291,6 +291,7 @@ enum swap_cluster_flags {
> > > * throughput.
> > > */
> > > struct percpu_cluster {
> > > + local_lock_t lock; /* Protect the percpu_cluster above */
> > > unsigned int next[SWAP_NR_ORDERS]; /* Likely next allocation offset */
> > > };
> > >
> > > @@ -313,7 +314,7 @@ struct swap_info_struct {
> > > /* list of cluster that contains at least one free slot */
> > > struct list_head frag_clusters[SWAP_NR_ORDERS];
> > > /* list of cluster that are fragmented or contented */
> > > - unsigned int frag_cluster_nr[SWAP_NR_ORDERS];
> > > + atomic_long_t frag_cluster_nr[SWAP_NR_ORDERS];
> > > unsigned int pages; /* total of usable pages of swap */
> > > atomic_long_t inuse_pages; /* number of those currently in use */
> > > struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location */
> > > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > > index 7795a3d27273..dadd4fead689 100644
> > > --- a/mm/swapfile.c
> > > +++ b/mm/swapfile.c
...snip...
> > > @@ -475,39 +488,90 @@ static void swap_cluster_schedule_discard(struct swap_info_struct *si,
> > >
> > > static void __free_cluster(struct swap_info_struct *si, struct swap_cluster_info *ci)
> > > {
> > > - lockdep_assert_held(&si->lock);
> > > lockdep_assert_held(&ci->lock);
> > > cluster_move(si, ci, &si->free_clusters, CLUSTER_FLAG_FREE);
> > > ci->order = 0;
> > > }
> > >
> > > +/*
> > > + * Isolate and lock the first cluster that is not contented on a list,
> > > + * clean its flag before taken off-list. Cluster flag must be in sync
> > > + * with list status, so cluster updaters can always know the cluster
> > > + * list status without touching si lock.
> > > + *
> > > + * Note it's possible that all clusters on a list are contented so
> > > + * this returns NULL for an non-empty list.
> > > + */
> > > +static struct swap_cluster_info *cluster_isolate_lock(
> > > + struct swap_info_struct *si, struct list_head *list)
> > > +{
> > > + struct swap_cluster_info *ci, *ret = NULL;
> > > +
> > > + spin_lock(&si->lock);
> > > +
> > > + if (unlikely(!(si->flags & SWP_WRITEOK)))
> > > + goto out;
> > > +
> > > + list_for_each_entry(ci, list, list) {
> > > + if (!spin_trylock(&ci->lock))
> > > + continue;
> > > +
> > > + /* We may only isolate and clear flags of following lists */
> > > + VM_BUG_ON(!ci->flags);
> > > + VM_BUG_ON(ci->flags > CLUSTER_FLAG_USABLE &&
> > > + ci->flags != CLUSTER_FLAG_FULL);
> > > +
> > > + list_del(&ci->list);
> > > + ci->flags = CLUSTER_FLAG_NONE;
> > > + ret = ci;
> > > + break;
> > > + }
> > > +out:
> > > + spin_unlock(&si->lock);
> > > +
> > > + return ret;
> > > +}
> > > +
> > > /*
> > > * Doing discard actually. After a cluster discard is finished, the cluster
> > > - * will be added to free cluster list. caller should hold si->lock.
> > > -*/
> > > -static void swap_do_scheduled_discard(struct swap_info_struct *si)
> > > + * will be added to free cluster list. Discard cluster is a bit special as
> > > + * they don't participate in allocation or reclaim, so clusters marked as
> > > + * CLUSTER_FLAG_DISCARD must remain off-list or on discard list.
> > > + */
> > > +static bool swap_do_scheduled_discard(struct swap_info_struct *si)
> > > {
> > > struct swap_cluster_info *ci;
> > > + bool ret = false;
> > > unsigned int idx;
> > >
> > > + spin_lock(&si->lock);
> > > while (!list_empty(&si->discard_clusters)) {
> > > ci = list_first_entry(&si->discard_clusters, struct swap_cluster_info, list);
> > > + /*
> > > + * Delete the cluster from list but don't clear its flags until
> > > + * discard is done, so isolation and relocation will skip it.
> > > + */
> > > list_del(&ci->list);
> >
> > I don't understand above comment. ci has been taken off list. While
> > allocation need isolate from a usable list. Even though we clear
> > ci->flags now, how come isolation and relocation will touch it. I may
> > miss anything here.
>
> There are many cases, one possible and common situation is that the
> percpu cluster (si->percpu_cluster of another CPU) is still pointing
> to it.
>
> Also, this commit removed protection of si lock on allocation, and
> allocation path may also drop ci lock to call reclaim, which means one
> cluster could be used or freed by anyone before allocator reacquire
> the ci lock again. In that case, the allocator could see a discard
> cluster.
>
> So we don't clear the discard flag, in case anyone misuse it.
>
> I can add more inline comments on this, this is already some related
> comments above the function relocate_cluster, could add some more
> referencing that.
Thanks for your great explanation. I understand that si->percpu_cluster
could point to a discarded ci, and a ci could be got from non-full,
frag lists but later become discarded if that ci is freed on other cpu
during cluster_reclaim_range() invocation. I haven't got how isolation
could see a discarded ci in cluster_isolate_lock(). Could you help give
an exmaple on how that happen?
Surely, I understand keeping the discarded flag is very necessary so
that checking like cluster_is_usable() will return expected value.
And by the way, I haven't got when the ' if (!ci->count)' case could
happen in relocate_cluster() since we have filtered away discarded ci
with the 'if (cluster_is_discard(ci))' checking. I asked in another
thread, could you help explain it?
static void relocate_cluster(struct swap_info_struct *si,
struct swap_cluster_info *ci)
{
lockdep_assert_held(&ci->lock);
/* Discard cluster must remain off-list or on discard list */
if (cluster_is_discard(ci))
return;
if (!ci->count) {
free_cluster(si, ci);
...
}
>
> >
> > > - /* Must clear flag when taking a cluster off-list */
> > > - ci->flags = CLUSTER_FLAG_NONE;
> > > idx = cluster_index(si, ci);
> > > spin_unlock(&si->lock);
> > > -
> > > discard_swap_cluster(si, idx * SWAPFILE_CLUSTER,
> > > SWAPFILE_CLUSTER);
> > >
> > > - spin_lock(&si->lock);
> > > spin_lock(&ci->lock);
> > > - __free_cluster(si, ci);
> > > + /*
> > > + * Discard is done, clear its flags as it's now off-list,
> > > + * then return the cluster to allocation list.
> > > + */
> > > + ci->flags = CLUSTER_FLAG_NONE;
> > > memset(si->swap_map + idx * SWAPFILE_CLUSTER,
> > > 0, SWAPFILE_CLUSTER);
> > > + __free_cluster(si, ci);
> > > spin_unlock(&ci->lock);
> > > + ret = true;
> > > + spin_lock(&si->lock);
> > > }
> > > + spin_unlock(&si->lock);
> > > + return ret;
> > > }
> > >
> > > static void swap_discard_work(struct work_struct *work)
> > ......snip....
> > > @@ -791,29 +873,34 @@ static void swap_reclaim_work(struct work_struct *work)
> > > static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
> > > unsigned char usage)
> > > {
> > > - struct percpu_cluster *cluster;
> > > struct swap_cluster_info *ci;
> > > unsigned int offset, found = 0;
> > >
> > > -new_cluster:
> > > - lockdep_assert_held(&si->lock);
> > > - cluster = this_cpu_ptr(si->percpu_cluster);
> > > - offset = cluster->next[order];
> > > + /* Fast path using per CPU cluster */
> > > + local_lock(&si->percpu_cluster->lock);
> > > + offset = __this_cpu_read(si->percpu_cluster->next[order]);
> > > if (offset) {
> > > - offset = alloc_swap_scan_cluster(si, offset, &found, order, usage);
> > > + ci = lock_cluster(si, offset);
> > > + /* Cluster could have been used by another order */
> > > + if (cluster_is_usable(ci, order)) {
> > > + if (cluster_is_free(ci))
> > > + offset = cluster_offset(si, ci);
> > > + offset = alloc_swap_scan_cluster(si, offset, &found,
> > > + order, usage);
> > > + } else {
> > > + unlock_cluster(ci);
> > > + }
> > > if (found)
> > > goto done;
> > > }
> > >
> > > - if (!list_empty(&si->free_clusters)) {
> > > - ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list);
> > > - offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, order, usage);
> > > - /*
> > > - * Either we didn't touch the cluster due to swapoff,
> > > - * or the allocation must success.
> > > - */
> > > - VM_BUG_ON((si->flags & SWP_WRITEOK) && !found);
> > > - goto done;
> > > +new_cluster:
> > > + ci = cluster_isolate_lock(si, &si->free_clusters);
> > > + if (ci) {
> > > + offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
> > > + &found, order, usage);
> > > + if (found)
> > > + goto done;
> > > }
> > >
> > > /* Try reclaim from full clusters if free clusters list is drained */
> > > @@ -821,49 +908,45 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o
> > > swap_reclaim_full_clusters(si, false);
> > >
> > > if (order < PMD_ORDER) {
> > > - unsigned int frags = 0;
> > > + unsigned int frags = 0, frags_existing;
> > >
> > > - while (!list_empty(&si->nonfull_clusters[order])) {
> > > - ci = list_first_entry(&si->nonfull_clusters[order],
> > > - struct swap_cluster_info, list);
> > > - cluster_move(si, ci, &si->frag_clusters[order], CLUSTER_FLAG_FRAG);
> > > + while ((ci = cluster_isolate_lock(si, &si->nonfull_clusters[order]))) {
> > > offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
> > > &found, order, usage);
> > > - frags++;
> > > + /*
> > > + * With `fragmenting` set to true, it will surely take
> > ~~~~~~~~~~~
> > wondering what 'fragmenting' means here.
>
> This comment is a bit out of context indeed, it actually trying to say
> the alloc_swap_scan_cluster call above should move the cluster to
> tail. I'll update the comment.
>
>
>
> >
> > > + * the cluster off nonfull list
> > > + */
> > > if (found)
> > > goto done;
> > > + frags++;
> > > }
> > >
> > > - /*
> > > - * Nonfull clusters are moved to frag tail if we reached
> > > - * here, count them too, don't over scan the frag list.
> > > - */
> > > - while (frags < si->frag_cluster_nr[order]) {
> > > - ci = list_first_entry(&si->frag_clusters[order],
> > > - struct swap_cluster_info, list);
> > > + frags_existing = atomic_long_read(&si->frag_cluster_nr[order]);
> > > + while (frags < frags_existing &&
> > > + (ci = cluster_isolate_lock(si, &si->frag_clusters[order]))) {
> > > + atomic_long_dec(&si->frag_cluster_nr[order]);
> > > /*
> > > - * Rotate the frag list to iterate, they were all failing
> > > - * high order allocation or moved here due to per-CPU usage,
> > > - * this help keeping usable cluster ahead.
> > > + * Rotate the frag list to iterate, they were all
> > > + * failing high order allocation or moved here due to
> > > + * per-CPU usage, but they could contain newly released
> > > + * reclaimable (eg. lazy-freed swap cache) slots.
> > > */
> > > - list_move_tail(&ci->list, &si->frag_clusters[order]);
> > > offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci),
> > > &found, order, usage);
> > > - frags++;
> > > if (found)
> > > goto done;
> > > + frags++;
> > > }
> > > }
> > >
> > > - if (!list_empty(&si->discard_clusters)) {
> > > - /*
> > > - * we don't have free cluster but have some clusters in
> > > - * discarding, do discard now and reclaim them, then
> > > - * reread cluster_next_cpu since we dropped si->lock
> > > - */
> > > - swap_do_scheduled_discard(si);
> > > + /*
> > > + * We don't have free cluster but have some clusters in
> > > + * discarding, do discard now and reclaim them, then
> > > + * reread cluster_next_cpu since we dropped si->lock
> > > + */
> > > + if ((si->flags & SWP_PAGE_DISCARD) && swap_do_scheduled_discard(si))
> > > goto new_cluster;
> > > - }
> > >
> > > if (order)
> > > goto done;
> > .....
> >
> >
>
next prev parent reply other threads:[~2025-01-10 11:24 UTC|newest]
Thread overview: 35+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-12-30 17:46 [PATCH v3 00/13] mm, swap: rework of swap allocator locks Kairui Song
2024-12-30 17:46 ` [PATCH v3 01/13] mm, swap: minor clean up for swap entry allocation Kairui Song
2025-01-09 4:04 ` Baoquan He
2024-12-30 17:46 ` [PATCH v3 02/13] mm, swap: fold swap_info_get_cont in the only caller Kairui Song
2025-01-09 4:05 ` Baoquan He
2024-12-30 17:46 ` [PATCH v3 03/13] mm, swap: remove old allocation path for HDD Kairui Song
2025-01-09 4:06 ` Baoquan He
2024-12-30 17:46 ` [PATCH v3 04/13] mm, swap: use cluster lock " Kairui Song
2025-01-09 4:07 ` Baoquan He
2024-12-30 17:46 ` [PATCH v3 05/13] mm, swap: clean up device availability check Kairui Song
2025-01-09 4:08 ` Baoquan He
2024-12-30 17:46 ` [PATCH v3 06/13] mm, swap: clean up plist removal and adding Kairui Song
2025-01-02 8:59 ` Baoquan He
2025-01-03 8:07 ` Kairui Song
2024-12-30 17:46 ` [PATCH v3 07/13] mm, swap: hold a reference during scan and cleanup flag usage Kairui Song
2025-01-04 5:46 ` Baoquan He
2025-01-13 5:34 ` Kairui Song
2025-01-20 2:39 ` Baoquan He
2025-01-27 9:19 ` Kairui Song
2025-02-05 9:18 ` Baoquan He
2024-12-30 17:46 ` [PATCH v3 08/13] mm, swap: use an enum to define all cluster flags and wrap flags changes Kairui Song
2025-01-06 8:43 ` Baoquan He
2025-01-13 5:49 ` Kairui Song
2024-12-30 17:46 ` [PATCH v3 09/13] mm, swap: reduce contention on device lock Kairui Song
2025-01-06 10:12 ` Baoquan He
2025-01-08 11:09 ` Baoquan He
2025-01-09 2:15 ` Kairui Song
2025-01-10 11:23 ` Baoquan He [this message]
2025-01-13 6:33 ` Kairui Song
2025-01-13 8:07 ` Kairui Song
2024-12-30 17:46 ` [PATCH v3 10/13] mm, swap: simplify percpu cluster updating Kairui Song
2025-01-09 2:07 ` Baoquan He
2024-12-30 17:46 ` [PATCH v3 11/13] mm, swap: introduce a helper for retrieving cluster from offset Kairui Song
2024-12-30 17:46 ` [PATCH v3 12/13] mm, swap: use a global swap cluster for non-rotation devices Kairui Song
2024-12-30 17:46 ` [PATCH v3 13/13] mm, swap_slots: remove slot cache for freeing path Kairui Song
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Z4EDQx2TwLokw1hJ@MiWiFi-R3L-srv \
--to=bhe@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=chrisl@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=kaleshsingh@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=nphamcs@gmail.com \
--cc=ryan.roberts@arm.com \
--cc=ryncsn@gmail.com \
--cc=v-songbaohua@oppo.com \
--cc=ying.huang@linux.alibaba.com \
--cc=yosryahmed@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox