Re: [PATCH v4 1/3] mm, swap: speed up hibernation allocation and writeout

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Kairui Song <ryncsn@gmail.com>
To: YoungJun Park <youngjun.park@lge.com>
Cc: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
	 Chris Li <chrisl@kernel.org>,
	Kemeng Shi <shikemeng@huaweicloud.com>,
	 Nhat Pham <nphamcs@gmail.com>, Baoquan He <bhe@redhat.com>,
	Barry Song <baohua@kernel.org>,
	 Carsten Grohmann <mail@carstengrohmann.de>,
	"Rafael J. Wysocki" <rafael@kernel.org>,
	 linux-kernel@vger.kernel.org,
	 "open list:SUSPEND TO RAM" <linux-pm@vger.kernel.org>,
	taejoon.song@lge.com,
	 "hyungjun.cho@lge.com Carsten Grohmann" <carstengrohmann@gmx.de>,
	stable@vger.kernel.org
Subject: Re: [PATCH v4 1/3] mm, swap: speed up hibernation allocation and writeout
Date: Tue, 24 Feb 2026 16:04:07 +0800	[thread overview]
Message-ID: <CAMgjq7BeA4cr5DSjpdaTVRRmcb_Pq+68yGZiiDg21fNPfGUQNg@mail.gmail.com> (raw)
In-Reply-To: <aZ1X1OwbAUq1k+C6@yjaykim-PowerEdge-T330>

On Tue, Feb 24, 2026 at 3:50 PM YoungJun Park <youngjun.park@lge.com> wrote:
>
> On Mon, Feb 16, 2026 at 10:58:02PM +0800, Kairui Song via B4 Relay wrote:
> > From: Kairui Song &lt;kasong@tencent.com&gt;
> >
> > Since commit 0ff67f990bd4 ("mm, swap: remove swap slot cache"),
> > hibernation has been using the swap slot slow allocation path for
> > simplification, which turns out might cause regression for some
> > devices because the allocator now rotates clusters too often, leading to
> > slower allocation and more random distribution of data.
> ...
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index c6863ff7152c..32e0e7545ab8 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -1926,8 +1926,9 @@ void swap_put_entries_direct(swp_entry_t entry, int nr)
> >  /* Allocate a slot for hibernation */
> >  swp_entry_t swap_alloc_hibernation_slot(int type)
> >  {
> > -     struct swap_info_struct *si = swap_type_to_info(type);
> > -     unsigned long offset;
> > +     struct swap_info_struct *pcp_si, *si = swap_type_to_info(type);
> > +     unsigned long pcp_offset, offset = SWAP_ENTRY_INVALID;
> > +     struct swap_cluster_info *ci;
> >       swp_entry_t entry = {0};
> >
> >       if (!si)
> > @@ -1937,11 +1938,21 @@ swp_entry_t swap_alloc_hibernation_slot(int type)
> >       if (get_swap_device_info(si)) {
>
> Hi Kairui :)
>
> Reading through the patch, I have some thoughts and review comments regarding
> the hibernation slot allocation logic. I'd like to discuss potential
> improvements. (Somewhat long... lot of thoughts come up on my mind)
>
> First, regarding the race with swapoff and refcounting.
>
> The code identifies the swap type before allocation, so a swapoff could
> occur in between. It seems safer to acquire the reference when identifying
> the type (e.g., find_first_swap). Also, instead of repeating get/put for
> every slot (allocation and free), could we hold the reference once during
> the initial lookup and release it after the image load? This avoids
> overhead since swapoff is effectively blocked once hibernation slots are
> allocated.

Hi Youngjun,

Yes, that's definitely doable, but requires the hibernation side to
change how it uses the API, which could be a long term workitem.

>
> >               if (si->flags & SWP_WRITEOK) {
> >                       /*
> > -                      * Grab the local lock to be compliant
> > -                      * with swap table allocation.
> > +                      * Try the local cluster first if it matches the device. If
> > +                      * not, try grab a new cluster and override local cluster.
> >                        */
> >                       local_lock(&percpu_swap_cluster.lock);
>
> Second, regarding local_lock:
>
> It seems mandatory now because distinguishing the lock context during swap
> table allocation is tricky (e.g., GFP_KERNEL allocation assumes a local
> locked context). Have you considered modifying the swap table allocation
> logic to handle this specifically? This might allow us to avoid holding the
> local_lock, especially if the device is not SWP_SOLIDSTATE.

I think you got this part wrong here. We need the lock because it will
call this_cpu_xxx operations later. And GFP_KERNEL doesn't assume a
lock locked context. Instead it needs to release the lock for a sleep
alloc if the ATOMIC alloc fails, and that could happen here.

But I agree we can definitely simplify this with some abstraction or wrapper.

>
> > -                     offset = cluster_alloc_swap_entry(si, NULL);
> > +                     pcp_si = this_cpu_read(percpu_swap_cluster.si[0]);
> > +                     pcp_offset = this_cpu_read(percpu_swap_cluster.offset[0]);
> > +                     if (pcp_si == si && pcp_offset) {
> > +                             ci = swap_cluster_lock(si, pcp_offset);
> > +                             if (cluster_is_usable(ci, 0))
> > +                                     offset = alloc_swap_scan_cluster(si, ci, NULL, pcp_offset);
> > +                             else
> > +                                     swap_cluster_unlock(ci);
> > +                     }
> > +                     if (!offset)
> > +                             offset = cluster_alloc_swap_entry(si, NULL);
> >                       local_unlock(&percpu_swap_cluster.lock);
> >                       if (offset)
> >                               entry = swp_entry(si->type, offset);
>
> Third, regarding cluster allocation:
>
> 1. If hibernation targets a lower-priority device, the per-cpu cluster
> usage might cause priority inversion (though minimal).

Right, the problem will be gone if we move the pcp cluster back to
device level. It's a trivial problem so I think we don't need to worry
about it now.

>
> 2. Have you considered treating clusters as a global resource for this
> case? For instance, caching next_offset in si(using union on global_cluster or new field) or allowing the
> allocator to calculate the next value directly, rather than splitting
> clusters per CPU.

I'm not sure how much code change it will involve and is it worth it.

Hibernation is supposed to stop every process, so concurrent memory
pressure is not something we are expecting here I think? Even if that
happens we are still fine.

>
> Finally, regarding readahead and freeing:
>
> Hibernation slots might be read during cluster-based readahead. Can we
> avoid this (e.g., by checking for a NULL fake shadow entry or adding a specific
> check for hibernation slots)? If so, we could also avoid triggering
> try_to_reclaim when freeing these slots.

Definitely! I have a patch that introduced a hibernation / exclusive
type in the swap table. Remember the is_coutnable macro you commented
about previously? That's reserved for this. For hibernation type, it's
not countable (exclusive to hibernation, maybe I need a better name
though) so readahead or any accidental IO will always skip it. By then
this ugly try_to_reclaim will be gone.

> Thanks for your work!

And thanks for your review :)

next prev parent reply	other threads:[~2026-02-24  8:04 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-16 14:58 [PATCH v4 0/3] mm/swap: hibernate: improve hibernate performance with new allocator Kairui Song via B4 Relay
2026-02-16 14:58 ` [PATCH v4 1/3] mm, swap: speed up hibernation allocation and writeout Kairui Song via B4 Relay
2026-02-16 21:42   ` Andrew Morton
2026-02-17 18:37     ` Kairui Song
2026-02-24  7:48   ` YoungJun Park
2026-02-24  8:04     ` Kairui Song [this message]
2026-02-24 11:42       ` YoungJun Park
2026-02-16 14:58 ` [PATCH v4 2/3] mm, swap: reduce indention for hibernate allocation helper Kairui Song via B4 Relay
2026-02-18  8:21   ` Barry Song
2026-02-18  8:58     ` Kairui Song
2026-02-16 14:58 ` [PATCH v4 3/3] mm, swap: merge common convention and simplify " Kairui Song via B4 Relay

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAMgjq7BeA4cr5DSjpdaTVRRmcb_Pq+68yGZiiDg21fNPfGUQNg@mail.gmail.com \
    --to=ryncsn@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=bhe@redhat.com \
    --cc=carstengrohmann@gmx.de \
    --cc=chrisl@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=mail@carstengrohmann.de \
    --cc=nphamcs@gmail.com \
    --cc=rafael@kernel.org \
    --cc=shikemeng@huaweicloud.com \
    --cc=stable@vger.kernel.org \
    --cc=taejoon.song@lge.com \
    --cc=youngjun.park@lge.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox