[PATCH v4 0/3] mm/swap: hibernate: improve hibernate performance with new allocator

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/3] mm/swap: hibernate: improve hibernate performance with new allocator
@ 2026-02-16 14:58 Kairui Song via B4 Relay
  2026-02-16 14:58 ` [PATCH v4 1/3] mm, swap: speed up hibernation allocation and writeout Kairui Song via B4 Relay
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Kairui Song via B4 Relay @ 2026-02-16 14:58 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Carsten Grohmann, Rafael J. Wysocki, linux-kernel,
	open list:SUSPEND TO RAM, Carsten Grohmann, Kairui Song, stable

The new swap allocator didn't provide a high-performance allocation
method for hibernate, and just left it using the easy slow path. As a
result, hibernate performance is quite bad on some devices

Fix it by implementing hibernate support for the fast allocation path.

This regression seems only happen with SSD devices with poor 4k
performance. I've tested on several different NVME and SSD setups, the
performance diff is tiny on them, but testing on a Samsung SSD 830
Series (SATA II, 3.0 Gbps) showed a big difference [1]:

Test result with Samsung SSD 830 Series (SATA II, 3.0 Gbps) thanks
to Carsten Grohmann [1]:
6.19:               324 seconds
After this series:  35 seconds

Test result with SAMSUNG MZ7LH480HAHQ-00005 (SATA 3.2, 6.0 Gb/s):
Before 0ff67f990bd4: Wrote 2230700 kbytes in 4.47 seconds (499.03 MB/s)
After 0ff67f990bd4: Wrote 2215472 kbytes in 4.44 seconds (498.98 MB/s)
After this series: Wrote 2038748 kbytes in 4.04 seconds (504.64 MB/s)

Test result with Memblaze P5910DT0384M00:
Before 0ff67f990bd4: Wrote 2222772 kbytes in 0.84 seconds (2646.15 MB/s)
After 0ff67f990bd4: Wrote 2224184 kbytes in 0.90 seconds (2471.31 MB/s)
After this series: Wrote 1559088 kbytes in 0.55 seconds (2834.70 MB/s)

The performance is almost the same for blazing fast SSDs, but for some
SSDs, the performance is several times better.

Patch 1 improves the hibernate performance by using the fast path, and
patch 2 cleans up the code a bit since there are now multiple fast path
users using similar conventions.

Signed-off-by: Kairui Song <kasong@tencent.com>
Tested-by: Carsten Grohmann <mail@carstengrohmann.de>
Link: https://lore.kernel.org/linux-mm/8b4bdcfa-ce3f-4e23-839f-31367df7c18f@gmx.de/ [1]
---
Changes in v4:
- Reduce indent and improve code comment, as suggested by [ Barry Song ]
- Link to v3: https://lore.kernel.org/r/20260216-hibernate-perf-v3-0-74e025091145@tencent.com

Changes in v3:
- Split the indention change to a standalone patch.
- Update mail address and add Cc stable.
- Link to v2: https://lore.kernel.org/r/20260215-hibernate-perf-v2-0-cf28c75b04b7@tencent.com

Changes in v2:
- Based on mm-unstable, resend using b4's relay to fix mismathed patch content.
- Link to v1: https://lore.kernel.org/r/20260215-hibernate-perf-v1-0-f55ee9ee67db@tencent.com

---
Kairui Song (3):
      mm, swap: speed up hibernation allocation and writeout
      mm, swap: reduce indention for hibernate allocation helper
      mm, swap: merge common convention and simplify allocation helper

 mm/swapfile.c | 92 ++++++++++++++++++++++++++++++++++-------------------------
 1 file changed, 53 insertions(+), 39 deletions(-)
---
base-commit: 53f061047924205138ad9bc315885255f7cc4944
change-id: 20260212-hibernate-perf-fb7783b2b252

Best regards,
-- 
Kairui Song <kasong@tencent.com>




^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v4 1/3] mm, swap: speed up hibernation allocation and writeout
  2026-02-16 14:58 [PATCH v4 0/3] mm/swap: hibernate: improve hibernate performance with new allocator Kairui Song via B4 Relay
@ 2026-02-16 14:58 ` Kairui Song via B4 Relay
  2026-02-16 21:42   ` Andrew Morton
  2026-02-24  7:48   ` YoungJun Park
  2026-02-16 14:58 ` [PATCH v4 2/3] mm, swap: reduce indention for hibernate allocation helper Kairui Song via B4 Relay
  2026-02-16 14:58 ` [PATCH v4 3/3] mm, swap: merge common convention and simplify " Kairui Song via B4 Relay
  2 siblings, 2 replies; 11+ messages in thread
From: Kairui Song via B4 Relay @ 2026-02-16 14:58 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Carsten Grohmann, Rafael J. Wysocki, linux-kernel,
	open list:SUSPEND TO RAM, Carsten Grohmann, Kairui Song, stable

From: Kairui Song <kasong@tencent.com>

Since commit 0ff67f990bd4 ("mm, swap: remove swap slot cache"),
hibernation has been using the swap slot slow allocation path for
simplification, which turns out might cause regression for some
devices because the allocator now rotates clusters too often, leading to
slower allocation and more random distribution of data.

Fast allocation is not complex, so implement hibernation support as
well.

Test result with Samsung SSD 830 Series (SATA II, 3.0 Gbps) shows the
performance is several times better [1]:
6.19:               324 seconds
After this series:  35 seconds

Fixes: 0ff67f990bd4 ("mm, swap: remove swap slot cache")
Reported-by: Carsten Grohmann <mail@carstengrohmann.de>
Closes: https://lore.kernel.org/linux-mm/20260206121151.dea3633d1f0ded7bbf49c22e@linux-foundation.org/
Link: https://lore.kernel.org/linux-mm/8b4bdcfa-ce3f-4e23-839f-31367df7c18f@gmx.de/ [1]
Cc: stable@vger.kernel.org
Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index c6863ff7152c..32e0e7545ab8 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1926,8 +1926,9 @@ void swap_put_entries_direct(swp_entry_t entry, int nr)
 /* Allocate a slot for hibernation */
 swp_entry_t swap_alloc_hibernation_slot(int type)
 {
-	struct swap_info_struct *si = swap_type_to_info(type);
-	unsigned long offset;
+	struct swap_info_struct *pcp_si, *si = swap_type_to_info(type);
+	unsigned long pcp_offset, offset = SWAP_ENTRY_INVALID;
+	struct swap_cluster_info *ci;
 	swp_entry_t entry = {0};
 
 	if (!si)
@@ -1937,11 +1938,21 @@ swp_entry_t swap_alloc_hibernation_slot(int type)
 	if (get_swap_device_info(si)) {
 		if (si->flags & SWP_WRITEOK) {
 			/*
-			 * Grab the local lock to be compliant
-			 * with swap table allocation.
+			 * Try the local cluster first if it matches the device. If
+			 * not, try grab a new cluster and override local cluster.
 			 */
 			local_lock(&percpu_swap_cluster.lock);
-			offset = cluster_alloc_swap_entry(si, NULL);
+			pcp_si = this_cpu_read(percpu_swap_cluster.si[0]);
+			pcp_offset = this_cpu_read(percpu_swap_cluster.offset[0]);
+			if (pcp_si == si && pcp_offset) {
+				ci = swap_cluster_lock(si, pcp_offset);
+				if (cluster_is_usable(ci, 0))
+					offset = alloc_swap_scan_cluster(si, ci, NULL, pcp_offset);
+				else
+					swap_cluster_unlock(ci);
+			}
+			if (!offset)
+				offset = cluster_alloc_swap_entry(si, NULL);
 			local_unlock(&percpu_swap_cluster.lock);
 			if (offset)
 				entry = swp_entry(si->type, offset);

-- 
2.52.0




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 1/3] mm, swap: speed up hibernation allocation and writeout
  2026-02-16 14:58 ` [PATCH v4 1/3] mm, swap: speed up hibernation allocation and writeout Kairui Song via B4 Relay
@ 2026-02-16 21:42   ` Andrew Morton
  2026-02-17 18:37     ` Kairui Song
  2026-02-24  7:48   ` YoungJun Park
  1 sibling, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2026-02-16 21:42 UTC (permalink / raw)
  To: kasong
  Cc: Kairui Song via B4 Relay, linux-mm, Chris Li, Kemeng Shi,
	Nhat Pham, Baoquan He, Barry Song, Carsten Grohmann,
	Rafael J. Wysocki, linux-kernel, open list:SUSPEND TO RAM,
	Carsten Grohmann, stable

On Mon, 16 Feb 2026 22:58:02 +0800 Kairui Song via B4 Relay <devnull+kasong.tencent.com@kernel.org> wrote:

> From: Kairui Song <kasong@tencent.com>
> 
> Since commit 0ff67f990bd4 ("mm, swap: remove swap slot cache"),
> hibernation has been using the swap slot slow allocation path for
> simplification, which turns out might cause regression for some
> devices because the allocator now rotates clusters too often, leading to
> slower allocation and more random distribution of data.
> 
> Fast allocation is not complex, so implement hibernation support as
> well.
> 
> Test result with Samsung SSD 830 Series (SATA II, 3.0 Gbps) shows the
> performance is several times better [1]:
> 6.19:               324 seconds
> After this series:  35 seconds

Thanks.

I'll merge only [1/3] at this time, into mm-unstable at this time (I'll
move it to mm-unstable after resyncing mm.git with upstream).

We don't want the other two patches present during testing of this
backportable fix because doing so partially invalidates that testing -
[2/3] and[3/3] might accidentally fix issues which [1/3] added.  It happens,
occasionally.

> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1926,8 +1926,9 @@ void swap_put_entries_direct(swp_entry_t entry, int nr)
>  /* Allocate a slot for hibernation */
>  swp_entry_t swap_alloc_hibernation_slot(int type)
>  {
> -	struct swap_info_struct *si = swap_type_to_info(type);
> -	unsigned long offset;
> +	struct swap_info_struct *pcp_si, *si = swap_type_to_info(type);
> +	unsigned long pcp_offset, offset = SWAP_ENTRY_INVALID;
> +	struct swap_cluster_info *ci;
>  	swp_entry_t entry = {0};
>  
>  	if (!si)
> @@ -1937,11 +1938,21 @@ swp_entry_t swap_alloc_hibernation_slot(int type)
>  	if (get_swap_device_info(si)) {
>  		if (si->flags & SWP_WRITEOK) {
>  			/*
> -			 * Grab the local lock to be compliant
> -			 * with swap table allocation.
> +			 * Try the local cluster first if it matches the device. If
> +			 * not, try grab a new cluster and override local cluster.
>  			 */

nanonit, worrying about 80-cols is rather old fashioned but there's no
reason to overflow 80 in a block comment!

>  			local_lock(&percpu_swap_cluster.lock);
> -			offset = cluster_alloc_swap_entry(si, NULL);
> +			pcp_si = this_cpu_read(percpu_swap_cluster.si[0]);
> +			pcp_offset = this_cpu_read(percpu_swap_cluster.offset[0]);
> +			if (pcp_si == si && pcp_offset) {
> +				ci = swap_cluster_lock(si, pcp_offset);
> +				if (cluster_is_usable(ci, 0))
> +					offset = alloc_swap_scan_cluster(si, ci, NULL, pcp_offset);
> +				else
> +					swap_cluster_unlock(ci);
> +			}
> +			if (!offset)
> +				offset = cluster_alloc_swap_entry(si, NULL);
>  			local_unlock(&percpu_swap_cluster.lock);
>  			if (offset)
>  				entry = swp_entry(si->type, offset);



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 1/3] mm, swap: speed up hibernation allocation and writeout
  2026-02-16 21:42   ` Andrew Morton
@ 2026-02-17 18:37     ` Kairui Song
  0 siblings, 0 replies; 11+ messages in thread
From: Kairui Song @ 2026-02-17 18:37 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kairui Song via B4 Relay, linux-mm, Chris Li, Kemeng Shi,
	Nhat Pham, Baoquan He, Barry Song, Carsten Grohmann,
	Rafael J. Wysocki, linux-kernel, open list:SUSPEND TO RAM,
	Carsten Grohmann, stable

On Tue, Feb 17, 2026 at 5:42 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Mon, 16 Feb 2026 22:58:02 +0800 Kairui Song via B4 Relay <devnull+kasong.tencent.com@kernel.org> wrote:
>
> > From: Kairui Song <kasong@tencent.com>
> >
> > Since commit 0ff67f990bd4 ("mm, swap: remove swap slot cache"),
> > hibernation has been using the swap slot slow allocation path for
> > simplification, which turns out might cause regression for some
> > devices because the allocator now rotates clusters too often, leading to
> > slower allocation and more random distribution of data.
> >
> > Fast allocation is not complex, so implement hibernation support as
> > well.
> >
> > Test result with Samsung SSD 830 Series (SATA II, 3.0 Gbps) shows the
> > performance is several times better [1]:
> > 6.19:               324 seconds
> > After this series:  35 seconds
>
> Thanks.
>
> I'll merge only [1/3] at this time, into mm-unstable at this time (I'll
> move it to mm-unstable after resyncing mm.git with upstream).
>
> We don't want the other two patches present during testing of this
> backportable fix because doing so partially invalidates that testing -
> [2/3] and[3/3] might accidentally fix issues which [1/3] added.  It happens,
> occasionally.

Sounds good to me. I'll send the cleanup separately sometime later again.

>
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -1926,8 +1926,9 @@ void swap_put_entries_direct(swp_entry_t entry, int nr)
> >  /* Allocate a slot for hibernation */
> >  swp_entry_t swap_alloc_hibernation_slot(int type)
> >  {
> > -     struct swap_info_struct *si = swap_type_to_info(type);
> > -     unsigned long offset;
> > +     struct swap_info_struct *pcp_si, *si = swap_type_to_info(type);
> > +     unsigned long pcp_offset, offset = SWAP_ENTRY_INVALID;
> > +     struct swap_cluster_info *ci;
> >       swp_entry_t entry = {0};
> >
> >       if (!si)
> > @@ -1937,11 +1938,21 @@ swp_entry_t swap_alloc_hibernation_slot(int type)
> >       if (get_swap_device_info(si)) {
> >               if (si->flags & SWP_WRITEOK) {
> >                       /*
> > -                      * Grab the local lock to be compliant
> > -                      * with swap table allocation.
> > +                      * Try the local cluster first if it matches the device. If
> > +                      * not, try grab a new cluster and override local cluster.
> >                        */
>
> nanonit, worrying about 80-cols is rather old fashioned but there's no
> reason to overflow 80 in a block comment!

I'll be more careful about this, thanks.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 1/3] mm, swap: speed up hibernation allocation and writeout
  2026-02-16 14:58 ` [PATCH v4 1/3] mm, swap: speed up hibernation allocation and writeout Kairui Song via B4 Relay
  2026-02-16 21:42   ` Andrew Morton
@ 2026-02-24  7:48   ` YoungJun Park
  2026-02-24  8:04     ` Kairui Song
  1 sibling, 1 reply; 11+ messages in thread
From: YoungJun Park @ 2026-02-24  7:48 UTC (permalink / raw)
  To: kasong
  Cc: linux-mm, Andrew Morton, Chris Li, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, Carsten Grohmann, Rafael J. Wysocki,
	linux-kernel, open list:SUSPEND TO RAM, taejoon.song,
	hyungjun.cho@lge.com Carsten Grohmann, stable

On Mon, Feb 16, 2026 at 10:58:02PM +0800, Kairui Song via B4 Relay wrote:
> From: Kairui Song &lt;kasong@tencent.com&gt;
> 
> Since commit 0ff67f990bd4 ("mm, swap: remove swap slot cache"),
> hibernation has been using the swap slot slow allocation path for
> simplification, which turns out might cause regression for some
> devices because the allocator now rotates clusters too often, leading to
> slower allocation and more random distribution of data.
...
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index c6863ff7152c..32e0e7545ab8 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1926,8 +1926,9 @@ void swap_put_entries_direct(swp_entry_t entry, int nr)
>  /* Allocate a slot for hibernation */
>  swp_entry_t swap_alloc_hibernation_slot(int type)
>  {
> -	struct swap_info_struct *si = swap_type_to_info(type);
> -	unsigned long offset;
> +	struct swap_info_struct *pcp_si, *si = swap_type_to_info(type);
> +	unsigned long pcp_offset, offset = SWAP_ENTRY_INVALID;
> +	struct swap_cluster_info *ci;
>  	swp_entry_t entry = {0};
>  
>  	if (!si)
> @@ -1937,11 +1938,21 @@ swp_entry_t swap_alloc_hibernation_slot(int type)
>  	if (get_swap_device_info(si)) {

Hi Kairui :)

Reading through the patch, I have some thoughts and review comments regarding
the hibernation slot allocation logic. I'd like to discuss potential
improvements. (Somewhat long... lot of thoughts come up on my mind)

First, regarding the race with swapoff and refcounting.

The code identifies the swap type before allocation, so a swapoff could
occur in between. It seems safer to acquire the reference when identifying
the type (e.g., find_first_swap). Also, instead of repeating get/put for
every slot (allocation and free), could we hold the reference once during
the initial lookup and release it after the image load? This avoids
overhead since swapoff is effectively blocked once hibernation slots are
allocated.

>  		if (si->flags & SWP_WRITEOK) {
>  			/*
> -			 * Grab the local lock to be compliant
> -			 * with swap table allocation.
> +			 * Try the local cluster first if it matches the device. If
> +			 * not, try grab a new cluster and override local cluster.
>  			 */
>  			local_lock(&percpu_swap_cluster.lock);

Second, regarding local_lock:

It seems mandatory now because distinguishing the lock context during swap
table allocation is tricky (e.g., GFP_KERNEL allocation assumes a local
locked context). Have you considered modifying the swap table allocation
logic to handle this specifically? This might allow us to avoid holding the
local_lock, especially if the device is not SWP_SOLIDSTATE.

> -			offset = cluster_alloc_swap_entry(si, NULL);
> +			pcp_si = this_cpu_read(percpu_swap_cluster.si[0]);
> +			pcp_offset = this_cpu_read(percpu_swap_cluster.offset[0]);
> +			if (pcp_si == si && pcp_offset) {
> +				ci = swap_cluster_lock(si, pcp_offset);
> +				if (cluster_is_usable(ci, 0))
> +					offset = alloc_swap_scan_cluster(si, ci, NULL, pcp_offset);
> +				else
> +					swap_cluster_unlock(ci);
> +			}
> +			if (!offset)
> +				offset = cluster_alloc_swap_entry(si, NULL);
>  			local_unlock(&percpu_swap_cluster.lock);
>  			if (offset)
>  				entry = swp_entry(si->type, offset);

Third, regarding cluster allocation:

1. If hibernation targets a lower-priority device, the per-cpu cluster
usage might cause priority inversion (though minimal).

2. Have you considered treating clusters as a global resource for this
case? For instance, caching next_offset in si(using union on global_cluster or new field) or allowing the
allocator to calculate the next value directly, rather than splitting
clusters per CPU.

Finally, regarding readahead and freeing:

Hibernation slots might be read during cluster-based readahead. Can we
avoid this (e.g., by checking for a NULL fake shadow entry or adding a specific
check for hibernation slots)? If so, we could also avoid triggering
try_to_reclaim when freeing these slots.

Thanks for your work!

Youngjun Park

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 1/3] mm, swap: speed up hibernation allocation and writeout
  2026-02-24  7:48   ` YoungJun Park
@ 2026-02-24  8:04     ` Kairui Song
  2026-02-24 11:42       ` YoungJun Park
  0 siblings, 1 reply; 11+ messages in thread
From: Kairui Song @ 2026-02-24  8:04 UTC (permalink / raw)
  To: YoungJun Park
  Cc: linux-mm, Andrew Morton, Chris Li, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, Carsten Grohmann, Rafael J. Wysocki,
	linux-kernel, open list:SUSPEND TO RAM, taejoon.song,
	hyungjun.cho@lge.com Carsten Grohmann, stable

On Tue, Feb 24, 2026 at 3:50 PM YoungJun Park <youngjun.park@lge.com> wrote:
>
> On Mon, Feb 16, 2026 at 10:58:02PM +0800, Kairui Song via B4 Relay wrote:
> > From: Kairui Song &lt;kasong@tencent.com&gt;
> >
> > Since commit 0ff67f990bd4 ("mm, swap: remove swap slot cache"),
> > hibernation has been using the swap slot slow allocation path for
> > simplification, which turns out might cause regression for some
> > devices because the allocator now rotates clusters too often, leading to
> > slower allocation and more random distribution of data.
> ...
> > diff --git a/mm/swapfile.c b/mm/swapfile.c
> > index c6863ff7152c..32e0e7545ab8 100644
> > --- a/mm/swapfile.c
> > +++ b/mm/swapfile.c
> > @@ -1926,8 +1926,9 @@ void swap_put_entries_direct(swp_entry_t entry, int nr)
> >  /* Allocate a slot for hibernation */
> >  swp_entry_t swap_alloc_hibernation_slot(int type)
> >  {
> > -     struct swap_info_struct *si = swap_type_to_info(type);
> > -     unsigned long offset;
> > +     struct swap_info_struct *pcp_si, *si = swap_type_to_info(type);
> > +     unsigned long pcp_offset, offset = SWAP_ENTRY_INVALID;
> > +     struct swap_cluster_info *ci;
> >       swp_entry_t entry = {0};
> >
> >       if (!si)
> > @@ -1937,11 +1938,21 @@ swp_entry_t swap_alloc_hibernation_slot(int type)
> >       if (get_swap_device_info(si)) {
>
> Hi Kairui :)
>
> Reading through the patch, I have some thoughts and review comments regarding
> the hibernation slot allocation logic. I'd like to discuss potential
> improvements. (Somewhat long... lot of thoughts come up on my mind)
>
> First, regarding the race with swapoff and refcounting.
>
> The code identifies the swap type before allocation, so a swapoff could
> occur in between. It seems safer to acquire the reference when identifying
> the type (e.g., find_first_swap). Also, instead of repeating get/put for
> every slot (allocation and free), could we hold the reference once during
> the initial lookup and release it after the image load? This avoids
> overhead since swapoff is effectively blocked once hibernation slots are
> allocated.

Hi Youngjun,

Yes, that's definitely doable, but requires the hibernation side to
change how it uses the API, which could be a long term workitem.

>
> >               if (si->flags & SWP_WRITEOK) {
> >                       /*
> > -                      * Grab the local lock to be compliant
> > -                      * with swap table allocation.
> > +                      * Try the local cluster first if it matches the device. If
> > +                      * not, try grab a new cluster and override local cluster.
> >                        */
> >                       local_lock(&percpu_swap_cluster.lock);
>
> Second, regarding local_lock:
>
> It seems mandatory now because distinguishing the lock context during swap
> table allocation is tricky (e.g., GFP_KERNEL allocation assumes a local
> locked context). Have you considered modifying the swap table allocation
> logic to handle this specifically? This might allow us to avoid holding the
> local_lock, especially if the device is not SWP_SOLIDSTATE.

I think you got this part wrong here. We need the lock because it will
call this_cpu_xxx operations later. And GFP_KERNEL doesn't assume a
lock locked context. Instead it needs to release the lock for a sleep
alloc if the ATOMIC alloc fails, and that could happen here.

But I agree we can definitely simplify this with some abstraction or wrapper.

>
> > -                     offset = cluster_alloc_swap_entry(si, NULL);
> > +                     pcp_si = this_cpu_read(percpu_swap_cluster.si[0]);
> > +                     pcp_offset = this_cpu_read(percpu_swap_cluster.offset[0]);
> > +                     if (pcp_si == si && pcp_offset) {
> > +                             ci = swap_cluster_lock(si, pcp_offset);
> > +                             if (cluster_is_usable(ci, 0))
> > +                                     offset = alloc_swap_scan_cluster(si, ci, NULL, pcp_offset);
> > +                             else
> > +                                     swap_cluster_unlock(ci);
> > +                     }
> > +                     if (!offset)
> > +                             offset = cluster_alloc_swap_entry(si, NULL);
> >                       local_unlock(&percpu_swap_cluster.lock);
> >                       if (offset)
> >                               entry = swp_entry(si->type, offset);
>
> Third, regarding cluster allocation:
>
> 1. If hibernation targets a lower-priority device, the per-cpu cluster
> usage might cause priority inversion (though minimal).

Right, the problem will be gone if we move the pcp cluster back to
device level. It's a trivial problem so I think we don't need to worry
about it now.

>
> 2. Have you considered treating clusters as a global resource for this
> case? For instance, caching next_offset in si(using union on global_cluster or new field) or allowing the
> allocator to calculate the next value directly, rather than splitting
> clusters per CPU.

I'm not sure how much code change it will involve and is it worth it.

Hibernation is supposed to stop every process, so concurrent memory
pressure is not something we are expecting here I think? Even if that
happens we are still fine.

>
> Finally, regarding readahead and freeing:
>
> Hibernation slots might be read during cluster-based readahead. Can we
> avoid this (e.g., by checking for a NULL fake shadow entry or adding a specific
> check for hibernation slots)? If so, we could also avoid triggering
> try_to_reclaim when freeing these slots.

Definitely! I have a patch that introduced a hibernation / exclusive
type in the swap table. Remember the is_coutnable macro you commented
about previously? That's reserved for this. For hibernation type, it's
not countable (exclusive to hibernation, maybe I need a better name
though) so readahead or any accidental IO will always skip it. By then
this ugly try_to_reclaim will be gone.

> Thanks for your work!

And thanks for your review :)


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 1/3] mm, swap: speed up hibernation allocation and writeout
  2026-02-24  8:04     ` Kairui Song
@ 2026-02-24 11:42       ` YoungJun Park
  0 siblings, 0 replies; 11+ messages in thread
From: YoungJun Park @ 2026-02-24 11:42 UTC (permalink / raw)
  To: Kairui Song
  Cc: linux-mm, Andrew Morton, Chris Li, Kemeng Shi, Nhat Pham,
	Baoquan He, Barry Song, Carsten Grohmann, Rafael J. Wysocki,
	linux-kernel, open list:SUSPEND TO RAM, taejoon.song,
	hyungjun.cho@lge.com Carsten Grohmann, stable

Thanks for the quick feedback :)

> Yes, that's definitely doable, but requires the hibernation side
> to change how it uses the API, which could be a long term
> workitem.

I can't claim to know the hibernation code inside out either,
but I think the picture would come together if we grab the
reference at find_first_swap / swap_type_of and just set the
put at the right place.

Let me look into this a bit more and bring it up if it turns
out to be worthwhile.

> I think you got this part wrong here. We need the lock because
> it will call this_cpu_xxx operations later. And GFP_KERNEL
> doesn't assume a lock locked context. Instead it needs to
> release the lock for a sleep alloc if the ATOMIC alloc fails,
> and that could happen here.

Ah right, sorry for the confusing wording. What I meant was
exactly what you described — the locks need to be released for
the GFP_KERNEL allocation, and the current code assumes the
local lock is always held at that point.

> But I agree we can definitely simplify this with some
> abstraction or wrapper.

What comes to mind right away is hoisting the alloc table
routine and distinguishing the path via the folio param. I'll
think about how to make it a clean design and propose a patch
if it makes sense :)

> I'm not sure how much code change it will involve and is it
> worth it.
>
> Hibernation is supposed to stop every process, so concurrent
> memory

I was thinking it might be possible with the ioctl-based
uswsusp path, but as you said, it probably wouldn't give us
a meaningful benefit.

> Definitely! I have a patch that introduced a hibernation /
> exclusive type in the swap table. Remember the is_countable
> macro you commented about previously? That's reserved for this.
> For hibernation type, it's not countable (exclusive to
> hibernation, maybe I need a better name though) so readahead or
> any accidental IO will always skip it. By then this ugly
> try_to_reclaim will be gone.

Nice!

> > Thanks for your work!
>
> And thanks for your review :)

Thanks 
Youngjun Park

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v4 2/3] mm, swap: reduce indention for hibernate allocation helper
  2026-02-16 14:58 [PATCH v4 0/3] mm/swap: hibernate: improve hibernate performance with new allocator Kairui Song via B4 Relay
  2026-02-16 14:58 ` [PATCH v4 1/3] mm, swap: speed up hibernation allocation and writeout Kairui Song via B4 Relay
@ 2026-02-16 14:58 ` Kairui Song via B4 Relay
  2026-02-18  8:21   ` Barry Song
  2026-02-16 14:58 ` [PATCH v4 3/3] mm, swap: merge common convention and simplify " Kairui Song via B4 Relay
  2 siblings, 1 reply; 11+ messages in thread
From: Kairui Song via B4 Relay @ 2026-02-16 14:58 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Carsten Grohmann, Rafael J. Wysocki, linux-kernel,
	open list:SUSPEND TO RAM, Carsten Grohmann, Kairui Song

From: Kairui Song <kasong@tencent.com>

It doesn't have to check the device flag, as the allocator will also
check the device flag and refuse to allocate if the device is not
writable. This might cause a trivial waste of CPU cycles of hibernate
allocation raced with swapoff, but that is very unlikely to happen.
Removing the check on the common path should be more helpful.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 51 +++++++++++++++++++++++----------------------------
 1 file changed, 23 insertions(+), 28 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 32e0e7545ab8..ea63885f344a 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1931,35 +1931,30 @@ swp_entry_t swap_alloc_hibernation_slot(int type)
 	struct swap_cluster_info *ci;
 	swp_entry_t entry = {0};
 
-	if (!si)
-		goto fail;
-
-	/* This is called for allocating swap entry, not cache */
-	if (get_swap_device_info(si)) {
-		if (si->flags & SWP_WRITEOK) {
-			/*
-			 * Try the local cluster first if it matches the device. If
-			 * not, try grab a new cluster and override local cluster.
-			 */
-			local_lock(&percpu_swap_cluster.lock);
-			pcp_si = this_cpu_read(percpu_swap_cluster.si[0]);
-			pcp_offset = this_cpu_read(percpu_swap_cluster.offset[0]);
-			if (pcp_si == si && pcp_offset) {
-				ci = swap_cluster_lock(si, pcp_offset);
-				if (cluster_is_usable(ci, 0))
-					offset = alloc_swap_scan_cluster(si, ci, NULL, pcp_offset);
-				else
-					swap_cluster_unlock(ci);
-			}
-			if (!offset)
-				offset = cluster_alloc_swap_entry(si, NULL);
-			local_unlock(&percpu_swap_cluster.lock);
-			if (offset)
-				entry = swp_entry(si->type, offset);
-		}
-		put_swap_device(si);
+	/* Return empty entry if device is not usable (swapoff or full) */
+	if (!si || !get_swap_device_info(si))
+		return entry;
+	/*
+	 * Try the local cluster first if it matches the device. If
+	 * not, try grab a new cluster and override local cluster.
+	 */
+	local_lock(&percpu_swap_cluster.lock);
+	pcp_si = this_cpu_read(percpu_swap_cluster.si[0]);
+	pcp_offset = this_cpu_read(percpu_swap_cluster.offset[0]);
+	if (pcp_si == si && pcp_offset) {
+		ci = swap_cluster_lock(si, pcp_offset);
+		if (cluster_is_usable(ci, 0))
+			offset = alloc_swap_scan_cluster(si, ci, NULL, pcp_offset);
+		else
+			swap_cluster_unlock(ci);
 	}
-fail:
+	if (offset == SWAP_ENTRY_INVALID)
+		offset = cluster_alloc_swap_entry(si, NULL);
+	local_unlock(&percpu_swap_cluster.lock);
+	if (offset)
+		entry = swp_entry(si->type, offset);
+	put_swap_device(si);
+
 	return entry;
 }
 

-- 
2.52.0




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 2/3] mm, swap: reduce indention for hibernate allocation helper
  2026-02-16 14:58 ` [PATCH v4 2/3] mm, swap: reduce indention for hibernate allocation helper Kairui Song via B4 Relay
@ 2026-02-18  8:21   ` Barry Song
  2026-02-18  8:58     ` Kairui Song
  0 siblings, 1 reply; 11+ messages in thread
From: Barry Song @ 2026-02-18  8:21 UTC (permalink / raw)
  To: kasong
  Cc: linux-mm, Andrew Morton, Chris Li, Kemeng Shi, Nhat Pham,
	Baoquan He, Carsten Grohmann, Rafael J. Wysocki, linux-kernel,
	open list:SUSPEND TO RAM, Carsten Grohmann

On Mon, Feb 16, 2026 at 10:58 PM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@kernel.org> wrote:
[...]
> +       /* Return empty entry if device is not usable (swapoff or full) */

I feel like swap full only affects the swap_avail_head /
swap_available list. Does it also make
get_swap_device_info() return false?

> +       if (!si || !get_swap_device_info(si))
> +               return entry;

Best Regards
Barry


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v4 2/3] mm, swap: reduce indention for hibernate allocation helper
  2026-02-18  8:21   ` Barry Song
@ 2026-02-18  8:58     ` Kairui Song
  0 siblings, 0 replies; 11+ messages in thread
From: Kairui Song @ 2026-02-18  8:58 UTC (permalink / raw)
  To: Barry Song
  Cc: linux-mm, Andrew Morton, Chris Li, Kemeng Shi, Nhat Pham,
	Baoquan He, Carsten Grohmann, Rafael J. Wysocki, linux-kernel,
	open list:SUSPEND TO RAM, Carsten Grohmann

On Wed, Feb 18, 2026 at 4:23 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Mon, Feb 16, 2026 at 10:58 PM Kairui Song via B4 Relay
> <devnull+kasong.tencent.com@kernel.org> wrote:
> [...]
> > +       /* Return empty entry if device is not usable (swapoff or full) */
>
> I feel like swap full only affects the swap_avail_head /
> swap_available list. Does it also make
> get_swap_device_info() return false?

Yeah you are right, full swap device doesn't affect
get_swap_device_info here so the comment isn't that accurate. I'll
update the comment later after 1/3 is moved to stable. Thanks!


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v4 3/3] mm, swap: merge common convention and simplify allocation helper
  2026-02-16 14:58 [PATCH v4 0/3] mm/swap: hibernate: improve hibernate performance with new allocator Kairui Song via B4 Relay
  2026-02-16 14:58 ` [PATCH v4 1/3] mm, swap: speed up hibernation allocation and writeout Kairui Song via B4 Relay
  2026-02-16 14:58 ` [PATCH v4 2/3] mm, swap: reduce indention for hibernate allocation helper Kairui Song via B4 Relay
@ 2026-02-16 14:58 ` Kairui Song via B4 Relay
  2 siblings, 0 replies; 11+ messages in thread
From: Kairui Song via B4 Relay @ 2026-02-16 14:58 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Chris Li, Kemeng Shi, Nhat Pham, Baoquan He,
	Barry Song, Carsten Grohmann, Rafael J. Wysocki, linux-kernel,
	open list:SUSPEND TO RAM, Carsten Grohmann, Kairui Song

From: Kairui Song <kasong@tencent.com>

Almost all callers of the cluster scan helper require the: lock -> check
usefulness/emptiness check -> allocate -> unlock routine. So merge them
into the same helper to simplify the code.

While at it, add some kerneldoc too.

Signed-off-by: Kairui Song <kasong@tencent.com>
---
 mm/swapfile.c | 54 +++++++++++++++++++++++++++++++-----------------------
 1 file changed, 31 insertions(+), 23 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index ea63885f344a..a6276c5ead8e 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -910,7 +910,21 @@ static bool cluster_alloc_range(struct swap_info_struct *si,
 	return true;
 }
 
-/* Try use a new cluster for current CPU and allocate from it. */
+/*
+ * alloc_swap_scan_cluster - Scan and allocate swap entries from one cluster.
+ * @si: the swap device of the cluster.
+ * @ci: the cluster, must be locked.
+ * @folio: the folio to allocate for, could be NULL.
+ * @offset: scan start offset, must be a swap device offset pointing inside @ci.
+ *
+ * Scan the swap slots inside @ci, starting from @offset, and allocate
+ * contiguous entries that point to these slots. If @folio is not NULL, folio
+ * size number of entries are allocated, and the starting entry is stored to
+ * folio->swap. If @folio is NULL, one entry will be allocated and passed to
+ * the caller as the return value. In both cases, the offset is returned.
+ *
+ * This helper also updates the percpu cached cluster.
+ */
 static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
 					    struct swap_cluster_info *ci,
 					    struct folio *folio, unsigned long offset)
@@ -923,11 +937,14 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
 	bool need_reclaim, ret, usable;
 
 	lockdep_assert_held(&ci->lock);
-	VM_WARN_ON(!cluster_is_usable(ci, order));
 
-	if (end < nr_pages || ci->count + nr_pages > SWAPFILE_CLUSTER)
+	if (!cluster_is_usable(ci, order) || end < nr_pages ||
+	    ci->count + nr_pages > SWAPFILE_CLUSTER)
 		goto out;
 
+	if (cluster_is_empty(ci))
+		offset = cluster_offset(si, ci);
+
 	for (end -= nr_pages; offset <= end; offset += nr_pages) {
 		need_reclaim = false;
 		if (!cluster_scan_range(si, ci, offset, nr_pages, &need_reclaim))
@@ -951,6 +968,14 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
 		break;
 	}
 out:
+	/*
+	 * Whether the allocation succeeded or failed, relocate the cluster
+	 * and update percpu offset cache. On success this is necessary to
+	 * mark the cluster as cached fast path. On failure, this invalidates
+	 * the percpu cache to indicate an allocation failure and next scan
+	 * should use a new cluster, and move the failed cluster to where it
+	 * should be.
+	 */
 	relocate_cluster(si, ci);
 	swap_cluster_unlock(ci);
 	if (si->flags & SWP_SOLIDSTATE) {
@@ -1060,14 +1085,7 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si,
 			goto new_cluster;
 
 		ci = swap_cluster_lock(si, offset);
-		/* Cluster could have been used by another order */
-		if (cluster_is_usable(ci, order)) {
-			if (cluster_is_empty(ci))
-				offset = cluster_offset(si, ci);
-			found = alloc_swap_scan_cluster(si, ci, folio, offset);
-		} else {
-			swap_cluster_unlock(ci);
-		}
+		found = alloc_swap_scan_cluster(si, ci, folio, offset);
 		if (found)
 			goto done;
 	}
@@ -1332,14 +1350,7 @@ static bool swap_alloc_fast(struct folio *folio)
 		return false;
 
 	ci = swap_cluster_lock(si, offset);
-	if (cluster_is_usable(ci, order)) {
-		if (cluster_is_empty(ci))
-			offset = cluster_offset(si, ci);
-		alloc_swap_scan_cluster(si, ci, folio, offset);
-	} else {
-		swap_cluster_unlock(ci);
-	}
-
+	alloc_swap_scan_cluster(si, ci, folio, offset);
 	put_swap_device(si);
 	return folio_test_swapcache(folio);
 }
@@ -1943,10 +1954,7 @@ swp_entry_t swap_alloc_hibernation_slot(int type)
 	pcp_offset = this_cpu_read(percpu_swap_cluster.offset[0]);
 	if (pcp_si == si && pcp_offset) {
 		ci = swap_cluster_lock(si, pcp_offset);
-		if (cluster_is_usable(ci, 0))
-			offset = alloc_swap_scan_cluster(si, ci, NULL, pcp_offset);
-		else
-			swap_cluster_unlock(ci);
+		offset = alloc_swap_scan_cluster(si, ci, NULL, pcp_offset);
 	}
 	if (offset == SWAP_ENTRY_INVALID)
 		offset = cluster_alloc_swap_entry(si, NULL);

-- 
2.52.0




^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-02-24 11:42 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-16 14:58 [PATCH v4 0/3] mm/swap: hibernate: improve hibernate performance with new allocator Kairui Song via B4 Relay
2026-02-16 14:58 ` [PATCH v4 1/3] mm, swap: speed up hibernation allocation and writeout Kairui Song via B4 Relay
2026-02-16 21:42   ` Andrew Morton
2026-02-17 18:37     ` Kairui Song
2026-02-24  7:48   ` YoungJun Park
2026-02-24  8:04     ` Kairui Song
2026-02-24 11:42       ` YoungJun Park
2026-02-16 14:58 ` [PATCH v4 2/3] mm, swap: reduce indention for hibernate allocation helper Kairui Song via B4 Relay
2026-02-18  8:21   ` Barry Song
2026-02-18  8:58     ` Kairui Song
2026-02-16 14:58 ` [PATCH v4 3/3] mm, swap: merge common convention and simplify " Kairui Song via B4 Relay

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox