[RFC PATCH v2] mm/swap, PM: hibernate: hold swap device reference across swap operation

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v2] mm/swap, PM: hibernate: hold swap device reference across swap operation
@ 2026-03-06  2:46 Youngjun Park
  2026-03-06  6:55 ` Chris Li
  0 siblings, 1 reply; 6+ messages in thread
From: Youngjun Park @ 2026-03-06  2:46 UTC (permalink / raw)
  To: rafael, akpm
  Cc: chrisl, kasong, pavel, shikemeng, nphamcs, bhe, baohua,
	youngjun.park, usama.arif, linux-pm, linux-mm

Currently, in the uswsusp path, only the swap type value is retrieved at
lookup time without holding a reference. If swapoff races after the type
is acquired, subsequent slot allocations operate on a stale swap device.

Additionally, grabbing and releasing the swap device reference on every
slot allocation is inefficient across the entire hibernation swap path.

Address these issues by holding the swap device reference from the point
the swap device is looked up, and releasing it once at each exit path.
This ensures the device remains valid throughout the operation and
removes the overhead of per-slot reference counting.

Signed-off-by: Youngjun Park <youngjun.park@lge.com>
---
Hi,
    
This is a simple RFC quality patch to verify if this approach is suitable.
Per Usama Arif's feedback regarding git bisectability,
I have squashed the previous commits into this single patch.

base-commit: ec96cb7e4c12ff5b474cf9ab66f2e9767953e448 (mm-new)

RFC v1: https://lore.kernel.org/linux-mm/20260305202413.1888499-1-usama.arif@linux.dev/T/#m3693d45180f14f441b6951984f4b4bfd90ec0c9d

 include/linux/swap.h |  1 +
 kernel/power/swap.c  | 12 +++++++---
 kernel/power/user.c  |  9 +++++++-
 mm/swapfile.c        | 55 ++++++++++++++++++++++----------------------
 4 files changed, 45 insertions(+), 32 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7a09df6977a5..37bf7cf21594 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -442,6 +442,7 @@ extern bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry);
 extern int swp_swapcount(swp_entry_t entry);
 struct backing_dev_info;
 extern struct swap_info_struct *get_swap_device(swp_entry_t entry);
+extern void put_swap_device_by_type(int type);
 sector_t swap_folio_sector(struct folio *folio);
 
 /*
diff --git a/kernel/power/swap.c b/kernel/power/swap.c
index 2e64869bb5a0..c230b0fa5a5f 100644
--- a/kernel/power/swap.c
+++ b/kernel/power/swap.c
@@ -350,9 +350,10 @@ static int swsusp_swap_check(void)
 
 	hib_resume_bdev_file = bdev_file_open_by_dev(swsusp_resume_device,
 			BLK_OPEN_WRITE, NULL, NULL);
-	if (IS_ERR(hib_resume_bdev_file))
+	if (IS_ERR(hib_resume_bdev_file)) {
+		put_swap_device_by_type(root_swap);
 		return PTR_ERR(hib_resume_bdev_file);
-
+	}
 	return 0;
 }
 
@@ -418,6 +419,7 @@ static int get_swap_writer(struct swap_map_handle *handle)
 err_rel:
 	release_swap_writer(handle);
 err_close:
+	put_swap_device_by_type(root_swap);
 	swsusp_close();
 	return ret;
 }
@@ -480,8 +482,11 @@ static int swap_writer_finish(struct swap_map_handle *handle,
 		flush_swap_writer(handle);
 	}
 
-	if (error)
+	if (error) {
 		free_all_swap_pages(root_swap);
+		put_swap_device_by_type(root_swap);
+	}
+
 	release_swap_writer(handle);
 	swsusp_close();
 
@@ -1647,6 +1652,7 @@ int swsusp_unmark(void)
 	 * We just returned from suspend, we don't need the image any more.
 	 */
 	free_all_swap_pages(root_swap);
+	put_swap_device_by_type(root_swap);
 
 	return error;
 }
diff --git a/kernel/power/user.c b/kernel/power/user.c
index 4401cfe26e5c..9cb6c24d49ea 100644
--- a/kernel/power/user.c
+++ b/kernel/power/user.c
@@ -90,8 +90,11 @@ static int snapshot_open(struct inode *inode, struct file *filp)
 			data->free_bitmaps = !error;
 		}
 	}
-	if (error)
+	if (error) {
 		hibernate_release();
+		if (data->swap >= 0)
+			put_swap_device_by_type(data->swap);
+	}
 
 	data->frozen = false;
 	data->ready = false;
@@ -115,6 +118,8 @@ static int snapshot_release(struct inode *inode, struct file *filp)
 	data = filp->private_data;
 	data->dev = 0;
 	free_all_swap_pages(data->swap);
+	if (data->swap >= 0)
+		put_swap_device_by_type(data->swap);
 	if (data->frozen) {
 		pm_restore_gfp_mask();
 		free_basic_memory_bitmaps();
@@ -235,6 +240,8 @@ static int snapshot_set_swap_area(struct snapshot_data *data,
 		offset = swap_area.offset;
 	}
 
+	if (data->swap >= 0)
+		put_swap_device_by_type(data->swap);
 	/*
 	 * User space encodes device types as two-byte values,
 	 * so we need to recode them
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 915bc93964db..f505dd1f7571 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1860,6 +1860,10 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
 	return NULL;
 }
 
+void put_swap_device_by_type(int type)
+{
+	percpu_ref_put(&swap_info[type]->users);
+}
 /*
  * Free a set of swap slots after their swap count dropped to zero, or will be
  * zero after putting the last ref (saves one __swap_cluster_put_entry call).
@@ -2085,30 +2089,28 @@ swp_entry_t swap_alloc_hibernation_slot(int type)
 		goto fail;
 
 	/* This is called for allocating swap entry, not cache */
-	if (get_swap_device_info(si)) {
-		if (si->flags & SWP_WRITEOK) {
-			/*
-			 * Try the local cluster first if it matches the device. If
-			 * not, try grab a new cluster and override local cluster.
-			 */
-			local_lock(&percpu_swap_cluster.lock);
-			pcp_si = this_cpu_read(percpu_swap_cluster.si[0]);
-			pcp_offset = this_cpu_read(percpu_swap_cluster.offset[0]);
-			if (pcp_si == si && pcp_offset) {
-				ci = swap_cluster_lock(si, pcp_offset);
-				if (cluster_is_usable(ci, 0))
-					offset = alloc_swap_scan_cluster(si, ci, NULL, pcp_offset);
-				else
-					swap_cluster_unlock(ci);
-			}
-			if (!offset)
-				offset = cluster_alloc_swap_entry(si, NULL);
-			local_unlock(&percpu_swap_cluster.lock);
-			if (offset)
-				entry = swp_entry(si->type, offset);
+	if (si->flags & SWP_WRITEOK) {
+		/*
+		 * Try the local cluster first if it matches the device. If
+		 * not, try grab a new cluster and override local cluster.
+		 */
+		local_lock(&percpu_swap_cluster.lock);
+		pcp_si = this_cpu_read(percpu_swap_cluster.si[0]);
+		pcp_offset = this_cpu_read(percpu_swap_cluster.offset[0]);
+		if (pcp_si == si && pcp_offset) {
+			ci = swap_cluster_lock(si, pcp_offset);
+			if (cluster_is_usable(ci, 0))
+				offset = alloc_swap_scan_cluster(si, ci, NULL, pcp_offset);
+			else
+				swap_cluster_unlock(ci);
 		}
-		put_swap_device(si);
+		if (!offset)
+			offset = cluster_alloc_swap_entry(si, NULL);
+		local_unlock(&percpu_swap_cluster.lock);
+		if (offset)
+			entry = swp_entry(si->type, offset);
 	}
+
 fail:
 	return entry;
 }
@@ -2116,14 +2118,10 @@ swp_entry_t swap_alloc_hibernation_slot(int type)
 /* Free a slot allocated by swap_alloc_hibernation_slot */
 void swap_free_hibernation_slot(swp_entry_t entry)
 {
-	struct swap_info_struct *si;
+	struct swap_info_struct *si = __swap_entry_to_info(entry);
 	struct swap_cluster_info *ci;
 	pgoff_t offset = swp_offset(entry);
 
-	si = get_swap_device(entry);
-	if (WARN_ON(!si))
-		return;
-
 	ci = swap_cluster_lock(si, offset);
 	__swap_cluster_put_entry(ci, offset % SWAPFILE_CLUSTER);
 	__swap_cluster_free_entries(si, ci, offset % SWAPFILE_CLUSTER, 1);
@@ -2131,7 +2129,6 @@ void swap_free_hibernation_slot(swp_entry_t entry)
 
 	/* In theory readahead might add it to the swap cache by accident */
 	__try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
-	put_swap_device(si);
 }
 
 /*
@@ -2160,6 +2157,7 @@ int swap_type_of(dev_t device, sector_t offset)
 			struct swap_extent *se = first_se(sis);
 
 			if (se->start_block == offset) {
+				get_swap_device_info(sis);
 				spin_unlock(&swap_lock);
 				return type;
 			}
@@ -2180,6 +2178,7 @@ int find_first_swap(dev_t *device)
 		if (!(sis->flags & SWP_WRITEOK))
 			continue;
 		*device = sis->bdev->bd_dev;
+		get_swap_device_info(sis);
 		spin_unlock(&swap_lock);
 		return type;
 	}
-- 
2.34.1



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH v2] mm/swap, PM: hibernate: hold swap device reference across swap operation
  2026-03-06  2:46 [RFC PATCH v2] mm/swap, PM: hibernate: hold swap device reference across swap operation Youngjun Park
@ 2026-03-06  6:55 ` Chris Li
  2026-03-06  8:02   ` YoungJun Park
  0 siblings, 1 reply; 6+ messages in thread
From: Chris Li @ 2026-03-06  6:55 UTC (permalink / raw)
  To: Youngjun Park
  Cc: rafael, akpm, kasong, pavel, shikemeng, nphamcs, bhe, baohua,
	usama.arif, linux-pm, linux-mm

On Thu, Mar 5, 2026 at 6:46 PM Youngjun Park <youngjun.park@lge.com> wrote:
>
> Currently, in the uswsusp path, only the swap type value is retrieved at
> lookup time without holding a reference. If swapoff races after the type
> is acquired, subsequent slot allocations operate on a stale swap device.

Just from you above description, I am not sure how the bug is actually
triggered yet. That sounds possible. I want more detail.

Can you show me which code path triggered this bug?
e.g. Thread A wants to suspend, with this back trace call graph.
Then in this function foo() A grabs the swap device without holding a reference.
Meanwhile, thread B is performing a swap off while A is at function foo().

> Additionally, grabbing and releasing the swap device reference on every
> slot allocation is inefficient across the entire hibernation swap path.

If the swap entry is already allocated by the suspend code on that
swap device, the follow up allocation does not need to grab the
reference again because the swap device's swapped count will not drop
to zero until resume.

> Address these issues by holding the swap device reference from the point
> the swap device is looked up, and releasing it once at each exit path.
> This ensures the device remains valid throughout the operation and
> removes the overhead of per-slot reference counting.

I want to understand how to trigger the buggy code path first. It
might be obvious to you. It is not obvious to me yet.

> Signed-off-by: Youngjun Park <youngjun.park@lge.com>
> ---
> Hi,
>
> This is a simple RFC quality patch to verify if this approach is suitable.
> Per Usama Arif's feedback regarding git bisectability,
> I have squashed the previous commits into this single patch.
>
> base-commit: ec96cb7e4c12ff5b474cf9ab66f2e9767953e448 (mm-new)
>
> RFC v1: https://lore.kernel.org/linux-mm/20260305202413.1888499-1-usama.arif@linux.dev/T/#m3693d45180f14f441b6951984f4b4bfd90ec0c9d
>
>  include/linux/swap.h |  1 +
>  kernel/power/swap.c  | 12 +++++++---
>  kernel/power/user.c  |  9 +++++++-
>  mm/swapfile.c        | 55 ++++++++++++++++++++++----------------------
>  4 files changed, 45 insertions(+), 32 deletions(-)
>
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 7a09df6977a5..37bf7cf21594 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -442,6 +442,7 @@ extern bool swap_entry_swapped(struct swap_info_struct *si, swp_entry_t entry);
>  extern int swp_swapcount(swp_entry_t entry);
>  struct backing_dev_info;
>  extern struct swap_info_struct *get_swap_device(swp_entry_t entry);
> +extern void put_swap_device_by_type(int type);
>  sector_t swap_folio_sector(struct folio *folio);
>
>  /*
> diff --git a/kernel/power/swap.c b/kernel/power/swap.c
> index 2e64869bb5a0..c230b0fa5a5f 100644
> --- a/kernel/power/swap.c
> +++ b/kernel/power/swap.c
> @@ -350,9 +350,10 @@ static int swsusp_swap_check(void)
>
>         hib_resume_bdev_file = bdev_file_open_by_dev(swsusp_resume_device,
>                         BLK_OPEN_WRITE, NULL, NULL);
> -       if (IS_ERR(hib_resume_bdev_file))
> +       if (IS_ERR(hib_resume_bdev_file)) {
> +               put_swap_device_by_type(root_swap);
>                 return PTR_ERR(hib_resume_bdev_file);
> -
> +       }
>         return 0;
>  }
>
> @@ -418,6 +419,7 @@ static int get_swap_writer(struct swap_map_handle *handle)
>  err_rel:
>         release_swap_writer(handle);
>  err_close:
> +       put_swap_device_by_type(root_swap);
>         swsusp_close();
>         return ret;
>  }
> @@ -480,8 +482,11 @@ static int swap_writer_finish(struct swap_map_handle *handle,
>                 flush_swap_writer(handle);
>         }
>
> -       if (error)
> +       if (error) {
>                 free_all_swap_pages(root_swap);
> +               put_swap_device_by_type(root_swap);
> +       }
> +
>         release_swap_writer(handle);
>         swsusp_close();
>
> @@ -1647,6 +1652,7 @@ int swsusp_unmark(void)
>          * We just returned from suspend, we don't need the image any more.
>          */
>         free_all_swap_pages(root_swap);
> +       put_swap_device_by_type(root_swap);
>
>         return error;
>  }
> diff --git a/kernel/power/user.c b/kernel/power/user.c
> index 4401cfe26e5c..9cb6c24d49ea 100644
> --- a/kernel/power/user.c
> +++ b/kernel/power/user.c
> @@ -90,8 +90,11 @@ static int snapshot_open(struct inode *inode, struct file *filp)
>                         data->free_bitmaps = !error;
>                 }
>         }
> -       if (error)
> +       if (error) {
>                 hibernate_release();
> +               if (data->swap >= 0)
> +                       put_swap_device_by_type(data->swap);
> +       }
>
>         data->frozen = false;
>         data->ready = false;
> @@ -115,6 +118,8 @@ static int snapshot_release(struct inode *inode, struct file *filp)
>         data = filp->private_data;
>         data->dev = 0;
>         free_all_swap_pages(data->swap);
> +       if (data->swap >= 0)
> +               put_swap_device_by_type(data->swap);
>         if (data->frozen) {
>                 pm_restore_gfp_mask();
>                 free_basic_memory_bitmaps();
> @@ -235,6 +240,8 @@ static int snapshot_set_swap_area(struct snapshot_data *data,
>                 offset = swap_area.offset;
>         }
>
> +       if (data->swap >= 0)
> +               put_swap_device_by_type(data->swap);
>         /*
>          * User space encodes device types as two-byte values,
>          * so we need to recode them
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 915bc93964db..f505dd1f7571 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -1860,6 +1860,10 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
>         return NULL;
>  }
>
> +void put_swap_device_by_type(int type)
> +{
> +       percpu_ref_put(&swap_info[type]->users);
> +}
>  /*
>   * Free a set of swap slots after their swap count dropped to zero, or will be
>   * zero after putting the last ref (saves one __swap_cluster_put_entry call).
> @@ -2085,30 +2089,28 @@ swp_entry_t swap_alloc_hibernation_slot(int type)
>                 goto fail;
>
>         /* This is called for allocating swap entry, not cache */
> -       if (get_swap_device_info(si)) {
> -               if (si->flags & SWP_WRITEOK) {
> -                       /*
> -                        * Try the local cluster first if it matches the device. If
> -                        * not, try grab a new cluster and override local cluster.
> -                        */
> -                       local_lock(&percpu_swap_cluster.lock);
> -                       pcp_si = this_cpu_read(percpu_swap_cluster.si[0]);
> -                       pcp_offset = this_cpu_read(percpu_swap_cluster.offset[0]);
> -                       if (pcp_si == si && pcp_offset) {
> -                               ci = swap_cluster_lock(si, pcp_offset);
> -                               if (cluster_is_usable(ci, 0))
> -                                       offset = alloc_swap_scan_cluster(si, ci, NULL, pcp_offset);
> -                               else
> -                                       swap_cluster_unlock(ci);
> -                       }
> -                       if (!offset)
> -                               offset = cluster_alloc_swap_entry(si, NULL);
> -                       local_unlock(&percpu_swap_cluster.lock);
> -                       if (offset)
> -                               entry = swp_entry(si->type, offset);
> +       if (si->flags & SWP_WRITEOK) {
> +               /*
> +                * Try the local cluster first if it matches the device. If
> +                * not, try grab a new cluster and override local cluster.
> +                */
> +               local_lock(&percpu_swap_cluster.lock);
> +               pcp_si = this_cpu_read(percpu_swap_cluster.si[0]);
> +               pcp_offset = this_cpu_read(percpu_swap_cluster.offset[0]);
> +               if (pcp_si == si && pcp_offset) {
> +                       ci = swap_cluster_lock(si, pcp_offset);
> +                       if (cluster_is_usable(ci, 0))
> +                               offset = alloc_swap_scan_cluster(si, ci, NULL, pcp_offset);
> +                       else
> +                               swap_cluster_unlock(ci);
>                 }
> -               put_swap_device(si);
> +               if (!offset)
> +                       offset = cluster_alloc_swap_entry(si, NULL);
> +               local_unlock(&percpu_swap_cluster.lock);
> +               if (offset)
> +                       entry = swp_entry(si->type, offset);
>         }
> +
>  fail:
>         return entry;
>  }
> @@ -2116,14 +2118,10 @@ swp_entry_t swap_alloc_hibernation_slot(int type)
>  /* Free a slot allocated by swap_alloc_hibernation_slot */
>  void swap_free_hibernation_slot(swp_entry_t entry)
>  {
> -       struct swap_info_struct *si;
> +       struct swap_info_struct *si = __swap_entry_to_info(entry);
>         struct swap_cluster_info *ci;
>         pgoff_t offset = swp_offset(entry);
>
> -       si = get_swap_device(entry);
> -       if (WARN_ON(!si))
> -               return;
> -
>         ci = swap_cluster_lock(si, offset);
>         __swap_cluster_put_entry(ci, offset % SWAPFILE_CLUSTER);
>         __swap_cluster_free_entries(si, ci, offset % SWAPFILE_CLUSTER, 1);
> @@ -2131,7 +2129,6 @@ void swap_free_hibernation_slot(swp_entry_t entry)
>
>         /* In theory readahead might add it to the swap cache by accident */
>         __try_to_reclaim_swap(si, offset, TTRS_ANYWAY);
> -       put_swap_device(si);
>  }
>
>  /*
> @@ -2160,6 +2157,7 @@ int swap_type_of(dev_t device, sector_t offset)
>                         struct swap_extent *se = first_se(sis);
>
>                         if (se->start_block == offset) {
> +                               get_swap_device_info(sis);

The function name swap_type_of() does not suggest that the function
should take a reference.  This is just about function naming. I am not
commenting on the function logic yet.

>                                 spin_unlock(&swap_lock);
>                                 return type;
>                         }
> @@ -2180,6 +2178,7 @@ int find_first_swap(dev_t *device)
>                 if (!(sis->flags & SWP_WRITEOK))
>                         continue;
>                 *device = sis->bdev->bd_dev;
> +               get_swap_device_info(sis);
You might consider moving this one line up.The typical usage pattern
is get the reference then operate on the stuff protected by the
reference count. Here the order does not really matter due to the
swap_lock protection.

Waiting for details on how to trigger the bug.

Chris


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH v2] mm/swap, PM: hibernate: hold swap device reference across swap operation
  2026-03-06  6:55 ` Chris Li
@ 2026-03-06  8:02   ` YoungJun Park
  2026-03-09  6:43     ` Chris Li
  0 siblings, 1 reply; 6+ messages in thread
From: YoungJun Park @ 2026-03-06  8:02 UTC (permalink / raw)
  To: Chris Li
  Cc: rafael, akpm, kasong, pavel, shikemeng, nphamcs, bhe, baohua,
	usama.arif, linux-pm, linux-mm

On Thu, Mar 05, 2026 at 10:55:15PM -0800, Chris Li wrote:
> On Thu, Mar 5, 2026 at 6:46 PM Youngjun Park <youngjun.park@lge.com> wrote:
> >
> > Currently, in the uswsusp path, only the swap type value is retrieved at
> > lookup time without holding a reference. If swapoff races after the type
> > is acquired, subsequent slot allocations operate on a stale swap device.
> 
> Just from you above description, I am not sure how the bug is actually
> triggered yet. That sounds possible. I want more detail.

To be honest, I am not deeply familiar with the snapshot code, which is why
I submitted this as an RFC. However, I believe the race is theoretically
possible and I was able to trigger it with a simple PoC user program.

(not in-kernel swsusp as I think, cuz every user thread freezed
before creating snapshot, only on uswsusp)

The race occurs in `power/user.c`

1. snapshot_open() calls swap_type_of() to find the swap device.
2. We get the swap type, but hold no reference at this point.
3. [Race Window]: Another thread triggers swapoff() and swapon()
4. snapshot_ioctl(SNAPSHOT_ALLOC_SWAP_PAGE) is called.
   -> The swap device is gone or the type ID is reused by another device or 
      swap device is missing.

> Can you show me which code path triggered this bug?
> e.g. Thread A wants to suspend, with this back trace call graph.
> Then in this function foo() A grabs the swap device without holding a reference.
> Meanwhile, thread B is performing a swap off while A is at function foo().
> 
> > Additionally, grabbing and releasing the swap device reference on every
> > slot allocation is inefficient across the entire hibernation swap path.
> 
> If the swap entry is already allocated by the suspend code on that
> swap device, the follow up allocation does not need to grab the
> reference again because the swap device's swapped count will not drop
> to zero until resume.

You are right. Since the swap device is pinned once a swap entry is
allocated, we could indeed rely on that pinning mechanism to ensure safety
for subsequent allocations (instead of doing get/put every time).

However, relying on that pinning alone does not protect the window between
the initial lookup (step 1) and the *first* allocation.

My proposal is to grab the reference at the lookup point to close this
initial race. If we do that, I believe we can remove the per-slot
get/put calls entirely, as the initial reference is sufficient to keep the
device alive until the operation completes.

Regarding the reference release strategy in this patch:

1. uswsusp: The reference is released when the snapshot device file
   is closed(snapshot_release) and error paths.
2. not uswsusp`: I only added reference release in the error paths.

About 2.. I conclude that on a successful resume, the system state reverts to
the snapshot point, making an explicit release unnecessary. However,
I am not 100% certain if this holds true for the swap reference
context.

This part is the primary reason I submitted this as an RFC. I
would appreciate it if you could review this part specifically to
confirm whether my understanding is correct.

> > Address these issues by holding the swap device reference from the point
> > the swap device is looked up, and releasing it once at each exit path.
> > This ensures the device remains valid throughout the operation and
> > removes the overhead of per-slot reference counting.
> 
> I want to understand how to trigger the buggy code path first. It
> might be obvious to you. It is not obvious to me yet.

I hope the explanation above clarifies the trace. Please let me know if
there are still parts that are not obvious, and I will explain further or
investigate more.

Thank you for the review
Youngjun Park

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH v2] mm/swap, PM: hibernate: hold swap device reference across swap operation
  2026-03-06  8:02   ` YoungJun Park
@ 2026-03-09  6:43     ` Chris Li
  2026-03-09  7:42       ` YoungJun Park
  0 siblings, 1 reply; 6+ messages in thread
From: Chris Li @ 2026-03-09  6:43 UTC (permalink / raw)
  To: YoungJun Park
  Cc: rafael, akpm, kasong, pavel, shikemeng, nphamcs, bhe, baohua,
	usama.arif, linux-pm, linux-mm

On Fri, Mar 6, 2026 at 12:02 AM YoungJun Park <youngjun.park@lge.com> wrote:
>
> On Thu, Mar 05, 2026 at 10:55:15PM -0800, Chris Li wrote:
> > On Thu, Mar 5, 2026 at 6:46 PM Youngjun Park <youngjun.park@lge.com> wrote:
> > >
> > > Currently, in the uswsusp path, only the swap type value is retrieved at
> > > lookup time without holding a reference. If swapoff races after the type
> > > is acquired, subsequent slot allocations operate on a stale swap device.
> >
> > Just from you above description, I am not sure how the bug is actually
> > triggered yet. That sounds possible. I want more detail.
>
> To be honest, I am not deeply familiar with the snapshot code, which is why
> I submitted this as an RFC. However, I believe the race is theoretically
> possible and I was able to trigger it with a simple PoC user program.
>
> (not in-kernel swsusp as I think, cuz every user thread freezed
> before creating snapshot, only on uswsusp)
>
> The race occurs in `power/user.c`
>
> 1. snapshot_open() calls swap_type_of() to find the swap device.
> 2. We get the swap type, but hold no reference at this point.
> 3. [Race Window]: Another thread triggers swapoff() and swapon()
> 4. snapshot_ioctl(SNAPSHOT_ALLOC_SWAP_PAGE) is called.
>    -> The swap device is gone or the type ID is reused by another device or
>       swap device is missing.

Ah, I see. Thanks for the explanation.

> > Can you show me which code path triggered this bug?
> > e.g. Thread A wants to suspend, with this back trace call graph.
> > Then in this function foo() A grabs the swap device without holding a reference.
> > Meanwhile, thread B is performing a swap off while A is at function foo().
> >
> > > Additionally, grabbing and releasing the swap device reference on every
> > > slot allocation is inefficient across the entire hibernation swap path.
> >
> > If the swap entry is already allocated by the suspend code on that
> > swap device, the follow up allocation does not need to grab the
> > reference again because the swap device's swapped count will not drop
> > to zero until resume.
>
> You are right. Since the swap device is pinned once a swap entry is
> allocated, we could indeed rely on that pinning mechanism to ensure safety
> for subsequent allocations (instead of doing get/put every time).
>
> However, relying on that pinning alone does not protect the window between
> the initial lookup (step 1) and the *first* allocation.

Agree. That place needs fixing. We will make two patches.

Patch 1. Fix the swap off  racing between lookup and first allocation
on suspend.
swap_type_of() is very tricky for the device swap because of the
conditional lookup of the si->start_block matching the offset or not.
That make this patch very complex.

One idea to brainstorm:

So we can get the reference count on during snapshot_open(), after
checking "root_swap" still points to valid swsusp_resume_device.
Then we release the reference count on "root_swap" during snapshot_release().

That might side step the complexity of  swap_type_of() doing the
si->start_block checking.

It should fix the bug you described here more simply.

> My proposal is to grab the reference at the lookup point to close this
> initial race.

That is my suggested patch 1.

> If we do that, I believe we can remove the per-slot
> get/put calls entirely, as the initial reference is sufficient to keep the

I suggest that as the patch 2. It is an optimization to eliminate the
get/put pairs. It is optional. without it is fine in terms of
correctness. Might not worth the trouble for patch 2.

> device alive until the operation completes.
>
> Regarding the reference release strategy in this patch:
>
> 1. uswsusp: The reference is released when the snapshot device file
>    is closed(snapshot_release) and error paths.
> 2. not uswsusp`: I only added reference release in the error paths.

That part makes this patch complex and harder to review. Need to
carefully check whether we take the reference count or not.

>
> About 2.. I conclude that on a successful resume, the system state reverts to
> the snapshot point, making an explicit release unnecessary. However,
> I am not 100% certain if this holds true for the swap reference
> context.

That is the part I try to avoid: the very fragmented error condition
for reference counting.
Hopefully, with patch 1 idea we don't need that complexity.

>
> This part is the primary reason I submitted this as an RFC. I
> would appreciate it if you could review this part specifically to
> confirm whether my understanding is correct.

BTW, I can review the swap part,  we also need to get the
suspend/resume maintainer (Rafael?) to review the suspend aspect of
this change as well.

>
> > > Address these issues by holding the swap device reference from the point
> > > the swap device is looked up, and releasing it once at each exit path.
> > > This ensures the device remains valid throughout the operation and
> > > removes the overhead of per-slot reference counting.
> >
> > I want to understand how to trigger the buggy code path first. It
> > might be obvious to you. It is not obvious to me yet.
>
> I hope the explanation above clarifies the trace. Please let me know if
> there are still parts that are not obvious, and I will explain further or
> investigate more.

Yes you did. Thank you.

Chris


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH v2] mm/swap, PM: hibernate: hold swap device reference across swap operation
  2026-03-09  6:43     ` Chris Li
@ 2026-03-09  7:42       ` YoungJun Park
  2026-03-11  7:31         ` Chris Li
  0 siblings, 1 reply; 6+ messages in thread
From: YoungJun Park @ 2026-03-09  7:42 UTC (permalink / raw)
  To: Chris Li
  Cc: rafael, akpm, kasong, pavel, shikemeng, nphamcs, bhe, baohua,
	usama.arif, linux-pm, linux-mm, hyungjun.cho, youngjun.park

On Sun, Mar 08, 2026 at 11:43:20PM -0700, Chris Li wrote:

> Agree. That place needs fixing. We will make two patches.
> 
> Patch 1. Fix the swap off  racing between lookup and first allocation
> on suspend.
> swap_type_of() is very tricky for the device swap because of the
> conditional lookup of the si->start_block matching the offset or not.
> That make this patch very complex.
> 
> One idea to brainstorm:
> 
> So we can get the reference count on during snapshot_open(), after
> checking "root_swap" still points to valid swsusp_resume_device.
> Then we release the reference count on "root_swap" during snapshot_release().
> 
> That might side step the complexity of  swap_type_of() doing the
> si->start_block checking.
> 
> It should fix the bug you described here more simply.

While that approach would be great as a minimal fix, I think we still
cannot avoid the following situation.

Until the first swap offset is allocated, we cannot guarantee that swapoff
won't happen. To be safe, I think it is difficult to prevent swapoff
without holding the swap_lock.

So, to stick to the minimal fix principle and only address the currently
possible bug in uswsusp, we could consider:

1) Creating a separate function to grab the reference for uswsusp, and
   put it in snapshot_close().
2) Adding a parameter to swap_type_of() to decide whether to acquire the
   reference or not, and put it in swsusp_close() 

On all strategies, we do not grab the
reference when taking an in-kernel snapshot, and do not add alloc/free
get/put.

> > My proposal is to grab the reference at the lookup point to close this
> > initial race.
> 
> That is my suggested patch 1.
> 
> > If we do that, I believe we can remove the per-slot
> > get/put calls entirely, as the initial reference is sufficient to keep the
> 
> I suggest that as the patch 2. It is an optimization to eliminate the
> get/put pairs. It is optional. without it is fine in terms of
> correctness. Might not worth the trouble for patch 2.

Yes, I agree. I will split the patch into two as you suggested and think
about it further.

> > device alive until the operation completes.
> >
> > Regarding the reference release strategy in this patch:
> >
> > 1. uswsusp: The reference is released when the snapshot device file
> >    is closed(snapshot_release) and error paths.
> > 2. not uswsusp`: I only added reference release in the error paths.
> 
> That part makes this patch complex and harder to review. Need to
> carefully check whether we take the reference count or not.
> 
> >
> > About 2.. I conclude that on a successful resume, the system state reverts to

> > the snapshot point, making an explicit release unnecessary. However,
> > I am not 100% certain if this holds true for the swap reference
> > context.
> 
> That is the part I try to avoid: the very fragmented error condition
> for reference counting.
> Hopefully, with patch 1 idea we don't need that complexity.

I agree with you.
But, I believe it can be a safe modification that can be sufficiently
verified through review.

I would love to hear the thoughts of the hibernation maintainers and other
reviewers on this. Although there are some complex parts, I think this
modification has clear benefits.

Thanks

Best regards,
Youngjun Park


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH v2] mm/swap, PM: hibernate: hold swap device reference across swap operation
  2026-03-09  7:42       ` YoungJun Park
@ 2026-03-11  7:31         ` Chris Li
  0 siblings, 0 replies; 6+ messages in thread
From: Chris Li @ 2026-03-11  7:31 UTC (permalink / raw)
  To: YoungJun Park
  Cc: rafael, akpm, kasong, pavel, shikemeng, nphamcs, bhe, baohua,
	usama.arif, linux-pm, linux-mm, hyungjun.cho

On Mon, Mar 9, 2026 at 12:42 AM YoungJun Park <youngjun.park@lge.com> wrote:
>
> On Sun, Mar 08, 2026 at 11:43:20PM -0700, Chris Li wrote:
>
> > Agree. That place needs fixing. We will make two patches.
> >
> > Patch 1. Fix the swap off  racing between lookup and first allocation
> > on suspend.
> > swap_type_of() is very tricky for the device swap because of the
> > conditional lookup of the si->start_block matching the offset or not.
> > That make this patch very complex.
> >
> > One idea to brainstorm:
> >
> > So we can get the reference count on during snapshot_open(), after
> > checking "root_swap" still points to valid swsusp_resume_device.
> > Then we release the reference count on "root_swap" during snapshot_release().
> >
> > That might side step the complexity of  swap_type_of() doing the
> > si->start_block checking.
> >
> > It should fix the bug you described here more simply.
>
> While that approach would be great as a minimal fix, I think we still
> cannot avoid the following situation.
>
> Until the first swap offset is allocated, we cannot guarantee that swapoff
> won't happen. To be safe, I think it is difficult to prevent swapoff
> without holding the swap_lock.

Grab the swap device reference at the beginning of `snapshot_open`,
before any swap_offset allocation, until `snapshot_close`. That should
prevent the swapoff? The swapoff must wait until the reference is
dropped at snapshot_close().
I assume the swap entry allocation happens between snapshot_open() and
snapshot_close().

> So, to stick to the minimal fix principle and only address the currently
> possible bug in uswsusp, we could consider:
>
> 1) Creating a separate function to grab the reference for uswsusp, and
>    put it in snapshot_close().
Ack.

> 2) Adding a parameter to swap_type_of() to decide whether to acquire the
>    reference or not, and put it in swsusp_close()

In my mind, shouldn't the first point 1) be enough? Not sure 2) is needed.

Chris




>
> On all strategies, we do not grab the
> reference when taking an in-kernel snapshot, and do not add alloc/free
> get/put.
>
> > > My proposal is to grab the reference at the lookup point to close this
> > > initial race.
> >
> > That is my suggested patch 1.
> >
> > > If we do that, I believe we can remove the per-slot
> > > get/put calls entirely, as the initial reference is sufficient to keep the
> >
> > I suggest that as the patch 2. It is an optimization to eliminate the
> > get/put pairs. It is optional. without it is fine in terms of
> > correctness. Might not worth the trouble for patch 2.
>
> Yes, I agree. I will split the patch into two as you suggested and think
> about it further.
>
> > > device alive until the operation completes.
> > >
> > > Regarding the reference release strategy in this patch:
> > >
> > > 1. uswsusp: The reference is released when the snapshot device file
> > >    is closed(snapshot_release) and error paths.
> > > 2. not uswsusp`: I only added reference release in the error paths.
> >
> > That part makes this patch complex and harder to review. Need to
> > carefully check whether we take the reference count or not.
> >
> > >
> > > About 2.. I conclude that on a successful resume, the system state reverts to
>
> > > the snapshot point, making an explicit release unnecessary. However,
> > > I am not 100% certain if this holds true for the swap reference
> > > context.
> >
> > That is the part I try to avoid: the very fragmented error condition
> > for reference counting.
> > Hopefully, with patch 1 idea we don't need that complexity.
>
> I agree with you.
> But, I believe it can be a safe modification that can be sufficiently
> verified through review.
>
> I would love to hear the thoughts of the hibernation maintainers and other
> reviewers on this. Although there are some complex parts, I think this
> modification has clear benefits.
>
> Thanks
>
> Best regards,
> Youngjun Park
>


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-03-11  7:31 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-03-06  2:46 [RFC PATCH v2] mm/swap, PM: hibernate: hold swap device reference across swap operation Youngjun Park
2026-03-06  6:55 ` Chris Li
2026-03-06  8:02   ` YoungJun Park
2026-03-09  6:43     ` Chris Li
2026-03-09  7:42       ` YoungJun Park
2026-03-11  7:31         ` Chris Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox