linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Chris Li <chrisl@kernel.org>
To: YoungJun Park <youngjun.park@lge.com>
Cc: rafael@kernel.org, akpm@linux-foundation.org, kasong@tencent.com,
	 pavel@kernel.org, shikemeng@huaweicloud.com, nphamcs@gmail.com,
	 bhe@redhat.com, baohua@kernel.org, usama.arif@linux.dev,
	 linux-pm@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [RFC PATCH v2] mm/swap, PM: hibernate: hold swap device reference across swap operation
Date: Sun, 8 Mar 2026 23:43:20 -0700	[thread overview]
Message-ID: <CACePvbVp=9PM=LUdL=aq8G2Svy+v04pBnf3ygRY+xW3WEHNm9A@mail.gmail.com> (raw)
In-Reply-To: <aaqKJQeO8wLQL7Zn@yjaykim-PowerEdge-T330>

On Fri, Mar 6, 2026 at 12:02 AM YoungJun Park <youngjun.park@lge.com> wrote:
>
> On Thu, Mar 05, 2026 at 10:55:15PM -0800, Chris Li wrote:
> > On Thu, Mar 5, 2026 at 6:46 PM Youngjun Park <youngjun.park@lge.com> wrote:
> > >
> > > Currently, in the uswsusp path, only the swap type value is retrieved at
> > > lookup time without holding a reference. If swapoff races after the type
> > > is acquired, subsequent slot allocations operate on a stale swap device.
> >
> > Just from you above description, I am not sure how the bug is actually
> > triggered yet. That sounds possible. I want more detail.
>
> To be honest, I am not deeply familiar with the snapshot code, which is why
> I submitted this as an RFC. However, I believe the race is theoretically
> possible and I was able to trigger it with a simple PoC user program.
>
> (not in-kernel swsusp as I think, cuz every user thread freezed
> before creating snapshot, only on uswsusp)
>
> The race occurs in `power/user.c`
>
> 1. snapshot_open() calls swap_type_of() to find the swap device.
> 2. We get the swap type, but hold no reference at this point.
> 3. [Race Window]: Another thread triggers swapoff() and swapon()
> 4. snapshot_ioctl(SNAPSHOT_ALLOC_SWAP_PAGE) is called.
>    -> The swap device is gone or the type ID is reused by another device or
>       swap device is missing.

Ah, I see. Thanks for the explanation.

> > Can you show me which code path triggered this bug?
> > e.g. Thread A wants to suspend, with this back trace call graph.
> > Then in this function foo() A grabs the swap device without holding a reference.
> > Meanwhile, thread B is performing a swap off while A is at function foo().
> >
> > > Additionally, grabbing and releasing the swap device reference on every
> > > slot allocation is inefficient across the entire hibernation swap path.
> >
> > If the swap entry is already allocated by the suspend code on that
> > swap device, the follow up allocation does not need to grab the
> > reference again because the swap device's swapped count will not drop
> > to zero until resume.
>
> You are right. Since the swap device is pinned once a swap entry is
> allocated, we could indeed rely on that pinning mechanism to ensure safety
> for subsequent allocations (instead of doing get/put every time).
>
> However, relying on that pinning alone does not protect the window between
> the initial lookup (step 1) and the *first* allocation.

Agree. That place needs fixing. We will make two patches.

Patch 1. Fix the swap off  racing between lookup and first allocation
on suspend.
swap_type_of() is very tricky for the device swap because of the
conditional lookup of the si->start_block matching the offset or not.
That make this patch very complex.

One idea to brainstorm:

So we can get the reference count on during snapshot_open(), after
checking "root_swap" still points to valid swsusp_resume_device.
Then we release the reference count on "root_swap" during snapshot_release().

That might side step the complexity of  swap_type_of() doing the
si->start_block checking.

It should fix the bug you described here more simply.

> My proposal is to grab the reference at the lookup point to close this
> initial race.

That is my suggested patch 1.

> If we do that, I believe we can remove the per-slot
> get/put calls entirely, as the initial reference is sufficient to keep the

I suggest that as the patch 2. It is an optimization to eliminate the
get/put pairs. It is optional. without it is fine in terms of
correctness. Might not worth the trouble for patch 2.

> device alive until the operation completes.
>
> Regarding the reference release strategy in this patch:
>
> 1. uswsusp: The reference is released when the snapshot device file
>    is closed(snapshot_release) and error paths.
> 2. not uswsusp`: I only added reference release in the error paths.

That part makes this patch complex and harder to review. Need to
carefully check whether we take the reference count or not.

>
> About 2.. I conclude that on a successful resume, the system state reverts to
> the snapshot point, making an explicit release unnecessary. However,
> I am not 100% certain if this holds true for the swap reference
> context.

That is the part I try to avoid: the very fragmented error condition
for reference counting.
Hopefully, with patch 1 idea we don't need that complexity.

>
> This part is the primary reason I submitted this as an RFC. I
> would appreciate it if you could review this part specifically to
> confirm whether my understanding is correct.

BTW, I can review the swap part,  we also need to get the
suspend/resume maintainer (Rafael?) to review the suspend aspect of
this change as well.

>
> > > Address these issues by holding the swap device reference from the point
> > > the swap device is looked up, and releasing it once at each exit path.
> > > This ensures the device remains valid throughout the operation and
> > > removes the overhead of per-slot reference counting.
> >
> > I want to understand how to trigger the buggy code path first. It
> > might be obvious to you. It is not obvious to me yet.
>
> I hope the explanation above clarifies the trace. Please let me know if
> there are still parts that are not obvious, and I will explain further or
> investigate more.

Yes you did. Thank you.

Chris


  reply	other threads:[~2026-03-09  6:43 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-06  2:46 Youngjun Park
2026-03-06  6:55 ` Chris Li
2026-03-06  8:02   ` YoungJun Park
2026-03-09  6:43     ` Chris Li [this message]
2026-03-09  7:42       ` YoungJun Park
2026-03-11  7:31         ` Chris Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CACePvbVp=9PM=LUdL=aq8G2Svy+v04pBnf3ygRY+xW3WEHNm9A@mail.gmail.com' \
    --to=chrisl@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=bhe@redhat.com \
    --cc=kasong@tencent.com \
    --cc=linux-mm@kvack.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=nphamcs@gmail.com \
    --cc=pavel@kernel.org \
    --cc=rafael@kernel.org \
    --cc=shikemeng@huaweicloud.com \
    --cc=usama.arif@linux.dev \
    --cc=youngjun.park@lge.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox