From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 223611093188 for ; Fri, 20 Mar 2026 07:59:51 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 732686B0378; Fri, 20 Mar 2026 03:59:50 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6E2D46B037A; Fri, 20 Mar 2026 03:59:50 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5F84B6B037D; Fri, 20 Mar 2026 03:59:50 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 4DF6D6B0378 for ; Fri, 20 Mar 2026 03:59:50 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id D6F5313BC4E for ; Fri, 20 Mar 2026 07:59:49 +0000 (UTC) X-FDA: 84565692498.09.FE3BB85 Received: from lgeamrelo03.lge.com (lgeamrelo03.lge.com [156.147.51.102]) by imf08.hostedemail.com (Postfix) with ESMTP id 63907160005 for ; Fri, 20 Mar 2026 07:59:47 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=none; spf=pass (imf08.hostedemail.com: domain of youngjun.park@lge.com designates 156.147.51.102 as permitted sender) smtp.mailfrom=youngjun.park@lge.com; dmarc=pass (policy=none) header.from=lge.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1773993588; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=UXdorD2jVazQ36XV9+Ae1eLRLoWmaW5vTWLzOrWXTms=; b=sZkwuyALKrBwTgeeYBNwziO0kX0M0CkZvB69OTCZY+L2kYtD9BVYtitBMelz4AUYdlN801 VGP/Vu1BtbA8CsqOuesgeYLmTYk104FwCgZaz45XmJuqGOlun53EBYbY1WW4sJ58FYkBrL 6SXFGv1SY2en6+Dm0lqmje39lFcdb4o= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=none; spf=pass (imf08.hostedemail.com: domain of youngjun.park@lge.com designates 156.147.51.102 as permitted sender) smtp.mailfrom=youngjun.park@lge.com; dmarc=pass (policy=none) header.from=lge.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1773993588; a=rsa-sha256; cv=none; b=ipK/6rjPIA3N6q1+h5kvCXgxWQdGD8EjodXPvY8l7IDU6aV5fgaaMKQKAe2w/6pdBJsRns XThgZcDpStKyxkZ6lHc7zsCqoMUeaEZbtCKk6tUiT4GxFjJ5EEOq+94lIhVWJLqqtrfrx6 8pQHwro2vfFWltPkY5fHyhbQ+G8OJiQ= Received: from unknown (HELO yjaykim-PowerEdge-T330) (10.177.112.156) by 156.147.51.102 with ESMTP; 20 Mar 2026 16:59:43 +0900 X-Original-SENDERIP: 10.177.112.156 X-Original-MAILFROM: youngjun.park@lge.com Date: Fri, 20 Mar 2026 16:59:43 +0900 From: YoungJun Park To: Kairui Song Cc: rafael@kernel.org, akpm@linux-foundation.org, chrisl@kernel.org, kasong@tencent.com, pavel@kernel.org, shikemeng@huaweicloud.com, nphamcs@gmail.com, bhe@redhat.com, baohua@kernel.org, usama.arif@linux.dev, linux-pm@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH v4 1/3] mm/swap, PM: hibernate: fix swapoff race in uswsusp by getting swap reference Message-ID: References: <20260317181318.2517015-1-youngjun.park@lge.com> <20260317181318.2517015-2-youngjun.park@lge.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspamd-Queue-Id: 63907160005 X-Stat-Signature: uhuo7a4ene7rhke98bcs8cw4dqrpa3uw X-Rspam-User: X-Rspamd-Server: rspam05 X-HE-Tag: 1773993587-386597 X-HE-Meta: U2FsdGVkX18Pl6W83LPIOK41CuQxyL3uH7sBCfWdQSEwaofKbv/F5GiZagm9XMTrDALq6YaA1dLuDJPcQvh/0+yHaHPyUbdm1bTK1p779/i8rQz7NV2/ZCPcfLPSCI7u3CTn7F8XDWyuZZvEcyLOlyCpviANYGLPD6otk6Kxcxut7atQDTr5jITC2rJruzA4FE0CqcSgUTaKoFLvCGlWrriTloOO5V5kebzPl3D4N5V+f530SMlpf/rkG3dW/Rn5+HJ6sW+spVrVX5wDfEDhfLZW6EPX2Nw4xdR/G/qTCHsheBoVOR4HfUQrbj+7gtdgHrQWqTRm6Yj3n8uj+W541V4rfQKzGlD0xOUz9rWokP3Q16PAFIhp78C3hjsGd9yU+2xZTNaO+Mrn96f6AZJF/sl+erhofi4FICOPUGi6Qli3JK4NLzcIyyzfYAfDI6wwjOEWf9qvboBMUJJ88n+39Xpf47QOHPyo+5Q7yFfukwxUbhui1TXGfM5Va5PLlBGTtUxRILzUvpGnq6prjxo+hfnSf/aCFJqKySM5UJhmZ0eRz9atODry9PnBbRve9bo8z1GHLyt6VCt1NgeoGHzh2z7eblpuuc7l20DEU9h5yToCNpyB4X2+oEWFCK3FGQjReMyhpTPX9bdu26dotfp3/ocFG1n94kdudFBHqve50t+blomvl3mpEcJEtDM7HynLbxpqyqah/cg53wqXGTqIM3thlJLXK/deLdOt3Zjvo+FmeFmW8dRtf0cz8i3lvRflKsUR/jgQ/lxpohpf0q9as1qcRjFPA6kItdGxivHxRiDaxwFFAv6ki0yHSAhJBqu2dRcNQxPYbiY45X0FPX2bTzSbqg60nessPtFUPjLtnZX8av2Tez1bWg6H2EUV9TS1FJq4wu3W03uxtPeYiqPCTKEjQnUhJdDCpKECJqHqVc2xFUEfd6NaZjHwmN+iL6LbQ/O93SozgB2dq6aPBCK 2XWBROQe c2qIbVNgf3bWI/ZqyLsa0Fx2SOgj3iVxr3m1VN6RMh2wbxDzqtQTsOSNZiatazp6agiHtLyOfbvBkKaFIWA3P5rJMfMtq+voFMK8/076vNGIEy0mY8C/aEqkotlyWejHOgSVFkHkA9SbMu9WNkveChY7rIxvc4+666MJEwLR/mEkhLV5VUeOu7i6idkA1RMSxscxea19I1PfKbzFjHKpAO+TgQ6I6mhc0lKnUPynGJx+6Bxd2UP/uKZ/SEg== Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Mar 20, 2026 at 12:34:02AM +0800, Kairui Song wrote: > On Wed, Mar 18, 2026 at 03:13:16AM +0800, Youngjun Park wrote: > Hi Youngjun, > > Thanks! This set of API looks better that before. Hello Kairui! :) > > > > -/* > > - * Find the swap type that corresponds to given device (if any). > > - * > > - * @offset - number of the PAGE_SIZE-sized block of the device, starting > > - * from 0, in which the swap header is expected to be located. > > - * > > - * This is needed for the suspend to disk (aka swsusp). > > - */ > > -int swap_type_of(dev_t device, sector_t offset) > > +static int swap_type_of(dev_t device, sector_t offset) > > Maybe rename it while at it, it's a internal helper now and > no one is using it. I will change! > > +/* > > + * Finds the swap type and safely acquires a reference to the swap device > > + * to prevent race conditions with swapoff. > > + * > > + * This should be used in environments like uswsusp where a race condition > > + * exists between configuring the resume device and allocating a swap slot. > > + * For sysfs hibernation where user-space is frozen (making swapoff > > + * impossible), use find_hibernation_swap_type() instead. > > + * > > + * The caller must drop the reference using put_hibernation_swap_type(). > > + */ > > You can follow the kernel doc format. I will convert this to proper kernel-doc format as well. Thanks. > > +int get_hibernation_swap_type(dev_t device, sector_t offset) > > +{ > > + int type; > > + struct swap_info_struct *sis; > > + > > + spin_lock(&swap_lock); > > + type = swap_type_of(device, offset); > > + sis = swap_type_to_info(type); > > + if (!sis || !get_swap_device_info(sis)) > > + type = -1; > > + > > + spin_unlock(&swap_lock); > > + return type; > > +} > > + > > +/* > > + * Drops the reference to the swap device previously acquired by > > + * get_hibernation_swap_type(). > > + */ > > +void put_hibernation_swap_type(int type) > > +{ > > + struct swap_info_struct *sis; > > + > > + sis = swap_type_to_info(type); > > + if (!sis) > > + return; > > I think it should never see sis == NULL here? It can happen in the following cases: 1. The snapshot device is not configured at open time (i.e. no resume= parameter and userspace intends to configure it later via ioctl). 2. The process exits or snapshot_release() is called with type = -1. Although callers would normally check `type >= 0`, this helper already handles invalid types defensively, so an extra check at the call site is not strictly necessary. > This looks really bad... It means swapoff will hung unless user send a > signal to stop it? How does that happen? > > If hibernate may hold a reference without allocating any slot for a > long time, then maybe we better introduce a new si->flags bit to > indicate a device is used by hibernate, so swapoff can abort early > with a -EBUSY. > > If hibernate only holds the swap device reference after it allocated > any slot, then we can avoid use a flag but use a hibernate type in > swap table instead: > > diff --git a/mm/swap_table.h b/mm/swap_table.h > index 8415ffbe2b9c..025e004fc2e7 100644 > --- a/mm/swap_table.h > +++ b/mm/swap_table.h > @@ -25,6 +25,7 @@ struct swap_table { > * PFN: | SWAP_COUNT |------ PFN -------|10| - Cached slot > * Pointer: |----------- Pointer ----------|100| - (Unused) > * Bad: |------------- 1 -------------|1000| - Bad slot > + * Exclusive:|---- 0 ----| ------ 1 -----|10000| - Exclusive usage > * > * SWAP_COUNT is `SWP_TB_COUNT_BITS` long, each entry is an atomic long. > * > @@ -46,6 +47,8 @@ struct swap_table { > * because only the lower three bits can be used as a marker for 8 bytes > * aligned pointers. > * > + * - Exclusive usage: e.g. for hibernation. > + * > * - Bad: Swap slot is reserved, protects swap header or holes on swap devices. > */ > > Both readahead and swapoff (unuse scan) will abort upon seeing such slot, > we solve both readahead and swapoff issue together. > > How do you think? We can use the flag idea first. That makes sense. I also considered the pinning flag approach (and mentioned my previously rejected strategy in response to Andrew’s question). Based on your feedback, I’m now convinced that introducing a dedicated SWP_HIBERNATION pinning flag in `si->flags` is the right direction. It makes the interaction with swapoff explicit and avoids long waits or implicit dependency on slot allocation timing. In the future, this pinning flag could potentially be reused in other swap paths if we ever need to distinguish hibernation-specific allocations. (this part is just imagination) Regarding the swap table “exclusive bit” idea: I don’t think it fits well here because there is a window between selecting the swap device type and actually allocating a swap slot. During that gap, the user context may change, so encoding exclusivity at the slot level would not fully cover the race. So I will proceed with the pinning flag approach ! > > */ > > percpu_ref_kill(&p->users); > > synchronize_rcu(); > > - wait_for_completion(&p->comp); > > + err = wait_for_completion_interruptible(&p->comp); > > + if (err) { > > + percpu_ref_resurrect(&p->users); > > + synchronize_rcu(); > > Do we need a synchronize_rcu here? It's required above because we are > releasing the resources so all RCU readers must exit from grace > period. But here we are aborting and not releasing anything so > there is no such need? This strategy has been dropped with the introduction of the SWP_HIBERNATION pinning flag. For clarification: the `synchronize_rcu()` here was intended to wait for the final percpu_ref callback, since that callback runs in RCU context. However, as you pointed out, since we are aborting rather than releasing resources, that extra grace period is likely unnecessary. I will update the patch accordingly and include additional testing results. Thanks again for the valuable review. Best regards, Youngjun