From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 34C12E77188 for ; Sat, 4 Jan 2025 05:46:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5D5CD6B0082; Sat, 4 Jan 2025 00:46:49 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 5861B6B0088; Sat, 4 Jan 2025 00:46:49 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 44CCC6B0089; Sat, 4 Jan 2025 00:46:49 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 26D9D6B0082 for ; Sat, 4 Jan 2025 00:46:49 -0500 (EST) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id B00E380C72 for ; Sat, 4 Jan 2025 05:46:48 +0000 (UTC) X-FDA: 82968685296.13.B70CA49 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf29.hostedemail.com (Postfix) with ESMTP id 9B6FC120004 for ; Sat, 4 Jan 2025 05:46:46 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=BOZAKrjF; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf29.hostedemail.com: domain of bhe@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=bhe@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1735969606; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=4o7IaJaCIr0+tvYM3Fg2MveCS4+Grjyfo5WnxwdvOyA=; b=1HZjHFTElCDpEyhjlAMCR2GVSYbtvMsHtCg10UzEJpfTvOVjMMvwIdvlx8Qks+ZwNQK0HY xsZPJpqKFmvCfP8UwPpEo03LNB1bVjfiaLB19asaSQP8usGJWsmj5M3DSoFMTHp66w2Nf/ wZRNOPILBaBDJlcWJGC+ZGWP8RsnOk8= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1735969607; a=rsa-sha256; cv=none; b=YH//SvbwkkA8D+tG47afjHqDupaVJTEUoGCIi1YeswrH8TH+lBDxYsfxueEUEv7/uCcAjW aP+hCcHiybjO77GCYRwxxSJGbWpKkTVvcGJRiLV7O5sse26RzUgkVfTHbGNbEsgH174ML9 dXOqDqwrkBo+Iwe2sb+DsKNSD7bvBiE= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=BOZAKrjF; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf29.hostedemail.com: domain of bhe@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=bhe@redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1735969606; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=4o7IaJaCIr0+tvYM3Fg2MveCS4+Grjyfo5WnxwdvOyA=; b=BOZAKrjFj/w7uEknCpsGCdJtYp7HIpGnaGkXEQEqRY5PjrpwvIuJSM/KY1t2w+q02gi9k1 WXyaeYby/hEZSGYn7XC9x2qC6FOS+HXiacqNdhKvZ8QgKEeuF1CvpksOySDi04nhZ6SQVO 21XE2k7sjmMHoFgDuNWds1O1QARtyas= Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-447-A3VwNQptNxOTOo6oXVc3Vw-1; Sat, 04 Jan 2025 00:46:40 -0500 X-MC-Unique: A3VwNQptNxOTOo6oXVc3Vw-1 X-Mimecast-MFC-AGG-ID: A3VwNQptNxOTOo6oXVc3Vw Received: from mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.12]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id BEE1119560B2; Sat, 4 Jan 2025 05:46:37 +0000 (UTC) Received: from localhost (unknown [10.72.112.163]) by mx-prod-int-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 2B54E19560A2; Sat, 4 Jan 2025 05:46:34 +0000 (UTC) Date: Sat, 4 Jan 2025 13:46:27 +0800 From: Baoquan He To: Kairui Song Cc: linux-mm@kvack.org, Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Nhat Pham , Johannes Weiner , Kalesh Singh , linux-kernel@vger.kernel.org Subject: Re: [PATCH v3 07/13] mm, swap: hold a reference during scan and cleanup flag usage Message-ID: References: <20241230174621.61185-1-ryncsn@gmail.com> <20241230174621.61185-8-ryncsn@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20241230174621.61185-8-ryncsn@gmail.com> X-Scanned-By: MIMEDefang 3.0 on 10.30.177.12 X-Stat-Signature: zkbqpztppjhwyp57kkfbjiorbo3r9h4d X-Rspamd-Queue-Id: 9B6FC120004 X-Rspam-User: X-Rspamd-Server: rspam01 X-HE-Tag: 1735969606-124874 X-HE-Meta: U2FsdGVkX1++iQ3tmyHSEOc68CA+BDacr8TQlMqu3JhCWJZyzSIPPGhTiPjBCZP76s+Ui4FKv2NWhj4334sXQZKo2U8rSdr5wVEwYPGVa4XgFNS6zCd+Sc5I/APINKUZyyA2hlH8mPTdTUQ8HnpyQ1GJRKxWbAzlQCrHN4+C8UWtkazV0uASXw2x6645xsXZLZrwTy33pXUzBVRII/vw/J8KFdc+aRX+UsN/oiIYNq1co6SeAk3qvlPFz25k7V/PISjDGSYalGT37ASroiBMqr3DqSvo2Fsd2qNd9cy4Fdr0WpGxwa05bt/NTf6b92nYDREFKJjFOeTdK+kAk5LZvKTFhbOpUKHIn6NUPSAyZfMVr/Z7/6HCne0G8Tp9yixLF6mn1PRqlJYXxTy+sd5wBBPtm1FjnK9frx1uUuv22xRIi9bBT2m6WK/kXbYn3WrlE/yHBWmP6VmV/EukIK8gnroqRKBTz48yRWrRWyQYa/fjebvjzej47CzB9m6C05IteuZXYsFzjuR1oY1Qzatfh+XAMkeOs7rQ/BPQzBnaFQWHJlYyZg2e1LWkEUCkurNXKosGLoEvHpD8VlMy+2sBp0gaKNu+Pp8reQxKx0Pbbck5W+psvqkkiPxFg0oahG32CxN2uyNDL8DGXkWht3OJVewn5b4j1XD6dfEGLBqDCZ9MZX2n/GxQ3whn0uMFzcCiCX/ROjtKAz+tALz9l6oHgwvD+Zj53LNzsdFGNZM7un6kX9pZDmn2/qbwrKyr8DWirxSR8bNbCBqDzYZRsHbOLci0nj1lnMiLioy3/qey8WjT742xISyzQLqm5xV+OF5axBaskvF3t9HNceA5oocZcIKvvjfs0M/kNZq3emoyyNcq2ZcKsuyhG6ViDxeIhAtIQ6HBUfK3jj+OVL9LSP0U9Bd0Eh56/KUzbGvgV4XobCHnJP3AEwRISYl+kF1mAQvjWsADtbt7QIwih9GEjEi u34ZOoRA mzCFYQL1Lqlk+NsXGuTWUXeQwHa+3eum7CbFzEv0DPWY+3Wc4jXeSbf+7FPC6VFz3SbIO3wG4Ok1dKDl0bF/28//HD72Qc+8xCpMEQVD5FtVUV6PVgbt3qwcu9Hn7dm1/Ie8odphpGGWAPjf61y3natWUoEYXQ/Euctgs0+AuEy5KOIwCyVT9DfyAbH+TQYS7U9bTIBmcDzPEvcxcLYfMeiVfsX+DASNRqU32tDoBLFEdLao9DJhgRDVFieikJzXbo6H5bfxssZDK6WR0mrEiX9pzDUivEwZIdgkHc9iHrBGbYQGhI+pmUoy5DLHtE3tIpXY6GItXnprhnvwT4dImn8xsZA2ZNQZhd6y/0w0aso8Tdv8GqLt8F1htAw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 12/31/24 at 01:46am, Kairui Song wrote: > From: Kairui Song > > The flag SWP_SCANNING was used as an indicator of whether a device > is being scanned for allocation, and prevents swapoff. Combined with > SWP_WRITEOK, they work as a set of barriers for a clean swapoff: > > 1. Swapoff clears SWP_WRITEOK, allocation requests will see > ~SWP_WRITEOK and abort as it's serialized by si->lock. > 2. Swapoff unuses all allocated entries. > 3. Swapoff waits for SWP_SCANNING flag to be cleared, so ongoing > allocations will stop, preventing UAF. > 4. Now swapoff can free everything safely. > > This will make the allocation path have a hard dependency on > si->lock. Allocation always have to acquire si->lock first for > setting SWP_SCANNING and checking SWP_WRITEOK. > > This commit removes this flag, and just uses the existing per-CPU > refcount instead to prevent UAF in step 3, which serves well for > such usage without dependency on si->lock, and scales very well too. > Just hold a reference during the whole scan and allocation process. > Swapoff will kill and wait for the counter. > > And for preventing any allocation from happening after step 1 so the > unuse in step 2 can ensure all slots are free, swapoff will acquire > the ci->lock of each cluster one by one to ensure all allocations > see ~SWP_WRITEOK and abort. Changing to use si->users is great, while wondering why we need acquire = each ci->lock now. After setup 1, we have cleared SWP_WRITEOK, and take the si off swap_avail_heads list. No matter what, we just need wait for p->comm's completion and continue, why bothering to loop for the ci->lock acquiring? > > This way these dependences on si->lock are gone. And worth noting we > can't kill the refcount as the first step for swapoff as the unuse > process have to acquire the refcount. > > Signed-off-by: Kairui Song > --- > include/linux/swap.h | 1 - > mm/swapfile.c | 90 ++++++++++++++++++++++++++++---------------- > 2 files changed, 57 insertions(+), 34 deletions(-) > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index e1eeea6307cd..02120f1005d5 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -219,7 +219,6 @@ enum { > SWP_STABLE_WRITES = (1 << 11), /* no overwrite PG_writeback pages */ > SWP_SYNCHRONOUS_IO = (1 << 12), /* synchronous IO is efficient */ > /* add others here before... */ > - SWP_SCANNING = (1 << 14), /* refcount in scan_swap_map */ > }; > > #define SWAP_CLUSTER_MAX 32UL > diff --git a/mm/swapfile.c b/mm/swapfile.c > index e6e58cfb5178..99fd0b0d84a2 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -658,6 +658,8 @@ static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster > { > unsigned int nr_pages = 1 << order; > > + lockdep_assert_held(&ci->lock); > + > if (!(si->flags & SWP_WRITEOK)) > return false; > > @@ -1059,8 +1061,6 @@ static int cluster_alloc_swap(struct swap_info_struct *si, > { > int n_ret = 0; > > - si->flags += SWP_SCANNING; > - > while (n_ret < nr) { > unsigned long offset = cluster_alloc_swap_entry(si, order, usage); > > @@ -1069,8 +1069,6 @@ static int cluster_alloc_swap(struct swap_info_struct *si, > slots[n_ret++] = swp_entry(si->type, offset); > } > > - si->flags -= SWP_SCANNING; > - > return n_ret; > } > > @@ -1112,6 +1110,22 @@ static int scan_swap_map_slots(struct swap_info_struct *si, > return cluster_alloc_swap(si, usage, nr, slots, order); > } > > +static bool get_swap_device_info(struct swap_info_struct *si) > +{ > + if (!percpu_ref_tryget_live(&si->users)) > + return false; > + /* > + * Guarantee the si->users are checked before accessing other > + * fields of swap_info_struct, and si->flags (SWP_WRITEOK) is > + * up to dated. > + * > + * Paired with the spin_unlock() after setup_swap_info() in > + * enable_swap_info(), and smp_wmb() in swapoff. > + */ > + smp_rmb(); > + return true; > +} > + > int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order) > { > int order = swap_entry_order(entry_order); > @@ -1139,13 +1153,16 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order) > /* requeue si to after same-priority siblings */ > plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]); > spin_unlock(&swap_avail_lock); > - spin_lock(&si->lock); > - n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE, > - n_goal, swp_entries, order); > - spin_unlock(&si->lock); > - if (n_ret || size > 1) > - goto check_out; > - cond_resched(); > + if (get_swap_device_info(si)) { > + spin_lock(&si->lock); > + n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE, > + n_goal, swp_entries, order); > + spin_unlock(&si->lock); > + put_swap_device(si); > + if (n_ret || size > 1) > + goto check_out; > + cond_resched(); > + } > > spin_lock(&swap_avail_lock); > /* > @@ -1296,16 +1313,8 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry) > si = swp_swap_info(entry); > if (!si) > goto bad_nofile; > - if (!percpu_ref_tryget_live(&si->users)) > + if (!get_swap_device_info(si)) > goto out; > - /* > - * Guarantee the si->users are checked before accessing other > - * fields of swap_info_struct. > - * > - * Paired with the spin_unlock() after setup_swap_info() in > - * enable_swap_info(). > - */ > - smp_rmb(); > offset = swp_offset(entry); > if (offset >= si->max) > goto put_out; > @@ -1785,10 +1794,13 @@ swp_entry_t get_swap_page_of_type(int type) > goto fail; > > /* This is called for allocating swap entry, not cache */ > - spin_lock(&si->lock); > - if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 0)) > - atomic_long_dec(&nr_swap_pages); > - spin_unlock(&si->lock); > + if (get_swap_device_info(si)) { > + spin_lock(&si->lock); > + if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 0)) > + atomic_long_dec(&nr_swap_pages); > + spin_unlock(&si->lock); > + put_swap_device(si); > + } > fail: > return entry; > } > @@ -2562,6 +2574,25 @@ bool has_usable_swap(void) > return ret; > } > > +/* > + * Called after clearing SWP_WRITEOK, ensures cluster_alloc_range > + * see the updated flags, so there will be no more allocations. > + */ > +static void wait_for_allocation(struct swap_info_struct *si) > +{ > + unsigned long offset; > + unsigned long end = ALIGN(si->max, SWAPFILE_CLUSTER); > + struct swap_cluster_info *ci; > + > + BUG_ON(si->flags & SWP_WRITEOK); > + > + for (offset = 0; offset < end; offset += SWAPFILE_CLUSTER) { > + ci = lock_cluster(si, offset); > + unlock_cluster(ci); > + offset += SWAPFILE_CLUSTER; > + } > +} > + > SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) > { > struct swap_info_struct *p = NULL; > @@ -2632,6 +2663,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) > spin_unlock(&p->lock); > spin_unlock(&swap_lock); > > + wait_for_allocation(p); > + > disable_swap_slots_cache_lock(); > > set_current_oom_origin(); > @@ -2674,15 +2707,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) > spin_lock(&p->lock); > drain_mmlist(); > > - /* wait for anyone still in scan_swap_map_slots */ > - while (p->flags >= SWP_SCANNING) { > - spin_unlock(&p->lock); > - spin_unlock(&swap_lock); > - schedule_timeout_uninterruptible(1); > - spin_lock(&swap_lock); > - spin_lock(&p->lock); > - } > - > swap_file = p->swap_file; > p->swap_file = NULL; > p->max = 0; > -- > 2.47.1 > >