From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8FAD8E77188 for ; Mon, 30 Dec 2024 17:47:05 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1AB076B009B; Mon, 30 Dec 2024 12:47:05 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 15C3A6B009D; Mon, 30 Dec 2024 12:47:05 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EEE3A6B009C; Mon, 30 Dec 2024 12:47:04 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id CE52E6B009A for ; Mon, 30 Dec 2024 12:47:04 -0500 (EST) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 72D561219C4 for ; Mon, 30 Dec 2024 17:47:04 +0000 (UTC) X-FDA: 82952354520.06.5F72186 Received: from mail-pl1-f180.google.com (mail-pl1-f180.google.com [209.85.214.180]) by imf08.hostedemail.com (Postfix) with ESMTP id 43158160005 for ; Mon, 30 Dec 2024 17:46:30 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=DxzDMnuT; spf=pass (imf08.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.214.180 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1735580774; h=from:from:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=L7fSMJo0m5jtrmJwyl3l9osZUE7hgdb18JM8bl2HsT4=; b=LP03B7KBo28Z0EYnIP3aV3MeUfKEW2EtD11sIglt3onw4BM4jEPO4pnExPIRMODnksixSL ojDsNGBQ3txhHgbF2S7auiG5FwOjiDYFujrJV1E4EKcntjVDlC3odYNUk9YtGTrATDNSKF pIRG4bwqL0RlmiKJcY4Tuxo2N1QOIvE= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=DxzDMnuT; spf=pass (imf08.hostedemail.com: domain of ryncsn@gmail.com designates 209.85.214.180 as permitted sender) smtp.mailfrom=ryncsn@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1735580774; a=rsa-sha256; cv=none; b=g6VIi3S7HnjFWxK0ugWKXUL3rV6vDmvKN/y7rqlmqv4Umdm6THmeEC9J0/PiKMYR48F3BK cNJJQxPjC4vzo6oKJpHGYNYRhOpe/wEmnSrKvXsDO7fkcS3qOwpVnHfnKxmI9uMbyJDxYH 7UcVKRIxsRMifmx3Mi3O67/EuTwezfc= Received: by mail-pl1-f180.google.com with SMTP id d9443c01a7336-216426b0865so119510995ad.0 for ; Mon, 30 Dec 2024 09:47:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1735580821; x=1736185621; darn=kvack.org; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:from:to:cc:subject :date:message-id:reply-to; bh=L7fSMJo0m5jtrmJwyl3l9osZUE7hgdb18JM8bl2HsT4=; b=DxzDMnuT3C9XpO887+rfIsYKPHMmPUSWlGXJl53koT0pL38XO2JsONiS+bixUxNVMa 5M3hETbeaS7evnM9Sz6OlCJSH8okzABaF+5IPLo7h5/n2pAr31YIYlcT9ALu3lhhybLe 03xR3BKVZwh8v8R0AwunkR6Wh+XHt8D4Pg48n6RL5/VospsD7n6J/ElTb9q0qddam76X 4Gg9P4TsZy48ks1JUpxHoiTfpdkKcIyx6zrFt4WKqEI1GqpMGopA+h8MlW7yA2YN8aoY IaYiUEmViQVGbpvDtNSi4Ui8OYSlNuTCbfW1eQHLRokKq8bSCc5ugDo3af6pyYNSQ/Ge Pvng== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1735580821; x=1736185621; h=content-transfer-encoding:mime-version:reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:x-gm-message-state :from:to:cc:subject:date:message-id:reply-to; bh=L7fSMJo0m5jtrmJwyl3l9osZUE7hgdb18JM8bl2HsT4=; b=whjKgfJHubj3VqpSUyaXR2M3gY33cJihXczXZAj4+8UD8Qyr5v6iRecsNDzt7AAtaK XvgRUjxc8TmxkBXU8VrOrf7gC+sSUVjd/bwbEhFYU8sYe4O9jSciqtIGxpco38IqePPQ d1Z5tWb3orzac3I5GixJ1grr1Fvenw5afRMBfv7Xky17YcqBLTf2Veb04+9L4MQMUQnP WRZIzofB1bA4EH2HRnovrsmT4tJPBenV+8BA8UrtOr6nca1NfihK3tXClDzS89o7RYpM LE5xZ5R9X378J+1kHpOu0iY7UJmo+GkSUkqYdShF914nOpGDQwKDTIo8U2dWj8n7zqF+ 68HQ== X-Gm-Message-State: AOJu0YzB4EHGL41N3GiicdNrh/kJcpzLNEwyHxer8FsJrAsdY+uWKMZC qrbXhrTCGglvs+h/EsWCaHwUgK01bkbTPM96ugdFYaNBqqRPEcJXY9pFf8BiHre3QA== X-Gm-Gg: ASbGnctF+beSxrycWCkkD9HlVS2DRoqpxomYFyiXXtqvZ2XOvBs0VndKQMfjYFKKt1Z hb/XIFnwfEeiaOgTyp+YWhxK1AKRz3XW50bVugXpv3Yh6x6WRZ6vZTME9q0jpf6IBxrt8VDw//m QYrGnxnC0srXW+DfMpd6ZhOwBYrRFgojyu9PAMz4Od6nJ1HbHTmGd7JU8jwgXJFPtsYZyoGIHZA 3IW7eM89MNolJRySAqm9mmJ1wVmm3ZLcabP/UWoxdrkn6R0sFqICDpbM7+0Fw+ex4q6G7oRUE7g X-Google-Smtp-Source: AGHT+IEaklwHFUSAvxBX58uwXC3bT8F5Ixs3unjPgxjsAOrETxagUzSGxkOOR/d+DYAMJriw3/P6zw== X-Received: by 2002:a17:902:f607:b0:212:5786:7bbe with SMTP id d9443c01a7336-219e6ea1baamr557334885ad.24.1735580820747; Mon, 30 Dec 2024 09:47:00 -0800 (PST) Received: from KASONG-MC4.tencent.com ([1.203.117.231]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-219dc9cdf25sm180687695ad.118.2024.12.30.09.46.57 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Mon, 30 Dec 2024 09:47:00 -0800 (PST) From: Kairui Song To: linux-mm@kvack.org Cc: Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Yosry Ahmed , "Huang, Ying" , Nhat Pham , Johannes Weiner , Kalesh Singh , linux-kernel@vger.kernel.org, Kairui Song Subject: [PATCH v3 07/13] mm, swap: hold a reference during scan and cleanup flag usage Date: Tue, 31 Dec 2024 01:46:15 +0800 Message-ID: <20241230174621.61185-8-ryncsn@gmail.com> X-Mailer: git-send-email 2.47.1 In-Reply-To: <20241230174621.61185-1-ryncsn@gmail.com> References: <20241230174621.61185-1-ryncsn@gmail.com> Reply-To: Kairui Song MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 43158160005 X-Rspam-User: X-Stat-Signature: hwjqzbwq7f4k5k8gjz3agfuk64wn3dux X-HE-Tag: 1735580790-438995 X-HE-Meta: U2FsdGVkX18j4vmrby7fqTfOh2NyipRP/bdX7mFrc2e7cFKipNocNu2U2ES9RWyTKQ6novAzRfI2KVQFPnmWcOP1QkqrSTGw42w4mOKgaTJU/ncmwCv16aPsirEY5aXuKn7wxquOoHJNfIlh5N1kI1MvzLbP6AMp/xf+mOCAJxy+bvX51lxPjbdGGlj0/hCcCQMsnzyT0eqa5ttVYkFEm7MxOlN9Uh14XO8NUZSeH5V2Xs3LnQJY6tL8/hzLHZULkR1J7NkOTMb5juZFtJFlYKp+RaLw2jDA12EoIMcbcEH9JGEK5moIVfHvHpPVWlcBFVKVcXaOf/R17+js87D8JvVNfF5iWsHFpCihGxt+c8mpXjErjJZ5C6KboI5v6bhz1LqS0NfrmdhpC7QtQhdVQhgtq6p7uDGdxOryPFdjocNiGrDpeSUvvhvIf9YxKl5+493mW4ron6zqxrmgVeDAfenIZfV42L9jrK03gOoBUCdx/vEdDvafti8m+PjAzvDLljGQ/t2+Q5zCBz928Th+1VAQNB4g8VaTmdgDvpf/71tGdWTbpUS5nCKQgg2GJggT4RwzsbHLiUiqEPy7Z3XtrVzysaJKqBPwZ6DmdPHYSXF86FqF9yEiVUNd7BsZebzCqiteV/F0AbgnFnPcAxNZtaVzaAAevfAJ0Z0TEahR8JuJ0/xa2ei/Q2YfEApAXmOmunsx0Hxktaq4VuKvY1MbsMSUgQ8wKNjW/M/4qnp0780gqejoDRXHKNiBZm6sfe8YDo5HhL2U/EORPFXOYezip7mLUZmRz7bhDE5Qhfaq/Rldjj5PX5M3uKIjrw8liKWeDBkExEED7ywkfwzXpSVelIroh/srpb70JOSt7EAu11beOc1vIw6UxZO3EFBTMNnjLpdSLQU0xnc5b027YpFmTImFO5ni+jPjjjWcJRMdNOJMbqFSYDXd3Q+pM1wgsBp1tQFhN0SQuC39tKVVWpr KJ62QnPD LTTKbXCE5cxuYUtk8Mw0tgf1fXxx3QDJa7UtKWY7bOcEhxNp/qHcSiEWoBQRB0P73kQhd0KS/XY+EbshZagRXVHE+KaYmaJWgxBh613TOCTNrvIkPejeNy04lnWLpjbPEsc9G0r8TM/2HuPgP2BcDhIibl9a8y+0lqGM6j+Y67DNP2d+CkE/YvZCUmcMUs3t0y05bZNfXnhk2v0npi1zh7oMQvGudn2C+kabTRF9OSDBDdbnbmbnNgXpmekfdS/SsjJMzxB3NrFSInaqrrUB4ijSafkQ8JUxDY+39+qAahrhuviyH2vVWf+N22VZIsbxZDlWwySENqEa0fBZSGcVvY95e76vrT+twaesSUfRK8BbnsQMIjYd3/mNaCuKYisikCbCWjD65trYKXRz8Z3LJAmf25mFQ4tmJ6eilb1hTxUOPs/EW4pugQTla2fSkyKiIV0jRaBcF2g+/uQIdlI0kwHz+Qv7gMXJgPFH22YLorYWIlilafqzXdMGSgXoOyqD/4krOf3Rqsy98a4k= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: From: Kairui Song The flag SWP_SCANNING was used as an indicator of whether a device is being scanned for allocation, and prevents swapoff. Combined with SWP_WRITEOK, they work as a set of barriers for a clean swapoff: 1. Swapoff clears SWP_WRITEOK, allocation requests will see ~SWP_WRITEOK and abort as it's serialized by si->lock. 2. Swapoff unuses all allocated entries. 3. Swapoff waits for SWP_SCANNING flag to be cleared, so ongoing allocations will stop, preventing UAF. 4. Now swapoff can free everything safely. This will make the allocation path have a hard dependency on si->lock. Allocation always have to acquire si->lock first for setting SWP_SCANNING and checking SWP_WRITEOK. This commit removes this flag, and just uses the existing per-CPU refcount instead to prevent UAF in step 3, which serves well for such usage without dependency on si->lock, and scales very well too. Just hold a reference during the whole scan and allocation process. Swapoff will kill and wait for the counter. And for preventing any allocation from happening after step 1 so the unuse in step 2 can ensure all slots are free, swapoff will acquire the ci->lock of each cluster one by one to ensure all allocations see ~SWP_WRITEOK and abort. This way these dependences on si->lock are gone. And worth noting we can't kill the refcount as the first step for swapoff as the unuse process have to acquire the refcount. Signed-off-by: Kairui Song --- include/linux/swap.h | 1 - mm/swapfile.c | 90 ++++++++++++++++++++++++++++---------------- 2 files changed, 57 insertions(+), 34 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index e1eeea6307cd..02120f1005d5 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -219,7 +219,6 @@ enum { SWP_STABLE_WRITES = (1 << 11), /* no overwrite PG_writeback pages */ SWP_SYNCHRONOUS_IO = (1 << 12), /* synchronous IO is efficient */ /* add others here before... */ - SWP_SCANNING = (1 << 14), /* refcount in scan_swap_map */ }; #define SWAP_CLUSTER_MAX 32UL diff --git a/mm/swapfile.c b/mm/swapfile.c index e6e58cfb5178..99fd0b0d84a2 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -658,6 +658,8 @@ static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster { unsigned int nr_pages = 1 << order; + lockdep_assert_held(&ci->lock); + if (!(si->flags & SWP_WRITEOK)) return false; @@ -1059,8 +1061,6 @@ static int cluster_alloc_swap(struct swap_info_struct *si, { int n_ret = 0; - si->flags += SWP_SCANNING; - while (n_ret < nr) { unsigned long offset = cluster_alloc_swap_entry(si, order, usage); @@ -1069,8 +1069,6 @@ static int cluster_alloc_swap(struct swap_info_struct *si, slots[n_ret++] = swp_entry(si->type, offset); } - si->flags -= SWP_SCANNING; - return n_ret; } @@ -1112,6 +1110,22 @@ static int scan_swap_map_slots(struct swap_info_struct *si, return cluster_alloc_swap(si, usage, nr, slots, order); } +static bool get_swap_device_info(struct swap_info_struct *si) +{ + if (!percpu_ref_tryget_live(&si->users)) + return false; + /* + * Guarantee the si->users are checked before accessing other + * fields of swap_info_struct, and si->flags (SWP_WRITEOK) is + * up to dated. + * + * Paired with the spin_unlock() after setup_swap_info() in + * enable_swap_info(), and smp_wmb() in swapoff. + */ + smp_rmb(); + return true; +} + int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order) { int order = swap_entry_order(entry_order); @@ -1139,13 +1153,16 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order) /* requeue si to after same-priority siblings */ plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]); spin_unlock(&swap_avail_lock); - spin_lock(&si->lock); - n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE, - n_goal, swp_entries, order); - spin_unlock(&si->lock); - if (n_ret || size > 1) - goto check_out; - cond_resched(); + if (get_swap_device_info(si)) { + spin_lock(&si->lock); + n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE, + n_goal, swp_entries, order); + spin_unlock(&si->lock); + put_swap_device(si); + if (n_ret || size > 1) + goto check_out; + cond_resched(); + } spin_lock(&swap_avail_lock); /* @@ -1296,16 +1313,8 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry) si = swp_swap_info(entry); if (!si) goto bad_nofile; - if (!percpu_ref_tryget_live(&si->users)) + if (!get_swap_device_info(si)) goto out; - /* - * Guarantee the si->users are checked before accessing other - * fields of swap_info_struct. - * - * Paired with the spin_unlock() after setup_swap_info() in - * enable_swap_info(). - */ - smp_rmb(); offset = swp_offset(entry); if (offset >= si->max) goto put_out; @@ -1785,10 +1794,13 @@ swp_entry_t get_swap_page_of_type(int type) goto fail; /* This is called for allocating swap entry, not cache */ - spin_lock(&si->lock); - if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 0)) - atomic_long_dec(&nr_swap_pages); - spin_unlock(&si->lock); + if (get_swap_device_info(si)) { + spin_lock(&si->lock); + if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 0)) + atomic_long_dec(&nr_swap_pages); + spin_unlock(&si->lock); + put_swap_device(si); + } fail: return entry; } @@ -2562,6 +2574,25 @@ bool has_usable_swap(void) return ret; } +/* + * Called after clearing SWP_WRITEOK, ensures cluster_alloc_range + * see the updated flags, so there will be no more allocations. + */ +static void wait_for_allocation(struct swap_info_struct *si) +{ + unsigned long offset; + unsigned long end = ALIGN(si->max, SWAPFILE_CLUSTER); + struct swap_cluster_info *ci; + + BUG_ON(si->flags & SWP_WRITEOK); + + for (offset = 0; offset < end; offset += SWAPFILE_CLUSTER) { + ci = lock_cluster(si, offset); + unlock_cluster(ci); + offset += SWAPFILE_CLUSTER; + } +} + SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) { struct swap_info_struct *p = NULL; @@ -2632,6 +2663,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) spin_unlock(&p->lock); spin_unlock(&swap_lock); + wait_for_allocation(p); + disable_swap_slots_cache_lock(); set_current_oom_origin(); @@ -2674,15 +2707,6 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) spin_lock(&p->lock); drain_mmlist(); - /* wait for anyone still in scan_swap_map_slots */ - while (p->flags >= SWP_SCANNING) { - spin_unlock(&p->lock); - spin_unlock(&swap_lock); - schedule_timeout_uninterruptible(1); - spin_lock(&swap_lock); - spin_lock(&p->lock); - } - swap_file = p->swap_file; p->swap_file = NULL; p->max = 0; -- 2.47.1