From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E8A81D41C13 for ; Wed, 13 Nov 2024 08:24:26 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 660E86B00B0; Wed, 13 Nov 2024 03:24:26 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 60F946B00C2; Wed, 13 Nov 2024 03:24:26 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4B0006B00C3; Wed, 13 Nov 2024 03:24:26 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 2B8E56B00B0 for ; Wed, 13 Nov 2024 03:24:26 -0500 (EST) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id CD540C08B4 for ; Wed, 13 Nov 2024 08:24:25 +0000 (UTC) X-FDA: 82780383252.06.C03A83F Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.16]) by imf05.hostedemail.com (Postfix) with ESMTP id 3168910000F for ; Wed, 13 Nov 2024 08:23:01 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=bpPE2xMc; spf=pass (imf05.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.16 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1731486070; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=5/qhqag9kCpknmmzgAmab6E49m02+MxSAenzjOTt4g8=; b=Rd2ilonpdOPWxFfKc/KVQdmu37nIP+805jQf8aIUG16RCL/nukI0BgI+KbS9p7v2UEOI5d FaY7F8ZhQGv1rbMaW8agd0/YnxGfJu1V3v7x9o9RWWw2c6ByGzB6NewUoY1mbcK1nc4QTJ 1A+qFvxY56fIHT6Ov29H9a0nP9+sHUk= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=bpPE2xMc; spf=pass (imf05.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.16 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1731486070; a=rsa-sha256; cv=none; b=5BJL3SarXXi7lCUmc+lbx36PrjJdWo9OV1DYpzM1a1SVoGpH/Lf1qrFR4A21FXf63Vd5r3 K9pWiwK0YO4heTzQVu5XUZTx+KyUsMscR2a1bzEtGGcTrCfxrdp4MBsH2y6uXV3ojIYezk S9/63NTmLrvti7LqQgDOmDmo5CqKf/Q= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1731486263; x=1763022263; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version; bh=E6NmNe5r0OVhK9kcBFpuSvzgak15G7aCepFbbnbQmD4=; b=bpPE2xMcFnVSlcVwTA4nuTLsYk4ToHJBD8iEzaH7MbC8HNiIeOH3SFn4 JSCOkEPyr32s3wk33i96zw+x9w7OiKSOY4C3YcntTOspPMivjWE1tyKnn lhetcn71oeVQ5PfurU8aQA3Hq7ulR3vyUjn+6CuxUWiKWXZaQSf2JP+26 hJwTLLBA8LxKW5Cucms2o/MzRiYJQWDCl3TzOBd1O8NoS6v61KoT3y3r1 QYu8nScaaUn6Dabsu1fZmOL2U9f9YzD6AtD9vORUP5bhcr7ksLc3A4B2S lIOXtlHOw3gZo1wCNF833bj4XWJUQoT+Ji9l/hr4Z5H0Tc7NmXmxiHJ8D A==; X-CSE-ConnectionGUID: XUHnW+zpRz6LJb+ey4AXUg== X-CSE-MsgGUID: blFAceKnRbmyeP2+CE2MbA== X-IronPort-AV: E=McAfee;i="6700,10204,11254"; a="18975282" X-IronPort-AV: E=Sophos;i="6.12,150,1728975600"; d="scan'208";a="18975282" Received: from orviesa005.jf.intel.com ([10.64.159.145]) by fmvoesa110.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Nov 2024 00:23:26 -0800 X-CSE-ConnectionGUID: czodun8uQgyH/FGZDqMEdg== X-CSE-MsgGUID: 8UOD8/y0RIGVLrXeGQ4vCg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.12,150,1728975600"; d="scan'208";a="92757051" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orviesa005-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 13 Nov 2024 00:23:24 -0800 From: "Huang, Ying" To: Kairui Song Cc: linux-mm@kvack.org, Kairui Song , Andrew Morton , Chris Li , Barry Song , Ryan Roberts , Hugh Dickins , Kalesh Singh , linux-kernel@vger.kernel.org Subject: Re: [PATCH] mm, swap: fix allocation and scanning race with swapoff In-Reply-To: <20241112083414.78174-1-ryncsn@gmail.com> (Kairui Song's message of "Tue, 12 Nov 2024 16:34:14 +0800") References: <20241112083414.78174-1-ryncsn@gmail.com> Date: Wed, 13 Nov 2024 16:19:50 +0800 Message-ID: <87cyizzgt5.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 3168910000F X-Stat-Signature: ar6mztu8doissfqx15i4rw9rwr61pk9q X-Rspam-User: X-HE-Tag: 1731486181-743043 X-HE-Meta: U2FsdGVkX18sbMzL0xf3TLb+13uc/hbFpupmjAGu0oCkgWFh5F8mxENSf5p3PBFLDrE8F4XE34jC/4PrEtOLTuYPT7cNx+IORXFYmE9Rf7EVVvphe/ypbzqtwqMm5/tbOi2kJNNyDyC3QN7m2owIkC+3Tn9a/276PPZ++InKhynL0OzpeETktKd0TygIR+kJfO+hiyRimXhuQFoYU2cuQYPDdruhgsghQ/wpu4r9VbHoc+fQk6I4WQ2xXCJz3W82DEFdkVCt5qpojg7UKoTSS02tQhLDWr9VrmJ6IPAs8sOVb0wDwVugsW9Cubwt6QxGc0I2tdCPBSvUO7IFqXdfz/a8P4idHXTxJhckhsglCSEVwylnZhHfljw8LMZPGucRB+po0A70J2A5AIa1tFl5jDhlT+1nq0Yba5a0H6rnTipX4oQbgS9U+lzr78zkV2Sy6USJ0BNT8CDiIKTlDTEYwG+h5JPlqy4QiOMdrzh7QCJH4gG7Sbfo4/koK+FJwy9M6ZH8mv5iMaRzRcW2s6D9t0DsLjayi+GdKWCY+aZNMjZg7xidFRvlKk4G39kGdFu1whEKeVoRbfOtOoH09qZzaFk1YPLD5hrQ1Qyk2Rkn8LLD/ieGdCNi1Oj74GG++aB0C2jZacaMBnF32/LCLy20JHnd0NVDMct0GsZ21XpdHlH52JBcWUr2J6YfBBY+nvzvmSR1/Js+OhfU1S6dw7O1c4YpvMLag9oFD+r0WouJhgobV82SEuEZvcO8g4v+yA16DFLSJvx2wRhuAMoljhslqLalmUk6vmXQZ843c/0ZJzOWt3Fivz4ksZ7yqf9iizBgi1ZKDPrANLdfQgz9N0rX3SqBJgEsGzDgaYbZbzHghMMdIatoC4eC0ue93ATZ7r9IpXZjjUZNQFThuPcXGyy4nVoRK9p2FQ1g1QvWqC4WqalfIwYaHN5CGvR5uvew8UytWDW8brqJdYHYKx1oMMP UN8nWvB0 fDli+67bnjD9dYoFq5MlUxL42aUhc/CqHusJ7J6tTe+rEKpUsS4fQrP0D06p/id9WVGf5+8XlkWqP5/b/YHAWy/paxeqRT4avbdjbzKLVRoWXDajfvZMT6u7QeP7OhH4VR/tREbOiJg2FEnCbgNnv7sWKCmipdXTfm86wSDVDl24CEG3I5Y6Iz5z+qC0T4gam4MXGQOFG7+wqCy5SUKd3KCyJAATpQ7wyCi1VW4+pU/tk4hbTqSIKuKzga1o9OFA+6hNU79LQaAq9OSA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Kairui Song writes: > From: Kairui Song > > There are two flags used to synchronize allocation and scanning with > swapoff: SWP_WRITEOK and SWP_SCANNING. > > SWP_WRITEOK: Swapoff will first unset this flag, at this point any > further swap allocation or scanning on this device should just abort > so no more new entries will be referencing this device. Swapoff > will then unuse all existing swap entries. > > SWP_SCANNING: This flag is set when device is being scanned. Swapoff > will wait for all scanner to stop before the final release of the swap > device structures to avoid UAF. Note this flag is the highest used bit > of si->flags so it could be added up arithmetically, if there are > multiple scanner. > > commit 5f843a9a3a1e ("mm: swap: separate SSD allocation from > scan_swap_map_slots()") ignored SWP_SCANNING and SWP_WRITEOK flags while > separating cluster allocation path from the old allocation path. Add > the flags back to fix swapoff race. The race is hard to trigger as > si->lock prevents most parallel operations, but si->lock could be > dropped for reclaim or discard. This issue is found during code review. > > This commit fixes this problem. For SWP_SCANNING, Just like before, > set the flag before scan and remove it afterwards. > > For SWP_WRITEOK, there are several places where si->lock could > be dropped, it will be error-prone and make the code hard to follow > if we try to cover these places one by one. So just do one check before > the real allocation, which is also very similar like before. > With new cluster allocator it may waste a bit of time iterating > the clusters but won't take long, and swapoff is not performance > sensitive. > > Reported-by: "Huang, Ying" > Closes: https://lore.kernel.org/linux-mm/87a5es3f1f.fsf@yhuang6-desk2.ccr.corp.intel.com/ > Fixes: 5f843a9a3a1e ("mm: swap: separate SSD allocation from scan_swap_map_slots()") > Signed-off-by: Kairui Song > --- > mm/swapfile.c | 22 +++++++++++++++++++--- > 1 file changed, 19 insertions(+), 3 deletions(-) > > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 9c85bd46ab7f..b0a9071cfe1d 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -664,12 +664,15 @@ static bool cluster_scan_range(struct swap_info_struct *si, > return true; > } > > -static void cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster_info *ci, > +static bool cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster_info *ci, > unsigned int start, unsigned char usage, > unsigned int order) > { > unsigned int nr_pages = 1 << order; > > + if (!(si->flags & SWP_WRITEOK)) > + return false; > + > if (cluster_is_free(ci)) { > if (nr_pages < SWAPFILE_CLUSTER) { > list_move_tail(&ci->list, &si->nonfull_clusters[order]); > @@ -690,6 +693,8 @@ static void cluster_alloc_range(struct swap_info_struct *si, struct swap_cluster > list_move_tail(&ci->list, &si->full_clusters); > ci->flags = CLUSTER_FLAG_FULL; > } > + > + return true; > } > > static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, unsigned long offset, > @@ -713,7 +718,10 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si, unsigne > > while (offset <= end) { > if (cluster_scan_range(si, ci, offset, nr_pages)) { > - cluster_alloc_range(si, ci, offset, usage, order); > + if (!cluster_alloc_range(si, ci, offset, usage, order)) { > + offset = SWAP_NEXT_INVALID; We set offset to SWAP_NEXT_INVALID for 3 times in this function. Can we use a local variable to remove the duplication? For example, unsigned long ret = SWAP_NEXT_INVALID; And set ret if we allocate swap entry successfully. We can do this in a separate patch. Otherwise, LGTM, Thanks! Feel free to add, Reviewed-by: "Huang, Ying" > + goto done; > + } > *foundp = offset; > if (ci->count == SWAPFILE_CLUSTER) { > offset = SWAP_NEXT_INVALID; > @@ -805,7 +813,11 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int o > if (!list_empty(&si->free_clusters)) { > ci = list_first_entry(&si->free_clusters, struct swap_cluster_info, list); > offset = alloc_swap_scan_cluster(si, cluster_offset(si, ci), &found, order, usage); > - VM_BUG_ON(!found); > + /* > + * Either we didn't touch the cluster due to swapoff, > + * or the allocation must success. > + */ > + VM_BUG_ON((si->flags & SWP_WRITEOK) && !found); > goto done; > } > > @@ -1041,6 +1053,8 @@ static int cluster_alloc_swap(struct swap_info_struct *si, > > VM_BUG_ON(!si->cluster_info); > > + si->flags += SWP_SCANNING; > + > while (n_ret < nr) { > unsigned long offset = cluster_alloc_swap_entry(si, order, usage); > > @@ -1049,6 +1063,8 @@ static int cluster_alloc_swap(struct swap_info_struct *si, > slots[n_ret++] = swp_entry(si->type, offset); > } > > + si->flags -= SWP_SCANNING; > + > return n_ret; > } -- Best Regards, Huang, Ying