From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4CD4CC54798 for ; Wed, 6 Mar 2024 02:54:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A43C66B0080; Tue, 5 Mar 2024 21:54:15 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 9F11B6B0082; Tue, 5 Mar 2024 21:54:15 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8B8296B0083; Tue, 5 Mar 2024 21:54:15 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 7B42E6B0080 for ; Tue, 5 Mar 2024 21:54:15 -0500 (EST) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 4D41B1C0C3B for ; Wed, 6 Mar 2024 02:54:15 +0000 (UTC) X-FDA: 81865095270.05.0B3FC5E Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.17]) by imf01.hostedemail.com (Postfix) with ESMTP id 061A340005 for ; Wed, 6 Mar 2024 02:54:11 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=AJfdDvaI; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf01.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.17 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709693652; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=fx2dd3zUhHA25LbtDQ0+YOdiE+GS0KJXcrSwdn0FKD0=; b=yR1z1OHPVNfa05K8Stiik30UYux3atCPJNhFQTTjHdq7O1zBHeiA5DwdLl6tV6Jl9DZ6nB nTHUkHCe/9qRlfVWt/ljcWY/Z5629DzfghTNav6XaJF3RMOhEA6JYJKuF3kZAM+oVnZj+c QzoEzIXkqLmJ6UpOjAwZW5te6sFwFpY= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=AJfdDvaI; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf01.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.17 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709693652; a=rsa-sha256; cv=none; b=N++qQemtaPlgU4L9PD2bDcGQHRj7QOmBmpXbj6xwwABuylnegV4HlHXhpltZgvdoLf9SVX DHYchECGrTvHPzpuhRbMZ3FFMZmOe2IDFacxYUvvMi/bGSj6XYjQwP68evNDlbHwjRKS+O lJP583GaIITvVpyajFzxM866H4Ty5is= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1709693652; x=1741229652; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version; bh=9WrgAkesKKlHGwX/3AxSFv+lcKYqyxJyA6ePn6zm5sM=; b=AJfdDvaIUINGyqcfVCtRvGLT2FJNxdNb/njXIICWQIR6BKCGvrsXZCLS giltIkuv4k6yoXTmL3WY/lsRcsN8CBwJM2c0+ylNO1MqBVA+Vjd0oEzAf YPIWAguLaUtZyeWI/uvCJfh381VMo1+CBlzkjkuuzovagQPVjSBpXSvG7 Ft5RxKxmFhSvJfwyAwCdtYXQwW+PbbmvaOraWUz8BX23L7qNerX5e9Vdk NN7RzbrISbVvW83DKT96HncGBB87rwh10o7ToXkQTf0mIro+ScaLmLALx 2RyVy0dSzPIPbvZF5Hwnjn8Zrn/CFPdzqOad+HPl9qt6fjzpea24yKiX4 g==; X-IronPort-AV: E=McAfee;i="6600,9927,11004"; a="4147262" X-IronPort-AV: E=Sophos;i="6.06,207,1705392000"; d="scan'208";a="4147262" Received: from fmviesa008.fm.intel.com ([10.60.135.148]) by fmvoesa111.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Mar 2024 18:54:10 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.06,207,1705392000"; d="scan'208";a="9686980" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmviesa008-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 05 Mar 2024 18:54:08 -0800 From: "Huang, Ying" To: Ryan Roberts , Miaohe Lin Cc: Andrew Morton , David Hildenbrand , , , Subject: Re: [PATCH v1] mm: swap: Fix race between free_swap_and_cache() and swapoff() In-Reply-To: <20240305151349.3781428-1-ryan.roberts@arm.com> (Ryan Roberts's message of "Tue, 5 Mar 2024 15:13:49 +0000") References: <20240305151349.3781428-1-ryan.roberts@arm.com> Date: Wed, 06 Mar 2024 10:52:13 +0800 Message-ID: <875xy0842q.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Rspamd-Queue-Id: 061A340005 X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: qkzeejq1h3mpn18e31pjzqk8fbnxz6xo X-HE-Tag: 1709693651-759436 X-HE-Meta: U2FsdGVkX1+6eWgvv0y72or7JUyyV/0+2VZwpxXkdq46DAshbPDT61EnNjIYIeqX4ePeoZXYVXXcDmOmKvDdMAMVO1AlHovxXSlYO9Tj8quLtH9ZChbc4JFq3tSCL1Z5aZnsW088oNx8aajJQIOCdL6uC2ktaxq7Y3x3T2du1bzaOkK5eDMqyzfsva43qoJ6JCgHvKTALTfZYFHtWBry7Obg1elpTnabM/TdtT0DkOmZxyMbPOVBvC9wcMS1N7Hl+0Eih6TnNkcYcw38KcbobdrllYsEcQfD2oRipFgq759Y/8XHI7++0WROV3Y8UNDB81TwyDRERGLDV6azsMorHzU/G7nXPfcGmzIdTq46YEP3LM/Ilf68+BjYJ+O1Y+zFM2sbKkVBi7jtvhW18Oj8ILYJw3/3ZZbkPFStGJq9hv7yIvcMWpVLBX7Wf8Jjd4qB6fsBs25rtxA771Bi3da3+2cVH00agCveh2rWQ8VY+V/FqPT+Ly3VDXCsHna/R2kWksdQ/c2bxR4q7QNDjjvekD5Lp+XCwAL0KP3xKG/3UBWPYWP9Dw5tFkaOUkOMd9F8p/u5lbatPu7EEE1+U1lbkidR6yW0Jr1kKf3Jp2DkqX/I6L/UrWlo+vrb5z59QlG4z9ObUlawIOQmnELUEJdM3zxznRliKjUhp8hBSin/u5GlA9pOSmwGrc2O+ygIDPEfXu0guuAqDz4L7groHAlFcQBO/Q27JRrSZvY4CYYzvQWBYu2FwqWtxzy15PzPgv4C/QqaeWOAi6kz6WHGW/l/WGC5hxJl9v3U3YsTViI2138klwk4nzIPHyoPxyEoQTB5aJce1rl42hLAHxreLg9cL3NEBsQtvqr4HIK+79lIQGXw/8ipHnK55J6S1zM336CCGL3MDT4wycc= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Ryan Roberts writes: > There was previously a theoretical window where swapoff() could run and > teardown a swap_info_struct while a call to free_swap_and_cache() was > running in another thread. This could cause, amongst other bad > possibilities, swap_page_trans_huge_swapped() (called by > free_swap_and_cache()) to access the freed memory for swap_map. > > This is a theoretical problem and I haven't been able to provoke it from > a test case. But there has been agreement based on code review that this > is possible (see link below). > > Fix it by using get_swap_device()/put_swap_device(), which will stall > swapoff(). There was an extra check in _swap_info_get() to confirm that > the swap entry was valid. This wasn't present in get_swap_device() so > I've added it. I couldn't find any existing get_swap_device() call sites > where this extra check would cause any false alarms. > > Details of how to provoke one possible issue (thanks to David Hilenbrand > for deriving this): > > --8<----- > > __swap_entry_free() might be the last user and result in > "count == SWAP_HAS_CACHE". > > swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0. > > So the question is: could someone reclaim the folio and turn > si->inuse_pages==0, before we completed swap_page_trans_huge_swapped(). > > Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are > still references by swap entries. > > Process 1 still references subpage 0 via swap entry. > Process 2 still references subpage 1 via swap entry. > > Process 1 quits. Calls free_swap_and_cache(). > -> count == SWAP_HAS_CACHE > [then, preempted in the hypervisor etc.] > > Process 2 quits. Calls free_swap_and_cache(). > -> count == SWAP_HAS_CACHE > > Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls > __try_to_reclaim_swap(). > > __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()-> > put_swap_folio()->free_swap_slot()->swapcache_free_entries()-> > swap_entry_free()->swap_range_free()-> > ... > WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries); > > What stops swapoff to succeed after process 2 reclaimed the swap cache > but before process1 finished its call to swap_page_trans_huge_swapped()? > > --8<----- I think that this can be simplified. Even for a 4K folio, this could happen. CPU0 CPU1 ---- ---- zap_pte_range free_swap_and_cache __swap_entry_free /* swap count become 0 */ swapoff try_to_unuse filemap_get_folio folio_free_swap /* remove swap cache */ /* free si->swap_map[] */ swap_page_trans_huge_swapped <-- access freed si->swap_map !!! > Fixes: 7c00bafee87c ("mm/swap: free swap slots in batch") > Closes: https://lore.kernel.org/linux-mm/65a66eb9-41f8-4790-8db2-0c70ea15979f@redhat.com/ > Cc: stable@vger.kernel.org > Signed-off-by: Ryan Roberts > --- > > Applies on top of v6.8-rc6 and mm-unstable (b38c34939fe4). > > Thanks, > Ryan > > mm/swapfile.c | 14 +++++++++++--- > 1 file changed, 11 insertions(+), 3 deletions(-) > > diff --git a/mm/swapfile.c b/mm/swapfile.c > index 2b3a2d85e350..f580e6abc674 100644 > --- a/mm/swapfile.c > +++ b/mm/swapfile.c > @@ -1281,7 +1281,9 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry) > smp_rmb(); > offset = swp_offset(entry); > if (offset >= si->max) > - goto put_out; > + goto bad_offset; > + if (data_race(!si->swap_map[swp_offset(entry)])) > + goto bad_free; > > return si; > bad_nofile: > @@ -1289,9 +1291,14 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry) > out: > return NULL; > put_out: > - pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val); > percpu_ref_put(&si->users); > return NULL; > +bad_offset: > + pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val); > + goto put_out; > +bad_free: > + pr_err("%s: %s%08lx\n", __func__, Unused_offset, entry.val); > + goto put_out; > } I don't think that it's a good idea to warn for bad free entries. get_swap_device() could be called without enough lock to prevent parallel swap operations on entry. So, false positive is possible there. I think that it's good to add more checks in general, for example, in free_swap_and_cache(), we can check more because we are sure the swap entry will not be freed by parallel swap operations. Just don't put the check in general get_swap_device(). We can add another helper to check that. I found that there are other checks in get_swap_device() already. I think that we may need to revert, commit 23b230ba8ac3 ("mm/swap: print bad swap offset entry in get_swap_device") which introduces it. And add check in appropriate places. -- Best Regards, Huang, Ying > static unsigned char __swap_entry_free(struct swap_info_struct *p, > @@ -1609,13 +1616,14 @@ int free_swap_and_cache(swp_entry_t entry) > if (non_swap_entry(entry)) > return 1; > > - p = _swap_info_get(entry); > + p = get_swap_device(entry); > if (p) { > count = __swap_entry_free(p, entry); > if (count == SWAP_HAS_CACHE && > !swap_page_trans_huge_swapped(p, entry)) > __try_to_reclaim_swap(p, swp_offset(entry), > TTRS_UNMAPPED | TTRS_FULL); > + put_swap_device(p); > } > return p != NULL; > } > -- > 2.25.1