From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AC35CC48BF6 for ; Thu, 7 Mar 2024 07:36:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3190E6B0116; Thu, 7 Mar 2024 02:36:43 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 2C9A86B0117; Thu, 7 Mar 2024 02:36:43 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 192446B0118; Thu, 7 Mar 2024 02:36:43 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 088EC6B0116 for ; Thu, 7 Mar 2024 02:36:43 -0500 (EST) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id ACAA7A0565 for ; Thu, 7 Mar 2024 07:36:42 +0000 (UTC) X-FDA: 81869435844.22.1238F36 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.9]) by imf09.hostedemail.com (Postfix) with ESMTP id DEF31140009 for ; Thu, 7 Mar 2024 07:36:39 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b="EP0k/ibn"; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf09.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.9 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709797001; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=UiJE7HelilfpOlfp1LQ5d5Z1Fpf6/xF1nRlB/Q8BHmc=; b=5+Nki95nlyhN9XGBY7EMOK96ei3+HALfbTrHbTRA1ej+fnQVvFFqL0Sx3XH+TKklt27yab SU/2Wb4n8W5Xhj8HFJnblwrQrfGK2jrWCyNGRSrD2si9lK+ozT5SvNHQc8hnj+xYxWbrh8 sgmsMng080SVLiY7PWFPBCrwP9OyQ8A= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b="EP0k/ibn"; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf09.hostedemail.com: domain of ying.huang@intel.com designates 192.198.163.9 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709797001; a=rsa-sha256; cv=none; b=5TkQBdBoFp441/MNihoK20Xxq1DFM57v0iPDAPMGVMTmItEWgRY2xRVm52naWNidnL7gT6 GMBFJ3tWFqGojTw7YvYUv7PBwYJtBC/6+VF8aogZ85jawt5atVW2OpxZ2B0r0H0ZgUY0It o0Qg84ZeHgpAeRrLpIx1eB4KhhNA4UY= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1709797000; x=1741333000; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version; bh=0kfFQ3v5nLN2q6UY/69rGMRfdWI+zcCCjuvpf3DZH9M=; b=EP0k/ibnhARYa+vxr6fVHa4p/7iRIhMigb6Edt04DvzvHHjMO3EGKhmC Ds78Vtan2n49cgrAEy+iR7PcrPwWNnnf8ekhTmDgFbGsO2iRnLXO/ewrE 6BRsVMsnLFGGqydBtLOZwxBkEsMYHj6hI72qAJZDC6MZnjzvP0ES5/kQN nj55/jjJNYBLM9+deMS8WZpGHwC7WrBAtDroqQFjqvWHeykTiqA9T2hYJ NErGmGa+Z0KQnLaTVXZ7zRcryLCOb9RPr/dKU1WqxkkJ3zcp5pdV2fMMW g2EfpzJjIQoOhShgQbTQbpnZqI4WBIr/eln/h4uDuDp7eV6Oeg4De4X5a Q==; X-IronPort-AV: E=McAfee;i="6600,9927,11005"; a="15174120" X-IronPort-AV: E=Sophos;i="6.06,210,1705392000"; d="scan'208";a="15174120" Received: from fmviesa007.fm.intel.com ([10.60.135.147]) by fmvoesa103.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Mar 2024 23:36:39 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.06,210,1705392000"; d="scan'208";a="9930329" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmviesa007-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Mar 2024 23:36:36 -0800 From: "Huang, Ying" To: Miaohe Lin Cc: Ryan Roberts , Andrew Morton , David Hildenbrand , , , Subject: Re: [PATCH v1] mm: swap: Fix race between free_swap_and_cache() and swapoff() In-Reply-To: <0925807f-d226-7f08-51d1-ab771b1a6c24@huawei.com> (Miaohe Lin's message of "Thu, 7 Mar 2024 14:50:10 +0800") References: <20240305151349.3781428-1-ryan.roberts@arm.com> <875xy0842q.fsf@yhuang6-desk2.ccr.corp.intel.com> <87bk7q7ffp.fsf@yhuang6-desk2.ccr.corp.intel.com> <0925807f-d226-7f08-51d1-ab771b1a6c24@huawei.com> Date: Thu, 07 Mar 2024 15:34:42 +0800 Message-ID: <8734t27awd.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Rspamd-Queue-Id: DEF31140009 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: 9bjk4646czjxkpurc4ydmwctwi3pmcro X-HE-Tag: 1709796999-108972 X-HE-Meta: U2FsdGVkX18U8rA7QH564gjqSlV3HoeBbGiWMOE6i8adq8fycmptwdURU/SV14QOGm7gTV/OCeMPlFyP2CmqFHFA36229lypqPx/BNeJ9MI9ncWvbuPTYIGD1GcQwO2HbIs8OhfCl3elU6fJaLVOeU8DdwSLlJA0IoXUa6c8tXiHMJUVW3l44BDH/JbEILPgDcaKe5xoik/Xw/mQdRWqtdzWEN/gEmzgzvIXr7rLp6j5r04eod/E/Twux0qbbfjoTgRx0Fooh03CHZ5YCTqLaURFIx8BaHZV6HPtgLYxIs0LHDZ0oM4TdwZHeDwHK4pT87NcjjnWaRZIUNA67BvjTiiubu6rlV7UPuq9uNsFsQ0aLABcjmjcBPKljEnYB+t+PId3RoV4Uwh8LM6UuQFfkCT3UR5GeVK8SW3/wqMLa0egJtP3DV4NcrLZv3pi8waU4639+j30ZFZ2ZPZfSI9yoTJw7mZVRDXuOToNIctKtGu5Q7Kpn/xYRcdHxV5bA176akDz86UNdX+7PIvwCYFcw3LUM2JQSvuwkRy59Vnv5hW2WbSk2kysFXcz2AbK2l7G/ee5oqR+OraUMecfWmhISY+ZO89Ze+VSeZTc8cuYM3YfhOqDQdRbYmuAUZyBMvv67rxwAd8FH74F6E/CzXbYxyzVmceQ+eWzR5IG+0469RjU5cqPMnCiLg5j1FxCvVja4PRizQ8GXudwSPrbTqUG0JSbzd2NAs7z8VC83UY1TOSN8pl50X/c8A6vRcM61BLkTAksgdv07MaYrWAqylcL/d342gg8GOGdoWlku2H/rjnPxbhyY3CnX1P3gLgINQt4ARCS3H/TFWkkFJXVakxRn33HhwEOyXCmbdcbwRxJtQ5NHCsGw9dB71EOrEWd51t8XlPymGMxaK8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Miaohe Lin writes: > On 2024/3/7 13:56, Huang, Ying wrote: >> Miaohe Lin writes: >> >>> On 2024/3/6 17:31, Ryan Roberts wrote: >>>> On 06/03/2024 08:51, Miaohe Lin wrote: >>>>> On 2024/3/6 10:52, Huang, Ying wrote: >>>>>> Ryan Roberts writes: >>>>>> >>>>>>> There was previously a theoretical window where swapoff() could run and >>>>>>> teardown a swap_info_struct while a call to free_swap_and_cache() was >>>>>>> running in another thread. This could cause, amongst other bad >>>>>>> possibilities, swap_page_trans_huge_swapped() (called by >>>>>>> free_swap_and_cache()) to access the freed memory for swap_map. >>>>>>> >>>>>>> This is a theoretical problem and I haven't been able to provoke it from >>>>>>> a test case. But there has been agreement based on code review that this >>>>>>> is possible (see link below). >>>>>>> >>>>>>> Fix it by using get_swap_device()/put_swap_device(), which will stall >>>>>>> swapoff(). There was an extra check in _swap_info_get() to confirm that >>>>>>> the swap entry was valid. This wasn't present in get_swap_device() so >>>>>>> I've added it. I couldn't find any existing get_swap_device() call sites >>>>>>> where this extra check would cause any false alarms. >>>>>>> >>>>>>> Details of how to provoke one possible issue (thanks to David Hilenbrand >>>>>>> for deriving this): >>>>>>> >>>>>>> --8<----- >>>>>>> >>>>>>> __swap_entry_free() might be the last user and result in >>>>>>> "count == SWAP_HAS_CACHE". >>>>>>> >>>>>>> swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0. >>>>>>> >>>>>>> So the question is: could someone reclaim the folio and turn >>>>>>> si->inuse_pages==0, before we completed swap_page_trans_huge_swapped(). >>>>>>> >>>>>>> Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are >>>>>>> still references by swap entries. >>>>>>> >>>>>>> Process 1 still references subpage 0 via swap entry. >>>>>>> Process 2 still references subpage 1 via swap entry. >>>>>>> >>>>>>> Process 1 quits. Calls free_swap_and_cache(). >>>>>>> -> count == SWAP_HAS_CACHE >>>>>>> [then, preempted in the hypervisor etc.] >>>>>>> >>>>>>> Process 2 quits. Calls free_swap_and_cache(). >>>>>>> -> count == SWAP_HAS_CACHE >>>>>>> >>>>>>> Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls >>>>>>> __try_to_reclaim_swap(). >>>>>>> >>>>>>> __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()-> >>>>>>> put_swap_folio()->free_swap_slot()->swapcache_free_entries()-> >>>>>>> swap_entry_free()->swap_range_free()-> >>>>>>> ... >>>>>>> WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries); >>>>>>> >>>>>>> What stops swapoff to succeed after process 2 reclaimed the swap cache >>>>>>> but before process1 finished its call to swap_page_trans_huge_swapped()? >>>>>>> >>>>>>> --8<----- >>>>>> >>>>>> I think that this can be simplified. Even for a 4K folio, this could >>>>>> happen. >>>>>> >>>>>> CPU0 CPU1 >>>>>> ---- ---- >>>>>> >>>>>> zap_pte_range >>>>>> free_swap_and_cache >>>>>> __swap_entry_free >>>>>> /* swap count become 0 */ >>>>>> swapoff >>>>>> try_to_unuse >>>>>> filemap_get_folio >>>>>> folio_free_swap >>>>>> /* remove swap cache */ >>>>>> /* free si->swap_map[] */ >>>>>> >>>>>> swap_page_trans_huge_swapped <-- access freed si->swap_map !!! >>>>> >>>>> Sorry for jumping the discussion here. IMHO, free_swap_and_cache is called with pte lock held. >>>> >>>> I don't beleive it has the PTL when called by shmem. >>> >>> In the case of shmem, folio_lock is used to guard against the race. >> >> I don't find folio is lock for shmem. find_lock_entries() will only >> lock the folio if (!xa_is_value()), that is, not swap entry. Can you >> point out where the folio is locked for shmem? > > You're right, folio is locked if not swap entry. That's my mistake. But it seems above race is still nonexistent. > shmem_unuse() will first be called to read all the shared memory data that resides in the swap device back into > memory when doing swapoff. In that case, all the swapped pages are moved to page cache thus there won't be any > xa_is_value(folio) cases when calling shmem_undo_range(). free_swap_and_cache() even won't be called from > shmem_undo_range() after shmem_unuse(). Or am I miss something? I think the following situation is possible. Right? CPU0 CPU1 ---- ---- shmem_undo_range shmem_free_swap xa_cmpxchg_irq free_swap_and_cache __swap_entry_free /* swap count become 0 */ swapoff try_to_unuse shmem_unuse /* cannot find swap entry */ find_next_to_unuse filemap_get_folio folio_free_swap /* remove swap cache */ /* free si->swap_map[] */ swap_page_trans_huge_swapped <-- access freed si->swap_map !!! shmem_undo_range can run earlier. -- Best Regards, Huang, Ying