From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A317CC48BF6 for ; Thu, 7 Mar 2024 05:58:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 16B776B0107; Thu, 7 Mar 2024 00:58:44 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 117FD6B010A; Thu, 7 Mar 2024 00:58:44 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F21BC6B010B; Thu, 7 Mar 2024 00:58:43 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id E07186B0107 for ; Thu, 7 Mar 2024 00:58:43 -0500 (EST) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id C61501601B0 for ; Thu, 7 Mar 2024 05:58:42 +0000 (UTC) X-FDA: 81869188884.15.E8DE313 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.11]) by imf21.hostedemail.com (Postfix) with ESMTP id AF1B21C0008 for ; Thu, 7 Mar 2024 05:58:40 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=DkWnAcvU; spf=pass (imf21.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.11 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709791121; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=HJoll29/8g7JMEZy15SQMBcsk6+YQCXsKKWGP99kZxU=; b=n0Y5d+6RKptjHnvsEMzA+lBH89RAs6NzPi4DHh+zAa/gMu0X4IomRvCR3NlHetua+u8iNb uoC8ftijpEo9KcDjJPzRqVOG0Mgo3WO8PMRmN6nYg6ktW/Ryk+mxmpPu0hUn77+/wMPKif T7TLEruSphgy3pRV8A4JFALDUyBbZKM= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709791121; a=rsa-sha256; cv=none; b=P/uR1sAGlXr8/LH/zj2TzKp1CmOL8pBLNFAwezznA2b/HOoL23uXS9guHK2tWTrvcKKRNv 6e2GsSByVGoRK8fi/5zfKl0kT8tFm4XHgZckXLCvsOfVWYGJHN5ibaLGx4vaB6M+NNe6cT u+rRvlKGTwQm0rt34SvMr3Ygi6vcc3s= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=DkWnAcvU; spf=pass (imf21.hostedemail.com: domain of ying.huang@intel.com designates 198.175.65.11 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1709791121; x=1741327121; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version; bh=stv0+KAZu3cCYaQgubC9ML8AoaLnW60gSHtE0MTOPJE=; b=DkWnAcvUhWv733RMaNmtUaGuw2r9yp38lKQdzYwbyrxt6NnDm4Q1jl7A zIuWN2PZksN5f0XacVqWutgQlYXuV4g3eI+fcykGXcqZz9c798F5Yi5hm fOh1DYcS6XMGZixlJhH/zKGR6CCDaorxey/5rjm/FOjf+dWZjA5rGPN0v 643CHBSCKH/f812SPa2ZYXQotJx31rucNDeAji+MeI6UfFI14QLJFuc98 7ufVi6xt52yrsLPBmxtPhqGKFhVva2rg0ohGXtMsYhNoUsnln07IjdIP3 gsaUEUTX4PXoU0WOj+ee9v5nH6f8FF01hB7JgXffrMeIuv3zBx5wpMtYQ A==; X-IronPort-AV: E=McAfee;i="6600,9927,11005"; a="15017972" X-IronPort-AV: E=Sophos;i="6.06,210,1705392000"; d="scan'208";a="15017972" Received: from fmviesa009.fm.intel.com ([10.60.135.149]) by orvoesa103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Mar 2024 21:58:39 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.06,210,1705392000"; d="scan'208";a="9895348" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmviesa009-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Mar 2024 21:58:36 -0800 From: "Huang, Ying" To: Miaohe Lin Cc: Ryan Roberts , Andrew Morton , David Hildenbrand , , , Subject: Re: [PATCH v1] mm: swap: Fix race between free_swap_and_cache() and swapoff() In-Reply-To: (Miaohe Lin's message of "Thu, 7 Mar 2024 10:38:47 +0800") References: <20240305151349.3781428-1-ryan.roberts@arm.com> <875xy0842q.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Thu, 07 Mar 2024 13:56:42 +0800 Message-ID: <87bk7q7ffp.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Stat-Signature: gshib35ip5d5ynx6t9gz1katpmzaqrxm X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: AF1B21C0008 X-Rspam-User: X-HE-Tag: 1709791120-834637 X-HE-Meta: U2FsdGVkX1/TXjJLJHD6ZkJ6sB3B2alzUfXdDu3xyJtIKUCi91ymZSmdNGZP49Mj45mJcF7ug4vpiEDKjRV2WY1vxgjiElp9cj0hrzEGElp2fN23IwIe/NsbSJWQz5qP1yKWf0a9IuLjCVWjgXUoQzid+P/aasjeCSAu+AZmMjb8mWUYNc04WyzIwVPBH04kOsg1DxThoJpDM+L1/RHZH7Hm4ElRf2scm+8s7CdYQMFJgAJFosnoKwcs8Ivu8p7mnjpUXiQVdwmVhlMs8Dn6+BsYmbN64o/vFBP2czubrjCiZ35s6kgY7EluYNDywOE1shGACDLwz4B19wy6BlkgQgsgF83lM/xy/b5ZMx3tcXP0Zvlzf/WD4S2izmxvh7ZupU9UFcQyFF+w11sqDT4uB0e/OtLXP25zClD0Cssmqf9lhjfqBGW6x4BXcjg8jgX/p+l9X18ONrMtDou7TJHLUlcC2HiqbrGy8EcL5RHEKh625SasQ4WoEEIWpNHCb8pK1AOdEl/dNrZXReqWRKyN5zwYLQiiHjbwlf1RgQENQCzC2YTTYBx6qC6Oeb/+byoW2s3xtTvRUHStZ5CvbJbCxo+8qRa9B5ACMY1i3WOM2HRKOsgkOaOm8xljH24XfyPDlIg5rOOKuk6+9T5bTdFjrVhzIs285rScZRnAg1RtBFalg0Qpxj/winU1pface/bnzo++6NCCaRLSYdUt7Tc9qjM9yDxXvKTwy/Kfuy/2IPsuRo0OzkBbTm9YAcPHeUER6G7pT07jfY8emOc4e9dSS+jbK68y4at8vweZh/RLZoKLasStKuawES+u8SubbaWPN4i0o7iMu2tlmbx5IymizICZoKSwbxfh4sDcWfBViSHkrwXbn4znDIYBnKxy+tEN698o1adjWH457tHxYmIZ4TNSiqqBlE90/68SOeyc4Ec8ZZQjngJ5FqG4rBQfNQeQ1usZfKSagxs7NjJWY4b OPITUtp7 pLJem5jGo1U8cAZsSD7dpV7pMntIJ9tldxken+S+IPx4oiE1wZA/Kc2I1MVZSjjM2lMI7AyiGfs0hc5rKKfcquh0Ai4eATDZYpB9c2GEgt0UrVIdofYL8Q6V+oqeTbBUtOWsotkw5xso2LppEmzjKkdWaQ7L1y8A9Toz7syFyvoSybdluuiguA35T9VL8ODn5o8gyyk15OSpZbcfjpqpVgwT5ogitZ9F4P9Td325JouB/SfqYZQ+RhoOe5d8R/0F22cAw11aMEUz56Bvxo0swvYkIzw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Miaohe Lin writes: > On 2024/3/6 17:31, Ryan Roberts wrote: >> On 06/03/2024 08:51, Miaohe Lin wrote: >>> On 2024/3/6 10:52, Huang, Ying wrote: >>>> Ryan Roberts writes: >>>> >>>>> There was previously a theoretical window where swapoff() could run and >>>>> teardown a swap_info_struct while a call to free_swap_and_cache() was >>>>> running in another thread. This could cause, amongst other bad >>>>> possibilities, swap_page_trans_huge_swapped() (called by >>>>> free_swap_and_cache()) to access the freed memory for swap_map. >>>>> >>>>> This is a theoretical problem and I haven't been able to provoke it from >>>>> a test case. But there has been agreement based on code review that this >>>>> is possible (see link below). >>>>> >>>>> Fix it by using get_swap_device()/put_swap_device(), which will stall >>>>> swapoff(). There was an extra check in _swap_info_get() to confirm that >>>>> the swap entry was valid. This wasn't present in get_swap_device() so >>>>> I've added it. I couldn't find any existing get_swap_device() call sites >>>>> where this extra check would cause any false alarms. >>>>> >>>>> Details of how to provoke one possible issue (thanks to David Hilenbrand >>>>> for deriving this): >>>>> >>>>> --8<----- >>>>> >>>>> __swap_entry_free() might be the last user and result in >>>>> "count == SWAP_HAS_CACHE". >>>>> >>>>> swapoff->try_to_unuse() will stop as soon as soon as si->inuse_pages==0. >>>>> >>>>> So the question is: could someone reclaim the folio and turn >>>>> si->inuse_pages==0, before we completed swap_page_trans_huge_swapped(). >>>>> >>>>> Imagine the following: 2 MiB folio in the swapcache. Only 2 subpages are >>>>> still references by swap entries. >>>>> >>>>> Process 1 still references subpage 0 via swap entry. >>>>> Process 2 still references subpage 1 via swap entry. >>>>> >>>>> Process 1 quits. Calls free_swap_and_cache(). >>>>> -> count == SWAP_HAS_CACHE >>>>> [then, preempted in the hypervisor etc.] >>>>> >>>>> Process 2 quits. Calls free_swap_and_cache(). >>>>> -> count == SWAP_HAS_CACHE >>>>> >>>>> Process 2 goes ahead, passes swap_page_trans_huge_swapped(), and calls >>>>> __try_to_reclaim_swap(). >>>>> >>>>> __try_to_reclaim_swap()->folio_free_swap()->delete_from_swap_cache()-> >>>>> put_swap_folio()->free_swap_slot()->swapcache_free_entries()-> >>>>> swap_entry_free()->swap_range_free()-> >>>>> ... >>>>> WRITE_ONCE(si->inuse_pages, si->inuse_pages - nr_entries); >>>>> >>>>> What stops swapoff to succeed after process 2 reclaimed the swap cache >>>>> but before process1 finished its call to swap_page_trans_huge_swapped()? >>>>> >>>>> --8<----- >>>> >>>> I think that this can be simplified. Even for a 4K folio, this could >>>> happen. >>>> >>>> CPU0 CPU1 >>>> ---- ---- >>>> >>>> zap_pte_range >>>> free_swap_and_cache >>>> __swap_entry_free >>>> /* swap count become 0 */ >>>> swapoff >>>> try_to_unuse >>>> filemap_get_folio >>>> folio_free_swap >>>> /* remove swap cache */ >>>> /* free si->swap_map[] */ >>>> >>>> swap_page_trans_huge_swapped <-- access freed si->swap_map !!! >>> >>> Sorry for jumping the discussion here. IMHO, free_swap_and_cache is called with pte lock held. >> >> I don't beleive it has the PTL when called by shmem. > > In the case of shmem, folio_lock is used to guard against the race. I don't find folio is lock for shmem. find_lock_entries() will only lock the folio if (!xa_is_value()), that is, not swap entry. Can you point out where the folio is locked for shmem? -- Best Regards, Huang, Ying >> >>> So synchronize_rcu (called by swapoff) will wait zap_pte_range to release the pte lock. So this >>> theoretical problem can't happen. Or am I miss something? >> >> For Huang Ying's example, I agree this can't happen because try_to_unuse() will >> be waiting for the PTL (see the reply I just sent). > > Do you mean the below message? > " > I don't think si->inuse_pages is decremented until __try_to_reclaim_swap() is > called (per David, above), which is called after swap_page_trans_huge_swapped() > has executed. So in CPU1, try_to_unuse() wouldn't see si->inuse_pages being zero > until after CPU0 has completed accessing si->swap_map, so if swapoff starts > where you have put it, it would get stalled waiting for the PTL which CPU0 has. > " > > I agree try_to_unuse() will wait for si->inuse_pages being zero. But why will it waits > for the PTL? It seems PTL is not used to protect si->inuse_pages. Or am I miss something? > >> >>> >>> CPU0 CPU1 >>> ---- ---- >>> >>> zap_pte_range >>> pte_offset_map_lock -- spin_lock is held. >>> free_swap_and_cache >>> __swap_entry_free >>> /* swap count become 0 */ >>> swapoff >>> try_to_unuse >>> filemap_get_folio >>> folio_free_swap >>> /* remove swap cache */ >>> percpu_ref_kill(&p->users); >>> swap_page_trans_huge_swapped >>> pte_unmap_unlock -- spin_lock is released. >>> synchronize_rcu(); --> Will wait pte_unmap_unlock to be called? >> >> Perhaps you can educate me here; I thought that synchronize_rcu() will only wait >> for RCU critical sections to complete. The PTL is a spin lock, so why would >> synchronize_rcu() wait for the PTL to become unlocked? > > I assume PTL will always disable preemption which disables a grace period until PTL is released. > But this might be fragile and I'm not really sure. I might be wrong. > > Thanks. >> >> >>> /* free si->swap_map[] */ >>> >>> Thanks. >>> >>> >> >> . >>