From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.3 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE, SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 43EFEC433ED for ; Fri, 9 Apr 2021 17:17:24 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 81508610F9 for ; Fri, 9 Apr 2021 17:17:23 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 81508610F9 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 08BC16B006C; Fri, 9 Apr 2021 13:17:23 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 03AFE6B006E; Fri, 9 Apr 2021 13:17:22 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DF6E46B0070; Fri, 9 Apr 2021 13:17:22 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id C119C6B006C for ; Fri, 9 Apr 2021 13:17:22 -0400 (EDT) Received: from smtpin04.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 85BD2184B5031 for ; Fri, 9 Apr 2021 17:17:22 +0000 (UTC) X-FDA: 78013484724.04.7C8566B Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by imf27.hostedemail.com (Postfix) with ESMTP id DB77480192D5 for ; Fri, 9 Apr 2021 17:17:13 +0000 (UTC) IronPort-SDR: h/6BzBGs9lDgoLp5mJJhX/MSi0edXYpPLK7aDtyxu/U4P1t1PQmoCd9ewqSqDUPNP2JcO+pqO1 TjDSI8gcY6qg== X-IronPort-AV: E=McAfee;i="6000,8403,9949"; a="180933747" X-IronPort-AV: E=Sophos;i="5.82,210,1613462400"; d="scan'208";a="180933747" Received: from orsmga008.jf.intel.com ([10.7.209.65]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Apr 2021 10:17:18 -0700 IronPort-SDR: zIQTy+bctBXPiRuqxwpT1wrH6pJVXZokV3DAjz/I+HarzA2RDLmg86m83HkfN0zyK+DOohZ9EJ 9IQhhLrlsnQA== X-IronPort-AV: E=Sophos;i="5.82,210,1613462400"; d="scan'208";a="422819313" Received: from schen9-mobl.amr.corp.intel.com ([10.209.107.191]) by orsmga008-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Apr 2021 10:17:18 -0700 Subject: Re: [PATCH 2/5] swap: fix do_swap_page() race with swapoff To: Miaohe Lin , akpm@linux-foundation.org Cc: hannes@cmpxchg.org, mhocko@suse.com, iamjoonsoo.kim@lge.com, vbabka@suse.cz, alex.shi@linux.alibaba.com, willy@infradead.org, minchan@kernel.org, richard.weiyang@gmail.com, ying.huang@intel.com, hughd@google.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org References: <20210408130820.48233-1-linmiaohe@huawei.com> <20210408130820.48233-3-linmiaohe@huawei.com> <7684b3de-2824-9b1f-f033-d4bc14f9e195@linux.intel.com> <50d34b02-c155-bad7-da1f-03807ad31275@huawei.com> From: Tim Chen Message-ID: <995a130b-f07a-4771-1fe3-477d2f3c1e8e@linux.intel.com> Date: Fri, 9 Apr 2021 10:17:17 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.6.0 MIME-Version: 1.0 In-Reply-To: <50d34b02-c155-bad7-da1f-03807ad31275@huawei.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: DB77480192D5 X-Stat-Signature: w7h1amfm6xczr6n96rj7znftguueqsps X-Rspamd-Server: rspam02 Received-SPF: none (linux.intel.com>: No applicable sender policy available) receiver=imf27; identity=mailfrom; envelope-from=""; helo=mga02.intel.com; client-ip=134.134.136.20 X-HE-DKIM-Result: none/none X-HE-Tag: 1617988633-127527 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 4/9/21 1:42 AM, Miaohe Lin wrote: > On 2021/4/9 5:34, Tim Chen wrote: >> >> >> On 4/8/21 6:08 AM, Miaohe Lin wrote: >>> When I was investigating the swap code, I found the below possible race >>> window: >>> >>> CPU 1 CPU 2 >>> ----- ----- >>> do_swap_page >>> synchronous swap_readpage >>> alloc_page_vma >>> swapoff >>> release swap_file, bdev, or ... >> > > Many thanks for quick review and reply! > >> Perhaps I'm missing something. The release of swap_file, bdev etc >> happens after we have cleared the SWP_VALID bit in si->flags in destroy_swap_extents >> if I read the swapoff code correctly. > Agree. Let's look this more close: > CPU1 CPU2 > ----- ----- > swap_readpage > if (data_race(sis->flags & SWP_FS_OPS)) { > swapoff > p->swap_file = NULL; > struct file *swap_file = sis->swap_file; > struct address_space *mapping = swap_file->f_mapping;[oops!] > ... > p->flags = 0; > ... > > Does this make sense for you? p->swapfile = NULL happens after the p->flags &= ~SWP_VALID, synchronize_rcu(), destroy_swap_extents() sequence in swapoff(). So I don't think the sequence you illustrated on CPU2 is in the right order. That said, without get_swap_device/put_swap_device in swap_readpage, you could potentially blow pass synchronize_rcu() on CPU2 and causes a problem. so I think the problematic race looks something like the following: CPU1 CPU2 ----- ----- swap_readpage if (data_race(sis->flags & SWP_FS_OPS)) { swapoff p->flags = &= ~SWP_VALID; .. synchronize_rcu(); .. p->swap_file = NULL; struct file *swap_file = sis->swap_file; struct address_space *mapping = swap_file->f_mapping;[oops!] ... ... By adding get_swap_device/put_swap_device, then the race is fixed. CPU1 CPU2 ----- ----- swap_readpage get_swap_device() .. if (data_race(sis->flags & SWP_FS_OPS)) { swapoff p->flags = &= ~SWP_VALID; .. struct file *swap_file = sis->swap_file; struct address_space *mapping = swap_file->f_mapping;[valid value] .. put_swap_device() synchronize_rcu(); .. p->swap_file = NULL; > >>> >>> swap_readpage >>> check sis->flags is ok >>> access swap_file, bdev...[oops!] >>> si->flags = 0 >> >> This happens after we clear the si->flags >> synchronize_rcu() >> release swap_file, bdev, in destroy_swap_extents() >> >> So I think if we have get_swap_device/put_swap_device in do_swap_page, >> it should fix the race you've pointed out here. >> Then synchronize_rcu() will wait till we have completed do_swap_page and >> call put_swap_device. > > Right, get_swap_device/put_swap_device could fix this race. __But__ rcu_read_lock() > in get_swap_device() could disable preempt and do_swap_page() may take a really long > time because it involves I/O. It may not be acceptable to disable preempt for such a > long time. :( I can see that it is not a good idea to hold rcu read lock for a long time over slow file I/O operation, which will be the side effect of introducing get/put_swap_device to swap_readpage. So using percpu_ref will then be preferable for synchronization once we introduce get/put_swap_device into swap_readpage. Tim