From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 708E5C636CD for ; Sun, 5 Feb 2023 21:50:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A457C6B0072; Sun, 5 Feb 2023 16:50:08 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 9F5186B0073; Sun, 5 Feb 2023 16:50:08 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 895AA6B0074; Sun, 5 Feb 2023 16:50:08 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 75DEC6B0072 for ; Sun, 5 Feb 2023 16:50:08 -0500 (EST) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 46900804D7 for ; Sun, 5 Feb 2023 21:50:08 +0000 (UTC) X-FDA: 80434581696.20.2011F11 Received: from mail-pf1-f179.google.com (mail-pf1-f179.google.com [209.85.210.179]) by imf23.hostedemail.com (Postfix) with ESMTP id 4BEEF14000E for ; Sun, 5 Feb 2023 21:50:06 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=fromorbit-com.20210112.gappssmtp.com header.s=20210112 header.b=Sy6WO6aU; spf=none (imf23.hostedemail.com: domain of david@fromorbit.com has no SPF policy when checking 209.85.210.179) smtp.mailfrom=david@fromorbit.com; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1675633806; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=h7yJpiP9ttw5sFfLyFtqc0KGl8DklAykC2zfpl3J1T0=; b=wiHQrFBzA+1KYGt9xo5jVT6XVBZypwiVPGU/TJI4Cgc/sqzzK8JC/HD4gjLTIut1kWQovD 1rzXpwaSjZb9WV5m64RQJmqvFEVNFlQugBE1EkkvI4btxW2BhuwvN8rSKCCltS5vEtulp0 v0Njr1hs+4ymb3W0SJMOuTUB8m11tRM= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=fromorbit-com.20210112.gappssmtp.com header.s=20210112 header.b=Sy6WO6aU; spf=none (imf23.hostedemail.com: domain of david@fromorbit.com has no SPF policy when checking 209.85.210.179) smtp.mailfrom=david@fromorbit.com; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1675633806; a=rsa-sha256; cv=none; b=2tuIZS4dIxrqeW17R4lIK/Ex28XQ57B8y1Q2NewL58ceyYMQiJM2pBz4fMON4G8ldP96Nb PSPvvK1P+UGwNY0oYCUl4PsDd+7Dogxab9xnBKQ+jp8OmHxEs2kYssO30GBdyk2vJOeQ2D nPMb/Utr0oo9a8umAvcO7RUqkaf8jfo= Received: by mail-pf1-f179.google.com with SMTP id t17so7141145pfj.0 for ; Sun, 05 Feb 2023 13:50:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=fromorbit-com.20210112.gappssmtp.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=h7yJpiP9ttw5sFfLyFtqc0KGl8DklAykC2zfpl3J1T0=; b=Sy6WO6aUbt4hKJI79krfYMxHyrk8VlNpZFJCscjnaf0dsl5WUq31QJSrKvkMtQIaza EihA37h2D2LwC+6MaXmtTMsPZL/p9MyUxMS9kdpzLDu0jEeMTwRw17E10TchJzuiLTyp jrGU7LZSvlpKECTzI6x2jpq7GW6foI37PRVUtckaO+b++7hmWyYymrrrdJucpnk0IGcA 7fUy8W0DeTe6f9XQ0rXzO6B+K0YyjsIoWs0iGHMbi3zvdTe2V/dN4y80uskdkr7ryzrO /zKGNWHUODK7nylWFEqljJV+7XUzs4LKVVUb4h4mpsMLzFO2Xln/JIVQW7jF3oC1bKid lkqA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=h7yJpiP9ttw5sFfLyFtqc0KGl8DklAykC2zfpl3J1T0=; b=3nf17BUxLDmBJnB2WnrkNgHZGw4/bIpbhaq2GJ3DBODl1uOMRx4Kyzco1pWNH0MU7l N8C2SlFVWdtsHD5EdELAkVdnW0D06jOBDlyL+7mUSv7IwgQ04wQBscvUJXc1rJypETjD 8PedTaSseH/Cc1Cf2MHMu24Jk4Qn2w4Der4jIedYIBny8hP2Yzd/TQoiHSvY5asnBOBt RTvf7dgwyzKGhlVAXK/8MEpRWBvaoiFKe9bByNuPVadY6bf6g/6T5U/0I29pkP45Kf4t wmlMXZPcijsmeCVrQe3AKYEJcUKNcVYj0P4eskNZgluirbKtggmhNB1sH1Mir+HgR0zt n1uA== X-Gm-Message-State: AO0yUKX5Yw1CpZpmodjq/vTSgHn1a08TGoNuFaKE+l5XFpk7colEXwe5 HQWq1bEfIbt1M9/nRDT0Bf5Mww== X-Google-Smtp-Source: AK7set/Qbh4FDtzACwf+BWf7UNO2tnGdJAA6iW0zDj8wSxkzi8r3CMK8bAHVC0ZExvoKYvkFgHaAyQ== X-Received: by 2002:a62:1dc2:0:b0:592:4502:fb0 with SMTP id d185-20020a621dc2000000b0059245020fb0mr16441670pfd.0.1675633805007; Sun, 05 Feb 2023 13:50:05 -0800 (PST) Received: from dread.disaster.area (pa49-181-4-128.pa.nsw.optusnet.com.au. [49.181.4.128]) by smtp.gmail.com with ESMTPSA id y10-20020a056a001c8a00b0058119caa82csm5497038pfw.205.2023.02.05.13.50.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 05 Feb 2023 13:50:04 -0800 (PST) Received: from dave by dread.disaster.area with local (Exim 4.92.3) (envelope-from ) id 1pOmtc-00Bo2z-SX; Mon, 06 Feb 2023 08:50:00 +1100 Date: Mon, 6 Feb 2023 08:50:00 +1100 From: Dave Chinner To: Shiyang Ruan Cc: linux-kernel@vger.kernel.org, linux-xfs@vger.kernel.org, nvdimm@lists.linux.dev, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, djwong@kernel.org, dan.j.williams@intel.com, hch@infradead.org, jane.chu@oracle.com Subject: Re: [PATCH v9 3/3] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Message-ID: <20230205215000.GT360264@dread.disaster.area> References: <1675522718-88-1-git-send-email-ruansy.fnst@fujitsu.com> <1675522718-88-4-git-send-email-ruansy.fnst@fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1675522718-88-4-git-send-email-ruansy.fnst@fujitsu.com> X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 4BEEF14000E X-Rspam-User: X-Stat-Signature: 5andd8xz1n6d3pbaorxskchjja171orj X-HE-Tag: 1675633806-287815 X-HE-Meta: U2FsdGVkX1+r7B24dffI42je+XQ1qgocFls24WY0olsfor+jmOWUIU+pOzGXWph4k3S6PuCfGOUAOcwsFVgkHrQNDPCfIlvCDCtfYdljkGgaGAY0cl+FhTe9tiodoo1//lPepGHCLFqrMan5n3B/s/qrFYg+uVV/o4L+uk4SWPQLi6Vd6j80B0PkM9T9VXuf+oTjI9tzTo8DsSNtfusPgEPC36ELptIDId+u0JVeJMXH3NzOr6nN/xi9/bVoQJIuPJd0BT+Szv3jiH0u7e3AldVfWoc3Rk/EjPDB6qzircVKYwE0iUa9YA4hq78FNSEQyb1gUnrAZ2O+yauYO2TWEZuMlFeLefylV6qHO6IiR4pAfFp5Z1UgO+3smtntiZvZVYgxMwgbbh2l7UZ06EzwFdDZqwEyyQf1kgds5g5Nw0CHtCFg7E6v1hU+S8dHogWhMzs2dOwJzyNOgm2yAAM2FqHhkViWDWsrgbxFDIbM12AwPcsQhLKwy7gbBzPDtKAzvpYT95G8b6WB8HMXsoa6UUDrX1gOeAk0t4j86q8sV7kCihhw7zQpIVryoJbjBBbBI6M3Tul7wJ+2aSqZF4iXul+1SHwkNa0vx31doIyRNOniyLKdLwbBWaZnoUHwSUDwuFu1iy7PDukWweFFOcxv4J1yrK0WKZnYxVQNy6lYoCu9jatff4DWDHaDsNR3GBt4RKQ6QVwt/QfEUwzTuPGdbYPnTLKutg56RIwL1yR4MrBxB75buhnCdDCl8ZjVQvJe++2sr5d6RCHsB7xB/iUGPH0urcO9+V1adMkyZYgn1S0nluXPT4ptG5HsE2krVrbhzheWt3KUdT8IqaGHUr2CNHoit1c41tZw3Qd3Z443Ew2b9kfk1RvxW8A8wPIAMHytd/kT4Syg7NSK7BqUCAHVhgXv83TdPkTIHJ0Qxp8iiZ/uLtUdcPqXcZ4AJUb0dPDgzmqPztMYBTWns1tMhRj JRvyaiC4 gHJ/p2B4lpZyRF2v7xkt/8G5mHRkOoB8UD8pehuyCTD7VdwFAZGFKe90N5fslbfnGxMcCg/RndtD/3fJvdTmCJ1wHC2Qi3TqxQ7eOYIK5tzZRjaOAxmhKYpB6eV7mjQy2zMDNXT14zKiLbJ8xcljcN1+DL9KwQx9hoAsplNb6Aj+quXTy66UdFO4VVGbOXt5DQxfKQaEFPDI2StyHMPHK2WmKgrCxJDx6A+wNNg7QkUKBzqmuDQAKZiL2FRGUvCkSvmyDksHcNcsZv56AHfYyRIU+Q86yw5sTQPkQSTEyQI2xG2wG+74Y9+TZfLasOVLoJYWRpqOZQ8sia9oaE4MgwbydyO2X/Jwi83cTjIBvR8unxo84Ncb/zRxxMdUNyiTVnYVRR/zhitKN4u2DoBUlB1e7QyN8b/4b1Al5OLmdnA25yAHH/mqBmoJ+yFDeD8VrCRFvhm/kddLX3UT6p/1PbBc1ag== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Sat, Feb 04, 2023 at 02:58:38PM +0000, Shiyang Ruan wrote: > This patch is inspired by Dan's "mm, dax, pmem: Introduce > dev_pagemap_failure()"[1]. With the help of dax_holder and > ->notify_failure() mechanism, the pmem driver is able to ask filesystem > (or mapped device) on it to unmap all files in use and notify processes > who are using those files. ..... > @@ -182,12 +188,24 @@ xfs_dax_notify_failure( > struct xfs_mount *mp = dax_holder(dax_dev); > u64 ddev_start; > u64 ddev_end; > + int error; > > if (!(mp->m_super->s_flags & SB_BORN)) { > xfs_warn(mp, "filesystem is not ready for notify_failure()!"); > return -EIO; > } > > + if (mf_flags & MF_MEM_PRE_REMOVE) { > + xfs_info(mp, "device is about to be removed!"); > + down_write(&mp->m_super->s_umount); > + error = sync_filesystem(mp->m_super); > + /* invalidate_inode_pages2() invalidates dax mapping */ > + super_drop_pagecache(mp->m_super, invalidate_inode_pages2); > + up_write(&mp->m_super->s_umount); I really don't like this. super_drop_pagecache() doesn't guarantee that everything is removed from cache. It is racy - it doesn't touch inodes being freed or being instantiated, nor does it prevent concurrent accesses to the inodes from re-instantiating page cache pages and dirtying them after the inode has been scanned by super_drop_pagecache(). If we are about to remove the block device and we want to guarantee that the filesystem is cleaned and stable before the device gets yanked out from under running applications, then we have to guarantee that we stall the running applications trying to modify the filesystem between the MF_MEM_PRE_REMOVE and the actual removal event that then shuts down the filesystem. Invalidating the page cache is not enough to guarantee this. Keep in mind that we're going to walk the rmap after writing the data to kill any processes that have mmap()d files in the filesystem after we've dropped the page cache - the page cache invalidation doesn't change this at all - and this will kill any active userspace DAX mappings before the device is unplugged. So I don't actually see how walking the page cache to invalidate it here actually helps "invalidate dax mapping" reliably as new write page faults on dax VMAs can still occur between super_drop_pagecache() and the rmap walk triggering kills on processes with DAX mapped VMAs. We also don't care if read-only operations race with device unplug - they are going to get EIO the moment the device is actually unplugged or the filesystem is shutdown anyway, so it doesn't matter if reads race with the device remove event. Hence all we really care about here is not dirtying the filesystem after we've started processing the MF_MEM_PRE_REMOVE event. Realistically, I think we need to freeze the filesystem here to prevent racing modifications occurring during the rmap + VMA walk + proc kill. That could be write() IO dirtying new data or other transactions running dirtying the journal/metadata. Both sync_filesystem() and super_drop_pagecache() operate on current state - they don't prevent future dax mapping instantiation or dirtying from happening on the device, so they don't prevent this... Cheers, Dave. -- Dave Chinner david@fromorbit.com