From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 66A49C3DA6F for ; Thu, 24 Aug 2023 09:42:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 87D2D280073; Thu, 24 Aug 2023 05:42:03 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 805CF280071; Thu, 24 Aug 2023 05:42:03 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 68123280073; Thu, 24 Aug 2023 05:42:03 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 50F52280071 for ; Thu, 24 Aug 2023 05:42:03 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 299D4140268 for ; Thu, 24 Aug 2023 09:42:03 +0000 (UTC) X-FDA: 81158506926.13.6A435E2 Received: from esa6.hc1455-7.c3s2.iphmx.com (esa6.hc1455-7.c3s2.iphmx.com [68.232.139.139]) by imf02.hostedemail.com (Postfix) with ESMTP id 765DA80024 for ; Thu, 24 Aug 2023 09:42:00 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=fujitsu.com; spf=pass (imf02.hostedemail.com: domain of ruansy.fnst@fujitsu.com designates 68.232.139.139 as permitted sender) smtp.mailfrom=ruansy.fnst@fujitsu.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1692870121; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=z8Nma7vsbxBPq2LmMXNYkzfFvOiNODwgvFBFyllYcVk=; b=6qjusak2LJeS3GZgcuSxlzq6UmJP7/Z3I5FSliTP0CFFGW6gVCg6wZdM+TvgByAEflcLin HUgdNpRU7Xe1EUXfbqnCZvGyQot/61dQShcIItIY9JQKIZucJkbEn6rl63RnDizfFOMYtj e9BZ3owyS5if94ZctlQMqR42McB0PBE= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=fujitsu.com; spf=pass (imf02.hostedemail.com: domain of ruansy.fnst@fujitsu.com designates 68.232.139.139 as permitted sender) smtp.mailfrom=ruansy.fnst@fujitsu.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1692870121; a=rsa-sha256; cv=none; b=Hgrrrp6PpFwyf+ey9ng/VtocfSsCIgoYlz9p6FOljokbJqILz51cBLbYk2Gqe8tm6JFQPx ikNoSyeTEFsXJ5Ks0oYgpDwsc+j4QF4FcPc+XVOBELlxEpeqiBKl0Xr+5GrRMF+uo4vCiM DzJk2DwN7qsmpDlIgr8oKICu6Kx8siI= X-IronPort-AV: E=McAfee;i="6600,9927,10811"; a="130609811" X-IronPort-AV: E=Sophos;i="6.01,195,1684767600"; d="scan'208";a="130609811" Received: from unknown (HELO yto-r2.gw.nic.fujitsu.com) ([218.44.52.218]) by esa6.hc1455-7.c3s2.iphmx.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Aug 2023 18:41:57 +0900 Received: from yto-m3.gw.nic.fujitsu.com (yto-nat-yto-m3.gw.nic.fujitsu.com [192.168.83.66]) by yto-r2.gw.nic.fujitsu.com (Postfix) with ESMTP id 31B56C68E8 for ; Thu, 24 Aug 2023 18:41:53 +0900 (JST) Received: from kws-ab3.gw.nic.fujitsu.com (kws-ab3.gw.nic.fujitsu.com [192.51.206.21]) by yto-m3.gw.nic.fujitsu.com (Postfix) with ESMTP id 74FB516E6A for ; Thu, 24 Aug 2023 18:41:52 +0900 (JST) Received: from edo.cn.fujitsu.com (edo.cn.fujitsu.com [10.167.33.5]) by kws-ab3.gw.nic.fujitsu.com (Postfix) with ESMTP id BC7872008E8B3 for ; Thu, 24 Aug 2023 18:41:51 +0900 (JST) Received: from [192.168.50.5] (unknown [10.167.234.230]) by edo.cn.fujitsu.com (Postfix) with ESMTP id C63971A0073; Thu, 24 Aug 2023 17:41:50 +0800 (CST) Message-ID: <999e83ca-df65-4a43-9d32-ff13a252c2d7@fujitsu.com> Date: Thu, 24 Aug 2023 17:41:50 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v13] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE for unbind To: "Darrick J. Wong" Cc: linux-fsdevel@vger.kernel.org, nvdimm@lists.linux.dev, linux-xfs@vger.kernel.org, linux-mm@kvack.org, dan.j.williams@intel.com, willy@infradead.org, jack@suse.cz, akpm@linux-foundation.org, mcgrof@kernel.org References: <20230629081651.253626-3-ruansy.fnst@fujitsu.com> <20230823081706.2970430-1-ruansy.fnst@fujitsu.com> <20230823233601.GH11263@frogsfrogsfrogs> From: Shiyang Ruan In-Reply-To: <20230823233601.GH11263@frogsfrogsfrogs> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-TM-AS-GCONF: 00 X-TM-AS-Product-Ver: IMSS-9.1.0.1417-9.0.0.1002-27832.006 X-TM-AS-User-Approved-Sender: Yes X-TMASE-Version: IMSS-9.1.0.1417-9.0.1002-27832.006 X-TMASE-Result: 10--25.814300-10.000000 X-TMASE-MatchedRID: WXBh9Xkni9aPvrMjLFD6eHchRkqzj/bEC/ExpXrHizxBqLOmHiM3wyDU 0t/WFgAcM21CreAEMZhHW7omdzHbHeVaI0j/eUAPxDiakrJ+Spl4fCfFRQ30yN20KhkCTcTU0Eq FjE9+26goQVAihVrNS4NRyZUAEJtaRtJQAvHxVGkvun/+8u/hs4OeZuUUsCzCuzdiHYg4JjN0O3 P7NRJ5H5ytBOG1WZxWyksi4Z4sCs2eTALXPNvL0hmCYUYerLHro8tN19oTXleaDNlRJumzuRFup 4CINH3JYGFrrc6fVM/jpCo1bCzvK2sth/lQGvIl0wmR34xRQc8aJDwYgQY/f1gLks93sG9t4xO7 9tPLHE+Asfk6HtbFfxFBD6+ejtliL/tBTZzO5Q3R7uN8GOEHx9DEMPvvoocvHTMj5a5/7iYUXKi zhC4rw6zTHbhNDPvt6rDhzwOXTehS0bd+i8J5ed+pUF0HsjxRyeUl7aCTy8iC1l3nLH+RlRWZS6 woN/9Fd5/rOwjY0Ilf9c8yqC+c5mXZNELg4fvBlTsGW3DmpUulY4F8r0vXP0DM9hRIPPFMIa7NU ssol07Azu336p/mXrDWbmALScPNPdNB9SkC/WxYUconbBJWJPEpHaRtv5/XZUgeE6Z9kRN79xCS a3QbsQRxo9aqLgqrYDm9oSi8Og9keVpWP2Sy+a5bb5QEYSkdrzl8sNiWClKbKItl61J/ybLn+0V m71Lc3t//bkkogL6/v0UhTKC9nMRB0bsfrpPI6T/LTDsmJmg= X-TMASE-SNAP-Result: 1.821001.0001-0-1-22:0,33:0,34:0-0 X-Rspamd-Queue-Id: 765DA80024 X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: jwrqhdwo9y1ppzygnqfdu5jjifzrmsos X-HE-Tag: 1692870120-379406 X-HE-Meta: U2FsdGVkX18eUDib2LyNnePiSdk/ZDKIwaWp4bNM2TAf721NIQ5lfrz7ZWHDM7vlo35gv/FWXtYO1yLCcfujw67zppnz7pPOzbCDIHVj1/YZ8CjUIq0JA+XkkTte5XFbapBnzwJVvAIadC2sTZTU+zZ6Xjpeg2bj7TBLB0bEzZgP+x/xyTiyJ6ouOZIu/T8Nqt21uFfHxRTq1ejtUZM3R7Ugj/sByV1nQA+DB3l5w8ZUGI/5zy5N+4U8GdQnuz7Y5BJaskJfLbVHRQryuUPBBhMjY+xYWFfVm5IVKg5VUpP+TpVYpT5VwAQQMuJOV4DkC79eTPA4Xe7SbGxXVWjPlH25p4uU2gbqMysZje1dsN0axnax89DrMERWmWbyaLPnDD6+657WS+Rv9n4/U49b8OPXXX5XHFhb9UuIT+flbDWRjYHxOaoE9hBXRlxq8xclaD0fP4KLqwBiAnZoCr+MLUeONumjIceKcYJEI5T6eyEbTtazJjL9O9xRB1Qko3ve8+3MZAWzzRIIktEy2C30o4M50gzNV12szTiff7ohvO3OdW6wSzIfTKiOc2wN3jah0SrhVj0uzZJV89ebVidHUM7SE1qBEVpvaOSRnIaORQ8uoD+vGaa3E1Op1bEwNZvAsj6M6uVFYhvuP/kjOLsX6ETGkRXZNeLabARs+og9gL6Zt/p7o/EdkGWO/3ezcdGgf+uwns63AeQDkTV5egeUa5HSLBI1Qc2sm2bRkbjdmzfXTdhJnXC2eUi3fN1RedxTRFDo73U0BcRJrr68+n7R73xDmlxOV667QTHyfN+p3V1L8n3ZhS98skpruFd/SeFFQV2qQuVug2cax+yAsY+KYJjCyDvktDw71gIDeTl85ATG5iIL5XkIovtWfrtk+ZmtcKRW3O1k6TiH5NQqDPRYVDEuDzOgcl6/29If8s3kXt3H7YNrnDopTZtAmU8vnBU7KilQ3ISn5ppdFNmqn0W c6ABWqfU ds+SfflIGEOcYpDBSqN8Dcobyi79TbBPkwOi1XkQEJb5Y7wNkMA2uRil4xCxeeyhgzggmvZfVUEc2B4gHQRtQGjOWfiaErQp/nuvqINzIQhfmhZuljTbxPTTpftfz21GOZB/TbQDyk44v6Keck6w4UDnwJ0ubBaJtz2MGULUypcgaYv2CtSHVpCWmE4p2Uu4mCI6r97zbI+2HlkLFzwDNPhJeHGRMuw0THS01G1D1r0y18HnHt4SJbhEqQsTuuRDGoE+O/vsiFFcbyxKHvAj1K2Kz6LwqFSUdQkU3i2n5eHl6z9L8H4txPyfrMnj1Rsufp8Yiqv/EkYHTIVzTsnUkfgAVlg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: 在 2023/8/24 7:36, Darrick J. Wong 写道: > On Wed, Aug 23, 2023 at 04:17:06PM +0800, Shiyang Ruan wrote: >> ==== >> Changes since v12: >> 1. correct flag name in subject (MF_MEM_REMOVE => MF_MEM_PRE_REMOVE) >> 2. complete the behavior when fs has already frozen by kernel call >> NOTICE: Instead of "call notify_failure() again w/o PRE_REMOVE", >> I tried this proposal[0]. >> 3. call xfs_dax_notify_failure_freeze() and _thaw() in same function >> 4. rebase on: xfs/xfs-linux.git vfs-for-next >> ==== >> >> Now, if we suddenly remove a PMEM device(by calling unbind) which >> contains FSDAX while programs are still accessing data in this device, >> e.g.: >> ``` >> $FSSTRESS_PROG -d $SCRATCH_MNT -n 99999 -p 4 & >> # $FSX_PROG -N 1000000 -o 8192 -l 500000 $SCRATCH_MNT/t001 & >> echo "pfn1.1" > /sys/bus/nd/drivers/nd_pmem/unbind >> ``` >> it could come into an unacceptable state: >> 1. device has gone but mount point still exists, and umount will fail >> with "target is busy" >> 2. programs will hang and cannot be killed >> 3. may crash with NULL pointer dereference >> >> To fix this, we introduce a MF_MEM_PRE_REMOVE flag to let it know that we >> are going to remove the whole device, and make sure all related processes >> could be notified so that they could end up gracefully. >> >> This patch is inspired by Dan's "mm, dax, pmem: Introduce >> dev_pagemap_failure()"[1]. With the help of dax_holder and >> ->notify_failure() mechanism, the pmem driver is able to ask filesystem >> on it to unmap all files in use, and notify processes who are using >> those files. >> >> Call trace: >> trigger unbind >> -> unbind_store() >> -> ... (skip) >> -> devres_release_all() >> -> kill_dax() >> -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE) >> -> xfs_dax_notify_failure() >> `-> freeze_super() // freeze (kernel call) >> `-> do xfs rmap >> ` -> mf_dax_kill_procs() >> ` -> collect_procs_fsdax() // all associated processes >> ` -> unmap_and_kill() >> ` -> invalidate_inode_pages2_range() // drop file's cache >> `-> thaw_super() // thaw (both kernel & user call) >> >> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove >> event. Use the exclusive freeze/thaw[2] to lock the filesystem to prevent >> new dax mapping from being created. Do not shutdown filesystem directly >> if configuration is not supported, or if failure range includes metadata >> area. Make sure all files and processes(not only the current progress) >> are handled correctly. Also drop the cache of associated files before >> pmem is removed. >> >> [0]: https://lore.kernel.org/linux-xfs/25cf6700-4db0-a346-632c-ec9fc291793a@fujitsu.com/ >> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/ >> [2]: https://lore.kernel.org/linux-xfs/169116275623.3187159.16862410128731457358.stg-ugh@frogsfrogsfrogs/ >> >> Signed-off-by: Shiyang Ruan >> --- >> drivers/dax/super.c | 3 +- >> fs/xfs/xfs_notify_failure.c | 99 ++++++++++++++++++++++++++++++++++--- >> include/linux/mm.h | 1 + >> mm/memory-failure.c | 17 +++++-- >> 4 files changed, 109 insertions(+), 11 deletions(-) >> >> diff --git a/drivers/dax/super.c b/drivers/dax/super.c >> index c4c4728a36e4..2e1a35e82fce 100644 >> --- a/drivers/dax/super.c >> +++ b/drivers/dax/super.c >> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev) >> return; >> >> if (dax_dev->holder_data != NULL) >> - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0); >> + dax_holder_notify_failure(dax_dev, 0, U64_MAX, >> + MF_MEM_PRE_REMOVE); >> >> clear_bit(DAXDEV_ALIVE, &dax_dev->flags); >> synchronize_srcu(&dax_srcu); >> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c >> index 4a9bbd3fe120..6496c32a9172 100644 >> --- a/fs/xfs/xfs_notify_failure.c >> +++ b/fs/xfs/xfs_notify_failure.c >> @@ -22,6 +22,7 @@ >> >> #include >> #include >> +#include >> >> struct xfs_failure_info { >> xfs_agblock_t startblock; >> @@ -73,10 +74,16 @@ xfs_dax_failure_fn( >> struct xfs_mount *mp = cur->bc_mp; >> struct xfs_inode *ip; >> struct xfs_failure_info *notify = data; >> + struct address_space *mapping; >> + pgoff_t pgoff; >> + unsigned long pgcnt; >> int error = 0; >> >> if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) || >> (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) { >> + /* Continue the query because this isn't a failure. */ >> + if (notify->mf_flags & MF_MEM_PRE_REMOVE) >> + return 0; >> notify->want_shutdown = true; >> return 0; >> } >> @@ -92,14 +99,60 @@ xfs_dax_failure_fn( >> return 0; >> } >> >> - error = mf_dax_kill_procs(VFS_I(ip)->i_mapping, >> - xfs_failure_pgoff(mp, rec, notify), >> - xfs_failure_pgcnt(mp, rec, notify), >> - notify->mf_flags); >> + mapping = VFS_I(ip)->i_mapping; >> + pgoff = xfs_failure_pgoff(mp, rec, notify); >> + pgcnt = xfs_failure_pgcnt(mp, rec, notify); >> + >> + /* Continue the rmap query if the inode isn't a dax file. */ >> + if (dax_mapping(mapping)) >> + error = mf_dax_kill_procs(mapping, pgoff, pgcnt, >> + notify->mf_flags); >> + >> + /* Invalidate the cache in dax pages. */ >> + if (notify->mf_flags & MF_MEM_PRE_REMOVE) >> + invalidate_inode_pages2_range(mapping, pgoff, >> + pgoff + pgcnt - 1); >> + >> xfs_irele(ip); >> return error; >> } >> >> +static int >> +xfs_dax_notify_failure_freeze( >> + struct xfs_mount *mp) >> +{ >> + struct super_block *sb = mp->m_super; >> + int error; >> + >> + error = freeze_super(sb, FREEZE_HOLDER_KERNEL); >> + if (error) >> + xfs_emerg(mp, "already frozen by kernel, err=%d", error); >> + >> + return error; >> +} >> + >> +static void >> +xfs_dax_notify_failure_thaw( >> + struct xfs_mount *mp, >> + bool kernel_frozen) >> +{ >> + struct super_block *sb = mp->m_super; >> + int error; >> + >> + if (!kernel_frozen) { >> + error = thaw_super(sb, FREEZE_HOLDER_KERNEL); >> + if (error) >> + xfs_emerg(mp, "still frozen after notify failure, err=%d", >> + error); >> + } >> + >> + /* >> + * Also thaw userspace call anyway because the device is about to be >> + * removed immediately. > > Does a userspace freeze inhibit or otherwise break device removal? It doesn't. Device can be removed. But after that, the mount point still exists, and `umount /mnt/scratch` fails with "target is busy." `xfs_freeze -u /mnt/scratch` cannot work too. So, I think thaw_super() anyway here is needed. > >> + */ >> + thaw_super(sb, FREEZE_HOLDER_USERSPACE); >> +} >> + >> static int >> xfs_dax_notify_ddev_failure( >> struct xfs_mount *mp, >> @@ -112,15 +165,29 @@ xfs_dax_notify_ddev_failure( >> struct xfs_btree_cur *cur = NULL; >> struct xfs_buf *agf_bp = NULL; >> int error = 0; >> + bool kernel_frozen = false; >> xfs_fsblock_t fsbno = XFS_DADDR_TO_FSB(mp, daddr); >> xfs_agnumber_t agno = XFS_FSB_TO_AGNO(mp, fsbno); >> xfs_fsblock_t end_fsbno = XFS_DADDR_TO_FSB(mp, >> daddr + bblen - 1); >> xfs_agnumber_t end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno); >> >> + if (mf_flags & MF_MEM_PRE_REMOVE) { >> + xfs_info(mp, "Device is about to be removed!"); >> + /* Freeze fs to prevent new mappings from being created. */ >> + error = xfs_dax_notify_failure_freeze(mp); >> + if (error) { >> + /* Keep going on if filesystem is frozen by kernel. */ >> + if (error == -EBUSY) >> + kernel_frozen = true; > > EBUSY means that xfs_dax_notify_failure_freeze did /not/ succeed in > kernel-freezing the fs. Someone else did, and they're expecting that > thaw_super will undo that. > > switch (error) { > case -EBUSY: > /* someone else froze the fs, keep going */ > break; > case 0: > /* we froze the fs */ > kernel_frozen = true; > break; > default: > /* something else broke, should we continue anyway? */ > return error; > } > > TBH I wonder why all that isn't just: > > kernel_frozen = xfs_dax_notify_failure_freeze(mp) == 0; > > Since we'd want to keep going even if (say) the pmem was already > starting to fail and the freeze actually failed due to EIO, right? Yes. So we can say it is a *try* to _freeze() here. No matter what its result is, we continue. Then I think the `kernel_frozen` becomes useless as well. Because we should try to call both _thaw(KERNEL_CALL) and _thaw(USER_CALL) to make sure umount can work after device is gone. Then, I think it's better to change them: `static int xfs_dax_notify_failure_freeze()`, `static void xfs_dax_notify_failure_thaw()` to `static void xfs_dax_notify_failure_try_freeze()`, `static void xfs_dax_notify_failure_try_thaw()`. -- Thanks, Ruan. > > --D > >> + else >> + return error; >> + } >> + } >> + >> error = xfs_trans_alloc_empty(mp, &tp); >> if (error) >> - return error; >> + goto out; >> >> for (; agno <= end_agno; agno++) { >> struct xfs_rmap_irec ri_low = { }; >> @@ -165,11 +232,23 @@ xfs_dax_notify_ddev_failure( >> } >> >> xfs_trans_cancel(tp); >> + >> + /* >> + * Determine how to shutdown the filesystem according to the >> + * error code and flags. >> + */ >> if (error || notify.want_shutdown) { >> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK); >> if (!error) >> error = -EFSCORRUPTED; >> - } >> + } else if (mf_flags & MF_MEM_PRE_REMOVE) >> + xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT); >> + >> +out: >> + /* Thaw the fs if it is frozen before. */ >> + if (mf_flags & MF_MEM_PRE_REMOVE) >> + xfs_dax_notify_failure_thaw(mp, kernel_frozen); >> + >> return error; >> } >> >> @@ -197,6 +276,8 @@ xfs_dax_notify_failure( >> >> if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev && >> mp->m_logdev_targp != mp->m_ddev_targp) { >> + if (mf_flags & MF_MEM_PRE_REMOVE) >> + return 0; >> xfs_err(mp, "ondisk log corrupt, shutting down fs!"); >> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK); >> return -EFSCORRUPTED; >> @@ -210,6 +291,12 @@ xfs_dax_notify_failure( >> ddev_start = mp->m_ddev_targp->bt_dax_part_off; >> ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1; >> >> + /* Notify failure on the whole device. */ >> + if (offset == 0 && len == U64_MAX) { >> + offset = ddev_start; >> + len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev); >> + } >> + >> /* Ignore the range out of filesystem area */ >> if (offset + len - 1 < ddev_start) >> return -ENXIO; >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index 799836e84840..944a1165a321 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -3577,6 +3577,7 @@ enum mf_flags { >> MF_UNPOISON = 1 << 4, >> MF_SW_SIMULATED = 1 << 5, >> MF_NO_RETRY = 1 << 6, >> + MF_MEM_PRE_REMOVE = 1 << 7, >> }; >> int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index, >> unsigned long count, int mf_flags); >> diff --git a/mm/memory-failure.c b/mm/memory-failure.c >> index dc5ff7dd4e50..92f18c9e0aaf 100644 >> --- a/mm/memory-failure.c >> +++ b/mm/memory-failure.c >> @@ -688,7 +688,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p, >> */ >> static void collect_procs_fsdax(struct page *page, >> struct address_space *mapping, pgoff_t pgoff, >> - struct list_head *to_kill) >> + struct list_head *to_kill, bool pre_remove) >> { >> struct vm_area_struct *vma; >> struct task_struct *tsk; >> @@ -696,8 +696,15 @@ static void collect_procs_fsdax(struct page *page, >> i_mmap_lock_read(mapping); >> read_lock(&tasklist_lock); >> for_each_process(tsk) { >> - struct task_struct *t = task_early_kill(tsk, true); >> + struct task_struct *t = tsk; >> >> + /* >> + * Search for all tasks while MF_MEM_PRE_REMOVE is set, because >> + * the current may not be the one accessing the fsdax page. >> + * Otherwise, search for the current task. >> + */ >> + if (!pre_remove) >> + t = task_early_kill(tsk, true); >> if (!t) >> continue; >> vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { >> @@ -1793,6 +1800,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index, >> dax_entry_t cookie; >> struct page *page; >> size_t end = index + count; >> + bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE; >> >> mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL; >> >> @@ -1804,9 +1812,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index, >> if (!page) >> goto unlock; >> >> - SetPageHWPoison(page); >> + if (!pre_remove) >> + SetPageHWPoison(page); >> >> - collect_procs_fsdax(page, mapping, index, &to_kill); >> + collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove); >> unmap_and_kill(&to_kill, page_to_pfn(page), mapping, >> index, mf_flags); >> unlock: >> -- >> 2.41.0 >>