From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7D27DC001DE for ; Mon, 31 Jul 2023 09:36:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D98106B0088; Mon, 31 Jul 2023 05:36:46 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D476B6B0089; Mon, 31 Jul 2023 05:36:46 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BE85628001A; Mon, 31 Jul 2023 05:36:46 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id ACA046B0088 for ; Mon, 31 Jul 2023 05:36:46 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 80E8F1C8E20 for ; Mon, 31 Jul 2023 09:36:46 +0000 (UTC) X-FDA: 81071402412.16.3912C9B Received: from esa6.hc1455-7.c3s2.iphmx.com (esa6.hc1455-7.c3s2.iphmx.com [68.232.139.139]) by imf14.hostedemail.com (Postfix) with ESMTP id A3E91100011 for ; Mon, 31 Jul 2023 09:36:43 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=fujitsu.com; spf=pass (imf14.hostedemail.com: domain of ruansy.fnst@fujitsu.com designates 68.232.139.139 as permitted sender) smtp.mailfrom=ruansy.fnst@fujitsu.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1690796204; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=TdWiw6dA4IOU1Knlv+sEyik6ae6yZnsMAl+XC1FLpnA=; b=u979Br2UwseCHKmOc4iFU/DgWStcImT8t/gy044/KKzhyHHO8JKzL0tuBANID3uSEANkl/ Wq25I57Ja94au/F6zZB5MGtMTQ045ObiaWztyWRuh2mDETv1kkORBWtkX2hcnKatx5+XOD tYWoySTbEirREHNjeTX0AAEKV8i7tsA= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=fujitsu.com; spf=pass (imf14.hostedemail.com: domain of ruansy.fnst@fujitsu.com designates 68.232.139.139 as permitted sender) smtp.mailfrom=ruansy.fnst@fujitsu.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1690796204; a=rsa-sha256; cv=none; b=5odv6SpcyjUjLWPP1uMwKGwWomFoWOH/dZPNkjuS1OXBsKQ2RBRXDGPYNXvBsl+7Iovbuy rNMqpd7OckCraNZFkwop3hiMjv7E4P/yT1ImDTQkVh1euVk8IXVWA3JnGw10gKoBFg4a5G GJUTj19h0ZC2E/0/lFvJKThx2i+575c= X-IronPort-AV: E=McAfee;i="6600,9927,10787"; a="127731122" X-IronPort-AV: E=Sophos;i="6.01,244,1684767600"; d="scan'208";a="127731122" Received: from unknown (HELO yto-r1.gw.nic.fujitsu.com) ([218.44.52.217]) by esa6.hc1455-7.c3s2.iphmx.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 31 Jul 2023 18:36:41 +0900 Received: from yto-m2.gw.nic.fujitsu.com (yto-nat-yto-m2.gw.nic.fujitsu.com [192.168.83.65]) by yto-r1.gw.nic.fujitsu.com (Postfix) with ESMTP id 074ACDB3C2 for ; Mon, 31 Jul 2023 18:36:39 +0900 (JST) Received: from kws-ab3.gw.nic.fujitsu.com (kws-ab3.gw.nic.fujitsu.com [192.51.206.21]) by yto-m2.gw.nic.fujitsu.com (Postfix) with ESMTP id 346A3D5E97 for ; Mon, 31 Jul 2023 18:36:38 +0900 (JST) Received: from edo.cn.fujitsu.com (edo.cn.fujitsu.com [10.167.33.5]) by kws-ab3.gw.nic.fujitsu.com (Postfix) with ESMTP id AE21620074726 for ; Mon, 31 Jul 2023 18:36:37 +0900 (JST) Received: from [192.168.50.5] (unknown [10.167.234.230]) by edo.cn.fujitsu.com (Postfix) with ESMTP id BBDA81A0070; Mon, 31 Jul 2023 17:36:36 +0800 (CST) Message-ID: Date: Mon, 31 Jul 2023 17:36:36 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind To: "Darrick J. Wong" Cc: linux-fsdevel@vger.kernel.org, nvdimm@lists.linux.dev, linux-xfs@vger.kernel.org, linux-mm@kvack.org, dan.j.williams@intel.com, willy@infradead.org, jack@suse.cz, akpm@linux-foundation.org, mcgrof@kernel.org References: <20230629081651.253626-1-ruansy.fnst@fujitsu.com> <20230629081651.253626-3-ruansy.fnst@fujitsu.com> <20230729151506.GI11352@frogsfrogsfrogs> Content-Language: en-US From: Shiyang Ruan In-Reply-To: <20230729151506.GI11352@frogsfrogsfrogs> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-TM-AS-GCONF: 00 X-TM-AS-Product-Ver: IMSS-9.1.0.1417-9.0.0.1002-27784.006 X-TM-AS-User-Approved-Sender: Yes X-TMASE-Version: IMSS-9.1.0.1417-9.0.1002-27784.006 X-TMASE-Result: 10--21.513100-10.000000 X-TMASE-MatchedRID: fQs1WIo05PGPvrMjLFD6eHchRkqzj/bEC/ExpXrHizxBqLOmHiM3wxem 4OLSGy1EiNggJ6Pbm2FbtzD5SJbjLpGZilTi8ctSxDiakrJ+Splt9UVWhqbRIWtEzrC9eANpEJm hpJ8aMPMs4TH6G8STucqWFlCQS6PJay2H+VAa8iXTCZHfjFFBzxokPBiBBj9/WAuSz3ewb23jE7 v208scT4Cx+Toe1sV/EUEPr56O2WIv+0FNnM7lDRFbgtHjUWLyGB9/bxS68hPMtotGtpF5VgimM t6TSXWl9IAP3W8IJ6ROaA8tMUkyucwitucT3dE79Ib/6w+1lWQ0YL9SJPufX7E9dgiHWXp2dCIZ l7SNYAwqrSgxSiVy6T4BGdad1mqh8p7loYJT/FuDpW5ZeDjLZEEe5VjFzwNbFCmwHLoxcsa20nP Q2eZ7CcNUkvwuYWUMsNZuYAtJw80900H1KQL9bG03YawHJvPCTfK5j0EZbyuHvxzu1FG4kbEOoi tFQVcIGBKlWUwuGka5HjqbLog5/l7FLEpyoHYAgnMtC97jHVQXivwflisSrHd17Y6gGqDCbxGoC yAt4uiBwOiAXd70LJNHG+DwJ1Q9najGxrJsU4OeAiCmPx4NwFkMvWAuahr8AsMBg/gBdVHudjnW XAurT7xAi7jPoeEQftwZ3X11IV0= X-TMASE-SNAP-Result: 1.821001.0001-0-1-22:0,33:0,34:0-0 X-Rspamd-Queue-Id: A3E91100011 X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: k8py1o7hust9cbas7af7gmh5smfim4xo X-HE-Tag: 1690796203-252810 X-HE-Meta: U2FsdGVkX181BcDOpUVzQ9ZAHpZDOJBLcojRVwlxJ/8mjJxxdgkNmMOtru1pmUeFcWwmI4OvTa9n5i849ja5K7A13VmE4sG/C0g+74q7rTH7XEUKKpwtsOELTrN7BBzRr3CN+3fawGLebaSiOmLC1Hkbpxu30keKQRLCyKrlE/a1nmbMxzgGuNB8zjQzT80PC6QcdzltRpAxTa8ZYmspfN+1FKlpseMbiidnFxWMCqfnd+CWwGSJhz0OLX1rRnQwv7NoVHp+lcD0fck5f+t/SyTcMROfs3p2ly6BVlG5Z1zmrFT91m/tFjFkSahn9fPjCnLeyQKTA6FMXNbdhLIyQJVEkKi1yPtGmMWdVyDPqL/T5PeQ4PSJS6Wc89ZrUtbNr0pJiL4QOfRqegjlYha3qk6jp+Zr82N+rSXOxplkcnjqhE/rq0PDew7FxBeYbjckeiWv/EiOniCBKoIaE2qYJnYnBNbucgVINm6oQ2pJmP5emk0BDaFJ7YCbCYkn/vXnwRbP9Une2oyPThiCCCBw/XsdWOclz5e0Tho9g+v+p7ABJC0HXARex8rpRdd5G43GT7tFYvOeDXk6cULoKytQJkNa6RD2DGw8aKMKvvMIg1MNkygpbzYzZGQB7iZv3HSjghHy7NknCTYL/GdoKCx8QMNJqWjfOYZcztvKPotAOWydP8zaUMK8Ac63WYu5GCowmIAOhGHBFbd5TWAjHB+KbAqcP2cOg8wtwk8V79AMJDflp6SbL/FrP+L2A6UcmbQ97DEgBzIUW23JQMSeuVTloI6L/ybwSJ/DfN/EjT+IbLKK83q7DMKyB4GogjjJnjZedNvn/lAN9ZQIFeSDzwwI+i4TpIDCLFwgobZuuKp4W4T4iMWoETAHWE3ENd/FArPYUZx0JEhHMYcRom3pLfNwVLaa02N9epycZ3FmViTZNIJxyf9bJHdaogaje7GLEWm4PaEGSz2yHiAyMY/UyXO +070AJQc xOWfywStKTTHQ4ITNMSj0g3PbGcpTi8APqhUWezXdNxoTmdl8lBCXEyPcICpCAFNZGLTEq1sjPpHNrxvqn05mi7skUMkBL7B09BDZNtwZr0QF4h/hvWQm806Rwbi/W9o1Y+Cd8lPY+M4Ucmip+Yq0ruHJeOLK9F/1HLe6nl9Vu6CtiaLkTVajXs61sXyFx9E2+47DB22W7R7cWjb8v3V7G6areFJ5FBRILyvBsYS6vNHk9MGK4efp1ah1YwqyDoexLi7E5kRMMmih3lefVry1j9X7dUuS7fPz8jgf1r+qm77hxEQeiM5pYC+hmYzZ+rlrbxrr65vHdYnDKSa69lnwfd9mIw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: 在 2023/7/29 23:15, Darrick J. Wong 写道: > On Thu, Jun 29, 2023 at 04:16:51PM +0800, Shiyang Ruan wrote: >> This patch is inspired by Dan's "mm, dax, pmem: Introduce >> dev_pagemap_failure()"[1]. With the help of dax_holder and >> ->notify_failure() mechanism, the pmem driver is able to ask filesystem >> on it to unmap all files in use, and notify processes who are using >> those files. >> >> Call trace: >> trigger unbind >> -> unbind_store() >> -> ... (skip) >> -> devres_release_all() >> -> kill_dax() >> -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE) >> -> xfs_dax_notify_failure() >> `-> freeze_super() // freeze (kernel call) >> `-> do xfs rmap >> ` -> mf_dax_kill_procs() >> ` -> collect_procs_fsdax() // all associated processes >> ` -> unmap_and_kill() >> ` -> invalidate_inode_pages2_range() // drop file's cache >> `-> thaw_super() // thaw (both kernel & user call) >> >> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove >> event. Use the exclusive freeze/thaw[2] to lock the filesystem to prevent >> new dax mapping from being created. Do not shutdown filesystem directly >> if configuration is not supported, or if failure range includes metadata >> area. Make sure all files and processes(not only the current progress) >> are handled correctly. Also drop the cache of associated files before >> pmem is removed. >> >> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/ >> [2]: https://lore.kernel.org/linux-xfs/168688010689.860947.1788875898367401950.stgit@frogsfrogsfrogs/ >> >> Signed-off-by: Shiyang Ruan >> --- >> drivers/dax/super.c | 3 +- >> fs/xfs/xfs_notify_failure.c | 86 ++++++++++++++++++++++++++++++++++--- >> include/linux/mm.h | 1 + >> mm/memory-failure.c | 17 ++++++-- >> 4 files changed, 96 insertions(+), 11 deletions(-) >> >> diff --git a/drivers/dax/super.c b/drivers/dax/super.c >> index c4c4728a36e4..2e1a35e82fce 100644 >> --- a/drivers/dax/super.c >> +++ b/drivers/dax/super.c >> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev) >> return; >> >> if (dax_dev->holder_data != NULL) >> - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0); >> + dax_holder_notify_failure(dax_dev, 0, U64_MAX, >> + MF_MEM_PRE_REMOVE); >> >> clear_bit(DAXDEV_ALIVE, &dax_dev->flags); >> synchronize_srcu(&dax_srcu); >> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c >> index 4a9bbd3fe120..f6ec56b76db6 100644 >> --- a/fs/xfs/xfs_notify_failure.c >> +++ b/fs/xfs/xfs_notify_failure.c >> @@ -22,6 +22,7 @@ >> >> #include >> #include >> +#include >> >> struct xfs_failure_info { >> xfs_agblock_t startblock; >> @@ -73,10 +74,16 @@ xfs_dax_failure_fn( >> struct xfs_mount *mp = cur->bc_mp; >> struct xfs_inode *ip; >> struct xfs_failure_info *notify = data; >> + struct address_space *mapping; >> + pgoff_t pgoff; >> + unsigned long pgcnt; >> int error = 0; >> >> if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) || >> (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) { >> + /* Continue the query because this isn't a failure. */ >> + if (notify->mf_flags & MF_MEM_PRE_REMOVE) >> + return 0; >> notify->want_shutdown = true; >> return 0; >> } >> @@ -92,14 +99,55 @@ xfs_dax_failure_fn( >> return 0; >> } >> >> - error = mf_dax_kill_procs(VFS_I(ip)->i_mapping, >> - xfs_failure_pgoff(mp, rec, notify), >> - xfs_failure_pgcnt(mp, rec, notify), >> - notify->mf_flags); >> + mapping = VFS_I(ip)->i_mapping; >> + pgoff = xfs_failure_pgoff(mp, rec, notify); >> + pgcnt = xfs_failure_pgcnt(mp, rec, notify); >> + >> + /* Continue the rmap query if the inode isn't a dax file. */ >> + if (dax_mapping(mapping)) >> + error = mf_dax_kill_procs(mapping, pgoff, pgcnt, >> + notify->mf_flags); >> + >> + /* Invalidate the cache in dax pages. */ >> + if (notify->mf_flags & MF_MEM_PRE_REMOVE) >> + invalidate_inode_pages2_range(mapping, pgoff, >> + pgoff + pgcnt - 1); >> + >> xfs_irele(ip); >> return error; >> } >> >> +static void >> +xfs_dax_notify_failure_freeze( >> + struct xfs_mount *mp) >> +{ >> + struct super_block *sb = mp->m_super; > > Nit: extra space right ^ here. > >> + >> + /* Wait until no one is holding the FREEZE_HOLDER_KERNEL. */ >> + while (freeze_super(sb, FREEZE_HOLDER_KERNEL) != 0) { >> + // Shall we just wait, or print warning then return -EBUSY? > > Hm. PRE_REMOVE gets called before the pmem gets unplugged, right? So > we'll send a second notification after it goes away, right? For the first question, yes. But I'm not sure about the second one. Do you mean: we'll send this notification again if unbind didn't success because freeze_super() returns -EBUSY? In other words, if the previous unbind operation did not work, we could unbind the device again. > > If so, then I'd say return the error here instead of looping, and live > with a kernel-frozen fs discarding the PRE_REMOVE message. > >> + delay(HZ / 10); >> + } >> +} >> + >> +static void >> +xfs_dax_notify_failure_thaw( >> + struct xfs_mount *mp) >> +{ >> + struct super_block *sb = mp->m_super; >> + int error; >> + >> + error = thaw_super(sb, FREEZE_HOLDER_KERNEL); >> + if (error) >> + xfs_emerg(mp, "still frozen after notify failure, err=%d", >> + error); >> + /* >> + * Also thaw userspace call anyway because the device is about to be >> + * removed immediately. >> + */ >> + thaw_super(sb, FREEZE_HOLDER_USERSPACE); >> +} >> + >> static int >> xfs_dax_notify_ddev_failure( >> struct xfs_mount *mp, >> @@ -120,7 +168,7 @@ xfs_dax_notify_ddev_failure( >> >> error = xfs_trans_alloc_empty(mp, &tp); >> if (error) >> - return error; >> + goto out; >> >> for (; agno <= end_agno; agno++) { >> struct xfs_rmap_irec ri_low = { }; >> @@ -165,11 +213,23 @@ xfs_dax_notify_ddev_failure( >> } >> >> xfs_trans_cancel(tp); >> + >> + /* >> + * Determine how to shutdown the filesystem according to the >> + * error code and flags. >> + */ >> if (error || notify.want_shutdown) { >> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK); >> if (!error) >> error = -EFSCORRUPTED; >> - } >> + } else if (mf_flags & MF_MEM_PRE_REMOVE) >> + xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT); >> + >> +out: >> + /* Thaw the fs if it is freezed before. */ >> + if (mf_flags & MF_MEM_PRE_REMOVE) >> + xfs_dax_notify_failure_thaw(mp); > > _thaw should be called from the same function that called _freeze. Will fix this. > > The rest of the patch seems ok to me. Thank you! -- Ruan. > > --D > >> + >> return error; >> } >> >> @@ -197,6 +257,8 @@ xfs_dax_notify_failure( >> >> if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev && >> mp->m_logdev_targp != mp->m_ddev_targp) { >> + if (mf_flags & MF_MEM_PRE_REMOVE) >> + return 0; >> xfs_err(mp, "ondisk log corrupt, shutting down fs!"); >> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK); >> return -EFSCORRUPTED; >> @@ -210,6 +272,12 @@ xfs_dax_notify_failure( >> ddev_start = mp->m_ddev_targp->bt_dax_part_off; >> ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1; >> >> + /* Notify failure on the whole device. */ >> + if (offset == 0 && len == U64_MAX) { >> + offset = ddev_start; >> + len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev); >> + } >> + >> /* Ignore the range out of filesystem area */ >> if (offset + len - 1 < ddev_start) >> return -ENXIO; >> @@ -226,6 +294,12 @@ xfs_dax_notify_failure( >> if (offset + len - 1 > ddev_end) >> len = ddev_end - offset + 1; >> >> + if (mf_flags & MF_MEM_PRE_REMOVE) { >> + xfs_info(mp, "device is about to be removed!"); >> + /* Freeze fs to prevent new mappings from being created. */ >> + xfs_dax_notify_failure_freeze(mp); >> + } >> + >> return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len), >> mf_flags); >> } >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index 27ce77080c79..a80c255b88d2 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -3576,6 +3576,7 @@ enum mf_flags { >> MF_UNPOISON = 1 << 4, >> MF_SW_SIMULATED = 1 << 5, >> MF_NO_RETRY = 1 << 6, >> + MF_MEM_PRE_REMOVE = 1 << 7, >> }; >> int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index, >> unsigned long count, int mf_flags); >> diff --git a/mm/memory-failure.c b/mm/memory-failure.c >> index 5b663eca1f29..483b75f2fcfb 100644 >> --- a/mm/memory-failure.c >> +++ b/mm/memory-failure.c >> @@ -688,7 +688,7 @@ static void add_to_kill_fsdax(struct task_struct *tsk, struct page *p, >> */ >> static void collect_procs_fsdax(struct page *page, >> struct address_space *mapping, pgoff_t pgoff, >> - struct list_head *to_kill) >> + struct list_head *to_kill, bool pre_remove) >> { >> struct vm_area_struct *vma; >> struct task_struct *tsk; >> @@ -696,8 +696,15 @@ static void collect_procs_fsdax(struct page *page, >> i_mmap_lock_read(mapping); >> read_lock(&tasklist_lock); >> for_each_process(tsk) { >> - struct task_struct *t = task_early_kill(tsk, true); >> + struct task_struct *t = tsk; >> >> + /* >> + * Search for all tasks while MF_MEM_PRE_REMOVE, because the >> + * current may not be the one accessing the fsdax page. >> + * Otherwise, search for the current task. >> + */ >> + if (!pre_remove) >> + t = task_early_kill(tsk, true); >> if (!t) >> continue; >> vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { >> @@ -1793,6 +1800,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index, >> dax_entry_t cookie; >> struct page *page; >> size_t end = index + count; >> + bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE; >> >> mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL; >> >> @@ -1804,9 +1812,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index, >> if (!page) >> goto unlock; >> >> - SetPageHWPoison(page); >> + if (!pre_remove) >> + SetPageHWPoison(page); >> >> - collect_procs_fsdax(page, mapping, index, &to_kill); >> + collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove); >> unmap_and_kill(&to_kill, page_to_pfn(page), mapping, >> index, mf_flags); >> unlock: >> -- >> 2.40.1 >>