From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A26ABEE49AF for ; Wed, 23 Aug 2023 08:36:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D8BBB280061; Wed, 23 Aug 2023 04:36:37 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D3B89280059; Wed, 23 Aug 2023 04:36:37 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BDC77280061; Wed, 23 Aug 2023 04:36:37 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id ABEE1280059 for ; Wed, 23 Aug 2023 04:36:37 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 7C1394050C for ; Wed, 23 Aug 2023 08:36:37 +0000 (UTC) X-FDA: 81154713234.25.0754148 Received: from esa11.hc1455-7.c3s2.iphmx.com (esa11.hc1455-7.c3s2.iphmx.com [207.54.90.137]) by imf29.hostedemail.com (Postfix) with ESMTP id 03A9F120011 for ; Wed, 23 Aug 2023 08:36:34 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=none; spf=pass (imf29.hostedemail.com: domain of ruansy.fnst@fujitsu.com designates 207.54.90.137 as permitted sender) smtp.mailfrom=ruansy.fnst@fujitsu.com; dmarc=pass (policy=none) header.from=fujitsu.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1692779795; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=xCmKP51x+5Pq5qqTdnwx+yQ5+IHsAQLfGFq907/wYzc=; b=7mXj5M5SI43w8caChk81I3C5EZobOBcrE68fYjlsUepT3/oxzhQ8qPoiPqj43rzba/1uGd 5CrP+f/mQZFpCJIStMibNN9ZjkkCHldoF1Bf6QH+oUa6hlVaj4DK6FNBAZTDRkQczCv+0m HHXyGt9kkUmNIh8xLlrxn6Clp1v4tGM= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1692779795; a=rsa-sha256; cv=none; b=kKtizTbBsvBLBskiKVIVZlpyFrTqgu1IIVZnUhicozngwrqg6FWp/s+FsoPssRf2t1QqKp 0tyLTmHaBsanpzJmkZPVT+8THaNosIxUiw07+Kb9K1dekSmmrg+Dod2/mXbt+oOpZm6pkM m3x3pR2GkzB7beumB4LyYhKd1FVh5yw= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=none; spf=pass (imf29.hostedemail.com: domain of ruansy.fnst@fujitsu.com designates 207.54.90.137 as permitted sender) smtp.mailfrom=ruansy.fnst@fujitsu.com; dmarc=pass (policy=none) header.from=fujitsu.com X-IronPort-AV: E=McAfee;i="6600,9927,10810"; a="108493204" X-IronPort-AV: E=Sophos;i="6.01,195,1684767600"; d="scan'208";a="108493204" Received: from unknown (HELO yto-r1.gw.nic.fujitsu.com) ([218.44.52.217]) by esa11.hc1455-7.c3s2.iphmx.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Aug 2023 17:36:33 +0900 Received: from yto-m1.gw.nic.fujitsu.com (yto-nat-yto-m1.gw.nic.fujitsu.com [192.168.83.64]) by yto-r1.gw.nic.fujitsu.com (Postfix) with ESMTP id 7BB8EDB3C1 for ; Wed, 23 Aug 2023 17:36:30 +0900 (JST) Received: from kws-ab4.gw.nic.fujitsu.com (kws-ab4.gw.nic.fujitsu.com [192.51.206.22]) by yto-m1.gw.nic.fujitsu.com (Postfix) with ESMTP id BAC2BCF7C0 for ; Wed, 23 Aug 2023 17:36:29 +0900 (JST) Received: from edo.cn.fujitsu.com (edo.cn.fujitsu.com [10.167.33.5]) by kws-ab4.gw.nic.fujitsu.com (Postfix) with ESMTP id 502822289E7 for ; Wed, 23 Aug 2023 17:36:29 +0900 (JST) Received: from [192.168.50.5] (unknown [10.167.234.230]) by edo.cn.fujitsu.com (Postfix) with ESMTP id 644271A008F; Wed, 23 Aug 2023 16:36:28 +0800 (CST) Message-ID: Date: Wed, 23 Aug 2023 16:36:28 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind To: Dan Williams , linux-fsdevel@vger.kernel.org, nvdimm@lists.linux.dev, linux-xfs@vger.kernel.org, linux-mm@kvack.org Cc: willy@infradead.org, jack@suse.cz, akpm@linux-foundation.org, djwong@kernel.org, mcgrof@kernel.org References: <20230629081651.253626-1-ruansy.fnst@fujitsu.com> <20230629081651.253626-3-ruansy.fnst@fujitsu.com> <64d18cd6c6e09_5ea6e294fb@dwillia2-xfh.jf.intel.com.notmuch> Content-Language: en-US From: Shiyang Ruan In-Reply-To: <64d18cd6c6e09_5ea6e294fb@dwillia2-xfh.jf.intel.com.notmuch> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-TM-AS-GCONF: 00 X-TM-AS-Product-Ver: IMSS-9.1.0.1417-9.0.0.1002-27830.005 X-TM-AS-User-Approved-Sender: Yes X-TMASE-Version: IMSS-9.1.0.1417-9.0.1002-27830.005 X-TMASE-Result: 10--29.976500-10.000000 X-TMASE-MatchedRID: UomY2wWC/KyPvrMjLFD6eCkMR2LAnMRp2q80vLACqaeqvcIF1TcLYI5a mlqOXwWavgUEAzu+ZVNbtzD5SJbjLpGZilTi8ctSxDiakrJ+Splt9UVWhqbRIWtEzrC9eANpEJm hpJ8aMPMs4TH6G8STucqWFlCQS6PJay2H+VAa8iXTCZHfjFFBzxokPBiBBj9/WAuSz3ewb23jE7 v208scT7xxBqTj7wDsupZIYbG9DPyAsoqIP3ENkIvefyp1glN0MzbF1gbxlQZFms6YEs23D3Nrl Eb9qTKUfNghynX3+4Fh7mDBbvo6VLHUOxwKYaKnYCs5AYZvXC/4qCLIu0mtIMC5DTEMxpeQ3JB1 YIf9iAoPNxz2EvpSIL/t8hsriyc8pANxWw95dxPY6EGR3k/uitG3Y6ijqBt3teXjSBMYnmklVDj pDxeIWyolWOAMg2fcFR2lGwd2c9E/O/cxE+VGcUWX0DfhVamwvD4YKo9SttxAzPYUSDzxTFDczN /SMpBY+u+xXVO1LUDeLyT0oSNhxXAA9eFj9SfYngIgpj8eDcBZDL1gLmoa/ALDAYP4AXVR7nY51 lwLq0+8QIu4z6HhEH7cGd19dSFd X-TMASE-SNAP-Result: 1.821001.0001-0-1-22:0,33:0,34:0-0 X-Rspamd-Queue-Id: 03A9F120011 X-Rspam-User: X-Stat-Signature: gyzpcus9z737zwnozfap841cd3x1ks1b X-Rspamd-Server: rspam03 X-HE-Tag: 1692779794-197479 X-HE-Meta: U2FsdGVkX1+XExK6LkOQkb+MPFeohJaCDuTH38R0zjLrEVgcYsE/UdZZS3NZpBgifUnOYX3EmKk9GUMBcYCUJfb4WEB0kyEnhuVcsNjRbSdP0gpooKAT6OH5MjACDfWtHf1ZieNDnteEQ/w6YjHOP6AQAyx31eelcKmNynYminZlPybcxPtzU9fZERanFM3j9IGLjsbjsfXwtrZGecXz/cmsMwKXbM4vE66WVz436M9Cl/E8oXAT5jvZx870tkKKvyxO3b9nNLYqdq09xc0hKCLH0+ntuvQXucNbZWtdnAPTd3ny84nBYBLV96uxkkQQngxycI323wtiwKpF+Hy8aWPZXKtnvdytFdjynul41NimT/GJP1dDUg42PP4zX+tkNUyrHRSXB504gzMNqX/2ay94Pcm7R8Sfa8nw/sM68aFmVN6W6553x/Stxz6l7M3ZBdQLGgsw9qgD8s1UrIOAdNoHsbdaZYQ1wWzYYStFYLUFe4zeHD7e9mprsYBEjcfCEpKWzDqxW+m7axv48w/exFYPDadHEGDu4JJZ8GgJ2hJzAYG+YOvwTTnSeBj5FEWc8w0ceAJRurp0zkjHm530EmCrBUOfCQb4cisSYCU3QOS1xoBW1g+j5v3bXlqwG8fH3uEkFrPyMDHAZIuWU5VcRtoYXw9Snatt+eqiA/sMM0FlJa3CsO4Osp0bC3GiyI52ckPa/hJiPfe6hBxf7ep6QImdYQA5BN3c3AP47pnYStuOl9d0wExNxpg/Uk7o6N3Z9o/i3R9nLFVM67qretVnF0qztCyoGvhFV/VNpFUO015fQCFTJ6OZ1k7rEPSDzqi2HLB7TjLvLepSprfc0gosKAQTtgQ3zb3/Qvd1UkNDYOhJ3iWV1Fa3JkBFd4jnBB9gzybB/8cXo5YbKFmLaRLfQeCqmJia2vW9zoWQaOyGF4lUOgc0V6tBers5IhDn0atTHH08QmD2yCzFIGq81D+ RfCdHUiv TQRK1eLX7/hEpQ2NDgtbyTJ2x5nQh1BNWLDMGxGgdDMAVz/0d/0vCP5ZAjMn+qe8YAu3hLvMno8qcqx0luheGboYNJCD3Wcehc5LQ8ZeAD744g+c13rcOIJRim7dWyxx3oyHtSD/R4H0wpvlgonYxvp7bG3ImFVfxW9YDu5Y71b7bX38POpp5NUGMkEoEBOvlEkjVW/X+xhSo56KmhwwQ2L2AoOiTdj+7SUW6uIN0/7GV22lm1i8+NTuCUtbzuaSCFkS+pXJfGsFzdxjIUKO9NgMC1vfa02eXuCGfIsie5IWRi9QxOZBQLp5uNA0rB//DfrOpdnwfagKYYqAjhoE/5SLDCQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: 在 2023/8/8 8:31, Dan Williams 写道: > Shiyang Ruan wrote: >> This patch is inspired by Dan's "mm, dax, pmem: Introduce >> dev_pagemap_failure()"[1]. With the help of dax_holder and >> ->notify_failure() mechanism, the pmem driver is able to ask filesystem >> on it to unmap all files in use, and notify processes who are using >> those files. >> >> Call trace: >> trigger unbind >> -> unbind_store() >> -> ... (skip) >> -> devres_release_all() >> -> kill_dax() >> -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE) >> -> xfs_dax_notify_failure() >> `-> freeze_super() // freeze (kernel call) >> `-> do xfs rmap >> ` -> mf_dax_kill_procs() >> ` -> collect_procs_fsdax() // all associated processes >> ` -> unmap_and_kill() >> ` -> invalidate_inode_pages2_range() // drop file's cache >> `-> thaw_super() // thaw (both kernel & user call) >> >> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove >> event. Use the exclusive freeze/thaw[2] to lock the filesystem to prevent >> new dax mapping from being created. Do not shutdown filesystem directly >> if configuration is not supported, or if failure range includes metadata >> area. Make sure all files and processes(not only the current progress) >> are handled correctly. Also drop the cache of associated files before >> pmem is removed. > > I would say more about why this is important for DAX users. Yes, the > devm_memremap_pages() vs get_user_pages() infrastructure can be improved > if it has a mechanism to revoke all pages that it has handed out for a > given device, but that's not an end user visible effect. > > The end user impact needs to be clear. Is this for existing deployed > pmem where a user accidentally removes a device and wants failures and > process killing instead of hangs? > > The reason Linux has got along without this for so long is because pmem > is difficult to remove (and with the sunset of Optane, difficult to > acquire). One motivation to pursue this is CXL where hotplug is better > defined and use cases like dynamic capacity devices where making forward > progress to kill processes is better than hanging. > > It would help to have an example of what happens without this patch. > >> >> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/ >> [2]: https://lore.kernel.org/linux-xfs/168688010689.860947.1788875898367401950.stgit@frogsfrogsfrogs/ >> >> Signed-off-by: Shiyang Ruan >> --- >> drivers/dax/super.c | 3 +- >> fs/xfs/xfs_notify_failure.c | 86 ++++++++++++++++++++++++++++++++++--- >> include/linux/mm.h | 1 + >> mm/memory-failure.c | 17 ++++++-- >> 4 files changed, 96 insertions(+), 11 deletions(-) >> >> diff --git a/drivers/dax/super.c b/drivers/dax/super.c >> index c4c4728a36e4..2e1a35e82fce 100644 >> --- a/drivers/dax/super.c >> +++ b/drivers/dax/super.c >> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev) >> return; >> >> if (dax_dev->holder_data != NULL) >> - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0); >> + dax_holder_notify_failure(dax_dev, 0, U64_MAX, >> + MF_MEM_PRE_REMOVE); > > The motivation in the original proposal was to convey the death of > large extents to memory_failure(). However, that proposal predated your > mf_dax_kill_procs() approach. With mf_dax_kill_procs() the need for a > new bulk memory_failure() API is gone. > > This is where the end user impact needs to be clear. It seems that > without this patch the filesystem may assume failure while the device is > already present, but that seems ok. The goal is forward progress after a > mistake not necessarily minimizing damage after a mistake. The fact that > the current code is not as gentle could be considered a feature because > graceful shutdown should always unmount before unplug, and if one > unplugs before unmount it is already understood that they get to keep > the pieces. > > Because the driver ->remove() callback can not enforce that the device > is still present it seems unnecessary to optimize for the case where the > filesystem is the device is being removed from an actively mounted > filesystem, but the device is still present. > > The dax_holder_notify_failure(dax_dev, 0, U64_MAX) is sufficient to say > "userspace failed to umount before hardware eject, stop trying to access > this range", rather than "try to finish up in this range, but it might > already be too late". Hi Dan, I added an simple example of "accidentally remove pmem device" and its consequences of not having this patch in the latest version. Please review. -- Thanks, Ruan.