From: Dan Williams <dan.j.williams@intel.com>
To: Shiyang Ruan <ruansy.fnst@fujitsu.com>,
<linux-fsdevel@vger.kernel.org>, <nvdimm@lists.linux.dev>,
<linux-xfs@vger.kernel.org>, <linux-mm@kvack.org>
Cc: <dan.j.williams@intel.com>, <willy@infradead.org>, <jack@suse.cz>,
<akpm@linux-foundation.org>, <djwong@kernel.org>,
<mcgrof@kernel.org>
Subject: RE: [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
Date: Mon, 7 Aug 2023 17:31:18 -0700 [thread overview]
Message-ID: <64d18cd6c6e09_5ea6e294fb@dwillia2-xfh.jf.intel.com.notmuch> (raw)
In-Reply-To: <20230629081651.253626-3-ruansy.fnst@fujitsu.com>
Shiyang Ruan wrote:
> This patch is inspired by Dan's "mm, dax, pmem: Introduce
> dev_pagemap_failure()"[1]. With the help of dax_holder and
> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> on it to unmap all files in use, and notify processes who are using
> those files.
>
> Call trace:
> trigger unbind
> -> unbind_store()
> -> ... (skip)
> -> devres_release_all()
> -> kill_dax()
> -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
> -> xfs_dax_notify_failure()
> `-> freeze_super() // freeze (kernel call)
> `-> do xfs rmap
> ` -> mf_dax_kill_procs()
> ` -> collect_procs_fsdax() // all associated processes
> ` -> unmap_and_kill()
> ` -> invalidate_inode_pages2_range() // drop file's cache
> `-> thaw_super() // thaw (both kernel & user call)
>
> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> event. Use the exclusive freeze/thaw[2] to lock the filesystem to prevent
> new dax mapping from being created. Do not shutdown filesystem directly
> if configuration is not supported, or if failure range includes metadata
> area. Make sure all files and processes(not only the current progress)
> are handled correctly. Also drop the cache of associated files before
> pmem is removed.
I would say more about why this is important for DAX users. Yes, the
devm_memremap_pages() vs get_user_pages() infrastructure can be improved
if it has a mechanism to revoke all pages that it has handed out for a
given device, but that's not an end user visible effect.
The end user impact needs to be clear. Is this for existing deployed
pmem where a user accidentally removes a device and wants failures and
process killing instead of hangs?
The reason Linux has got along without this for so long is because pmem
is difficult to remove (and with the sunset of Optane, difficult to
acquire). One motivation to pursue this is CXL where hotplug is better
defined and use cases like dynamic capacity devices where making forward
progress to kill processes is better than hanging.
It would help to have an example of what happens without this patch.
>
> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
> [2]: https://lore.kernel.org/linux-xfs/168688010689.860947.1788875898367401950.stgit@frogsfrogsfrogs/
>
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> ---
> drivers/dax/super.c | 3 +-
> fs/xfs/xfs_notify_failure.c | 86 ++++++++++++++++++++++++++++++++++---
> include/linux/mm.h | 1 +
> mm/memory-failure.c | 17 ++++++--
> 4 files changed, 96 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index c4c4728a36e4..2e1a35e82fce 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
> return;
>
> if (dax_dev->holder_data != NULL)
> - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> + dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> + MF_MEM_PRE_REMOVE);
The motivation in the original proposal was to convey the death of
large extents to memory_failure(). However, that proposal predated your
mf_dax_kill_procs() approach. With mf_dax_kill_procs() the need for a
new bulk memory_failure() API is gone.
This is where the end user impact needs to be clear. It seems that
without this patch the filesystem may assume failure while the device is
already present, but that seems ok. The goal is forward progress after a
mistake not necessarily minimizing damage after a mistake. The fact that
the current code is not as gentle could be considered a feature because
graceful shutdown should always unmount before unplug, and if one
unplugs before unmount it is already understood that they get to keep
the pieces.
Because the driver ->remove() callback can not enforce that the device
is still present it seems unnecessary to optimize for the case where the
filesystem is the device is being removed from an actively mounted
filesystem, but the device is still present.
The dax_holder_notify_failure(dax_dev, 0, U64_MAX) is sufficient to say
"userspace failed to umount before hardware eject, stop trying to access
this range", rather than "try to finish up in this range, but it might
already be too late".
next prev parent reply other threads:[~2023-08-08 0:31 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-06-29 8:16 [PATCH v12 0/2] " Shiyang Ruan
2023-06-29 8:16 ` [PATCH v12 1/2] xfs: fix the calculation for "end" and "length" Shiyang Ruan
2023-06-29 8:16 ` [PATCH v12 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
2023-06-29 12:02 ` kernel test robot
2023-07-14 9:07 ` Shiyang Ruan
2023-07-14 14:18 ` Darrick J. Wong
2023-07-20 1:50 ` Shiyang Ruan
2023-07-29 10:01 ` Shiyang Ruan
2023-07-29 15:15 ` Darrick J. Wong
2023-07-29 15:15 ` Darrick J. Wong
2023-07-31 9:36 ` Shiyang Ruan
2023-08-01 3:25 ` Darrick J. Wong
2023-08-03 10:44 ` Shiyang Ruan
2023-08-08 0:31 ` Dan Williams [this message]
2023-08-23 8:36 ` Shiyang Ruan
2023-08-23 8:17 ` [PATCH v13] mm, pmem, xfs: Introduce MF_MEM_PRE_REMOVE " Shiyang Ruan
2023-08-23 23:36 ` Darrick J. Wong
2023-08-24 9:41 ` Shiyang Ruan
2023-08-24 23:57 ` Darrick J. Wong
2023-08-25 3:52 ` Shiyang Ruan
2023-08-26 0:17 ` Darrick J. Wong
2023-08-28 6:57 ` [PATCH v14] " Shiyang Ruan
2023-08-30 15:34 ` Darrick J. Wong
2023-09-27 8:17 ` Dan Williams
2023-09-27 9:18 ` Shiyang Ruan
2023-09-28 10:32 ` [PATCH v15] " Shiyang Ruan
2023-09-29 18:31 ` Dan Williams
2023-10-01 1:43 ` kernel test robot
2023-10-02 11:57 ` Shiyang Ruan
2023-10-20 9:56 ` Chandan Babu R
2023-10-20 15:40 ` Darrick J. Wong
2023-10-23 6:40 ` Chandan Babu R
2023-10-23 7:26 ` Shiyang Ruan
2023-10-23 12:21 ` Chandan Babu R
2023-10-23 7:20 ` [PATCH v15.1] " Shiyang Ruan
2024-01-11 22:24 ` [PATCH v12 0/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE " Bill O'Donnell
2024-01-12 1:56 ` Shiyang Ruan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=64d18cd6c6e09_5ea6e294fb@dwillia2-xfh.jf.intel.com.notmuch \
--to=dan.j.williams@intel.com \
--cc=akpm@linux-foundation.org \
--cc=djwong@kernel.org \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-xfs@vger.kernel.org \
--cc=mcgrof@kernel.org \
--cc=nvdimm@lists.linux.dev \
--cc=ruansy.fnst@fujitsu.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox