* [PATCH v11 0/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
@ 2023-03-28 9:41 Shiyang Ruan
2023-03-28 9:41 ` [PATCH v11 1/2] xfs: fix the calculation of length and end Shiyang Ruan
` (2 more replies)
0 siblings, 3 replies; 18+ messages in thread
From: Shiyang Ruan @ 2023-03-28 9:41 UTC (permalink / raw)
To: linux-fsdevel, nvdimm, linux-xfs, linux-mm
Cc: dan.j.williams, willy, jack, akpm, djwong
This patchset is to add gracefully unbind support for pmem.
Patch1 corrects the calculation of length and end of a given range.
Patch2 introduces a new flag call MF_MEM_REMOVE, to let dax holder know
it is a remove event. With the help of notify_failure mechanism, we are
able to shutdown the filesystem on the pmem gracefully.
Changes since v10:
Patch1:
1. correct the count calculation in xfs_failure_pgcnt().
Patch2:
2. drop the patch which introduces super_drop_pagecache().
3. in mf_dax_kill_procs(), don't SetPageHWPoison() and search for all
tasks while mf_flags has MF_MEM_PRE_REMOVE.
4. only do mf_dax_kill_procs() on dax mapping.
5. do invalidate_inode_pages2_range() for each file found during rmap,
to make sure the dax entry are disassociated before pmem is gone.
Otherwise, umount filesystem after unbind will cause crash because
the dax entries have to be disassociated but now the pmem is not
exist.
For detail analysis of this change, please refer this link[1].
[1] https://lore.kernel.org/linux-xfs/b1d9fc03-1a71-a75f-f87b-5819991e4eb2@fujitsu.com/
Shiyang Ruan (2):
xfs: fix the calculation of length and end
mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
drivers/dax/super.c | 3 +-
fs/xfs/xfs_notify_failure.c | 66 +++++++++++++++++++++++++++++++------
include/linux/mm.h | 1 +
mm/memory-failure.c | 17 +++++++---
4 files changed, 72 insertions(+), 15 deletions(-)
--
2.39.2
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v11 1/2] xfs: fix the calculation of length and end
2023-03-28 9:41 [PATCH v11 0/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
@ 2023-03-28 9:41 ` Shiyang Ruan
2023-03-28 9:41 ` [PATCH v11 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
2023-04-04 4:33 ` [PATCH v11 0/2] " Shiyang Ruan
2 siblings, 0 replies; 18+ messages in thread
From: Shiyang Ruan @ 2023-03-28 9:41 UTC (permalink / raw)
To: linux-fsdevel, nvdimm, linux-xfs, linux-mm
Cc: dan.j.williams, willy, jack, akpm, djwong
The end should be start + length - 1. Also fix the calculation of the
length when seeking for intersection of notify range and device.
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
fs/xfs/xfs_notify_failure.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index c4078d0ec108..1e2eddb8f90f 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -61,7 +61,7 @@ xfs_failure_pgcnt(
end_notify = notify->startblock + notify->blockcount;
end_cross = min(end_rec, end_notify);
- return XFS_FSB_TO_B(mp, end_cross - start_cross) >> PAGE_SHIFT;
+ return XFS_FSB_TO_B(mp, end_cross - start_cross + 1) >> PAGE_SHIFT;
}
static int
@@ -114,7 +114,7 @@ xfs_dax_notify_ddev_failure(
int error = 0;
xfs_fsblock_t fsbno = XFS_DADDR_TO_FSB(mp, daddr);
xfs_agnumber_t agno = XFS_FSB_TO_AGNO(mp, fsbno);
- xfs_fsblock_t end_fsbno = XFS_DADDR_TO_FSB(mp, daddr + bblen);
+ xfs_fsblock_t end_fsbno = XFS_DADDR_TO_FSB(mp, daddr + bblen - 1);
xfs_agnumber_t end_agno = XFS_FSB_TO_AGNO(mp, end_fsbno);
error = xfs_trans_alloc_empty(mp, &tp);
@@ -210,7 +210,7 @@ xfs_dax_notify_failure(
ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
/* Ignore the range out of filesystem area */
- if (offset + len < ddev_start)
+ if (offset + len - 1 < ddev_start)
return -ENXIO;
if (offset > ddev_end)
return -ENXIO;
@@ -222,8 +222,8 @@ xfs_dax_notify_failure(
len -= ddev_start - offset;
offset = 0;
}
- if (offset + len > ddev_end)
- len -= ddev_end - offset;
+ if (offset + len - 1 > ddev_end)
+ len = ddev_end - offset + 1;
return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
mf_flags);
--
2.39.2
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v11 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
2023-03-28 9:41 [PATCH v11 0/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
2023-03-28 9:41 ` [PATCH v11 1/2] xfs: fix the calculation of length and end Shiyang Ruan
@ 2023-03-28 9:41 ` Shiyang Ruan
2023-04-04 17:45 ` Darrick J. Wong
2023-04-12 10:52 ` [RFC PATCH v11.1 " Shiyang Ruan
2023-04-04 4:33 ` [PATCH v11 0/2] " Shiyang Ruan
2 siblings, 2 replies; 18+ messages in thread
From: Shiyang Ruan @ 2023-03-28 9:41 UTC (permalink / raw)
To: linux-fsdevel, nvdimm, linux-xfs, linux-mm
Cc: dan.j.williams, willy, jack, akpm, djwong
This patch is inspired by Dan's "mm, dax, pmem: Introduce
dev_pagemap_failure()"[1]. With the help of dax_holder and
->notify_failure() mechanism, the pmem driver is able to ask filesystem
(or mapped device) on it to unmap all files in use and notify processes
who are using those files.
Call trace:
trigger unbind
-> unbind_store()
-> ... (skip)
-> devres_release_all()
-> kill_dax()
-> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
-> xfs_dax_notify_failure()
`-> freeze_super()
`-> do xfs rmap
` -> mf_dax_kill_procs()
` -> collect_procs_fsdax() // all associated
` -> unmap_and_kill()
` -> invalidate_inode_pages2() // drop file's cache
`-> thaw_super()
Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
event. Freeze the filesystem to prevent new dax mapping being created.
And do not shutdown filesystem directly if something not supported, or
if failure range includes metadata area. Make sure all files and
processes are handled correctly. Also drop the cache of associated
files before pmem is removed.
[1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
---
drivers/dax/super.c | 3 +-
fs/xfs/xfs_notify_failure.c | 56 +++++++++++++++++++++++++++++++++----
include/linux/mm.h | 1 +
mm/memory-failure.c | 17 ++++++++---
4 files changed, 67 insertions(+), 10 deletions(-)
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index c4c4728a36e4..2e1a35e82fce 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
return;
if (dax_dev->holder_data != NULL)
- dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
+ dax_holder_notify_failure(dax_dev, 0, U64_MAX,
+ MF_MEM_PRE_REMOVE);
clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
synchronize_srcu(&dax_srcu);
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index 1e2eddb8f90f..1b4eff43f9b5 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -22,6 +22,7 @@
#include <linux/mm.h>
#include <linux/dax.h>
+#include <linux/fs.h>
struct xfs_failure_info {
xfs_agblock_t startblock;
@@ -73,10 +74,16 @@ xfs_dax_failure_fn(
struct xfs_mount *mp = cur->bc_mp;
struct xfs_inode *ip;
struct xfs_failure_info *notify = data;
+ struct address_space *mapping;
+ pgoff_t pgoff;
+ unsigned long pgcnt;
int error = 0;
if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
(rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
+ /* The device is about to be removed. Not a really failure. */
+ if (notify->mf_flags & MF_MEM_PRE_REMOVE)
+ return 0;
notify->want_shutdown = true;
return 0;
}
@@ -92,10 +99,18 @@ xfs_dax_failure_fn(
return 0;
}
- error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
- xfs_failure_pgoff(mp, rec, notify),
- xfs_failure_pgcnt(mp, rec, notify),
- notify->mf_flags);
+ mapping = VFS_I(ip)->i_mapping;
+ pgoff = xfs_failure_pgoff(mp, rec, notify);
+ pgcnt = xfs_failure_pgcnt(mp, rec, notify);
+
+ /* Continue the rmap query if the inode isn't a dax file. */
+ if (dax_mapping(mapping))
+ error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
+ notify->mf_flags);
+
+ /* Invalidate the cache anyway. */
+ invalidate_inode_pages2_range(mapping, pgoff, pgoff + pgcnt - 1);
+
xfs_irele(ip);
return error;
}
@@ -164,11 +179,25 @@ xfs_dax_notify_ddev_failure(
}
xfs_trans_cancel(tp);
+
+ /* Unfreeze filesystem anyway if it is freezed before. */
+ if (mf_flags & MF_MEM_PRE_REMOVE) {
+ error = thaw_super(mp->m_super);
+ if (error)
+ return error;
+ }
+
+ /*
+ * Determine how to shutdown the filesystem according to the
+ * error code and flags.
+ */
if (error || notify.want_shutdown) {
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
if (!error)
error = -EFSCORRUPTED;
- }
+ } else if (mf_flags & MF_MEM_PRE_REMOVE)
+ xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
+
return error;
}
@@ -182,6 +211,7 @@ xfs_dax_notify_failure(
struct xfs_mount *mp = dax_holder(dax_dev);
u64 ddev_start;
u64 ddev_end;
+ int error;
if (!(mp->m_super->s_flags & SB_BORN)) {
xfs_warn(mp, "filesystem is not ready for notify_failure()!");
@@ -196,6 +226,8 @@ xfs_dax_notify_failure(
if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
mp->m_logdev_targp != mp->m_ddev_targp) {
+ if (mf_flags & MF_MEM_PRE_REMOVE)
+ return 0;
xfs_err(mp, "ondisk log corrupt, shutting down fs!");
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
return -EFSCORRUPTED;
@@ -209,6 +241,12 @@ xfs_dax_notify_failure(
ddev_start = mp->m_ddev_targp->bt_dax_part_off;
ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
+ /* Notify failure on the whole device. */
+ if (offset == 0 && len == U64_MAX) {
+ offset = ddev_start;
+ len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
+ }
+
/* Ignore the range out of filesystem area */
if (offset + len - 1 < ddev_start)
return -ENXIO;
@@ -225,6 +263,14 @@ xfs_dax_notify_failure(
if (offset + len - 1 > ddev_end)
len = ddev_end - offset + 1;
+ if (mf_flags & MF_MEM_PRE_REMOVE) {
+ xfs_info(mp, "device is about to be removed!");
+ /* Freeze the filesystem to prevent new mappings created. */
+ error = freeze_super(mp->m_super);
+ if (error)
+ return error;
+ }
+
return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
mf_flags);
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1f79667824eb..ac3f22c20e1d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3436,6 +3436,7 @@ enum mf_flags {
MF_UNPOISON = 1 << 4,
MF_SW_SIMULATED = 1 << 5,
MF_NO_RETRY = 1 << 6,
+ MF_MEM_PRE_REMOVE = 1 << 7,
};
int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
unsigned long count, int mf_flags);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index fae9baf3be16..6e6acec45568 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -623,7 +623,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
*/
static void collect_procs_fsdax(struct page *page,
struct address_space *mapping, pgoff_t pgoff,
- struct list_head *to_kill)
+ struct list_head *to_kill, bool pre_remove)
{
struct vm_area_struct *vma;
struct task_struct *tsk;
@@ -631,8 +631,15 @@ static void collect_procs_fsdax(struct page *page,
i_mmap_lock_read(mapping);
read_lock(&tasklist_lock);
for_each_process(tsk) {
- struct task_struct *t = task_early_kill(tsk, true);
+ struct task_struct *t = tsk;
+ /*
+ * Search for all tasks while MF_MEM_PRE_REMOVE, because the
+ * current may not be the one accessing the fsdax page.
+ * Otherwise, search for the current task.
+ */
+ if (!pre_remove)
+ t = task_early_kill(tsk, true);
if (!t)
continue;
vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
@@ -1732,6 +1739,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
dax_entry_t cookie;
struct page *page;
size_t end = index + count;
+ bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
@@ -1743,9 +1751,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
if (!page)
goto unlock;
- SetPageHWPoison(page);
+ if (!pre_remove)
+ SetPageHWPoison(page);
- collect_procs_fsdax(page, mapping, index, &to_kill);
+ collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
index, mf_flags);
unlock:
--
2.39.2
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v11 0/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
2023-03-28 9:41 [PATCH v11 0/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
2023-03-28 9:41 ` [PATCH v11 1/2] xfs: fix the calculation of length and end Shiyang Ruan
2023-03-28 9:41 ` [PATCH v11 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
@ 2023-04-04 4:33 ` Shiyang Ruan
2 siblings, 0 replies; 18+ messages in thread
From: Shiyang Ruan @ 2023-04-04 4:33 UTC (permalink / raw)
To: linux-fsdevel, nvdimm, linux-xfs, linux-mm
Cc: dan.j.williams, willy, jack, akpm, djwong
Ping
在 2023/3/28 17:41, Shiyang Ruan 写道:
> This patchset is to add gracefully unbind support for pmem.
> Patch1 corrects the calculation of length and end of a given range.
> Patch2 introduces a new flag call MF_MEM_REMOVE, to let dax holder know
> it is a remove event. With the help of notify_failure mechanism, we are
> able to shutdown the filesystem on the pmem gracefully.
>
> Changes since v10:
> Patch1:
> 1. correct the count calculation in xfs_failure_pgcnt().
> Patch2:
> 2. drop the patch which introduces super_drop_pagecache().
> 3. in mf_dax_kill_procs(), don't SetPageHWPoison() and search for all
> tasks while mf_flags has MF_MEM_PRE_REMOVE.
> 4. only do mf_dax_kill_procs() on dax mapping.
> 5. do invalidate_inode_pages2_range() for each file found during rmap,
> to make sure the dax entry are disassociated before pmem is gone.
> Otherwise, umount filesystem after unbind will cause crash because
> the dax entries have to be disassociated but now the pmem is not
> exist.
>
> For detail analysis of this change, please refer this link[1].
>
> [1] https://lore.kernel.org/linux-xfs/b1d9fc03-1a71-a75f-f87b-5819991e4eb2@fujitsu.com/
>
> Shiyang Ruan (2):
> xfs: fix the calculation of length and end
> mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
>
> drivers/dax/super.c | 3 +-
> fs/xfs/xfs_notify_failure.c | 66 +++++++++++++++++++++++++++++++------
> include/linux/mm.h | 1 +
> mm/memory-failure.c | 17 +++++++---
> 4 files changed, 72 insertions(+), 15 deletions(-)
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v11 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
2023-03-28 9:41 ` [PATCH v11 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
@ 2023-04-04 17:45 ` Darrick J. Wong
2023-04-05 1:28 ` Yasunori Gotou (Fujitsu)
` (2 more replies)
2023-04-12 10:52 ` [RFC PATCH v11.1 " Shiyang Ruan
1 sibling, 3 replies; 18+ messages in thread
From: Darrick J. Wong @ 2023-04-04 17:45 UTC (permalink / raw)
To: Shiyang Ruan
Cc: linux-fsdevel, nvdimm, linux-xfs, linux-mm, dan.j.williams,
willy, jack, akpm
On Tue, Mar 28, 2023 at 09:41:46AM +0000, Shiyang Ruan wrote:
> This patch is inspired by Dan's "mm, dax, pmem: Introduce
> dev_pagemap_failure()"[1]. With the help of dax_holder and
> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> (or mapped device) on it to unmap all files in use and notify processes
> who are using those files.
>
> Call trace:
> trigger unbind
> -> unbind_store()
> -> ... (skip)
> -> devres_release_all()
> -> kill_dax()
> -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
> -> xfs_dax_notify_failure()
> `-> freeze_super()
> `-> do xfs rmap
> ` -> mf_dax_kill_procs()
> ` -> collect_procs_fsdax() // all associated
> ` -> unmap_and_kill()
> ` -> invalidate_inode_pages2() // drop file's cache
> `-> thaw_super()
>
> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> event. Freeze the filesystem to prevent new dax mapping being created.
> And do not shutdown filesystem directly if something not supported, or
> if failure range includes metadata area. Make sure all files and
> processes are handled correctly. Also drop the cache of associated
> files before pmem is removed.
>
> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
>
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> ---
> drivers/dax/super.c | 3 +-
> fs/xfs/xfs_notify_failure.c | 56 +++++++++++++++++++++++++++++++++----
> include/linux/mm.h | 1 +
> mm/memory-failure.c | 17 ++++++++---
> 4 files changed, 67 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index c4c4728a36e4..2e1a35e82fce 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
> return;
>
> if (dax_dev->holder_data != NULL)
> - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> + dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> + MF_MEM_PRE_REMOVE);
>
> clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
> synchronize_srcu(&dax_srcu);
> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> index 1e2eddb8f90f..1b4eff43f9b5 100644
> --- a/fs/xfs/xfs_notify_failure.c
> +++ b/fs/xfs/xfs_notify_failure.c
> @@ -22,6 +22,7 @@
>
> #include <linux/mm.h>
> #include <linux/dax.h>
> +#include <linux/fs.h>
>
> struct xfs_failure_info {
> xfs_agblock_t startblock;
> @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
> struct xfs_mount *mp = cur->bc_mp;
> struct xfs_inode *ip;
> struct xfs_failure_info *notify = data;
> + struct address_space *mapping;
> + pgoff_t pgoff;
> + unsigned long pgcnt;
> int error = 0;
>
> if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
> (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> + /* The device is about to be removed. Not a really failure. */
> + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> + return 0;
> notify->want_shutdown = true;
> return 0;
> }
> @@ -92,10 +99,18 @@ xfs_dax_failure_fn(
> return 0;
> }
>
> - error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
> - xfs_failure_pgoff(mp, rec, notify),
> - xfs_failure_pgcnt(mp, rec, notify),
> - notify->mf_flags);
> + mapping = VFS_I(ip)->i_mapping;
> + pgoff = xfs_failure_pgoff(mp, rec, notify);
> + pgcnt = xfs_failure_pgcnt(mp, rec, notify);
> +
> + /* Continue the rmap query if the inode isn't a dax file. */
> + if (dax_mapping(mapping))
> + error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
> + notify->mf_flags);
> +
> + /* Invalidate the cache anyway. */
> + invalidate_inode_pages2_range(mapping, pgoff, pgoff + pgcnt - 1);
> +
> xfs_irele(ip);
> return error;
> }
> @@ -164,11 +179,25 @@ xfs_dax_notify_ddev_failure(
> }
>
> xfs_trans_cancel(tp);
> +
> + /* Unfreeze filesystem anyway if it is freezed before. */
> + if (mf_flags & MF_MEM_PRE_REMOVE) {
> + error = thaw_super(mp->m_super);
> + if (error)
> + return error;
If someone *else* wanders in and thaws the fs, you'll get EINVAL here.
I guess that's useful for knowing if someone's screwed up the freeze
state on us, but ... really, don't you want to make sure you've gotten
the freeze and nobody else can take it away?
I think you want the kernel-initiated freeze proposed by Luis here:
https://lore.kernel.org/linux-fsdevel/20230114003409.1168311-4-mcgrof@kernel.org/
Also: Is Fujitsu still pursuing pmem products? Even though Optane is
dead? I'm no longer sure of what the roadmap is for all this fsdax code
and whatnot.
--D
> + }
> +
> + /*
> + * Determine how to shutdown the filesystem according to the
> + * error code and flags.
> + */
> if (error || notify.want_shutdown) {
> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
> if (!error)
> error = -EFSCORRUPTED;
> - }
> + } else if (mf_flags & MF_MEM_PRE_REMOVE)
> + xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
> +
> return error;
> }
>
> @@ -182,6 +211,7 @@ xfs_dax_notify_failure(
> struct xfs_mount *mp = dax_holder(dax_dev);
> u64 ddev_start;
> u64 ddev_end;
> + int error;
>
> if (!(mp->m_super->s_flags & SB_BORN)) {
> xfs_warn(mp, "filesystem is not ready for notify_failure()!");
> @@ -196,6 +226,8 @@ xfs_dax_notify_failure(
>
> if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
> mp->m_logdev_targp != mp->m_ddev_targp) {
> + if (mf_flags & MF_MEM_PRE_REMOVE)
> + return 0;
> xfs_err(mp, "ondisk log corrupt, shutting down fs!");
> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
> return -EFSCORRUPTED;
> @@ -209,6 +241,12 @@ xfs_dax_notify_failure(
> ddev_start = mp->m_ddev_targp->bt_dax_part_off;
> ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
>
> + /* Notify failure on the whole device. */
> + if (offset == 0 && len == U64_MAX) {
> + offset = ddev_start;
> + len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
> + }
> +
> /* Ignore the range out of filesystem area */
> if (offset + len - 1 < ddev_start)
> return -ENXIO;
> @@ -225,6 +263,14 @@ xfs_dax_notify_failure(
> if (offset + len - 1 > ddev_end)
> len = ddev_end - offset + 1;
>
> + if (mf_flags & MF_MEM_PRE_REMOVE) {
> + xfs_info(mp, "device is about to be removed!");
> + /* Freeze the filesystem to prevent new mappings created. */
> + error = freeze_super(mp->m_super);
> + if (error)
> + return error;
> + }
> +
> return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
> mf_flags);
> }
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 1f79667824eb..ac3f22c20e1d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3436,6 +3436,7 @@ enum mf_flags {
> MF_UNPOISON = 1 << 4,
> MF_SW_SIMULATED = 1 << 5,
> MF_NO_RETRY = 1 << 6,
> + MF_MEM_PRE_REMOVE = 1 << 7,
> };
> int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> unsigned long count, int mf_flags);
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index fae9baf3be16..6e6acec45568 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -623,7 +623,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
> */
> static void collect_procs_fsdax(struct page *page,
> struct address_space *mapping, pgoff_t pgoff,
> - struct list_head *to_kill)
> + struct list_head *to_kill, bool pre_remove)
> {
> struct vm_area_struct *vma;
> struct task_struct *tsk;
> @@ -631,8 +631,15 @@ static void collect_procs_fsdax(struct page *page,
> i_mmap_lock_read(mapping);
> read_lock(&tasklist_lock);
> for_each_process(tsk) {
> - struct task_struct *t = task_early_kill(tsk, true);
> + struct task_struct *t = tsk;
>
> + /*
> + * Search for all tasks while MF_MEM_PRE_REMOVE, because the
> + * current may not be the one accessing the fsdax page.
> + * Otherwise, search for the current task.
> + */
> + if (!pre_remove)
> + t = task_early_kill(tsk, true);
> if (!t)
> continue;
> vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> @@ -1732,6 +1739,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> dax_entry_t cookie;
> struct page *page;
> size_t end = index + count;
> + bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
>
> mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
>
> @@ -1743,9 +1751,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> if (!page)
> goto unlock;
>
> - SetPageHWPoison(page);
> + if (!pre_remove)
> + SetPageHWPoison(page);
>
> - collect_procs_fsdax(page, mapping, index, &to_kill);
> + collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
> unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
> index, mf_flags);
> unlock:
> --
> 2.39.2
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* RE: [PATCH v11 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
2023-04-04 17:45 ` Darrick J. Wong
@ 2023-04-05 1:28 ` Yasunori Gotou (Fujitsu)
2023-04-05 4:36 ` Dan Williams
2023-04-06 10:50 ` Shiyang Ruan
2 siblings, 0 replies; 18+ messages in thread
From: Yasunori Gotou (Fujitsu) @ 2023-04-05 1:28 UTC (permalink / raw)
To: 'Darrick J. Wong', ruansy.fnst
Cc: linux-fsdevel, nvdimm, linux-xfs, linux-mm, dan.j.williams,
willy, jack, akpm
Hello, Darrick-san,
> Also: Is Fujitsu still pursuing pmem products? Even though Optane is dead?
> I'm no longer sure of what the roadmap is for all this fsdax code and whatnot.
I need to talk about it.
Though Optane is certainly dead, "persistent memory" is not dead.
Because CXL specification supports persistent memory and FS-DAX
should be supported for it.
Actually, one of our customer tried to use Persistent memory for MySQL(*),
and I heard that they still expects CXL Pmem.
(*) https://www.docswell.com/s/ydnjp/K86GY5-2022-03-07-145233#p1
So, please continue.
Thanks,
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v11 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
2023-04-04 17:45 ` Darrick J. Wong
2023-04-05 1:28 ` Yasunori Gotou (Fujitsu)
@ 2023-04-05 4:36 ` Dan Williams
2023-04-06 10:50 ` Shiyang Ruan
2 siblings, 0 replies; 18+ messages in thread
From: Dan Williams @ 2023-04-05 4:36 UTC (permalink / raw)
To: Darrick J. Wong, Shiyang Ruan
Cc: linux-fsdevel, nvdimm, linux-xfs, linux-mm, dan.j.williams,
willy, jack, akpm
Darrick J. Wong wrote:
> On Tue, Mar 28, 2023 at 09:41:46AM +0000, Shiyang Ruan wrote:
> > This patch is inspired by Dan's "mm, dax, pmem: Introduce
> > dev_pagemap_failure()"[1]. With the help of dax_holder and
> > ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> > (or mapped device) on it to unmap all files in use and notify processes
> > who are using those files.
> >
> > Call trace:
> > trigger unbind
> > -> unbind_store()
> > -> ... (skip)
> > -> devres_release_all()
> > -> kill_dax()
> > -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
> > -> xfs_dax_notify_failure()
> > `-> freeze_super()
> > `-> do xfs rmap
> > ` -> mf_dax_kill_procs()
> > ` -> collect_procs_fsdax() // all associated
> > ` -> unmap_and_kill()
> > ` -> invalidate_inode_pages2() // drop file's cache
> > `-> thaw_super()
> >
> > Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> > event. Freeze the filesystem to prevent new dax mapping being created.
> > And do not shutdown filesystem directly if something not supported, or
> > if failure range includes metadata area. Make sure all files and
> > processes are handled correctly. Also drop the cache of associated
> > files before pmem is removed.
> >
> > [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
> >
> > Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> > ---
> > drivers/dax/super.c | 3 +-
> > fs/xfs/xfs_notify_failure.c | 56 +++++++++++++++++++++++++++++++++----
> > include/linux/mm.h | 1 +
> > mm/memory-failure.c | 17 ++++++++---
> > 4 files changed, 67 insertions(+), 10 deletions(-)
> >
> > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > index c4c4728a36e4..2e1a35e82fce 100644
> > --- a/drivers/dax/super.c
> > +++ b/drivers/dax/super.c
> > @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
> > return;
> >
> > if (dax_dev->holder_data != NULL)
> > - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> > + dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> > + MF_MEM_PRE_REMOVE);
> >
> > clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
> > synchronize_srcu(&dax_srcu);
> > diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> > index 1e2eddb8f90f..1b4eff43f9b5 100644
> > --- a/fs/xfs/xfs_notify_failure.c
> > +++ b/fs/xfs/xfs_notify_failure.c
> > @@ -22,6 +22,7 @@
> >
> > #include <linux/mm.h>
> > #include <linux/dax.h>
> > +#include <linux/fs.h>
> >
> > struct xfs_failure_info {
> > xfs_agblock_t startblock;
> > @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
> > struct xfs_mount *mp = cur->bc_mp;
> > struct xfs_inode *ip;
> > struct xfs_failure_info *notify = data;
> > + struct address_space *mapping;
> > + pgoff_t pgoff;
> > + unsigned long pgcnt;
> > int error = 0;
> >
> > if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
> > (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> > + /* The device is about to be removed. Not a really failure. */
> > + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> > + return 0;
> > notify->want_shutdown = true;
> > return 0;
> > }
> > @@ -92,10 +99,18 @@ xfs_dax_failure_fn(
> > return 0;
> > }
> >
> > - error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
> > - xfs_failure_pgoff(mp, rec, notify),
> > - xfs_failure_pgcnt(mp, rec, notify),
> > - notify->mf_flags);
> > + mapping = VFS_I(ip)->i_mapping;
> > + pgoff = xfs_failure_pgoff(mp, rec, notify);
> > + pgcnt = xfs_failure_pgcnt(mp, rec, notify);
> > +
> > + /* Continue the rmap query if the inode isn't a dax file. */
> > + if (dax_mapping(mapping))
> > + error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
> > + notify->mf_flags);
> > +
> > + /* Invalidate the cache anyway. */
> > + invalidate_inode_pages2_range(mapping, pgoff, pgoff + pgcnt - 1);
> > +
> > xfs_irele(ip);
> > return error;
> > }
> > @@ -164,11 +179,25 @@ xfs_dax_notify_ddev_failure(
> > }
> >
> > xfs_trans_cancel(tp);
> > +
> > + /* Unfreeze filesystem anyway if it is freezed before. */
> > + if (mf_flags & MF_MEM_PRE_REMOVE) {
> > + error = thaw_super(mp->m_super);
> > + if (error)
> > + return error;
>
> If someone *else* wanders in and thaws the fs, you'll get EINVAL here.
>
> I guess that's useful for knowing if someone's screwed up the freeze
> state on us, but ... really, don't you want to make sure you've gotten
> the freeze and nobody else can take it away?
>
> I think you want the kernel-initiated freeze proposed by Luis here:
> https://lore.kernel.org/linux-fsdevel/20230114003409.1168311-4-mcgrof@kernel.org/
>
> Also: Is Fujitsu still pursuing pmem products? Even though Optane is
> dead? I'm no longer sure of what the roadmap is for all this fsdax code
> and whatnot.
First, I need to spend more time on DAX patches, I have let CXL
monopolize too much of my time and you've relied reviewed these when you
have other XFS things to worry about.
As for the future of fsdax / pmem, I had written this up earlier:
https://lore.kernel.org/all/62ef05515b085_1b3c29434@dwillia2-xfh.jf.intel.com.notmuch/
That said, I would feel better if there were examples to point at doing
"PMEM over CXL" in the market. Maybe there are and I have missed them.
There are examples of vendors doing combined CXL memory with NVME flash,
but the CXL in that case seems to just be a way to move DMA buffers
closer to the device, not something more interesting like a
hardware-accelerated page-cache / pmem device.
For now the kernel is ready for PMEM CXL devices per the specification
(drivers/cxl/pmem.c).
Now, all that said the motivation for this patch is a bit different in
that it solves an architectural problem with the way current pmem
devices are shutdown and the missing step to properly evacuate usage of
'struct page' metadata that may be active on that device. So while CXL
hotplug is a practical trigger for this, it can also be achieved without
hardware via:
echo "namespaceX.Y" > /sys/bus/nd/drivers/nd_pmem/unbind
...to hot remove an in use namespace / volume. The observation is that
before the pmem driver can assume that it can rip active pages away from
the system it needs to tell anyone that cares about those page
disappearing to abandon their interest and shutdown. So the idea is for
surprise shutdown failures to tell dax-filesystems that all of the pmem
is going away at once.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v11 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
2023-04-04 17:45 ` Darrick J. Wong
2023-04-05 1:28 ` Yasunori Gotou (Fujitsu)
2023-04-05 4:36 ` Dan Williams
@ 2023-04-06 10:50 ` Shiyang Ruan
2023-04-06 14:54 ` Darrick J. Wong
2 siblings, 1 reply; 18+ messages in thread
From: Shiyang Ruan @ 2023-04-06 10:50 UTC (permalink / raw)
To: Darrick J. Wong
Cc: linux-fsdevel, nvdimm, linux-xfs, linux-mm, dan.j.williams,
willy, jack, akpm
在 2023/4/5 1:45, Darrick J. Wong 写道:
> On Tue, Mar 28, 2023 at 09:41:46AM +0000, Shiyang Ruan wrote:
>> This patch is inspired by Dan's "mm, dax, pmem: Introduce
>> dev_pagemap_failure()"[1]. With the help of dax_holder and
>> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
>> (or mapped device) on it to unmap all files in use and notify processes
>> who are using those files.
>>
>> Call trace:
>> trigger unbind
>> -> unbind_store()
>> -> ... (skip)
>> -> devres_release_all()
>> -> kill_dax()
>> -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>> -> xfs_dax_notify_failure()
>> `-> freeze_super()
>> `-> do xfs rmap
>> ` -> mf_dax_kill_procs()
>> ` -> collect_procs_fsdax() // all associated
>> ` -> unmap_and_kill()
>> ` -> invalidate_inode_pages2() // drop file's cache
>> `-> thaw_super()
>>
>> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
>> event. Freeze the filesystem to prevent new dax mapping being created.
>> And do not shutdown filesystem directly if something not supported, or
>> if failure range includes metadata area. Make sure all files and
>> processes are handled correctly. Also drop the cache of associated
>> files before pmem is removed.
>>
>> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
>>
>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>> ---
>> drivers/dax/super.c | 3 +-
>> fs/xfs/xfs_notify_failure.c | 56 +++++++++++++++++++++++++++++++++----
>> include/linux/mm.h | 1 +
>> mm/memory-failure.c | 17 ++++++++---
>> 4 files changed, 67 insertions(+), 10 deletions(-)
>>
>> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
>> index c4c4728a36e4..2e1a35e82fce 100644
>> --- a/drivers/dax/super.c
>> +++ b/drivers/dax/super.c
>> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
>> return;
>>
>> if (dax_dev->holder_data != NULL)
>> - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
>> + dax_holder_notify_failure(dax_dev, 0, U64_MAX,
>> + MF_MEM_PRE_REMOVE);
>>
>> clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
>> synchronize_srcu(&dax_srcu);
>> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
>> index 1e2eddb8f90f..1b4eff43f9b5 100644
>> --- a/fs/xfs/xfs_notify_failure.c
>> +++ b/fs/xfs/xfs_notify_failure.c
>> @@ -22,6 +22,7 @@
>>
>> #include <linux/mm.h>
>> #include <linux/dax.h>
>> +#include <linux/fs.h>
>>
>> struct xfs_failure_info {
>> xfs_agblock_t startblock;
>> @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
>> struct xfs_mount *mp = cur->bc_mp;
>> struct xfs_inode *ip;
>> struct xfs_failure_info *notify = data;
>> + struct address_space *mapping;
>> + pgoff_t pgoff;
>> + unsigned long pgcnt;
>> int error = 0;
>>
>> if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
>> (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
>> + /* The device is about to be removed. Not a really failure. */
>> + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
>> + return 0;
>> notify->want_shutdown = true;
>> return 0;
>> }
>> @@ -92,10 +99,18 @@ xfs_dax_failure_fn(
>> return 0;
>> }
>>
>> - error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
>> - xfs_failure_pgoff(mp, rec, notify),
>> - xfs_failure_pgcnt(mp, rec, notify),
>> - notify->mf_flags);
>> + mapping = VFS_I(ip)->i_mapping;
>> + pgoff = xfs_failure_pgoff(mp, rec, notify);
>> + pgcnt = xfs_failure_pgcnt(mp, rec, notify);
>> +
>> + /* Continue the rmap query if the inode isn't a dax file. */
>> + if (dax_mapping(mapping))
>> + error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
>> + notify->mf_flags);
>> +
>> + /* Invalidate the cache anyway. */
>> + invalidate_inode_pages2_range(mapping, pgoff, pgoff + pgcnt - 1);
>> +
>> xfs_irele(ip);
>> return error;
>> }
>> @@ -164,11 +179,25 @@ xfs_dax_notify_ddev_failure(
>> }
>>
>> xfs_trans_cancel(tp);
>> +
>> + /* Unfreeze filesystem anyway if it is freezed before. */
>> + if (mf_flags & MF_MEM_PRE_REMOVE) {
>> + error = thaw_super(mp->m_super);
>> + if (error)
>> + return error;
>
> If someone *else* wanders in and thaws the fs, you'll get EINVAL here.
>
> I guess that's useful for knowing if someone's screwed up the freeze
> state on us, but ... really, don't you want to make sure you've gotten
> the freeze and nobody else can take it away?
Ok, I know it now.
>
> I think you want the kernel-initiated freeze proposed by Luis here:
> https://lore.kernel.org/linux-fsdevel/20230114003409.1168311-4-mcgrof@kernel.org/
This patch gives userspace higher priority to do freeze/thaw then
kernelspace. Userspace can thaw it even when kernelspace needs the
freeze state. But I think it can't happen in this case. Kernelspace(in
this case) should hold the freeze state and, IOW, has higher priority
than userspace. I think we could change the @usercall to @priority.
-int freeze_super(struct super_block *sb)
+int freeze_super(struct super_block *sb, int priority)
And priority definitions like:
#define FREEZE_PRO_AUTO 0 // for auto freeze
#define FREEZE_PRO_USERCALL 1 // for user call
#define FREEZE_PRO_KERNELCALL 2 // for kernel call
--
Thanks,
Ruan.
>
> Also: Is Fujitsu still pursuing pmem products? Even though Optane is
> dead? I'm no longer sure of what the roadmap is for all this fsdax code
> and whatnot.
>
> --D
>
>> + }
>> +
>> + /*
>> + * Determine how to shutdown the filesystem according to the
>> + * error code and flags.
>> + */
>> if (error || notify.want_shutdown) {
>> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>> if (!error)
>> error = -EFSCORRUPTED;
>> - }
>> + } else if (mf_flags & MF_MEM_PRE_REMOVE)
>> + xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
>> +
>> return error;
>> }
>>
>> @@ -182,6 +211,7 @@ xfs_dax_notify_failure(
>> struct xfs_mount *mp = dax_holder(dax_dev);
>> u64 ddev_start;
>> u64 ddev_end;
>> + int error;
>>
>> if (!(mp->m_super->s_flags & SB_BORN)) {
>> xfs_warn(mp, "filesystem is not ready for notify_failure()!");
>> @@ -196,6 +226,8 @@ xfs_dax_notify_failure(
>>
>> if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
>> mp->m_logdev_targp != mp->m_ddev_targp) {
>> + if (mf_flags & MF_MEM_PRE_REMOVE)
>> + return 0;
>> xfs_err(mp, "ondisk log corrupt, shutting down fs!");
>> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>> return -EFSCORRUPTED;
>> @@ -209,6 +241,12 @@ xfs_dax_notify_failure(
>> ddev_start = mp->m_ddev_targp->bt_dax_part_off;
>> ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
>>
>> + /* Notify failure on the whole device. */
>> + if (offset == 0 && len == U64_MAX) {
>> + offset = ddev_start;
>> + len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
>> + }
>> +
>> /* Ignore the range out of filesystem area */
>> if (offset + len - 1 < ddev_start)
>> return -ENXIO;
>> @@ -225,6 +263,14 @@ xfs_dax_notify_failure(
>> if (offset + len - 1 > ddev_end)
>> len = ddev_end - offset + 1;
>>
>> + if (mf_flags & MF_MEM_PRE_REMOVE) {
>> + xfs_info(mp, "device is about to be removed!");
>> + /* Freeze the filesystem to prevent new mappings created. */
>> + error = freeze_super(mp->m_super);
>> + if (error)
>> + return error;
>> + }
>> +
>> return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
>> mf_flags);
>> }
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index 1f79667824eb..ac3f22c20e1d 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3436,6 +3436,7 @@ enum mf_flags {
>> MF_UNPOISON = 1 << 4,
>> MF_SW_SIMULATED = 1 << 5,
>> MF_NO_RETRY = 1 << 6,
>> + MF_MEM_PRE_REMOVE = 1 << 7,
>> };
>> int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>> unsigned long count, int mf_flags);
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index fae9baf3be16..6e6acec45568 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -623,7 +623,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
>> */
>> static void collect_procs_fsdax(struct page *page,
>> struct address_space *mapping, pgoff_t pgoff,
>> - struct list_head *to_kill)
>> + struct list_head *to_kill, bool pre_remove)
>> {
>> struct vm_area_struct *vma;
>> struct task_struct *tsk;
>> @@ -631,8 +631,15 @@ static void collect_procs_fsdax(struct page *page,
>> i_mmap_lock_read(mapping);
>> read_lock(&tasklist_lock);
>> for_each_process(tsk) {
>> - struct task_struct *t = task_early_kill(tsk, true);
>> + struct task_struct *t = tsk;
>>
>> + /*
>> + * Search for all tasks while MF_MEM_PRE_REMOVE, because the
>> + * current may not be the one accessing the fsdax page.
>> + * Otherwise, search for the current task.
>> + */
>> + if (!pre_remove)
>> + t = task_early_kill(tsk, true);
>> if (!t)
>> continue;
>> vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
>> @@ -1732,6 +1739,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>> dax_entry_t cookie;
>> struct page *page;
>> size_t end = index + count;
>> + bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
>>
>> mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
>>
>> @@ -1743,9 +1751,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>> if (!page)
>> goto unlock;
>>
>> - SetPageHWPoison(page);
>> + if (!pre_remove)
>> + SetPageHWPoison(page);
>>
>> - collect_procs_fsdax(page, mapping, index, &to_kill);
>> + collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
>> unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
>> index, mf_flags);
>> unlock:
>> --
>> 2.39.2
>>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v11 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
2023-04-06 10:50 ` Shiyang Ruan
@ 2023-04-06 14:54 ` Darrick J. Wong
2023-04-07 2:07 ` Shiyang Ruan
0 siblings, 1 reply; 18+ messages in thread
From: Darrick J. Wong @ 2023-04-06 14:54 UTC (permalink / raw)
To: Shiyang Ruan
Cc: linux-fsdevel, nvdimm, linux-xfs, linux-mm, dan.j.williams,
willy, jack, akpm
On Thu, Apr 06, 2023 at 06:50:22PM +0800, Shiyang Ruan wrote:
>
>
> 在 2023/4/5 1:45, Darrick J. Wong 写道:
> > On Tue, Mar 28, 2023 at 09:41:46AM +0000, Shiyang Ruan wrote:
> > > This patch is inspired by Dan's "mm, dax, pmem: Introduce
> > > dev_pagemap_failure()"[1]. With the help of dax_holder and
> > > ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> > > (or mapped device) on it to unmap all files in use and notify processes
> > > who are using those files.
> > >
> > > Call trace:
> > > trigger unbind
> > > -> unbind_store()
> > > -> ... (skip)
> > > -> devres_release_all()
> > > -> kill_dax()
> > > -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
> > > -> xfs_dax_notify_failure()
> > > `-> freeze_super()
> > > `-> do xfs rmap
> > > ` -> mf_dax_kill_procs()
> > > ` -> collect_procs_fsdax() // all associated
> > > ` -> unmap_and_kill()
> > > ` -> invalidate_inode_pages2() // drop file's cache
> > > `-> thaw_super()
> > >
> > > Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> > > event. Freeze the filesystem to prevent new dax mapping being created.
> > > And do not shutdown filesystem directly if something not supported, or
> > > if failure range includes metadata area. Make sure all files and
> > > processes are handled correctly. Also drop the cache of associated
> > > files before pmem is removed.
> > >
> > > [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
> > >
> > > Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> > > ---
> > > drivers/dax/super.c | 3 +-
> > > fs/xfs/xfs_notify_failure.c | 56 +++++++++++++++++++++++++++++++++----
> > > include/linux/mm.h | 1 +
> > > mm/memory-failure.c | 17 ++++++++---
> > > 4 files changed, 67 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > > index c4c4728a36e4..2e1a35e82fce 100644
> > > --- a/drivers/dax/super.c
> > > +++ b/drivers/dax/super.c
> > > @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
> > > return;
> > > if (dax_dev->holder_data != NULL)
> > > - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> > > + dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> > > + MF_MEM_PRE_REMOVE);
> > > clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
> > > synchronize_srcu(&dax_srcu);
> > > diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> > > index 1e2eddb8f90f..1b4eff43f9b5 100644
> > > --- a/fs/xfs/xfs_notify_failure.c
> > > +++ b/fs/xfs/xfs_notify_failure.c
> > > @@ -22,6 +22,7 @@
> > > #include <linux/mm.h>
> > > #include <linux/dax.h>
> > > +#include <linux/fs.h>
> > > struct xfs_failure_info {
> > > xfs_agblock_t startblock;
> > > @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
> > > struct xfs_mount *mp = cur->bc_mp;
> > > struct xfs_inode *ip;
> > > struct xfs_failure_info *notify = data;
> > > + struct address_space *mapping;
> > > + pgoff_t pgoff;
> > > + unsigned long pgcnt;
> > > int error = 0;
> > > if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
> > > (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> > > + /* The device is about to be removed. Not a really failure. */
> > > + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> > > + return 0;
> > > notify->want_shutdown = true;
> > > return 0;
> > > }
> > > @@ -92,10 +99,18 @@ xfs_dax_failure_fn(
> > > return 0;
> > > }
> > > - error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
> > > - xfs_failure_pgoff(mp, rec, notify),
> > > - xfs_failure_pgcnt(mp, rec, notify),
> > > - notify->mf_flags);
> > > + mapping = VFS_I(ip)->i_mapping;
> > > + pgoff = xfs_failure_pgoff(mp, rec, notify);
> > > + pgcnt = xfs_failure_pgcnt(mp, rec, notify);
> > > +
> > > + /* Continue the rmap query if the inode isn't a dax file. */
> > > + if (dax_mapping(mapping))
> > > + error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
> > > + notify->mf_flags);
> > > +
> > > + /* Invalidate the cache anyway. */
> > > + invalidate_inode_pages2_range(mapping, pgoff, pgoff + pgcnt - 1);
> > > +
> > > xfs_irele(ip);
> > > return error;
> > > }
> > > @@ -164,11 +179,25 @@ xfs_dax_notify_ddev_failure(
> > > }
> > > xfs_trans_cancel(tp);
> > > +
> > > + /* Unfreeze filesystem anyway if it is freezed before. */
> > > + if (mf_flags & MF_MEM_PRE_REMOVE) {
> > > + error = thaw_super(mp->m_super);
> > > + if (error)
> > > + return error;
> >
> > If someone *else* wanders in and thaws the fs, you'll get EINVAL here.
> >
> > I guess that's useful for knowing if someone's screwed up the freeze
> > state on us, but ... really, don't you want to make sure you've gotten
> > the freeze and nobody else can take it away?
>
> Ok, I know it now.
> >
> > I think you want the kernel-initiated freeze proposed by Luis here:
> > https://lore.kernel.org/linux-fsdevel/20230114003409.1168311-4-mcgrof@kernel.org/
>
> This patch gives userspace higher priority to do freeze/thaw then
> kernelspace. Userspace can thaw it even when kernelspace needs the freeze
> state. But I think it can't happen in this case. Kernelspace(in this case)
> should hold the freeze state and, IOW, has higher priority than userspace.
> I think we could change the @usercall to @priority.
>
> -int freeze_super(struct super_block *sb)
> +int freeze_super(struct super_block *sb, int priority)
>
> And priority definitions like:
> #define FREEZE_PRO_AUTO 0 // for auto freeze
What is an auto-freeze, and who would be calling it? I suspect we
could get by with "exclusive" and "non-exclusive". Non-exclusive is the
free-for-all we have now where any userspace can thaw, and exclusive is
for kernel users (like PRE_REMOVE) who want to block everyone else from
thawing.
> #define FREEZE_PRO_USERCALL 1 // for user call
> #define FREEZE_PRO_KERNELCALL 2 // for kernel call
"Linux 6.5, now with FREEZE PRO!!!!" ;)
THAW_PROT_* since we're really protecting who gets to thaw the fs; and
"PROT" is a more customary mnemonic for 'protection.
--D
>
>
> --
> Thanks,
> Ruan.
>
> >
> > Also: Is Fujitsu still pursuing pmem products? Even though Optane is
> > dead? I'm no longer sure of what the roadmap is for all this fsdax code
> > and whatnot.
> >
> > --D
> >
> > > + }
> > > +
> > > + /*
> > > + * Determine how to shutdown the filesystem according to the
> > > + * error code and flags.
> > > + */
> > > if (error || notify.want_shutdown) {
> > > xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
> > > if (!error)
> > > error = -EFSCORRUPTED;
> > > - }
> > > + } else if (mf_flags & MF_MEM_PRE_REMOVE)
> > > + xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
> > > +
> > > return error;
> > > }
> > > @@ -182,6 +211,7 @@ xfs_dax_notify_failure(
> > > struct xfs_mount *mp = dax_holder(dax_dev);
> > > u64 ddev_start;
> > > u64 ddev_end;
> > > + int error;
> > > if (!(mp->m_super->s_flags & SB_BORN)) {
> > > xfs_warn(mp, "filesystem is not ready for notify_failure()!");
> > > @@ -196,6 +226,8 @@ xfs_dax_notify_failure(
> > > if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
> > > mp->m_logdev_targp != mp->m_ddev_targp) {
> > > + if (mf_flags & MF_MEM_PRE_REMOVE)
> > > + return 0;
> > > xfs_err(mp, "ondisk log corrupt, shutting down fs!");
> > > xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
> > > return -EFSCORRUPTED;
> > > @@ -209,6 +241,12 @@ xfs_dax_notify_failure(
> > > ddev_start = mp->m_ddev_targp->bt_dax_part_off;
> > > ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
> > > + /* Notify failure on the whole device. */
> > > + if (offset == 0 && len == U64_MAX) {
> > > + offset = ddev_start;
> > > + len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
> > > + }
> > > +
> > > /* Ignore the range out of filesystem area */
> > > if (offset + len - 1 < ddev_start)
> > > return -ENXIO;
> > > @@ -225,6 +263,14 @@ xfs_dax_notify_failure(
> > > if (offset + len - 1 > ddev_end)
> > > len = ddev_end - offset + 1;
> > > + if (mf_flags & MF_MEM_PRE_REMOVE) {
> > > + xfs_info(mp, "device is about to be removed!");
> > > + /* Freeze the filesystem to prevent new mappings created. */
> > > + error = freeze_super(mp->m_super);
> > > + if (error)
> > > + return error;
> > > + }
> > > +
> > > return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
> > > mf_flags);
> > > }
> > > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > > index 1f79667824eb..ac3f22c20e1d 100644
> > > --- a/include/linux/mm.h
> > > +++ b/include/linux/mm.h
> > > @@ -3436,6 +3436,7 @@ enum mf_flags {
> > > MF_UNPOISON = 1 << 4,
> > > MF_SW_SIMULATED = 1 << 5,
> > > MF_NO_RETRY = 1 << 6,
> > > + MF_MEM_PRE_REMOVE = 1 << 7,
> > > };
> > > int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> > > unsigned long count, int mf_flags);
> > > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > > index fae9baf3be16..6e6acec45568 100644
> > > --- a/mm/memory-failure.c
> > > +++ b/mm/memory-failure.c
> > > @@ -623,7 +623,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
> > > */
> > > static void collect_procs_fsdax(struct page *page,
> > > struct address_space *mapping, pgoff_t pgoff,
> > > - struct list_head *to_kill)
> > > + struct list_head *to_kill, bool pre_remove)
> > > {
> > > struct vm_area_struct *vma;
> > > struct task_struct *tsk;
> > > @@ -631,8 +631,15 @@ static void collect_procs_fsdax(struct page *page,
> > > i_mmap_lock_read(mapping);
> > > read_lock(&tasklist_lock);
> > > for_each_process(tsk) {
> > > - struct task_struct *t = task_early_kill(tsk, true);
> > > + struct task_struct *t = tsk;
> > > + /*
> > > + * Search for all tasks while MF_MEM_PRE_REMOVE, because the
> > > + * current may not be the one accessing the fsdax page.
> > > + * Otherwise, search for the current task.
> > > + */
> > > + if (!pre_remove)
> > > + t = task_early_kill(tsk, true);
> > > if (!t)
> > > continue;
> > > vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> > > @@ -1732,6 +1739,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> > > dax_entry_t cookie;
> > > struct page *page;
> > > size_t end = index + count;
> > > + bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
> > > mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> > > @@ -1743,9 +1751,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> > > if (!page)
> > > goto unlock;
> > > - SetPageHWPoison(page);
> > > + if (!pre_remove)
> > > + SetPageHWPoison(page);
> > > - collect_procs_fsdax(page, mapping, index, &to_kill);
> > > + collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
> > > unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
> > > index, mf_flags);
> > > unlock:
> > > --
> > > 2.39.2
> > >
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [PATCH v11 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
2023-04-06 14:54 ` Darrick J. Wong
@ 2023-04-07 2:07 ` Shiyang Ruan
0 siblings, 0 replies; 18+ messages in thread
From: Shiyang Ruan @ 2023-04-07 2:07 UTC (permalink / raw)
To: Darrick J. Wong
Cc: linux-fsdevel, nvdimm, linux-xfs, linux-mm, dan.j.williams,
willy, jack, akpm
在 2023/4/6 22:54, Darrick J. Wong 写道:
> On Thu, Apr 06, 2023 at 06:50:22PM +0800, Shiyang Ruan wrote:
>>
>>
>> 在 2023/4/5 1:45, Darrick J. Wong 写道:
>>> On Tue, Mar 28, 2023 at 09:41:46AM +0000, Shiyang Ruan wrote:
>>>> This patch is inspired by Dan's "mm, dax, pmem: Introduce
>>>> dev_pagemap_failure()"[1]. With the help of dax_holder and
>>>> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
>>>> (or mapped device) on it to unmap all files in use and notify processes
>>>> who are using those files.
>>>>
>>>> Call trace:
>>>> trigger unbind
>>>> -> unbind_store()
>>>> -> ... (skip)
>>>> -> devres_release_all()
>>>> -> kill_dax()
>>>> -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>>>> -> xfs_dax_notify_failure()
>>>> `-> freeze_super()
>>>> `-> do xfs rmap
>>>> ` -> mf_dax_kill_procs()
>>>> ` -> collect_procs_fsdax() // all associated
>>>> ` -> unmap_and_kill()
>>>> ` -> invalidate_inode_pages2() // drop file's cache
>>>> `-> thaw_super()
>>>>
>>>> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
>>>> event. Freeze the filesystem to prevent new dax mapping being created.
>>>> And do not shutdown filesystem directly if something not supported, or
>>>> if failure range includes metadata area. Make sure all files and
>>>> processes are handled correctly. Also drop the cache of associated
>>>> files before pmem is removed.
>>>>
>>>> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
>>>>
>>>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>>>> ---
>>>> drivers/dax/super.c | 3 +-
>>>> fs/xfs/xfs_notify_failure.c | 56 +++++++++++++++++++++++++++++++++----
>>>> include/linux/mm.h | 1 +
>>>> mm/memory-failure.c | 17 ++++++++---
>>>> 4 files changed, 67 insertions(+), 10 deletions(-)
>>>>
>>>> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
>>>> index c4c4728a36e4..2e1a35e82fce 100644
>>>> --- a/drivers/dax/super.c
>>>> +++ b/drivers/dax/super.c
>>>> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
>>>> return;
>>>> if (dax_dev->holder_data != NULL)
>>>> - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
>>>> + dax_holder_notify_failure(dax_dev, 0, U64_MAX,
>>>> + MF_MEM_PRE_REMOVE);
>>>> clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
>>>> synchronize_srcu(&dax_srcu);
>>>> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
>>>> index 1e2eddb8f90f..1b4eff43f9b5 100644
>>>> --- a/fs/xfs/xfs_notify_failure.c
>>>> +++ b/fs/xfs/xfs_notify_failure.c
>>>> @@ -22,6 +22,7 @@
>>>> #include <linux/mm.h>
>>>> #include <linux/dax.h>
>>>> +#include <linux/fs.h>
>>>> struct xfs_failure_info {
>>>> xfs_agblock_t startblock;
>>>> @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
>>>> struct xfs_mount *mp = cur->bc_mp;
>>>> struct xfs_inode *ip;
>>>> struct xfs_failure_info *notify = data;
>>>> + struct address_space *mapping;
>>>> + pgoff_t pgoff;
>>>> + unsigned long pgcnt;
>>>> int error = 0;
>>>> if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
>>>> (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
>>>> + /* The device is about to be removed. Not a really failure. */
>>>> + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
>>>> + return 0;
>>>> notify->want_shutdown = true;
>>>> return 0;
>>>> }
>>>> @@ -92,10 +99,18 @@ xfs_dax_failure_fn(
>>>> return 0;
>>>> }
>>>> - error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
>>>> - xfs_failure_pgoff(mp, rec, notify),
>>>> - xfs_failure_pgcnt(mp, rec, notify),
>>>> - notify->mf_flags);
>>>> + mapping = VFS_I(ip)->i_mapping;
>>>> + pgoff = xfs_failure_pgoff(mp, rec, notify);
>>>> + pgcnt = xfs_failure_pgcnt(mp, rec, notify);
>>>> +
>>>> + /* Continue the rmap query if the inode isn't a dax file. */
>>>> + if (dax_mapping(mapping))
>>>> + error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
>>>> + notify->mf_flags);
>>>> +
>>>> + /* Invalidate the cache anyway. */
>>>> + invalidate_inode_pages2_range(mapping, pgoff, pgoff + pgcnt - 1);
>>>> +
>>>> xfs_irele(ip);
>>>> return error;
>>>> }
>>>> @@ -164,11 +179,25 @@ xfs_dax_notify_ddev_failure(
>>>> }
>>>> xfs_trans_cancel(tp);
>>>> +
>>>> + /* Unfreeze filesystem anyway if it is freezed before. */
>>>> + if (mf_flags & MF_MEM_PRE_REMOVE) {
>>>> + error = thaw_super(mp->m_super);
>>>> + if (error)
>>>> + return error;
>>>
>>> If someone *else* wanders in and thaws the fs, you'll get EINVAL here.
>>>
>>> I guess that's useful for knowing if someone's screwed up the freeze
>>> state on us, but ... really, don't you want to make sure you've gotten
>>> the freeze and nobody else can take it away?
>>
>> Ok, I know it now.
>>>
>>> I think you want the kernel-initiated freeze proposed by Luis here:
>>> https://lore.kernel.org/linux-fsdevel/20230114003409.1168311-4-mcgrof@kernel.org/
>>
>> This patch gives userspace higher priority to do freeze/thaw then
>> kernelspace. Userspace can thaw it even when kernelspace needs the freeze
>> state. But I think it can't happen in this case. Kernelspace(in this case)
>> should hold the freeze state and, IOW, has higher priority than userspace.
>> I think we could change the @usercall to @priority.
>>
>> -int freeze_super(struct super_block *sb)
>> +int freeze_super(struct super_block *sb, int priority)
>>
>> And priority definitions like:
>> #define FREEZE_PRO_AUTO 0 // for auto freeze
>
> What is an auto-freeze, and who would be calling it?
It was a concept Luis introduced in his patchset.
> I suspect we
> could get by with "exclusive" and "non-exclusive". Non-exclusive is the
> free-for-all we have now where any userspace can thaw, and exclusive is
> for kernel users (like PRE_REMOVE) who want to block everyone else from
> thawing.
Yes, I think this is better (after I reading that patchset and its
comments carefully).
>
>> #define FREEZE_PRO_USERCALL 1 // for user call
>> #define FREEZE_PRO_KERNELCALL 2 // for kernel call
>
> "Linux 6.5, now with FREEZE PRO!!!!" ;)
What I was going to say here was priority, which is abbreviated PRIO.
My mistake here.
>
> THAW_PROT_* since we're really protecting who gets to thaw the fs; and
> "PROT" is a more customary mnemonic for 'protection.
OK. This looks more appropriate. Thanks!
--
Ruan.
>
> --D
>
>>
>>
>> --
>> Thanks,
>> Ruan.
>>
>>>
>>> Also: Is Fujitsu still pursuing pmem products? Even though Optane is
>>> dead? I'm no longer sure of what the roadmap is for all this fsdax code
>>> and whatnot.
>>>
>>> --D
>>>
>>>> + }
>>>> +
>>>> + /*
>>>> + * Determine how to shutdown the filesystem according to the
>>>> + * error code and flags.
>>>> + */
>>>> if (error || notify.want_shutdown) {
>>>> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>>>> if (!error)
>>>> error = -EFSCORRUPTED;
>>>> - }
>>>> + } else if (mf_flags & MF_MEM_PRE_REMOVE)
>>>> + xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
>>>> +
>>>> return error;
>>>> }
>>>> @@ -182,6 +211,7 @@ xfs_dax_notify_failure(
>>>> struct xfs_mount *mp = dax_holder(dax_dev);
>>>> u64 ddev_start;
>>>> u64 ddev_end;
>>>> + int error;
>>>> if (!(mp->m_super->s_flags & SB_BORN)) {
>>>> xfs_warn(mp, "filesystem is not ready for notify_failure()!");
>>>> @@ -196,6 +226,8 @@ xfs_dax_notify_failure(
>>>> if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
>>>> mp->m_logdev_targp != mp->m_ddev_targp) {
>>>> + if (mf_flags & MF_MEM_PRE_REMOVE)
>>>> + return 0;
>>>> xfs_err(mp, "ondisk log corrupt, shutting down fs!");
>>>> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>>>> return -EFSCORRUPTED;
>>>> @@ -209,6 +241,12 @@ xfs_dax_notify_failure(
>>>> ddev_start = mp->m_ddev_targp->bt_dax_part_off;
>>>> ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
>>>> + /* Notify failure on the whole device. */
>>>> + if (offset == 0 && len == U64_MAX) {
>>>> + offset = ddev_start;
>>>> + len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
>>>> + }
>>>> +
>>>> /* Ignore the range out of filesystem area */
>>>> if (offset + len - 1 < ddev_start)
>>>> return -ENXIO;
>>>> @@ -225,6 +263,14 @@ xfs_dax_notify_failure(
>>>> if (offset + len - 1 > ddev_end)
>>>> len = ddev_end - offset + 1;
>>>> + if (mf_flags & MF_MEM_PRE_REMOVE) {
>>>> + xfs_info(mp, "device is about to be removed!");
>>>> + /* Freeze the filesystem to prevent new mappings created. */
>>>> + error = freeze_super(mp->m_super);
>>>> + if (error)
>>>> + return error;
>>>> + }
>>>> +
>>>> return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
>>>> mf_flags);
>>>> }
>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>> index 1f79667824eb..ac3f22c20e1d 100644
>>>> --- a/include/linux/mm.h
>>>> +++ b/include/linux/mm.h
>>>> @@ -3436,6 +3436,7 @@ enum mf_flags {
>>>> MF_UNPOISON = 1 << 4,
>>>> MF_SW_SIMULATED = 1 << 5,
>>>> MF_NO_RETRY = 1 << 6,
>>>> + MF_MEM_PRE_REMOVE = 1 << 7,
>>>> };
>>>> int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>>>> unsigned long count, int mf_flags);
>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>>> index fae9baf3be16..6e6acec45568 100644
>>>> --- a/mm/memory-failure.c
>>>> +++ b/mm/memory-failure.c
>>>> @@ -623,7 +623,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
>>>> */
>>>> static void collect_procs_fsdax(struct page *page,
>>>> struct address_space *mapping, pgoff_t pgoff,
>>>> - struct list_head *to_kill)
>>>> + struct list_head *to_kill, bool pre_remove)
>>>> {
>>>> struct vm_area_struct *vma;
>>>> struct task_struct *tsk;
>>>> @@ -631,8 +631,15 @@ static void collect_procs_fsdax(struct page *page,
>>>> i_mmap_lock_read(mapping);
>>>> read_lock(&tasklist_lock);
>>>> for_each_process(tsk) {
>>>> - struct task_struct *t = task_early_kill(tsk, true);
>>>> + struct task_struct *t = tsk;
>>>> + /*
>>>> + * Search for all tasks while MF_MEM_PRE_REMOVE, because the
>>>> + * current may not be the one accessing the fsdax page.
>>>> + * Otherwise, search for the current task.
>>>> + */
>>>> + if (!pre_remove)
>>>> + t = task_early_kill(tsk, true);
>>>> if (!t)
>>>> continue;
>>>> vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
>>>> @@ -1732,6 +1739,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>>>> dax_entry_t cookie;
>>>> struct page *page;
>>>> size_t end = index + count;
>>>> + bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
>>>> mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
>>>> @@ -1743,9 +1751,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>>>> if (!page)
>>>> goto unlock;
>>>> - SetPageHWPoison(page);
>>>> + if (!pre_remove)
>>>> + SetPageHWPoison(page);
>>>> - collect_procs_fsdax(page, mapping, index, &to_kill);
>>>> + collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
>>>> unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
>>>> index, mf_flags);
>>>> unlock:
>>>> --
>>>> 2.39.2
>>>>
^ permalink raw reply [flat|nested] 18+ messages in thread
* [RFC PATCH v11.1 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
2023-03-28 9:41 ` [PATCH v11 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
2023-04-04 17:45 ` Darrick J. Wong
@ 2023-04-12 10:52 ` Shiyang Ruan
2023-04-20 2:07 ` Shiyang Ruan
1 sibling, 1 reply; 18+ messages in thread
From: Shiyang Ruan @ 2023-04-12 10:52 UTC (permalink / raw)
To: linux-fsdevel, nvdimm, linux-xfs, linux-mm
Cc: dan.j.williams, willy, jack, akpm, djwong
This is a RFC HOTFIX.
This hotfix adds a exclusive forzen state to make sure any others won't
thaw the fs during xfs_dax_notify_failure():
#define SB_FREEZE_EXCLUSIVE (SB_FREEZE_COMPLETE + 2)
Using +2 here is because Darrick's patch[0] is using +1. So, should we
make these definitions global?
Another thing I can't make up my mind is: when another freezer has freeze
the fs, should we wait unitl it finish, or print a warning in dmesg and
return -EBUSY?
Since there are at least 2 places needs exclusive forzen state, I think
we can refactor helper functions of freeze/thaw for them. e.g.
int freeze_super_exclusive(struct super_block *sb, int frozen);
int thaw_super_exclusive(struct super_block *sb, int frozen);
[0] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=repair-fscounters&id=c3a0d1de4d54ffb565dbc7092dfe1fb851940669
--- Original commit message ---
This patch is inspired by Dan's "mm, dax, pmem: Introduce
dev_pagemap_failure()"[1]. With the help of dax_holder and
->notify_failure() mechanism, the pmem driver is able to ask filesystem
(or mapped device) on it to unmap all files in use and notify processes
who are using those files.
Call trace:
trigger unbind
-> unbind_store()
-> ... (skip)
-> devres_release_all()
-> kill_dax()
-> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
-> xfs_dax_notify_failure()
`-> freeze_super()
`-> do xfs rmap
` -> mf_dax_kill_procs()
` -> collect_procs_fsdax() // all associated
` -> unmap_and_kill()
` -> invalidate_inode_pages2() // drop file's cache
`-> thaw_super()
Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
event. Also introduce a exclusive freeze/thaw to lock the filesystem to
prevent new dax mapping from being created. And do not shutdown
filesystem directly if something not supported, or if failure range
includes metadata area. Make sure all files and processes are handled
correctly. Also drop the cache of associated files before pmem is
removed.
[1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
---
drivers/dax/super.c | 3 +-
fs/xfs/xfs_notify_failure.c | 151 ++++++++++++++++++++++++++++++++++--
include/linux/mm.h | 1 +
mm/memory-failure.c | 17 +++-
4 files changed, 162 insertions(+), 10 deletions(-)
diff --git a/drivers/dax/super.c b/drivers/dax/super.c
index c4c4728a36e4..2e1a35e82fce 100644
--- a/drivers/dax/super.c
+++ b/drivers/dax/super.c
@@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
return;
if (dax_dev->holder_data != NULL)
- dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
+ dax_holder_notify_failure(dax_dev, 0, U64_MAX,
+ MF_MEM_PRE_REMOVE);
clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
synchronize_srcu(&dax_srcu);
diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
index 1e2eddb8f90f..796dd954d33a 100644
--- a/fs/xfs/xfs_notify_failure.c
+++ b/fs/xfs/xfs_notify_failure.c
@@ -22,6 +22,7 @@
#include <linux/mm.h>
#include <linux/dax.h>
+#include <linux/fs.h>
struct xfs_failure_info {
xfs_agblock_t startblock;
@@ -73,10 +74,16 @@ xfs_dax_failure_fn(
struct xfs_mount *mp = cur->bc_mp;
struct xfs_inode *ip;
struct xfs_failure_info *notify = data;
+ struct address_space *mapping;
+ pgoff_t pgoff;
+ unsigned long pgcnt;
int error = 0;
if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
(rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
+ /* The device is about to be removed. Not a really failure. */
+ if (notify->mf_flags & MF_MEM_PRE_REMOVE)
+ return 0;
notify->want_shutdown = true;
return 0;
}
@@ -92,14 +99,120 @@ xfs_dax_failure_fn(
return 0;
}
- error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
- xfs_failure_pgoff(mp, rec, notify),
- xfs_failure_pgcnt(mp, rec, notify),
- notify->mf_flags);
+ mapping = VFS_I(ip)->i_mapping;
+ pgoff = xfs_failure_pgoff(mp, rec, notify);
+ pgcnt = xfs_failure_pgcnt(mp, rec, notify);
+
+ /* Continue the rmap query if the inode isn't a dax file. */
+ if (dax_mapping(mapping))
+ error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
+ notify->mf_flags);
+
+ /* Invalidate the cache anyway. */
+ invalidate_inode_pages2_range(mapping, pgoff, pgoff + pgcnt - 1);
+
xfs_irele(ip);
return error;
}
+#define SB_FREEZE_EXCLUSIVE (SB_FREEZE_COMPLETE + 2)
+
+static int
+xfs_dax_notify_failure_freeze(
+ struct xfs_mount *mp)
+{
+ struct super_block *sb = mp->m_super;
+ int error = 0;
+ int level;
+
+ /* Wait until we're ready to freeze. */
+ down_write(&sb->s_umount);
+ while (sb->s_writers.frozen != SB_UNFROZEN) {
+ up_write(&sb->s_umount);
+
+ // just wait, or print warning in dmesg then return -EBUSY?
+
+ delay(HZ / 10);
+ down_write(&sb->s_umount);
+ }
+
+ if (sb_rdonly(sb)) {
+ sb->s_writers.frozen = SB_FREEZE_EXCLUSIVE;
+ goto out;
+ }
+
+ sb->s_writers.frozen = SB_FREEZE_WRITE;
+ /* Release s_umount to preserve sb_start_write -> s_umount ordering */
+ up_write(&sb->s_umount);
+ percpu_down_write(sb->s_writers.rw_sem + SB_FREEZE_WRITE - 1);
+ down_write(&sb->s_umount);
+
+ /* Now we go and block page faults... */
+ sb->s_writers.frozen = SB_FREEZE_PAGEFAULT;
+ percpu_down_write(sb->s_writers.rw_sem + SB_FREEZE_PAGEFAULT - 1);
+
+ /* All writers are done so after syncing there won't be dirty data */
+ error = sync_filesystem(sb);
+ if (error) {
+ sb->s_writers.frozen = SB_UNFROZEN;
+ for (level = SB_FREEZE_PAGEFAULT - 1; level >= 0; level--)
+ percpu_up_write(sb->s_writers.rw_sem + level);
+ wake_up(&sb->s_writers.wait_unfrozen);
+ goto out;
+ }
+
+ /* Now wait for internal filesystem counter */
+ sb->s_writers.frozen = SB_FREEZE_FS;
+ percpu_down_write(sb->s_writers.rw_sem + SB_FREEZE_FS - 1);
+
+ /*
+ * To prevent anyone else from unfreezing us, set the VFS freeze level
+ * to one higher than SB_FREEZE_COMPLETE.
+ */
+ sb->s_writers.frozen = SB_FREEZE_EXCLUSIVE;
+ for (level = SB_FREEZE_LEVELS - 1; level >= 0; level--)
+ percpu_rwsem_release(sb->s_writers.rw_sem + level, 0,
+ _THIS_IP_);
+
+out:
+ up_write(&sb->s_umount);
+ return error;
+}
+
+static void
+xfs_dax_notify_failure_thaw(
+ struct xfs_mount *mp)
+{
+ struct super_block *sb = mp->m_super;
+ int level;
+
+ down_write(&sb->s_umount);
+ if (sb->s_writers.frozen != SB_FREEZE_EXCLUSIVE) {
+ /* somebody snuck in and unfroze us? */
+ ASSERT(0);
+ up_write(&sb->s_umount);
+ return;
+ }
+
+ if (sb_rdonly(sb)) {
+ sb->s_writers.frozen = SB_UNFROZEN;
+ goto out;
+ }
+
+ for (level = 0; level < SB_FREEZE_LEVELS; ++level)
+ percpu_rwsem_acquire(sb->s_writers.rw_sem + level, 0,
+ _THIS_IP_);
+
+ sb->s_writers.frozen = SB_UNFROZEN;
+ for (level = SB_FREEZE_LEVELS - 1; level >= 0; level--)
+ percpu_up_write(sb->s_writers.rw_sem + level);
+
+out:
+ wake_up(&sb->s_writers.wait_unfrozen);
+ up_write(&sb->s_umount);
+}
+
static int
xfs_dax_notify_ddev_failure(
struct xfs_mount *mp,
@@ -164,11 +277,22 @@ xfs_dax_notify_ddev_failure(
}
xfs_trans_cancel(tp);
+
+ /* Thaw the fs if it is freezed before. */
+ if (mf_flags & MF_MEM_PRE_REMOVE)
+ xfs_dax_notify_failure_thaw(mp);
+
+ /*
+ * Determine how to shutdown the filesystem according to the
+ * error code and flags.
+ */
if (error || notify.want_shutdown) {
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
if (!error)
error = -EFSCORRUPTED;
- }
+ } else if (mf_flags & MF_MEM_PRE_REMOVE)
+ xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
+
return error;
}
@@ -182,6 +306,7 @@ xfs_dax_notify_failure(
struct xfs_mount *mp = dax_holder(dax_dev);
u64 ddev_start;
u64 ddev_end;
+ int error;
if (!(mp->m_super->s_flags & SB_BORN)) {
xfs_warn(mp, "filesystem is not ready for notify_failure()!");
@@ -196,6 +321,8 @@ xfs_dax_notify_failure(
if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
mp->m_logdev_targp != mp->m_ddev_targp) {
+ if (mf_flags & MF_MEM_PRE_REMOVE)
+ return 0;
xfs_err(mp, "ondisk log corrupt, shutting down fs!");
xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
return -EFSCORRUPTED;
@@ -209,6 +336,12 @@ xfs_dax_notify_failure(
ddev_start = mp->m_ddev_targp->bt_dax_part_off;
ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
+ /* Notify failure on the whole device. */
+ if (offset == 0 && len == U64_MAX) {
+ offset = ddev_start;
+ len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
+ }
+
/* Ignore the range out of filesystem area */
if (offset + len - 1 < ddev_start)
return -ENXIO;
@@ -225,6 +358,14 @@ xfs_dax_notify_failure(
if (offset + len - 1 > ddev_end)
len = ddev_end - offset + 1;
+ if (mf_flags & MF_MEM_PRE_REMOVE) {
+ xfs_info(mp, "device is about to be removed!");
+ /* Freeze fs to prevent new mappings from being created. */
+ error = xfs_dax_notify_failure_freeze(mp);
+ if (error)
+ return error;
+ }
+
return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
mf_flags);
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1f79667824eb..ac3f22c20e1d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3436,6 +3436,7 @@ enum mf_flags {
MF_UNPOISON = 1 << 4,
MF_SW_SIMULATED = 1 << 5,
MF_NO_RETRY = 1 << 6,
+ MF_MEM_PRE_REMOVE = 1 << 7,
};
int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
unsigned long count, int mf_flags);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index fae9baf3be16..6e6acec45568 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -623,7 +623,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
*/
static void collect_procs_fsdax(struct page *page,
struct address_space *mapping, pgoff_t pgoff,
- struct list_head *to_kill)
+ struct list_head *to_kill, bool pre_remove)
{
struct vm_area_struct *vma;
struct task_struct *tsk;
@@ -631,8 +631,15 @@ static void collect_procs_fsdax(struct page *page,
i_mmap_lock_read(mapping);
read_lock(&tasklist_lock);
for_each_process(tsk) {
- struct task_struct *t = task_early_kill(tsk, true);
+ struct task_struct *t = tsk;
+ /*
+ * Search for all tasks while MF_MEM_PRE_REMOVE, because the
+ * current may not be the one accessing the fsdax page.
+ * Otherwise, search for the current task.
+ */
+ if (!pre_remove)
+ t = task_early_kill(tsk, true);
if (!t)
continue;
vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
@@ -1732,6 +1739,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
dax_entry_t cookie;
struct page *page;
size_t end = index + count;
+ bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
@@ -1743,9 +1751,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
if (!page)
goto unlock;
- SetPageHWPoison(page);
+ if (!pre_remove)
+ SetPageHWPoison(page);
- collect_procs_fsdax(page, mapping, index, &to_kill);
+ collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
index, mf_flags);
unlock:
--
2.40.0
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH v11.1 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
2023-04-12 10:52 ` [RFC PATCH v11.1 " Shiyang Ruan
@ 2023-04-20 2:07 ` Shiyang Ruan
2023-04-20 12:09 ` Jan Kara
0 siblings, 1 reply; 18+ messages in thread
From: Shiyang Ruan @ 2023-04-20 2:07 UTC (permalink / raw)
To: linux-fsdevel, nvdimm, linux-xfs, linux-mm
Cc: dan.j.williams, willy, jack, akpm, djwong
Ping~
在 2023/4/12 18:52, Shiyang Ruan 写道:
> This is a RFC HOTFIX.
>
> This hotfix adds a exclusive forzen state to make sure any others won't
> thaw the fs during xfs_dax_notify_failure():
>
> #define SB_FREEZE_EXCLUSIVE (SB_FREEZE_COMPLETE + 2)
> Using +2 here is because Darrick's patch[0] is using +1. So, should we
> make these definitions global?
>
> Another thing I can't make up my mind is: when another freezer has freeze
> the fs, should we wait unitl it finish, or print a warning in dmesg and
> return -EBUSY?
>
> Since there are at least 2 places needs exclusive forzen state, I think
> we can refactor helper functions of freeze/thaw for them. e.g.
> int freeze_super_exclusive(struct super_block *sb, int frozen);
> int thaw_super_exclusive(struct super_block *sb, int frozen);
>
> [0] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=repair-fscounters&id=c3a0d1de4d54ffb565dbc7092dfe1fb851940669
>
>
> --- Original commit message ---
> This patch is inspired by Dan's "mm, dax, pmem: Introduce
> dev_pagemap_failure()"[1]. With the help of dax_holder and
> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> (or mapped device) on it to unmap all files in use and notify processes
> who are using those files.
>
> Call trace:
> trigger unbind
> -> unbind_store()
> -> ... (skip)
> -> devres_release_all()
> -> kill_dax()
> -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
> -> xfs_dax_notify_failure()
> `-> freeze_super()
> `-> do xfs rmap
> ` -> mf_dax_kill_procs()
> ` -> collect_procs_fsdax() // all associated
> ` -> unmap_and_kill()
> ` -> invalidate_inode_pages2() // drop file's cache
> `-> thaw_super()
>
> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> event. Also introduce a exclusive freeze/thaw to lock the filesystem to
> prevent new dax mapping from being created. And do not shutdown
> filesystem directly if something not supported, or if failure range
> includes metadata area. Make sure all files and processes are handled
> correctly. Also drop the cache of associated files before pmem is
> removed.
>
> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
>
> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> ---
> drivers/dax/super.c | 3 +-
> fs/xfs/xfs_notify_failure.c | 151 ++++++++++++++++++++++++++++++++++--
> include/linux/mm.h | 1 +
> mm/memory-failure.c | 17 +++-
> 4 files changed, 162 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> index c4c4728a36e4..2e1a35e82fce 100644
> --- a/drivers/dax/super.c
> +++ b/drivers/dax/super.c
> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
> return;
>
> if (dax_dev->holder_data != NULL)
> - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> + dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> + MF_MEM_PRE_REMOVE);
>
> clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
> synchronize_srcu(&dax_srcu);
> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> index 1e2eddb8f90f..796dd954d33a 100644
> --- a/fs/xfs/xfs_notify_failure.c
> +++ b/fs/xfs/xfs_notify_failure.c
> @@ -22,6 +22,7 @@
>
> #include <linux/mm.h>
> #include <linux/dax.h>
> +#include <linux/fs.h>
>
> struct xfs_failure_info {
> xfs_agblock_t startblock;
> @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
> struct xfs_mount *mp = cur->bc_mp;
> struct xfs_inode *ip;
> struct xfs_failure_info *notify = data;
> + struct address_space *mapping;
> + pgoff_t pgoff;
> + unsigned long pgcnt;
> int error = 0;
>
> if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
> (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> + /* The device is about to be removed. Not a really failure. */
> + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> + return 0;
> notify->want_shutdown = true;
> return 0;
> }
> @@ -92,14 +99,120 @@ xfs_dax_failure_fn(
> return 0;
> }
>
> - error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
> - xfs_failure_pgoff(mp, rec, notify),
> - xfs_failure_pgcnt(mp, rec, notify),
> - notify->mf_flags);
> + mapping = VFS_I(ip)->i_mapping;
> + pgoff = xfs_failure_pgoff(mp, rec, notify);
> + pgcnt = xfs_failure_pgcnt(mp, rec, notify);
> +
> + /* Continue the rmap query if the inode isn't a dax file. */
> + if (dax_mapping(mapping))
> + error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
> + notify->mf_flags);
> +
> + /* Invalidate the cache anyway. */
> + invalidate_inode_pages2_range(mapping, pgoff, pgoff + pgcnt - 1);
> +
> xfs_irele(ip);
> return error;
> }
>
> +#define SB_FREEZE_EXCLUSIVE (SB_FREEZE_COMPLETE + 2)
> +
> +static int
> +xfs_dax_notify_failure_freeze(
> + struct xfs_mount *mp)
> +{
> + struct super_block *sb = mp->m_super;
> + int error = 0;
> + int level;
> +
> + /* Wait until we're ready to freeze. */
> + down_write(&sb->s_umount);
> + while (sb->s_writers.frozen != SB_UNFROZEN) {
> + up_write(&sb->s_umount);
> +
> + // just wait, or print warning in dmesg then return -EBUSY?
> +
> + delay(HZ / 10);
> + down_write(&sb->s_umount);
> + }
> +
> + if (sb_rdonly(sb)) {
> + sb->s_writers.frozen = SB_FREEZE_EXCLUSIVE;
> + goto out;
> + }
> +
> + sb->s_writers.frozen = SB_FREEZE_WRITE;
> + /* Release s_umount to preserve sb_start_write -> s_umount ordering */
> + up_write(&sb->s_umount);
> + percpu_down_write(sb->s_writers.rw_sem + SB_FREEZE_WRITE - 1);
> + down_write(&sb->s_umount);
> +
> + /* Now we go and block page faults... */
> + sb->s_writers.frozen = SB_FREEZE_PAGEFAULT;
> + percpu_down_write(sb->s_writers.rw_sem + SB_FREEZE_PAGEFAULT - 1);
> +
> + /* All writers are done so after syncing there won't be dirty data */
> + error = sync_filesystem(sb);
> + if (error) {
> + sb->s_writers.frozen = SB_UNFROZEN;
> + for (level = SB_FREEZE_PAGEFAULT - 1; level >= 0; level--)
> + percpu_up_write(sb->s_writers.rw_sem + level);
> + wake_up(&sb->s_writers.wait_unfrozen);
> + goto out;
> + }
> +
> + /* Now wait for internal filesystem counter */
> + sb->s_writers.frozen = SB_FREEZE_FS;
> + percpu_down_write(sb->s_writers.rw_sem + SB_FREEZE_FS - 1);
> +
> + /*
> + * To prevent anyone else from unfreezing us, set the VFS freeze level
> + * to one higher than SB_FREEZE_COMPLETE.
> + */
> + sb->s_writers.frozen = SB_FREEZE_EXCLUSIVE;
> + for (level = SB_FREEZE_LEVELS - 1; level >= 0; level--)
> + percpu_rwsem_release(sb->s_writers.rw_sem + level, 0,
> + _THIS_IP_);
> +
> +out:
> + up_write(&sb->s_umount);
> + return error;
> +}
> +
> +static void
> +xfs_dax_notify_failure_thaw(
> + struct xfs_mount *mp)
> +{
> + struct super_block *sb = mp->m_super;
> + int level;
> +
> + down_write(&sb->s_umount);
> + if (sb->s_writers.frozen != SB_FREEZE_EXCLUSIVE) {
> + /* somebody snuck in and unfroze us? */
> + ASSERT(0);
> + up_write(&sb->s_umount);
> + return;
> + }
> +
> + if (sb_rdonly(sb)) {
> + sb->s_writers.frozen = SB_UNFROZEN;
> + goto out;
> + }
> +
> + for (level = 0; level < SB_FREEZE_LEVELS; ++level)
> + percpu_rwsem_acquire(sb->s_writers.rw_sem + level, 0,
> + _THIS_IP_);
> +
> + sb->s_writers.frozen = SB_UNFROZEN;
> + for (level = SB_FREEZE_LEVELS - 1; level >= 0; level--)
> + percpu_up_write(sb->s_writers.rw_sem + level);
> +
> +out:
> + wake_up(&sb->s_writers.wait_unfrozen);
> + up_write(&sb->s_umount);
> +}
> +
> static int
> xfs_dax_notify_ddev_failure(
> struct xfs_mount *mp,
> @@ -164,11 +277,22 @@ xfs_dax_notify_ddev_failure(
> }
>
> xfs_trans_cancel(tp);
> +
> + /* Thaw the fs if it is freezed before. */
> + if (mf_flags & MF_MEM_PRE_REMOVE)
> + xfs_dax_notify_failure_thaw(mp);
> +
> + /*
> + * Determine how to shutdown the filesystem according to the
> + * error code and flags.
> + */
> if (error || notify.want_shutdown) {
> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
> if (!error)
> error = -EFSCORRUPTED;
> - }
> + } else if (mf_flags & MF_MEM_PRE_REMOVE)
> + xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
> +
> return error;
> }
>
> @@ -182,6 +306,7 @@ xfs_dax_notify_failure(
> struct xfs_mount *mp = dax_holder(dax_dev);
> u64 ddev_start;
> u64 ddev_end;
> + int error;
>
> if (!(mp->m_super->s_flags & SB_BORN)) {
> xfs_warn(mp, "filesystem is not ready for notify_failure()!");
> @@ -196,6 +321,8 @@ xfs_dax_notify_failure(
>
> if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
> mp->m_logdev_targp != mp->m_ddev_targp) {
> + if (mf_flags & MF_MEM_PRE_REMOVE)
> + return 0;
> xfs_err(mp, "ondisk log corrupt, shutting down fs!");
> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
> return -EFSCORRUPTED;
> @@ -209,6 +336,12 @@ xfs_dax_notify_failure(
> ddev_start = mp->m_ddev_targp->bt_dax_part_off;
> ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
>
> + /* Notify failure on the whole device. */
> + if (offset == 0 && len == U64_MAX) {
> + offset = ddev_start;
> + len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
> + }
> +
> /* Ignore the range out of filesystem area */
> if (offset + len - 1 < ddev_start)
> return -ENXIO;
> @@ -225,6 +358,14 @@ xfs_dax_notify_failure(
> if (offset + len - 1 > ddev_end)
> len = ddev_end - offset + 1;
>
> + if (mf_flags & MF_MEM_PRE_REMOVE) {
> + xfs_info(mp, "device is about to be removed!");
> + /* Freeze fs to prevent new mappings from being created. */
> + error = xfs_dax_notify_failure_freeze(mp);
> + if (error)
> + return error;
> + }
> +
> return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
> mf_flags);
> }
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 1f79667824eb..ac3f22c20e1d 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3436,6 +3436,7 @@ enum mf_flags {
> MF_UNPOISON = 1 << 4,
> MF_SW_SIMULATED = 1 << 5,
> MF_NO_RETRY = 1 << 6,
> + MF_MEM_PRE_REMOVE = 1 << 7,
> };
> int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> unsigned long count, int mf_flags);
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index fae9baf3be16..6e6acec45568 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -623,7 +623,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
> */
> static void collect_procs_fsdax(struct page *page,
> struct address_space *mapping, pgoff_t pgoff,
> - struct list_head *to_kill)
> + struct list_head *to_kill, bool pre_remove)
> {
> struct vm_area_struct *vma;
> struct task_struct *tsk;
> @@ -631,8 +631,15 @@ static void collect_procs_fsdax(struct page *page,
> i_mmap_lock_read(mapping);
> read_lock(&tasklist_lock);
> for_each_process(tsk) {
> - struct task_struct *t = task_early_kill(tsk, true);
> + struct task_struct *t = tsk;
>
> + /*
> + * Search for all tasks while MF_MEM_PRE_REMOVE, because the
> + * current may not be the one accessing the fsdax page.
> + * Otherwise, search for the current task.
> + */
> + if (!pre_remove)
> + t = task_early_kill(tsk, true);
> if (!t)
> continue;
> vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> @@ -1732,6 +1739,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> dax_entry_t cookie;
> struct page *page;
> size_t end = index + count;
> + bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
>
> mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
>
> @@ -1743,9 +1751,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> if (!page)
> goto unlock;
>
> - SetPageHWPoison(page);
> + if (!pre_remove)
> + SetPageHWPoison(page);
>
> - collect_procs_fsdax(page, mapping, index, &to_kill);
> + collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
> unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
> index, mf_flags);
> unlock:
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH v11.1 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
2023-04-20 2:07 ` Shiyang Ruan
@ 2023-04-20 12:09 ` Jan Kara
2023-04-25 12:47 ` Shiyang Ruan
0 siblings, 1 reply; 18+ messages in thread
From: Jan Kara @ 2023-04-20 12:09 UTC (permalink / raw)
To: Shiyang Ruan
Cc: linux-fsdevel, nvdimm, linux-xfs, linux-mm, dan.j.williams,
willy, jack, akpm, djwong, Luis Chamberlain
On Thu 20-04-23 10:07:39, Shiyang Ruan wrote:
> 在 2023/4/12 18:52, Shiyang Ruan 写道:
> > This is a RFC HOTFIX.
> >
> > This hotfix adds a exclusive forzen state to make sure any others won't
> > thaw the fs during xfs_dax_notify_failure():
> >
> > #define SB_FREEZE_EXCLUSIVE (SB_FREEZE_COMPLETE + 2)
> > Using +2 here is because Darrick's patch[0] is using +1. So, should we
> > make these definitions global?
> >
> > Another thing I can't make up my mind is: when another freezer has freeze
> > the fs, should we wait unitl it finish, or print a warning in dmesg and
> > return -EBUSY?
> >
> > Since there are at least 2 places needs exclusive forzen state, I think
> > we can refactor helper functions of freeze/thaw for them. e.g.
> > int freeze_super_exclusive(struct super_block *sb, int frozen);
> > int thaw_super_exclusive(struct super_block *sb, int frozen);
> >
> > [0] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=repair-fscounters&id=c3a0d1de4d54ffb565dbc7092dfe1fb851940669
I'm OK with the idea of new freeze state that does not allow userspace to
thaw the filesystem. But I don't really like the guts of filesystem
freezing being replicated inside XFS. It is bad enough that they are
replicated in [0], replicating them *once more* in another XFS file shows
we are definitely doing something wrong. And Luis will need yet another
incantation of the exlusive freeze for suspend-to-disk. So please guys get
together and reorganize the generic freezing code so that it supports
exclusive freeze (for in-kernel users) and works for your usecases instead
of replicating it inside XFS...
Honza
> > --- Original commit message ---
> > This patch is inspired by Dan's "mm, dax, pmem: Introduce
> > dev_pagemap_failure()"[1]. With the help of dax_holder and
> > ->notify_failure() mechanism, the pmem driver is able to ask filesystem
> > (or mapped device) on it to unmap all files in use and notify processes
> > who are using those files.
> >
> > Call trace:
> > trigger unbind
> > -> unbind_store()
> > -> ... (skip)
> > -> devres_release_all()
> > -> kill_dax()
> > -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
> > -> xfs_dax_notify_failure()
> > `-> freeze_super()
> > `-> do xfs rmap
> > ` -> mf_dax_kill_procs()
> > ` -> collect_procs_fsdax() // all associated
> > ` -> unmap_and_kill()
> > ` -> invalidate_inode_pages2() // drop file's cache
> > `-> thaw_super()
> >
> > Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
> > event. Also introduce a exclusive freeze/thaw to lock the filesystem to
> > prevent new dax mapping from being created. And do not shutdown
> > filesystem directly if something not supported, or if failure range
> > includes metadata area. Make sure all files and processes are handled
> > correctly. Also drop the cache of associated files before pmem is
> > removed.
> >
> > [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
> >
> > Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
> > ---
> > drivers/dax/super.c | 3 +-
> > fs/xfs/xfs_notify_failure.c | 151 ++++++++++++++++++++++++++++++++++--
> > include/linux/mm.h | 1 +
> > mm/memory-failure.c | 17 +++-
> > 4 files changed, 162 insertions(+), 10 deletions(-)
> >
> > diff --git a/drivers/dax/super.c b/drivers/dax/super.c
> > index c4c4728a36e4..2e1a35e82fce 100644
> > --- a/drivers/dax/super.c
> > +++ b/drivers/dax/super.c
> > @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
> > return;
> > if (dax_dev->holder_data != NULL)
> > - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
> > + dax_holder_notify_failure(dax_dev, 0, U64_MAX,
> > + MF_MEM_PRE_REMOVE);
> > clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
> > synchronize_srcu(&dax_srcu);
> > diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
> > index 1e2eddb8f90f..796dd954d33a 100644
> > --- a/fs/xfs/xfs_notify_failure.c
> > +++ b/fs/xfs/xfs_notify_failure.c
> > @@ -22,6 +22,7 @@
> > #include <linux/mm.h>
> > #include <linux/dax.h>
> > +#include <linux/fs.h>
> > struct xfs_failure_info {
> > xfs_agblock_t startblock;
> > @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
> > struct xfs_mount *mp = cur->bc_mp;
> > struct xfs_inode *ip;
> > struct xfs_failure_info *notify = data;
> > + struct address_space *mapping;
> > + pgoff_t pgoff;
> > + unsigned long pgcnt;
> > int error = 0;
> > if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
> > (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
> > + /* The device is about to be removed. Not a really failure. */
> > + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
> > + return 0;
> > notify->want_shutdown = true;
> > return 0;
> > }
> > @@ -92,14 +99,120 @@ xfs_dax_failure_fn(
> > return 0;
> > }
> > - error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
> > - xfs_failure_pgoff(mp, rec, notify),
> > - xfs_failure_pgcnt(mp, rec, notify),
> > - notify->mf_flags);
> > + mapping = VFS_I(ip)->i_mapping;
> > + pgoff = xfs_failure_pgoff(mp, rec, notify);
> > + pgcnt = xfs_failure_pgcnt(mp, rec, notify);
> > +
> > + /* Continue the rmap query if the inode isn't a dax file. */
> > + if (dax_mapping(mapping))
> > + error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
> > + notify->mf_flags);
> > +
> > + /* Invalidate the cache anyway. */
> > + invalidate_inode_pages2_range(mapping, pgoff, pgoff + pgcnt - 1);
> > +
> > xfs_irele(ip);
> > return error;
> > }
> > +#define SB_FREEZE_EXCLUSIVE (SB_FREEZE_COMPLETE + 2)
> > +
> > +static int
> > +xfs_dax_notify_failure_freeze(
> > + struct xfs_mount *mp)
> > +{
> > + struct super_block *sb = mp->m_super;
> > + int error = 0;
> > + int level;
> > +
> > + /* Wait until we're ready to freeze. */
> > + down_write(&sb->s_umount);
> > + while (sb->s_writers.frozen != SB_UNFROZEN) {
> > + up_write(&sb->s_umount);
> > +
> > + // just wait, or print warning in dmesg then return -EBUSY?
> > +
> > + delay(HZ / 10);
> > + down_write(&sb->s_umount);
> > + }
> > +
> > + if (sb_rdonly(sb)) {
> > + sb->s_writers.frozen = SB_FREEZE_EXCLUSIVE;
> > + goto out;
> > + }
> > +
> > + sb->s_writers.frozen = SB_FREEZE_WRITE;
> > + /* Release s_umount to preserve sb_start_write -> s_umount ordering */
> > + up_write(&sb->s_umount);
> > + percpu_down_write(sb->s_writers.rw_sem + SB_FREEZE_WRITE - 1);
> > + down_write(&sb->s_umount);
> > +
> > + /* Now we go and block page faults... */
> > + sb->s_writers.frozen = SB_FREEZE_PAGEFAULT;
> > + percpu_down_write(sb->s_writers.rw_sem + SB_FREEZE_PAGEFAULT - 1);
> > +
> > + /* All writers are done so after syncing there won't be dirty data */
> > + error = sync_filesystem(sb);
> > + if (error) {
> > + sb->s_writers.frozen = SB_UNFROZEN;
> > + for (level = SB_FREEZE_PAGEFAULT - 1; level >= 0; level--)
> > + percpu_up_write(sb->s_writers.rw_sem + level);
> > + wake_up(&sb->s_writers.wait_unfrozen);
> > + goto out;
> > + }
> > +
> > + /* Now wait for internal filesystem counter */
> > + sb->s_writers.frozen = SB_FREEZE_FS;
> > + percpu_down_write(sb->s_writers.rw_sem + SB_FREEZE_FS - 1);
> > +
> > + /*
> > + * To prevent anyone else from unfreezing us, set the VFS freeze level
> > + * to one higher than SB_FREEZE_COMPLETE.
> > + */
> > + sb->s_writers.frozen = SB_FREEZE_EXCLUSIVE;
> > + for (level = SB_FREEZE_LEVELS - 1; level >= 0; level--)
> > + percpu_rwsem_release(sb->s_writers.rw_sem + level, 0,
> > + _THIS_IP_);
> > +
> > +out:
> > + up_write(&sb->s_umount);
> > + return error;
> > +}
> > +
> > +static void
> > +xfs_dax_notify_failure_thaw(
> > + struct xfs_mount *mp)
> > +{
> > + struct super_block *sb = mp->m_super;
> > + int level;
> > +
> > + down_write(&sb->s_umount);
> > + if (sb->s_writers.frozen != SB_FREEZE_EXCLUSIVE) {
> > + /* somebody snuck in and unfroze us? */
> > + ASSERT(0);
> > + up_write(&sb->s_umount);
> > + return;
> > + }
> > +
> > + if (sb_rdonly(sb)) {
> > + sb->s_writers.frozen = SB_UNFROZEN;
> > + goto out;
> > + }
> > +
> > + for (level = 0; level < SB_FREEZE_LEVELS; ++level)
> > + percpu_rwsem_acquire(sb->s_writers.rw_sem + level, 0,
> > + _THIS_IP_);
> > +
> > + sb->s_writers.frozen = SB_UNFROZEN;
> > + for (level = SB_FREEZE_LEVELS - 1; level >= 0; level--)
> > + percpu_up_write(sb->s_writers.rw_sem + level);
> > +
> > +out:
> > + wake_up(&sb->s_writers.wait_unfrozen);
> > + up_write(&sb->s_umount);
> > +}
> > +
> > static int
> > xfs_dax_notify_ddev_failure(
> > struct xfs_mount *mp,
> > @@ -164,11 +277,22 @@ xfs_dax_notify_ddev_failure(
> > }
> > xfs_trans_cancel(tp);
> > +
> > + /* Thaw the fs if it is freezed before. */
> > + if (mf_flags & MF_MEM_PRE_REMOVE)
> > + xfs_dax_notify_failure_thaw(mp);
> > +
> > + /*
> > + * Determine how to shutdown the filesystem according to the
> > + * error code and flags.
> > + */
> > if (error || notify.want_shutdown) {
> > xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
> > if (!error)
> > error = -EFSCORRUPTED;
> > - }
> > + } else if (mf_flags & MF_MEM_PRE_REMOVE)
> > + xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
> > +
> > return error;
> > }
> > @@ -182,6 +306,7 @@ xfs_dax_notify_failure(
> > struct xfs_mount *mp = dax_holder(dax_dev);
> > u64 ddev_start;
> > u64 ddev_end;
> > + int error;
> > if (!(mp->m_super->s_flags & SB_BORN)) {
> > xfs_warn(mp, "filesystem is not ready for notify_failure()!");
> > @@ -196,6 +321,8 @@ xfs_dax_notify_failure(
> > if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
> > mp->m_logdev_targp != mp->m_ddev_targp) {
> > + if (mf_flags & MF_MEM_PRE_REMOVE)
> > + return 0;
> > xfs_err(mp, "ondisk log corrupt, shutting down fs!");
> > xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
> > return -EFSCORRUPTED;
> > @@ -209,6 +336,12 @@ xfs_dax_notify_failure(
> > ddev_start = mp->m_ddev_targp->bt_dax_part_off;
> > ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
> > + /* Notify failure on the whole device. */
> > + if (offset == 0 && len == U64_MAX) {
> > + offset = ddev_start;
> > + len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
> > + }
> > +
> > /* Ignore the range out of filesystem area */
> > if (offset + len - 1 < ddev_start)
> > return -ENXIO;
> > @@ -225,6 +358,14 @@ xfs_dax_notify_failure(
> > if (offset + len - 1 > ddev_end)
> > len = ddev_end - offset + 1;
> > + if (mf_flags & MF_MEM_PRE_REMOVE) {
> > + xfs_info(mp, "device is about to be removed!");
> > + /* Freeze fs to prevent new mappings from being created. */
> > + error = xfs_dax_notify_failure_freeze(mp);
> > + if (error)
> > + return error;
> > + }
> > +
> > return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
> > mf_flags);
> > }
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 1f79667824eb..ac3f22c20e1d 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -3436,6 +3436,7 @@ enum mf_flags {
> > MF_UNPOISON = 1 << 4,
> > MF_SW_SIMULATED = 1 << 5,
> > MF_NO_RETRY = 1 << 6,
> > + MF_MEM_PRE_REMOVE = 1 << 7,
> > };
> > int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> > unsigned long count, int mf_flags);
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index fae9baf3be16..6e6acec45568 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -623,7 +623,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
> > */
> > static void collect_procs_fsdax(struct page *page,
> > struct address_space *mapping, pgoff_t pgoff,
> > - struct list_head *to_kill)
> > + struct list_head *to_kill, bool pre_remove)
> > {
> > struct vm_area_struct *vma;
> > struct task_struct *tsk;
> > @@ -631,8 +631,15 @@ static void collect_procs_fsdax(struct page *page,
> > i_mmap_lock_read(mapping);
> > read_lock(&tasklist_lock);
> > for_each_process(tsk) {
> > - struct task_struct *t = task_early_kill(tsk, true);
> > + struct task_struct *t = tsk;
> > + /*
> > + * Search for all tasks while MF_MEM_PRE_REMOVE, because the
> > + * current may not be the one accessing the fsdax page.
> > + * Otherwise, search for the current task.
> > + */
> > + if (!pre_remove)
> > + t = task_early_kill(tsk, true);
> > if (!t)
> > continue;
> > vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
> > @@ -1732,6 +1739,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> > dax_entry_t cookie;
> > struct page *page;
> > size_t end = index + count;
> > + bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
> > mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> > @@ -1743,9 +1751,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
> > if (!page)
> > goto unlock;
> > - SetPageHWPoison(page);
> > + if (!pre_remove)
> > + SetPageHWPoison(page);
> > - collect_procs_fsdax(page, mapping, index, &to_kill);
> > + collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
> > unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
> > index, mf_flags);
> > unlock:
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH v11.1 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
2023-04-20 12:09 ` Jan Kara
@ 2023-04-25 12:47 ` Shiyang Ruan
2023-04-25 13:23 ` Jan Kara
0 siblings, 1 reply; 18+ messages in thread
From: Shiyang Ruan @ 2023-04-25 12:47 UTC (permalink / raw)
To: Jan Kara, djwong, Luis Chamberlain
Cc: linux-fsdevel, nvdimm, linux-xfs, linux-mm, dan.j.williams, willy, akpm
在 2023/4/20 20:09, Jan Kara 写道:
> On Thu 20-04-23 10:07:39, Shiyang Ruan wrote:
>> 在 2023/4/12 18:52, Shiyang Ruan 写道:
>>> This is a RFC HOTFIX.
>>>
>>> This hotfix adds a exclusive forzen state to make sure any others won't
>>> thaw the fs during xfs_dax_notify_failure():
>>>
>>> #define SB_FREEZE_EXCLUSIVE (SB_FREEZE_COMPLETE + 2)
>>> Using +2 here is because Darrick's patch[0] is using +1. So, should we
>>> make these definitions global?
>>>
>>> Another thing I can't make up my mind is: when another freezer has freeze
>>> the fs, should we wait unitl it finish, or print a warning in dmesg and
>>> return -EBUSY?
>>>
>>> Since there are at least 2 places needs exclusive forzen state, I think
>>> we can refactor helper functions of freeze/thaw for them. e.g.
>>> int freeze_super_exclusive(struct super_block *sb, int frozen);
>>> int thaw_super_exclusive(struct super_block *sb, int frozen);
>>>
>>> [0] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=repair-fscounters&id=c3a0d1de4d54ffb565dbc7092dfe1fb851940669
>
> I'm OK with the idea of new freeze state that does not allow userspace to
> thaw the filesystem. But I don't really like the guts of filesystem
> freezing being replicated inside XFS. It is bad enough that they are
> replicated in [0], replicating them *once more* in another XFS file shows
> we are definitely doing something wrong. And Luis will need yet another
> incantation of the exlusive freeze for suspend-to-disk. So please guys get
> together and reorganize the generic freezing code so that it supports
> exclusive freeze (for in-kernel users) and works for your usecases instead
> of replicating it inside XFS...
I agree that too much replicating code is not good. It's necessary to
create a generic exclusive freeze/thaw for all users. But for me, I
don't have the confidence to do it well, because it requires good design
and code changes will involve other filesystems. It's diffcult.
However, I hope to be able to make progress on this unbind feature.
Thus, I tend to refactor a common helper function for xfs first, and
update the code later when the generic freeze is done.
--
Thanks,
Ruan.
>
> Honza
>
>>> --- Original commit message ---
>>> This patch is inspired by Dan's "mm, dax, pmem: Introduce
>>> dev_pagemap_failure()"[1]. With the help of dax_holder and
>>> ->notify_failure() mechanism, the pmem driver is able to ask filesystem
>>> (or mapped device) on it to unmap all files in use and notify processes
>>> who are using those files.
>>>
>>> Call trace:
>>> trigger unbind
>>> -> unbind_store()
>>> -> ... (skip)
>>> -> devres_release_all()
>>> -> kill_dax()
>>> -> dax_holder_notify_failure(dax_dev, 0, U64_MAX, MF_MEM_PRE_REMOVE)
>>> -> xfs_dax_notify_failure()
>>> `-> freeze_super()
>>> `-> do xfs rmap
>>> ` -> mf_dax_kill_procs()
>>> ` -> collect_procs_fsdax() // all associated
>>> ` -> unmap_and_kill()
>>> ` -> invalidate_inode_pages2() // drop file's cache
>>> `-> thaw_super()
>>>
>>> Introduce MF_MEM_PRE_REMOVE to let filesystem know this is a remove
>>> event. Also introduce a exclusive freeze/thaw to lock the filesystem to
>>> prevent new dax mapping from being created. And do not shutdown
>>> filesystem directly if something not supported, or if failure range
>>> includes metadata area. Make sure all files and processes are handled
>>> correctly. Also drop the cache of associated files before pmem is
>>> removed.
>>>
>>> [1]: https://lore.kernel.org/linux-mm/161604050314.1463742.14151665140035795571.stgit@dwillia2-desk3.amr.corp.intel.com/
>>>
>>> Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com>
>>> ---
>>> drivers/dax/super.c | 3 +-
>>> fs/xfs/xfs_notify_failure.c | 151 ++++++++++++++++++++++++++++++++++--
>>> include/linux/mm.h | 1 +
>>> mm/memory-failure.c | 17 +++-
>>> 4 files changed, 162 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/drivers/dax/super.c b/drivers/dax/super.c
>>> index c4c4728a36e4..2e1a35e82fce 100644
>>> --- a/drivers/dax/super.c
>>> +++ b/drivers/dax/super.c
>>> @@ -323,7 +323,8 @@ void kill_dax(struct dax_device *dax_dev)
>>> return;
>>> if (dax_dev->holder_data != NULL)
>>> - dax_holder_notify_failure(dax_dev, 0, U64_MAX, 0);
>>> + dax_holder_notify_failure(dax_dev, 0, U64_MAX,
>>> + MF_MEM_PRE_REMOVE);
>>> clear_bit(DAXDEV_ALIVE, &dax_dev->flags);
>>> synchronize_srcu(&dax_srcu);
>>> diff --git a/fs/xfs/xfs_notify_failure.c b/fs/xfs/xfs_notify_failure.c
>>> index 1e2eddb8f90f..796dd954d33a 100644
>>> --- a/fs/xfs/xfs_notify_failure.c
>>> +++ b/fs/xfs/xfs_notify_failure.c
>>> @@ -22,6 +22,7 @@
>>> #include <linux/mm.h>
>>> #include <linux/dax.h>
>>> +#include <linux/fs.h>
>>> struct xfs_failure_info {
>>> xfs_agblock_t startblock;
>>> @@ -73,10 +74,16 @@ xfs_dax_failure_fn(
>>> struct xfs_mount *mp = cur->bc_mp;
>>> struct xfs_inode *ip;
>>> struct xfs_failure_info *notify = data;
>>> + struct address_space *mapping;
>>> + pgoff_t pgoff;
>>> + unsigned long pgcnt;
>>> int error = 0;
>>> if (XFS_RMAP_NON_INODE_OWNER(rec->rm_owner) ||
>>> (rec->rm_flags & (XFS_RMAP_ATTR_FORK | XFS_RMAP_BMBT_BLOCK))) {
>>> + /* The device is about to be removed. Not a really failure. */
>>> + if (notify->mf_flags & MF_MEM_PRE_REMOVE)
>>> + return 0;
>>> notify->want_shutdown = true;
>>> return 0;
>>> }
>>> @@ -92,14 +99,120 @@ xfs_dax_failure_fn(
>>> return 0;
>>> }
>>> - error = mf_dax_kill_procs(VFS_I(ip)->i_mapping,
>>> - xfs_failure_pgoff(mp, rec, notify),
>>> - xfs_failure_pgcnt(mp, rec, notify),
>>> - notify->mf_flags);
>>> + mapping = VFS_I(ip)->i_mapping;
>>> + pgoff = xfs_failure_pgoff(mp, rec, notify);
>>> + pgcnt = xfs_failure_pgcnt(mp, rec, notify);
>>> +
>>> + /* Continue the rmap query if the inode isn't a dax file. */
>>> + if (dax_mapping(mapping))
>>> + error = mf_dax_kill_procs(mapping, pgoff, pgcnt,
>>> + notify->mf_flags);
>>> +
>>> + /* Invalidate the cache anyway. */
>>> + invalidate_inode_pages2_range(mapping, pgoff, pgoff + pgcnt - 1);
>>> +
>>> xfs_irele(ip);
>>> return error;
>>> }
>>> +#define SB_FREEZE_EXCLUSIVE (SB_FREEZE_COMPLETE + 2)
>>> +
>>> +static int
>>> +xfs_dax_notify_failure_freeze(
>>> + struct xfs_mount *mp)
>>> +{
>>> + struct super_block *sb = mp->m_super;
>>> + int error = 0;
>>> + int level;
>>> +
>>> + /* Wait until we're ready to freeze. */
>>> + down_write(&sb->s_umount);
>>> + while (sb->s_writers.frozen != SB_UNFROZEN) {
>>> + up_write(&sb->s_umount);
>>> +
>>> + // just wait, or print warning in dmesg then return -EBUSY?
>>> +
>>> + delay(HZ / 10);
>>> + down_write(&sb->s_umount);
>>> + }
>>> +
>>> + if (sb_rdonly(sb)) {
>>> + sb->s_writers.frozen = SB_FREEZE_EXCLUSIVE;
>>> + goto out;
>>> + }
>>> +
>>> + sb->s_writers.frozen = SB_FREEZE_WRITE;
>>> + /* Release s_umount to preserve sb_start_write -> s_umount ordering */
>>> + up_write(&sb->s_umount);
>>> + percpu_down_write(sb->s_writers.rw_sem + SB_FREEZE_WRITE - 1);
>>> + down_write(&sb->s_umount);
>>> +
>>> + /* Now we go and block page faults... */
>>> + sb->s_writers.frozen = SB_FREEZE_PAGEFAULT;
>>> + percpu_down_write(sb->s_writers.rw_sem + SB_FREEZE_PAGEFAULT - 1);
>>> +
>>> + /* All writers are done so after syncing there won't be dirty data */
>>> + error = sync_filesystem(sb);
>>> + if (error) {
>>> + sb->s_writers.frozen = SB_UNFROZEN;
>>> + for (level = SB_FREEZE_PAGEFAULT - 1; level >= 0; level--)
>>> + percpu_up_write(sb->s_writers.rw_sem + level);
>>> + wake_up(&sb->s_writers.wait_unfrozen);
>>> + goto out;
>>> + }
>>> +
>>> + /* Now wait for internal filesystem counter */
>>> + sb->s_writers.frozen = SB_FREEZE_FS;
>>> + percpu_down_write(sb->s_writers.rw_sem + SB_FREEZE_FS - 1);
>>> +
>>> + /*
>>> + * To prevent anyone else from unfreezing us, set the VFS freeze level
>>> + * to one higher than SB_FREEZE_COMPLETE.
>>> + */
>>> + sb->s_writers.frozen = SB_FREEZE_EXCLUSIVE;
>>> + for (level = SB_FREEZE_LEVELS - 1; level >= 0; level--)
>>> + percpu_rwsem_release(sb->s_writers.rw_sem + level, 0,
>>> + _THIS_IP_);
>>> +
>>> +out:
>>> + up_write(&sb->s_umount);
>>> + return error;
>>> +}
>>> +
>>> +static void
>>> +xfs_dax_notify_failure_thaw(
>>> + struct xfs_mount *mp)
>>> +{
>>> + struct super_block *sb = mp->m_super;
>>> + int level;
>>> +
>>> + down_write(&sb->s_umount);
>>> + if (sb->s_writers.frozen != SB_FREEZE_EXCLUSIVE) {
>>> + /* somebody snuck in and unfroze us? */
>>> + ASSERT(0);
>>> + up_write(&sb->s_umount);
>>> + return;
>>> + }
>>> +
>>> + if (sb_rdonly(sb)) {
>>> + sb->s_writers.frozen = SB_UNFROZEN;
>>> + goto out;
>>> + }
>>> +
>>> + for (level = 0; level < SB_FREEZE_LEVELS; ++level)
>>> + percpu_rwsem_acquire(sb->s_writers.rw_sem + level, 0,
>>> + _THIS_IP_);
>>> +
>>> + sb->s_writers.frozen = SB_UNFROZEN;
>>> + for (level = SB_FREEZE_LEVELS - 1; level >= 0; level--)
>>> + percpu_up_write(sb->s_writers.rw_sem + level);
>>> +
>>> +out:
>>> + wake_up(&sb->s_writers.wait_unfrozen);
>>> + up_write(&sb->s_umount);
>>> +}
>>> +
>>> static int
>>> xfs_dax_notify_ddev_failure(
>>> struct xfs_mount *mp,
>>> @@ -164,11 +277,22 @@ xfs_dax_notify_ddev_failure(
>>> }
>>> xfs_trans_cancel(tp);
>>> +
>>> + /* Thaw the fs if it is freezed before. */
>>> + if (mf_flags & MF_MEM_PRE_REMOVE)
>>> + xfs_dax_notify_failure_thaw(mp);
>>> +
>>> + /*
>>> + * Determine how to shutdown the filesystem according to the
>>> + * error code and flags.
>>> + */
>>> if (error || notify.want_shutdown) {
>>> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>>> if (!error)
>>> error = -EFSCORRUPTED;
>>> - }
>>> + } else if (mf_flags & MF_MEM_PRE_REMOVE)
>>> + xfs_force_shutdown(mp, SHUTDOWN_FORCE_UMOUNT);
>>> +
>>> return error;
>>> }
>>> @@ -182,6 +306,7 @@ xfs_dax_notify_failure(
>>> struct xfs_mount *mp = dax_holder(dax_dev);
>>> u64 ddev_start;
>>> u64 ddev_end;
>>> + int error;
>>> if (!(mp->m_super->s_flags & SB_BORN)) {
>>> xfs_warn(mp, "filesystem is not ready for notify_failure()!");
>>> @@ -196,6 +321,8 @@ xfs_dax_notify_failure(
>>> if (mp->m_logdev_targp && mp->m_logdev_targp->bt_daxdev == dax_dev &&
>>> mp->m_logdev_targp != mp->m_ddev_targp) {
>>> + if (mf_flags & MF_MEM_PRE_REMOVE)
>>> + return 0;
>>> xfs_err(mp, "ondisk log corrupt, shutting down fs!");
>>> xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_ONDISK);
>>> return -EFSCORRUPTED;
>>> @@ -209,6 +336,12 @@ xfs_dax_notify_failure(
>>> ddev_start = mp->m_ddev_targp->bt_dax_part_off;
>>> ddev_end = ddev_start + bdev_nr_bytes(mp->m_ddev_targp->bt_bdev) - 1;
>>> + /* Notify failure on the whole device. */
>>> + if (offset == 0 && len == U64_MAX) {
>>> + offset = ddev_start;
>>> + len = bdev_nr_bytes(mp->m_ddev_targp->bt_bdev);
>>> + }
>>> +
>>> /* Ignore the range out of filesystem area */
>>> if (offset + len - 1 < ddev_start)
>>> return -ENXIO;
>>> @@ -225,6 +358,14 @@ xfs_dax_notify_failure(
>>> if (offset + len - 1 > ddev_end)
>>> len = ddev_end - offset + 1;
>>> + if (mf_flags & MF_MEM_PRE_REMOVE) {
>>> + xfs_info(mp, "device is about to be removed!");
>>> + /* Freeze fs to prevent new mappings from being created. */
>>> + error = xfs_dax_notify_failure_freeze(mp);
>>> + if (error)
>>> + return error;
>>> + }
>>> +
>>> return xfs_dax_notify_ddev_failure(mp, BTOBB(offset), BTOBB(len),
>>> mf_flags);
>>> }
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 1f79667824eb..ac3f22c20e1d 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -3436,6 +3436,7 @@ enum mf_flags {
>>> MF_UNPOISON = 1 << 4,
>>> MF_SW_SIMULATED = 1 << 5,
>>> MF_NO_RETRY = 1 << 6,
>>> + MF_MEM_PRE_REMOVE = 1 << 7,
>>> };
>>> int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>>> unsigned long count, int mf_flags);
>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>> index fae9baf3be16..6e6acec45568 100644
>>> --- a/mm/memory-failure.c
>>> +++ b/mm/memory-failure.c
>>> @@ -623,7 +623,7 @@ static void collect_procs_file(struct page *page, struct list_head *to_kill,
>>> */
>>> static void collect_procs_fsdax(struct page *page,
>>> struct address_space *mapping, pgoff_t pgoff,
>>> - struct list_head *to_kill)
>>> + struct list_head *to_kill, bool pre_remove)
>>> {
>>> struct vm_area_struct *vma;
>>> struct task_struct *tsk;
>>> @@ -631,8 +631,15 @@ static void collect_procs_fsdax(struct page *page,
>>> i_mmap_lock_read(mapping);
>>> read_lock(&tasklist_lock);
>>> for_each_process(tsk) {
>>> - struct task_struct *t = task_early_kill(tsk, true);
>>> + struct task_struct *t = tsk;
>>> + /*
>>> + * Search for all tasks while MF_MEM_PRE_REMOVE, because the
>>> + * current may not be the one accessing the fsdax page.
>>> + * Otherwise, search for the current task.
>>> + */
>>> + if (!pre_remove)
>>> + t = task_early_kill(tsk, true);
>>> if (!t)
>>> continue;
>>> vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
>>> @@ -1732,6 +1739,7 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>>> dax_entry_t cookie;
>>> struct page *page;
>>> size_t end = index + count;
>>> + bool pre_remove = mf_flags & MF_MEM_PRE_REMOVE;
>>> mf_flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
>>> @@ -1743,9 +1751,10 @@ int mf_dax_kill_procs(struct address_space *mapping, pgoff_t index,
>>> if (!page)
>>> goto unlock;
>>> - SetPageHWPoison(page);
>>> + if (!pre_remove)
>>> + SetPageHWPoison(page);
>>> - collect_procs_fsdax(page, mapping, index, &to_kill);
>>> + collect_procs_fsdax(page, mapping, index, &to_kill, pre_remove);
>>> unmap_and_kill(&to_kill, page_to_pfn(page), mapping,
>>> index, mf_flags);
>>> unlock:
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH v11.1 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
2023-04-25 12:47 ` Shiyang Ruan
@ 2023-04-25 13:23 ` Jan Kara
2023-04-25 15:18 ` Darrick J. Wong
0 siblings, 1 reply; 18+ messages in thread
From: Jan Kara @ 2023-04-25 13:23 UTC (permalink / raw)
To: Shiyang Ruan
Cc: Jan Kara, djwong, Luis Chamberlain, linux-fsdevel, nvdimm,
linux-xfs, linux-mm, dan.j.williams, willy, akpm
On Tue 25-04-23 20:47:35, Shiyang Ruan wrote:
>
>
> 在 2023/4/20 20:09, Jan Kara 写道:
> > On Thu 20-04-23 10:07:39, Shiyang Ruan wrote:
> > > 在 2023/4/12 18:52, Shiyang Ruan 写道:
> > > > This is a RFC HOTFIX.
> > > >
> > > > This hotfix adds a exclusive forzen state to make sure any others won't
> > > > thaw the fs during xfs_dax_notify_failure():
> > > >
> > > > #define SB_FREEZE_EXCLUSIVE (SB_FREEZE_COMPLETE + 2)
> > > > Using +2 here is because Darrick's patch[0] is using +1. So, should we
> > > > make these definitions global?
> > > >
> > > > Another thing I can't make up my mind is: when another freezer has freeze
> > > > the fs, should we wait unitl it finish, or print a warning in dmesg and
> > > > return -EBUSY?
> > > >
> > > > Since there are at least 2 places needs exclusive forzen state, I think
> > > > we can refactor helper functions of freeze/thaw for them. e.g.
> > > > int freeze_super_exclusive(struct super_block *sb, int frozen);
> > > > int thaw_super_exclusive(struct super_block *sb, int frozen);
> > > >
> > > > [0] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=repair-fscounters&id=c3a0d1de4d54ffb565dbc7092dfe1fb851940669
> >
> > I'm OK with the idea of new freeze state that does not allow userspace to
> > thaw the filesystem. But I don't really like the guts of filesystem
> > freezing being replicated inside XFS. It is bad enough that they are
> > replicated in [0], replicating them *once more* in another XFS file shows
> > we are definitely doing something wrong. And Luis will need yet another
> > incantation of the exlusive freeze for suspend-to-disk. So please guys get
> > together and reorganize the generic freezing code so that it supports
> > exclusive freeze (for in-kernel users) and works for your usecases instead
> > of replicating it inside XFS...
>
> I agree that too much replicating code is not good. It's necessary to
> create a generic exclusive freeze/thaw for all users. But for me, I don't
> have the confidence to do it well, because it requires good design and code
> changes will involve other filesystems. It's diffcult.
>
> However, I hope to be able to make progress on this unbind feature. Thus, I
> tend to refactor a common helper function for xfs first, and update the code
> later when the generic freeze is done.
I think Darrick was thinking about working on a proper generic interface.
So please coordinate with him.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH v11.1 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
2023-04-25 13:23 ` Jan Kara
@ 2023-04-25 15:18 ` Darrick J. Wong
2023-04-26 2:27 ` Shiyang Ruan
0 siblings, 1 reply; 18+ messages in thread
From: Darrick J. Wong @ 2023-04-25 15:18 UTC (permalink / raw)
To: Jan Kara
Cc: Shiyang Ruan, Luis Chamberlain, linux-fsdevel, nvdimm, linux-xfs,
linux-mm, dan.j.williams, willy, akpm
On Tue, Apr 25, 2023 at 03:23:15PM +0200, Jan Kara wrote:
> On Tue 25-04-23 20:47:35, Shiyang Ruan wrote:
> >
> >
> > 在 2023/4/20 20:09, Jan Kara 写道:
> > > On Thu 20-04-23 10:07:39, Shiyang Ruan wrote:
> > > > 在 2023/4/12 18:52, Shiyang Ruan 写道:
> > > > > This is a RFC HOTFIX.
> > > > >
> > > > > This hotfix adds a exclusive forzen state to make sure any others won't
> > > > > thaw the fs during xfs_dax_notify_failure():
> > > > >
> > > > > #define SB_FREEZE_EXCLUSIVE (SB_FREEZE_COMPLETE + 2)
> > > > > Using +2 here is because Darrick's patch[0] is using +1. So, should we
> > > > > make these definitions global?
> > > > >
> > > > > Another thing I can't make up my mind is: when another freezer has freeze
> > > > > the fs, should we wait unitl it finish, or print a warning in dmesg and
> > > > > return -EBUSY?
> > > > >
> > > > > Since there are at least 2 places needs exclusive forzen state, I think
> > > > > we can refactor helper functions of freeze/thaw for them. e.g.
> > > > > int freeze_super_exclusive(struct super_block *sb, int frozen);
> > > > > int thaw_super_exclusive(struct super_block *sb, int frozen);
> > > > >
> > > > > [0] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=repair-fscounters&id=c3a0d1de4d54ffb565dbc7092dfe1fb851940669
> > >
> > > I'm OK with the idea of new freeze state that does not allow userspace to
> > > thaw the filesystem. But I don't really like the guts of filesystem
> > > freezing being replicated inside XFS. It is bad enough that they are
> > > replicated in [0], replicating them *once more* in another XFS file shows
> > > we are definitely doing something wrong. And Luis will need yet another
> > > incantation of the exlusive freeze for suspend-to-disk. So please guys get
> > > together and reorganize the generic freezing code so that it supports
> > > exclusive freeze (for in-kernel users) and works for your usecases instead
> > > of replicating it inside XFS...
> >
> > I agree that too much replicating code is not good. It's necessary to
> > create a generic exclusive freeze/thaw for all users. But for me, I don't
> > have the confidence to do it well, because it requires good design and code
> > changes will involve other filesystems. It's diffcult.
> >
> > However, I hope to be able to make progress on this unbind feature. Thus, I
> > tend to refactor a common helper function for xfs first, and update the code
> > later when the generic freeze is done.
>
> I think Darrick was thinking about working on a proper generic interface.
> So please coordinate with him.
I'll post a vfs generic kernelfreeze series later today.
One thing I haven't figured out yet is what's supposed to happen when
PREREMOVE is called on a frozen filesystem. We don't want userspace to
be able to thaw the fs while PREREMOVE is running, so I /guess/ that
means we need some method for the kernel to take over a userspace
freeze and then put it back when we're done?
--D
> Honza
>
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH v11.1 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
2023-04-25 15:18 ` Darrick J. Wong
@ 2023-04-26 2:27 ` Shiyang Ruan
2023-04-26 2:37 ` Darrick J. Wong
0 siblings, 1 reply; 18+ messages in thread
From: Shiyang Ruan @ 2023-04-26 2:27 UTC (permalink / raw)
To: Darrick J. Wong, Jan Kara, Luis Chamberlain
Cc: linux-fsdevel, nvdimm, linux-xfs, linux-mm, dan.j.williams, willy, akpm
在 2023/4/25 23:18, Darrick J. Wong 写道:
> On Tue, Apr 25, 2023 at 03:23:15PM +0200, Jan Kara wrote:
>> On Tue 25-04-23 20:47:35, Shiyang Ruan wrote:
>>>
>>>
>>> 在 2023/4/20 20:09, Jan Kara 写道:
>>>> On Thu 20-04-23 10:07:39, Shiyang Ruan wrote:
>>>>> 在 2023/4/12 18:52, Shiyang Ruan 写道:
>>>>>> This is a RFC HOTFIX.
>>>>>>
>>>>>> This hotfix adds a exclusive forzen state to make sure any others won't
>>>>>> thaw the fs during xfs_dax_notify_failure():
>>>>>>
>>>>>> #define SB_FREEZE_EXCLUSIVE (SB_FREEZE_COMPLETE + 2)
>>>>>> Using +2 here is because Darrick's patch[0] is using +1. So, should we
>>>>>> make these definitions global?
>>>>>>
>>>>>> Another thing I can't make up my mind is: when another freezer has freeze
>>>>>> the fs, should we wait unitl it finish, or print a warning in dmesg and
>>>>>> return -EBUSY?
>>>>>>
>>>>>> Since there are at least 2 places needs exclusive forzen state, I think
>>>>>> we can refactor helper functions of freeze/thaw for them. e.g.
>>>>>> int freeze_super_exclusive(struct super_block *sb, int frozen);
>>>>>> int thaw_super_exclusive(struct super_block *sb, int frozen);
>>>>>>
>>>>>> [0] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=repair-fscounters&id=c3a0d1de4d54ffb565dbc7092dfe1fb851940669
>>>>
>>>> I'm OK with the idea of new freeze state that does not allow userspace to
>>>> thaw the filesystem. But I don't really like the guts of filesystem
>>>> freezing being replicated inside XFS. It is bad enough that they are
>>>> replicated in [0], replicating them *once more* in another XFS file shows
>>>> we are definitely doing something wrong. And Luis will need yet another
>>>> incantation of the exlusive freeze for suspend-to-disk. So please guys get
>>>> together and reorganize the generic freezing code so that it supports
>>>> exclusive freeze (for in-kernel users) and works for your usecases instead
>>>> of replicating it inside XFS...
>>>
>>> I agree that too much replicating code is not good. It's necessary to
>>> create a generic exclusive freeze/thaw for all users. But for me, I don't
>>> have the confidence to do it well, because it requires good design and code
>>> changes will involve other filesystems. It's diffcult.
>>>
>>> However, I hope to be able to make progress on this unbind feature. Thus, I
>>> tend to refactor a common helper function for xfs first, and update the code
>>> later when the generic freeze is done.
>>
>> I think Darrick was thinking about working on a proper generic interface.
>> So please coordinate with him.
>
> I'll post a vfs generic kernelfreeze series later today.
>
> One thing I haven't figured out yet is what's supposed to happen when
> PREREMOVE is called on a frozen filesystem.
call PREREMOVE when:
1. freezed by kernel: we wait unitl kernel thaws -> not sure
2. freezed by userspace: we take over the control of freeze state:
a. userspace can't thaw before PREREMOVE is done
b. kernel keeps freeze state after PREREMOVE is done and before
userspace thaws
Since the unbind interface doesn't return any other errcode except
-ENODEV, the only thing I can think of to do is wait for the other one
done? If another one doesn't thaw after a long time waitting, we print
a "waitting too long" warning in dmesg. But I'm not sure if this is good.
> We don't want userspace to
> be able to thaw the fs while PREREMOVE is running, so I /guess/ that
> means we need some method for the kernel to take over a userspace
> freeze and then put it back when we're done?
As is designed by Luis, we can add sb->s_writers.frozen_by_user flag to
distinguish whether current freeze state is initiated by kernel or
userspace. In his patch, userspace can take over kernel's freeze. We
just need to switch the order.
--
Thanks,
Ruan.
>
> --D
>
>> Honza
>>
>> --
>> Jan Kara <jack@suse.com>
>> SUSE Labs, CR
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH v11.1 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind
2023-04-26 2:27 ` Shiyang Ruan
@ 2023-04-26 2:37 ` Darrick J. Wong
0 siblings, 0 replies; 18+ messages in thread
From: Darrick J. Wong @ 2023-04-26 2:37 UTC (permalink / raw)
To: Shiyang Ruan
Cc: Jan Kara, Luis Chamberlain, linux-fsdevel, nvdimm, linux-xfs,
linux-mm, dan.j.williams, willy, akpm
On Wed, Apr 26, 2023 at 10:27:43AM +0800, Shiyang Ruan wrote:
>
>
> 在 2023/4/25 23:18, Darrick J. Wong 写道:
> > On Tue, Apr 25, 2023 at 03:23:15PM +0200, Jan Kara wrote:
> > > On Tue 25-04-23 20:47:35, Shiyang Ruan wrote:
> > > >
> > > >
> > > > 在 2023/4/20 20:09, Jan Kara 写道:
> > > > > On Thu 20-04-23 10:07:39, Shiyang Ruan wrote:
> > > > > > 在 2023/4/12 18:52, Shiyang Ruan 写道:
> > > > > > > This is a RFC HOTFIX.
> > > > > > >
> > > > > > > This hotfix adds a exclusive forzen state to make sure any others won't
> > > > > > > thaw the fs during xfs_dax_notify_failure():
> > > > > > >
> > > > > > > #define SB_FREEZE_EXCLUSIVE (SB_FREEZE_COMPLETE + 2)
> > > > > > > Using +2 here is because Darrick's patch[0] is using +1. So, should we
> > > > > > > make these definitions global?
> > > > > > >
> > > > > > > Another thing I can't make up my mind is: when another freezer has freeze
> > > > > > > the fs, should we wait unitl it finish, or print a warning in dmesg and
> > > > > > > return -EBUSY?
> > > > > > >
> > > > > > > Since there are at least 2 places needs exclusive forzen state, I think
> > > > > > > we can refactor helper functions of freeze/thaw for them. e.g.
> > > > > > > int freeze_super_exclusive(struct super_block *sb, int frozen);
> > > > > > > int thaw_super_exclusive(struct super_block *sb, int frozen);
> > > > > > >
> > > > > > > [0] https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=repair-fscounters&id=c3a0d1de4d54ffb565dbc7092dfe1fb851940669
> > > > >
> > > > > I'm OK with the idea of new freeze state that does not allow userspace to
> > > > > thaw the filesystem. But I don't really like the guts of filesystem
> > > > > freezing being replicated inside XFS. It is bad enough that they are
> > > > > replicated in [0], replicating them *once more* in another XFS file shows
> > > > > we are definitely doing something wrong. And Luis will need yet another
> > > > > incantation of the exlusive freeze for suspend-to-disk. So please guys get
> > > > > together and reorganize the generic freezing code so that it supports
> > > > > exclusive freeze (for in-kernel users) and works for your usecases instead
> > > > > of replicating it inside XFS...
> > > >
> > > > I agree that too much replicating code is not good. It's necessary to
> > > > create a generic exclusive freeze/thaw for all users. But for me, I don't
> > > > have the confidence to do it well, because it requires good design and code
> > > > changes will involve other filesystems. It's diffcult.
> > > >
> > > > However, I hope to be able to make progress on this unbind feature. Thus, I
> > > > tend to refactor a common helper function for xfs first, and update the code
> > > > later when the generic freeze is done.
> > >
> > > I think Darrick was thinking about working on a proper generic interface.
> > > So please coordinate with him.
> >
> > I'll post a vfs generic kernelfreeze series later today.
> >
> > One thing I haven't figured out yet is what's supposed to happen when
> > PREREMOVE is called on a frozen filesystem.
>
> call PREREMOVE when:
> 1. freezed by kernel: we wait unitl kernel thaws -> not sure
> 2. freezed by userspace: we take over the control of freeze state:
> a. userspace can't thaw before PREREMOVE is done
> b. kernel keeps freeze state after PREREMOVE is done and before
> userspace thaws
>
> Since the unbind interface doesn't return any other errcode except -ENODEV,
> the only thing I can think of to do is wait for the other one done? If
> another one doesn't thaw after a long time waitting, we print a "waitting
> too long" warning in dmesg. But I'm not sure if this is good.
>
> > We don't want userspace to
> > be able to thaw the fs while PREREMOVE is running, so I /guess/ that
> > means we need some method for the kernel to take over a userspace
> > freeze and then put it back when we're done?
>
> As is designed by Luis, we can add sb->s_writers.frozen_by_user flag to
> distinguish whether current freeze state is initiated by kernel or
> userspace. In his patch, userspace can take over kernel's freeze. We just
> need to switch the order.
<nod> How does this patchset
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=djwong-wtf&id=a97da76ed5256d692a02ece01b4032dbf68cbf89
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=djwong-wtf&id=93310faf77480265b3bc784f6883f5af9ccfce3b
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/commit/?h=djwong-wtf&id=a68cea1aa317775046372840ee4f0ba5bdb75d9f
strike you?
I think for #2 above I could write a freeze_super_excl variant that
turns a userspace freeze into a kernel freeze, and a thaw_super_excl
variant that changes it back.
--D
>
>
> --
> Thanks,
> Ruan.
>
> >
> > --D
> >
> > > Honza
> > >
> > > --
> > > Jan Kara <jack@suse.com>
> > > SUSE Labs, CR
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2023-04-26 2:37 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-28 9:41 [PATCH v11 0/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
2023-03-28 9:41 ` [PATCH v11 1/2] xfs: fix the calculation of length and end Shiyang Ruan
2023-03-28 9:41 ` [PATCH v11 2/2] mm, pmem, xfs: Introduce MF_MEM_REMOVE for unbind Shiyang Ruan
2023-04-04 17:45 ` Darrick J. Wong
2023-04-05 1:28 ` Yasunori Gotou (Fujitsu)
2023-04-05 4:36 ` Dan Williams
2023-04-06 10:50 ` Shiyang Ruan
2023-04-06 14:54 ` Darrick J. Wong
2023-04-07 2:07 ` Shiyang Ruan
2023-04-12 10:52 ` [RFC PATCH v11.1 " Shiyang Ruan
2023-04-20 2:07 ` Shiyang Ruan
2023-04-20 12:09 ` Jan Kara
2023-04-25 12:47 ` Shiyang Ruan
2023-04-25 13:23 ` Jan Kara
2023-04-25 15:18 ` Darrick J. Wong
2023-04-26 2:27 ` Shiyang Ruan
2023-04-26 2:37 ` Darrick J. Wong
2023-04-04 4:33 ` [PATCH v11 0/2] " Shiyang Ruan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox