* [PATCH v6 00/26] fs/dax: Fix ZONE_DEVICE page reference counts
@ 2025-01-10 6:00 Alistair Popple
2025-01-10 6:00 ` [PATCH v6 01/26] fuse: Fix dax truncate/punch_hole fault path Alistair Popple
` (26 more replies)
0 siblings, 27 replies; 97+ messages in thread
From: Alistair Popple @ 2025-01-10 6:00 UTC (permalink / raw)
To: akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
Main updates since v5:
- Reworked patch 1 based on Dan's feedback.
- Fixed build issues on PPC and when CONFIG_PGTABLE_HAS_HUGE_LEAVES
is no defined.
- Minor comment formatting and documentation fixes.
- Remove PTE_DEVMAP definitions from Loongarch which were added since
this series was initially written.
Main updates since v4:
- Removed most of the devdax/fsdax checks in fs/proc/task_mmu.c. This
means smaps/pagemap may contain DAX pages.
- Fixed rmap accounting of PUD mapped pages.
- Minor code clean-ups.
Main updates since v3:
- Rebased onto next-20241216. The rebase wasn't too difficult, but in
the interests of getting this out sooner for Andrew to look at as
requested by him I have yet to extensively build/run test this
version of the series.
- Fixed a bunch of build breakages reported by John Hubbard and the
kernel test robot due to various combinations of CONFIG options.
- Split the rmap changes into a separate patch as suggested by David H.
- Reworded the description for the P2PDMA change.
Main updates since v2:
- Rename the DAX specific dax_insert_XXX functions to vmf_insert_XXX
and have them pass the vmf struct.
- Separate out the device DAX changes.
- Restore the page share mapping counting and associated warnings.
- Rework truncate to require file-systems to have previously called
dax_break_layout() to remove the address space mapping for a
page. This found several bugs which are fixed by the first half of
the series. The motivation for this was initially to allow the FS
DAX page-cache mappings to hold a reference on the page.
However that turned out to be a dead-end (see the comments on patch
21), but it found several bugs and I think overall it is an
improvement so I have left it here.
Device and FS DAX pages have always maintained their own page
reference counts without following the normal rules for page reference
counting. In particular pages are considered free when the refcount
hits one rather than zero and refcounts are not added when mapping the
page.
Tracking this requires special PTE bits (PTE_DEVMAP) and a secondary
mechanism for allowing GUP to hold references on the page (see
get_dev_pagemap). However there doesn't seem to be any reason why FS
DAX pages need their own reference counting scheme.
By treating the refcounts on these pages the same way as normal pages
we can remove a lot of special checks. In particular pXd_trans_huge()
becomes the same as pXd_leaf(), although I haven't made that change
here. It also frees up a valuable SW define PTE bit on architectures
that have devmap PTE bits defined.
It also almost certainly allows further clean-up of the devmap managed
functions, but I have left that as a future improvment. It also
enables support for compound ZONE_DEVICE pages which is one of my
primary motivators for doing this work.
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Tested-by: Alison Schofield <alison.schofield@intel.com>
---
Cc: lina@asahilina.net
Cc: zhang.lyra@gmail.com
Cc: gerald.schaefer@linux.ibm.com
Cc: dan.j.williams@intel.com
Cc: vishal.l.verma@intel.com
Cc: dave.jiang@intel.com
Cc: logang@deltatee.com
Cc: bhelgaas@google.com
Cc: jack@suse.cz
Cc: jgg@ziepe.ca
Cc: catalin.marinas@arm.com
Cc: will@kernel.org
Cc: mpe@ellerman.id.au
Cc: npiggin@gmail.com
Cc: dave.hansen@linux.intel.com
Cc: ira.weiny@intel.com
Cc: willy@infradead.org
Cc: djwong@kernel.org
Cc: tytso@mit.edu
Cc: linmiaohe@huawei.com
Cc: david@redhat.com
Cc: peterx@redhat.com
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: nvdimm@lists.linux.dev
Cc: linux-cxl@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-ext4@vger.kernel.org
Cc: linux-xfs@vger.kernel.org
Cc: jhubbard@nvidia.com
Cc: hch@lst.de
Cc: david@fromorbit.com
Cc: chenhuacai@kernel.org
Cc: kernel@xen0n.name
Cc: loongarch@lists.linux.dev
Alistair Popple (26):
fuse: Fix dax truncate/punch_hole fault path
fs/dax: Return unmapped busy pages from dax_layout_busy_page_range()
fs/dax: Don't skip locked entries when scanning entries
fs/dax: Refactor wait for dax idle page
fs/dax: Create a common implementation to break DAX layouts
fs/dax: Always remove DAX page-cache entries when breaking layouts
fs/dax: Ensure all pages are idle prior to filesystem unmount
fs/dax: Remove PAGE_MAPPING_DAX_SHARED mapping flag
mm/gup: Remove redundant check for PCI P2PDMA page
mm/mm_init: Move p2pdma page refcount initialisation to p2pdma
mm: Allow compound zone device pages
mm/memory: Enhance insert_page_into_pte_locked() to create writable mappings
mm/memory: Add vmf_insert_page_mkwrite()
rmap: Add support for PUD sized mappings to rmap
huge_memory: Add vmf_insert_folio_pud()
huge_memory: Add vmf_insert_folio_pmd()
memremap: Add is_devdax_page() and is_fsdax_page() helpers
mm/gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages
proc/task_mmu: Mark devdax and fsdax pages as always unpinned
mm/mlock: Skip ZONE_DEVICE PMDs during mlock
fs/dax: Properly refcount fs dax pages
device/dax: Properly refcount device dax pages when mapping
mm: Remove pXX_devmap callers
mm: Remove devmap related functions and page table bits
Revert "riscv: mm: Add support for ZONE_DEVICE"
Revert "LoongArch: Add ARCH_HAS_PTE_DEVMAP support"
Documentation/mm/arch_pgtable_helpers.rst | 6 +-
arch/arm64/Kconfig | 1 +-
arch/arm64/include/asm/pgtable-prot.h | 1 +-
arch/arm64/include/asm/pgtable.h | 24 +-
arch/loongarch/Kconfig | 1 +-
arch/loongarch/include/asm/pgtable-bits.h | 6 +-
arch/loongarch/include/asm/pgtable.h | 19 +-
arch/powerpc/Kconfig | 1 +-
arch/powerpc/include/asm/book3s/64/hash-4k.h | 6 +-
arch/powerpc/include/asm/book3s/64/hash-64k.h | 7 +-
arch/powerpc/include/asm/book3s/64/pgtable.h | 53 +---
arch/powerpc/include/asm/book3s/64/radix.h | 14 +-
arch/powerpc/mm/book3s64/hash_hugepage.c | 2 +-
arch/powerpc/mm/book3s64/hash_pgtable.c | 3 +-
arch/powerpc/mm/book3s64/hugetlbpage.c | 2 +-
arch/powerpc/mm/book3s64/pgtable.c | 10 +-
arch/powerpc/mm/book3s64/radix_pgtable.c | 5 +-
arch/powerpc/mm/pgtable.c | 2 +-
arch/riscv/Kconfig | 1 +-
arch/riscv/include/asm/pgtable-64.h | 20 +-
arch/riscv/include/asm/pgtable-bits.h | 1 +-
arch/riscv/include/asm/pgtable.h | 17 +-
arch/x86/Kconfig | 1 +-
arch/x86/include/asm/pgtable.h | 51 +---
arch/x86/include/asm/pgtable_types.h | 5 +-
drivers/dax/device.c | 15 +-
drivers/gpu/drm/nouveau/nouveau_dmem.c | 3 +-
drivers/nvdimm/pmem.c | 4 +-
drivers/pci/p2pdma.c | 19 +-
fs/dax.c | 363 ++++++++++++++-----
fs/ext4/inode.c | 43 +--
fs/fuse/dax.c | 30 +--
fs/fuse/dir.c | 2 +-
fs/fuse/file.c | 4 +-
fs/fuse/virtio_fs.c | 3 +-
fs/proc/task_mmu.c | 2 +-
fs/userfaultfd.c | 2 +-
fs/xfs/xfs_inode.c | 40 +-
fs/xfs/xfs_inode.h | 3 +-
fs/xfs/xfs_super.c | 18 +-
include/linux/dax.h | 37 ++-
include/linux/huge_mm.h | 12 +-
include/linux/memremap.h | 28 +-
include/linux/migrate.h | 4 +-
include/linux/mm.h | 40 +--
include/linux/mm_types.h | 16 +-
include/linux/mmzone.h | 12 +-
include/linux/page-flags.h | 6 +-
include/linux/pfn_t.h | 20 +-
include/linux/pgtable.h | 21 +-
include/linux/rmap.h | 15 +-
lib/test_hmm.c | 3 +-
mm/Kconfig | 4 +-
mm/debug_vm_pgtable.c | 59 +---
mm/gup.c | 176 +---------
mm/hmm.c | 12 +-
mm/huge_memory.c | 220 +++++++-----
mm/internal.h | 2 +-
mm/khugepaged.c | 2 +-
mm/madvise.c | 8 +-
mm/mapping_dirty_helpers.c | 4 +-
mm/memory-failure.c | 6 +-
mm/memory.c | 118 ++++--
mm/memremap.c | 59 +--
mm/migrate_device.c | 9 +-
mm/mlock.c | 2 +-
mm/mm_init.c | 23 +-
mm/mprotect.c | 2 +-
mm/mremap.c | 5 +-
mm/page_vma_mapped.c | 5 +-
mm/pagewalk.c | 14 +-
mm/pgtable-generic.c | 7 +-
mm/rmap.c | 67 +++-
mm/swap.c | 2 +-
mm/truncate.c | 16 +-
mm/userfaultfd.c | 5 +-
mm/vmscan.c | 5 +-
77 files changed, 895 insertions(+), 961 deletions(-)
base-commit: e25c8d66f6786300b680866c0e0139981273feba
--
git-series 0.9.1
^ permalink raw reply [flat|nested] 97+ messages in thread
* [PATCH v6 01/26] fuse: Fix dax truncate/punch_hole fault path
2025-01-10 6:00 [PATCH v6 00/26] fs/dax: Fix ZONE_DEVICE page reference counts Alistair Popple
@ 2025-01-10 6:00 ` Alistair Popple
2025-02-05 13:03 ` Vivek Goyal
2025-01-10 6:00 ` [PATCH v6 02/26] fs/dax: Return unmapped busy pages from dax_layout_busy_page_range() Alistair Popple
` (25 subsequent siblings)
26 siblings, 1 reply; 97+ messages in thread
From: Alistair Popple @ 2025-01-10 6:00 UTC (permalink / raw)
To: akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch, Vivek Goyal
FS DAX requires file systems to call into the DAX layout prior to unlinking
inodes to ensure there is no ongoing DMA or other remote access to the
direct mapped page. The fuse file system implements
fuse_dax_break_layouts() to do this which includes a comment indicating
that passing dmap_end == 0 leads to unmapping of the whole file.
However this is not true - passing dmap_end == 0 will not unmap anything
before dmap_start, and further more dax_layout_busy_page_range() will not
scan any of the range to see if there maybe ongoing DMA access to the
range. Fix this by passing -1 for dmap_end to fuse_dax_break_layouts()
which will invalidate the entire file range to
dax_layout_busy_page_range().
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Co-developed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Fixes: 6ae330cad6ef ("virtiofs: serialize truncate/punch_hole and dax fault path")
Cc: Vivek Goyal <vgoyal@redhat.com>
---
Changes for v6:
- Original patch had a misplaced hunk due to a bad rebase.
- Reworked fix based on Dan's comments.
---
fs/fuse/dax.c | 1 -
fs/fuse/dir.c | 2 +-
fs/fuse/file.c | 4 ++--
3 files changed, 3 insertions(+), 4 deletions(-)
diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index 9abbc2f..455c4a1 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -681,7 +681,6 @@ static int __fuse_dax_break_layouts(struct inode *inode, bool *retry,
0, 0, fuse_wait_dax_page(inode));
}
-/* dmap_end == 0 leads to unmapping of whole file */
int fuse_dax_break_layouts(struct inode *inode, u64 dmap_start,
u64 dmap_end)
{
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 0b2f856..bc6c893 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -1936,7 +1936,7 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
if (FUSE_IS_DAX(inode) && is_truncate) {
filemap_invalidate_lock(mapping);
fault_blocked = true;
- err = fuse_dax_break_layouts(inode, 0, 0);
+ err = fuse_dax_break_layouts(inode, 0, -1);
if (err) {
filemap_invalidate_unlock(mapping);
return err;
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 082ee37..cef7a8f 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -253,7 +253,7 @@ static int fuse_open(struct inode *inode, struct file *file)
if (dax_truncate) {
filemap_invalidate_lock(inode->i_mapping);
- err = fuse_dax_break_layouts(inode, 0, 0);
+ err = fuse_dax_break_layouts(inode, 0, -1);
if (err)
goto out_inode_unlock;
}
@@ -2890,7 +2890,7 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
inode_lock(inode);
if (block_faults) {
filemap_invalidate_lock(inode->i_mapping);
- err = fuse_dax_break_layouts(inode, 0, 0);
+ err = fuse_dax_break_layouts(inode, 0, -1);
if (err)
goto out;
}
--
git-series 0.9.1
^ permalink raw reply [flat|nested] 97+ messages in thread
* [PATCH v6 02/26] fs/dax: Return unmapped busy pages from dax_layout_busy_page_range()
2025-01-10 6:00 [PATCH v6 00/26] fs/dax: Fix ZONE_DEVICE page reference counts Alistair Popple
2025-01-10 6:00 ` [PATCH v6 01/26] fuse: Fix dax truncate/punch_hole fault path Alistair Popple
@ 2025-01-10 6:00 ` Alistair Popple
2025-01-10 6:00 ` [PATCH v6 03/26] fs/dax: Don't skip locked entries when scanning entries Alistair Popple
` (24 subsequent siblings)
26 siblings, 0 replies; 97+ messages in thread
From: Alistair Popple @ 2025-01-10 6:00 UTC (permalink / raw)
To: akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
dax_layout_busy_page_range() is used by file systems to scan the DAX
page-cache to unmap mapping pages from user-space and to determine if
any pages in the given range are busy, either due to ongoing DMA or
other get_user_pages() usage.
Currently it checks to see the file mapping is mapped into user-space
with mapping_mapped() and returns early if not, skipping the check for
DMA busy pages. This is wrong as pages may still be undergoing DMA
access even if they have subsequently been unmapped from
user-space. Fix this by dropping the check for mapping_mapped().
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
---
fs/dax.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/dax.c b/fs/dax.c
index 21b4740..5133568 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -690,7 +690,7 @@ struct page *dax_layout_busy_page_range(struct address_space *mapping,
if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
return NULL;
- if (!dax_mapping(mapping) || !mapping_mapped(mapping))
+ if (!dax_mapping(mapping))
return NULL;
/* If end == LLONG_MAX, all pages from start to till end of file */
--
git-series 0.9.1
^ permalink raw reply [flat|nested] 97+ messages in thread
* [PATCH v6 03/26] fs/dax: Don't skip locked entries when scanning entries
2025-01-10 6:00 [PATCH v6 00/26] fs/dax: Fix ZONE_DEVICE page reference counts Alistair Popple
2025-01-10 6:00 ` [PATCH v6 01/26] fuse: Fix dax truncate/punch_hole fault path Alistair Popple
2025-01-10 6:00 ` [PATCH v6 02/26] fs/dax: Return unmapped busy pages from dax_layout_busy_page_range() Alistair Popple
@ 2025-01-10 6:00 ` Alistair Popple
2025-01-10 6:00 ` [PATCH v6 04/26] fs/dax: Refactor wait for dax idle page Alistair Popple
` (23 subsequent siblings)
26 siblings, 0 replies; 97+ messages in thread
From: Alistair Popple @ 2025-01-10 6:00 UTC (permalink / raw)
To: akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
Several functions internal to FS DAX use the following pattern when
trying to obtain an unlocked entry:
xas_for_each(&xas, entry, end_idx) {
if (dax_is_locked(entry))
entry = get_unlocked_entry(&xas, 0);
This is problematic because get_unlocked_entry() will get the next
present entry in the range, and the next entry may not be
locked. Therefore any processing of the original locked entry will be
skipped. This can cause dax_layout_busy_page_range() to miss DMA-busy
pages in the range, leading file systems to free blocks whilst DMA
operations are ongoing which can lead to file system corruption.
Instead callers from within a xas_for_each() loop should be waiting
for the current entry to be unlocked without advancing the XArray
state so a new function is introduced to wait.
Also while we are here rename get_unlocked_entry() to
get_next_unlocked_entry() to make it clear that it may advance the
iterator state.
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
---
fs/dax.c | 50 +++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 41 insertions(+), 9 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 5133568..d010c10 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -206,7 +206,7 @@ static void dax_wake_entry(struct xa_state *xas, void *entry,
*
* Must be called with the i_pages lock held.
*/
-static void *get_unlocked_entry(struct xa_state *xas, unsigned int order)
+static void *get_next_unlocked_entry(struct xa_state *xas, unsigned int order)
{
void *entry;
struct wait_exceptional_entry_queue ewait;
@@ -236,6 +236,37 @@ static void *get_unlocked_entry(struct xa_state *xas, unsigned int order)
}
/*
+ * Wait for the given entry to become unlocked. Caller must hold the i_pages
+ * lock and call either put_unlocked_entry() if it did not lock the entry or
+ * dax_unlock_entry() if it did. Returns an unlocked entry if still present.
+ */
+static void *wait_entry_unlocked_exclusive(struct xa_state *xas, void *entry)
+{
+ struct wait_exceptional_entry_queue ewait;
+ wait_queue_head_t *wq;
+
+ init_wait(&ewait.wait);
+ ewait.wait.func = wake_exceptional_entry_func;
+
+ while (unlikely(dax_is_locked(entry))) {
+ wq = dax_entry_waitqueue(xas, entry, &ewait.key);
+ prepare_to_wait_exclusive(wq, &ewait.wait,
+ TASK_UNINTERRUPTIBLE);
+ xas_pause(xas);
+ xas_unlock_irq(xas);
+ schedule();
+ finish_wait(wq, &ewait.wait);
+ xas_lock_irq(xas);
+ entry = xas_load(xas);
+ }
+
+ if (xa_is_internal(entry))
+ return NULL;
+
+ return entry;
+}
+
+/*
* The only thing keeping the address space around is the i_pages lock
* (it's cycled in clear_inode() after removing the entries from i_pages)
* After we call xas_unlock_irq(), we cannot touch xas->xa.
@@ -250,7 +281,7 @@ static void wait_entry_unlocked(struct xa_state *xas, void *entry)
wq = dax_entry_waitqueue(xas, entry, &ewait.key);
/*
- * Unlike get_unlocked_entry() there is no guarantee that this
+ * Unlike get_next_unlocked_entry() there is no guarantee that this
* path ever successfully retrieves an unlocked entry before an
* inode dies. Perform a non-exclusive wait in case this path
* never successfully performs its own wake up.
@@ -580,7 +611,7 @@ static void *grab_mapping_entry(struct xa_state *xas,
retry:
pmd_downgrade = false;
xas_lock_irq(xas);
- entry = get_unlocked_entry(xas, order);
+ entry = get_next_unlocked_entry(xas, order);
if (entry) {
if (dax_is_conflict(entry))
@@ -716,8 +747,7 @@ struct page *dax_layout_busy_page_range(struct address_space *mapping,
xas_for_each(&xas, entry, end_idx) {
if (WARN_ON_ONCE(!xa_is_value(entry)))
continue;
- if (unlikely(dax_is_locked(entry)))
- entry = get_unlocked_entry(&xas, 0);
+ entry = wait_entry_unlocked_exclusive(&xas, entry);
if (entry)
page = dax_busy_page(entry);
put_unlocked_entry(&xas, entry, WAKE_NEXT);
@@ -750,7 +780,7 @@ static int __dax_invalidate_entry(struct address_space *mapping,
void *entry;
xas_lock_irq(&xas);
- entry = get_unlocked_entry(&xas, 0);
+ entry = get_next_unlocked_entry(&xas, 0);
if (!entry || WARN_ON_ONCE(!xa_is_value(entry)))
goto out;
if (!trunc &&
@@ -776,7 +806,9 @@ static int __dax_clear_dirty_range(struct address_space *mapping,
xas_lock_irq(&xas);
xas_for_each(&xas, entry, end) {
- entry = get_unlocked_entry(&xas, 0);
+ entry = wait_entry_unlocked_exclusive(&xas, entry);
+ if (!entry)
+ continue;
xas_clear_mark(&xas, PAGECACHE_TAG_DIRTY);
xas_clear_mark(&xas, PAGECACHE_TAG_TOWRITE);
put_unlocked_entry(&xas, entry, WAKE_NEXT);
@@ -940,7 +972,7 @@ static int dax_writeback_one(struct xa_state *xas, struct dax_device *dax_dev,
if (unlikely(dax_is_locked(entry))) {
void *old_entry = entry;
- entry = get_unlocked_entry(xas, 0);
+ entry = get_next_unlocked_entry(xas, 0);
/* Entry got punched out / reallocated? */
if (!entry || WARN_ON_ONCE(!xa_is_value(entry)))
@@ -1949,7 +1981,7 @@ dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, unsigned int order)
vm_fault_t ret;
xas_lock_irq(&xas);
- entry = get_unlocked_entry(&xas, order);
+ entry = get_next_unlocked_entry(&xas, order);
/* Did we race with someone splitting entry or so? */
if (!entry || dax_is_conflict(entry) ||
(order == 0 && !dax_is_pte_entry(entry))) {
--
git-series 0.9.1
^ permalink raw reply [flat|nested] 97+ messages in thread
* [PATCH v6 04/26] fs/dax: Refactor wait for dax idle page
2025-01-10 6:00 [PATCH v6 00/26] fs/dax: Fix ZONE_DEVICE page reference counts Alistair Popple
` (2 preceding siblings ...)
2025-01-10 6:00 ` [PATCH v6 03/26] fs/dax: Don't skip locked entries when scanning entries Alistair Popple
@ 2025-01-10 6:00 ` Alistair Popple
2025-01-10 6:00 ` [PATCH v6 05/26] fs/dax: Create a common implementation to break DAX layouts Alistair Popple
` (22 subsequent siblings)
26 siblings, 0 replies; 97+ messages in thread
From: Alistair Popple @ 2025-01-10 6:00 UTC (permalink / raw)
To: akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
A FS DAX page is considered idle when its refcount drops to one. This
is currently open-coded in all file systems supporting FS DAX. Move
the idle detection to a common function to make future changes easier.
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Theodore Ts'o <tytso@mit.edu>
---
fs/ext4/inode.c | 5 +----
fs/fuse/dax.c | 4 +---
fs/xfs/xfs_inode.c | 4 +---
include/linux/dax.h | 8 ++++++++
4 files changed, 11 insertions(+), 10 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 7c54ae5..cc1acb1 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3922,10 +3922,7 @@ int ext4_break_layouts(struct inode *inode)
if (!page)
return 0;
- error = ___wait_var_event(&page->_refcount,
- atomic_read(&page->_refcount) == 1,
- TASK_INTERRUPTIBLE, 0, 0,
- ext4_wait_dax_page(inode));
+ error = dax_wait_page_idle(page, ext4_wait_dax_page, inode);
} while (error == 0);
return error;
diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index 455c4a1..d2ff482 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -676,9 +676,7 @@ static int __fuse_dax_break_layouts(struct inode *inode, bool *retry,
return 0;
*retry = true;
- return ___wait_var_event(&page->_refcount,
- atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
- 0, 0, fuse_wait_dax_page(inode));
+ return dax_wait_page_idle(page, fuse_wait_dax_page, inode);
}
int fuse_dax_break_layouts(struct inode *inode, u64 dmap_start,
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index c8ad260..42ea203 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3000,9 +3000,7 @@ xfs_break_dax_layouts(
return 0;
*retry = true;
- return ___wait_var_event(&page->_refcount,
- atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE,
- 0, 0, xfs_wait_dax_page(inode));
+ return dax_wait_page_idle(page, xfs_wait_dax_page, inode);
}
int
diff --git a/include/linux/dax.h b/include/linux/dax.h
index df41a00..9b1ce98 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -207,6 +207,14 @@ int dax_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
int dax_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
const struct iomap_ops *ops);
+static inline int dax_wait_page_idle(struct page *page,
+ void (cb)(struct inode *),
+ struct inode *inode)
+{
+ return ___wait_var_event(page, page_ref_count(page) == 1,
+ TASK_INTERRUPTIBLE, 0, 0, cb(inode));
+}
+
#if IS_ENABLED(CONFIG_DAX)
int dax_read_lock(void);
void dax_read_unlock(int id);
--
git-series 0.9.1
^ permalink raw reply [flat|nested] 97+ messages in thread
* [PATCH v6 05/26] fs/dax: Create a common implementation to break DAX layouts
2025-01-10 6:00 [PATCH v6 00/26] fs/dax: Fix ZONE_DEVICE page reference counts Alistair Popple
` (3 preceding siblings ...)
2025-01-10 6:00 ` [PATCH v6 04/26] fs/dax: Refactor wait for dax idle page Alistair Popple
@ 2025-01-10 6:00 ` Alistair Popple
2025-01-10 16:44 ` Darrick J. Wong
` (3 more replies)
2025-01-10 6:00 ` [PATCH v6 06/26] fs/dax: Always remove DAX page-cache entries when breaking layouts Alistair Popple
` (21 subsequent siblings)
26 siblings, 4 replies; 97+ messages in thread
From: Alistair Popple @ 2025-01-10 6:00 UTC (permalink / raw)
To: akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
Prior to freeing a block file systems supporting FS DAX must check
that the associated pages are both unmapped from user-space and not
undergoing DMA or other access from eg. get_user_pages(). This is
achieved by unmapping the file range and scanning the FS DAX
page-cache to see if any pages within the mapping have an elevated
refcount.
This is done using two functions - dax_layout_busy_page_range() which
returns a page to wait for the refcount to become idle on. Rather than
open-code this introduce a common implementation to both unmap and
wait for the page to become idle.
Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
Changes for v5:
- Don't wait for idle pages on non-DAX mappings
Changes for v4:
- Fixed some build breakage due to missing symbol exports reported by
John Hubbard (thanks!).
---
fs/dax.c | 33 +++++++++++++++++++++++++++++++++
fs/ext4/inode.c | 10 +---------
fs/fuse/dax.c | 27 +++------------------------
fs/xfs/xfs_inode.c | 23 +++++------------------
fs/xfs/xfs_inode.h | 2 +-
include/linux/dax.h | 21 +++++++++++++++++++++
mm/madvise.c | 8 ++++----
7 files changed, 68 insertions(+), 56 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index d010c10..9c3bd07 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -845,6 +845,39 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
return ret;
}
+static int wait_page_idle(struct page *page,
+ void (cb)(struct inode *),
+ struct inode *inode)
+{
+ return ___wait_var_event(page, page_ref_count(page) == 1,
+ TASK_INTERRUPTIBLE, 0, 0, cb(inode));
+}
+
+/*
+ * Unmaps the inode and waits for any DMA to complete prior to deleting the
+ * DAX mapping entries for the range.
+ */
+int dax_break_mapping(struct inode *inode, loff_t start, loff_t end,
+ void (cb)(struct inode *))
+{
+ struct page *page;
+ int error;
+
+ if (!dax_mapping(inode->i_mapping))
+ return 0;
+
+ do {
+ page = dax_layout_busy_page_range(inode->i_mapping, start, end);
+ if (!page)
+ break;
+
+ error = wait_page_idle(page, cb, inode);
+ } while (error == 0);
+
+ return error;
+}
+EXPORT_SYMBOL_GPL(dax_break_mapping);
+
/*
* Invalidate DAX entry if it is clean.
*/
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index cc1acb1..ee8e83f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3917,15 +3917,7 @@ int ext4_break_layouts(struct inode *inode)
if (WARN_ON_ONCE(!rwsem_is_locked(&inode->i_mapping->invalidate_lock)))
return -EINVAL;
- do {
- page = dax_layout_busy_page(inode->i_mapping);
- if (!page)
- return 0;
-
- error = dax_wait_page_idle(page, ext4_wait_dax_page, inode);
- } while (error == 0);
-
- return error;
+ return dax_break_mapping_inode(inode, ext4_wait_dax_page);
}
/*
diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
index d2ff482..410af88 100644
--- a/fs/fuse/dax.c
+++ b/fs/fuse/dax.c
@@ -665,33 +665,12 @@ static void fuse_wait_dax_page(struct inode *inode)
filemap_invalidate_lock(inode->i_mapping);
}
-/* Should be called with mapping->invalidate_lock held exclusively */
-static int __fuse_dax_break_layouts(struct inode *inode, bool *retry,
- loff_t start, loff_t end)
-{
- struct page *page;
-
- page = dax_layout_busy_page_range(inode->i_mapping, start, end);
- if (!page)
- return 0;
-
- *retry = true;
- return dax_wait_page_idle(page, fuse_wait_dax_page, inode);
-}
-
+/* Should be called with mapping->invalidate_lock held exclusively. */
int fuse_dax_break_layouts(struct inode *inode, u64 dmap_start,
u64 dmap_end)
{
- bool retry;
- int ret;
-
- do {
- retry = false;
- ret = __fuse_dax_break_layouts(inode, &retry, dmap_start,
- dmap_end);
- } while (ret == 0 && retry);
-
- return ret;
+ return dax_break_mapping(inode, dmap_start, dmap_end,
+ fuse_wait_dax_page);
}
ssize_t fuse_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 42ea203..295730a 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2715,21 +2715,17 @@ xfs_mmaplock_two_inodes_and_break_dax_layout(
struct xfs_inode *ip2)
{
int error;
- bool retry;
struct page *page;
if (ip1->i_ino > ip2->i_ino)
swap(ip1, ip2);
again:
- retry = false;
/* Lock the first inode */
xfs_ilock(ip1, XFS_MMAPLOCK_EXCL);
- error = xfs_break_dax_layouts(VFS_I(ip1), &retry);
- if (error || retry) {
+ error = xfs_break_dax_layouts(VFS_I(ip1));
+ if (error) {
xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
- if (error == 0 && retry)
- goto again;
return error;
}
@@ -2988,19 +2984,11 @@ xfs_wait_dax_page(
int
xfs_break_dax_layouts(
- struct inode *inode,
- bool *retry)
+ struct inode *inode)
{
- struct page *page;
-
xfs_assert_ilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL);
- page = dax_layout_busy_page(inode->i_mapping);
- if (!page)
- return 0;
-
- *retry = true;
- return dax_wait_page_idle(page, xfs_wait_dax_page, inode);
+ return dax_break_mapping_inode(inode, xfs_wait_dax_page);
}
int
@@ -3018,8 +3006,7 @@ xfs_break_layouts(
retry = false;
switch (reason) {
case BREAK_UNMAP:
- error = xfs_break_dax_layouts(inode, &retry);
- if (error || retry)
+ if (xfs_break_dax_layouts(inode))
break;
fallthrough;
case BREAK_WRITE:
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 1648dc5..c4f03f6 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -593,7 +593,7 @@ xfs_itruncate_extents(
return xfs_itruncate_extents_flags(tpp, ip, whichfork, new_size, 0);
}
-int xfs_break_dax_layouts(struct inode *inode, bool *retry);
+int xfs_break_dax_layouts(struct inode *inode);
int xfs_break_layouts(struct inode *inode, uint *iolock,
enum layout_break_reason reason);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 9b1ce98..f6583d3 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -228,6 +228,20 @@ static inline void dax_read_unlock(int id)
{
}
#endif /* CONFIG_DAX */
+
+#if !IS_ENABLED(CONFIG_FS_DAX)
+static inline int __must_check dax_break_mapping(struct inode *inode,
+ loff_t start, loff_t end, void (cb)(struct inode *))
+{
+ return 0;
+}
+
+static inline void dax_break_mapping_uninterruptible(struct inode *inode,
+ void (cb)(struct inode *))
+{
+}
+#endif
+
bool dax_alive(struct dax_device *dax_dev);
void *dax_get_private(struct dax_device *dax_dev);
long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
@@ -251,6 +265,13 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
pgoff_t index);
+int __must_check dax_break_mapping(struct inode *inode, loff_t start,
+ loff_t end, void (cb)(struct inode *));
+static inline int __must_check dax_break_mapping_inode(struct inode *inode,
+ void (cb)(struct inode *))
+{
+ return dax_break_mapping(inode, 0, LLONG_MAX, cb);
+}
int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
struct inode *dest, loff_t destoff,
loff_t len, bool *is_same,
diff --git a/mm/madvise.c b/mm/madvise.c
index 49f3a75..1f4c99e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1063,7 +1063,7 @@ static int guard_install_pud_entry(pud_t *pud, unsigned long addr,
pud_t pudval = pudp_get(pud);
/* If huge return >0 so we abort the operation + zap. */
- return pud_trans_huge(pudval) || pud_devmap(pudval);
+ return pud_trans_huge(pudval);
}
static int guard_install_pmd_entry(pmd_t *pmd, unsigned long addr,
@@ -1072,7 +1072,7 @@ static int guard_install_pmd_entry(pmd_t *pmd, unsigned long addr,
pmd_t pmdval = pmdp_get(pmd);
/* If huge return >0 so we abort the operation + zap. */
- return pmd_trans_huge(pmdval) || pmd_devmap(pmdval);
+ return pmd_trans_huge(pmdval);
}
static int guard_install_pte_entry(pte_t *pte, unsigned long addr,
@@ -1183,7 +1183,7 @@ static int guard_remove_pud_entry(pud_t *pud, unsigned long addr,
pud_t pudval = pudp_get(pud);
/* If huge, cannot have guard pages present, so no-op - skip. */
- if (pud_trans_huge(pudval) || pud_devmap(pudval))
+ if (pud_trans_huge(pudval))
walk->action = ACTION_CONTINUE;
return 0;
@@ -1195,7 +1195,7 @@ static int guard_remove_pmd_entry(pmd_t *pmd, unsigned long addr,
pmd_t pmdval = pmdp_get(pmd);
/* If huge, cannot have guard pages present, so no-op - skip. */
- if (pmd_trans_huge(pmdval) || pmd_devmap(pmdval))
+ if (pmd_trans_huge(pmdval))
walk->action = ACTION_CONTINUE;
return 0;
--
git-series 0.9.1
^ permalink raw reply [flat|nested] 97+ messages in thread
* [PATCH v6 06/26] fs/dax: Always remove DAX page-cache entries when breaking layouts
2025-01-10 6:00 [PATCH v6 00/26] fs/dax: Fix ZONE_DEVICE page reference counts Alistair Popple
` (4 preceding siblings ...)
2025-01-10 6:00 ` [PATCH v6 05/26] fs/dax: Create a common implementation to break DAX layouts Alistair Popple
@ 2025-01-10 6:00 ` Alistair Popple
2025-01-13 23:31 ` Dan Williams
2025-01-10 6:00 ` [PATCH v6 07/26] fs/dax: Ensure all pages are idle prior to filesystem unmount Alistair Popple
` (20 subsequent siblings)
26 siblings, 1 reply; 97+ messages in thread
From: Alistair Popple @ 2025-01-10 6:00 UTC (permalink / raw)
To: akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
Prior to any truncation operations file systems call
dax_break_mapping() to ensure pages in the range are not under going
DMA. Later DAX page-cache entries will be removed by
truncate_folio_batch_exceptionals() in the generic page-cache code.
However this makes it possible for folios to be removed from the
page-cache even though they are still DMA busy if the file-system
hasn't called dax_break_mapping(). It also means they can never be
waited on in future because FS DAX will lose track of them once the
page-cache entry has been deleted.
Instead it is better to delete the FS DAX entry when the file-system
calls dax_break_mapping() as part of it's truncate operation. This
ensures only idle pages can be removed from the FS DAX page-cache and
makes it easy to detect if a file-system hasn't called
dax_break_mapping() prior to a truncate operation.
Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
Ideally I think we would move the whole wait-for-idle logic directly
into the truncate paths. However this is difficult for a few
reasons. Each filesystem needs it's own wait callback, although a new
address space operation could address that. More problematic is that
the wait-for-idle can fail as the wait is TASK_INTERRUPTIBLE, but none
of the generic truncate paths allow for failure.
So it ends up being easier to continue to let file systems call this
and check that they behave as expected.
---
fs/dax.c | 33 +++++++++++++++++++++++++++++++++
fs/xfs/xfs_inode.c | 6 ++++++
include/linux/dax.h | 2 ++
mm/truncate.c | 16 +++++++++++++++-
4 files changed, 56 insertions(+), 1 deletion(-)
diff --git a/fs/dax.c b/fs/dax.c
index 9c3bd07..7008a73 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -845,6 +845,36 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
return ret;
}
+void dax_delete_mapping_range(struct address_space *mapping,
+ loff_t start, loff_t end)
+{
+ void *entry;
+ pgoff_t start_idx = start >> PAGE_SHIFT;
+ pgoff_t end_idx;
+ XA_STATE(xas, &mapping->i_pages, start_idx);
+
+ /* If end == LLONG_MAX, all pages from start to till end of file */
+ if (end == LLONG_MAX)
+ end_idx = ULONG_MAX;
+ else
+ end_idx = end >> PAGE_SHIFT;
+
+ xas_lock_irq(&xas);
+ xas_for_each(&xas, entry, end_idx) {
+ if (!xa_is_value(entry))
+ continue;
+ entry = wait_entry_unlocked_exclusive(&xas, entry);
+ if (!entry)
+ continue;
+ dax_disassociate_entry(entry, mapping, true);
+ xas_store(&xas, NULL);
+ mapping->nrpages -= 1UL << dax_entry_order(entry);
+ put_unlocked_entry(&xas, entry, WAKE_ALL);
+ }
+ xas_unlock_irq(&xas);
+}
+EXPORT_SYMBOL_GPL(dax_delete_mapping_range);
+
static int wait_page_idle(struct page *page,
void (cb)(struct inode *),
struct inode *inode)
@@ -874,6 +904,9 @@ int dax_break_mapping(struct inode *inode, loff_t start, loff_t end,
error = wait_page_idle(page, cb, inode);
} while (error == 0);
+ if (!page)
+ dax_delete_mapping_range(inode->i_mapping, start, end);
+
return error;
}
EXPORT_SYMBOL_GPL(dax_break_mapping);
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 295730a..4410b42 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2746,6 +2746,12 @@ xfs_mmaplock_two_inodes_and_break_dax_layout(
goto again;
}
+ /*
+ * Normally xfs_break_dax_layouts() would delete the mapping entries as well so
+ * do that here.
+ */
+ dax_delete_mapping_range(VFS_I(ip2)->i_mapping, 0, LLONG_MAX);
+
return 0;
}
diff --git a/include/linux/dax.h b/include/linux/dax.h
index f6583d3..ef9e02c 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -263,6 +263,8 @@ vm_fault_t dax_iomap_fault(struct vm_fault *vmf, unsigned int order,
vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
unsigned int order, pfn_t pfn);
int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
+void dax_delete_mapping_range(struct address_space *mapping,
+ loff_t start, loff_t end);
int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
pgoff_t index);
int __must_check dax_break_mapping(struct inode *inode, loff_t start,
diff --git a/mm/truncate.c b/mm/truncate.c
index 7c304d2..b7f51a6 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -78,8 +78,22 @@ static void truncate_folio_batch_exceptionals(struct address_space *mapping,
if (dax_mapping(mapping)) {
for (i = j; i < nr; i++) {
- if (xa_is_value(fbatch->folios[i]))
+ if (xa_is_value(fbatch->folios[i])) {
+ /*
+ * File systems should already have called
+ * dax_break_mapping_entry() to remove all DAX
+ * entries while holding a lock to prevent
+ * establishing new entries. Therefore we
+ * shouldn't find any here.
+ */
+ WARN_ON_ONCE(1);
+
+ /*
+ * Delete the mapping so truncate_pagecache()
+ * doesn't loop forever.
+ */
dax_delete_mapping_entry(mapping, indices[i]);
+ }
}
goto out;
}
--
git-series 0.9.1
^ permalink raw reply [flat|nested] 97+ messages in thread
* [PATCH v6 07/26] fs/dax: Ensure all pages are idle prior to filesystem unmount
2025-01-10 6:00 [PATCH v6 00/26] fs/dax: Fix ZONE_DEVICE page reference counts Alistair Popple
` (5 preceding siblings ...)
2025-01-10 6:00 ` [PATCH v6 06/26] fs/dax: Always remove DAX page-cache entries when breaking layouts Alistair Popple
@ 2025-01-10 6:00 ` Alistair Popple
2025-01-10 16:50 ` Darrick J. Wong
2025-01-13 23:42 ` Dan Williams
2025-01-10 6:00 ` [PATCH v6 08/26] fs/dax: Remove PAGE_MAPPING_DAX_SHARED mapping flag Alistair Popple
` (19 subsequent siblings)
26 siblings, 2 replies; 97+ messages in thread
From: Alistair Popple @ 2025-01-10 6:00 UTC (permalink / raw)
To: akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
File systems call dax_break_mapping() prior to reallocating file
system blocks to ensure the page is not undergoing any DMA or other
accesses. Generally this is needed when a file is truncated to ensure
that if a block is reallocated nothing is writing to it. However
filesystems currently don't call this when an FS DAX inode is evicted.
This can cause problems when the file system is unmounted as a page
can continue to be under going DMA or other remote access after
unmount. This means if the file system is remounted any truncate or
other operation which requires the underlying file system block to be
freed will not wait for the remote access to complete. Therefore a
busy block may be reallocated to a new file leading to corruption.
Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
Changes for v5:
- Don't wait for pages to be idle in non-DAX mappings
---
fs/dax.c | 29 +++++++++++++++++++++++++++++
fs/ext4/inode.c | 32 ++++++++++++++------------------
fs/xfs/xfs_inode.c | 9 +++++++++
fs/xfs/xfs_inode.h | 1 +
fs/xfs/xfs_super.c | 18 ++++++++++++++++++
include/linux/dax.h | 2 ++
6 files changed, 73 insertions(+), 18 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 7008a73..4e49cc4 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -883,6 +883,14 @@ static int wait_page_idle(struct page *page,
TASK_INTERRUPTIBLE, 0, 0, cb(inode));
}
+static void wait_page_idle_uninterruptible(struct page *page,
+ void (cb)(struct inode *),
+ struct inode *inode)
+{
+ ___wait_var_event(page, page_ref_count(page) == 1,
+ TASK_UNINTERRUPTIBLE, 0, 0, cb(inode));
+}
+
/*
* Unmaps the inode and waits for any DMA to complete prior to deleting the
* DAX mapping entries for the range.
@@ -911,6 +919,27 @@ int dax_break_mapping(struct inode *inode, loff_t start, loff_t end,
}
EXPORT_SYMBOL_GPL(dax_break_mapping);
+void dax_break_mapping_uninterruptible(struct inode *inode,
+ void (cb)(struct inode *))
+{
+ struct page *page;
+
+ if (!dax_mapping(inode->i_mapping))
+ return;
+
+ do {
+ page = dax_layout_busy_page_range(inode->i_mapping, 0,
+ LLONG_MAX);
+ if (!page)
+ break;
+
+ wait_page_idle_uninterruptible(page, cb, inode);
+ } while (true);
+
+ dax_delete_mapping_range(inode->i_mapping, 0, LLONG_MAX);
+}
+EXPORT_SYMBOL_GPL(dax_break_mapping_uninterruptible);
+
/*
* Invalidate DAX entry if it is clean.
*/
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ee8e83f..fa35161 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -163,6 +163,18 @@ int ext4_inode_is_fast_symlink(struct inode *inode)
(inode->i_size < EXT4_N_BLOCKS * 4);
}
+static void ext4_wait_dax_page(struct inode *inode)
+{
+ filemap_invalidate_unlock(inode->i_mapping);
+ schedule();
+ filemap_invalidate_lock(inode->i_mapping);
+}
+
+int ext4_break_layouts(struct inode *inode)
+{
+ return dax_break_mapping_inode(inode, ext4_wait_dax_page);
+}
+
/*
* Called at the last iput() if i_nlink is zero.
*/
@@ -181,6 +193,8 @@ void ext4_evict_inode(struct inode *inode)
trace_ext4_evict_inode(inode);
+ dax_break_mapping_uninterruptible(inode, ext4_wait_dax_page);
+
if (EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL)
ext4_evict_ea_inode(inode);
if (inode->i_nlink) {
@@ -3902,24 +3916,6 @@ int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset,
return ret;
}
-static void ext4_wait_dax_page(struct inode *inode)
-{
- filemap_invalidate_unlock(inode->i_mapping);
- schedule();
- filemap_invalidate_lock(inode->i_mapping);
-}
-
-int ext4_break_layouts(struct inode *inode)
-{
- struct page *page;
- int error;
-
- if (WARN_ON_ONCE(!rwsem_is_locked(&inode->i_mapping->invalidate_lock)))
- return -EINVAL;
-
- return dax_break_mapping_inode(inode, ext4_wait_dax_page);
-}
-
/*
* ext4_punch_hole: punches a hole in a file by releasing the blocks
* associated with the given offset and length
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 4410b42..c7ec5ab 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2997,6 +2997,15 @@ xfs_break_dax_layouts(
return dax_break_mapping_inode(inode, xfs_wait_dax_page);
}
+void
+xfs_break_dax_layouts_uninterruptible(
+ struct inode *inode)
+{
+ xfs_assert_ilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL);
+
+ dax_break_mapping_uninterruptible(inode, xfs_wait_dax_page);
+}
+
int
xfs_break_layouts(
struct inode *inode,
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index c4f03f6..613797a 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -594,6 +594,7 @@ xfs_itruncate_extents(
}
int xfs_break_dax_layouts(struct inode *inode);
+void xfs_break_dax_layouts_uninterruptible(struct inode *inode);
int xfs_break_layouts(struct inode *inode, uint *iolock,
enum layout_break_reason reason);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 8524b9d..73ec060 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -751,6 +751,23 @@ xfs_fs_drop_inode(
return generic_drop_inode(inode);
}
+STATIC void
+xfs_fs_evict_inode(
+ struct inode *inode)
+{
+ struct xfs_inode *ip = XFS_I(inode);
+ uint iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
+
+ if (IS_DAX(inode)) {
+ xfs_ilock(ip, iolock);
+ xfs_break_dax_layouts_uninterruptible(inode);
+ xfs_iunlock(ip, iolock);
+ }
+
+ truncate_inode_pages_final(&inode->i_data);
+ clear_inode(inode);
+}
+
static void
xfs_mount_free(
struct xfs_mount *mp)
@@ -1189,6 +1206,7 @@ static const struct super_operations xfs_super_operations = {
.destroy_inode = xfs_fs_destroy_inode,
.dirty_inode = xfs_fs_dirty_inode,
.drop_inode = xfs_fs_drop_inode,
+ .evict_inode = xfs_fs_evict_inode,
.put_super = xfs_fs_put_super,
.sync_fs = xfs_fs_sync_fs,
.freeze_fs = xfs_fs_freeze,
diff --git a/include/linux/dax.h b/include/linux/dax.h
index ef9e02c..7c3773f 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -274,6 +274,8 @@ static inline int __must_check dax_break_mapping_inode(struct inode *inode,
{
return dax_break_mapping(inode, 0, LLONG_MAX, cb);
}
+void dax_break_mapping_uninterruptible(struct inode *inode,
+ void (cb)(struct inode *));
int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
struct inode *dest, loff_t destoff,
loff_t len, bool *is_same,
--
git-series 0.9.1
^ permalink raw reply [flat|nested] 97+ messages in thread
* [PATCH v6 08/26] fs/dax: Remove PAGE_MAPPING_DAX_SHARED mapping flag
2025-01-10 6:00 [PATCH v6 00/26] fs/dax: Fix ZONE_DEVICE page reference counts Alistair Popple
` (6 preceding siblings ...)
2025-01-10 6:00 ` [PATCH v6 07/26] fs/dax: Ensure all pages are idle prior to filesystem unmount Alistair Popple
@ 2025-01-10 6:00 ` Alistair Popple
2025-01-14 0:52 ` Dan Williams
2025-01-14 14:47 ` David Hildenbrand
2025-01-10 6:00 ` [PATCH v6 09/26] mm/gup: Remove redundant check for PCI P2PDMA page Alistair Popple
` (18 subsequent siblings)
26 siblings, 2 replies; 97+ messages in thread
From: Alistair Popple @ 2025-01-10 6:00 UTC (permalink / raw)
To: akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
PAGE_MAPPING_DAX_SHARED is the same as PAGE_MAPPING_ANON. This isn't
currently a problem because FS DAX pages are treated
specially. However a future change will make FS DAX pages more like
normal pages, so folio_test_anon() must not return true for a FS DAX
page.
We could explicitly test for a FS DAX page in folio_test_anon(),
etc. however the PAGE_MAPPING_DAX_SHARED flag isn't actually
needed. Instead we can use the page->mapping field to implicitly track
the first mapping of a page. If page->mapping is non-NULL it implies
the page is associated with a single mapping at page->index. If the
page is associated with a second mapping clear page->mapping and set
page->share to 1.
This is possible because a shared mapping implies the file-system
implements dax_holder_operations which makes the ->mapping and
->index, which is a union with ->share, unused.
The page is considered shared when page->mapping == NULL and
page->share > 0 or page->mapping != NULL, implying it is present in at
least one address space. This also makes it easier for a future change
to detect when a page is first mapped into an address space which
requires special handling.
Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
fs/dax.c | 45 +++++++++++++++++++++++++--------------
include/linux/page-flags.h | 6 +-----
2 files changed, 29 insertions(+), 22 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 4e49cc4..d35dbe1 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -351,38 +351,41 @@ static unsigned long dax_end_pfn(void *entry)
for (pfn = dax_to_pfn(entry); \
pfn < dax_end_pfn(entry); pfn++)
+/*
+ * A DAX page is considered shared if it has no mapping set and ->share (which
+ * shares the ->index field) is non-zero. Note this may return false even if the
+ * page is shared between multiple files but has not yet actually been mapped
+ * into multiple address spaces.
+ */
static inline bool dax_page_is_shared(struct page *page)
{
- return page->mapping == PAGE_MAPPING_DAX_SHARED;
+ return !page->mapping && page->share;
}
/*
- * Set the page->mapping with PAGE_MAPPING_DAX_SHARED flag, increase the
- * refcount.
+ * Increase the page share refcount, warning if the page is not marked as shared.
*/
static inline void dax_page_share_get(struct page *page)
{
- if (page->mapping != PAGE_MAPPING_DAX_SHARED) {
- /*
- * Reset the index if the page was already mapped
- * regularly before.
- */
- if (page->mapping)
- page->share = 1;
- page->mapping = PAGE_MAPPING_DAX_SHARED;
- }
+ WARN_ON_ONCE(!page->share);
+ WARN_ON_ONCE(page->mapping);
page->share++;
}
static inline unsigned long dax_page_share_put(struct page *page)
{
+ WARN_ON_ONCE(!page->share);
return --page->share;
}
/*
- * When it is called in dax_insert_entry(), the shared flag will indicate that
- * whether this entry is shared by multiple files. If so, set the page->mapping
- * PAGE_MAPPING_DAX_SHARED, and use page->share as refcount.
+ * When it is called in dax_insert_entry(), the shared flag will indicate
+ * whether this entry is shared by multiple files. If the page has not
+ * previously been associated with any mappings the ->mapping and ->index
+ * fields will be set. If it has already been associated with a mapping
+ * the mapping will be cleared and the share count set. It's then up to the
+ * file-system to track which mappings contain which pages, ie. by implementing
+ * dax_holder_operations.
*/
static void dax_associate_entry(void *entry, struct address_space *mapping,
struct vm_area_struct *vma, unsigned long address, bool shared)
@@ -397,7 +400,17 @@ static void dax_associate_entry(void *entry, struct address_space *mapping,
for_each_mapped_pfn(entry, pfn) {
struct page *page = pfn_to_page(pfn);
- if (shared) {
+ if (shared && page->mapping && page->share) {
+ if (page->mapping) {
+ page->mapping = NULL;
+
+ /*
+ * Page has already been mapped into one address
+ * space so set the share count.
+ */
+ page->share = 1;
+ }
+
dax_page_share_get(page);
} else {
WARN_ON_ONCE(page->mapping);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 691506b..598334e 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -668,12 +668,6 @@ PAGEFLAG_FALSE(VmemmapSelfHosted, vmemmap_self_hosted)
#define PAGE_MAPPING_KSM (PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE)
#define PAGE_MAPPING_FLAGS (PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE)
-/*
- * Different with flags above, this flag is used only for fsdax mode. It
- * indicates that this page->mapping is now under reflink case.
- */
-#define PAGE_MAPPING_DAX_SHARED ((void *)0x1)
-
static __always_inline bool folio_mapping_flags(const struct folio *folio)
{
return ((unsigned long)folio->mapping & PAGE_MAPPING_FLAGS) != 0;
--
git-series 0.9.1
^ permalink raw reply [flat|nested] 97+ messages in thread
* [PATCH v6 09/26] mm/gup: Remove redundant check for PCI P2PDMA page
2025-01-10 6:00 [PATCH v6 00/26] fs/dax: Fix ZONE_DEVICE page reference counts Alistair Popple
` (7 preceding siblings ...)
2025-01-10 6:00 ` [PATCH v6 08/26] fs/dax: Remove PAGE_MAPPING_DAX_SHARED mapping flag Alistair Popple
@ 2025-01-10 6:00 ` Alistair Popple
2025-01-10 6:00 ` [PATCH v6 10/26] mm/mm_init: Move p2pdma page refcount initialisation to p2pdma Alistair Popple
` (17 subsequent siblings)
26 siblings, 0 replies; 97+ messages in thread
From: Alistair Popple @ 2025-01-10 6:00 UTC (permalink / raw)
To: akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch, Jason Gunthorpe
PCI P2PDMA pages are not mapped with pXX_devmap PTEs therefore the
check in __gup_device_huge() is redundant. Remove it
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Dan Wiliams <dan.j.williams@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
mm/gup.c | 5 -----
1 file changed, 5 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c
index 2304175..9b587b5 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -3016,11 +3016,6 @@ static int gup_fast_devmap_leaf(unsigned long pfn, unsigned long addr,
break;
}
- if (!(flags & FOLL_PCI_P2PDMA) && is_pci_p2pdma_page(page)) {
- gup_fast_undo_dev_pagemap(nr, nr_start, flags, pages);
- break;
- }
-
folio = try_grab_folio_fast(page, 1, flags);
if (!folio) {
gup_fast_undo_dev_pagemap(nr, nr_start, flags, pages);
--
git-series 0.9.1
^ permalink raw reply [flat|nested] 97+ messages in thread
* [PATCH v6 10/26] mm/mm_init: Move p2pdma page refcount initialisation to p2pdma
2025-01-10 6:00 [PATCH v6 00/26] fs/dax: Fix ZONE_DEVICE page reference counts Alistair Popple
` (8 preceding siblings ...)
2025-01-10 6:00 ` [PATCH v6 09/26] mm/gup: Remove redundant check for PCI P2PDMA page Alistair Popple
@ 2025-01-10 6:00 ` Alistair Popple
2025-01-14 14:51 ` David Hildenbrand
2025-01-10 6:00 ` [PATCH v6 11/26] mm: Allow compound zone device pages Alistair Popple
` (16 subsequent siblings)
26 siblings, 1 reply; 97+ messages in thread
From: Alistair Popple @ 2025-01-10 6:00 UTC (permalink / raw)
To: akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
Currently ZONE_DEVICE page reference counts are initialised by core
memory management code in __init_zone_device_page() as part of the
memremap() call which driver modules make to obtain ZONE_DEVICE
pages. This initialises page refcounts to 1 before returning them to
the driver.
This was presumably done because it drivers had a reference of sorts
on the page. It also ensured the page could always be mapped with
vm_insert_page() for example and would never get freed (ie. have a
zero refcount), freeing drivers of manipulating page reference counts.
However it complicates figuring out whether or not a page is free from
the mm perspective because it is no longer possible to just look at
the refcount. Instead the page type must be known and if GUP is used a
secondary pgmap reference is also sometimes needed.
To simplify this it is desirable to remove the page reference count
for the driver, so core mm can just use the refcount without having to
account for page type or do other types of tracking. This is possible
because drivers can always assume the page is valid as core kernel
will never offline or remove the struct page.
This means it is now up to drivers to initialise the page refcount as
required. P2PDMA uses vm_insert_page() to map the page, and that
requires a non-zero reference count when initialising the page so set
that when the page is first mapped.
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
---
Changes since v2:
- Initialise the page refcount for all pages covered by the kaddr
---
drivers/pci/p2pdma.c | 13 +++++++++++--
mm/memremap.c | 17 +++++++++++++----
mm/mm_init.c | 22 ++++++++++++++++++----
3 files changed, 42 insertions(+), 10 deletions(-)
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 0cb7e0a..04773a8 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -140,13 +140,22 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
rcu_read_unlock();
for (vaddr = vma->vm_start; vaddr < vma->vm_end; vaddr += PAGE_SIZE) {
- ret = vm_insert_page(vma, vaddr, virt_to_page(kaddr));
+ struct page *page = virt_to_page(kaddr);
+
+ /*
+ * Initialise the refcount for the freshly allocated page. As
+ * we have just allocated the page no one else should be
+ * using it.
+ */
+ VM_WARN_ON_ONCE_PAGE(!page_ref_count(page), page);
+ set_page_count(page, 1);
+ ret = vm_insert_page(vma, vaddr, page);
if (ret) {
gen_pool_free(p2pdma->pool, (uintptr_t)kaddr, len);
return ret;
}
percpu_ref_get(ref);
- put_page(virt_to_page(kaddr));
+ put_page(page);
kaddr += PAGE_SIZE;
len -= PAGE_SIZE;
}
diff --git a/mm/memremap.c b/mm/memremap.c
index 40d4547..07bbe0e 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -488,15 +488,24 @@ void free_zone_device_folio(struct folio *folio)
folio->mapping = NULL;
folio->page.pgmap->ops->page_free(folio_page(folio, 0));
- if (folio->page.pgmap->type != MEMORY_DEVICE_PRIVATE &&
- folio->page.pgmap->type != MEMORY_DEVICE_COHERENT)
+ switch (folio->page.pgmap->type) {
+ case MEMORY_DEVICE_PRIVATE:
+ case MEMORY_DEVICE_COHERENT:
+ put_dev_pagemap(folio->page.pgmap);
+ break;
+
+ case MEMORY_DEVICE_FS_DAX:
+ case MEMORY_DEVICE_GENERIC:
/*
* Reset the refcount to 1 to prepare for handing out the page
* again.
*/
folio_set_count(folio, 1);
- else
- put_dev_pagemap(folio->page.pgmap);
+ break;
+
+ case MEMORY_DEVICE_PCI_P2PDMA:
+ break;
+ }
}
void zone_device_page_init(struct page *page)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 24b68b4..f021e63 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1017,12 +1017,26 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
}
/*
- * ZONE_DEVICE pages are released directly to the driver page allocator
- * which will set the page count to 1 when allocating the page.
+ * ZONE_DEVICE pages other than MEMORY_TYPE_GENERIC and
+ * MEMORY_TYPE_FS_DAX pages are released directly to the driver page
+ * allocator which will set the page count to 1 when allocating the
+ * page.
+ *
+ * MEMORY_TYPE_GENERIC and MEMORY_TYPE_FS_DAX pages automatically have
+ * their refcount reset to one whenever they are freed (ie. after
+ * their refcount drops to 0).
*/
- if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
- pgmap->type == MEMORY_DEVICE_COHERENT)
+ switch (pgmap->type) {
+ case MEMORY_DEVICE_PRIVATE:
+ case MEMORY_DEVICE_COHERENT:
+ case MEMORY_DEVICE_PCI_P2PDMA:
set_page_count(page, 0);
+ break;
+
+ case MEMORY_DEVICE_FS_DAX:
+ case MEMORY_DEVICE_GENERIC:
+ break;
+ }
}
/*
--
git-series 0.9.1
^ permalink raw reply [flat|nested] 97+ messages in thread
* [PATCH v6 11/26] mm: Allow compound zone device pages
2025-01-10 6:00 [PATCH v6 00/26] fs/dax: Fix ZONE_DEVICE page reference counts Alistair Popple
` (9 preceding siblings ...)
2025-01-10 6:00 ` [PATCH v6 10/26] mm/mm_init: Move p2pdma page refcount initialisation to p2pdma Alistair Popple
@ 2025-01-10 6:00 ` Alistair Popple
2025-01-14 14:59 ` David Hildenbrand
2025-01-10 6:00 ` [PATCH v6 12/26] mm/memory: Enhance insert_page_into_pte_locked() to create writable mappings Alistair Popple
` (15 subsequent siblings)
26 siblings, 1 reply; 97+ messages in thread
From: Alistair Popple @ 2025-01-10 6:00 UTC (permalink / raw)
To: akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch, Jason Gunthorpe
Zone device pages are used to represent various type of device memory
managed by device drivers. Currently compound zone device pages are
not supported. This is because MEMORY_DEVICE_FS_DAX pages are the only
user of higher order zone device pages and have their own page
reference counting.
A future change will unify FS DAX reference counting with normal page
reference counting rules and remove the special FS DAX reference
counting. Supporting that requires compound zone device pages.
Supporting compound zone device pages requires compound_head() to
distinguish between head and tail pages whilst still preserving the
special struct page fields that are specific to zone device pages.
A tail page is distinguished by having bit zero being set in
page->compound_head, with the remaining bits pointing to the head
page. For zone device pages page->compound_head is shared with
page->pgmap.
The page->pgmap field is common to all pages within a memory section.
Therefore pgmap is the same for both head and tail pages and can be
moved into the folio and we can use the standard scheme to find
compound_head from a tail page.
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
---
Changes for v4:
- Fix build breakages reported by kernel test robot
Changes since v2:
- Indentation fix
- Rename page_dev_pagemap() to page_pgmap()
- Rename folio _unused field to _unused_pgmap_compound_head
- s/WARN_ON/VM_WARN_ON_ONCE_PAGE/
Changes since v1:
- Move pgmap to the folio as suggested by Matthew Wilcox
---
drivers/gpu/drm/nouveau/nouveau_dmem.c | 3 ++-
drivers/pci/p2pdma.c | 6 +++---
include/linux/memremap.h | 6 +++---
include/linux/migrate.h | 4 ++--
include/linux/mm_types.h | 9 +++++++--
include/linux/mmzone.h | 12 +++++++++++-
lib/test_hmm.c | 3 ++-
mm/hmm.c | 2 +-
mm/memory.c | 4 +++-
mm/memremap.c | 14 +++++++-------
mm/migrate_device.c | 7 +++++--
mm/mm_init.c | 2 +-
12 files changed, 47 insertions(+), 25 deletions(-)
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 1a07256..61d0f41 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -88,7 +88,8 @@ struct nouveau_dmem {
static struct nouveau_dmem_chunk *nouveau_page_to_chunk(struct page *page)
{
- return container_of(page->pgmap, struct nouveau_dmem_chunk, pagemap);
+ return container_of(page_pgmap(page), struct nouveau_dmem_chunk,
+ pagemap);
}
static struct nouveau_drm *page_to_drm(struct page *page)
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 04773a8..19214ec 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -202,7 +202,7 @@ static const struct attribute_group p2pmem_group = {
static void p2pdma_page_free(struct page *page)
{
- struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page->pgmap);
+ struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page_pgmap(page));
/* safe to dereference while a reference is held to the percpu ref */
struct pci_p2pdma *p2pdma =
rcu_dereference_protected(pgmap->provider->p2pdma, 1);
@@ -1025,8 +1025,8 @@ enum pci_p2pdma_map_type
pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
struct scatterlist *sg)
{
- if (state->pgmap != sg_page(sg)->pgmap) {
- state->pgmap = sg_page(sg)->pgmap;
+ if (state->pgmap != page_pgmap(sg_page(sg))) {
+ state->pgmap = page_pgmap(sg_page(sg));
state->map = pci_p2pdma_map_type(state->pgmap, dev);
state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset;
}
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 3f7143a..0256a42 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -161,7 +161,7 @@ static inline bool is_device_private_page(const struct page *page)
{
return IS_ENABLED(CONFIG_DEVICE_PRIVATE) &&
is_zone_device_page(page) &&
- page->pgmap->type == MEMORY_DEVICE_PRIVATE;
+ page_pgmap(page)->type == MEMORY_DEVICE_PRIVATE;
}
static inline bool folio_is_device_private(const struct folio *folio)
@@ -173,13 +173,13 @@ static inline bool is_pci_p2pdma_page(const struct page *page)
{
return IS_ENABLED(CONFIG_PCI_P2PDMA) &&
is_zone_device_page(page) &&
- page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
+ page_pgmap(page)->type == MEMORY_DEVICE_PCI_P2PDMA;
}
static inline bool is_device_coherent_page(const struct page *page)
{
return is_zone_device_page(page) &&
- page->pgmap->type == MEMORY_DEVICE_COHERENT;
+ page_pgmap(page)->type == MEMORY_DEVICE_COHERENT;
}
static inline bool folio_is_device_coherent(const struct folio *folio)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 29919fa..61899ec 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -205,8 +205,8 @@ struct migrate_vma {
unsigned long end;
/*
- * Set to the owner value also stored in page->pgmap->owner for
- * migrating out of device private memory. The flags also need to
+ * Set to the owner value also stored in page_pgmap(page)->owner
+ * for migrating out of device private memory. The flags also need to
* be set to MIGRATE_VMA_SELECT_DEVICE_PRIVATE.
* The caller should always set this field when using mmu notifier
* callbacks to avoid device MMU invalidations for device private
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index df8f515..54b59b8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -129,8 +129,11 @@ struct page {
unsigned long compound_head; /* Bit zero is set */
};
struct { /* ZONE_DEVICE pages */
- /** @pgmap: Points to the hosting device page map. */
- struct dev_pagemap *pgmap;
+ /*
+ * The first word is used for compound_head or folio
+ * pgmap
+ */
+ void *_unused_pgmap_compound_head;
void *zone_device_data;
/*
* ZONE_DEVICE private pages are counted as being
@@ -299,6 +302,7 @@ typedef struct {
* @_refcount: Do not access this member directly. Use folio_ref_count()
* to find how many references there are to this folio.
* @memcg_data: Memory Control Group data.
+ * @pgmap: Metadata for ZONE_DEVICE mappings
* @virtual: Virtual address in the kernel direct map.
* @_last_cpupid: IDs of last CPU and last process that accessed the folio.
* @_entire_mapcount: Do not use directly, call folio_entire_mapcount().
@@ -337,6 +341,7 @@ struct folio {
/* private: */
};
/* public: */
+ struct dev_pagemap *pgmap;
};
struct address_space *mapping;
pgoff_t index;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index c7ad4d6..fd492c3 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1159,6 +1159,12 @@ static inline bool is_zone_device_page(const struct page *page)
return page_zonenum(page) == ZONE_DEVICE;
}
+static inline struct dev_pagemap *page_pgmap(const struct page *page)
+{
+ VM_WARN_ON_ONCE_PAGE(!is_zone_device_page(page), page);
+ return page_folio(page)->pgmap;
+}
+
/*
* Consecutive zone device pages should not be merged into the same sgl
* or bvec segment with other types of pages or if they belong to different
@@ -1174,7 +1180,7 @@ static inline bool zone_device_pages_have_same_pgmap(const struct page *a,
return false;
if (!is_zone_device_page(a))
return true;
- return a->pgmap == b->pgmap;
+ return page_pgmap(a) == page_pgmap(b);
}
extern void memmap_init_zone_device(struct zone *, unsigned long,
@@ -1189,6 +1195,10 @@ static inline bool zone_device_pages_have_same_pgmap(const struct page *a,
{
return true;
}
+static inline struct dev_pagemap *page_pgmap(const struct page *page)
+{
+ return NULL;
+}
#endif
static inline bool folio_is_zone_device(const struct folio *folio)
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 056f2e4..ffd0c6f 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -195,7 +195,8 @@ static int dmirror_fops_release(struct inode *inode, struct file *filp)
static struct dmirror_chunk *dmirror_page_to_chunk(struct page *page)
{
- return container_of(page->pgmap, struct dmirror_chunk, pagemap);
+ return container_of(page_pgmap(page), struct dmirror_chunk,
+ pagemap);
}
static struct dmirror_device *dmirror_page_to_device(struct page *page)
diff --git a/mm/hmm.c b/mm/hmm.c
index 7e0229a..082f7b7 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -248,7 +248,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
* just report the PFN.
*/
if (is_device_private_entry(entry) &&
- pfn_swap_entry_to_page(entry)->pgmap->owner ==
+ page_pgmap(pfn_swap_entry_to_page(entry))->owner ==
range->dev_private_owner) {
cpu_flags = HMM_PFN_VALID;
if (is_writable_device_private_entry(entry))
diff --git a/mm/memory.c b/mm/memory.c
index f09f20c..06bb29e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4316,6 +4316,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
vmf->page = pfn_swap_entry_to_page(entry);
ret = remove_device_exclusive_entry(vmf);
} else if (is_device_private_entry(entry)) {
+ struct dev_pagemap *pgmap;
if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
/*
* migrate_to_ram is not yet ready to operate
@@ -4340,7 +4341,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
*/
get_page(vmf->page);
pte_unmap_unlock(vmf->pte, vmf->ptl);
- ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
+ pgmap = page_pgmap(vmf->page);
+ ret = pgmap->ops->migrate_to_ram(vmf);
put_page(vmf->page);
} else if (is_hwpoison_entry(entry)) {
ret = VM_FAULT_HWPOISON;
diff --git a/mm/memremap.c b/mm/memremap.c
index 07bbe0e..68099af 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -458,8 +458,8 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
void free_zone_device_folio(struct folio *folio)
{
- if (WARN_ON_ONCE(!folio->page.pgmap->ops ||
- !folio->page.pgmap->ops->page_free))
+ if (WARN_ON_ONCE(!folio->pgmap->ops ||
+ !folio->pgmap->ops->page_free))
return;
mem_cgroup_uncharge(folio);
@@ -486,12 +486,12 @@ void free_zone_device_folio(struct folio *folio)
* to clear folio->mapping.
*/
folio->mapping = NULL;
- folio->page.pgmap->ops->page_free(folio_page(folio, 0));
+ folio->pgmap->ops->page_free(folio_page(folio, 0));
- switch (folio->page.pgmap->type) {
+ switch (folio->pgmap->type) {
case MEMORY_DEVICE_PRIVATE:
case MEMORY_DEVICE_COHERENT:
- put_dev_pagemap(folio->page.pgmap);
+ put_dev_pagemap(folio->pgmap);
break;
case MEMORY_DEVICE_FS_DAX:
@@ -514,7 +514,7 @@ void zone_device_page_init(struct page *page)
* Drivers shouldn't be allocating pages after calling
* memunmap_pages().
*/
- WARN_ON_ONCE(!percpu_ref_tryget_live(&page->pgmap->ref));
+ WARN_ON_ONCE(!percpu_ref_tryget_live(&page_pgmap(page)->ref));
set_page_count(page, 1);
lock_page(page);
}
@@ -523,7 +523,7 @@ EXPORT_SYMBOL_GPL(zone_device_page_init);
#ifdef CONFIG_FS_DAX
bool __put_devmap_managed_folio_refs(struct folio *folio, int refs)
{
- if (folio->page.pgmap->type != MEMORY_DEVICE_FS_DAX)
+ if (folio->pgmap->type != MEMORY_DEVICE_FS_DAX)
return false;
/*
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 9cf2659..2209070 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -106,6 +106,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
arch_enter_lazy_mmu_mode();
for (; addr < end; addr += PAGE_SIZE, ptep++) {
+ struct dev_pagemap *pgmap;
unsigned long mpfn = 0, pfn;
struct folio *folio;
struct page *page;
@@ -133,9 +134,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
goto next;
page = pfn_swap_entry_to_page(entry);
+ pgmap = page_pgmap(page);
if (!(migrate->flags &
MIGRATE_VMA_SELECT_DEVICE_PRIVATE) ||
- page->pgmap->owner != migrate->pgmap_owner)
+ pgmap->owner != migrate->pgmap_owner)
goto next;
mpfn = migrate_pfn(page_to_pfn(page)) |
@@ -151,12 +153,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
goto next;
}
page = vm_normal_page(migrate->vma, addr, pte);
+ pgmap = page_pgmap(page);
if (page && !is_zone_device_page(page) &&
!(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
goto next;
else if (page && is_device_coherent_page(page) &&
(!(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_COHERENT) ||
- page->pgmap->owner != migrate->pgmap_owner))
+ pgmap->owner != migrate->pgmap_owner))
goto next;
mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
diff --git a/mm/mm_init.c b/mm/mm_init.c
index f021e63..cb73402 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -998,7 +998,7 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
* and zone_device_data. It is a bug if a ZONE_DEVICE page is
* ever freed or placed on a driver-private list.
*/
- page->pgmap = pgmap;
+ page_folio(page)->pgmap = pgmap;
page->zone_device_data = NULL;
/*
--
git-series 0.9.1
^ permalink raw reply [flat|nested] 97+ messages in thread
* [PATCH v6 12/26] mm/memory: Enhance insert_page_into_pte_locked() to create writable mappings
2025-01-10 6:00 [PATCH v6 00/26] fs/dax: Fix ZONE_DEVICE page reference counts Alistair Popple
` (10 preceding siblings ...)
2025-01-10 6:00 ` [PATCH v6 11/26] mm: Allow compound zone device pages Alistair Popple
@ 2025-01-10 6:00 ` Alistair Popple
2025-01-14 15:03 ` David Hildenbrand
[not found] ` <6785b90f300d8_20fa29465@dwillia2-xfh.jf.intel.com.notmuch>
2025-01-10 6:00 ` [PATCH v6 13/26] mm/memory: Add vmf_insert_page_mkwrite() Alistair Popple
` (14 subsequent siblings)
26 siblings, 2 replies; 97+ messages in thread
From: Alistair Popple @ 2025-01-10 6:00 UTC (permalink / raw)
To: akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
In preparation for using insert_page() for DAX, enhance
insert_page_into_pte_locked() to handle establishing writable
mappings. Recall that DAX returns VM_FAULT_NOPAGE after installing a
PTE which bypasses the typical set_pte_range() in finish_fault.
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Suggested-by: Dan Williams <dan.j.williams@intel.com>
---
Changes for v5:
- Minor comment/formatting fixes suggested by David Hildenbrand
Changes since v2:
- New patch split out from "mm/memory: Add dax_insert_pfn"
---
mm/memory.c | 37 +++++++++++++++++++++++++++++--------
1 file changed, 29 insertions(+), 8 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 06bb29e..8531acb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2126,19 +2126,40 @@ static int validate_page_before_insert(struct vm_area_struct *vma,
}
static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte,
- unsigned long addr, struct page *page, pgprot_t prot)
+ unsigned long addr, struct page *page,
+ pgprot_t prot, bool mkwrite)
{
struct folio *folio = page_folio(page);
+ pte_t entry = ptep_get(pte);
pte_t pteval;
- if (!pte_none(ptep_get(pte)))
- return -EBUSY;
+ if (!pte_none(entry)) {
+ if (!mkwrite)
+ return -EBUSY;
+
+ /* see insert_pfn(). */
+ if (pte_pfn(entry) != page_to_pfn(page)) {
+ WARN_ON_ONCE(!is_zero_pfn(pte_pfn(entry)));
+ return -EFAULT;
+ }
+ entry = maybe_mkwrite(entry, vma);
+ entry = pte_mkyoung(entry);
+ if (ptep_set_access_flags(vma, addr, pte, entry, 1))
+ update_mmu_cache(vma, addr, pte);
+ return 0;
+ }
+
/* Ok, finally just insert the thing.. */
pteval = mk_pte(page, prot);
if (unlikely(is_zero_folio(folio))) {
pteval = pte_mkspecial(pteval);
} else {
folio_get(folio);
+ entry = mk_pte(page, prot);
+ if (mkwrite) {
+ entry = pte_mkyoung(entry);
+ entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ }
inc_mm_counter(vma->vm_mm, mm_counter_file(folio));
folio_add_file_rmap_pte(folio, page, vma);
}
@@ -2147,7 +2168,7 @@ static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte,
}
static int insert_page(struct vm_area_struct *vma, unsigned long addr,
- struct page *page, pgprot_t prot)
+ struct page *page, pgprot_t prot, bool mkwrite)
{
int retval;
pte_t *pte;
@@ -2160,7 +2181,7 @@ static int insert_page(struct vm_area_struct *vma, unsigned long addr,
pte = get_locked_pte(vma->vm_mm, addr, &ptl);
if (!pte)
goto out;
- retval = insert_page_into_pte_locked(vma, pte, addr, page, prot);
+ retval = insert_page_into_pte_locked(vma, pte, addr, page, prot, mkwrite);
pte_unmap_unlock(pte, ptl);
out:
return retval;
@@ -2174,7 +2195,7 @@ static int insert_page_in_batch_locked(struct vm_area_struct *vma, pte_t *pte,
err = validate_page_before_insert(vma, page);
if (err)
return err;
- return insert_page_into_pte_locked(vma, pte, addr, page, prot);
+ return insert_page_into_pte_locked(vma, pte, addr, page, prot, false);
}
/* insert_pages() amortizes the cost of spinlock operations
@@ -2310,7 +2331,7 @@ int vm_insert_page(struct vm_area_struct *vma, unsigned long addr,
BUG_ON(vma->vm_flags & VM_PFNMAP);
vm_flags_set(vma, VM_MIXEDMAP);
}
- return insert_page(vma, addr, page, vma->vm_page_prot);
+ return insert_page(vma, addr, page, vma->vm_page_prot, false);
}
EXPORT_SYMBOL(vm_insert_page);
@@ -2590,7 +2611,7 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma,
* result in pfn_t_has_page() == false.
*/
page = pfn_to_page(pfn_t_to_pfn(pfn));
- err = insert_page(vma, addr, page, pgprot);
+ err = insert_page(vma, addr, page, pgprot, mkwrite);
} else {
return insert_pfn(vma, addr, pfn, pgprot, mkwrite);
}
--
git-series 0.9.1
^ permalink raw reply [flat|nested] 97+ messages in thread
* [PATCH v6 13/26] mm/memory: Add vmf_insert_page_mkwrite()
2025-01-10 6:00 [PATCH v6 00/26] fs/dax: Fix ZONE_DEVICE page reference counts Alistair Popple
` (11 preceding siblings ...)
2025-01-10 6:00 ` [PATCH v6 12/26] mm/memory: Enhance insert_page_into_pte_locked() to create writable mappings Alistair Popple
@ 2025-01-10 6:00 ` Alistair Popple
2025-01-14 16:15 ` David Hildenbrand
2025-01-10 6:00 ` [PATCH v6 14/26] rmap: Add support for PUD sized mappings to rmap Alistair Popple
` (13 subsequent siblings)
26 siblings, 1 reply; 97+ messages in thread
From: Alistair Popple @ 2025-01-10 6:00 UTC (permalink / raw)
To: akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
Currently to map a DAX page the DAX driver calls vmf_insert_pfn. This
creates a special devmap PTE entry for the pfn but does not take a
reference on the underlying struct page for the mapping. This is
because DAX page refcounts are treated specially, as indicated by the
presence of a devmap entry.
To allow DAX page refcounts to be managed the same as normal page
refcounts introduce vmf_insert_page_mkwrite(). This will take a
reference on the underlying page much the same as vmf_insert_page,
except it also permits upgrading an existing mapping to be writable if
requested/possible.
Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
Updates from v2:
- Rename function to make not DAX specific
- Split the insert_page_into_pte_locked() change into a separate
patch.
Updates from v1:
- Re-arrange code in insert_page_into_pte_locked() based on comments
from Jan Kara.
- Call mkdrity/mkyoung for the mkwrite case, also suggested by Jan.
---
include/linux/mm.h | 2 ++
mm/memory.c | 36 ++++++++++++++++++++++++++++++++++++
2 files changed, 38 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e790298..f267b06 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3620,6 +3620,8 @@ int vm_map_pages(struct vm_area_struct *vma, struct page **pages,
unsigned long num);
int vm_map_pages_zero(struct vm_area_struct *vma, struct page **pages,
unsigned long num);
+vm_fault_t vmf_insert_page_mkwrite(struct vm_fault *vmf, struct page *page,
+ bool write);
vm_fault_t vmf_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
unsigned long pfn);
vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/memory.c b/mm/memory.c
index 8531acb..c60b819 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2624,6 +2624,42 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma,
return VM_FAULT_NOPAGE;
}
+vm_fault_t vmf_insert_page_mkwrite(struct vm_fault *vmf, struct page *page,
+ bool write)
+{
+ struct vm_area_struct *vma = vmf->vma;
+ pgprot_t pgprot = vma->vm_page_prot;
+ unsigned long pfn = page_to_pfn(page);
+ unsigned long addr = vmf->address;
+ int err;
+
+ if (addr < vma->vm_start || addr >= vma->vm_end)
+ return VM_FAULT_SIGBUS;
+
+ track_pfn_insert(vma, &pgprot, pfn_to_pfn_t(pfn));
+
+ if (!pfn_modify_allowed(pfn, pgprot))
+ return VM_FAULT_SIGBUS;
+
+ /*
+ * We refcount the page normally so make sure pfn_valid is true.
+ */
+ if (!pfn_valid(pfn))
+ return VM_FAULT_SIGBUS;
+
+ if (WARN_ON(is_zero_pfn(pfn) && write))
+ return VM_FAULT_SIGBUS;
+
+ err = insert_page(vma, addr, page, pgprot, write);
+ if (err == -ENOMEM)
+ return VM_FAULT_OOM;
+ if (err < 0 && err != -EBUSY)
+ return VM_FAULT_SIGBUS;
+
+ return VM_FAULT_NOPAGE;
+}
+EXPORT_SYMBOL_GPL(vmf_insert_page_mkwrite);
+
vm_fault_t vmf_insert_mixed(struct vm_area_struct *vma, unsigned long addr,
pfn_t pfn)
{
--
git-series 0.9.1
^ permalink raw reply [flat|nested] 97+ messages in thread
* [PATCH v6 14/26] rmap: Add support for PUD sized mappings to rmap
2025-01-10 6:00 [PATCH v6 00/26] fs/dax: Fix ZONE_DEVICE page reference counts Alistair Popple
` (12 preceding siblings ...)
2025-01-10 6:00 ` [PATCH v6 13/26] mm/memory: Add vmf_insert_page_mkwrite() Alistair Popple
@ 2025-01-10 6:00 ` Alistair Popple
2025-01-14 1:21 ` Dan Williams
2025-01-10 6:00 ` [PATCH v6 15/26] huge_memory: Add vmf_insert_folio_pud() Alistair Popple
` (12 subsequent siblings)
26 siblings, 1 reply; 97+ messages in thread
From: Alistair Popple @ 2025-01-10 6:00 UTC (permalink / raw)
To: akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
The rmap doesn't currently support adding a PUD mapping of a
folio. This patch adds support for entire PUD mappings of folios,
primarily to allow for more standard refcounting of device DAX
folios. Currently DAX is the only user of this and it doesn't require
support for partially mapped PUD-sized folios so we don't support for
that for now.
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
Changes for v6:
- Minor comment formatting fix
- Add an additional check for CONFIG_TRANSPARENT_HUGEPAGE to fix a
build breakage when CONFIG_PGTABLE_HAS_HUGE_LEAVES is not defined.
Changes for v5:
- Fixed accounting as suggested by David.
Changes for v4:
- New for v4, split out rmap changes as suggested by David.
---
include/linux/rmap.h | 15 ++++++++++-
mm/rmap.c | 67 ++++++++++++++++++++++++++++++++++++++++++---
2 files changed, 78 insertions(+), 4 deletions(-)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 683a040..4509a43 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -192,6 +192,7 @@ typedef int __bitwise rmap_t;
enum rmap_level {
RMAP_LEVEL_PTE = 0,
RMAP_LEVEL_PMD,
+ RMAP_LEVEL_PUD,
};
static inline void __folio_rmap_sanity_checks(const struct folio *folio,
@@ -228,6 +229,14 @@ static inline void __folio_rmap_sanity_checks(const struct folio *folio,
VM_WARN_ON_FOLIO(folio_nr_pages(folio) != HPAGE_PMD_NR, folio);
VM_WARN_ON_FOLIO(nr_pages != HPAGE_PMD_NR, folio);
break;
+ case RMAP_LEVEL_PUD:
+ /*
+ * Assume that we are creating a single "entire" mapping of the
+ * folio.
+ */
+ VM_WARN_ON_FOLIO(folio_nr_pages(folio) != HPAGE_PUD_NR, folio);
+ VM_WARN_ON_FOLIO(nr_pages != HPAGE_PUD_NR, folio);
+ break;
default:
VM_WARN_ON_ONCE(true);
}
@@ -251,12 +260,16 @@ void folio_add_file_rmap_ptes(struct folio *, struct page *, int nr_pages,
folio_add_file_rmap_ptes(folio, page, 1, vma)
void folio_add_file_rmap_pmd(struct folio *, struct page *,
struct vm_area_struct *);
+void folio_add_file_rmap_pud(struct folio *, struct page *,
+ struct vm_area_struct *);
void folio_remove_rmap_ptes(struct folio *, struct page *, int nr_pages,
struct vm_area_struct *);
#define folio_remove_rmap_pte(folio, page, vma) \
folio_remove_rmap_ptes(folio, page, 1, vma)
void folio_remove_rmap_pmd(struct folio *, struct page *,
struct vm_area_struct *);
+void folio_remove_rmap_pud(struct folio *, struct page *,
+ struct vm_area_struct *);
void hugetlb_add_anon_rmap(struct folio *, struct vm_area_struct *,
unsigned long address, rmap_t flags);
@@ -341,6 +354,7 @@ static __always_inline void __folio_dup_file_rmap(struct folio *folio,
atomic_add(orig_nr_pages, &folio->_large_mapcount);
break;
case RMAP_LEVEL_PMD:
+ case RMAP_LEVEL_PUD:
atomic_inc(&folio->_entire_mapcount);
atomic_inc(&folio->_large_mapcount);
break;
@@ -437,6 +451,7 @@ static __always_inline int __folio_try_dup_anon_rmap(struct folio *folio,
atomic_add(orig_nr_pages, &folio->_large_mapcount);
break;
case RMAP_LEVEL_PMD:
+ case RMAP_LEVEL_PUD:
if (PageAnonExclusive(page)) {
if (unlikely(maybe_pinned))
return -EBUSY;
diff --git a/mm/rmap.c b/mm/rmap.c
index c6c4d4e..fbcb58d 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1187,12 +1187,19 @@ static __always_inline unsigned int __folio_add_rmap(struct folio *folio,
atomic_add(orig_nr_pages, &folio->_large_mapcount);
break;
case RMAP_LEVEL_PMD:
+ case RMAP_LEVEL_PUD:
first = atomic_inc_and_test(&folio->_entire_mapcount);
if (first) {
nr = atomic_add_return_relaxed(ENTIRELY_MAPPED, mapped);
if (likely(nr < ENTIRELY_MAPPED + ENTIRELY_MAPPED)) {
- *nr_pmdmapped = folio_nr_pages(folio);
- nr = *nr_pmdmapped - (nr & FOLIO_PAGES_MAPPED);
+ nr_pages = folio_nr_pages(folio);
+ /*
+ * We only track PMD mappings of PMD-sized
+ * folios separately.
+ */
+ if (level == RMAP_LEVEL_PMD)
+ *nr_pmdmapped = nr_pages;
+ nr = nr_pages - (nr & FOLIO_PAGES_MAPPED);
/* Raced ahead of a remove and another add? */
if (unlikely(nr < 0))
nr = 0;
@@ -1338,6 +1345,13 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
case RMAP_LEVEL_PMD:
SetPageAnonExclusive(page);
break;
+ case RMAP_LEVEL_PUD:
+ /*
+ * Keep the compiler happy, we don't support anonymous
+ * PUD mappings.
+ */
+ WARN_ON_ONCE(1);
+ break;
}
}
for (i = 0; i < nr_pages; i++) {
@@ -1531,6 +1545,27 @@ void folio_add_file_rmap_pmd(struct folio *folio, struct page *page,
#endif
}
+/**
+ * folio_add_file_rmap_pud - add a PUD mapping to a page range of a folio
+ * @folio: The folio to add the mapping to
+ * @page: The first page to add
+ * @vma: The vm area in which the mapping is added
+ *
+ * The page range of the folio is defined by [page, page + HPAGE_PUD_NR)
+ *
+ * The caller needs to hold the page table lock.
+ */
+void folio_add_file_rmap_pud(struct folio *folio, struct page *page,
+ struct vm_area_struct *vma)
+{
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
+ defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
+ __folio_add_file_rmap(folio, page, HPAGE_PUD_NR, vma, RMAP_LEVEL_PUD);
+#else
+ WARN_ON_ONCE(true);
+#endif
+}
+
static __always_inline void __folio_remove_rmap(struct folio *folio,
struct page *page, int nr_pages, struct vm_area_struct *vma,
enum rmap_level level)
@@ -1560,13 +1595,16 @@ static __always_inline void __folio_remove_rmap(struct folio *folio,
partially_mapped = nr && atomic_read(mapped);
break;
case RMAP_LEVEL_PMD:
+ case RMAP_LEVEL_PUD:
atomic_dec(&folio->_large_mapcount);
last = atomic_add_negative(-1, &folio->_entire_mapcount);
if (last) {
nr = atomic_sub_return_relaxed(ENTIRELY_MAPPED, mapped);
if (likely(nr < ENTIRELY_MAPPED)) {
- nr_pmdmapped = folio_nr_pages(folio);
- nr = nr_pmdmapped - (nr & FOLIO_PAGES_MAPPED);
+ nr_pages = folio_nr_pages(folio);
+ if (level == RMAP_LEVEL_PMD)
+ nr_pmdmapped = nr_pages;
+ nr = nr_pages - (nr & FOLIO_PAGES_MAPPED);
/* Raced ahead of another remove and an add? */
if (unlikely(nr < 0))
nr = 0;
@@ -1640,6 +1678,27 @@ void folio_remove_rmap_pmd(struct folio *folio, struct page *page,
#endif
}
+/**
+ * folio_remove_rmap_pud - remove a PUD mapping from a page range of a folio
+ * @folio: The folio to remove the mapping from
+ * @page: The first page to remove
+ * @vma: The vm area from which the mapping is removed
+ *
+ * The page range of the folio is defined by [page, page + HPAGE_PUD_NR)
+ *
+ * The caller needs to hold the page table lock.
+ */
+void folio_remove_rmap_pud(struct folio *folio, struct page *page,
+ struct vm_area_struct *vma)
+{
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
+ defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
+ __folio_remove_rmap(folio, page, HPAGE_PUD_NR, vma, RMAP_LEVEL_PUD);
+#else
+ WARN_ON_ONCE(true);
+#endif
+}
+
/*
* @arg: enum ttu_flags will be passed to this argument
*/
--
git-series 0.9.1
^ permalink raw reply [flat|nested] 97+ messages in thread
* [PATCH v6 15/26] huge_memory: Add vmf_insert_folio_pud()
2025-01-10 6:00 [PATCH v6 00/26] fs/dax: Fix ZONE_DEVICE page reference counts Alistair Popple
` (13 preceding siblings ...)
2025-01-10 6:00 ` [PATCH v6 14/26] rmap: Add support for PUD sized mappings to rmap Alistair Popple
@ 2025-01-10 6:00 ` Alistair Popple
2025-01-14 1:27 ` Dan Williams
2025-01-14 16:22 ` David Hildenbrand
2025-01-10 6:00 ` [PATCH v6 16/26] huge_memory: Add vmf_insert_folio_pmd() Alistair Popple
` (11 subsequent siblings)
26 siblings, 2 replies; 97+ messages in thread
From: Alistair Popple @ 2025-01-10 6:00 UTC (permalink / raw)
To: akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
Currently DAX folio/page reference counts are managed differently to
normal pages. To allow these to be managed the same as normal pages
introduce vmf_insert_folio_pud. This will map the entire PUD-sized folio
and take references as it would for a normally mapped page.
This is distinct from the current mechanism, vmf_insert_pfn_pud, which
simply inserts a special devmap PUD entry into the page table without
holding a reference to the page for the mapping.
Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
Changes for v5:
- Removed is_huge_zero_pud() as it's unlikely to ever be implemented.
- Minor code clean-up suggested by David.
---
include/linux/huge_mm.h | 1 +-
mm/huge_memory.c | 89 ++++++++++++++++++++++++++++++++++++------
2 files changed, 78 insertions(+), 12 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 93e509b..5bd1ff7 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -39,6 +39,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
+vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio, bool write);
enum transparent_hugepage_flag {
TRANSPARENT_HUGEPAGE_UNSUPPORTED,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 120cd2c..256adc3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1482,19 +1482,17 @@ static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
struct mm_struct *mm = vma->vm_mm;
pgprot_t prot = vma->vm_page_prot;
pud_t entry;
- spinlock_t *ptl;
- ptl = pud_lock(mm, pud);
if (!pud_none(*pud)) {
if (write) {
if (WARN_ON_ONCE(pud_pfn(*pud) != pfn_t_to_pfn(pfn)))
- goto out_unlock;
+ return;
entry = pud_mkyoung(*pud);
entry = maybe_pud_mkwrite(pud_mkdirty(entry), vma);
if (pudp_set_access_flags(vma, addr, pud, entry, 1))
update_mmu_cache_pud(vma, addr, pud);
}
- goto out_unlock;
+ return;
}
entry = pud_mkhuge(pfn_t_pud(pfn, prot));
@@ -1508,9 +1506,6 @@ static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
}
set_pud_at(mm, addr, pud, entry);
update_mmu_cache_pud(vma, addr, pud);
-
-out_unlock:
- spin_unlock(ptl);
}
/**
@@ -1528,6 +1523,7 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
unsigned long addr = vmf->address & PUD_MASK;
struct vm_area_struct *vma = vmf->vma;
pgprot_t pgprot = vma->vm_page_prot;
+ spinlock_t *ptl;
/*
* If we had pud_special, we could avoid all these restrictions,
@@ -1545,10 +1541,48 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
track_pfn_insert(vma, &pgprot, pfn);
+ ptl = pud_lock(vma->vm_mm, vmf->pud);
insert_pfn_pud(vma, addr, vmf->pud, pfn, write);
+ spin_unlock(ptl);
+
return VM_FAULT_NOPAGE;
}
EXPORT_SYMBOL_GPL(vmf_insert_pfn_pud);
+
+/**
+ * vmf_insert_folio_pud - insert a pud size folio mapped by a pud entry
+ * @vmf: Structure describing the fault
+ * @folio: folio to insert
+ * @write: whether it's a write fault
+ *
+ * Return: vm_fault_t value.
+ */
+vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio, bool write)
+{
+ struct vm_area_struct *vma = vmf->vma;
+ unsigned long addr = vmf->address & PUD_MASK;
+ pud_t *pud = vmf->pud;
+ struct mm_struct *mm = vma->vm_mm;
+ spinlock_t *ptl;
+
+ if (addr < vma->vm_start || addr >= vma->vm_end)
+ return VM_FAULT_SIGBUS;
+
+ if (WARN_ON_ONCE(folio_order(folio) != PUD_ORDER))
+ return VM_FAULT_SIGBUS;
+
+ ptl = pud_lock(mm, pud);
+ if (pud_none(*vmf->pud)) {
+ folio_get(folio);
+ folio_add_file_rmap_pud(folio, &folio->page, vma);
+ add_mm_counter(mm, mm_counter_file(folio), HPAGE_PUD_NR);
+ }
+ insert_pfn_pud(vma, addr, vmf->pud, pfn_to_pfn_t(folio_pfn(folio)), write);
+ spin_unlock(ptl);
+
+ return VM_FAULT_NOPAGE;
+}
+EXPORT_SYMBOL_GPL(vmf_insert_folio_pud);
#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
@@ -2146,7 +2180,8 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
zap_deposited_table(tlb->mm, pmd);
spin_unlock(ptl);
} else if (is_huge_zero_pmd(orig_pmd)) {
- zap_deposited_table(tlb->mm, pmd);
+ if (!vma_is_dax(vma) || arch_needs_pgtable_deposit())
+ zap_deposited_table(tlb->mm, pmd);
spin_unlock(ptl);
} else {
struct folio *folio = NULL;
@@ -2634,12 +2669,23 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
orig_pud = pudp_huge_get_and_clear_full(vma, addr, pud, tlb->fullmm);
arch_check_zapped_pud(vma, orig_pud);
tlb_remove_pud_tlb_entry(tlb, pud, addr);
- if (vma_is_special_huge(vma)) {
+ if (!vma_is_dax(vma) && vma_is_special_huge(vma)) {
spin_unlock(ptl);
/* No zero page support yet */
} else {
- /* No support for anonymous PUD pages yet */
- BUG();
+ struct page *page = NULL;
+ struct folio *folio;
+
+ /* No support for anonymous PUD pages or migration yet */
+ VM_WARN_ON_ONCE(vma_is_anonymous(vma) || !pud_present(orig_pud));
+
+ page = pud_page(orig_pud);
+ folio = page_folio(page);
+ folio_remove_rmap_pud(folio, page, vma);
+ add_mm_counter(tlb->mm, mm_counter_file(folio), -HPAGE_PUD_NR);
+
+ spin_unlock(ptl);
+ tlb_remove_page_size(tlb, page, HPAGE_PUD_SIZE);
}
return 1;
}
@@ -2647,6 +2693,10 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
static void __split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
unsigned long haddr)
{
+ struct folio *folio;
+ struct page *page;
+ pud_t old_pud;
+
VM_BUG_ON(haddr & ~HPAGE_PUD_MASK);
VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PUD_SIZE, vma);
@@ -2654,7 +2704,22 @@ static void __split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
count_vm_event(THP_SPLIT_PUD);
- pudp_huge_clear_flush(vma, haddr, pud);
+ old_pud = pudp_huge_clear_flush(vma, haddr, pud);
+
+ if (!vma_is_dax(vma))
+ return;
+
+ page = pud_page(old_pud);
+ folio = page_folio(page);
+
+ if (!folio_test_dirty(folio) && pud_dirty(old_pud))
+ folio_mark_dirty(folio);
+ if (!folio_test_referenced(folio) && pud_young(old_pud))
+ folio_set_referenced(folio);
+ folio_remove_rmap_pud(folio, page, vma);
+ folio_put(folio);
+ add_mm_counter(vma->vm_mm, mm_counter_file(folio),
+ -HPAGE_PUD_NR);
}
void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
--
git-series 0.9.1
^ permalink raw reply [flat|nested] 97+ messages in thread
* [PATCH v6 16/26] huge_memory: Add vmf_insert_folio_pmd()
2025-01-10 6:00 [PATCH v6 00/26] fs/dax: Fix ZONE_DEVICE page reference counts Alistair Popple
` (14 preceding siblings ...)
2025-01-10 6:00 ` [PATCH v6 15/26] huge_memory: Add vmf_insert_folio_pud() Alistair Popple
@ 2025-01-10 6:00 ` Alistair Popple
2025-01-14 2:04 ` Dan Williams
2025-01-14 16:40 ` David Hildenbrand
2025-01-10 6:00 ` [PATCH v6 17/26] memremap: Add is_devdax_page() and is_fsdax_page() helpers Alistair Popple
` (10 subsequent siblings)
26 siblings, 2 replies; 97+ messages in thread
From: Alistair Popple @ 2025-01-10 6:00 UTC (permalink / raw)
To: akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
Currently DAX folio/page reference counts are managed differently to
normal pages. To allow these to be managed the same as normal pages
introduce vmf_insert_folio_pmd. This will map the entire PMD-sized folio
and take references as it would for a normally mapped page.
This is distinct from the current mechanism, vmf_insert_pfn_pmd, which
simply inserts a special devmap PMD entry into the page table without
holding a reference to the page for the mapping.
Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
Changes for v5:
- Minor code cleanup suggested by David
---
include/linux/huge_mm.h | 1 +-
mm/huge_memory.c | 54 ++++++++++++++++++++++++++++++++++--------
2 files changed, 45 insertions(+), 10 deletions(-)
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 5bd1ff7..3633bd3 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -39,6 +39,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
+vm_fault_t vmf_insert_folio_pmd(struct vm_fault *vmf, struct folio *folio, bool write);
vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio, bool write);
enum transparent_hugepage_flag {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 256adc3..d1ea76e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1381,14 +1381,12 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
{
struct mm_struct *mm = vma->vm_mm;
pmd_t entry;
- spinlock_t *ptl;
- ptl = pmd_lock(mm, pmd);
if (!pmd_none(*pmd)) {
if (write) {
if (pmd_pfn(*pmd) != pfn_t_to_pfn(pfn)) {
WARN_ON_ONCE(!is_huge_zero_pmd(*pmd));
- goto out_unlock;
+ return;
}
entry = pmd_mkyoung(*pmd);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
@@ -1396,7 +1394,7 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
update_mmu_cache_pmd(vma, addr, pmd);
}
- goto out_unlock;
+ return;
}
entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
@@ -1417,11 +1415,6 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
set_pmd_at(mm, addr, pmd, entry);
update_mmu_cache_pmd(vma, addr, pmd);
-
-out_unlock:
- spin_unlock(ptl);
- if (pgtable)
- pte_free(mm, pgtable);
}
/**
@@ -1440,6 +1433,7 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
struct vm_area_struct *vma = vmf->vma;
pgprot_t pgprot = vma->vm_page_prot;
pgtable_t pgtable = NULL;
+ spinlock_t *ptl;
/*
* If we had pmd_special, we could avoid all these restrictions,
@@ -1462,12 +1456,52 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
}
track_pfn_insert(vma, &pgprot, pfn);
-
+ ptl = pmd_lock(vma->vm_mm, vmf->pmd);
insert_pfn_pmd(vma, addr, vmf->pmd, pfn, pgprot, write, pgtable);
+ spin_unlock(ptl);
+ if (pgtable)
+ pte_free(vma->vm_mm, pgtable);
+
return VM_FAULT_NOPAGE;
}
EXPORT_SYMBOL_GPL(vmf_insert_pfn_pmd);
+vm_fault_t vmf_insert_folio_pmd(struct vm_fault *vmf, struct folio *folio, bool write)
+{
+ struct vm_area_struct *vma = vmf->vma;
+ unsigned long addr = vmf->address & PMD_MASK;
+ struct mm_struct *mm = vma->vm_mm;
+ spinlock_t *ptl;
+ pgtable_t pgtable = NULL;
+
+ if (addr < vma->vm_start || addr >= vma->vm_end)
+ return VM_FAULT_SIGBUS;
+
+ if (WARN_ON_ONCE(folio_order(folio) != PMD_ORDER))
+ return VM_FAULT_SIGBUS;
+
+ if (arch_needs_pgtable_deposit()) {
+ pgtable = pte_alloc_one(vma->vm_mm);
+ if (!pgtable)
+ return VM_FAULT_OOM;
+ }
+
+ ptl = pmd_lock(mm, vmf->pmd);
+ if (pmd_none(*vmf->pmd)) {
+ folio_get(folio);
+ folio_add_file_rmap_pmd(folio, &folio->page, vma);
+ add_mm_counter(mm, mm_counter_file(folio), HPAGE_PMD_NR);
+ }
+ insert_pfn_pmd(vma, addr, vmf->pmd, pfn_to_pfn_t(folio_pfn(folio)),
+ vma->vm_page_prot, write, pgtable);
+ spin_unlock(ptl);
+ if (pgtable)
+ pte_free(mm, pgtable);
+
+ return VM_FAULT_NOPAGE;
+}
+EXPORT_SYMBOL_GPL(vmf_insert_folio_pmd);
+
#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
static pud_t maybe_pud_mkwrite(pud_t pud, struct vm_area_struct *vma)
{
--
git-series 0.9.1
^ permalink raw reply [flat|nested] 97+ messages in thread
* [PATCH v6 17/26] memremap: Add is_devdax_page() and is_fsdax_page() helpers
2025-01-10 6:00 [PATCH v6 00/26] fs/dax: Fix ZONE_DEVICE page reference counts Alistair Popple
` (15 preceding siblings ...)
2025-01-10 6:00 ` [PATCH v6 16/26] huge_memory: Add vmf_insert_folio_pmd() Alistair Popple
@ 2025-01-10 6:00 ` Alistair Popple
2025-01-14 2:05 ` Dan Williams
2025-01-10 6:00 ` [PATCH v6 18/26] mm/gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages Alistair Popple
` (9 subsequent siblings)
26 siblings, 1 reply; 97+ messages in thread
From: Alistair Popple @ 2025-01-10 6:00 UTC (permalink / raw)
To: akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
Add helpers to determine if a page or folio is a devdax or fsdax page
or folio.
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
Changes for v5:
- Renamed is_device_dax_page() to is_devdax_page() for consistency.
---
include/linux/memremap.h | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 0256a42..54e8b57 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -187,6 +187,28 @@ static inline bool folio_is_device_coherent(const struct folio *folio)
return is_device_coherent_page(&folio->page);
}
+static inline bool is_fsdax_page(const struct page *page)
+{
+ return is_zone_device_page(page) &&
+ page_pgmap(page)->type == MEMORY_DEVICE_FS_DAX;
+}
+
+static inline bool folio_is_fsdax(const struct folio *folio)
+{
+ return is_fsdax_page(&folio->page);
+}
+
+static inline bool is_devdax_page(const struct page *page)
+{
+ return is_zone_device_page(page) &&
+ page_pgmap(page)->type == MEMORY_DEVICE_GENERIC;
+}
+
+static inline bool folio_is_devdax(const struct folio *folio)
+{
+ return is_devdax_page(&folio->page);
+}
+
#ifdef CONFIG_ZONE_DEVICE
void zone_device_page_init(struct page *page);
void *memremap_pages(struct dev_pagemap *pgmap, int nid);
--
git-series 0.9.1
^ permalink raw reply [flat|nested] 97+ messages in thread
* [PATCH v6 18/26] mm/gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages
2025-01-10 6:00 [PATCH v6 00/26] fs/dax: Fix ZONE_DEVICE page reference counts Alistair Popple
` (16 preceding siblings ...)
2025-01-10 6:00 ` [PATCH v6 17/26] memremap: Add is_devdax_page() and is_fsdax_page() helpers Alistair Popple
@ 2025-01-10 6:00 ` Alistair Popple
2025-01-14 2:16 ` Dan Williams
2025-01-10 6:00 ` [PATCH v6 19/26] proc/task_mmu: Mark devdax and fsdax pages as always unpinned Alistair Popple
` (8 subsequent siblings)
26 siblings, 1 reply; 97+ messages in thread
From: Alistair Popple @ 2025-01-10 6:00 UTC (permalink / raw)
To: akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
Longterm pinning of FS DAX pages should already be disallowed by
various pXX_devmap checks. However a future change will cause these
checks to be invalid for FS DAX pages so make
folio_is_longterm_pinnable() return false for FS DAX pages.
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
include/linux/mm.h | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f267b06..01edca9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2078,6 +2078,10 @@ static inline bool folio_is_longterm_pinnable(struct folio *folio)
if (folio_is_device_coherent(folio))
return false;
+ /* DAX must also always allow eviction. */
+ if (folio_is_fsdax(folio))
+ return false;
+
/* Otherwise, non-movable zone folios can be pinned. */
return !folio_is_zone_movable(folio);
--
git-series 0.9.1
^ permalink raw reply [flat|nested] 97+ messages in thread
* [PATCH v6 19/26] proc/task_mmu: Mark devdax and fsdax pages as always unpinned
2025-01-10 6:00 [PATCH v6 00/26] fs/dax: Fix ZONE_DEVICE page reference counts Alistair Popple
` (17 preceding siblings ...)
2025-01-10 6:00 ` [PATCH v6 18/26] mm/gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages Alistair Popple
@ 2025-01-10 6:00 ` Alistair Popple
2025-01-14 2:28 ` Dan Williams
2025-01-10 6:00 ` [PATCH v6 20/26] mm/mlock: Skip ZONE_DEVICE PMDs during mlock Alistair Popple
` (7 subsequent siblings)
26 siblings, 1 reply; 97+ messages in thread
From: Alistair Popple @ 2025-01-10 6:00 UTC (permalink / raw)
To: akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
The procfs mmu files such as smaps and pagemap currently ignore devdax and
fsdax pages because these pages are considered special. A future change
will start treating these as normal pages, meaning they can be exposed via
smaps and pagemap.
The only difference is that devdax and fsdax pages can never be pinned for
DMA via FOLL_LONGTERM, so add an explicit check in pte_is_pinned() to
reflect that.
Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
Changes for v5:
- After discussion with David remove the checks for DAX pages for
smaps and pagemap walkers. This means DAX pages will now appear in
those procfs files.
---
fs/proc/task_mmu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 38a5a3e..9a8a7d3 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1378,7 +1378,7 @@ static inline bool pte_is_pinned(struct vm_area_struct *vma, unsigned long addr,
if (likely(!test_bit(MMF_HAS_PINNED, &vma->vm_mm->flags)))
return false;
folio = vm_normal_folio(vma, addr, pte);
- if (!folio)
+ if (!folio || folio_is_devdax(folio) || folio_is_fsdax(folio))
return false;
return folio_maybe_dma_pinned(folio);
}
--
git-series 0.9.1
^ permalink raw reply [flat|nested] 97+ messages in thread
* [PATCH v6 20/26] mm/mlock: Skip ZONE_DEVICE PMDs during mlock
2025-01-10 6:00 [PATCH v6 00/26] fs/dax: Fix ZONE_DEVICE page reference counts Alistair Popple
` (18 preceding siblings ...)
2025-01-10 6:00 ` [PATCH v6 19/26] proc/task_mmu: Mark devdax and fsdax pages as always unpinned Alistair Popple
@ 2025-01-10 6:00 ` Alistair Popple
2025-01-14 2:42 ` Dan Williams
2025-01-10 6:00 ` [PATCH v6 21/26] fs/dax: Properly refcount fs dax pages Alistair Popple
` (6 subsequent siblings)
26 siblings, 1 reply; 97+ messages in thread
From: Alistair Popple @ 2025-01-10 6:00 UTC (permalink / raw)
To: akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
At present mlock skips ptes mapping ZONE_DEVICE pages. A future change
to remove pmd_devmap will allow pmd_trans_huge_lock() to return
ZONE_DEVICE folios so make sure we continue to skip those.
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
mm/mlock.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/mm/mlock.c b/mm/mlock.c
index cde076f..3cb72b5 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -368,6 +368,8 @@ static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
if (is_huge_zero_pmd(*pmd))
goto out;
folio = pmd_folio(*pmd);
+ if (folio_is_zone_device(folio))
+ goto out;
if (vma->vm_flags & VM_LOCKED)
mlock_folio(folio);
else
--
git-series 0.9.1
^ permalink raw reply [flat|nested] 97+ messages in thread
* [PATCH v6 21/26] fs/dax: Properly refcount fs dax pages
2025-01-10 6:00 [PATCH v6 00/26] fs/dax: Fix ZONE_DEVICE page reference counts Alistair Popple
` (19 preceding siblings ...)
2025-01-10 6:00 ` [PATCH v6 20/26] mm/mlock: Skip ZONE_DEVICE PMDs during mlock Alistair Popple
@ 2025-01-10 6:00 ` Alistair Popple
2025-01-10 16:54 ` Darrick J. Wong
2025-01-14 3:35 ` Dan Williams
2025-01-10 6:00 ` [PATCH v6 22/26] device/dax: Properly refcount device dax pages when mapping Alistair Popple
` (5 subsequent siblings)
26 siblings, 2 replies; 97+ messages in thread
From: Alistair Popple @ 2025-01-10 6:00 UTC (permalink / raw)
To: akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
Currently fs dax pages are considered free when the refcount drops to
one and their refcounts are not increased when mapped via PTEs or
decreased when unmapped. This requires special logic in mm paths to
detect that these pages should not be properly refcounted, and to
detect when the refcount drops to one instead of zero.
On the other hand get_user_pages(), etc. will properly refcount fs dax
pages by taking a reference and dropping it when the page is
unpinned.
Tracking this special behaviour requires extra PTE bits
(eg. pte_devmap) and introduces rules that are potentially confusing
and specific to FS DAX pages. To fix this, and to possibly allow
removal of the special PTE bits in future, convert the fs dax page
refcounts to be zero based and instead take a reference on the page
each time it is mapped as is currently the case for normal pages.
This may also allow a future clean-up to remove the pgmap refcounting
that is currently done in mm/gup.c.
Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
Changes since v2:
Based on some questions from Dan I attempted to have the FS DAX page
cache (ie. address space) hold a reference to the folio whilst it was
mapped. However I came to the strong conclusion that this was not the
right thing to do.
If the page refcount == 0 it means the page is:
1. not mapped into user-space
2. not subject to other access via DMA/GUP/etc.
Ie. From the core MM perspective the page is not in use.
The fact a page may or may not be present in one or more address space
mappings is irrelevant for core MM. It just means the page is still in
use or valid from the file system perspective, and it's a
responsiblity of the file system to remove these mappings if the pfn
mapping becomes invalid (along with first making sure the MM state,
ie. page->refcount, is idle). So we shouldn't be trying to track that
lifetime with MM refcounts.
Doing so just makes DMA-idle tracking more complex because there is
now another thing (one or more address spaces) which can hold
references on a page. And FS DAX can't even keep track of all the
address spaces which might contain a reference to the page in the
XFS/reflink case anyway.
We could do this if we made file systems invalidate all address space
mappings prior to calling dax_break_layouts(), but that isn't
currently neccessary and would lead to increased faults just so we
could do some superfluous refcounting which the file system already
does.
I have however put the page sharing checks and WARN_ON's back which
also turned out to be useful for figuring out when to re-initialising
a folio.
---
drivers/nvdimm/pmem.c | 4 +-
fs/dax.c | 212 +++++++++++++++++++++++-----------------
fs/fuse/virtio_fs.c | 3 +-
fs/xfs/xfs_inode.c | 2 +-
include/linux/dax.h | 6 +-
include/linux/mm.h | 27 +-----
include/linux/mm_types.h | 7 +-
mm/gup.c | 9 +--
mm/huge_memory.c | 6 +-
mm/internal.h | 2 +-
mm/memory-failure.c | 6 +-
mm/memory.c | 6 +-
mm/memremap.c | 47 ++++-----
mm/mm_init.c | 9 +--
mm/swap.c | 2 +-
15 files changed, 183 insertions(+), 165 deletions(-)
diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
index d81faa9..785b2d2 100644
--- a/drivers/nvdimm/pmem.c
+++ b/drivers/nvdimm/pmem.c
@@ -513,7 +513,7 @@ static int pmem_attach_disk(struct device *dev,
pmem->disk = disk;
pmem->pgmap.owner = pmem;
- pmem->pfn_flags = PFN_DEV;
+ pmem->pfn_flags = 0;
if (is_nd_pfn(dev)) {
pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
pmem->pgmap.ops = &fsdax_pagemap_ops;
@@ -522,7 +522,6 @@ static int pmem_attach_disk(struct device *dev,
pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
pmem->pfn_pad = resource_size(res) -
range_len(&pmem->pgmap.range);
- pmem->pfn_flags |= PFN_MAP;
bb_range = pmem->pgmap.range;
bb_range.start += pmem->data_offset;
} else if (pmem_should_map_pages(dev)) {
@@ -532,7 +531,6 @@ static int pmem_attach_disk(struct device *dev,
pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
pmem->pgmap.ops = &fsdax_pagemap_ops;
addr = devm_memremap_pages(dev, &pmem->pgmap);
- pmem->pfn_flags |= PFN_MAP;
bb_range = pmem->pgmap.range;
} else {
addr = devm_memremap(dev, pmem->phys_addr,
diff --git a/fs/dax.c b/fs/dax.c
index d35dbe1..19f444e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -71,6 +71,11 @@ static unsigned long dax_to_pfn(void *entry)
return xa_to_value(entry) >> DAX_SHIFT;
}
+static struct folio *dax_to_folio(void *entry)
+{
+ return page_folio(pfn_to_page(dax_to_pfn(entry)));
+}
+
static void *dax_make_entry(pfn_t pfn, unsigned long flags)
{
return xa_mk_value(flags | (pfn_t_to_pfn(pfn) << DAX_SHIFT));
@@ -338,44 +343,88 @@ static unsigned long dax_entry_size(void *entry)
return PAGE_SIZE;
}
-static unsigned long dax_end_pfn(void *entry)
-{
- return dax_to_pfn(entry) + dax_entry_size(entry) / PAGE_SIZE;
-}
-
-/*
- * Iterate through all mapped pfns represented by an entry, i.e. skip
- * 'empty' and 'zero' entries.
- */
-#define for_each_mapped_pfn(entry, pfn) \
- for (pfn = dax_to_pfn(entry); \
- pfn < dax_end_pfn(entry); pfn++)
-
/*
* A DAX page is considered shared if it has no mapping set and ->share (which
* shares the ->index field) is non-zero. Note this may return false even if the
* page is shared between multiple files but has not yet actually been mapped
* into multiple address spaces.
*/
-static inline bool dax_page_is_shared(struct page *page)
+static inline bool dax_folio_is_shared(struct folio *folio)
{
- return !page->mapping && page->share;
+ return !folio->mapping && folio->share;
}
/*
- * Increase the page share refcount, warning if the page is not marked as shared.
+ * Increase the folio share refcount, warning if the folio is not marked as shared.
*/
-static inline void dax_page_share_get(struct page *page)
+static inline void dax_folio_share_get(void *entry)
{
- WARN_ON_ONCE(!page->share);
- WARN_ON_ONCE(page->mapping);
- page->share++;
+ struct folio *folio = dax_to_folio(entry);
+
+ WARN_ON_ONCE(!folio->share);
+ WARN_ON_ONCE(folio->mapping);
+ WARN_ON_ONCE(dax_entry_order(entry) != folio_order(folio));
+ folio->share++;
+}
+
+static inline unsigned long dax_folio_share_put(struct folio *folio)
+{
+ unsigned long ref;
+
+ if (!dax_folio_is_shared(folio))
+ ref = 0;
+ else
+ ref = --folio->share;
+
+ WARN_ON_ONCE(ref < 0);
+ if (!ref) {
+ folio->mapping = NULL;
+ if (folio_order(folio)) {
+ struct dev_pagemap *pgmap = page_pgmap(&folio->page);
+ unsigned int order = folio_order(folio);
+ unsigned int i;
+
+ for (i = 0; i < (1UL << order); i++) {
+ struct page *page = folio_page(folio, i);
+
+ ClearPageHead(page);
+ clear_compound_head(page);
+
+ /*
+ * Reset pgmap which was over-written by
+ * prep_compound_page().
+ */
+ page_folio(page)->pgmap = pgmap;
+
+ /* Make sure this isn't set to TAIL_MAPPING */
+ page->mapping = NULL;
+ page->share = 0;
+ WARN_ON_ONCE(page_ref_count(page));
+ }
+ }
+ }
+
+ return ref;
}
-static inline unsigned long dax_page_share_put(struct page *page)
+static void dax_device_folio_init(void *entry)
{
- WARN_ON_ONCE(!page->share);
- return --page->share;
+ struct folio *folio = dax_to_folio(entry);
+ int order = dax_entry_order(entry);
+
+ /*
+ * Folio should have been split back to order-0 pages in
+ * dax_folio_share_put() when they were removed from their
+ * final mapping.
+ */
+ WARN_ON_ONCE(folio_order(folio));
+
+ if (order > 0) {
+ prep_compound_page(&folio->page, order);
+ if (order > 1)
+ INIT_LIST_HEAD(&folio->_deferred_list);
+ WARN_ON_ONCE(folio_ref_count(folio));
+ }
}
/*
@@ -388,72 +437,58 @@ static inline unsigned long dax_page_share_put(struct page *page)
* dax_holder_operations.
*/
static void dax_associate_entry(void *entry, struct address_space *mapping,
- struct vm_area_struct *vma, unsigned long address, bool shared)
+ struct vm_area_struct *vma, unsigned long address, bool shared)
{
- unsigned long size = dax_entry_size(entry), pfn, index;
- int i = 0;
+ unsigned long size = dax_entry_size(entry), index;
+ struct folio *folio = dax_to_folio(entry);
if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
return;
index = linear_page_index(vma, address & ~(size - 1));
- for_each_mapped_pfn(entry, pfn) {
- struct page *page = pfn_to_page(pfn);
-
- if (shared && page->mapping && page->share) {
- if (page->mapping) {
- page->mapping = NULL;
+ if (shared && (folio->mapping || dax_folio_is_shared(folio))) {
+ if (folio->mapping) {
+ folio->mapping = NULL;
- /*
- * Page has already been mapped into one address
- * space so set the share count.
- */
- page->share = 1;
- }
-
- dax_page_share_get(page);
- } else {
- WARN_ON_ONCE(page->mapping);
- page->mapping = mapping;
- page->index = index + i++;
+ /*
+ * folio has already been mapped into one address
+ * space so set the share count.
+ */
+ folio->share = 1;
}
+
+ dax_folio_share_get(entry);
+ } else {
+ WARN_ON_ONCE(folio->mapping);
+ dax_device_folio_init(entry);
+ folio = dax_to_folio(entry);
+ folio->mapping = mapping;
+ folio->index = index;
}
}
static void dax_disassociate_entry(void *entry, struct address_space *mapping,
- bool trunc)
+ bool trunc)
{
- unsigned long pfn;
+ struct folio *folio = dax_to_folio(entry);
if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
return;
- for_each_mapped_pfn(entry, pfn) {
- struct page *page = pfn_to_page(pfn);
-
- WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
- if (dax_page_is_shared(page)) {
- /* keep the shared flag if this page is still shared */
- if (dax_page_share_put(page) > 0)
- continue;
- } else
- WARN_ON_ONCE(page->mapping && page->mapping != mapping);
- page->mapping = NULL;
- page->index = 0;
- }
+ dax_folio_share_put(folio);
}
static struct page *dax_busy_page(void *entry)
{
- unsigned long pfn;
+ struct folio *folio = dax_to_folio(entry);
- for_each_mapped_pfn(entry, pfn) {
- struct page *page = pfn_to_page(pfn);
+ if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry))
+ return NULL;
- if (page_ref_count(page) > 1)
- return page;
- }
- return NULL;
+ if (folio_ref_count(folio) - folio_mapcount(folio))
+ return &folio->page;
+ else
+ return NULL;
}
/**
@@ -786,7 +821,7 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
EXPORT_SYMBOL_GPL(dax_layout_busy_page);
static int __dax_invalidate_entry(struct address_space *mapping,
- pgoff_t index, bool trunc)
+ pgoff_t index, bool trunc)
{
XA_STATE(xas, &mapping->i_pages, index);
int ret = 0;
@@ -892,7 +927,7 @@ static int wait_page_idle(struct page *page,
void (cb)(struct inode *),
struct inode *inode)
{
- return ___wait_var_event(page, page_ref_count(page) == 1,
+ return ___wait_var_event(page, page_ref_count(page) == 0,
TASK_INTERRUPTIBLE, 0, 0, cb(inode));
}
@@ -900,7 +935,7 @@ static void wait_page_idle_uninterruptible(struct page *page,
void (cb)(struct inode *),
struct inode *inode)
{
- ___wait_var_event(page, page_ref_count(page) == 1,
+ ___wait_var_event(page, page_ref_count(page) == 0,
TASK_UNINTERRUPTIBLE, 0, 0, cb(inode));
}
@@ -949,7 +984,8 @@ void dax_break_mapping_uninterruptible(struct inode *inode,
wait_page_idle_uninterruptible(page, cb, inode);
} while (true);
- dax_delete_mapping_range(inode->i_mapping, 0, LLONG_MAX);
+ if (!page)
+ dax_delete_mapping_range(inode->i_mapping, 0, LLONG_MAX);
}
EXPORT_SYMBOL_GPL(dax_break_mapping_uninterruptible);
@@ -1035,8 +1071,10 @@ static void *dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
void *old;
dax_disassociate_entry(entry, mapping, false);
- dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address,
- shared);
+ if (!(flags & DAX_ZERO_PAGE))
+ dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address,
+ shared);
+
/*
* Only swap our new entry into the page cache if the current
* entry is a zero page or an empty entry. If a normal PTE or
@@ -1224,9 +1262,7 @@ static int dax_iomap_direct_access(const struct iomap *iomap, loff_t pos,
goto out;
if (pfn_t_to_pfn(*pfnp) & (PHYS_PFN(size)-1))
goto out;
- /* For larger pages we need devmap */
- if (length > 1 && !pfn_t_devmap(*pfnp))
- goto out;
+
rc = 0;
out_check_addr:
@@ -1333,7 +1369,7 @@ static vm_fault_t dax_load_hole(struct xa_state *xas, struct vm_fault *vmf,
*entry = dax_insert_entry(xas, vmf, iter, *entry, pfn, DAX_ZERO_PAGE);
- ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
+ ret = vmf_insert_page_mkwrite(vmf, pfn_t_to_page(pfn), false);
trace_dax_load_hole(inode, vmf, ret);
return ret;
}
@@ -1804,7 +1840,8 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
loff_t pos = (loff_t)xas->xa_index << PAGE_SHIFT;
bool write = iter->flags & IOMAP_WRITE;
unsigned long entry_flags = pmd ? DAX_PMD : 0;
- int err = 0;
+ struct folio *folio;
+ int ret, err = 0;
pfn_t pfn;
void *kaddr;
@@ -1836,17 +1873,18 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
return dax_fault_return(err);
}
+ folio = dax_to_folio(*entry);
if (dax_fault_is_synchronous(iter, vmf->vma))
return dax_fault_synchronous_pfnp(pfnp, pfn);
- /* insert PMD pfn */
+ folio_ref_inc(folio);
if (pmd)
- return vmf_insert_pfn_pmd(vmf, pfn, write);
+ ret = vmf_insert_folio_pmd(vmf, pfn_folio(pfn_t_to_pfn(pfn)), write);
+ else
+ ret = vmf_insert_page_mkwrite(vmf, pfn_t_to_page(pfn), write);
+ folio_put(folio);
- /* insert PTE pfn */
- if (write)
- return vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn);
- return vmf_insert_mixed(vmf->vma, vmf->address, pfn);
+ return ret;
}
static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
@@ -2085,6 +2123,7 @@ dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, unsigned int order)
{
struct address_space *mapping = vmf->vma->vm_file->f_mapping;
XA_STATE_ORDER(xas, &mapping->i_pages, vmf->pgoff, order);
+ struct folio *folio;
void *entry;
vm_fault_t ret;
@@ -2102,14 +2141,17 @@ dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, unsigned int order)
xas_set_mark(&xas, PAGECACHE_TAG_DIRTY);
dax_lock_entry(&xas, entry);
xas_unlock_irq(&xas);
+ folio = pfn_folio(pfn_t_to_pfn(pfn));
+ folio_ref_inc(folio);
if (order == 0)
- ret = vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn);
+ ret = vmf_insert_page_mkwrite(vmf, &folio->page, true);
#ifdef CONFIG_FS_DAX_PMD
else if (order == PMD_ORDER)
- ret = vmf_insert_pfn_pmd(vmf, pfn, FAULT_FLAG_WRITE);
+ ret = vmf_insert_folio_pmd(vmf, folio, FAULT_FLAG_WRITE);
#endif
else
ret = VM_FAULT_FALLBACK;
+ folio_put(folio);
dax_unlock_entry(&xas, entry);
trace_dax_insert_pfn_mkwrite(mapping->host, vmf, ret);
return ret;
diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index 82afe78..2c7b24c 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -1017,8 +1017,7 @@ static long virtio_fs_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
if (kaddr)
*kaddr = fs->window_kaddr + offset;
if (pfn)
- *pfn = phys_to_pfn_t(fs->window_phys_addr + offset,
- PFN_DEV | PFN_MAP);
+ *pfn = phys_to_pfn_t(fs->window_phys_addr + offset, 0);
return nr_pages > max_nr_pages ? max_nr_pages : nr_pages;
}
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index c7ec5ab..7bfb4eb 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2740,7 +2740,7 @@ xfs_mmaplock_two_inodes_and_break_dax_layout(
* for this nested lock case.
*/
page = dax_layout_busy_page(VFS_I(ip2)->i_mapping);
- if (page && page_ref_count(page) != 1) {
+ if (page && page_ref_count(page) != 0) {
xfs_iunlock(ip2, XFS_MMAPLOCK_EXCL);
xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
goto again;
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 7c3773f..dbefea1 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -211,8 +211,12 @@ static inline int dax_wait_page_idle(struct page *page,
void (cb)(struct inode *),
struct inode *inode)
{
- return ___wait_var_event(page, page_ref_count(page) == 1,
+ int ret;
+
+ ret = ___wait_var_event(page, !page_ref_count(page),
TASK_INTERRUPTIBLE, 0, 0, cb(inode));
+
+ return ret;
}
#if IS_ENABLED(CONFIG_DAX)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 01edca9..a734278 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1161,6 +1161,8 @@ int vma_is_stack_for_current(struct vm_area_struct *vma);
struct mmu_gather;
struct inode;
+extern void prep_compound_page(struct page *page, unsigned int order);
+
/*
* compound_order() can be called without holding a reference, which means
* that niceties like page_folio() don't work. These callers should be
@@ -1482,25 +1484,6 @@ vm_fault_t finish_fault(struct vm_fault *vmf);
* back into memory.
*/
-#if defined(CONFIG_ZONE_DEVICE) && defined(CONFIG_FS_DAX)
-DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
-
-bool __put_devmap_managed_folio_refs(struct folio *folio, int refs);
-static inline bool put_devmap_managed_folio_refs(struct folio *folio, int refs)
-{
- if (!static_branch_unlikely(&devmap_managed_key))
- return false;
- if (!folio_is_zone_device(folio))
- return false;
- return __put_devmap_managed_folio_refs(folio, refs);
-}
-#else /* CONFIG_ZONE_DEVICE && CONFIG_FS_DAX */
-static inline bool put_devmap_managed_folio_refs(struct folio *folio, int refs)
-{
- return false;
-}
-#endif /* CONFIG_ZONE_DEVICE && CONFIG_FS_DAX */
-
/* 127: arbitrary random number, small enough to assemble well */
#define folio_ref_zero_or_close_to_overflow(folio) \
((unsigned int) folio_ref_count(folio) + 127u <= 127u)
@@ -1615,12 +1598,6 @@ static inline void put_page(struct page *page)
{
struct folio *folio = page_folio(page);
- /*
- * For some devmap managed pages we need to catch refcount transition
- * from 2 to 1:
- */
- if (put_devmap_managed_folio_refs(folio, 1))
- return;
folio_put(folio);
}
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 54b59b8..e308cb9 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -295,6 +295,8 @@ typedef struct {
* anonymous memory.
* @index: Offset within the file, in units of pages. For anonymous memory,
* this is the index from the beginning of the mmap.
+ * @share: number of DAX mappings that reference this folio. See
+ * dax_associate_entry.
* @private: Filesystem per-folio data (see folio_attach_private()).
* @swap: Used for swp_entry_t if folio_test_swapcache().
* @_mapcount: Do not access this member directly. Use folio_mapcount() to
@@ -344,7 +346,10 @@ struct folio {
struct dev_pagemap *pgmap;
};
struct address_space *mapping;
- pgoff_t index;
+ union {
+ pgoff_t index;
+ unsigned long share;
+ };
union {
void *private;
swp_entry_t swap;
diff --git a/mm/gup.c b/mm/gup.c
index 9b587b5..d6575ed 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -96,8 +96,7 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
* belongs to this folio.
*/
if (unlikely(page_folio(page) != folio)) {
- if (!put_devmap_managed_folio_refs(folio, refs))
- folio_put_refs(folio, refs);
+ folio_put_refs(folio, refs);
goto retry;
}
@@ -116,8 +115,7 @@ static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
refs *= GUP_PIN_COUNTING_BIAS;
}
- if (!put_devmap_managed_folio_refs(folio, refs))
- folio_put_refs(folio, refs);
+ folio_put_refs(folio, refs);
}
/**
@@ -565,8 +563,7 @@ static struct folio *try_grab_folio_fast(struct page *page, int refs,
*/
if (unlikely((flags & FOLL_LONGTERM) &&
!folio_is_longterm_pinnable(folio))) {
- if (!put_devmap_managed_folio_refs(folio, refs))
- folio_put_refs(folio, refs);
+ folio_put_refs(folio, refs);
return NULL;
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d1ea76e..0cf1151 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2209,7 +2209,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
tlb->fullmm);
arch_check_zapped_pmd(vma, orig_pmd);
tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
- if (vma_is_special_huge(vma)) {
+ if (!vma_is_dax(vma) && vma_is_special_huge(vma)) {
if (arch_needs_pgtable_deposit())
zap_deposited_table(tlb->mm, pmd);
spin_unlock(ptl);
@@ -2853,13 +2853,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
*/
if (arch_needs_pgtable_deposit())
zap_deposited_table(mm, pmd);
- if (vma_is_special_huge(vma))
+ if (!vma_is_dax(vma) && vma_is_special_huge(vma))
return;
if (unlikely(is_pmd_migration_entry(old_pmd))) {
swp_entry_t entry;
entry = pmd_to_swp_entry(old_pmd);
folio = pfn_swap_entry_folio(entry);
+ } else if (is_huge_zero_pmd(old_pmd)) {
+ return;
} else {
page = pmd_page(old_pmd);
folio = page_folio(page);
diff --git a/mm/internal.h b/mm/internal.h
index 3922788..c4df0ad 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -733,8 +733,6 @@ static inline void prep_compound_tail(struct page *head, int tail_idx)
set_page_private(p, 0);
}
-extern void prep_compound_page(struct page *page, unsigned int order);
-
void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags);
extern bool free_pages_prepare(struct page *page, unsigned int order);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index a7b8ccd..7838bf1 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -419,18 +419,18 @@ static unsigned long dev_pagemap_mapping_shift(struct vm_area_struct *vma,
pud = pud_offset(p4d, address);
if (!pud_present(*pud))
return 0;
- if (pud_devmap(*pud))
+ if (pud_trans_huge(*pud))
return PUD_SHIFT;
pmd = pmd_offset(pud, address);
if (!pmd_present(*pmd))
return 0;
- if (pmd_devmap(*pmd))
+ if (pmd_trans_huge(*pmd))
return PMD_SHIFT;
pte = pte_offset_map(pmd, address);
if (!pte)
return 0;
ptent = ptep_get(pte);
- if (pte_present(ptent) && pte_devmap(ptent))
+ if (pte_present(ptent))
ret = PAGE_SHIFT;
pte_unmap(pte);
return ret;
diff --git a/mm/memory.c b/mm/memory.c
index c60b819..02e12b0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3843,13 +3843,15 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) {
/*
* VM_MIXEDMAP !pfn_valid() case, or VM_SOFTDIRTY clear on a
- * VM_PFNMAP VMA.
+ * VM_PFNMAP VMA. FS DAX also wants ops->pfn_mkwrite called.
*
* We should not cow pages in a shared writeable mapping.
* Just mark the pages writable and/or call ops->pfn_mkwrite.
*/
- if (!vmf->page)
+ if (!vmf->page || is_fsdax_page(vmf->page)) {
+ vmf->page = NULL;
return wp_pfn_shared(vmf);
+ }
return wp_page_shared(vmf, folio);
}
diff --git a/mm/memremap.c b/mm/memremap.c
index 68099af..9a8879b 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -458,8 +458,13 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
void free_zone_device_folio(struct folio *folio)
{
- if (WARN_ON_ONCE(!folio->pgmap->ops ||
- !folio->pgmap->ops->page_free))
+ struct dev_pagemap *pgmap = folio->pgmap;
+
+ if (WARN_ON_ONCE(!pgmap->ops))
+ return;
+
+ if (WARN_ON_ONCE(pgmap->type != MEMORY_DEVICE_FS_DAX &&
+ !pgmap->ops->page_free))
return;
mem_cgroup_uncharge(folio);
@@ -484,26 +489,36 @@ void free_zone_device_folio(struct folio *folio)
* For other types of ZONE_DEVICE pages, migration is either
* handled differently or not done at all, so there is no need
* to clear folio->mapping.
+ *
+ * FS DAX pages clear the mapping when the folio->share count hits
+ * zero which indicating the page has been removed from the file
+ * system mapping.
*/
- folio->mapping = NULL;
- folio->pgmap->ops->page_free(folio_page(folio, 0));
+ if (pgmap->type != MEMORY_DEVICE_FS_DAX)
+ folio->mapping = NULL;
- switch (folio->pgmap->type) {
+ switch (pgmap->type) {
case MEMORY_DEVICE_PRIVATE:
case MEMORY_DEVICE_COHERENT:
- put_dev_pagemap(folio->pgmap);
+ pgmap->ops->page_free(folio_page(folio, 0));
+ put_dev_pagemap(pgmap);
break;
- case MEMORY_DEVICE_FS_DAX:
case MEMORY_DEVICE_GENERIC:
/*
* Reset the refcount to 1 to prepare for handing out the page
* again.
*/
+ pgmap->ops->page_free(folio_page(folio, 0));
folio_set_count(folio, 1);
break;
+ case MEMORY_DEVICE_FS_DAX:
+ wake_up_var(&folio->page);
+ break;
+
case MEMORY_DEVICE_PCI_P2PDMA:
+ pgmap->ops->page_free(folio_page(folio, 0));
break;
}
}
@@ -519,21 +534,3 @@ void zone_device_page_init(struct page *page)
lock_page(page);
}
EXPORT_SYMBOL_GPL(zone_device_page_init);
-
-#ifdef CONFIG_FS_DAX
-bool __put_devmap_managed_folio_refs(struct folio *folio, int refs)
-{
- if (folio->pgmap->type != MEMORY_DEVICE_FS_DAX)
- return false;
-
- /*
- * fsdax page refcounts are 1-based, rather than 0-based: if
- * refcount is 1, then the page is free and the refcount is
- * stable because nobody holds a reference on the page.
- */
- if (folio_ref_sub_return(folio, refs) == 1)
- wake_up_var(&folio->_refcount);
- return true;
-}
-EXPORT_SYMBOL(__put_devmap_managed_folio_refs);
-#endif /* CONFIG_FS_DAX */
diff --git a/mm/mm_init.c b/mm/mm_init.c
index cb73402..0c12b29 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1017,23 +1017,22 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
}
/*
- * ZONE_DEVICE pages other than MEMORY_TYPE_GENERIC and
- * MEMORY_TYPE_FS_DAX pages are released directly to the driver page
- * allocator which will set the page count to 1 when allocating the
- * page.
+ * ZONE_DEVICE pages other than MEMORY_TYPE_GENERIC are released
+ * directly to the driver page allocator which will set the page count
+ * to 1 when allocating the page.
*
* MEMORY_TYPE_GENERIC and MEMORY_TYPE_FS_DAX pages automatically have
* their refcount reset to one whenever they are freed (ie. after
* their refcount drops to 0).
*/
switch (pgmap->type) {
+ case MEMORY_DEVICE_FS_DAX:
case MEMORY_DEVICE_PRIVATE:
case MEMORY_DEVICE_COHERENT:
case MEMORY_DEVICE_PCI_P2PDMA:
set_page_count(page, 0);
break;
- case MEMORY_DEVICE_FS_DAX:
case MEMORY_DEVICE_GENERIC:
break;
}
diff --git a/mm/swap.c b/mm/swap.c
index 062c856..a587842 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -952,8 +952,6 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
unlock_page_lruvec_irqrestore(lruvec, flags);
lruvec = NULL;
}
- if (put_devmap_managed_folio_refs(folio, nr_refs))
- continue;
if (folio_ref_sub_and_test(folio, nr_refs))
free_zone_device_folio(folio);
continue;
--
git-series 0.9.1
^ permalink raw reply [flat|nested] 97+ messages in thread
* [PATCH v6 22/26] device/dax: Properly refcount device dax pages when mapping
2025-01-10 6:00 [PATCH v6 00/26] fs/dax: Fix ZONE_DEVICE page reference counts Alistair Popple
` (20 preceding siblings ...)
2025-01-10 6:00 ` [PATCH v6 21/26] fs/dax: Properly refcount fs dax pages Alistair Popple
@ 2025-01-10 6:00 ` Alistair Popple
2025-01-14 6:12 ` Dan Williams
2025-01-10 6:00 ` [PATCH v6 23/26] mm: Remove pXX_devmap callers Alistair Popple
` (4 subsequent siblings)
26 siblings, 1 reply; 97+ messages in thread
From: Alistair Popple @ 2025-01-10 6:00 UTC (permalink / raw)
To: akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
Device DAX pages are currently not reference counted when mapped,
instead relying on the devmap PTE bit to ensure mapping code will not
get/put references. This requires special handling in various page
table walkers, particularly GUP, to manage references on the
underlying pgmap to ensure the pages remain valid.
However there is no reason these pages can't be refcounted properly at
map time. Doning so eliminates the need for the devmap PTE bit,
freeing up a precious PTE bit. It also simplifies GUP as it no longer
needs to manage the special pgmap references and can instead just
treat the pages normally as defined by vm_normal_page().
Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
drivers/dax/device.c | 15 +++++++++------
mm/memremap.c | 13 ++++++-------
2 files changed, 15 insertions(+), 13 deletions(-)
diff --git a/drivers/dax/device.c b/drivers/dax/device.c
index 6d74e62..fd22dbf 100644
--- a/drivers/dax/device.c
+++ b/drivers/dax/device.c
@@ -126,11 +126,12 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
return VM_FAULT_SIGBUS;
}
- pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);
+ pfn = phys_to_pfn_t(phys, 0);
dax_set_mapping(vmf, pfn, fault_size);
- return vmf_insert_mixed(vmf->vma, vmf->address, pfn);
+ return vmf_insert_page_mkwrite(vmf, pfn_t_to_page(pfn),
+ vmf->flags & FAULT_FLAG_WRITE);
}
static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
@@ -169,11 +170,12 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
return VM_FAULT_SIGBUS;
}
- pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);
+ pfn = phys_to_pfn_t(phys, 0);
dax_set_mapping(vmf, pfn, fault_size);
- return vmf_insert_pfn_pmd(vmf, pfn, vmf->flags & FAULT_FLAG_WRITE);
+ return vmf_insert_folio_pmd(vmf, page_folio(pfn_t_to_page(pfn)),
+ vmf->flags & FAULT_FLAG_WRITE);
}
#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
@@ -214,11 +216,12 @@ static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
return VM_FAULT_SIGBUS;
}
- pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);
+ pfn = phys_to_pfn_t(phys, 0);
dax_set_mapping(vmf, pfn, fault_size);
- return vmf_insert_pfn_pud(vmf, pfn, vmf->flags & FAULT_FLAG_WRITE);
+ return vmf_insert_folio_pud(vmf, page_folio(pfn_t_to_page(pfn)),
+ vmf->flags & FAULT_FLAG_WRITE);
}
#else
static vm_fault_t __dev_dax_pud_fault(struct dev_dax *dev_dax,
diff --git a/mm/memremap.c b/mm/memremap.c
index 9a8879b..532a52a 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -460,11 +460,10 @@ void free_zone_device_folio(struct folio *folio)
{
struct dev_pagemap *pgmap = folio->pgmap;
- if (WARN_ON_ONCE(!pgmap->ops))
- return;
-
- if (WARN_ON_ONCE(pgmap->type != MEMORY_DEVICE_FS_DAX &&
- !pgmap->ops->page_free))
+ if (WARN_ON_ONCE((!pgmap->ops &&
+ pgmap->type != MEMORY_DEVICE_GENERIC) ||
+ (pgmap->ops && !pgmap->ops->page_free &&
+ pgmap->type != MEMORY_DEVICE_FS_DAX)))
return;
mem_cgroup_uncharge(folio);
@@ -494,7 +493,8 @@ void free_zone_device_folio(struct folio *folio)
* zero which indicating the page has been removed from the file
* system mapping.
*/
- if (pgmap->type != MEMORY_DEVICE_FS_DAX)
+ if (pgmap->type != MEMORY_DEVICE_FS_DAX &&
+ pgmap->type != MEMORY_DEVICE_GENERIC)
folio->mapping = NULL;
switch (pgmap->type) {
@@ -509,7 +509,6 @@ void free_zone_device_folio(struct folio *folio)
* Reset the refcount to 1 to prepare for handing out the page
* again.
*/
- pgmap->ops->page_free(folio_page(folio, 0));
folio_set_count(folio, 1);
break;
--
git-series 0.9.1
^ permalink raw reply [flat|nested] 97+ messages in thread
* [PATCH v6 23/26] mm: Remove pXX_devmap callers
2025-01-10 6:00 [PATCH v6 00/26] fs/dax: Fix ZONE_DEVICE page reference counts Alistair Popple
` (21 preceding siblings ...)
2025-01-10 6:00 ` [PATCH v6 22/26] device/dax: Properly refcount device dax pages when mapping Alistair Popple
@ 2025-01-10 6:00 ` Alistair Popple
2025-01-14 18:50 ` Dan Williams
2025-01-10 6:00 ` [PATCH v6 24/26] mm: Remove devmap related functions and page table bits Alistair Popple
` (3 subsequent siblings)
26 siblings, 1 reply; 97+ messages in thread
From: Alistair Popple @ 2025-01-10 6:00 UTC (permalink / raw)
To: akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
The devmap PTE special bit was used to detect mappings of FS DAX
pages. This tracking was required to ensure the generic mm did not
manipulate the page reference counts as FS DAX implemented it's own
reference counting scheme.
Now that FS DAX pages have their references counted the same way as
normal pages this tracking is no longer needed and can be
removed.
Almost all existing uses of pmd_devmap() are paired with a check of
pmd_trans_huge(). As pmd_trans_huge() now returns true for FS DAX pages
dropping the check in these cases doesn't change anything.
However care needs to be taken because pmd_trans_huge() also checks that
a page is not an FS DAX page. This is dealt with either by checking
!vma_is_dax() or relying on the fact that the page pointer was obtained
from a page list. This is possible because zone device pages cannot
appear in any page list due to sharing page->lru with page->pgmap.
Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
arch/powerpc/mm/book3s64/hash_hugepage.c | 2 +-
arch/powerpc/mm/book3s64/hash_pgtable.c | 3 +-
arch/powerpc/mm/book3s64/hugetlbpage.c | 2 +-
arch/powerpc/mm/book3s64/pgtable.c | 10 +-
arch/powerpc/mm/book3s64/radix_pgtable.c | 5 +-
arch/powerpc/mm/pgtable.c | 2 +-
fs/dax.c | 5 +-
fs/userfaultfd.c | 2 +-
include/linux/huge_mm.h | 10 +-
include/linux/pgtable.h | 2 +-
mm/gup.c | 162 +------------------------
mm/hmm.c | 7 +-
mm/huge_memory.c | 71 +----------
mm/khugepaged.c | 2 +-
mm/mapping_dirty_helpers.c | 4 +-
mm/memory.c | 35 +----
mm/migrate_device.c | 2 +-
mm/mprotect.c | 2 +-
mm/mremap.c | 5 +-
mm/page_vma_mapped.c | 5 +-
mm/pagewalk.c | 14 +-
mm/pgtable-generic.c | 7 +-
mm/userfaultfd.c | 5 +-
mm/vmscan.c | 5 +-
24 files changed, 66 insertions(+), 303 deletions(-)
diff --git a/arch/powerpc/mm/book3s64/hash_hugepage.c b/arch/powerpc/mm/book3s64/hash_hugepage.c
index 15d6f3e..cdfd4fe 100644
--- a/arch/powerpc/mm/book3s64/hash_hugepage.c
+++ b/arch/powerpc/mm/book3s64/hash_hugepage.c
@@ -54,7 +54,7 @@ int __hash_page_thp(unsigned long ea, unsigned long access, unsigned long vsid,
/*
* Make sure this is thp or devmap entry
*/
- if (!(old_pmd & (H_PAGE_THP_HUGE | _PAGE_DEVMAP)))
+ if (!(old_pmd & H_PAGE_THP_HUGE))
return 0;
rflags = htab_convert_pte_flags(new_pmd, flags);
diff --git a/arch/powerpc/mm/book3s64/hash_pgtable.c b/arch/powerpc/mm/book3s64/hash_pgtable.c
index 988948d..82d3117 100644
--- a/arch/powerpc/mm/book3s64/hash_pgtable.c
+++ b/arch/powerpc/mm/book3s64/hash_pgtable.c
@@ -195,7 +195,7 @@ unsigned long hash__pmd_hugepage_update(struct mm_struct *mm, unsigned long addr
unsigned long old;
#ifdef CONFIG_DEBUG_VM
- WARN_ON(!hash__pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
+ WARN_ON(!hash__pmd_trans_huge(*pmdp));
assert_spin_locked(pmd_lockptr(mm, pmdp));
#endif
@@ -227,7 +227,6 @@ pmd_t hash__pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long addres
VM_BUG_ON(address & ~HPAGE_PMD_MASK);
VM_BUG_ON(pmd_trans_huge(*pmdp));
- VM_BUG_ON(pmd_devmap(*pmdp));
pmd = *pmdp;
pmd_clear(pmdp);
diff --git a/arch/powerpc/mm/book3s64/hugetlbpage.c b/arch/powerpc/mm/book3s64/hugetlbpage.c
index 83c3361..2bcbbf9 100644
--- a/arch/powerpc/mm/book3s64/hugetlbpage.c
+++ b/arch/powerpc/mm/book3s64/hugetlbpage.c
@@ -74,7 +74,7 @@ int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
} while(!pte_xchg(ptep, __pte(old_pte), __pte(new_pte)));
/* Make sure this is a hugetlb entry */
- if (old_pte & (H_PAGE_THP_HUGE | _PAGE_DEVMAP))
+ if (old_pte & H_PAGE_THP_HUGE)
return 0;
rflags = htab_convert_pte_flags(new_pte, flags);
diff --git a/arch/powerpc/mm/book3s64/pgtable.c b/arch/powerpc/mm/book3s64/pgtable.c
index 3745425..916b4ce 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -63,7 +63,7 @@ int pmdp_set_access_flags(struct vm_area_struct *vma, unsigned long address,
{
int changed;
#ifdef CONFIG_DEBUG_VM
- WARN_ON(!pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
+ WARN_ON(!pmd_trans_huge(*pmdp));
assert_spin_locked(pmd_lockptr(vma->vm_mm, pmdp));
#endif
changed = !pmd_same(*(pmdp), entry);
@@ -83,7 +83,6 @@ int pudp_set_access_flags(struct vm_area_struct *vma, unsigned long address,
{
int changed;
#ifdef CONFIG_DEBUG_VM
- WARN_ON(!pud_devmap(*pudp));
assert_spin_locked(pud_lockptr(vma->vm_mm, pudp));
#endif
changed = !pud_same(*(pudp), entry);
@@ -205,8 +204,8 @@ pmd_t pmdp_huge_get_and_clear_full(struct vm_area_struct *vma,
{
pmd_t pmd;
VM_BUG_ON(addr & ~HPAGE_PMD_MASK);
- VM_BUG_ON((pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
- !pmd_devmap(*pmdp)) || !pmd_present(*pmdp));
+ VM_BUG_ON((pmd_present(*pmdp) && !pmd_trans_huge(*pmdp)) ||
+ !pmd_present(*pmdp));
pmd = pmdp_huge_get_and_clear(vma->vm_mm, addr, pmdp);
/*
* if it not a fullmm flush, then we can possibly end up converting
@@ -224,8 +223,7 @@ pud_t pudp_huge_get_and_clear_full(struct vm_area_struct *vma,
pud_t pud;
VM_BUG_ON(addr & ~HPAGE_PMD_MASK);
- VM_BUG_ON((pud_present(*pudp) && !pud_devmap(*pudp)) ||
- !pud_present(*pudp));
+ VM_BUG_ON(!pud_present(*pudp));
pud = pudp_huge_get_and_clear(vma->vm_mm, addr, pudp);
/*
* if it not a fullmm flush, then we can possibly end up converting
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 311e211..f0b606d 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1412,7 +1412,7 @@ unsigned long radix__pmd_hugepage_update(struct mm_struct *mm, unsigned long add
unsigned long old;
#ifdef CONFIG_DEBUG_VM
- WARN_ON(!radix__pmd_trans_huge(*pmdp) && !pmd_devmap(*pmdp));
+ WARN_ON(!radix__pmd_trans_huge(*pmdp));
assert_spin_locked(pmd_lockptr(mm, pmdp));
#endif
@@ -1429,7 +1429,7 @@ unsigned long radix__pud_hugepage_update(struct mm_struct *mm, unsigned long add
unsigned long old;
#ifdef CONFIG_DEBUG_VM
- WARN_ON(!pud_devmap(*pudp));
+ WARN_ON(!pud_trans_huge(*pudp));
assert_spin_locked(pud_lockptr(mm, pudp));
#endif
@@ -1447,7 +1447,6 @@ pmd_t radix__pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long addre
VM_BUG_ON(address & ~HPAGE_PMD_MASK);
VM_BUG_ON(radix__pmd_trans_huge(*pmdp));
- VM_BUG_ON(pmd_devmap(*pmdp));
/*
* khugepaged calls this for normal pmd
*/
diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 61df5ae..dfaa9fd 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -509,7 +509,7 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea,
return NULL;
#endif
- if (pmd_trans_huge(pmd) || pmd_devmap(pmd)) {
+ if (pmd_trans_huge(pmd)) {
if (is_thp)
*is_thp = true;
ret_pte = (pte_t *)pmdp;
diff --git a/fs/dax.c b/fs/dax.c
index 19f444e..facddd6 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1928,7 +1928,7 @@ static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
* the PTE we need to set up. If so just return and the fault will be
* retried.
*/
- if (pmd_trans_huge(*vmf->pmd) || pmd_devmap(*vmf->pmd)) {
+ if (pmd_trans_huge(*vmf->pmd)) {
ret = VM_FAULT_NOPAGE;
goto unlock_entry;
}
@@ -2049,8 +2049,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
* the PMD we need to set up. If so just return and the fault will be
* retried.
*/
- if (!pmd_none(*vmf->pmd) && !pmd_trans_huge(*vmf->pmd) &&
- !pmd_devmap(*vmf->pmd)) {
+ if (!pmd_none(*vmf->pmd) && !pmd_trans_huge(*vmf->pmd)) {
ret = 0;
goto unlock_entry;
}
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 7c0bd0b..c52b91f 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -304,7 +304,7 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
goto out;
ret = false;
- if (!pmd_present(_pmd) || pmd_devmap(_pmd))
+ if (!pmd_present(_pmd) || vma_is_dax(vmf->vma))
goto out;
if (pmd_trans_huge(_pmd)) {
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 3633bd3..9cb5227 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -368,8 +368,7 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
#define split_huge_pmd(__vma, __pmd, __address) \
do { \
pmd_t *____pmd = (__pmd); \
- if (is_swap_pmd(*____pmd) || pmd_trans_huge(*____pmd) \
- || pmd_devmap(*____pmd)) \
+ if (is_swap_pmd(*____pmd) || pmd_trans_huge(*____pmd)) \
__split_huge_pmd(__vma, __pmd, __address, \
false, NULL); \
} while (0)
@@ -395,8 +394,7 @@ change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
#define split_huge_pud(__vma, __pud, __address) \
do { \
pud_t *____pud = (__pud); \
- if (pud_trans_huge(*____pud) \
- || pud_devmap(*____pud)) \
+ if (pud_trans_huge(*____pud)) \
__split_huge_pud(__vma, __pud, __address); \
} while (0)
@@ -419,7 +417,7 @@ static inline int is_swap_pmd(pmd_t pmd)
static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
struct vm_area_struct *vma)
{
- if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd))
+ if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd))
return __pmd_trans_huge_lock(pmd, vma);
else
return NULL;
@@ -427,7 +425,7 @@ static inline spinlock_t *pmd_trans_huge_lock(pmd_t *pmd,
static inline spinlock_t *pud_trans_huge_lock(pud_t *pud,
struct vm_area_struct *vma)
{
- if (pud_trans_huge(*pud) || pud_devmap(*pud))
+ if (pud_trans_huge(*pud))
return __pud_trans_huge_lock(pud, vma);
else
return NULL;
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 94d267d..00e4a06 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1635,7 +1635,7 @@ static inline int pud_trans_unstable(pud_t *pud)
defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
pud_t pudval = READ_ONCE(*pud);
- if (pud_none(pudval) || pud_trans_huge(pudval) || pud_devmap(pudval))
+ if (pud_none(pudval) || pud_trans_huge(pudval))
return 1;
if (unlikely(pud_bad(pudval))) {
pud_clear_bad(pud);
diff --git a/mm/gup.c b/mm/gup.c
index d6575ed..95be530 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -678,31 +678,9 @@ static struct page *follow_huge_pud(struct vm_area_struct *vma,
return NULL;
pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
-
- if (IS_ENABLED(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) &&
- pud_devmap(pud)) {
- /*
- * device mapped pages can only be returned if the caller
- * will manage the page reference count.
- *
- * At least one of FOLL_GET | FOLL_PIN must be set, so
- * assert that here:
- */
- if (!(flags & (FOLL_GET | FOLL_PIN)))
- return ERR_PTR(-EEXIST);
-
- if (flags & FOLL_TOUCH)
- touch_pud(vma, addr, pudp, flags & FOLL_WRITE);
-
- ctx->pgmap = get_dev_pagemap(pfn, ctx->pgmap);
- if (!ctx->pgmap)
- return ERR_PTR(-EFAULT);
- }
-
page = pfn_to_page(pfn);
- if (!pud_devmap(pud) && !pud_write(pud) &&
- gup_must_unshare(vma, flags, page))
+ if (!pud_write(pud) && gup_must_unshare(vma, flags, page))
return ERR_PTR(-EMLINK);
ret = try_grab_folio(page_folio(page), 1, flags);
@@ -861,8 +839,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
page = vm_normal_page(vma, address, pte);
/*
- * We only care about anon pages in can_follow_write_pte() and don't
- * have to worry about pte_devmap() because they are never anon.
+ * We only care about anon pages in can_follow_write_pte().
*/
if ((flags & FOLL_WRITE) &&
!can_follow_write_pte(pte, page, vma, flags)) {
@@ -870,18 +847,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
goto out;
}
- if (!page && pte_devmap(pte) && (flags & (FOLL_GET | FOLL_PIN))) {
- /*
- * Only return device mapping pages in the FOLL_GET or FOLL_PIN
- * case since they are only valid while holding the pgmap
- * reference.
- */
- *pgmap = get_dev_pagemap(pte_pfn(pte), *pgmap);
- if (*pgmap)
- page = pte_page(pte);
- else
- goto no_page;
- } else if (unlikely(!page)) {
+ if (unlikely(!page)) {
if (flags & FOLL_DUMP) {
/* Avoid special (like zero) pages in core dumps */
page = ERR_PTR(-EFAULT);
@@ -963,14 +929,6 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
return no_page_table(vma, flags, address);
if (!pmd_present(pmdval))
return no_page_table(vma, flags, address);
- if (pmd_devmap(pmdval)) {
- ptl = pmd_lock(mm, pmd);
- page = follow_devmap_pmd(vma, address, pmd, flags, &ctx->pgmap);
- spin_unlock(ptl);
- if (page)
- return page;
- return no_page_table(vma, flags, address);
- }
if (likely(!pmd_leaf(pmdval)))
return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
@@ -2892,7 +2850,7 @@ static int gup_fast_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
int *nr)
{
struct dev_pagemap *pgmap = NULL;
- int nr_start = *nr, ret = 0;
+ int ret = 0;
pte_t *ptep, *ptem;
ptem = ptep = pte_offset_map(&pmd, addr);
@@ -2916,16 +2874,7 @@ static int gup_fast_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
if (!pte_access_permitted(pte, flags & FOLL_WRITE))
goto pte_unmap;
- if (pte_devmap(pte)) {
- if (unlikely(flags & FOLL_LONGTERM))
- goto pte_unmap;
-
- pgmap = get_dev_pagemap(pte_pfn(pte), pgmap);
- if (unlikely(!pgmap)) {
- gup_fast_undo_dev_pagemap(nr, nr_start, flags, pages);
- goto pte_unmap;
- }
- } else if (pte_special(pte))
+ if (pte_special(pte))
goto pte_unmap;
VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
@@ -2996,91 +2945,6 @@ static int gup_fast_pte_range(pmd_t pmd, pmd_t *pmdp, unsigned long addr,
}
#endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
-#if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
-static int gup_fast_devmap_leaf(unsigned long pfn, unsigned long addr,
- unsigned long end, unsigned int flags, struct page **pages, int *nr)
-{
- int nr_start = *nr;
- struct dev_pagemap *pgmap = NULL;
-
- do {
- struct folio *folio;
- struct page *page = pfn_to_page(pfn);
-
- pgmap = get_dev_pagemap(pfn, pgmap);
- if (unlikely(!pgmap)) {
- gup_fast_undo_dev_pagemap(nr, nr_start, flags, pages);
- break;
- }
-
- folio = try_grab_folio_fast(page, 1, flags);
- if (!folio) {
- gup_fast_undo_dev_pagemap(nr, nr_start, flags, pages);
- break;
- }
- folio_set_referenced(folio);
- pages[*nr] = page;
- (*nr)++;
- pfn++;
- } while (addr += PAGE_SIZE, addr != end);
-
- put_dev_pagemap(pgmap);
- return addr == end;
-}
-
-static int gup_fast_devmap_pmd_leaf(pmd_t orig, pmd_t *pmdp, unsigned long addr,
- unsigned long end, unsigned int flags, struct page **pages,
- int *nr)
-{
- unsigned long fault_pfn;
- int nr_start = *nr;
-
- fault_pfn = pmd_pfn(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
- if (!gup_fast_devmap_leaf(fault_pfn, addr, end, flags, pages, nr))
- return 0;
-
- if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
- gup_fast_undo_dev_pagemap(nr, nr_start, flags, pages);
- return 0;
- }
- return 1;
-}
-
-static int gup_fast_devmap_pud_leaf(pud_t orig, pud_t *pudp, unsigned long addr,
- unsigned long end, unsigned int flags, struct page **pages,
- int *nr)
-{
- unsigned long fault_pfn;
- int nr_start = *nr;
-
- fault_pfn = pud_pfn(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT);
- if (!gup_fast_devmap_leaf(fault_pfn, addr, end, flags, pages, nr))
- return 0;
-
- if (unlikely(pud_val(orig) != pud_val(*pudp))) {
- gup_fast_undo_dev_pagemap(nr, nr_start, flags, pages);
- return 0;
- }
- return 1;
-}
-#else
-static int gup_fast_devmap_pmd_leaf(pmd_t orig, pmd_t *pmdp, unsigned long addr,
- unsigned long end, unsigned int flags, struct page **pages,
- int *nr)
-{
- BUILD_BUG();
- return 0;
-}
-
-static int gup_fast_devmap_pud_leaf(pud_t pud, pud_t *pudp, unsigned long addr,
- unsigned long end, unsigned int flags, struct page **pages,
- int *nr)
-{
- BUILD_BUG();
- return 0;
-}
-#endif
-
static int gup_fast_pmd_leaf(pmd_t orig, pmd_t *pmdp, unsigned long addr,
unsigned long end, unsigned int flags, struct page **pages,
int *nr)
@@ -3095,13 +2959,6 @@ static int gup_fast_pmd_leaf(pmd_t orig, pmd_t *pmdp, unsigned long addr,
if (pmd_special(orig))
return 0;
- if (pmd_devmap(orig)) {
- if (unlikely(flags & FOLL_LONGTERM))
- return 0;
- return gup_fast_devmap_pmd_leaf(orig, pmdp, addr, end, flags,
- pages, nr);
- }
-
page = pmd_page(orig);
refs = record_subpages(page, PMD_SIZE, addr, end, pages + *nr);
@@ -3142,13 +2999,6 @@ static int gup_fast_pud_leaf(pud_t orig, pud_t *pudp, unsigned long addr,
if (pud_special(orig))
return 0;
- if (pud_devmap(orig)) {
- if (unlikely(flags & FOLL_LONGTERM))
- return 0;
- return gup_fast_devmap_pud_leaf(orig, pudp, addr, end, flags,
- pages, nr);
- }
-
page = pud_page(orig);
refs = record_subpages(page, PUD_SIZE, addr, end, pages + *nr);
@@ -3187,8 +3037,6 @@ static int gup_fast_pgd_leaf(pgd_t orig, pgd_t *pgdp, unsigned long addr,
if (!pgd_access_permitted(orig, flags & FOLL_WRITE))
return 0;
- BUILD_BUG_ON(pgd_devmap(orig));
-
page = pgd_page(orig);
refs = record_subpages(page, PGDIR_SIZE, addr, end, pages + *nr);
diff --git a/mm/hmm.c b/mm/hmm.c
index 082f7b7..285578e 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -298,7 +298,6 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
* fall through and treat it like a normal page.
*/
if (!vm_normal_page(walk->vma, addr, pte) &&
- !pte_devmap(pte) &&
!is_zero_pfn(pte_pfn(pte))) {
if (hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0)) {
pte_unmap(ptep);
@@ -351,7 +350,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
return hmm_pfns_fill(start, end, range, HMM_PFN_ERROR);
}
- if (pmd_devmap(pmd) || pmd_trans_huge(pmd)) {
+ if (pmd_trans_huge(pmd)) {
/*
* No need to take pmd_lock here, even if some other thread
* is splitting the huge pmd we will get that event through
@@ -362,7 +361,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
* values.
*/
pmd = pmdp_get_lockless(pmdp);
- if (!pmd_devmap(pmd) && !pmd_trans_huge(pmd))
+ if (!pmd_trans_huge(pmd))
goto again;
return hmm_vma_handle_pmd(walk, addr, end, hmm_pfns, pmd);
@@ -429,7 +428,7 @@ static int hmm_vma_walk_pud(pud_t *pudp, unsigned long start, unsigned long end,
return hmm_vma_walk_hole(start, end, -1, walk);
}
- if (pud_leaf(pud) && pud_devmap(pud)) {
+ if (pud_leaf(pud) && vma_is_dax(walk->vma)) {
unsigned long i, npages, pfn;
unsigned int required_fault;
unsigned long *hmm_pfns;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0cf1151..0d934eb 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1398,10 +1398,7 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
}
entry = pmd_mkhuge(pfn_t_pmd(pfn, prot));
- if (pfn_t_devmap(pfn))
- entry = pmd_mkdevmap(entry);
- else
- entry = pmd_mkspecial(entry);
+ entry = pmd_mkspecial(entry);
if (write) {
entry = pmd_mkyoung(pmd_mkdirty(entry));
entry = maybe_pmd_mkwrite(entry, vma);
@@ -1440,8 +1437,6 @@ vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write)
* but we need to be consistent with PTEs and architectures that
* can't support a 'special' bit.
*/
- BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) &&
- !pfn_t_devmap(pfn));
BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
(VM_PFNMAP|VM_MIXEDMAP));
BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
@@ -1530,10 +1525,7 @@ static void insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr,
}
entry = pud_mkhuge(pfn_t_pud(pfn, prot));
- if (pfn_t_devmap(pfn))
- entry = pud_mkdevmap(entry);
- else
- entry = pud_mkspecial(entry);
+ entry = pud_mkspecial(entry);
if (write) {
entry = pud_mkyoung(pud_mkdirty(entry));
entry = maybe_pud_mkwrite(entry, vma);
@@ -1564,8 +1556,6 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write)
* but we need to be consistent with PTEs and architectures that
* can't support a 'special' bit.
*/
- BUG_ON(!(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) &&
- !pfn_t_devmap(pfn));
BUG_ON((vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP)) ==
(VM_PFNMAP|VM_MIXEDMAP));
BUG_ON((vma->vm_flags & VM_PFNMAP) && is_cow_mapping(vma->vm_flags));
@@ -1632,46 +1622,6 @@ void touch_pmd(struct vm_area_struct *vma, unsigned long addr,
update_mmu_cache_pmd(vma, addr, pmd);
}
-struct page *follow_devmap_pmd(struct vm_area_struct *vma, unsigned long addr,
- pmd_t *pmd, int flags, struct dev_pagemap **pgmap)
-{
- unsigned long pfn = pmd_pfn(*pmd);
- struct mm_struct *mm = vma->vm_mm;
- struct page *page;
- int ret;
-
- assert_spin_locked(pmd_lockptr(mm, pmd));
-
- if (flags & FOLL_WRITE && !pmd_write(*pmd))
- return NULL;
-
- if (pmd_present(*pmd) && pmd_devmap(*pmd))
- /* pass */;
- else
- return NULL;
-
- if (flags & FOLL_TOUCH)
- touch_pmd(vma, addr, pmd, flags & FOLL_WRITE);
-
- /*
- * device mapped pages can only be returned if the
- * caller will manage the page reference count.
- */
- if (!(flags & (FOLL_GET | FOLL_PIN)))
- return ERR_PTR(-EEXIST);
-
- pfn += (addr & ~PMD_MASK) >> PAGE_SHIFT;
- *pgmap = get_dev_pagemap(pfn, *pgmap);
- if (!*pgmap)
- return ERR_PTR(-EFAULT);
- page = pfn_to_page(pfn);
- ret = try_grab_folio(page_folio(page), 1, flags);
- if (ret)
- page = ERR_PTR(ret);
-
- return page;
-}
-
int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr,
struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
@@ -1823,7 +1773,7 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
ret = -EAGAIN;
pud = *src_pud;
- if (unlikely(!pud_trans_huge(pud) && !pud_devmap(pud)))
+ if (unlikely(!pud_trans_huge(pud)))
goto out_unlock;
/*
@@ -2665,8 +2615,7 @@ spinlock_t *__pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma)
{
spinlock_t *ptl;
ptl = pmd_lock(vma->vm_mm, pmd);
- if (likely(is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) ||
- pmd_devmap(*pmd)))
+ if (likely(is_swap_pmd(*pmd) || pmd_trans_huge(*pmd)))
return ptl;
spin_unlock(ptl);
return NULL;
@@ -2683,7 +2632,7 @@ spinlock_t *__pud_trans_huge_lock(pud_t *pud, struct vm_area_struct *vma)
spinlock_t *ptl;
ptl = pud_lock(vma->vm_mm, pud);
- if (likely(pud_trans_huge(*pud) || pud_devmap(*pud)))
+ if (likely(pud_trans_huge(*pud)))
return ptl;
spin_unlock(ptl);
return NULL;
@@ -2734,7 +2683,7 @@ static void __split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
VM_BUG_ON(haddr & ~HPAGE_PUD_MASK);
VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PUD_SIZE, vma);
- VM_BUG_ON(!pud_trans_huge(*pud) && !pud_devmap(*pud));
+ VM_BUG_ON(!pud_trans_huge(*pud));
count_vm_event(THP_SPLIT_PUD);
@@ -2767,7 +2716,7 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
(address & HPAGE_PUD_MASK) + HPAGE_PUD_SIZE);
mmu_notifier_invalidate_range_start(&range);
ptl = pud_lock(vma->vm_mm, pud);
- if (unlikely(!pud_trans_huge(*pud) && !pud_devmap(*pud)))
+ if (unlikely(!pud_trans_huge(*pud)))
goto out;
__split_huge_pud_locked(vma, pud, range.start);
@@ -2840,8 +2789,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
- VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd)
- && !pmd_devmap(*pmd));
+ VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd));
count_vm_event(THP_SPLIT_PMD);
@@ -3058,8 +3006,7 @@ void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
* require a folio to check the PMD against. Otherwise, there
* is a risk of replacing the wrong folio.
*/
- if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd) ||
- is_pmd_migration_entry(*pmd)) {
+ if (pmd_trans_huge(*pmd) || is_pmd_migration_entry(*pmd)) {
if (folio && folio != pmd_folio(*pmd))
return;
__split_huge_pmd_locked(vma, pmd, address, freeze);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 99dc995..aedef75 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -957,8 +957,6 @@ static inline int check_pmd_state(pmd_t *pmd)
return SCAN_PMD_NULL;
if (pmd_trans_huge(pmde))
return SCAN_PMD_MAPPED;
- if (pmd_devmap(pmde))
- return SCAN_PMD_NULL;
if (pmd_bad(pmde))
return SCAN_PMD_NULL;
return SCAN_SUCCEED;
diff --git a/mm/mapping_dirty_helpers.c b/mm/mapping_dirty_helpers.c
index 2f8829b..208b428 100644
--- a/mm/mapping_dirty_helpers.c
+++ b/mm/mapping_dirty_helpers.c
@@ -129,7 +129,7 @@ static int wp_clean_pmd_entry(pmd_t *pmd, unsigned long addr, unsigned long end,
pmd_t pmdval = pmdp_get_lockless(pmd);
/* Do not split a huge pmd, present or migrated */
- if (pmd_trans_huge(pmdval) || pmd_devmap(pmdval)) {
+ if (pmd_trans_huge(pmdval)) {
WARN_ON(pmd_write(pmdval) || pmd_dirty(pmdval));
walk->action = ACTION_CONTINUE;
}
@@ -152,7 +152,7 @@ static int wp_clean_pud_entry(pud_t *pud, unsigned long addr, unsigned long end,
pud_t pudval = READ_ONCE(*pud);
/* Do not split a huge pud */
- if (pud_trans_huge(pudval) || pud_devmap(pudval)) {
+ if (pud_trans_huge(pudval)) {
WARN_ON(pud_write(pudval) || pud_dirty(pudval));
walk->action = ACTION_CONTINUE;
}
diff --git a/mm/memory.c b/mm/memory.c
index 02e12b0..d39d1c5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -603,16 +603,6 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
return NULL;
if (is_zero_pfn(pfn))
return NULL;
- if (pte_devmap(pte))
- /*
- * NOTE: New users of ZONE_DEVICE will not set pte_devmap()
- * and will have refcounts incremented on their struct pages
- * when they are inserted into PTEs, thus they are safe to
- * return here. Legacy ZONE_DEVICE pages that set pte_devmap()
- * do not have refcounts. Example of legacy ZONE_DEVICE is
- * MEMORY_DEVICE_FS_DAX type in pmem or virtio_fs drivers.
- */
- return NULL;
print_bad_pte(vma, addr, pte, NULL);
return NULL;
@@ -690,8 +680,6 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
}
}
- if (pmd_devmap(pmd))
- return NULL;
if (is_huge_zero_pmd(pmd))
return NULL;
if (unlikely(pfn > highest_memmap_pfn))
@@ -1245,8 +1233,7 @@ copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
src_pmd = pmd_offset(src_pud, addr);
do {
next = pmd_addr_end(addr, end);
- if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)
- || pmd_devmap(*src_pmd)) {
+ if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd)) {
int err;
VM_BUG_ON_VMA(next-addr != HPAGE_PMD_SIZE, src_vma);
err = copy_huge_pmd(dst_mm, src_mm, dst_pmd, src_pmd,
@@ -1282,7 +1269,7 @@ copy_pud_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
src_pud = pud_offset(src_p4d, addr);
do {
next = pud_addr_end(addr, end);
- if (pud_trans_huge(*src_pud) || pud_devmap(*src_pud)) {
+ if (pud_trans_huge(*src_pud)) {
int err;
VM_BUG_ON_VMA(next-addr != HPAGE_PUD_SIZE, src_vma);
@@ -1797,7 +1784,7 @@ static inline unsigned long zap_pmd_range(struct mmu_gather *tlb,
pmd = pmd_offset(pud, addr);
do {
next = pmd_addr_end(addr, end);
- if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
+ if (is_swap_pmd(*pmd) || pmd_trans_huge(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE)
__split_huge_pmd(vma, pmd, addr, false, NULL);
else if (zap_huge_pmd(tlb, vma, pmd, addr)) {
@@ -1839,7 +1826,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
pud = pud_offset(p4d, addr);
do {
next = pud_addr_end(addr, end);
- if (pud_trans_huge(*pud) || pud_devmap(*pud)) {
+ if (pud_trans_huge(*pud)) {
if (next - addr != HPAGE_PUD_SIZE) {
mmap_assert_locked(tlb->mm);
split_huge_pud(vma, pud, addr);
@@ -2454,10 +2441,7 @@ static vm_fault_t insert_pfn(struct vm_area_struct *vma, unsigned long addr,
}
/* Ok, finally just insert the thing.. */
- if (pfn_t_devmap(pfn))
- entry = pte_mkdevmap(pfn_t_pte(pfn, prot));
- else
- entry = pte_mkspecial(pfn_t_pte(pfn, prot));
+ entry = pte_mkspecial(pfn_t_pte(pfn, prot));
if (mkwrite) {
entry = pte_mkyoung(entry);
@@ -2568,8 +2552,6 @@ static bool vm_mixed_ok(struct vm_area_struct *vma, pfn_t pfn, bool mkwrite)
/* these checks mirror the abort conditions in vm_normal_page */
if (vma->vm_flags & VM_MIXEDMAP)
return true;
- if (pfn_t_devmap(pfn))
- return true;
if (pfn_t_special(pfn))
return true;
if (is_zero_pfn(pfn_t_to_pfn(pfn)))
@@ -2601,8 +2583,7 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma,
* than insert_pfn). If a zero_pfn were inserted into a VM_MIXEDMAP
* without pte special, it would there be refcounted as a normal page.
*/
- if (!IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) &&
- !pfn_t_devmap(pfn) && pfn_t_valid(pfn)) {
+ if (!IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL) && pfn_t_valid(pfn)) {
struct page *page;
/*
@@ -6034,7 +6015,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
pud_t orig_pud = *vmf.pud;
barrier();
- if (pud_trans_huge(orig_pud) || pud_devmap(orig_pud)) {
+ if (pud_trans_huge(orig_pud)) {
/*
* TODO once we support anonymous PUDs: NUMA case and
@@ -6075,7 +6056,7 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
pmd_migration_entry_wait(mm, vmf.pmd);
return 0;
}
- if (pmd_trans_huge(vmf.orig_pmd) || pmd_devmap(vmf.orig_pmd)) {
+ if (pmd_trans_huge(vmf.orig_pmd)) {
if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma))
return do_huge_pmd_numa_page(&vmf);
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 2209070..a721e0d 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -599,7 +599,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
pmdp = pmd_alloc(mm, pudp, addr);
if (!pmdp)
goto abort;
- if (pmd_trans_huge(*pmdp) || pmd_devmap(*pmdp))
+ if (pmd_trans_huge(*pmdp))
goto abort;
if (pte_alloc(mm, pmdp))
goto abort;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 516b1d8..31055a8 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -384,7 +384,7 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
goto next;
_pmd = pmdp_get_lockless(pmd);
- if (is_swap_pmd(_pmd) || pmd_trans_huge(_pmd) || pmd_devmap(_pmd)) {
+ if (is_swap_pmd(_pmd) || pmd_trans_huge(_pmd)) {
if ((next - addr != HPAGE_PMD_SIZE) ||
pgtable_split_needed(vma, cp_flags)) {
__split_huge_pmd(vma, pmd, addr, false, NULL);
diff --git a/mm/mremap.c b/mm/mremap.c
index 6047341..96fff18 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -603,7 +603,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
new_pud = alloc_new_pud(vma->vm_mm, vma, new_addr);
if (!new_pud)
break;
- if (pud_trans_huge(*old_pud) || pud_devmap(*old_pud)) {
+ if (pud_trans_huge(*old_pud)) {
if (extent == HPAGE_PUD_SIZE) {
move_pgt_entry(HPAGE_PUD, vma, old_addr, new_addr,
old_pud, new_pud, need_rmap_locks);
@@ -625,8 +625,7 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
if (!new_pmd)
break;
again:
- if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd) ||
- pmd_devmap(*old_pmd)) {
+ if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd)) {
if (extent == HPAGE_PMD_SIZE &&
move_pgt_entry(HPAGE_PMD, vma, old_addr, new_addr,
old_pmd, new_pmd, need_rmap_locks))
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 81839a9..18eadc5 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -242,8 +242,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
*/
pmde = pmdp_get_lockless(pvmw->pmd);
- if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde) ||
- (pmd_present(pmde) && pmd_devmap(pmde))) {
+ if (pmd_trans_huge(pmde) || is_pmd_migration_entry(pmde)) {
pvmw->ptl = pmd_lock(mm, pvmw->pmd);
pmde = *pvmw->pmd;
if (!pmd_present(pmde)) {
@@ -258,7 +257,7 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
return not_found(pvmw);
return true;
}
- if (likely(pmd_trans_huge(pmde) || pmd_devmap(pmde))) {
+ if (likely(pmd_trans_huge(pmde))) {
if (pvmw->flags & PVMW_MIGRATION)
return not_found(pvmw);
if (!check_pmd(pmd_pfn(pmde), pvmw))
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index e478777..6a7eb38 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -143,8 +143,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
* We are ONLY installing, so avoid unnecessarily
* splitting a present huge page.
*/
- if (pmd_present(*pmd) &&
- (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)))
+ if (pmd_present(*pmd) && pmd_trans_huge(*pmd))
continue;
}
@@ -210,8 +209,7 @@ static int walk_pud_range(p4d_t *p4d, unsigned long addr, unsigned long end,
* We are ONLY installing, so avoid unnecessarily
* splitting a present huge page.
*/
- if (pud_present(*pud) &&
- (pud_trans_huge(*pud) || pud_devmap(*pud)))
+ if (pud_present(*pud) && pud_trans_huge(*pud))
continue;
}
@@ -872,7 +870,7 @@ struct folio *folio_walk_start(struct folio_walk *fw,
* TODO: FW_MIGRATION support for PUD migration entries
* once there are relevant users.
*/
- if (!pud_present(pud) || pud_devmap(pud) || pud_special(pud)) {
+ if (!pud_present(pud) || pud_special(pud)) {
spin_unlock(ptl);
goto not_found;
} else if (!pud_leaf(pud)) {
@@ -884,6 +882,12 @@ struct folio *folio_walk_start(struct folio_walk *fw,
* support PUD mappings in VM_PFNMAP|VM_MIXEDMAP VMAs.
*/
page = pud_page(pud);
+
+ if (is_devdax_page(page)) {
+ spin_unlock(ptl);
+ goto not_found;
+ }
+
goto found;
}
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 5a882f2..567e2d0 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -139,8 +139,7 @@ pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
{
pmd_t pmd;
VM_BUG_ON(address & ~HPAGE_PMD_MASK);
- VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp) &&
- !pmd_devmap(*pmdp));
+ VM_BUG_ON(pmd_present(*pmdp) && !pmd_trans_huge(*pmdp));
pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);
flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
return pmd;
@@ -153,7 +152,7 @@ pud_t pudp_huge_clear_flush(struct vm_area_struct *vma, unsigned long address,
pud_t pud;
VM_BUG_ON(address & ~HPAGE_PUD_MASK);
- VM_BUG_ON(!pud_trans_huge(*pudp) && !pud_devmap(*pudp));
+ VM_BUG_ON(!pud_trans_huge(*pudp));
pud = pudp_huge_get_and_clear(vma->vm_mm, address, pudp);
flush_pud_tlb_range(vma, address, address + HPAGE_PUD_SIZE);
return pud;
@@ -293,7 +292,7 @@ pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
*pmdvalp = pmdval;
if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
goto nomap;
- if (unlikely(pmd_trans_huge(pmdval) || pmd_devmap(pmdval)))
+ if (unlikely(pmd_trans_huge(pmdval)))
goto nomap;
if (unlikely(pmd_bad(pmdval))) {
pmd_clear_bad(pmd);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 4527c38..a03c6f1 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -790,8 +790,7 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
* (This includes the case where the PMD used to be THP and
* changed back to none after __pte_alloc().)
*/
- if (unlikely(!pmd_present(dst_pmdval) || pmd_trans_huge(dst_pmdval) ||
- pmd_devmap(dst_pmdval))) {
+ if (unlikely(!pmd_present(dst_pmdval) || pmd_trans_huge(dst_pmdval))) {
err = -EEXIST;
break;
}
@@ -1694,7 +1693,7 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsigned long dst_start,
ptl = pmd_trans_huge_lock(src_pmd, src_vma);
if (ptl) {
- if (pmd_devmap(*src_pmd)) {
+ if (vma_is_dax(src_vma)) {
spin_unlock(ptl);
err = -ENOENT;
break;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 39886f4..b0e25d1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3366,7 +3366,7 @@ static unsigned long get_pte_pfn(pte_t pte, struct vm_area_struct *vma, unsigned
if (!pte_present(pte) || is_zero_pfn(pfn))
return -1;
- if (WARN_ON_ONCE(pte_devmap(pte) || pte_special(pte)))
+ if (WARN_ON_ONCE(pte_special(pte)))
return -1;
if (!pte_young(pte) && !mm_has_notifiers(vma->vm_mm))
@@ -3391,9 +3391,6 @@ static unsigned long get_pmd_pfn(pmd_t pmd, struct vm_area_struct *vma, unsigned
if (!pmd_present(pmd) || is_huge_zero_pmd(pmd))
return -1;
- if (WARN_ON_ONCE(pmd_devmap(pmd)))
- return -1;
-
if (!pmd_young(pmd) && !mm_has_notifiers(vma->vm_mm))
return -1;
--
git-series 0.9.1
^ permalink raw reply [flat|nested] 97+ messages in thread
* [PATCH v6 24/26] mm: Remove devmap related functions and page table bits
2025-01-10 6:00 [PATCH v6 00/26] fs/dax: Fix ZONE_DEVICE page reference counts Alistair Popple
` (22 preceding siblings ...)
2025-01-10 6:00 ` [PATCH v6 23/26] mm: Remove pXX_devmap callers Alistair Popple
@ 2025-01-10 6:00 ` Alistair Popple
2025-01-11 10:08 ` Huacai Chen
2025-01-14 19:03 ` Dan Williams
2025-01-10 6:00 ` [PATCH v6 25/26] Revert "riscv: mm: Add support for ZONE_DEVICE" Alistair Popple
` (2 subsequent siblings)
26 siblings, 2 replies; 97+ messages in thread
From: Alistair Popple @ 2025-01-10 6:00 UTC (permalink / raw)
To: akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
Now that DAX and all other reference counts to ZONE_DEVICE pages are
managed normally there is no need for the special devmap PTE/PMD/PUD
page table bits. So drop all references to these, freeing up a
software defined page table bit on architectures supporting it.
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Acked-by: Will Deacon <will@kernel.org> # arm64
---
Documentation/mm/arch_pgtable_helpers.rst | 6 +--
arch/arm64/Kconfig | 1 +-
arch/arm64/include/asm/pgtable-prot.h | 1 +-
arch/arm64/include/asm/pgtable.h | 24 +--------
arch/powerpc/Kconfig | 1 +-
arch/powerpc/include/asm/book3s/64/hash-4k.h | 6 +--
arch/powerpc/include/asm/book3s/64/hash-64k.h | 7 +--
arch/powerpc/include/asm/book3s/64/pgtable.h | 53 +------------------
arch/powerpc/include/asm/book3s/64/radix.h | 14 +-----
arch/x86/Kconfig | 1 +-
arch/x86/include/asm/pgtable.h | 51 +-----------------
arch/x86/include/asm/pgtable_types.h | 5 +--
include/linux/mm.h | 7 +--
include/linux/pfn_t.h | 20 +-------
include/linux/pgtable.h | 19 +------
mm/Kconfig | 4 +-
mm/debug_vm_pgtable.c | 59 +--------------------
mm/hmm.c | 3 +-
18 files changed, 11 insertions(+), 271 deletions(-)
diff --git a/Documentation/mm/arch_pgtable_helpers.rst b/Documentation/mm/arch_pgtable_helpers.rst
index af24516..c88c7fa 100644
--- a/Documentation/mm/arch_pgtable_helpers.rst
+++ b/Documentation/mm/arch_pgtable_helpers.rst
@@ -30,8 +30,6 @@ PTE Page Table Helpers
+---------------------------+--------------------------------------------------+
| pte_protnone | Tests a PROT_NONE PTE |
+---------------------------+--------------------------------------------------+
-| pte_devmap | Tests a ZONE_DEVICE mapped PTE |
-+---------------------------+--------------------------------------------------+
| pte_soft_dirty | Tests a soft dirty PTE |
+---------------------------+--------------------------------------------------+
| pte_swp_soft_dirty | Tests a soft dirty swapped PTE |
@@ -104,8 +102,6 @@ PMD Page Table Helpers
+---------------------------+--------------------------------------------------+
| pmd_protnone | Tests a PROT_NONE PMD |
+---------------------------+--------------------------------------------------+
-| pmd_devmap | Tests a ZONE_DEVICE mapped PMD |
-+---------------------------+--------------------------------------------------+
| pmd_soft_dirty | Tests a soft dirty PMD |
+---------------------------+--------------------------------------------------+
| pmd_swp_soft_dirty | Tests a soft dirty swapped PMD |
@@ -177,8 +173,6 @@ PUD Page Table Helpers
+---------------------------+--------------------------------------------------+
| pud_write | Tests a writable PUD |
+---------------------------+--------------------------------------------------+
-| pud_devmap | Tests a ZONE_DEVICE mapped PUD |
-+---------------------------+--------------------------------------------------+
| pud_mkyoung | Creates a young PUD |
+---------------------------+--------------------------------------------------+
| pud_mkold | Creates an old PUD |
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 39310a4..81855d1 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -41,7 +41,6 @@ config ARM64
select ARCH_HAS_NMI_SAFE_THIS_CPU_OPS
select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
select ARCH_HAS_NONLEAF_PMD_YOUNG if ARM64_HAFT
- select ARCH_HAS_PTE_DEVMAP
select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_HW_PTE_YOUNG
select ARCH_HAS_SETUP_DMA_OPS
diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
index 9f9cf13..49b51df 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -17,7 +17,6 @@
#define PTE_SWP_EXCLUSIVE (_AT(pteval_t, 1) << 2) /* only for swp ptes */
#define PTE_DIRTY (_AT(pteval_t, 1) << 55)
#define PTE_SPECIAL (_AT(pteval_t, 1) << 56)
-#define PTE_DEVMAP (_AT(pteval_t, 1) << 57)
/*
* PTE_PRESENT_INVALID=1 & PTE_VALID=0 indicates that the pte's fields should be
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index f8dac66..ea34e51 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -108,7 +108,6 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
#define pte_user(pte) (!!(pte_val(pte) & PTE_USER))
#define pte_user_exec(pte) (!(pte_val(pte) & PTE_UXN))
#define pte_cont(pte) (!!(pte_val(pte) & PTE_CONT))
-#define pte_devmap(pte) (!!(pte_val(pte) & PTE_DEVMAP))
#define pte_tagged(pte) ((pte_val(pte) & PTE_ATTRINDX_MASK) == \
PTE_ATTRINDX(MT_NORMAL_TAGGED))
@@ -290,11 +289,6 @@ static inline pmd_t pmd_mkcont(pmd_t pmd)
return __pmd(pmd_val(pmd) | PMD_SECT_CONT);
}
-static inline pte_t pte_mkdevmap(pte_t pte)
-{
- return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
-}
-
#ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
static inline int pte_uffd_wp(pte_t pte)
{
@@ -587,14 +581,6 @@ static inline int pmd_trans_huge(pmd_t pmd)
#define pmd_mkhuge(pmd) (__pmd(pmd_val(pmd) & ~PMD_TABLE_BIT))
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define pmd_devmap(pmd) pte_devmap(pmd_pte(pmd))
-#endif
-static inline pmd_t pmd_mkdevmap(pmd_t pmd)
-{
- return pte_pmd(set_pte_bit(pmd_pte(pmd), __pgprot(PTE_DEVMAP)));
-}
-
#ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP
#define pmd_special(pte) (!!((pmd_val(pte) & PTE_SPECIAL)))
static inline pmd_t pmd_mkspecial(pmd_t pmd)
@@ -1195,16 +1181,6 @@ static inline int pmdp_set_access_flags(struct vm_area_struct *vma,
return __ptep_set_access_flags(vma, address, (pte_t *)pmdp,
pmd_pte(entry), dirty);
}
-
-static inline int pud_devmap(pud_t pud)
-{
- return 0;
-}
-
-static inline int pgd_devmap(pgd_t pgd)
-{
- return 0;
-}
#endif
#ifdef CONFIG_PAGE_TABLE_CHECK
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index da0ac66..3e85f89 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -147,7 +147,6 @@ config PPC
select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
select ARCH_HAS_PHYS_TO_DMA
select ARCH_HAS_PMEM_API
- select ARCH_HAS_PTE_DEVMAP if PPC_BOOK3S_64
select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_SCALED_CPUTIME if VIRT_CPU_ACCOUNTING_NATIVE && PPC_BOOK3S_64
select ARCH_HAS_SET_MEMORY
diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
index c3efaca..b0546d3 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -160,12 +160,6 @@ extern pmd_t hash__pmdp_huge_get_and_clear(struct mm_struct *mm,
extern int hash__has_transparent_hugepage(void);
#endif
-static inline pmd_t hash__pmd_mkdevmap(pmd_t pmd)
-{
- BUG();
- return pmd;
-}
-
#endif /* !__ASSEMBLY__ */
#endif /* _ASM_POWERPC_BOOK3S_64_HASH_4K_H */
diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index 0bf6fd0..0fb5b7d 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -259,7 +259,7 @@ static inline void mark_hpte_slot_valid(unsigned char *hpte_slot_array,
*/
static inline int hash__pmd_trans_huge(pmd_t pmd)
{
- return !!((pmd_val(pmd) & (_PAGE_PTE | H_PAGE_THP_HUGE | _PAGE_DEVMAP)) ==
+ return !!((pmd_val(pmd) & (_PAGE_PTE | H_PAGE_THP_HUGE)) ==
(_PAGE_PTE | H_PAGE_THP_HUGE));
}
@@ -281,11 +281,6 @@ extern pmd_t hash__pmdp_huge_get_and_clear(struct mm_struct *mm,
extern int hash__has_transparent_hugepage(void);
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-static inline pmd_t hash__pmd_mkdevmap(pmd_t pmd)
-{
- return __pmd(pmd_val(pmd) | (_PAGE_PTE | H_PAGE_THP_HUGE | _PAGE_DEVMAP));
-}
-
#endif /* __ASSEMBLY__ */
#endif /* _ASM_POWERPC_BOOK3S_64_HASH_64K_H */
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 6d98e6f..1d98d0a 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -88,7 +88,6 @@
#define _PAGE_SOFT_DIRTY _RPAGE_SW3 /* software: software dirty tracking */
#define _PAGE_SPECIAL _RPAGE_SW2 /* software: special page */
-#define _PAGE_DEVMAP _RPAGE_SW1 /* software: ZONE_DEVICE page */
/*
* Drivers request for cache inhibited pte mapping using _PAGE_NO_CACHE
@@ -109,7 +108,7 @@
*/
#define _HPAGE_CHG_MASK (PTE_RPN_MASK | _PAGE_HPTEFLAGS | _PAGE_DIRTY | \
_PAGE_ACCESSED | H_PAGE_THP_HUGE | _PAGE_PTE | \
- _PAGE_SOFT_DIRTY | _PAGE_DEVMAP)
+ _PAGE_SOFT_DIRTY)
/*
* user access blocked by key
*/
@@ -123,7 +122,7 @@
*/
#define _PAGE_CHG_MASK (PTE_RPN_MASK | _PAGE_HPTEFLAGS | _PAGE_DIRTY | \
_PAGE_ACCESSED | _PAGE_SPECIAL | _PAGE_PTE | \
- _PAGE_SOFT_DIRTY | _PAGE_DEVMAP)
+ _PAGE_SOFT_DIRTY)
/*
* We define 2 sets of base prot bits, one for basic pages (ie,
@@ -609,24 +608,6 @@ static inline pte_t pte_mkhuge(pte_t pte)
return pte;
}
-static inline pte_t pte_mkdevmap(pte_t pte)
-{
- return __pte_raw(pte_raw(pte) | cpu_to_be64(_PAGE_SPECIAL | _PAGE_DEVMAP));
-}
-
-/*
- * This is potentially called with a pmd as the argument, in which case it's not
- * safe to check _PAGE_DEVMAP unless we also confirm that _PAGE_PTE is set.
- * That's because the bit we use for _PAGE_DEVMAP is not reserved for software
- * use in page directory entries (ie. non-ptes).
- */
-static inline int pte_devmap(pte_t pte)
-{
- __be64 mask = cpu_to_be64(_PAGE_DEVMAP | _PAGE_PTE);
-
- return (pte_raw(pte) & mask) == mask;
-}
-
static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
{
/* FIXME!! check whether this need to be a conditional */
@@ -1380,36 +1361,6 @@ static inline bool arch_needs_pgtable_deposit(void)
}
extern void serialize_against_pte_lookup(struct mm_struct *mm);
-
-static inline pmd_t pmd_mkdevmap(pmd_t pmd)
-{
- if (radix_enabled())
- return radix__pmd_mkdevmap(pmd);
- return hash__pmd_mkdevmap(pmd);
-}
-
-static inline pud_t pud_mkdevmap(pud_t pud)
-{
- if (radix_enabled())
- return radix__pud_mkdevmap(pud);
- BUG();
- return pud;
-}
-
-static inline int pmd_devmap(pmd_t pmd)
-{
- return pte_devmap(pmd_pte(pmd));
-}
-
-static inline int pud_devmap(pud_t pud)
-{
- return pte_devmap(pud_pte(pud));
-}
-
-static inline int pgd_devmap(pgd_t pgd)
-{
- return 0;
-}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
#define __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION
diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
index 8f55ff7..df23a82 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -264,7 +264,7 @@ static inline int radix__p4d_bad(p4d_t p4d)
static inline int radix__pmd_trans_huge(pmd_t pmd)
{
- return (pmd_val(pmd) & (_PAGE_PTE | _PAGE_DEVMAP)) == _PAGE_PTE;
+ return (pmd_val(pmd) & _PAGE_PTE) == _PAGE_PTE;
}
static inline pmd_t radix__pmd_mkhuge(pmd_t pmd)
@@ -274,7 +274,7 @@ static inline pmd_t radix__pmd_mkhuge(pmd_t pmd)
static inline int radix__pud_trans_huge(pud_t pud)
{
- return (pud_val(pud) & (_PAGE_PTE | _PAGE_DEVMAP)) == _PAGE_PTE;
+ return (pud_val(pud) & _PAGE_PTE) == _PAGE_PTE;
}
static inline pud_t radix__pud_mkhuge(pud_t pud)
@@ -315,16 +315,6 @@ static inline int radix__has_transparent_pud_hugepage(void)
}
#endif
-static inline pmd_t radix__pmd_mkdevmap(pmd_t pmd)
-{
- return __pmd(pmd_val(pmd) | (_PAGE_PTE | _PAGE_DEVMAP));
-}
-
-static inline pud_t radix__pud_mkdevmap(pud_t pud)
-{
- return __pud(pud_val(pud) | (_PAGE_PTE | _PAGE_DEVMAP));
-}
-
struct vmem_altmap;
struct dev_pagemap;
extern int __meminit radix__vmemmap_create_mapping(unsigned long start,
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 77f001c..acac373 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -97,7 +97,6 @@ config X86
select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
select ARCH_HAS_PMEM_API if X86_64
select ARCH_HAS_PREEMPT_LAZY
- select ARCH_HAS_PTE_DEVMAP if X86_64
select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_HW_PTE_YOUNG
select ARCH_HAS_NONLEAF_PMD_YOUNG if PGTABLE_LEVELS > 2
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 593f10a..77705be 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -308,16 +308,15 @@ static inline bool pmd_leaf(pmd_t pte)
}
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-/* NOTE: when predicate huge page, consider also pmd_devmap, or use pmd_leaf */
static inline int pmd_trans_huge(pmd_t pmd)
{
- return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE;
+ return (pmd_val(pmd) & _PAGE_PSE) == _PAGE_PSE;
}
#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
static inline int pud_trans_huge(pud_t pud)
{
- return (pud_val(pud) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE;
+ return (pud_val(pud) & _PAGE_PSE) == _PAGE_PSE;
}
#endif
@@ -327,24 +326,6 @@ static inline int has_transparent_hugepage(void)
return boot_cpu_has(X86_FEATURE_PSE);
}
-#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP
-static inline int pmd_devmap(pmd_t pmd)
-{
- return !!(pmd_val(pmd) & _PAGE_DEVMAP);
-}
-
-#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
-static inline int pud_devmap(pud_t pud)
-{
- return !!(pud_val(pud) & _PAGE_DEVMAP);
-}
-#else
-static inline int pud_devmap(pud_t pud)
-{
- return 0;
-}
-#endif
-
#ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP
static inline bool pmd_special(pmd_t pmd)
{
@@ -368,12 +349,6 @@ static inline pud_t pud_mkspecial(pud_t pud)
return pud_set_flags(pud, _PAGE_SPECIAL);
}
#endif /* CONFIG_ARCH_SUPPORTS_PUD_PFNMAP */
-
-static inline int pgd_devmap(pgd_t pgd)
-{
- return 0;
-}
-#endif
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
static inline pte_t pte_set_flags(pte_t pte, pteval_t set)
@@ -534,11 +509,6 @@ static inline pte_t pte_mkspecial(pte_t pte)
return pte_set_flags(pte, _PAGE_SPECIAL);
}
-static inline pte_t pte_mkdevmap(pte_t pte)
-{
- return pte_set_flags(pte, _PAGE_SPECIAL|_PAGE_DEVMAP);
-}
-
/* See comments above mksaveddirty_shift() */
static inline pmd_t pmd_mksaveddirty(pmd_t pmd)
{
@@ -610,11 +580,6 @@ static inline pmd_t pmd_mkwrite_shstk(pmd_t pmd)
return pmd_set_flags(pmd, _PAGE_DIRTY);
}
-static inline pmd_t pmd_mkdevmap(pmd_t pmd)
-{
- return pmd_set_flags(pmd, _PAGE_DEVMAP);
-}
-
static inline pmd_t pmd_mkhuge(pmd_t pmd)
{
return pmd_set_flags(pmd, _PAGE_PSE);
@@ -680,11 +645,6 @@ static inline pud_t pud_mkdirty(pud_t pud)
return pud_mksaveddirty(pud);
}
-static inline pud_t pud_mkdevmap(pud_t pud)
-{
- return pud_set_flags(pud, _PAGE_DEVMAP);
-}
-
static inline pud_t pud_mkhuge(pud_t pud)
{
return pud_set_flags(pud, _PAGE_PSE);
@@ -1012,13 +972,6 @@ static inline int pte_present(pte_t a)
return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
}
-#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP
-static inline int pte_devmap(pte_t a)
-{
- return (pte_flags(a) & _PAGE_DEVMAP) == _PAGE_DEVMAP;
-}
-#endif
-
#define pte_accessible pte_accessible
static inline bool pte_accessible(struct mm_struct *mm, pte_t a)
{
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 4b80453..e4c7b51 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -33,7 +33,6 @@
#define _PAGE_BIT_CPA_TEST _PAGE_BIT_SOFTW1
#define _PAGE_BIT_UFFD_WP _PAGE_BIT_SOFTW2 /* userfaultfd wrprotected */
#define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */
-#define _PAGE_BIT_DEVMAP _PAGE_BIT_SOFTW4
#ifdef CONFIG_X86_64
#define _PAGE_BIT_SAVED_DIRTY _PAGE_BIT_SOFTW5 /* Saved Dirty bit (leaf) */
@@ -119,11 +118,9 @@
#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
#define _PAGE_NX (_AT(pteval_t, 1) << _PAGE_BIT_NX)
-#define _PAGE_DEVMAP (_AT(u64, 1) << _PAGE_BIT_DEVMAP)
#define _PAGE_SOFTW4 (_AT(pteval_t, 1) << _PAGE_BIT_SOFTW4)
#else
#define _PAGE_NX (_AT(pteval_t, 0))
-#define _PAGE_DEVMAP (_AT(pteval_t, 0))
#define _PAGE_SOFTW4 (_AT(pteval_t, 0))
#endif
@@ -152,7 +149,7 @@
#define _COMMON_PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \
_PAGE_SPECIAL | _PAGE_ACCESSED | \
_PAGE_DIRTY_BITS | _PAGE_SOFT_DIRTY | \
- _PAGE_DEVMAP | _PAGE_CC | _PAGE_UFFD_WP)
+ _PAGE_CC | _PAGE_UFFD_WP)
#define _PAGE_CHG_MASK (_COMMON_PAGE_CHG_MASK | _PAGE_PAT)
#define _HPAGE_CHG_MASK (_COMMON_PAGE_CHG_MASK | _PAGE_PSE | _PAGE_PAT_LARGE)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a734278..23c4e9b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2769,13 +2769,6 @@ static inline pud_t pud_mkspecial(pud_t pud)
}
#endif /* CONFIG_ARCH_SUPPORTS_PUD_PFNMAP */
-#ifndef CONFIG_ARCH_HAS_PTE_DEVMAP
-static inline int pte_devmap(pte_t pte)
-{
- return 0;
-}
-#endif
-
extern pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr,
spinlock_t **ptl);
static inline pte_t *get_locked_pte(struct mm_struct *mm, unsigned long addr,
diff --git a/include/linux/pfn_t.h b/include/linux/pfn_t.h
index 2d91482..0100ad8 100644
--- a/include/linux/pfn_t.h
+++ b/include/linux/pfn_t.h
@@ -97,26 +97,6 @@ static inline pud_t pfn_t_pud(pfn_t pfn, pgprot_t pgprot)
#endif
#endif
-#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP
-static inline bool pfn_t_devmap(pfn_t pfn)
-{
- const u64 flags = PFN_DEV|PFN_MAP;
-
- return (pfn.val & flags) == flags;
-}
-#else
-static inline bool pfn_t_devmap(pfn_t pfn)
-{
- return false;
-}
-pte_t pte_mkdevmap(pte_t pte);
-pmd_t pmd_mkdevmap(pmd_t pmd);
-#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
- defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
-pud_t pud_mkdevmap(pud_t pud);
-#endif
-#endif /* CONFIG_ARCH_HAS_PTE_DEVMAP */
-
#ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
static inline bool pfn_t_special(pfn_t pfn)
{
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 00e4a06..1c377de 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1606,21 +1606,6 @@ static inline int pud_write(pud_t pud)
}
#endif /* pud_write */
-#if !defined(CONFIG_ARCH_HAS_PTE_DEVMAP) || !defined(CONFIG_TRANSPARENT_HUGEPAGE)
-static inline int pmd_devmap(pmd_t pmd)
-{
- return 0;
-}
-static inline int pud_devmap(pud_t pud)
-{
- return 0;
-}
-static inline int pgd_devmap(pgd_t pgd)
-{
- return 0;
-}
-#endif
-
#if !defined(CONFIG_TRANSPARENT_HUGEPAGE) || \
!defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
static inline int pud_trans_huge(pud_t pud)
@@ -1875,8 +1860,8 @@ typedef unsigned int pgtbl_mod_mask;
* - It should contain a huge PFN, which points to a huge page larger than
* PAGE_SIZE of the platform. The PFN format isn't important here.
*
- * - It should cover all kinds of huge mappings (e.g., pXd_trans_huge(),
- * pXd_devmap(), or hugetlb mappings).
+ * - It should cover all kinds of huge mappings (i.e. pXd_trans_huge()
+ * or hugetlb mappings).
*/
#ifndef pgd_leaf
#define pgd_leaf(x) false
diff --git a/mm/Kconfig b/mm/Kconfig
index 7949ab1..e1d0981 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1044,9 +1044,6 @@ config ARCH_HAS_CURRENT_STACK_POINTER
register alias named "current_stack_pointer", this config can be
selected.
-config ARCH_HAS_PTE_DEVMAP
- bool
-
config ARCH_HAS_ZONE_DMA_SET
bool
@@ -1064,7 +1061,6 @@ config ZONE_DEVICE
depends on MEMORY_HOTPLUG
depends on MEMORY_HOTREMOVE
depends on SPARSEMEM_VMEMMAP
- depends on ARCH_HAS_PTE_DEVMAP
select XARRAY_MULTI
help
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index bc748f7..cf5ff92 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -348,12 +348,6 @@ static void __init pud_advanced_tests(struct pgtable_debug_args *args)
vaddr &= HPAGE_PUD_MASK;
pud = pfn_pud(args->pud_pfn, args->page_prot);
- /*
- * Some architectures have debug checks to make sure
- * huge pud mapping are only found with devmap entries
- * For now test with only devmap entries.
- */
- pud = pud_mkdevmap(pud);
set_pud_at(args->mm, vaddr, args->pudp, pud);
flush_dcache_page(page);
pudp_set_wrprotect(args->mm, vaddr, args->pudp);
@@ -366,7 +360,6 @@ static void __init pud_advanced_tests(struct pgtable_debug_args *args)
WARN_ON(!pud_none(pud));
#endif /* __PAGETABLE_PMD_FOLDED */
pud = pfn_pud(args->pud_pfn, args->page_prot);
- pud = pud_mkdevmap(pud);
pud = pud_wrprotect(pud);
pud = pud_mkclean(pud);
set_pud_at(args->mm, vaddr, args->pudp, pud);
@@ -384,7 +377,6 @@ static void __init pud_advanced_tests(struct pgtable_debug_args *args)
#endif /* __PAGETABLE_PMD_FOLDED */
pud = pfn_pud(args->pud_pfn, args->page_prot);
- pud = pud_mkdevmap(pud);
pud = pud_mkyoung(pud);
set_pud_at(args->mm, vaddr, args->pudp, pud);
flush_dcache_page(page);
@@ -693,53 +685,6 @@ static void __init pmd_protnone_tests(struct pgtable_debug_args *args)
static void __init pmd_protnone_tests(struct pgtable_debug_args *args) { }
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP
-static void __init pte_devmap_tests(struct pgtable_debug_args *args)
-{
- pte_t pte = pfn_pte(args->fixed_pte_pfn, args->page_prot);
-
- pr_debug("Validating PTE devmap\n");
- WARN_ON(!pte_devmap(pte_mkdevmap(pte)));
-}
-
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-static void __init pmd_devmap_tests(struct pgtable_debug_args *args)
-{
- pmd_t pmd;
-
- if (!has_transparent_hugepage())
- return;
-
- pr_debug("Validating PMD devmap\n");
- pmd = pfn_pmd(args->fixed_pmd_pfn, args->page_prot);
- WARN_ON(!pmd_devmap(pmd_mkdevmap(pmd)));
-}
-
-#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
-static void __init pud_devmap_tests(struct pgtable_debug_args *args)
-{
- pud_t pud;
-
- if (!has_transparent_pud_hugepage())
- return;
-
- pr_debug("Validating PUD devmap\n");
- pud = pfn_pud(args->fixed_pud_pfn, args->page_prot);
- WARN_ON(!pud_devmap(pud_mkdevmap(pud)));
-}
-#else /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
-static void __init pud_devmap_tests(struct pgtable_debug_args *args) { }
-#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
-#else /* CONFIG_TRANSPARENT_HUGEPAGE */
-static void __init pmd_devmap_tests(struct pgtable_debug_args *args) { }
-static void __init pud_devmap_tests(struct pgtable_debug_args *args) { }
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-#else
-static void __init pte_devmap_tests(struct pgtable_debug_args *args) { }
-static void __init pmd_devmap_tests(struct pgtable_debug_args *args) { }
-static void __init pud_devmap_tests(struct pgtable_debug_args *args) { }
-#endif /* CONFIG_ARCH_HAS_PTE_DEVMAP */
-
static void __init pte_soft_dirty_tests(struct pgtable_debug_args *args)
{
pte_t pte = pfn_pte(args->fixed_pte_pfn, args->page_prot);
@@ -1341,10 +1286,6 @@ static int __init debug_vm_pgtable(void)
pte_protnone_tests(&args);
pmd_protnone_tests(&args);
- pte_devmap_tests(&args);
- pmd_devmap_tests(&args);
- pud_devmap_tests(&args);
-
pte_soft_dirty_tests(&args);
pmd_soft_dirty_tests(&args);
pte_swap_soft_dirty_tests(&args);
diff --git a/mm/hmm.c b/mm/hmm.c
index 285578e..2a12879 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -395,8 +395,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
return 0;
}
-#if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && \
- defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
+#if defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
static inline unsigned long pud_to_hmm_pfn_flags(struct hmm_range *range,
pud_t pud)
{
--
git-series 0.9.1
^ permalink raw reply [flat|nested] 97+ messages in thread
* [PATCH v6 25/26] Revert "riscv: mm: Add support for ZONE_DEVICE"
2025-01-10 6:00 [PATCH v6 00/26] fs/dax: Fix ZONE_DEVICE page reference counts Alistair Popple
` (23 preceding siblings ...)
2025-01-10 6:00 ` [PATCH v6 24/26] mm: Remove devmap related functions and page table bits Alistair Popple
@ 2025-01-10 6:00 ` Alistair Popple
2025-01-14 19:11 ` Dan Williams
2025-01-10 6:00 ` [PATCH v6 26/26] Revert "LoongArch: Add ARCH_HAS_PTE_DEVMAP support" Alistair Popple
2025-01-10 7:05 ` [PATCH v6 00/26] fs/dax: Fix ZONE_DEVICE page reference counts Dan Williams
26 siblings, 1 reply; 97+ messages in thread
From: Alistair Popple @ 2025-01-10 6:00 UTC (permalink / raw)
To: akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch, Björn Töpel
DEVMAP PTEs are no longer required to support ZONE_DEVICE so remove
them.
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Suggested-by: Chunyan Zhang <zhang.lyra@gmail.com>
Reviewed-by: Björn Töpel <bjorn@rivosinc.com>
---
arch/riscv/Kconfig | 1 -
arch/riscv/include/asm/pgtable-64.h | 20 --------------------
arch/riscv/include/asm/pgtable-bits.h | 1 -
arch/riscv/include/asm/pgtable.h | 17 -----------------
4 files changed, 39 deletions(-)
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 7d57186..c285250 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -43,7 +43,6 @@ config RISCV
select ARCH_HAS_PMEM_API
select ARCH_HAS_PREEMPT_LAZY
select ARCH_HAS_PREPARE_SYNC_CORE_CMD
- select ARCH_HAS_PTE_DEVMAP if 64BIT && MMU
select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_SET_DIRECT_MAP if MMU
select ARCH_HAS_SET_MEMORY if MMU
diff --git a/arch/riscv/include/asm/pgtable-64.h b/arch/riscv/include/asm/pgtable-64.h
index 0897dd9..8c36a88 100644
--- a/arch/riscv/include/asm/pgtable-64.h
+++ b/arch/riscv/include/asm/pgtable-64.h
@@ -398,24 +398,4 @@ static inline struct page *pgd_page(pgd_t pgd)
#define p4d_offset p4d_offset
p4d_t *p4d_offset(pgd_t *pgd, unsigned long address);
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-static inline int pte_devmap(pte_t pte);
-static inline pte_t pmd_pte(pmd_t pmd);
-
-static inline int pmd_devmap(pmd_t pmd)
-{
- return pte_devmap(pmd_pte(pmd));
-}
-
-static inline int pud_devmap(pud_t pud)
-{
- return 0;
-}
-
-static inline int pgd_devmap(pgd_t pgd)
-{
- return 0;
-}
-#endif
-
#endif /* _ASM_RISCV_PGTABLE_64_H */
diff --git a/arch/riscv/include/asm/pgtable-bits.h b/arch/riscv/include/asm/pgtable-bits.h
index a8f5205..179bd4a 100644
--- a/arch/riscv/include/asm/pgtable-bits.h
+++ b/arch/riscv/include/asm/pgtable-bits.h
@@ -19,7 +19,6 @@
#define _PAGE_SOFT (3 << 8) /* Reserved for software */
#define _PAGE_SPECIAL (1 << 8) /* RSW: 0x1 */
-#define _PAGE_DEVMAP (1 << 9) /* RSW, devmap */
#define _PAGE_TABLE _PAGE_PRESENT
/*
diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index d4e99ee..9fa9d13 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -399,13 +399,6 @@ static inline int pte_special(pte_t pte)
return pte_val(pte) & _PAGE_SPECIAL;
}
-#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP
-static inline int pte_devmap(pte_t pte)
-{
- return pte_val(pte) & _PAGE_DEVMAP;
-}
-#endif
-
/* static inline pte_t pte_rdprotect(pte_t pte) */
static inline pte_t pte_wrprotect(pte_t pte)
@@ -447,11 +440,6 @@ static inline pte_t pte_mkspecial(pte_t pte)
return __pte(pte_val(pte) | _PAGE_SPECIAL);
}
-static inline pte_t pte_mkdevmap(pte_t pte)
-{
- return __pte(pte_val(pte) | _PAGE_DEVMAP);
-}
-
static inline pte_t pte_mkhuge(pte_t pte)
{
return pte;
@@ -763,11 +751,6 @@ static inline pmd_t pmd_mkdirty(pmd_t pmd)
return pte_pmd(pte_mkdirty(pmd_pte(pmd)));
}
-static inline pmd_t pmd_mkdevmap(pmd_t pmd)
-{
- return pte_pmd(pte_mkdevmap(pmd_pte(pmd)));
-}
-
static inline void set_pmd_at(struct mm_struct *mm, unsigned long addr,
pmd_t *pmdp, pmd_t pmd)
{
--
git-series 0.9.1
^ permalink raw reply [flat|nested] 97+ messages in thread
* [PATCH v6 26/26] Revert "LoongArch: Add ARCH_HAS_PTE_DEVMAP support"
2025-01-10 6:00 [PATCH v6 00/26] fs/dax: Fix ZONE_DEVICE page reference counts Alistair Popple
` (24 preceding siblings ...)
2025-01-10 6:00 ` [PATCH v6 25/26] Revert "riscv: mm: Add support for ZONE_DEVICE" Alistair Popple
@ 2025-01-10 6:00 ` Alistair Popple
2025-01-10 7:05 ` [PATCH v6 00/26] fs/dax: Fix ZONE_DEVICE page reference counts Dan Williams
26 siblings, 0 replies; 97+ messages in thread
From: Alistair Popple @ 2025-01-10 6:00 UTC (permalink / raw)
To: akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
DEVMAP PTEs are no longer required to support ZONE_DEVICE so remove
them.
Signed-off-by: Alistair Popple <apopple@nvidia.com>
---
arch/loongarch/Kconfig | 1 -
arch/loongarch/include/asm/pgtable-bits.h | 6 ++----
arch/loongarch/include/asm/pgtable.h | 19 -------------------
3 files changed, 2 insertions(+), 24 deletions(-)
diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
index 1c4d13a..b7fc27f 100644
--- a/arch/loongarch/Kconfig
+++ b/arch/loongarch/Kconfig
@@ -25,7 +25,6 @@ config LOONGARCH
select ARCH_HAS_NMI_SAFE_THIS_CPU_OPS
select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
select ARCH_HAS_PREEMPT_LAZY
- select ARCH_HAS_PTE_DEVMAP
select ARCH_HAS_PTE_SPECIAL
select ARCH_HAS_SET_MEMORY
select ARCH_HAS_SET_DIRECT_MAP
diff --git a/arch/loongarch/include/asm/pgtable-bits.h b/arch/loongarch/include/asm/pgtable-bits.h
index 82cd3a9..21319c1 100644
--- a/arch/loongarch/include/asm/pgtable-bits.h
+++ b/arch/loongarch/include/asm/pgtable-bits.h
@@ -22,7 +22,6 @@
#define _PAGE_PFN_SHIFT 12
#define _PAGE_SWP_EXCLUSIVE_SHIFT 23
#define _PAGE_PFN_END_SHIFT 48
-#define _PAGE_DEVMAP_SHIFT 59
#define _PAGE_PRESENT_INVALID_SHIFT 60
#define _PAGE_NO_READ_SHIFT 61
#define _PAGE_NO_EXEC_SHIFT 62
@@ -36,7 +35,6 @@
#define _PAGE_MODIFIED (_ULCAST_(1) << _PAGE_MODIFIED_SHIFT)
#define _PAGE_PROTNONE (_ULCAST_(1) << _PAGE_PROTNONE_SHIFT)
#define _PAGE_SPECIAL (_ULCAST_(1) << _PAGE_SPECIAL_SHIFT)
-#define _PAGE_DEVMAP (_ULCAST_(1) << _PAGE_DEVMAP_SHIFT)
/* We borrow bit 23 to store the exclusive marker in swap PTEs. */
#define _PAGE_SWP_EXCLUSIVE (_ULCAST_(1) << _PAGE_SWP_EXCLUSIVE_SHIFT)
@@ -76,8 +74,8 @@
#define __READABLE (_PAGE_VALID)
#define __WRITEABLE (_PAGE_DIRTY | _PAGE_WRITE)
-#define _PAGE_CHG_MASK (_PAGE_MODIFIED | _PAGE_SPECIAL | _PAGE_DEVMAP | _PFN_MASK | _CACHE_MASK | _PAGE_PLV)
-#define _HPAGE_CHG_MASK (_PAGE_MODIFIED | _PAGE_SPECIAL | _PAGE_DEVMAP | _PFN_MASK | _CACHE_MASK | _PAGE_PLV | _PAGE_HUGE)
+#define _PAGE_CHG_MASK (_PAGE_MODIFIED | _PAGE_SPECIAL | _PFN_MASK | _CACHE_MASK | _PAGE_PLV)
+#define _HPAGE_CHG_MASK (_PAGE_MODIFIED | _PAGE_SPECIAL | _PFN_MASK | _CACHE_MASK | _PAGE_PLV | _PAGE_HUGE)
#define PAGE_NONE __pgprot(_PAGE_PROTNONE | _PAGE_NO_READ | \
_PAGE_USER | _CACHE_CC)
diff --git a/arch/loongarch/include/asm/pgtable.h b/arch/loongarch/include/asm/pgtable.h
index da34673..d83b14b 100644
--- a/arch/loongarch/include/asm/pgtable.h
+++ b/arch/loongarch/include/asm/pgtable.h
@@ -410,9 +410,6 @@ static inline int pte_special(pte_t pte) { return pte_val(pte) & _PAGE_SPECIAL;
static inline pte_t pte_mkspecial(pte_t pte) { pte_val(pte) |= _PAGE_SPECIAL; return pte; }
#endif /* CONFIG_ARCH_HAS_PTE_SPECIAL */
-static inline int pte_devmap(pte_t pte) { return !!(pte_val(pte) & _PAGE_DEVMAP); }
-static inline pte_t pte_mkdevmap(pte_t pte) { pte_val(pte) |= _PAGE_DEVMAP; return pte; }
-
#define pte_accessible pte_accessible
static inline unsigned long pte_accessible(struct mm_struct *mm, pte_t a)
{
@@ -547,17 +544,6 @@ static inline pmd_t pmd_mkyoung(pmd_t pmd)
return pmd;
}
-static inline int pmd_devmap(pmd_t pmd)
-{
- return !!(pmd_val(pmd) & _PAGE_DEVMAP);
-}
-
-static inline pmd_t pmd_mkdevmap(pmd_t pmd)
-{
- pmd_val(pmd) |= _PAGE_DEVMAP;
- return pmd;
-}
-
static inline struct page *pmd_page(pmd_t pmd)
{
if (pmd_trans_huge(pmd))
@@ -613,11 +599,6 @@ static inline long pmd_protnone(pmd_t pmd)
#define pmd_leaf(pmd) ((pmd_val(pmd) & _PAGE_HUGE) != 0)
#define pud_leaf(pud) ((pud_val(pud) & _PAGE_HUGE) != 0)
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define pud_devmap(pud) (0)
-#define pgd_devmap(pgd) (0)
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
-
/*
* We provide our own get_unmapped area to cope with the virtual aliasing
* constraints placed on us by the cache architecture.
--
git-series 0.9.1
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 00/26] fs/dax: Fix ZONE_DEVICE page reference counts
2025-01-10 6:00 [PATCH v6 00/26] fs/dax: Fix ZONE_DEVICE page reference counts Alistair Popple
` (25 preceding siblings ...)
2025-01-10 6:00 ` [PATCH v6 26/26] Revert "LoongArch: Add ARCH_HAS_PTE_DEVMAP support" Alistair Popple
@ 2025-01-10 7:05 ` Dan Williams
2025-01-11 1:30 ` Andrew Morton
26 siblings, 1 reply; 97+ messages in thread
From: Dan Williams @ 2025-01-10 7:05 UTC (permalink / raw)
To: Alistair Popple, akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
Alistair Popple wrote:
> Main updates since v5:
>
> - Reworked patch 1 based on Dan's feedback.
>
> - Fixed build issues on PPC and when CONFIG_PGTABLE_HAS_HUGE_LEAVES
> is no defined.
>
> - Minor comment formatting and documentation fixes.
>
> - Remove PTE_DEVMAP definitions from Loongarch which were added since
> this series was initially written.
[..]
>
> base-commit: e25c8d66f6786300b680866c0e0139981273feba
If this is going to go through nvdimm.git I will need it against a
mainline tag baseline. Linus will want to see the merge conflicts.
Otherwise if that merge commit is too messy, or you would rather not
rebase, then it either needs to go one of two options:
- Andrew's tree which is the only tree I know of that can carry
patches relative to linux-next.
- Wait for v6.14-rc1 and get this into nvdimm.git early in the cycle
when the conflict storm will be low.
Last I attempted the merge conflict resolution with v4, they were not
*that* bad. However, that rebase may need to keep some definitions
around to avoid compile breakage and the need to expand the merge commit
to carrying things like the Loongarch PTE_DEVMAP removal. I.e. move some
of the after-the-fact cleanups to a post merge branch.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 05/26] fs/dax: Create a common implementation to break DAX layouts
2025-01-10 6:00 ` [PATCH v6 05/26] fs/dax: Create a common implementation to break DAX layouts Alistair Popple
@ 2025-01-10 16:44 ` Darrick J. Wong
2025-01-13 0:47 ` Alistair Popple
2025-01-13 20:11 ` Dan Williams
` (2 subsequent siblings)
3 siblings, 1 reply; 97+ messages in thread
From: Darrick J. Wong @ 2025-01-10 16:44 UTC (permalink / raw)
To: Alistair Popple
Cc: akpm, dan.j.williams, linux-mm, alison.schofield, lina,
zhang.lyra, gerald.schaefer, vishal.l.verma, dave.jiang, logang,
bhelgaas, jack, jgg, catalin.marinas, will, mpe, npiggin,
dave.hansen, ira.weiny, willy, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
On Fri, Jan 10, 2025 at 05:00:33PM +1100, Alistair Popple wrote:
> Prior to freeing a block file systems supporting FS DAX must check
> that the associated pages are both unmapped from user-space and not
> undergoing DMA or other access from eg. get_user_pages(). This is
> achieved by unmapping the file range and scanning the FS DAX
> page-cache to see if any pages within the mapping have an elevated
> refcount.
>
> This is done using two functions - dax_layout_busy_page_range() which
> returns a page to wait for the refcount to become idle on. Rather than
> open-code this introduce a common implementation to both unmap and
> wait for the page to become idle.
>
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
So now that Dan Carpenter has complained, I guess I should look at
this...
> ---
>
> Changes for v5:
>
> - Don't wait for idle pages on non-DAX mappings
>
> Changes for v4:
>
> - Fixed some build breakage due to missing symbol exports reported by
> John Hubbard (thanks!).
> ---
> fs/dax.c | 33 +++++++++++++++++++++++++++++++++
> fs/ext4/inode.c | 10 +---------
> fs/fuse/dax.c | 27 +++------------------------
> fs/xfs/xfs_inode.c | 23 +++++------------------
> fs/xfs/xfs_inode.h | 2 +-
> include/linux/dax.h | 21 +++++++++++++++++++++
> mm/madvise.c | 8 ++++----
> 7 files changed, 68 insertions(+), 56 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index d010c10..9c3bd07 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -845,6 +845,39 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
> return ret;
> }
>
> +static int wait_page_idle(struct page *page,
> + void (cb)(struct inode *),
> + struct inode *inode)
> +{
> + return ___wait_var_event(page, page_ref_count(page) == 1,
> + TASK_INTERRUPTIBLE, 0, 0, cb(inode));
> +}
> +
> +/*
> + * Unmaps the inode and waits for any DMA to complete prior to deleting the
> + * DAX mapping entries for the range.
> + */
> +int dax_break_mapping(struct inode *inode, loff_t start, loff_t end,
> + void (cb)(struct inode *))
> +{
> + struct page *page;
> + int error;
> +
> + if (!dax_mapping(inode->i_mapping))
> + return 0;
> +
> + do {
> + page = dax_layout_busy_page_range(inode->i_mapping, start, end);
> + if (!page)
> + break;
> +
> + error = wait_page_idle(page, cb, inode);
> + } while (error == 0);
You didn't initialize error to 0, so it could be any value. What if
dax_layout_busy_page_range returns null the first time through the loop?
> +
> + return error;
> +}
> +EXPORT_SYMBOL_GPL(dax_break_mapping);
> +
> /*
> * Invalidate DAX entry if it is clean.
> */
<I'm no expert, skipping to xfs>
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 42ea203..295730a 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2715,21 +2715,17 @@ xfs_mmaplock_two_inodes_and_break_dax_layout(
> struct xfs_inode *ip2)
> {
> int error;
> - bool retry;
> struct page *page;
>
> if (ip1->i_ino > ip2->i_ino)
> swap(ip1, ip2);
>
> again:
> - retry = false;
> /* Lock the first inode */
> xfs_ilock(ip1, XFS_MMAPLOCK_EXCL);
> - error = xfs_break_dax_layouts(VFS_I(ip1), &retry);
> - if (error || retry) {
> + error = xfs_break_dax_layouts(VFS_I(ip1));
> + if (error) {
> xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
> - if (error == 0 && retry)
> - goto again;
Hmm, so the retry loop has moved into xfs_break_dax_layouts, which means
that we no longer cycle the MMAPLOCK. Why was the lock cycling
unnecessary?
> return error;
> }
>
> @@ -2988,19 +2984,11 @@ xfs_wait_dax_page(
>
> int
> xfs_break_dax_layouts(
> - struct inode *inode,
> - bool *retry)
> + struct inode *inode)
> {
> - struct page *page;
> -
> xfs_assert_ilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL);
>
> - page = dax_layout_busy_page(inode->i_mapping);
> - if (!page)
> - return 0;
> -
> - *retry = true;
> - return dax_wait_page_idle(page, xfs_wait_dax_page, inode);
> + return dax_break_mapping_inode(inode, xfs_wait_dax_page);
> }
>
> int
> @@ -3018,8 +3006,7 @@ xfs_break_layouts(
> retry = false;
> switch (reason) {
> case BREAK_UNMAP:
> - error = xfs_break_dax_layouts(inode, &retry);
> - if (error || retry)
> + if (xfs_break_dax_layouts(inode))
dax_break_mapping can return -ERESTARTSYS, right? So doesn't this need
to be:
error = xfs_break_dax_layouts(inode);
if (error)
break;
Hm?
--D
> break;
> fallthrough;
> case BREAK_WRITE:
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index 1648dc5..c4f03f6 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -593,7 +593,7 @@ xfs_itruncate_extents(
> return xfs_itruncate_extents_flags(tpp, ip, whichfork, new_size, 0);
> }
>
> -int xfs_break_dax_layouts(struct inode *inode, bool *retry);
> +int xfs_break_dax_layouts(struct inode *inode);
> int xfs_break_layouts(struct inode *inode, uint *iolock,
> enum layout_break_reason reason);
>
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 9b1ce98..f6583d3 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -228,6 +228,20 @@ static inline void dax_read_unlock(int id)
> {
> }
> #endif /* CONFIG_DAX */
> +
> +#if !IS_ENABLED(CONFIG_FS_DAX)
> +static inline int __must_check dax_break_mapping(struct inode *inode,
> + loff_t start, loff_t end, void (cb)(struct inode *))
> +{
> + return 0;
> +}
> +
> +static inline void dax_break_mapping_uninterruptible(struct inode *inode,
> + void (cb)(struct inode *))
> +{
> +}
> +#endif
> +
> bool dax_alive(struct dax_device *dax_dev);
> void *dax_get_private(struct dax_device *dax_dev);
> long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
> @@ -251,6 +265,13 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
> int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
> int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
> pgoff_t index);
> +int __must_check dax_break_mapping(struct inode *inode, loff_t start,
> + loff_t end, void (cb)(struct inode *));
> +static inline int __must_check dax_break_mapping_inode(struct inode *inode,
> + void (cb)(struct inode *))
> +{
> + return dax_break_mapping(inode, 0, LLONG_MAX, cb);
> +}
> int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> struct inode *dest, loff_t destoff,
> loff_t len, bool *is_same,
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 49f3a75..1f4c99e 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1063,7 +1063,7 @@ static int guard_install_pud_entry(pud_t *pud, unsigned long addr,
> pud_t pudval = pudp_get(pud);
>
> /* If huge return >0 so we abort the operation + zap. */
> - return pud_trans_huge(pudval) || pud_devmap(pudval);
> + return pud_trans_huge(pudval);
> }
>
> static int guard_install_pmd_entry(pmd_t *pmd, unsigned long addr,
> @@ -1072,7 +1072,7 @@ static int guard_install_pmd_entry(pmd_t *pmd, unsigned long addr,
> pmd_t pmdval = pmdp_get(pmd);
>
> /* If huge return >0 so we abort the operation + zap. */
> - return pmd_trans_huge(pmdval) || pmd_devmap(pmdval);
> + return pmd_trans_huge(pmdval);
> }
>
> static int guard_install_pte_entry(pte_t *pte, unsigned long addr,
> @@ -1183,7 +1183,7 @@ static int guard_remove_pud_entry(pud_t *pud, unsigned long addr,
> pud_t pudval = pudp_get(pud);
>
> /* If huge, cannot have guard pages present, so no-op - skip. */
> - if (pud_trans_huge(pudval) || pud_devmap(pudval))
> + if (pud_trans_huge(pudval))
> walk->action = ACTION_CONTINUE;
>
> return 0;
> @@ -1195,7 +1195,7 @@ static int guard_remove_pmd_entry(pmd_t *pmd, unsigned long addr,
> pmd_t pmdval = pmdp_get(pmd);
>
> /* If huge, cannot have guard pages present, so no-op - skip. */
> - if (pmd_trans_huge(pmdval) || pmd_devmap(pmdval))
> + if (pmd_trans_huge(pmdval))
> walk->action = ACTION_CONTINUE;
>
> return 0;
> --
> git-series 0.9.1
>
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 07/26] fs/dax: Ensure all pages are idle prior to filesystem unmount
2025-01-10 6:00 ` [PATCH v6 07/26] fs/dax: Ensure all pages are idle prior to filesystem unmount Alistair Popple
@ 2025-01-10 16:50 ` Darrick J. Wong
2025-01-13 0:57 ` Alistair Popple
2025-01-13 23:42 ` Dan Williams
1 sibling, 1 reply; 97+ messages in thread
From: Darrick J. Wong @ 2025-01-10 16:50 UTC (permalink / raw)
To: Alistair Popple
Cc: akpm, dan.j.williams, linux-mm, alison.schofield, lina,
zhang.lyra, gerald.schaefer, vishal.l.verma, dave.jiang, logang,
bhelgaas, jack, jgg, catalin.marinas, will, mpe, npiggin,
dave.hansen, ira.weiny, willy, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
On Fri, Jan 10, 2025 at 05:00:35PM +1100, Alistair Popple wrote:
> File systems call dax_break_mapping() prior to reallocating file
> system blocks to ensure the page is not undergoing any DMA or other
> accesses. Generally this is needed when a file is truncated to ensure
> that if a block is reallocated nothing is writing to it. However
> filesystems currently don't call this when an FS DAX inode is evicted.
>
> This can cause problems when the file system is unmounted as a page
> can continue to be under going DMA or other remote access after
> unmount. This means if the file system is remounted any truncate or
> other operation which requires the underlying file system block to be
> freed will not wait for the remote access to complete. Therefore a
> busy block may be reallocated to a new file leading to corruption.
>
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>
> ---
>
> Changes for v5:
>
> - Don't wait for pages to be idle in non-DAX mappings
> ---
> fs/dax.c | 29 +++++++++++++++++++++++++++++
> fs/ext4/inode.c | 32 ++++++++++++++------------------
> fs/xfs/xfs_inode.c | 9 +++++++++
> fs/xfs/xfs_inode.h | 1 +
> fs/xfs/xfs_super.c | 18 ++++++++++++++++++
> include/linux/dax.h | 2 ++
> 6 files changed, 73 insertions(+), 18 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 7008a73..4e49cc4 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -883,6 +883,14 @@ static int wait_page_idle(struct page *page,
> TASK_INTERRUPTIBLE, 0, 0, cb(inode));
> }
>
> +static void wait_page_idle_uninterruptible(struct page *page,
> + void (cb)(struct inode *),
> + struct inode *inode)
> +{
> + ___wait_var_event(page, page_ref_count(page) == 1,
> + TASK_UNINTERRUPTIBLE, 0, 0, cb(inode));
> +}
> +
> /*
> * Unmaps the inode and waits for any DMA to complete prior to deleting the
> * DAX mapping entries for the range.
> @@ -911,6 +919,27 @@ int dax_break_mapping(struct inode *inode, loff_t start, loff_t end,
> }
> EXPORT_SYMBOL_GPL(dax_break_mapping);
>
> +void dax_break_mapping_uninterruptible(struct inode *inode,
> + void (cb)(struct inode *))
> +{
> + struct page *page;
> +
> + if (!dax_mapping(inode->i_mapping))
> + return;
> +
> + do {
> + page = dax_layout_busy_page_range(inode->i_mapping, 0,
> + LLONG_MAX);
> + if (!page)
> + break;
> +
> + wait_page_idle_uninterruptible(page, cb, inode);
> + } while (true);
> +
> + dax_delete_mapping_range(inode->i_mapping, 0, LLONG_MAX);
> +}
> +EXPORT_SYMBOL_GPL(dax_break_mapping_uninterruptible);
> +
> /*
> * Invalidate DAX entry if it is clean.
> */
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index ee8e83f..fa35161 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -163,6 +163,18 @@ int ext4_inode_is_fast_symlink(struct inode *inode)
> (inode->i_size < EXT4_N_BLOCKS * 4);
> }
>
> +static void ext4_wait_dax_page(struct inode *inode)
> +{
> + filemap_invalidate_unlock(inode->i_mapping);
> + schedule();
> + filemap_invalidate_lock(inode->i_mapping);
> +}
> +
> +int ext4_break_layouts(struct inode *inode)
> +{
> + return dax_break_mapping_inode(inode, ext4_wait_dax_page);
> +}
> +
> /*
> * Called at the last iput() if i_nlink is zero.
> */
> @@ -181,6 +193,8 @@ void ext4_evict_inode(struct inode *inode)
>
> trace_ext4_evict_inode(inode);
>
> + dax_break_mapping_uninterruptible(inode, ext4_wait_dax_page);
> +
> if (EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL)
> ext4_evict_ea_inode(inode);
> if (inode->i_nlink) {
> @@ -3902,24 +3916,6 @@ int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset,
> return ret;
> }
>
> -static void ext4_wait_dax_page(struct inode *inode)
> -{
> - filemap_invalidate_unlock(inode->i_mapping);
> - schedule();
> - filemap_invalidate_lock(inode->i_mapping);
> -}
> -
> -int ext4_break_layouts(struct inode *inode)
> -{
> - struct page *page;
> - int error;
> -
> - if (WARN_ON_ONCE(!rwsem_is_locked(&inode->i_mapping->invalidate_lock)))
> - return -EINVAL;
> -
> - return dax_break_mapping_inode(inode, ext4_wait_dax_page);
> -}
> -
> /*
> * ext4_punch_hole: punches a hole in a file by releasing the blocks
> * associated with the given offset and length
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 4410b42..c7ec5ab 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2997,6 +2997,15 @@ xfs_break_dax_layouts(
> return dax_break_mapping_inode(inode, xfs_wait_dax_page);
> }
>
> +void
> +xfs_break_dax_layouts_uninterruptible(
> + struct inode *inode)
> +{
> + xfs_assert_ilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL);
> +
> + dax_break_mapping_uninterruptible(inode, xfs_wait_dax_page);
> +}
> +
> int
> xfs_break_layouts(
> struct inode *inode,
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index c4f03f6..613797a 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -594,6 +594,7 @@ xfs_itruncate_extents(
> }
>
> int xfs_break_dax_layouts(struct inode *inode);
> +void xfs_break_dax_layouts_uninterruptible(struct inode *inode);
> int xfs_break_layouts(struct inode *inode, uint *iolock,
> enum layout_break_reason reason);
>
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 8524b9d..73ec060 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -751,6 +751,23 @@ xfs_fs_drop_inode(
> return generic_drop_inode(inode);
> }
>
> +STATIC void
> +xfs_fs_evict_inode(
> + struct inode *inode)
> +{
> + struct xfs_inode *ip = XFS_I(inode);
> + uint iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
> +
> + if (IS_DAX(inode)) {
> + xfs_ilock(ip, iolock);
> + xfs_break_dax_layouts_uninterruptible(inode);
> + xfs_iunlock(ip, iolock);
If we're evicting the inode, why is it necessary to take i_rwsem and the
mmap invalidation lock? Shouldn't the evicting thread be the only one
with access to this inode?
--D
> + }
> +
> + truncate_inode_pages_final(&inode->i_data);
> + clear_inode(inode);
> +}
> +
> static void
> xfs_mount_free(
> struct xfs_mount *mp)
> @@ -1189,6 +1206,7 @@ static const struct super_operations xfs_super_operations = {
> .destroy_inode = xfs_fs_destroy_inode,
> .dirty_inode = xfs_fs_dirty_inode,
> .drop_inode = xfs_fs_drop_inode,
> + .evict_inode = xfs_fs_evict_inode,
> .put_super = xfs_fs_put_super,
> .sync_fs = xfs_fs_sync_fs,
> .freeze_fs = xfs_fs_freeze,
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index ef9e02c..7c3773f 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -274,6 +274,8 @@ static inline int __must_check dax_break_mapping_inode(struct inode *inode,
> {
> return dax_break_mapping(inode, 0, LLONG_MAX, cb);
> }
> +void dax_break_mapping_uninterruptible(struct inode *inode,
> + void (cb)(struct inode *));
> int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> struct inode *dest, loff_t destoff,
> loff_t len, bool *is_same,
> --
> git-series 0.9.1
>
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 21/26] fs/dax: Properly refcount fs dax pages
2025-01-10 6:00 ` [PATCH v6 21/26] fs/dax: Properly refcount fs dax pages Alistair Popple
@ 2025-01-10 16:54 ` Darrick J. Wong
2025-01-13 3:18 ` Alistair Popple
2025-01-14 3:35 ` Dan Williams
1 sibling, 1 reply; 97+ messages in thread
From: Darrick J. Wong @ 2025-01-10 16:54 UTC (permalink / raw)
To: Alistair Popple
Cc: akpm, dan.j.williams, linux-mm, alison.schofield, lina,
zhang.lyra, gerald.schaefer, vishal.l.verma, dave.jiang, logang,
bhelgaas, jack, jgg, catalin.marinas, will, mpe, npiggin,
dave.hansen, ira.weiny, willy, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
On Fri, Jan 10, 2025 at 05:00:49PM +1100, Alistair Popple wrote:
> Currently fs dax pages are considered free when the refcount drops to
> one and their refcounts are not increased when mapped via PTEs or
> decreased when unmapped. This requires special logic in mm paths to
> detect that these pages should not be properly refcounted, and to
> detect when the refcount drops to one instead of zero.
>
> On the other hand get_user_pages(), etc. will properly refcount fs dax
> pages by taking a reference and dropping it when the page is
> unpinned.
>
> Tracking this special behaviour requires extra PTE bits
> (eg. pte_devmap) and introduces rules that are potentially confusing
> and specific to FS DAX pages. To fix this, and to possibly allow
> removal of the special PTE bits in future, convert the fs dax page
> refcounts to be zero based and instead take a reference on the page
> each time it is mapped as is currently the case for normal pages.
>
> This may also allow a future clean-up to remove the pgmap refcounting
> that is currently done in mm/gup.c.
>
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>
> ---
>
> Changes since v2:
>
> Based on some questions from Dan I attempted to have the FS DAX page
> cache (ie. address space) hold a reference to the folio whilst it was
> mapped. However I came to the strong conclusion that this was not the
> right thing to do.
>
> If the page refcount == 0 it means the page is:
>
> 1. not mapped into user-space
> 2. not subject to other access via DMA/GUP/etc.
>
> Ie. From the core MM perspective the page is not in use.
>
> The fact a page may or may not be present in one or more address space
> mappings is irrelevant for core MM. It just means the page is still in
> use or valid from the file system perspective, and it's a
> responsiblity of the file system to remove these mappings if the pfn
> mapping becomes invalid (along with first making sure the MM state,
> ie. page->refcount, is idle). So we shouldn't be trying to track that
> lifetime with MM refcounts.
>
> Doing so just makes DMA-idle tracking more complex because there is
> now another thing (one or more address spaces) which can hold
> references on a page. And FS DAX can't even keep track of all the
> address spaces which might contain a reference to the page in the
> XFS/reflink case anyway.
>
> We could do this if we made file systems invalidate all address space
> mappings prior to calling dax_break_layouts(), but that isn't
> currently neccessary and would lead to increased faults just so we
> could do some superfluous refcounting which the file system already
> does.
>
> I have however put the page sharing checks and WARN_ON's back which
> also turned out to be useful for figuring out when to re-initialising
> a folio.
> ---
> drivers/nvdimm/pmem.c | 4 +-
> fs/dax.c | 212 +++++++++++++++++++++++-----------------
> fs/fuse/virtio_fs.c | 3 +-
> fs/xfs/xfs_inode.c | 2 +-
> include/linux/dax.h | 6 +-
> include/linux/mm.h | 27 +-----
> include/linux/mm_types.h | 7 +-
> mm/gup.c | 9 +--
> mm/huge_memory.c | 6 +-
> mm/internal.h | 2 +-
> mm/memory-failure.c | 6 +-
> mm/memory.c | 6 +-
> mm/memremap.c | 47 ++++-----
> mm/mm_init.c | 9 +--
> mm/swap.c | 2 +-
> 15 files changed, 183 insertions(+), 165 deletions(-)
>
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index d81faa9..785b2d2 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -513,7 +513,7 @@ static int pmem_attach_disk(struct device *dev,
>
> pmem->disk = disk;
> pmem->pgmap.owner = pmem;
> - pmem->pfn_flags = PFN_DEV;
> + pmem->pfn_flags = 0;
> if (is_nd_pfn(dev)) {
> pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
> pmem->pgmap.ops = &fsdax_pagemap_ops;
> @@ -522,7 +522,6 @@ static int pmem_attach_disk(struct device *dev,
> pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
> pmem->pfn_pad = resource_size(res) -
> range_len(&pmem->pgmap.range);
> - pmem->pfn_flags |= PFN_MAP;
> bb_range = pmem->pgmap.range;
> bb_range.start += pmem->data_offset;
> } else if (pmem_should_map_pages(dev)) {
> @@ -532,7 +531,6 @@ static int pmem_attach_disk(struct device *dev,
> pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
> pmem->pgmap.ops = &fsdax_pagemap_ops;
> addr = devm_memremap_pages(dev, &pmem->pgmap);
> - pmem->pfn_flags |= PFN_MAP;
> bb_range = pmem->pgmap.range;
> } else {
> addr = devm_memremap(dev, pmem->phys_addr,
> diff --git a/fs/dax.c b/fs/dax.c
> index d35dbe1..19f444e 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -71,6 +71,11 @@ static unsigned long dax_to_pfn(void *entry)
> return xa_to_value(entry) >> DAX_SHIFT;
> }
>
> +static struct folio *dax_to_folio(void *entry)
> +{
> + return page_folio(pfn_to_page(dax_to_pfn(entry)));
> +}
> +
> static void *dax_make_entry(pfn_t pfn, unsigned long flags)
> {
> return xa_mk_value(flags | (pfn_t_to_pfn(pfn) << DAX_SHIFT));
> @@ -338,44 +343,88 @@ static unsigned long dax_entry_size(void *entry)
> return PAGE_SIZE;
> }
>
> -static unsigned long dax_end_pfn(void *entry)
> -{
> - return dax_to_pfn(entry) + dax_entry_size(entry) / PAGE_SIZE;
> -}
> -
> -/*
> - * Iterate through all mapped pfns represented by an entry, i.e. skip
> - * 'empty' and 'zero' entries.
> - */
> -#define for_each_mapped_pfn(entry, pfn) \
> - for (pfn = dax_to_pfn(entry); \
> - pfn < dax_end_pfn(entry); pfn++)
> -
> /*
> * A DAX page is considered shared if it has no mapping set and ->share (which
> * shares the ->index field) is non-zero. Note this may return false even if the
> * page is shared between multiple files but has not yet actually been mapped
> * into multiple address spaces.
> */
> -static inline bool dax_page_is_shared(struct page *page)
> +static inline bool dax_folio_is_shared(struct folio *folio)
> {
> - return !page->mapping && page->share;
> + return !folio->mapping && folio->share;
> }
>
> /*
> - * Increase the page share refcount, warning if the page is not marked as shared.
> + * Increase the folio share refcount, warning if the folio is not marked as shared.
> */
> -static inline void dax_page_share_get(struct page *page)
> +static inline void dax_folio_share_get(void *entry)
> {
> - WARN_ON_ONCE(!page->share);
> - WARN_ON_ONCE(page->mapping);
> - page->share++;
> + struct folio *folio = dax_to_folio(entry);
> +
> + WARN_ON_ONCE(!folio->share);
> + WARN_ON_ONCE(folio->mapping);
> + WARN_ON_ONCE(dax_entry_order(entry) != folio_order(folio));
> + folio->share++;
> +}
> +
> +static inline unsigned long dax_folio_share_put(struct folio *folio)
> +{
> + unsigned long ref;
> +
> + if (!dax_folio_is_shared(folio))
> + ref = 0;
> + else
> + ref = --folio->share;
> +
> + WARN_ON_ONCE(ref < 0);
> + if (!ref) {
> + folio->mapping = NULL;
> + if (folio_order(folio)) {
> + struct dev_pagemap *pgmap = page_pgmap(&folio->page);
> + unsigned int order = folio_order(folio);
> + unsigned int i;
> +
> + for (i = 0; i < (1UL << order); i++) {
> + struct page *page = folio_page(folio, i);
> +
> + ClearPageHead(page);
> + clear_compound_head(page);
> +
> + /*
> + * Reset pgmap which was over-written by
> + * prep_compound_page().
> + */
> + page_folio(page)->pgmap = pgmap;
> +
> + /* Make sure this isn't set to TAIL_MAPPING */
> + page->mapping = NULL;
> + page->share = 0;
> + WARN_ON_ONCE(page_ref_count(page));
> + }
> + }
> + }
> +
> + return ref;
> }
>
> -static inline unsigned long dax_page_share_put(struct page *page)
> +static void dax_device_folio_init(void *entry)
> {
> - WARN_ON_ONCE(!page->share);
> - return --page->share;
> + struct folio *folio = dax_to_folio(entry);
> + int order = dax_entry_order(entry);
> +
> + /*
> + * Folio should have been split back to order-0 pages in
> + * dax_folio_share_put() when they were removed from their
> + * final mapping.
> + */
> + WARN_ON_ONCE(folio_order(folio));
> +
> + if (order > 0) {
> + prep_compound_page(&folio->page, order);
> + if (order > 1)
> + INIT_LIST_HEAD(&folio->_deferred_list);
> + WARN_ON_ONCE(folio_ref_count(folio));
> + }
> }
>
> /*
> @@ -388,72 +437,58 @@ static inline unsigned long dax_page_share_put(struct page *page)
> * dax_holder_operations.
> */
> static void dax_associate_entry(void *entry, struct address_space *mapping,
> - struct vm_area_struct *vma, unsigned long address, bool shared)
> + struct vm_area_struct *vma, unsigned long address, bool shared)
> {
> - unsigned long size = dax_entry_size(entry), pfn, index;
> - int i = 0;
> + unsigned long size = dax_entry_size(entry), index;
> + struct folio *folio = dax_to_folio(entry);
>
> if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
> return;
>
> index = linear_page_index(vma, address & ~(size - 1));
> - for_each_mapped_pfn(entry, pfn) {
> - struct page *page = pfn_to_page(pfn);
> -
> - if (shared && page->mapping && page->share) {
> - if (page->mapping) {
> - page->mapping = NULL;
> + if (shared && (folio->mapping || dax_folio_is_shared(folio))) {
> + if (folio->mapping) {
> + folio->mapping = NULL;
>
> - /*
> - * Page has already been mapped into one address
> - * space so set the share count.
> - */
> - page->share = 1;
> - }
> -
> - dax_page_share_get(page);
> - } else {
> - WARN_ON_ONCE(page->mapping);
> - page->mapping = mapping;
> - page->index = index + i++;
> + /*
> + * folio has already been mapped into one address
> + * space so set the share count.
> + */
> + folio->share = 1;
> }
> +
> + dax_folio_share_get(entry);
> + } else {
> + WARN_ON_ONCE(folio->mapping);
> + dax_device_folio_init(entry);
> + folio = dax_to_folio(entry);
> + folio->mapping = mapping;
> + folio->index = index;
> }
> }
>
> static void dax_disassociate_entry(void *entry, struct address_space *mapping,
> - bool trunc)
> + bool trunc)
> {
> - unsigned long pfn;
> + struct folio *folio = dax_to_folio(entry);
>
> if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
> return;
>
> - for_each_mapped_pfn(entry, pfn) {
> - struct page *page = pfn_to_page(pfn);
> -
> - WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
> - if (dax_page_is_shared(page)) {
> - /* keep the shared flag if this page is still shared */
> - if (dax_page_share_put(page) > 0)
> - continue;
> - } else
> - WARN_ON_ONCE(page->mapping && page->mapping != mapping);
> - page->mapping = NULL;
> - page->index = 0;
> - }
> + dax_folio_share_put(folio);
> }
>
> static struct page *dax_busy_page(void *entry)
> {
> - unsigned long pfn;
> + struct folio *folio = dax_to_folio(entry);
>
> - for_each_mapped_pfn(entry, pfn) {
> - struct page *page = pfn_to_page(pfn);
> + if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry))
> + return NULL;
>
> - if (page_ref_count(page) > 1)
> - return page;
> - }
> - return NULL;
> + if (folio_ref_count(folio) - folio_mapcount(folio))
> + return &folio->page;
> + else
> + return NULL;
> }
>
> /**
> @@ -786,7 +821,7 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
> EXPORT_SYMBOL_GPL(dax_layout_busy_page);
>
> static int __dax_invalidate_entry(struct address_space *mapping,
> - pgoff_t index, bool trunc)
> + pgoff_t index, bool trunc)
> {
> XA_STATE(xas, &mapping->i_pages, index);
> int ret = 0;
> @@ -892,7 +927,7 @@ static int wait_page_idle(struct page *page,
> void (cb)(struct inode *),
> struct inode *inode)
> {
> - return ___wait_var_event(page, page_ref_count(page) == 1,
> + return ___wait_var_event(page, page_ref_count(page) == 0,
> TASK_INTERRUPTIBLE, 0, 0, cb(inode));
> }
>
> @@ -900,7 +935,7 @@ static void wait_page_idle_uninterruptible(struct page *page,
> void (cb)(struct inode *),
> struct inode *inode)
> {
> - ___wait_var_event(page, page_ref_count(page) == 1,
> + ___wait_var_event(page, page_ref_count(page) == 0,
> TASK_UNINTERRUPTIBLE, 0, 0, cb(inode));
> }
>
> @@ -949,7 +984,8 @@ void dax_break_mapping_uninterruptible(struct inode *inode,
> wait_page_idle_uninterruptible(page, cb, inode);
> } while (true);
>
> - dax_delete_mapping_range(inode->i_mapping, 0, LLONG_MAX);
> + if (!page)
> + dax_delete_mapping_range(inode->i_mapping, 0, LLONG_MAX);
> }
> EXPORT_SYMBOL_GPL(dax_break_mapping_uninterruptible);
>
> @@ -1035,8 +1071,10 @@ static void *dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
> void *old;
>
> dax_disassociate_entry(entry, mapping, false);
> - dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address,
> - shared);
> + if (!(flags & DAX_ZERO_PAGE))
> + dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address,
> + shared);
> +
> /*
> * Only swap our new entry into the page cache if the current
> * entry is a zero page or an empty entry. If a normal PTE or
> @@ -1224,9 +1262,7 @@ static int dax_iomap_direct_access(const struct iomap *iomap, loff_t pos,
> goto out;
> if (pfn_t_to_pfn(*pfnp) & (PHYS_PFN(size)-1))
> goto out;
> - /* For larger pages we need devmap */
> - if (length > 1 && !pfn_t_devmap(*pfnp))
> - goto out;
> +
> rc = 0;
>
> out_check_addr:
> @@ -1333,7 +1369,7 @@ static vm_fault_t dax_load_hole(struct xa_state *xas, struct vm_fault *vmf,
>
> *entry = dax_insert_entry(xas, vmf, iter, *entry, pfn, DAX_ZERO_PAGE);
>
> - ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
> + ret = vmf_insert_page_mkwrite(vmf, pfn_t_to_page(pfn), false);
> trace_dax_load_hole(inode, vmf, ret);
> return ret;
> }
> @@ -1804,7 +1840,8 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
> loff_t pos = (loff_t)xas->xa_index << PAGE_SHIFT;
> bool write = iter->flags & IOMAP_WRITE;
> unsigned long entry_flags = pmd ? DAX_PMD : 0;
> - int err = 0;
> + struct folio *folio;
> + int ret, err = 0;
> pfn_t pfn;
> void *kaddr;
>
> @@ -1836,17 +1873,18 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
> return dax_fault_return(err);
> }
>
> + folio = dax_to_folio(*entry);
> if (dax_fault_is_synchronous(iter, vmf->vma))
> return dax_fault_synchronous_pfnp(pfnp, pfn);
>
> - /* insert PMD pfn */
> + folio_ref_inc(folio);
> if (pmd)
> - return vmf_insert_pfn_pmd(vmf, pfn, write);
> + ret = vmf_insert_folio_pmd(vmf, pfn_folio(pfn_t_to_pfn(pfn)), write);
> + else
> + ret = vmf_insert_page_mkwrite(vmf, pfn_t_to_page(pfn), write);
> + folio_put(folio);
>
> - /* insert PTE pfn */
> - if (write)
> - return vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn);
> - return vmf_insert_mixed(vmf->vma, vmf->address, pfn);
> + return ret;
> }
>
> static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
> @@ -2085,6 +2123,7 @@ dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, unsigned int order)
> {
> struct address_space *mapping = vmf->vma->vm_file->f_mapping;
> XA_STATE_ORDER(xas, &mapping->i_pages, vmf->pgoff, order);
> + struct folio *folio;
> void *entry;
> vm_fault_t ret;
>
> @@ -2102,14 +2141,17 @@ dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, unsigned int order)
> xas_set_mark(&xas, PAGECACHE_TAG_DIRTY);
> dax_lock_entry(&xas, entry);
> xas_unlock_irq(&xas);
> + folio = pfn_folio(pfn_t_to_pfn(pfn));
> + folio_ref_inc(folio);
> if (order == 0)
> - ret = vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn);
> + ret = vmf_insert_page_mkwrite(vmf, &folio->page, true);
> #ifdef CONFIG_FS_DAX_PMD
> else if (order == PMD_ORDER)
> - ret = vmf_insert_pfn_pmd(vmf, pfn, FAULT_FLAG_WRITE);
> + ret = vmf_insert_folio_pmd(vmf, folio, FAULT_FLAG_WRITE);
> #endif
> else
> ret = VM_FAULT_FALLBACK;
> + folio_put(folio);
> dax_unlock_entry(&xas, entry);
> trace_dax_insert_pfn_mkwrite(mapping->host, vmf, ret);
> return ret;
> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> index 82afe78..2c7b24c 100644
> --- a/fs/fuse/virtio_fs.c
> +++ b/fs/fuse/virtio_fs.c
> @@ -1017,8 +1017,7 @@ static long virtio_fs_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
> if (kaddr)
> *kaddr = fs->window_kaddr + offset;
> if (pfn)
> - *pfn = phys_to_pfn_t(fs->window_phys_addr + offset,
> - PFN_DEV | PFN_MAP);
> + *pfn = phys_to_pfn_t(fs->window_phys_addr + offset, 0);
> return nr_pages > max_nr_pages ? max_nr_pages : nr_pages;
> }
>
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index c7ec5ab..7bfb4eb 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2740,7 +2740,7 @@ xfs_mmaplock_two_inodes_and_break_dax_layout(
> * for this nested lock case.
> */
> page = dax_layout_busy_page(VFS_I(ip2)->i_mapping);
> - if (page && page_ref_count(page) != 1) {
> + if (page && page_ref_count(page) != 0) {
You might want to wrap this weird detail for the next filesystem that
uses it, so that the fine details of fsdax aren't opencoded in xfs:
static inline bool dax_page_in_use(struct page *page)
{
return page && page_ref_count(page) != 0;
}
page = dax_layout_busy_page(VFS_I(ip2)->i_mapping);
if (dax_page_in_use(page)) {
/* unlock and retry... */
}
--D
> xfs_iunlock(ip2, XFS_MMAPLOCK_EXCL);
> xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
> goto again;
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 7c3773f..dbefea1 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -211,8 +211,12 @@ static inline int dax_wait_page_idle(struct page *page,
> void (cb)(struct inode *),
> struct inode *inode)
> {
> - return ___wait_var_event(page, page_ref_count(page) == 1,
> + int ret;
> +
> + ret = ___wait_var_event(page, !page_ref_count(page),
> TASK_INTERRUPTIBLE, 0, 0, cb(inode));
> +
> + return ret;
> }
>
> #if IS_ENABLED(CONFIG_DAX)
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 01edca9..a734278 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1161,6 +1161,8 @@ int vma_is_stack_for_current(struct vm_area_struct *vma);
> struct mmu_gather;
> struct inode;
>
> +extern void prep_compound_page(struct page *page, unsigned int order);
> +
> /*
> * compound_order() can be called without holding a reference, which means
> * that niceties like page_folio() don't work. These callers should be
> @@ -1482,25 +1484,6 @@ vm_fault_t finish_fault(struct vm_fault *vmf);
> * back into memory.
> */
>
> -#if defined(CONFIG_ZONE_DEVICE) && defined(CONFIG_FS_DAX)
> -DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
> -
> -bool __put_devmap_managed_folio_refs(struct folio *folio, int refs);
> -static inline bool put_devmap_managed_folio_refs(struct folio *folio, int refs)
> -{
> - if (!static_branch_unlikely(&devmap_managed_key))
> - return false;
> - if (!folio_is_zone_device(folio))
> - return false;
> - return __put_devmap_managed_folio_refs(folio, refs);
> -}
> -#else /* CONFIG_ZONE_DEVICE && CONFIG_FS_DAX */
> -static inline bool put_devmap_managed_folio_refs(struct folio *folio, int refs)
> -{
> - return false;
> -}
> -#endif /* CONFIG_ZONE_DEVICE && CONFIG_FS_DAX */
> -
> /* 127: arbitrary random number, small enough to assemble well */
> #define folio_ref_zero_or_close_to_overflow(folio) \
> ((unsigned int) folio_ref_count(folio) + 127u <= 127u)
> @@ -1615,12 +1598,6 @@ static inline void put_page(struct page *page)
> {
> struct folio *folio = page_folio(page);
>
> - /*
> - * For some devmap managed pages we need to catch refcount transition
> - * from 2 to 1:
> - */
> - if (put_devmap_managed_folio_refs(folio, 1))
> - return;
> folio_put(folio);
> }
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 54b59b8..e308cb9 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -295,6 +295,8 @@ typedef struct {
> * anonymous memory.
> * @index: Offset within the file, in units of pages. For anonymous memory,
> * this is the index from the beginning of the mmap.
> + * @share: number of DAX mappings that reference this folio. See
> + * dax_associate_entry.
> * @private: Filesystem per-folio data (see folio_attach_private()).
> * @swap: Used for swp_entry_t if folio_test_swapcache().
> * @_mapcount: Do not access this member directly. Use folio_mapcount() to
> @@ -344,7 +346,10 @@ struct folio {
> struct dev_pagemap *pgmap;
> };
> struct address_space *mapping;
> - pgoff_t index;
> + union {
> + pgoff_t index;
> + unsigned long share;
> + };
> union {
> void *private;
> swp_entry_t swap;
> diff --git a/mm/gup.c b/mm/gup.c
> index 9b587b5..d6575ed 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -96,8 +96,7 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
> * belongs to this folio.
> */
> if (unlikely(page_folio(page) != folio)) {
> - if (!put_devmap_managed_folio_refs(folio, refs))
> - folio_put_refs(folio, refs);
> + folio_put_refs(folio, refs);
> goto retry;
> }
>
> @@ -116,8 +115,7 @@ static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
> refs *= GUP_PIN_COUNTING_BIAS;
> }
>
> - if (!put_devmap_managed_folio_refs(folio, refs))
> - folio_put_refs(folio, refs);
> + folio_put_refs(folio, refs);
> }
>
> /**
> @@ -565,8 +563,7 @@ static struct folio *try_grab_folio_fast(struct page *page, int refs,
> */
> if (unlikely((flags & FOLL_LONGTERM) &&
> !folio_is_longterm_pinnable(folio))) {
> - if (!put_devmap_managed_folio_refs(folio, refs))
> - folio_put_refs(folio, refs);
> + folio_put_refs(folio, refs);
> return NULL;
> }
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d1ea76e..0cf1151 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2209,7 +2209,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> tlb->fullmm);
> arch_check_zapped_pmd(vma, orig_pmd);
> tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
> - if (vma_is_special_huge(vma)) {
> + if (!vma_is_dax(vma) && vma_is_special_huge(vma)) {
> if (arch_needs_pgtable_deposit())
> zap_deposited_table(tlb->mm, pmd);
> spin_unlock(ptl);
> @@ -2853,13 +2853,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> */
> if (arch_needs_pgtable_deposit())
> zap_deposited_table(mm, pmd);
> - if (vma_is_special_huge(vma))
> + if (!vma_is_dax(vma) && vma_is_special_huge(vma))
> return;
> if (unlikely(is_pmd_migration_entry(old_pmd))) {
> swp_entry_t entry;
>
> entry = pmd_to_swp_entry(old_pmd);
> folio = pfn_swap_entry_folio(entry);
> + } else if (is_huge_zero_pmd(old_pmd)) {
> + return;
> } else {
> page = pmd_page(old_pmd);
> folio = page_folio(page);
> diff --git a/mm/internal.h b/mm/internal.h
> index 3922788..c4df0ad 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -733,8 +733,6 @@ static inline void prep_compound_tail(struct page *head, int tail_idx)
> set_page_private(p, 0);
> }
>
> -extern void prep_compound_page(struct page *page, unsigned int order);
> -
> void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags);
> extern bool free_pages_prepare(struct page *page, unsigned int order);
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index a7b8ccd..7838bf1 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -419,18 +419,18 @@ static unsigned long dev_pagemap_mapping_shift(struct vm_area_struct *vma,
> pud = pud_offset(p4d, address);
> if (!pud_present(*pud))
> return 0;
> - if (pud_devmap(*pud))
> + if (pud_trans_huge(*pud))
> return PUD_SHIFT;
> pmd = pmd_offset(pud, address);
> if (!pmd_present(*pmd))
> return 0;
> - if (pmd_devmap(*pmd))
> + if (pmd_trans_huge(*pmd))
> return PMD_SHIFT;
> pte = pte_offset_map(pmd, address);
> if (!pte)
> return 0;
> ptent = ptep_get(pte);
> - if (pte_present(ptent) && pte_devmap(ptent))
> + if (pte_present(ptent))
> ret = PAGE_SHIFT;
> pte_unmap(pte);
> return ret;
> diff --git a/mm/memory.c b/mm/memory.c
> index c60b819..02e12b0 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3843,13 +3843,15 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
> if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) {
> /*
> * VM_MIXEDMAP !pfn_valid() case, or VM_SOFTDIRTY clear on a
> - * VM_PFNMAP VMA.
> + * VM_PFNMAP VMA. FS DAX also wants ops->pfn_mkwrite called.
> *
> * We should not cow pages in a shared writeable mapping.
> * Just mark the pages writable and/or call ops->pfn_mkwrite.
> */
> - if (!vmf->page)
> + if (!vmf->page || is_fsdax_page(vmf->page)) {
> + vmf->page = NULL;
> return wp_pfn_shared(vmf);
> + }
> return wp_page_shared(vmf, folio);
> }
>
> diff --git a/mm/memremap.c b/mm/memremap.c
> index 68099af..9a8879b 100644
> --- a/mm/memremap.c
> +++ b/mm/memremap.c
> @@ -458,8 +458,13 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
>
> void free_zone_device_folio(struct folio *folio)
> {
> - if (WARN_ON_ONCE(!folio->pgmap->ops ||
> - !folio->pgmap->ops->page_free))
> + struct dev_pagemap *pgmap = folio->pgmap;
> +
> + if (WARN_ON_ONCE(!pgmap->ops))
> + return;
> +
> + if (WARN_ON_ONCE(pgmap->type != MEMORY_DEVICE_FS_DAX &&
> + !pgmap->ops->page_free))
> return;
>
> mem_cgroup_uncharge(folio);
> @@ -484,26 +489,36 @@ void free_zone_device_folio(struct folio *folio)
> * For other types of ZONE_DEVICE pages, migration is either
> * handled differently or not done at all, so there is no need
> * to clear folio->mapping.
> + *
> + * FS DAX pages clear the mapping when the folio->share count hits
> + * zero which indicating the page has been removed from the file
> + * system mapping.
> */
> - folio->mapping = NULL;
> - folio->pgmap->ops->page_free(folio_page(folio, 0));
> + if (pgmap->type != MEMORY_DEVICE_FS_DAX)
> + folio->mapping = NULL;
>
> - switch (folio->pgmap->type) {
> + switch (pgmap->type) {
> case MEMORY_DEVICE_PRIVATE:
> case MEMORY_DEVICE_COHERENT:
> - put_dev_pagemap(folio->pgmap);
> + pgmap->ops->page_free(folio_page(folio, 0));
> + put_dev_pagemap(pgmap);
> break;
>
> - case MEMORY_DEVICE_FS_DAX:
> case MEMORY_DEVICE_GENERIC:
> /*
> * Reset the refcount to 1 to prepare for handing out the page
> * again.
> */
> + pgmap->ops->page_free(folio_page(folio, 0));
> folio_set_count(folio, 1);
> break;
>
> + case MEMORY_DEVICE_FS_DAX:
> + wake_up_var(&folio->page);
> + break;
> +
> case MEMORY_DEVICE_PCI_P2PDMA:
> + pgmap->ops->page_free(folio_page(folio, 0));
> break;
> }
> }
> @@ -519,21 +534,3 @@ void zone_device_page_init(struct page *page)
> lock_page(page);
> }
> EXPORT_SYMBOL_GPL(zone_device_page_init);
> -
> -#ifdef CONFIG_FS_DAX
> -bool __put_devmap_managed_folio_refs(struct folio *folio, int refs)
> -{
> - if (folio->pgmap->type != MEMORY_DEVICE_FS_DAX)
> - return false;
> -
> - /*
> - * fsdax page refcounts are 1-based, rather than 0-based: if
> - * refcount is 1, then the page is free and the refcount is
> - * stable because nobody holds a reference on the page.
> - */
> - if (folio_ref_sub_return(folio, refs) == 1)
> - wake_up_var(&folio->_refcount);
> - return true;
> -}
> -EXPORT_SYMBOL(__put_devmap_managed_folio_refs);
> -#endif /* CONFIG_FS_DAX */
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index cb73402..0c12b29 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -1017,23 +1017,22 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
> }
>
> /*
> - * ZONE_DEVICE pages other than MEMORY_TYPE_GENERIC and
> - * MEMORY_TYPE_FS_DAX pages are released directly to the driver page
> - * allocator which will set the page count to 1 when allocating the
> - * page.
> + * ZONE_DEVICE pages other than MEMORY_TYPE_GENERIC are released
> + * directly to the driver page allocator which will set the page count
> + * to 1 when allocating the page.
> *
> * MEMORY_TYPE_GENERIC and MEMORY_TYPE_FS_DAX pages automatically have
> * their refcount reset to one whenever they are freed (ie. after
> * their refcount drops to 0).
> */
> switch (pgmap->type) {
> + case MEMORY_DEVICE_FS_DAX:
> case MEMORY_DEVICE_PRIVATE:
> case MEMORY_DEVICE_COHERENT:
> case MEMORY_DEVICE_PCI_P2PDMA:
> set_page_count(page, 0);
> break;
>
> - case MEMORY_DEVICE_FS_DAX:
> case MEMORY_DEVICE_GENERIC:
> break;
> }
> diff --git a/mm/swap.c b/mm/swap.c
> index 062c856..a587842 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -952,8 +952,6 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
> unlock_page_lruvec_irqrestore(lruvec, flags);
> lruvec = NULL;
> }
> - if (put_devmap_managed_folio_refs(folio, nr_refs))
> - continue;
> if (folio_ref_sub_and_test(folio, nr_refs))
> free_zone_device_folio(folio);
> continue;
> --
> git-series 0.9.1
>
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 00/26] fs/dax: Fix ZONE_DEVICE page reference counts
2025-01-10 7:05 ` [PATCH v6 00/26] fs/dax: Fix ZONE_DEVICE page reference counts Dan Williams
@ 2025-01-11 1:30 ` Andrew Morton
2025-01-11 3:35 ` Dan Williams
0 siblings, 1 reply; 97+ messages in thread
From: Andrew Morton @ 2025-01-11 1:30 UTC (permalink / raw)
To: Dan Williams
Cc: Alistair Popple, linux-mm, alison.schofield, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
On Thu, 9 Jan 2025 23:05:56 -0800 Dan Williams <dan.j.williams@intel.com> wrote:
> > - Remove PTE_DEVMAP definitions from Loongarch which were added since
> > this series was initially written.
> [..]
> >
> > base-commit: e25c8d66f6786300b680866c0e0139981273feba
>
> If this is going to go through nvdimm.git I will need it against a
> mainline tag baseline. Linus will want to see the merge conflicts.
>
> Otherwise if that merge commit is too messy, or you would rather not
> rebase, then it either needs to go one of two options:
>
> - Andrew's tree which is the only tree I know of that can carry
> patches relative to linux-next.
I used to be able to do that but haven't got around to setting up such
a thing with mm.git. This is the first time the need has arisen,
really.
> - Wait for v6.14-rc1
I'm thinking so. Darrick's review comments indicate that we'll be seeing a v7.
> and get this into nvdimm.git early in the cycle
> when the conflict storm will be low.
erk. This patchset hits mm/ a lot, and nvdimm hardly at all. Is it
not practical to carry this in mm.git?
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 00/26] fs/dax: Fix ZONE_DEVICE page reference counts
2025-01-11 1:30 ` Andrew Morton
@ 2025-01-11 3:35 ` Dan Williams
2025-01-13 1:05 ` Alistair Popple
0 siblings, 1 reply; 97+ messages in thread
From: Dan Williams @ 2025-01-11 3:35 UTC (permalink / raw)
To: Andrew Morton, Dan Williams
Cc: Alistair Popple, linux-mm, alison.schofield, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
Andrew Morton wrote:
> On Thu, 9 Jan 2025 23:05:56 -0800 Dan Williams <dan.j.williams@intel.com> wrote:
>
> > > - Remove PTE_DEVMAP definitions from Loongarch which were added since
> > > this series was initially written.
> > [..]
> > >
> > > base-commit: e25c8d66f6786300b680866c0e0139981273feba
> >
> > If this is going to go through nvdimm.git I will need it against a
> > mainline tag baseline. Linus will want to see the merge conflicts.
> >
> > Otherwise if that merge commit is too messy, or you would rather not
> > rebase, then it either needs to go one of two options:
> >
> > - Andrew's tree which is the only tree I know of that can carry
> > patches relative to linux-next.
>
> I used to be able to do that but haven't got around to setting up such
> a thing with mm.git. This is the first time the need has arisen,
> really.
Oh, good to know.
>
> > - Wait for v6.14-rc1
>
> I'm thinking so. Darrick's review comments indicate that we'll be seeing a v7.
>
> > and get this into nvdimm.git early in the cycle
> > when the conflict storm will be low.
>
> erk. This patchset hits mm/ a lot, and nvdimm hardly at all. Is it
> not practical to carry this in mm.git?
I'm totally fine with it going through mm.git. nvdimm.git is just the
historical path for touches to fs/dax.c, and git blame points mostly to
me for the issues Alistair is fixing. I am happy to review and ack and
watch this go through mm.git.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 24/26] mm: Remove devmap related functions and page table bits
2025-01-10 6:00 ` [PATCH v6 24/26] mm: Remove devmap related functions and page table bits Alistair Popple
@ 2025-01-11 10:08 ` Huacai Chen
2025-01-14 19:03 ` Dan Williams
1 sibling, 0 replies; 97+ messages in thread
From: Huacai Chen @ 2025-01-11 10:08 UTC (permalink / raw)
To: Alistair Popple
Cc: akpm, dan.j.williams, linux-mm, alison.schofield, lina,
zhang.lyra, gerald.schaefer, vishal.l.verma, dave.jiang, logang,
bhelgaas, jack, jgg, catalin.marinas, will, mpe, npiggin,
dave.hansen, ira.weiny, willy, djwong, tytso, linmiaohe, david,
peterx, linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev,
nvdimm, linux-cxl, linux-fsdevel, linux-ext4, linux-xfs,
jhubbard, hch, david, kernel, loongarch
Hi, Alistair,
I think the last two patches can be squashed into this one.
Huacai
On Fri, Jan 10, 2025 at 2:03 PM Alistair Popple <apopple@nvidia.com> wrote:
>
> Now that DAX and all other reference counts to ZONE_DEVICE pages are
> managed normally there is no need for the special devmap PTE/PMD/PUD
> page table bits. So drop all references to these, freeing up a
> software defined page table bit on architectures supporting it.
>
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> Acked-by: Will Deacon <will@kernel.org> # arm64
> ---
> Documentation/mm/arch_pgtable_helpers.rst | 6 +--
> arch/arm64/Kconfig | 1 +-
> arch/arm64/include/asm/pgtable-prot.h | 1 +-
> arch/arm64/include/asm/pgtable.h | 24 +--------
> arch/powerpc/Kconfig | 1 +-
> arch/powerpc/include/asm/book3s/64/hash-4k.h | 6 +--
> arch/powerpc/include/asm/book3s/64/hash-64k.h | 7 +--
> arch/powerpc/include/asm/book3s/64/pgtable.h | 53 +------------------
> arch/powerpc/include/asm/book3s/64/radix.h | 14 +-----
> arch/x86/Kconfig | 1 +-
> arch/x86/include/asm/pgtable.h | 51 +-----------------
> arch/x86/include/asm/pgtable_types.h | 5 +--
> include/linux/mm.h | 7 +--
> include/linux/pfn_t.h | 20 +-------
> include/linux/pgtable.h | 19 +------
> mm/Kconfig | 4 +-
> mm/debug_vm_pgtable.c | 59 +--------------------
> mm/hmm.c | 3 +-
> 18 files changed, 11 insertions(+), 271 deletions(-)
>
> diff --git a/Documentation/mm/arch_pgtable_helpers.rst b/Documentation/mm/arch_pgtable_helpers.rst
> index af24516..c88c7fa 100644
> --- a/Documentation/mm/arch_pgtable_helpers.rst
> +++ b/Documentation/mm/arch_pgtable_helpers.rst
> @@ -30,8 +30,6 @@ PTE Page Table Helpers
> +---------------------------+--------------------------------------------------+
> | pte_protnone | Tests a PROT_NONE PTE |
> +---------------------------+--------------------------------------------------+
> -| pte_devmap | Tests a ZONE_DEVICE mapped PTE |
> -+---------------------------+--------------------------------------------------+
> | pte_soft_dirty | Tests a soft dirty PTE |
> +---------------------------+--------------------------------------------------+
> | pte_swp_soft_dirty | Tests a soft dirty swapped PTE |
> @@ -104,8 +102,6 @@ PMD Page Table Helpers
> +---------------------------+--------------------------------------------------+
> | pmd_protnone | Tests a PROT_NONE PMD |
> +---------------------------+--------------------------------------------------+
> -| pmd_devmap | Tests a ZONE_DEVICE mapped PMD |
> -+---------------------------+--------------------------------------------------+
> | pmd_soft_dirty | Tests a soft dirty PMD |
> +---------------------------+--------------------------------------------------+
> | pmd_swp_soft_dirty | Tests a soft dirty swapped PMD |
> @@ -177,8 +173,6 @@ PUD Page Table Helpers
> +---------------------------+--------------------------------------------------+
> | pud_write | Tests a writable PUD |
> +---------------------------+--------------------------------------------------+
> -| pud_devmap | Tests a ZONE_DEVICE mapped PUD |
> -+---------------------------+--------------------------------------------------+
> | pud_mkyoung | Creates a young PUD |
> +---------------------------+--------------------------------------------------+
> | pud_mkold | Creates an old PUD |
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 39310a4..81855d1 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -41,7 +41,6 @@ config ARM64
> select ARCH_HAS_NMI_SAFE_THIS_CPU_OPS
> select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
> select ARCH_HAS_NONLEAF_PMD_YOUNG if ARM64_HAFT
> - select ARCH_HAS_PTE_DEVMAP
> select ARCH_HAS_PTE_SPECIAL
> select ARCH_HAS_HW_PTE_YOUNG
> select ARCH_HAS_SETUP_DMA_OPS
> diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
> index 9f9cf13..49b51df 100644
> --- a/arch/arm64/include/asm/pgtable-prot.h
> +++ b/arch/arm64/include/asm/pgtable-prot.h
> @@ -17,7 +17,6 @@
> #define PTE_SWP_EXCLUSIVE (_AT(pteval_t, 1) << 2) /* only for swp ptes */
> #define PTE_DIRTY (_AT(pteval_t, 1) << 55)
> #define PTE_SPECIAL (_AT(pteval_t, 1) << 56)
> -#define PTE_DEVMAP (_AT(pteval_t, 1) << 57)
>
> /*
> * PTE_PRESENT_INVALID=1 & PTE_VALID=0 indicates that the pte's fields should be
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index f8dac66..ea34e51 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -108,7 +108,6 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
> #define pte_user(pte) (!!(pte_val(pte) & PTE_USER))
> #define pte_user_exec(pte) (!(pte_val(pte) & PTE_UXN))
> #define pte_cont(pte) (!!(pte_val(pte) & PTE_CONT))
> -#define pte_devmap(pte) (!!(pte_val(pte) & PTE_DEVMAP))
> #define pte_tagged(pte) ((pte_val(pte) & PTE_ATTRINDX_MASK) == \
> PTE_ATTRINDX(MT_NORMAL_TAGGED))
>
> @@ -290,11 +289,6 @@ static inline pmd_t pmd_mkcont(pmd_t pmd)
> return __pmd(pmd_val(pmd) | PMD_SECT_CONT);
> }
>
> -static inline pte_t pte_mkdevmap(pte_t pte)
> -{
> - return set_pte_bit(pte, __pgprot(PTE_DEVMAP | PTE_SPECIAL));
> -}
> -
> #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP
> static inline int pte_uffd_wp(pte_t pte)
> {
> @@ -587,14 +581,6 @@ static inline int pmd_trans_huge(pmd_t pmd)
>
> #define pmd_mkhuge(pmd) (__pmd(pmd_val(pmd) & ~PMD_TABLE_BIT))
>
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -#define pmd_devmap(pmd) pte_devmap(pmd_pte(pmd))
> -#endif
> -static inline pmd_t pmd_mkdevmap(pmd_t pmd)
> -{
> - return pte_pmd(set_pte_bit(pmd_pte(pmd), __pgprot(PTE_DEVMAP)));
> -}
> -
> #ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP
> #define pmd_special(pte) (!!((pmd_val(pte) & PTE_SPECIAL)))
> static inline pmd_t pmd_mkspecial(pmd_t pmd)
> @@ -1195,16 +1181,6 @@ static inline int pmdp_set_access_flags(struct vm_area_struct *vma,
> return __ptep_set_access_flags(vma, address, (pte_t *)pmdp,
> pmd_pte(entry), dirty);
> }
> -
> -static inline int pud_devmap(pud_t pud)
> -{
> - return 0;
> -}
> -
> -static inline int pgd_devmap(pgd_t pgd)
> -{
> - return 0;
> -}
> #endif
>
> #ifdef CONFIG_PAGE_TABLE_CHECK
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index da0ac66..3e85f89 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -147,7 +147,6 @@ config PPC
> select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
> select ARCH_HAS_PHYS_TO_DMA
> select ARCH_HAS_PMEM_API
> - select ARCH_HAS_PTE_DEVMAP if PPC_BOOK3S_64
> select ARCH_HAS_PTE_SPECIAL
> select ARCH_HAS_SCALED_CPUTIME if VIRT_CPU_ACCOUNTING_NATIVE && PPC_BOOK3S_64
> select ARCH_HAS_SET_MEMORY
> diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> index c3efaca..b0546d3 100644
> --- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
> +++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
> @@ -160,12 +160,6 @@ extern pmd_t hash__pmdp_huge_get_and_clear(struct mm_struct *mm,
> extern int hash__has_transparent_hugepage(void);
> #endif
>
> -static inline pmd_t hash__pmd_mkdevmap(pmd_t pmd)
> -{
> - BUG();
> - return pmd;
> -}
> -
> #endif /* !__ASSEMBLY__ */
>
> #endif /* _ASM_POWERPC_BOOK3S_64_HASH_4K_H */
> diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> index 0bf6fd0..0fb5b7d 100644
> --- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
> +++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
> @@ -259,7 +259,7 @@ static inline void mark_hpte_slot_valid(unsigned char *hpte_slot_array,
> */
> static inline int hash__pmd_trans_huge(pmd_t pmd)
> {
> - return !!((pmd_val(pmd) & (_PAGE_PTE | H_PAGE_THP_HUGE | _PAGE_DEVMAP)) ==
> + return !!((pmd_val(pmd) & (_PAGE_PTE | H_PAGE_THP_HUGE)) ==
> (_PAGE_PTE | H_PAGE_THP_HUGE));
> }
>
> @@ -281,11 +281,6 @@ extern pmd_t hash__pmdp_huge_get_and_clear(struct mm_struct *mm,
> extern int hash__has_transparent_hugepage(void);
> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> -static inline pmd_t hash__pmd_mkdevmap(pmd_t pmd)
> -{
> - return __pmd(pmd_val(pmd) | (_PAGE_PTE | H_PAGE_THP_HUGE | _PAGE_DEVMAP));
> -}
> -
> #endif /* __ASSEMBLY__ */
>
> #endif /* _ASM_POWERPC_BOOK3S_64_HASH_64K_H */
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index 6d98e6f..1d98d0a 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -88,7 +88,6 @@
>
> #define _PAGE_SOFT_DIRTY _RPAGE_SW3 /* software: software dirty tracking */
> #define _PAGE_SPECIAL _RPAGE_SW2 /* software: special page */
> -#define _PAGE_DEVMAP _RPAGE_SW1 /* software: ZONE_DEVICE page */
>
> /*
> * Drivers request for cache inhibited pte mapping using _PAGE_NO_CACHE
> @@ -109,7 +108,7 @@
> */
> #define _HPAGE_CHG_MASK (PTE_RPN_MASK | _PAGE_HPTEFLAGS | _PAGE_DIRTY | \
> _PAGE_ACCESSED | H_PAGE_THP_HUGE | _PAGE_PTE | \
> - _PAGE_SOFT_DIRTY | _PAGE_DEVMAP)
> + _PAGE_SOFT_DIRTY)
> /*
> * user access blocked by key
> */
> @@ -123,7 +122,7 @@
> */
> #define _PAGE_CHG_MASK (PTE_RPN_MASK | _PAGE_HPTEFLAGS | _PAGE_DIRTY | \
> _PAGE_ACCESSED | _PAGE_SPECIAL | _PAGE_PTE | \
> - _PAGE_SOFT_DIRTY | _PAGE_DEVMAP)
> + _PAGE_SOFT_DIRTY)
>
> /*
> * We define 2 sets of base prot bits, one for basic pages (ie,
> @@ -609,24 +608,6 @@ static inline pte_t pte_mkhuge(pte_t pte)
> return pte;
> }
>
> -static inline pte_t pte_mkdevmap(pte_t pte)
> -{
> - return __pte_raw(pte_raw(pte) | cpu_to_be64(_PAGE_SPECIAL | _PAGE_DEVMAP));
> -}
> -
> -/*
> - * This is potentially called with a pmd as the argument, in which case it's not
> - * safe to check _PAGE_DEVMAP unless we also confirm that _PAGE_PTE is set.
> - * That's because the bit we use for _PAGE_DEVMAP is not reserved for software
> - * use in page directory entries (ie. non-ptes).
> - */
> -static inline int pte_devmap(pte_t pte)
> -{
> - __be64 mask = cpu_to_be64(_PAGE_DEVMAP | _PAGE_PTE);
> -
> - return (pte_raw(pte) & mask) == mask;
> -}
> -
> static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
> {
> /* FIXME!! check whether this need to be a conditional */
> @@ -1380,36 +1361,6 @@ static inline bool arch_needs_pgtable_deposit(void)
> }
> extern void serialize_against_pte_lookup(struct mm_struct *mm);
>
> -
> -static inline pmd_t pmd_mkdevmap(pmd_t pmd)
> -{
> - if (radix_enabled())
> - return radix__pmd_mkdevmap(pmd);
> - return hash__pmd_mkdevmap(pmd);
> -}
> -
> -static inline pud_t pud_mkdevmap(pud_t pud)
> -{
> - if (radix_enabled())
> - return radix__pud_mkdevmap(pud);
> - BUG();
> - return pud;
> -}
> -
> -static inline int pmd_devmap(pmd_t pmd)
> -{
> - return pte_devmap(pmd_pte(pmd));
> -}
> -
> -static inline int pud_devmap(pud_t pud)
> -{
> - return pte_devmap(pud_pte(pud));
> -}
> -
> -static inline int pgd_devmap(pgd_t pgd)
> -{
> - return 0;
> -}
> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> #define __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION
> diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
> index 8f55ff7..df23a82 100644
> --- a/arch/powerpc/include/asm/book3s/64/radix.h
> +++ b/arch/powerpc/include/asm/book3s/64/radix.h
> @@ -264,7 +264,7 @@ static inline int radix__p4d_bad(p4d_t p4d)
>
> static inline int radix__pmd_trans_huge(pmd_t pmd)
> {
> - return (pmd_val(pmd) & (_PAGE_PTE | _PAGE_DEVMAP)) == _PAGE_PTE;
> + return (pmd_val(pmd) & _PAGE_PTE) == _PAGE_PTE;
> }
>
> static inline pmd_t radix__pmd_mkhuge(pmd_t pmd)
> @@ -274,7 +274,7 @@ static inline pmd_t radix__pmd_mkhuge(pmd_t pmd)
>
> static inline int radix__pud_trans_huge(pud_t pud)
> {
> - return (pud_val(pud) & (_PAGE_PTE | _PAGE_DEVMAP)) == _PAGE_PTE;
> + return (pud_val(pud) & _PAGE_PTE) == _PAGE_PTE;
> }
>
> static inline pud_t radix__pud_mkhuge(pud_t pud)
> @@ -315,16 +315,6 @@ static inline int radix__has_transparent_pud_hugepage(void)
> }
> #endif
>
> -static inline pmd_t radix__pmd_mkdevmap(pmd_t pmd)
> -{
> - return __pmd(pmd_val(pmd) | (_PAGE_PTE | _PAGE_DEVMAP));
> -}
> -
> -static inline pud_t radix__pud_mkdevmap(pud_t pud)
> -{
> - return __pud(pud_val(pud) | (_PAGE_PTE | _PAGE_DEVMAP));
> -}
> -
> struct vmem_altmap;
> struct dev_pagemap;
> extern int __meminit radix__vmemmap_create_mapping(unsigned long start,
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 77f001c..acac373 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -97,7 +97,6 @@ config X86
> select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
> select ARCH_HAS_PMEM_API if X86_64
> select ARCH_HAS_PREEMPT_LAZY
> - select ARCH_HAS_PTE_DEVMAP if X86_64
> select ARCH_HAS_PTE_SPECIAL
> select ARCH_HAS_HW_PTE_YOUNG
> select ARCH_HAS_NONLEAF_PMD_YOUNG if PGTABLE_LEVELS > 2
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 593f10a..77705be 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -308,16 +308,15 @@ static inline bool pmd_leaf(pmd_t pte)
> }
>
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -/* NOTE: when predicate huge page, consider also pmd_devmap, or use pmd_leaf */
> static inline int pmd_trans_huge(pmd_t pmd)
> {
> - return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE;
> + return (pmd_val(pmd) & _PAGE_PSE) == _PAGE_PSE;
> }
>
> #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> static inline int pud_trans_huge(pud_t pud)
> {
> - return (pud_val(pud) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE;
> + return (pud_val(pud) & _PAGE_PSE) == _PAGE_PSE;
> }
> #endif
>
> @@ -327,24 +326,6 @@ static inline int has_transparent_hugepage(void)
> return boot_cpu_has(X86_FEATURE_PSE);
> }
>
> -#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP
> -static inline int pmd_devmap(pmd_t pmd)
> -{
> - return !!(pmd_val(pmd) & _PAGE_DEVMAP);
> -}
> -
> -#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> -static inline int pud_devmap(pud_t pud)
> -{
> - return !!(pud_val(pud) & _PAGE_DEVMAP);
> -}
> -#else
> -static inline int pud_devmap(pud_t pud)
> -{
> - return 0;
> -}
> -#endif
> -
> #ifdef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP
> static inline bool pmd_special(pmd_t pmd)
> {
> @@ -368,12 +349,6 @@ static inline pud_t pud_mkspecial(pud_t pud)
> return pud_set_flags(pud, _PAGE_SPECIAL);
> }
> #endif /* CONFIG_ARCH_SUPPORTS_PUD_PFNMAP */
> -
> -static inline int pgd_devmap(pgd_t pgd)
> -{
> - return 0;
> -}
> -#endif
> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> static inline pte_t pte_set_flags(pte_t pte, pteval_t set)
> @@ -534,11 +509,6 @@ static inline pte_t pte_mkspecial(pte_t pte)
> return pte_set_flags(pte, _PAGE_SPECIAL);
> }
>
> -static inline pte_t pte_mkdevmap(pte_t pte)
> -{
> - return pte_set_flags(pte, _PAGE_SPECIAL|_PAGE_DEVMAP);
> -}
> -
> /* See comments above mksaveddirty_shift() */
> static inline pmd_t pmd_mksaveddirty(pmd_t pmd)
> {
> @@ -610,11 +580,6 @@ static inline pmd_t pmd_mkwrite_shstk(pmd_t pmd)
> return pmd_set_flags(pmd, _PAGE_DIRTY);
> }
>
> -static inline pmd_t pmd_mkdevmap(pmd_t pmd)
> -{
> - return pmd_set_flags(pmd, _PAGE_DEVMAP);
> -}
> -
> static inline pmd_t pmd_mkhuge(pmd_t pmd)
> {
> return pmd_set_flags(pmd, _PAGE_PSE);
> @@ -680,11 +645,6 @@ static inline pud_t pud_mkdirty(pud_t pud)
> return pud_mksaveddirty(pud);
> }
>
> -static inline pud_t pud_mkdevmap(pud_t pud)
> -{
> - return pud_set_flags(pud, _PAGE_DEVMAP);
> -}
> -
> static inline pud_t pud_mkhuge(pud_t pud)
> {
> return pud_set_flags(pud, _PAGE_PSE);
> @@ -1012,13 +972,6 @@ static inline int pte_present(pte_t a)
> return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
> }
>
> -#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP
> -static inline int pte_devmap(pte_t a)
> -{
> - return (pte_flags(a) & _PAGE_DEVMAP) == _PAGE_DEVMAP;
> -}
> -#endif
> -
> #define pte_accessible pte_accessible
> static inline bool pte_accessible(struct mm_struct *mm, pte_t a)
> {
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index 4b80453..e4c7b51 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -33,7 +33,6 @@
> #define _PAGE_BIT_CPA_TEST _PAGE_BIT_SOFTW1
> #define _PAGE_BIT_UFFD_WP _PAGE_BIT_SOFTW2 /* userfaultfd wrprotected */
> #define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */
> -#define _PAGE_BIT_DEVMAP _PAGE_BIT_SOFTW4
>
> #ifdef CONFIG_X86_64
> #define _PAGE_BIT_SAVED_DIRTY _PAGE_BIT_SOFTW5 /* Saved Dirty bit (leaf) */
> @@ -119,11 +118,9 @@
>
> #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
> #define _PAGE_NX (_AT(pteval_t, 1) << _PAGE_BIT_NX)
> -#define _PAGE_DEVMAP (_AT(u64, 1) << _PAGE_BIT_DEVMAP)
> #define _PAGE_SOFTW4 (_AT(pteval_t, 1) << _PAGE_BIT_SOFTW4)
> #else
> #define _PAGE_NX (_AT(pteval_t, 0))
> -#define _PAGE_DEVMAP (_AT(pteval_t, 0))
> #define _PAGE_SOFTW4 (_AT(pteval_t, 0))
> #endif
>
> @@ -152,7 +149,7 @@
> #define _COMMON_PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \
> _PAGE_SPECIAL | _PAGE_ACCESSED | \
> _PAGE_DIRTY_BITS | _PAGE_SOFT_DIRTY | \
> - _PAGE_DEVMAP | _PAGE_CC | _PAGE_UFFD_WP)
> + _PAGE_CC | _PAGE_UFFD_WP)
> #define _PAGE_CHG_MASK (_COMMON_PAGE_CHG_MASK | _PAGE_PAT)
> #define _HPAGE_CHG_MASK (_COMMON_PAGE_CHG_MASK | _PAGE_PSE | _PAGE_PAT_LARGE)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index a734278..23c4e9b 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2769,13 +2769,6 @@ static inline pud_t pud_mkspecial(pud_t pud)
> }
> #endif /* CONFIG_ARCH_SUPPORTS_PUD_PFNMAP */
>
> -#ifndef CONFIG_ARCH_HAS_PTE_DEVMAP
> -static inline int pte_devmap(pte_t pte)
> -{
> - return 0;
> -}
> -#endif
> -
> extern pte_t *__get_locked_pte(struct mm_struct *mm, unsigned long addr,
> spinlock_t **ptl);
> static inline pte_t *get_locked_pte(struct mm_struct *mm, unsigned long addr,
> diff --git a/include/linux/pfn_t.h b/include/linux/pfn_t.h
> index 2d91482..0100ad8 100644
> --- a/include/linux/pfn_t.h
> +++ b/include/linux/pfn_t.h
> @@ -97,26 +97,6 @@ static inline pud_t pfn_t_pud(pfn_t pfn, pgprot_t pgprot)
> #endif
> #endif
>
> -#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP
> -static inline bool pfn_t_devmap(pfn_t pfn)
> -{
> - const u64 flags = PFN_DEV|PFN_MAP;
> -
> - return (pfn.val & flags) == flags;
> -}
> -#else
> -static inline bool pfn_t_devmap(pfn_t pfn)
> -{
> - return false;
> -}
> -pte_t pte_mkdevmap(pte_t pte);
> -pmd_t pmd_mkdevmap(pmd_t pmd);
> -#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
> - defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
> -pud_t pud_mkdevmap(pud_t pud);
> -#endif
> -#endif /* CONFIG_ARCH_HAS_PTE_DEVMAP */
> -
> #ifdef CONFIG_ARCH_HAS_PTE_SPECIAL
> static inline bool pfn_t_special(pfn_t pfn)
> {
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 00e4a06..1c377de 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1606,21 +1606,6 @@ static inline int pud_write(pud_t pud)
> }
> #endif /* pud_write */
>
> -#if !defined(CONFIG_ARCH_HAS_PTE_DEVMAP) || !defined(CONFIG_TRANSPARENT_HUGEPAGE)
> -static inline int pmd_devmap(pmd_t pmd)
> -{
> - return 0;
> -}
> -static inline int pud_devmap(pud_t pud)
> -{
> - return 0;
> -}
> -static inline int pgd_devmap(pgd_t pgd)
> -{
> - return 0;
> -}
> -#endif
> -
> #if !defined(CONFIG_TRANSPARENT_HUGEPAGE) || \
> !defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
> static inline int pud_trans_huge(pud_t pud)
> @@ -1875,8 +1860,8 @@ typedef unsigned int pgtbl_mod_mask;
> * - It should contain a huge PFN, which points to a huge page larger than
> * PAGE_SIZE of the platform. The PFN format isn't important here.
> *
> - * - It should cover all kinds of huge mappings (e.g., pXd_trans_huge(),
> - * pXd_devmap(), or hugetlb mappings).
> + * - It should cover all kinds of huge mappings (i.e. pXd_trans_huge()
> + * or hugetlb mappings).
> */
> #ifndef pgd_leaf
> #define pgd_leaf(x) false
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 7949ab1..e1d0981 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1044,9 +1044,6 @@ config ARCH_HAS_CURRENT_STACK_POINTER
> register alias named "current_stack_pointer", this config can be
> selected.
>
> -config ARCH_HAS_PTE_DEVMAP
> - bool
> -
> config ARCH_HAS_ZONE_DMA_SET
> bool
>
> @@ -1064,7 +1061,6 @@ config ZONE_DEVICE
> depends on MEMORY_HOTPLUG
> depends on MEMORY_HOTREMOVE
> depends on SPARSEMEM_VMEMMAP
> - depends on ARCH_HAS_PTE_DEVMAP
> select XARRAY_MULTI
>
> help
> diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
> index bc748f7..cf5ff92 100644
> --- a/mm/debug_vm_pgtable.c
> +++ b/mm/debug_vm_pgtable.c
> @@ -348,12 +348,6 @@ static void __init pud_advanced_tests(struct pgtable_debug_args *args)
> vaddr &= HPAGE_PUD_MASK;
>
> pud = pfn_pud(args->pud_pfn, args->page_prot);
> - /*
> - * Some architectures have debug checks to make sure
> - * huge pud mapping are only found with devmap entries
> - * For now test with only devmap entries.
> - */
> - pud = pud_mkdevmap(pud);
> set_pud_at(args->mm, vaddr, args->pudp, pud);
> flush_dcache_page(page);
> pudp_set_wrprotect(args->mm, vaddr, args->pudp);
> @@ -366,7 +360,6 @@ static void __init pud_advanced_tests(struct pgtable_debug_args *args)
> WARN_ON(!pud_none(pud));
> #endif /* __PAGETABLE_PMD_FOLDED */
> pud = pfn_pud(args->pud_pfn, args->page_prot);
> - pud = pud_mkdevmap(pud);
> pud = pud_wrprotect(pud);
> pud = pud_mkclean(pud);
> set_pud_at(args->mm, vaddr, args->pudp, pud);
> @@ -384,7 +377,6 @@ static void __init pud_advanced_tests(struct pgtable_debug_args *args)
> #endif /* __PAGETABLE_PMD_FOLDED */
>
> pud = pfn_pud(args->pud_pfn, args->page_prot);
> - pud = pud_mkdevmap(pud);
> pud = pud_mkyoung(pud);
> set_pud_at(args->mm, vaddr, args->pudp, pud);
> flush_dcache_page(page);
> @@ -693,53 +685,6 @@ static void __init pmd_protnone_tests(struct pgtable_debug_args *args)
> static void __init pmd_protnone_tests(struct pgtable_debug_args *args) { }
> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> -#ifdef CONFIG_ARCH_HAS_PTE_DEVMAP
> -static void __init pte_devmap_tests(struct pgtable_debug_args *args)
> -{
> - pte_t pte = pfn_pte(args->fixed_pte_pfn, args->page_prot);
> -
> - pr_debug("Validating PTE devmap\n");
> - WARN_ON(!pte_devmap(pte_mkdevmap(pte)));
> -}
> -
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -static void __init pmd_devmap_tests(struct pgtable_debug_args *args)
> -{
> - pmd_t pmd;
> -
> - if (!has_transparent_hugepage())
> - return;
> -
> - pr_debug("Validating PMD devmap\n");
> - pmd = pfn_pmd(args->fixed_pmd_pfn, args->page_prot);
> - WARN_ON(!pmd_devmap(pmd_mkdevmap(pmd)));
> -}
> -
> -#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> -static void __init pud_devmap_tests(struct pgtable_debug_args *args)
> -{
> - pud_t pud;
> -
> - if (!has_transparent_pud_hugepage())
> - return;
> -
> - pr_debug("Validating PUD devmap\n");
> - pud = pfn_pud(args->fixed_pud_pfn, args->page_prot);
> - WARN_ON(!pud_devmap(pud_mkdevmap(pud)));
> -}
> -#else /* !CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
> -static void __init pud_devmap_tests(struct pgtable_debug_args *args) { }
> -#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
> -#else /* CONFIG_TRANSPARENT_HUGEPAGE */
> -static void __init pmd_devmap_tests(struct pgtable_debug_args *args) { }
> -static void __init pud_devmap_tests(struct pgtable_debug_args *args) { }
> -#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> -#else
> -static void __init pte_devmap_tests(struct pgtable_debug_args *args) { }
> -static void __init pmd_devmap_tests(struct pgtable_debug_args *args) { }
> -static void __init pud_devmap_tests(struct pgtable_debug_args *args) { }
> -#endif /* CONFIG_ARCH_HAS_PTE_DEVMAP */
> -
> static void __init pte_soft_dirty_tests(struct pgtable_debug_args *args)
> {
> pte_t pte = pfn_pte(args->fixed_pte_pfn, args->page_prot);
> @@ -1341,10 +1286,6 @@ static int __init debug_vm_pgtable(void)
> pte_protnone_tests(&args);
> pmd_protnone_tests(&args);
>
> - pte_devmap_tests(&args);
> - pmd_devmap_tests(&args);
> - pud_devmap_tests(&args);
> -
> pte_soft_dirty_tests(&args);
> pmd_soft_dirty_tests(&args);
> pte_swap_soft_dirty_tests(&args);
> diff --git a/mm/hmm.c b/mm/hmm.c
> index 285578e..2a12879 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -395,8 +395,7 @@ static int hmm_vma_walk_pmd(pmd_t *pmdp,
> return 0;
> }
>
> -#if defined(CONFIG_ARCH_HAS_PTE_DEVMAP) && \
> - defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
> +#if defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
> static inline unsigned long pud_to_hmm_pfn_flags(struct hmm_range *range,
> pud_t pud)
> {
> --
> git-series 0.9.1
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 05/26] fs/dax: Create a common implementation to break DAX layouts
2025-01-10 16:44 ` Darrick J. Wong
@ 2025-01-13 0:47 ` Alistair Popple
2025-01-13 2:47 ` Darrick J. Wong
0 siblings, 1 reply; 97+ messages in thread
From: Alistair Popple @ 2025-01-13 0:47 UTC (permalink / raw)
To: Darrick J. Wong
Cc: akpm, dan.j.williams, linux-mm, alison.schofield, lina,
zhang.lyra, gerald.schaefer, vishal.l.verma, dave.jiang, logang,
bhelgaas, jack, jgg, catalin.marinas, will, mpe, npiggin,
dave.hansen, ira.weiny, willy, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
On Fri, Jan 10, 2025 at 08:44:38AM -0800, Darrick J. Wong wrote:
> On Fri, Jan 10, 2025 at 05:00:33PM +1100, Alistair Popple wrote:
> > Prior to freeing a block file systems supporting FS DAX must check
> > that the associated pages are both unmapped from user-space and not
> > undergoing DMA or other access from eg. get_user_pages(). This is
> > achieved by unmapping the file range and scanning the FS DAX
> > page-cache to see if any pages within the mapping have an elevated
> > refcount.
> >
> > This is done using two functions - dax_layout_busy_page_range() which
> > returns a page to wait for the refcount to become idle on. Rather than
> > open-code this introduce a common implementation to both unmap and
> > wait for the page to become idle.
> >
> > Signed-off-by: Alistair Popple <apopple@nvidia.com>
>
> So now that Dan Carpenter has complained, I guess I should look at
> this...
>
> > ---
> >
> > Changes for v5:
> >
> > - Don't wait for idle pages on non-DAX mappings
> >
> > Changes for v4:
> >
> > - Fixed some build breakage due to missing symbol exports reported by
> > John Hubbard (thanks!).
> > ---
> > fs/dax.c | 33 +++++++++++++++++++++++++++++++++
> > fs/ext4/inode.c | 10 +---------
> > fs/fuse/dax.c | 27 +++------------------------
> > fs/xfs/xfs_inode.c | 23 +++++------------------
> > fs/xfs/xfs_inode.h | 2 +-
> > include/linux/dax.h | 21 +++++++++++++++++++++
> > mm/madvise.c | 8 ++++----
> > 7 files changed, 68 insertions(+), 56 deletions(-)
> >
> > diff --git a/fs/dax.c b/fs/dax.c
> > index d010c10..9c3bd07 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -845,6 +845,39 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
> > return ret;
> > }
> >
> > +static int wait_page_idle(struct page *page,
> > + void (cb)(struct inode *),
> > + struct inode *inode)
> > +{
> > + return ___wait_var_event(page, page_ref_count(page) == 1,
> > + TASK_INTERRUPTIBLE, 0, 0, cb(inode));
> > +}
> > +
> > +/*
> > + * Unmaps the inode and waits for any DMA to complete prior to deleting the
> > + * DAX mapping entries for the range.
> > + */
> > +int dax_break_mapping(struct inode *inode, loff_t start, loff_t end,
> > + void (cb)(struct inode *))
> > +{
> > + struct page *page;
> > + int error;
> > +
> > + if (!dax_mapping(inode->i_mapping))
> > + return 0;
> > +
> > + do {
> > + page = dax_layout_busy_page_range(inode->i_mapping, start, end);
> > + if (!page)
> > + break;
> > +
> > + error = wait_page_idle(page, cb, inode);
> > + } while (error == 0);
>
> You didn't initialize error to 0, so it could be any value. What if
> dax_layout_busy_page_range returns null the first time through the loop?
Yes. I went down the rabbit hole of figuring out why this didn't produce a
compiler warning and forgot to go back and fix it. Thanks.
> > +
> > + return error;
> > +}
> > +EXPORT_SYMBOL_GPL(dax_break_mapping);
> > +
> > /*
> > * Invalidate DAX entry if it is clean.
> > */
>
> <I'm no expert, skipping to xfs>
>
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index 42ea203..295730a 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -2715,21 +2715,17 @@ xfs_mmaplock_two_inodes_and_break_dax_layout(
> > struct xfs_inode *ip2)
> > {
> > int error;
> > - bool retry;
> > struct page *page;
> >
> > if (ip1->i_ino > ip2->i_ino)
> > swap(ip1, ip2);
> >
> > again:
> > - retry = false;
> > /* Lock the first inode */
> > xfs_ilock(ip1, XFS_MMAPLOCK_EXCL);
> > - error = xfs_break_dax_layouts(VFS_I(ip1), &retry);
> > - if (error || retry) {
> > + error = xfs_break_dax_layouts(VFS_I(ip1));
> > + if (error) {
> > xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
> > - if (error == 0 && retry)
> > - goto again;
>
> Hmm, so the retry loop has moved into xfs_break_dax_layouts, which means
> that we no longer cycle the MMAPLOCK. Why was the lock cycling
> unnecessary?
Because the lock cycling is already happening in the xfs_wait_dax_page()
callback which is called as part of the retry loop in dax_break_mapping().
> > return error;
> > }
> >
> > @@ -2988,19 +2984,11 @@ xfs_wait_dax_page(
> >
> > int
> > xfs_break_dax_layouts(
> > - struct inode *inode,
> > - bool *retry)
> > + struct inode *inode)
> > {
> > - struct page *page;
> > -
> > xfs_assert_ilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL);
> >
> > - page = dax_layout_busy_page(inode->i_mapping);
> > - if (!page)
> > - return 0;
> > -
> > - *retry = true;
> > - return dax_wait_page_idle(page, xfs_wait_dax_page, inode);
> > + return dax_break_mapping_inode(inode, xfs_wait_dax_page);
> > }
> >
> > int
> > @@ -3018,8 +3006,7 @@ xfs_break_layouts(
> > retry = false;
> > switch (reason) {
> > case BREAK_UNMAP:
> > - error = xfs_break_dax_layouts(inode, &retry);
> > - if (error || retry)
> > + if (xfs_break_dax_layouts(inode))
>
> dax_break_mapping can return -ERESTARTSYS, right? So doesn't this need
> to be:
> error = xfs_break_dax_layouts(inode);
> if (error)
> break;
>
> Hm?
Right. Thanks for the review, have fixed for the next respin.
- Alistair
> --D
>
> > break;
> > fallthrough;
> > case BREAK_WRITE:
> > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > index 1648dc5..c4f03f6 100644
> > --- a/fs/xfs/xfs_inode.h
> > +++ b/fs/xfs/xfs_inode.h
> > @@ -593,7 +593,7 @@ xfs_itruncate_extents(
> > return xfs_itruncate_extents_flags(tpp, ip, whichfork, new_size, 0);
> > }
> >
> > -int xfs_break_dax_layouts(struct inode *inode, bool *retry);
> > +int xfs_break_dax_layouts(struct inode *inode);
> > int xfs_break_layouts(struct inode *inode, uint *iolock,
> > enum layout_break_reason reason);
> >
> > diff --git a/include/linux/dax.h b/include/linux/dax.h
> > index 9b1ce98..f6583d3 100644
> > --- a/include/linux/dax.h
> > +++ b/include/linux/dax.h
> > @@ -228,6 +228,20 @@ static inline void dax_read_unlock(int id)
> > {
> > }
> > #endif /* CONFIG_DAX */
> > +
> > +#if !IS_ENABLED(CONFIG_FS_DAX)
> > +static inline int __must_check dax_break_mapping(struct inode *inode,
> > + loff_t start, loff_t end, void (cb)(struct inode *))
> > +{
> > + return 0;
> > +}
> > +
> > +static inline void dax_break_mapping_uninterruptible(struct inode *inode,
> > + void (cb)(struct inode *))
> > +{
> > +}
> > +#endif
> > +
> > bool dax_alive(struct dax_device *dax_dev);
> > void *dax_get_private(struct dax_device *dax_dev);
> > long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
> > @@ -251,6 +265,13 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
> > int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
> > int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
> > pgoff_t index);
> > +int __must_check dax_break_mapping(struct inode *inode, loff_t start,
> > + loff_t end, void (cb)(struct inode *));
> > +static inline int __must_check dax_break_mapping_inode(struct inode *inode,
> > + void (cb)(struct inode *))
> > +{
> > + return dax_break_mapping(inode, 0, LLONG_MAX, cb);
> > +}
> > int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> > struct inode *dest, loff_t destoff,
> > loff_t len, bool *is_same,
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 49f3a75..1f4c99e 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -1063,7 +1063,7 @@ static int guard_install_pud_entry(pud_t *pud, unsigned long addr,
> > pud_t pudval = pudp_get(pud);
> >
> > /* If huge return >0 so we abort the operation + zap. */
> > - return pud_trans_huge(pudval) || pud_devmap(pudval);
> > + return pud_trans_huge(pudval);
> > }
> >
> > static int guard_install_pmd_entry(pmd_t *pmd, unsigned long addr,
> > @@ -1072,7 +1072,7 @@ static int guard_install_pmd_entry(pmd_t *pmd, unsigned long addr,
> > pmd_t pmdval = pmdp_get(pmd);
> >
> > /* If huge return >0 so we abort the operation + zap. */
> > - return pmd_trans_huge(pmdval) || pmd_devmap(pmdval);
> > + return pmd_trans_huge(pmdval);
> > }
> >
> > static int guard_install_pte_entry(pte_t *pte, unsigned long addr,
> > @@ -1183,7 +1183,7 @@ static int guard_remove_pud_entry(pud_t *pud, unsigned long addr,
> > pud_t pudval = pudp_get(pud);
> >
> > /* If huge, cannot have guard pages present, so no-op - skip. */
> > - if (pud_trans_huge(pudval) || pud_devmap(pudval))
> > + if (pud_trans_huge(pudval))
> > walk->action = ACTION_CONTINUE;
> >
> > return 0;
> > @@ -1195,7 +1195,7 @@ static int guard_remove_pmd_entry(pmd_t *pmd, unsigned long addr,
> > pmd_t pmdval = pmdp_get(pmd);
> >
> > /* If huge, cannot have guard pages present, so no-op - skip. */
> > - if (pmd_trans_huge(pmdval) || pmd_devmap(pmdval))
> > + if (pmd_trans_huge(pmdval))
> > walk->action = ACTION_CONTINUE;
> >
> > return 0;
> > --
> > git-series 0.9.1
> >
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 07/26] fs/dax: Ensure all pages are idle prior to filesystem unmount
2025-01-10 16:50 ` Darrick J. Wong
@ 2025-01-13 0:57 ` Alistair Popple
2025-01-13 2:49 ` Darrick J. Wong
0 siblings, 1 reply; 97+ messages in thread
From: Alistair Popple @ 2025-01-13 0:57 UTC (permalink / raw)
To: Darrick J. Wong
Cc: akpm, dan.j.williams, linux-mm, alison.schofield, lina,
zhang.lyra, gerald.schaefer, vishal.l.verma, dave.jiang, logang,
bhelgaas, jack, jgg, catalin.marinas, will, mpe, npiggin,
dave.hansen, ira.weiny, willy, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
On Fri, Jan 10, 2025 at 08:50:19AM -0800, Darrick J. Wong wrote:
> On Fri, Jan 10, 2025 at 05:00:35PM +1100, Alistair Popple wrote:
> > File systems call dax_break_mapping() prior to reallocating file
> > system blocks to ensure the page is not undergoing any DMA or other
> > accesses. Generally this is needed when a file is truncated to ensure
> > that if a block is reallocated nothing is writing to it. However
> > filesystems currently don't call this when an FS DAX inode is evicted.
> >
> > This can cause problems when the file system is unmounted as a page
> > can continue to be under going DMA or other remote access after
> > unmount. This means if the file system is remounted any truncate or
> > other operation which requires the underlying file system block to be
> > freed will not wait for the remote access to complete. Therefore a
> > busy block may be reallocated to a new file leading to corruption.
> >
> > Signed-off-by: Alistair Popple <apopple@nvidia.com>
> >
> > ---
> >
> > Changes for v5:
> >
> > - Don't wait for pages to be idle in non-DAX mappings
> > ---
> > fs/dax.c | 29 +++++++++++++++++++++++++++++
> > fs/ext4/inode.c | 32 ++++++++++++++------------------
> > fs/xfs/xfs_inode.c | 9 +++++++++
> > fs/xfs/xfs_inode.h | 1 +
> > fs/xfs/xfs_super.c | 18 ++++++++++++++++++
> > include/linux/dax.h | 2 ++
> > 6 files changed, 73 insertions(+), 18 deletions(-)
> >
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 7008a73..4e49cc4 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -883,6 +883,14 @@ static int wait_page_idle(struct page *page,
> > TASK_INTERRUPTIBLE, 0, 0, cb(inode));
> > }
> >
> > +static void wait_page_idle_uninterruptible(struct page *page,
> > + void (cb)(struct inode *),
> > + struct inode *inode)
> > +{
> > + ___wait_var_event(page, page_ref_count(page) == 1,
> > + TASK_UNINTERRUPTIBLE, 0, 0, cb(inode));
> > +}
> > +
> > /*
> > * Unmaps the inode and waits for any DMA to complete prior to deleting the
> > * DAX mapping entries for the range.
> > @@ -911,6 +919,27 @@ int dax_break_mapping(struct inode *inode, loff_t start, loff_t end,
> > }
> > EXPORT_SYMBOL_GPL(dax_break_mapping);
> >
> > +void dax_break_mapping_uninterruptible(struct inode *inode,
> > + void (cb)(struct inode *))
> > +{
> > + struct page *page;
> > +
> > + if (!dax_mapping(inode->i_mapping))
> > + return;
> > +
> > + do {
> > + page = dax_layout_busy_page_range(inode->i_mapping, 0,
> > + LLONG_MAX);
> > + if (!page)
> > + break;
> > +
> > + wait_page_idle_uninterruptible(page, cb, inode);
> > + } while (true);
> > +
> > + dax_delete_mapping_range(inode->i_mapping, 0, LLONG_MAX);
> > +}
> > +EXPORT_SYMBOL_GPL(dax_break_mapping_uninterruptible);
> > +
> > /*
> > * Invalidate DAX entry if it is clean.
> > */
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index ee8e83f..fa35161 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -163,6 +163,18 @@ int ext4_inode_is_fast_symlink(struct inode *inode)
> > (inode->i_size < EXT4_N_BLOCKS * 4);
> > }
> >
> > +static void ext4_wait_dax_page(struct inode *inode)
> > +{
> > + filemap_invalidate_unlock(inode->i_mapping);
> > + schedule();
> > + filemap_invalidate_lock(inode->i_mapping);
> > +}
> > +
> > +int ext4_break_layouts(struct inode *inode)
> > +{
> > + return dax_break_mapping_inode(inode, ext4_wait_dax_page);
> > +}
> > +
> > /*
> > * Called at the last iput() if i_nlink is zero.
> > */
> > @@ -181,6 +193,8 @@ void ext4_evict_inode(struct inode *inode)
> >
> > trace_ext4_evict_inode(inode);
> >
> > + dax_break_mapping_uninterruptible(inode, ext4_wait_dax_page);
> > +
> > if (EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL)
> > ext4_evict_ea_inode(inode);
> > if (inode->i_nlink) {
> > @@ -3902,24 +3916,6 @@ int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset,
> > return ret;
> > }
> >
> > -static void ext4_wait_dax_page(struct inode *inode)
> > -{
> > - filemap_invalidate_unlock(inode->i_mapping);
> > - schedule();
> > - filemap_invalidate_lock(inode->i_mapping);
> > -}
> > -
> > -int ext4_break_layouts(struct inode *inode)
> > -{
> > - struct page *page;
> > - int error;
> > -
> > - if (WARN_ON_ONCE(!rwsem_is_locked(&inode->i_mapping->invalidate_lock)))
> > - return -EINVAL;
> > -
> > - return dax_break_mapping_inode(inode, ext4_wait_dax_page);
> > -}
> > -
> > /*
> > * ext4_punch_hole: punches a hole in a file by releasing the blocks
> > * associated with the given offset and length
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index 4410b42..c7ec5ab 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -2997,6 +2997,15 @@ xfs_break_dax_layouts(
> > return dax_break_mapping_inode(inode, xfs_wait_dax_page);
> > }
> >
> > +void
> > +xfs_break_dax_layouts_uninterruptible(
> > + struct inode *inode)
> > +{
> > + xfs_assert_ilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL);
> > +
> > + dax_break_mapping_uninterruptible(inode, xfs_wait_dax_page);
> > +}
> > +
> > int
> > xfs_break_layouts(
> > struct inode *inode,
> > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > index c4f03f6..613797a 100644
> > --- a/fs/xfs/xfs_inode.h
> > +++ b/fs/xfs/xfs_inode.h
> > @@ -594,6 +594,7 @@ xfs_itruncate_extents(
> > }
> >
> > int xfs_break_dax_layouts(struct inode *inode);
> > +void xfs_break_dax_layouts_uninterruptible(struct inode *inode);
> > int xfs_break_layouts(struct inode *inode, uint *iolock,
> > enum layout_break_reason reason);
> >
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index 8524b9d..73ec060 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -751,6 +751,23 @@ xfs_fs_drop_inode(
> > return generic_drop_inode(inode);
> > }
> >
> > +STATIC void
> > +xfs_fs_evict_inode(
> > + struct inode *inode)
> > +{
> > + struct xfs_inode *ip = XFS_I(inode);
> > + uint iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
> > +
> > + if (IS_DAX(inode)) {
> > + xfs_ilock(ip, iolock);
> > + xfs_break_dax_layouts_uninterruptible(inode);
> > + xfs_iunlock(ip, iolock);
>
> If we're evicting the inode, why is it necessary to take i_rwsem and the
> mmap invalidation lock? Shouldn't the evicting thread be the only one
> with access to this inode?
Hmm, good point. I think you're right. I can easily stop taking
XFS_IOLOCK_EXCL. Not taking XFS_MMAPLOCK_EXCL is slightly more difficult because
xfs_wait_dax_page() expects it to be taken. Do you think it is worth creating a
separate callback (xfs_wait_dax_page_unlocked()?) specifically for this path or
would you be happy with a comment explaining why we take the XFS_MMAPLOCK_EXCL
lock here?
- Alistair
> --D
>
> > + }
> > +
> > + truncate_inode_pages_final(&inode->i_data);
> > + clear_inode(inode);
> > +}
> > +
> > static void
> > xfs_mount_free(
> > struct xfs_mount *mp)
> > @@ -1189,6 +1206,7 @@ static const struct super_operations xfs_super_operations = {
> > .destroy_inode = xfs_fs_destroy_inode,
> > .dirty_inode = xfs_fs_dirty_inode,
> > .drop_inode = xfs_fs_drop_inode,
> > + .evict_inode = xfs_fs_evict_inode,
> > .put_super = xfs_fs_put_super,
> > .sync_fs = xfs_fs_sync_fs,
> > .freeze_fs = xfs_fs_freeze,
> > diff --git a/include/linux/dax.h b/include/linux/dax.h
> > index ef9e02c..7c3773f 100644
> > --- a/include/linux/dax.h
> > +++ b/include/linux/dax.h
> > @@ -274,6 +274,8 @@ static inline int __must_check dax_break_mapping_inode(struct inode *inode,
> > {
> > return dax_break_mapping(inode, 0, LLONG_MAX, cb);
> > }
> > +void dax_break_mapping_uninterruptible(struct inode *inode,
> > + void (cb)(struct inode *));
> > int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> > struct inode *dest, loff_t destoff,
> > loff_t len, bool *is_same,
> > --
> > git-series 0.9.1
> >
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 00/26] fs/dax: Fix ZONE_DEVICE page reference counts
2025-01-11 3:35 ` Dan Williams
@ 2025-01-13 1:05 ` Alistair Popple
0 siblings, 0 replies; 97+ messages in thread
From: Alistair Popple @ 2025-01-13 1:05 UTC (permalink / raw)
To: Dan Williams
Cc: Andrew Morton, linux-mm, alison.schofield, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
On Fri, Jan 10, 2025 at 07:35:57PM -0800, Dan Williams wrote:
> Andrew Morton wrote:
> > On Thu, 9 Jan 2025 23:05:56 -0800 Dan Williams <dan.j.williams@intel.com> wrote:
> >
> > > > - Remove PTE_DEVMAP definitions from Loongarch which were added since
> > > > this series was initially written.
> > > [..]
> > > >
> > > > base-commit: e25c8d66f6786300b680866c0e0139981273feba
> > >
> > > If this is going to go through nvdimm.git I will need it against a
> > > mainline tag baseline. Linus will want to see the merge conflicts.
> > >
> > > Otherwise if that merge commit is too messy, or you would rather not
> > > rebase, then it either needs to go one of two options:
> > >
> > > - Andrew's tree which is the only tree I know of that can carry
> > > patches relative to linux-next.
> >
> > I used to be able to do that but haven't got around to setting up such
> > a thing with mm.git. This is the first time the need has arisen,
> > really.
>
> Oh, good to know.
>
> >
> > > - Wait for v6.14-rc1
> >
> > I'm thinking so. Darrick's review comments indicate that we'll be seeing a v7.
I'm ok with that. It could do with a decent soak in linux-next anyway given it
touches a lot of mm and fs.
Once v6.14-rc1 is released I will do a rebase on top of that.
> > > and get this into nvdimm.git early in the cycle
> > > when the conflict storm will be low.
> >
> > erk. This patchset hits mm/ a lot, and nvdimm hardly at all. Is it
> > not practical to carry this in mm.git?
>
> I'm totally fine with it going through mm.git. nvdimm.git is just the
> historical path for touches to fs/dax.c, and git blame points mostly to
> me for the issues Alistair is fixing. I am happy to review and ack and
> watch this go through mm.git.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 05/26] fs/dax: Create a common implementation to break DAX layouts
2025-01-13 0:47 ` Alistair Popple
@ 2025-01-13 2:47 ` Darrick J. Wong
0 siblings, 0 replies; 97+ messages in thread
From: Darrick J. Wong @ 2025-01-13 2:47 UTC (permalink / raw)
To: Alistair Popple
Cc: akpm, dan.j.williams, linux-mm, alison.schofield, lina,
zhang.lyra, gerald.schaefer, vishal.l.verma, dave.jiang, logang,
bhelgaas, jack, jgg, catalin.marinas, will, mpe, npiggin,
dave.hansen, ira.weiny, willy, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
On Mon, Jan 13, 2025 at 11:47:41AM +1100, Alistair Popple wrote:
> On Fri, Jan 10, 2025 at 08:44:38AM -0800, Darrick J. Wong wrote:
> > On Fri, Jan 10, 2025 at 05:00:33PM +1100, Alistair Popple wrote:
> > > Prior to freeing a block file systems supporting FS DAX must check
> > > that the associated pages are both unmapped from user-space and not
> > > undergoing DMA or other access from eg. get_user_pages(). This is
> > > achieved by unmapping the file range and scanning the FS DAX
> > > page-cache to see if any pages within the mapping have an elevated
> > > refcount.
> > >
> > > This is done using two functions - dax_layout_busy_page_range() which
> > > returns a page to wait for the refcount to become idle on. Rather than
> > > open-code this introduce a common implementation to both unmap and
> > > wait for the page to become idle.
> > >
> > > Signed-off-by: Alistair Popple <apopple@nvidia.com>
> >
> > So now that Dan Carpenter has complained, I guess I should look at
> > this...
> >
> > > ---
> > >
> > > Changes for v5:
> > >
> > > - Don't wait for idle pages on non-DAX mappings
> > >
> > > Changes for v4:
> > >
> > > - Fixed some build breakage due to missing symbol exports reported by
> > > John Hubbard (thanks!).
> > > ---
> > > fs/dax.c | 33 +++++++++++++++++++++++++++++++++
> > > fs/ext4/inode.c | 10 +---------
> > > fs/fuse/dax.c | 27 +++------------------------
> > > fs/xfs/xfs_inode.c | 23 +++++------------------
> > > fs/xfs/xfs_inode.h | 2 +-
> > > include/linux/dax.h | 21 +++++++++++++++++++++
> > > mm/madvise.c | 8 ++++----
> > > 7 files changed, 68 insertions(+), 56 deletions(-)
> > >
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index d010c10..9c3bd07 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -845,6 +845,39 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
> > > return ret;
> > > }
> > >
> > > +static int wait_page_idle(struct page *page,
> > > + void (cb)(struct inode *),
> > > + struct inode *inode)
> > > +{
> > > + return ___wait_var_event(page, page_ref_count(page) == 1,
> > > + TASK_INTERRUPTIBLE, 0, 0, cb(inode));
> > > +}
> > > +
> > > +/*
> > > + * Unmaps the inode and waits for any DMA to complete prior to deleting the
> > > + * DAX mapping entries for the range.
> > > + */
> > > +int dax_break_mapping(struct inode *inode, loff_t start, loff_t end,
> > > + void (cb)(struct inode *))
> > > +{
> > > + struct page *page;
> > > + int error;
> > > +
> > > + if (!dax_mapping(inode->i_mapping))
> > > + return 0;
> > > +
> > > + do {
> > > + page = dax_layout_busy_page_range(inode->i_mapping, start, end);
> > > + if (!page)
> > > + break;
> > > +
> > > + error = wait_page_idle(page, cb, inode);
> > > + } while (error == 0);
> >
> > You didn't initialize error to 0, so it could be any value. What if
> > dax_layout_busy_page_range returns null the first time through the loop?
>
> Yes. I went down the rabbit hole of figuring out why this didn't produce a
> compiler warning and forgot to go back and fix it. Thanks.
>
> > > +
> > > + return error;
> > > +}
> > > +EXPORT_SYMBOL_GPL(dax_break_mapping);
> > > +
> > > /*
> > > * Invalidate DAX entry if it is clean.
> > > */
> >
> > <I'm no expert, skipping to xfs>
> >
> > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > index 42ea203..295730a 100644
> > > --- a/fs/xfs/xfs_inode.c
> > > +++ b/fs/xfs/xfs_inode.c
> > > @@ -2715,21 +2715,17 @@ xfs_mmaplock_two_inodes_and_break_dax_layout(
> > > struct xfs_inode *ip2)
> > > {
> > > int error;
> > > - bool retry;
> > > struct page *page;
> > >
> > > if (ip1->i_ino > ip2->i_ino)
> > > swap(ip1, ip2);
> > >
> > > again:
> > > - retry = false;
> > > /* Lock the first inode */
> > > xfs_ilock(ip1, XFS_MMAPLOCK_EXCL);
> > > - error = xfs_break_dax_layouts(VFS_I(ip1), &retry);
> > > - if (error || retry) {
> > > + error = xfs_break_dax_layouts(VFS_I(ip1));
> > > + if (error) {
> > > xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
> > > - if (error == 0 && retry)
> > > - goto again;
> >
> > Hmm, so the retry loop has moved into xfs_break_dax_layouts, which means
> > that we no longer cycle the MMAPLOCK. Why was the lock cycling
> > unnecessary?
>
> Because the lock cycling is already happening in the xfs_wait_dax_page()
> callback which is called as part of the retry loop in dax_break_mapping().
Aha, good point.
--D
> > > return error;
> > > }
> > >
> > > @@ -2988,19 +2984,11 @@ xfs_wait_dax_page(
> > >
> > > int
> > > xfs_break_dax_layouts(
> > > - struct inode *inode,
> > > - bool *retry)
> > > + struct inode *inode)
> > > {
> > > - struct page *page;
> > > -
> > > xfs_assert_ilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL);
> > >
> > > - page = dax_layout_busy_page(inode->i_mapping);
> > > - if (!page)
> > > - return 0;
> > > -
> > > - *retry = true;
> > > - return dax_wait_page_idle(page, xfs_wait_dax_page, inode);
> > > + return dax_break_mapping_inode(inode, xfs_wait_dax_page);
> > > }
> > >
> > > int
> > > @@ -3018,8 +3006,7 @@ xfs_break_layouts(
> > > retry = false;
> > > switch (reason) {
> > > case BREAK_UNMAP:
> > > - error = xfs_break_dax_layouts(inode, &retry);
> > > - if (error || retry)
> > > + if (xfs_break_dax_layouts(inode))
> >
> > dax_break_mapping can return -ERESTARTSYS, right? So doesn't this need
> > to be:
> > error = xfs_break_dax_layouts(inode);
> > if (error)
> > break;
> >
> > Hm?
>
> Right. Thanks for the review, have fixed for the next respin.
>
> - Alistair
>
> > --D
> >
> > > break;
> > > fallthrough;
> > > case BREAK_WRITE:
> > > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > > index 1648dc5..c4f03f6 100644
> > > --- a/fs/xfs/xfs_inode.h
> > > +++ b/fs/xfs/xfs_inode.h
> > > @@ -593,7 +593,7 @@ xfs_itruncate_extents(
> > > return xfs_itruncate_extents_flags(tpp, ip, whichfork, new_size, 0);
> > > }
> > >
> > > -int xfs_break_dax_layouts(struct inode *inode, bool *retry);
> > > +int xfs_break_dax_layouts(struct inode *inode);
> > > int xfs_break_layouts(struct inode *inode, uint *iolock,
> > > enum layout_break_reason reason);
> > >
> > > diff --git a/include/linux/dax.h b/include/linux/dax.h
> > > index 9b1ce98..f6583d3 100644
> > > --- a/include/linux/dax.h
> > > +++ b/include/linux/dax.h
> > > @@ -228,6 +228,20 @@ static inline void dax_read_unlock(int id)
> > > {
> > > }
> > > #endif /* CONFIG_DAX */
> > > +
> > > +#if !IS_ENABLED(CONFIG_FS_DAX)
> > > +static inline int __must_check dax_break_mapping(struct inode *inode,
> > > + loff_t start, loff_t end, void (cb)(struct inode *))
> > > +{
> > > + return 0;
> > > +}
> > > +
> > > +static inline void dax_break_mapping_uninterruptible(struct inode *inode,
> > > + void (cb)(struct inode *))
> > > +{
> > > +}
> > > +#endif
> > > +
> > > bool dax_alive(struct dax_device *dax_dev);
> > > void *dax_get_private(struct dax_device *dax_dev);
> > > long dax_direct_access(struct dax_device *dax_dev, pgoff_t pgoff, long nr_pages,
> > > @@ -251,6 +265,13 @@ vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
> > > int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
> > > int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
> > > pgoff_t index);
> > > +int __must_check dax_break_mapping(struct inode *inode, loff_t start,
> > > + loff_t end, void (cb)(struct inode *));
> > > +static inline int __must_check dax_break_mapping_inode(struct inode *inode,
> > > + void (cb)(struct inode *))
> > > +{
> > > + return dax_break_mapping(inode, 0, LLONG_MAX, cb);
> > > +}
> > > int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> > > struct inode *dest, loff_t destoff,
> > > loff_t len, bool *is_same,
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index 49f3a75..1f4c99e 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -1063,7 +1063,7 @@ static int guard_install_pud_entry(pud_t *pud, unsigned long addr,
> > > pud_t pudval = pudp_get(pud);
> > >
> > > /* If huge return >0 so we abort the operation + zap. */
> > > - return pud_trans_huge(pudval) || pud_devmap(pudval);
> > > + return pud_trans_huge(pudval);
> > > }
> > >
> > > static int guard_install_pmd_entry(pmd_t *pmd, unsigned long addr,
> > > @@ -1072,7 +1072,7 @@ static int guard_install_pmd_entry(pmd_t *pmd, unsigned long addr,
> > > pmd_t pmdval = pmdp_get(pmd);
> > >
> > > /* If huge return >0 so we abort the operation + zap. */
> > > - return pmd_trans_huge(pmdval) || pmd_devmap(pmdval);
> > > + return pmd_trans_huge(pmdval);
> > > }
> > >
> > > static int guard_install_pte_entry(pte_t *pte, unsigned long addr,
> > > @@ -1183,7 +1183,7 @@ static int guard_remove_pud_entry(pud_t *pud, unsigned long addr,
> > > pud_t pudval = pudp_get(pud);
> > >
> > > /* If huge, cannot have guard pages present, so no-op - skip. */
> > > - if (pud_trans_huge(pudval) || pud_devmap(pudval))
> > > + if (pud_trans_huge(pudval))
> > > walk->action = ACTION_CONTINUE;
> > >
> > > return 0;
> > > @@ -1195,7 +1195,7 @@ static int guard_remove_pmd_entry(pmd_t *pmd, unsigned long addr,
> > > pmd_t pmdval = pmdp_get(pmd);
> > >
> > > /* If huge, cannot have guard pages present, so no-op - skip. */
> > > - if (pmd_trans_huge(pmdval) || pmd_devmap(pmdval))
> > > + if (pmd_trans_huge(pmdval))
> > > walk->action = ACTION_CONTINUE;
> > >
> > > return 0;
> > > --
> > > git-series 0.9.1
> > >
>
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 07/26] fs/dax: Ensure all pages are idle prior to filesystem unmount
2025-01-13 0:57 ` Alistair Popple
@ 2025-01-13 2:49 ` Darrick J. Wong
2025-01-13 5:48 ` Alistair Popple
0 siblings, 1 reply; 97+ messages in thread
From: Darrick J. Wong @ 2025-01-13 2:49 UTC (permalink / raw)
To: Alistair Popple
Cc: akpm, dan.j.williams, linux-mm, alison.schofield, lina,
zhang.lyra, gerald.schaefer, vishal.l.verma, dave.jiang, logang,
bhelgaas, jack, jgg, catalin.marinas, will, mpe, npiggin,
dave.hansen, ira.weiny, willy, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
On Mon, Jan 13, 2025 at 11:57:18AM +1100, Alistair Popple wrote:
> On Fri, Jan 10, 2025 at 08:50:19AM -0800, Darrick J. Wong wrote:
> > On Fri, Jan 10, 2025 at 05:00:35PM +1100, Alistair Popple wrote:
> > > File systems call dax_break_mapping() prior to reallocating file
> > > system blocks to ensure the page is not undergoing any DMA or other
> > > accesses. Generally this is needed when a file is truncated to ensure
> > > that if a block is reallocated nothing is writing to it. However
> > > filesystems currently don't call this when an FS DAX inode is evicted.
> > >
> > > This can cause problems when the file system is unmounted as a page
> > > can continue to be under going DMA or other remote access after
> > > unmount. This means if the file system is remounted any truncate or
> > > other operation which requires the underlying file system block to be
> > > freed will not wait for the remote access to complete. Therefore a
> > > busy block may be reallocated to a new file leading to corruption.
> > >
> > > Signed-off-by: Alistair Popple <apopple@nvidia.com>
> > >
> > > ---
> > >
> > > Changes for v5:
> > >
> > > - Don't wait for pages to be idle in non-DAX mappings
> > > ---
> > > fs/dax.c | 29 +++++++++++++++++++++++++++++
> > > fs/ext4/inode.c | 32 ++++++++++++++------------------
> > > fs/xfs/xfs_inode.c | 9 +++++++++
> > > fs/xfs/xfs_inode.h | 1 +
> > > fs/xfs/xfs_super.c | 18 ++++++++++++++++++
> > > include/linux/dax.h | 2 ++
> > > 6 files changed, 73 insertions(+), 18 deletions(-)
> > >
> > > diff --git a/fs/dax.c b/fs/dax.c
> > > index 7008a73..4e49cc4 100644
> > > --- a/fs/dax.c
> > > +++ b/fs/dax.c
> > > @@ -883,6 +883,14 @@ static int wait_page_idle(struct page *page,
> > > TASK_INTERRUPTIBLE, 0, 0, cb(inode));
> > > }
> > >
> > > +static void wait_page_idle_uninterruptible(struct page *page,
> > > + void (cb)(struct inode *),
> > > + struct inode *inode)
> > > +{
> > > + ___wait_var_event(page, page_ref_count(page) == 1,
> > > + TASK_UNINTERRUPTIBLE, 0, 0, cb(inode));
> > > +}
> > > +
> > > /*
> > > * Unmaps the inode and waits for any DMA to complete prior to deleting the
> > > * DAX mapping entries for the range.
> > > @@ -911,6 +919,27 @@ int dax_break_mapping(struct inode *inode, loff_t start, loff_t end,
> > > }
> > > EXPORT_SYMBOL_GPL(dax_break_mapping);
> > >
> > > +void dax_break_mapping_uninterruptible(struct inode *inode,
> > > + void (cb)(struct inode *))
> > > +{
> > > + struct page *page;
> > > +
> > > + if (!dax_mapping(inode->i_mapping))
> > > + return;
> > > +
> > > + do {
> > > + page = dax_layout_busy_page_range(inode->i_mapping, 0,
> > > + LLONG_MAX);
> > > + if (!page)
> > > + break;
> > > +
> > > + wait_page_idle_uninterruptible(page, cb, inode);
> > > + } while (true);
> > > +
> > > + dax_delete_mapping_range(inode->i_mapping, 0, LLONG_MAX);
> > > +}
> > > +EXPORT_SYMBOL_GPL(dax_break_mapping_uninterruptible);
> > > +
> > > /*
> > > * Invalidate DAX entry if it is clean.
> > > */
> > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > index ee8e83f..fa35161 100644
> > > --- a/fs/ext4/inode.c
> > > +++ b/fs/ext4/inode.c
> > > @@ -163,6 +163,18 @@ int ext4_inode_is_fast_symlink(struct inode *inode)
> > > (inode->i_size < EXT4_N_BLOCKS * 4);
> > > }
> > >
> > > +static void ext4_wait_dax_page(struct inode *inode)
> > > +{
> > > + filemap_invalidate_unlock(inode->i_mapping);
> > > + schedule();
> > > + filemap_invalidate_lock(inode->i_mapping);
> > > +}
> > > +
> > > +int ext4_break_layouts(struct inode *inode)
> > > +{
> > > + return dax_break_mapping_inode(inode, ext4_wait_dax_page);
> > > +}
> > > +
> > > /*
> > > * Called at the last iput() if i_nlink is zero.
> > > */
> > > @@ -181,6 +193,8 @@ void ext4_evict_inode(struct inode *inode)
> > >
> > > trace_ext4_evict_inode(inode);
> > >
> > > + dax_break_mapping_uninterruptible(inode, ext4_wait_dax_page);
> > > +
> > > if (EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL)
> > > ext4_evict_ea_inode(inode);
> > > if (inode->i_nlink) {
> > > @@ -3902,24 +3916,6 @@ int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset,
> > > return ret;
> > > }
> > >
> > > -static void ext4_wait_dax_page(struct inode *inode)
> > > -{
> > > - filemap_invalidate_unlock(inode->i_mapping);
> > > - schedule();
> > > - filemap_invalidate_lock(inode->i_mapping);
> > > -}
> > > -
> > > -int ext4_break_layouts(struct inode *inode)
> > > -{
> > > - struct page *page;
> > > - int error;
> > > -
> > > - if (WARN_ON_ONCE(!rwsem_is_locked(&inode->i_mapping->invalidate_lock)))
> > > - return -EINVAL;
> > > -
> > > - return dax_break_mapping_inode(inode, ext4_wait_dax_page);
> > > -}
> > > -
> > > /*
> > > * ext4_punch_hole: punches a hole in a file by releasing the blocks
> > > * associated with the given offset and length
> > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > index 4410b42..c7ec5ab 100644
> > > --- a/fs/xfs/xfs_inode.c
> > > +++ b/fs/xfs/xfs_inode.c
> > > @@ -2997,6 +2997,15 @@ xfs_break_dax_layouts(
> > > return dax_break_mapping_inode(inode, xfs_wait_dax_page);
> > > }
> > >
> > > +void
> > > +xfs_break_dax_layouts_uninterruptible(
> > > + struct inode *inode)
> > > +{
> > > + xfs_assert_ilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL);
> > > +
> > > + dax_break_mapping_uninterruptible(inode, xfs_wait_dax_page);
> > > +}
> > > +
> > > int
> > > xfs_break_layouts(
> > > struct inode *inode,
> > > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > > index c4f03f6..613797a 100644
> > > --- a/fs/xfs/xfs_inode.h
> > > +++ b/fs/xfs/xfs_inode.h
> > > @@ -594,6 +594,7 @@ xfs_itruncate_extents(
> > > }
> > >
> > > int xfs_break_dax_layouts(struct inode *inode);
> > > +void xfs_break_dax_layouts_uninterruptible(struct inode *inode);
> > > int xfs_break_layouts(struct inode *inode, uint *iolock,
> > > enum layout_break_reason reason);
> > >
> > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > index 8524b9d..73ec060 100644
> > > --- a/fs/xfs/xfs_super.c
> > > +++ b/fs/xfs/xfs_super.c
> > > @@ -751,6 +751,23 @@ xfs_fs_drop_inode(
> > > return generic_drop_inode(inode);
> > > }
> > >
> > > +STATIC void
> > > +xfs_fs_evict_inode(
> > > + struct inode *inode)
> > > +{
> > > + struct xfs_inode *ip = XFS_I(inode);
> > > + uint iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
> > > +
> > > + if (IS_DAX(inode)) {
> > > + xfs_ilock(ip, iolock);
> > > + xfs_break_dax_layouts_uninterruptible(inode);
> > > + xfs_iunlock(ip, iolock);
> >
> > If we're evicting the inode, why is it necessary to take i_rwsem and the
> > mmap invalidation lock? Shouldn't the evicting thread be the only one
> > with access to this inode?
>
> Hmm, good point. I think you're right. I can easily stop taking
> XFS_IOLOCK_EXCL. Not taking XFS_MMAPLOCK_EXCL is slightly more difficult because
> xfs_wait_dax_page() expects it to be taken. Do you think it is worth creating a
> separate callback (xfs_wait_dax_page_unlocked()?) specifically for this path or
> would you be happy with a comment explaining why we take the XFS_MMAPLOCK_EXCL
> lock here?
There shouldn't be any other threads removing "pages" from i_mapping
during eviction, right? If so, I think you can just call schedule()
directly from dax_break_mapping_uninterruptble.
(dax mappings aren't allowed supposed to persist beyond unmount /
eviction, just like regular pagecache, right??)
--D
> - Alistair
>
> > --D
> >
> > > + }
> > > +
> > > + truncate_inode_pages_final(&inode->i_data);
> > > + clear_inode(inode);
> > > +}
> > > +
> > > static void
> > > xfs_mount_free(
> > > struct xfs_mount *mp)
> > > @@ -1189,6 +1206,7 @@ static const struct super_operations xfs_super_operations = {
> > > .destroy_inode = xfs_fs_destroy_inode,
> > > .dirty_inode = xfs_fs_dirty_inode,
> > > .drop_inode = xfs_fs_drop_inode,
> > > + .evict_inode = xfs_fs_evict_inode,
> > > .put_super = xfs_fs_put_super,
> > > .sync_fs = xfs_fs_sync_fs,
> > > .freeze_fs = xfs_fs_freeze,
> > > diff --git a/include/linux/dax.h b/include/linux/dax.h
> > > index ef9e02c..7c3773f 100644
> > > --- a/include/linux/dax.h
> > > +++ b/include/linux/dax.h
> > > @@ -274,6 +274,8 @@ static inline int __must_check dax_break_mapping_inode(struct inode *inode,
> > > {
> > > return dax_break_mapping(inode, 0, LLONG_MAX, cb);
> > > }
> > > +void dax_break_mapping_uninterruptible(struct inode *inode,
> > > + void (cb)(struct inode *));
> > > int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> > > struct inode *dest, loff_t destoff,
> > > loff_t len, bool *is_same,
> > > --
> > > git-series 0.9.1
> > >
>
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 21/26] fs/dax: Properly refcount fs dax pages
2025-01-10 16:54 ` Darrick J. Wong
@ 2025-01-13 3:18 ` Alistair Popple
0 siblings, 0 replies; 97+ messages in thread
From: Alistair Popple @ 2025-01-13 3:18 UTC (permalink / raw)
To: Darrick J. Wong
Cc: akpm, dan.j.williams, linux-mm, alison.schofield, lina,
zhang.lyra, gerald.schaefer, vishal.l.verma, dave.jiang, logang,
bhelgaas, jack, jgg, catalin.marinas, will, mpe, npiggin,
dave.hansen, ira.weiny, willy, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
On Fri, Jan 10, 2025 at 08:54:55AM -0800, Darrick J. Wong wrote:
> On Fri, Jan 10, 2025 at 05:00:49PM +1100, Alistair Popple wrote:
> > Currently fs dax pages are considered free when the refcount drops to
> > one and their refcounts are not increased when mapped via PTEs or
> > decreased when unmapped. This requires special logic in mm paths to
> > detect that these pages should not be properly refcounted, and to
> > detect when the refcount drops to one instead of zero.
> >
> > On the other hand get_user_pages(), etc. will properly refcount fs dax
> > pages by taking a reference and dropping it when the page is
> > unpinned.
> >
> > Tracking this special behaviour requires extra PTE bits
> > (eg. pte_devmap) and introduces rules that are potentially confusing
> > and specific to FS DAX pages. To fix this, and to possibly allow
> > removal of the special PTE bits in future, convert the fs dax page
> > refcounts to be zero based and instead take a reference on the page
> > each time it is mapped as is currently the case for normal pages.
> >
> > This may also allow a future clean-up to remove the pgmap refcounting
> > that is currently done in mm/gup.c.
> >
> > Signed-off-by: Alistair Popple <apopple@nvidia.com>
> >
> > ---
> >
> > Changes since v2:
> >
> > Based on some questions from Dan I attempted to have the FS DAX page
> > cache (ie. address space) hold a reference to the folio whilst it was
> > mapped. However I came to the strong conclusion that this was not the
> > right thing to do.
> >
> > If the page refcount == 0 it means the page is:
> >
> > 1. not mapped into user-space
> > 2. not subject to other access via DMA/GUP/etc.
> >
> > Ie. From the core MM perspective the page is not in use.
> >
> > The fact a page may or may not be present in one or more address space
> > mappings is irrelevant for core MM. It just means the page is still in
> > use or valid from the file system perspective, and it's a
> > responsiblity of the file system to remove these mappings if the pfn
> > mapping becomes invalid (along with first making sure the MM state,
> > ie. page->refcount, is idle). So we shouldn't be trying to track that
> > lifetime with MM refcounts.
> >
> > Doing so just makes DMA-idle tracking more complex because there is
> > now another thing (one or more address spaces) which can hold
> > references on a page. And FS DAX can't even keep track of all the
> > address spaces which might contain a reference to the page in the
> > XFS/reflink case anyway.
> >
> > We could do this if we made file systems invalidate all address space
> > mappings prior to calling dax_break_layouts(), but that isn't
> > currently neccessary and would lead to increased faults just so we
> > could do some superfluous refcounting which the file system already
> > does.
> >
> > I have however put the page sharing checks and WARN_ON's back which
> > also turned out to be useful for figuring out when to re-initialising
> > a folio.
> > ---
> > drivers/nvdimm/pmem.c | 4 +-
> > fs/dax.c | 212 +++++++++++++++++++++++-----------------
> > fs/fuse/virtio_fs.c | 3 +-
> > fs/xfs/xfs_inode.c | 2 +-
> > include/linux/dax.h | 6 +-
> > include/linux/mm.h | 27 +-----
> > include/linux/mm_types.h | 7 +-
> > mm/gup.c | 9 +--
> > mm/huge_memory.c | 6 +-
> > mm/internal.h | 2 +-
> > mm/memory-failure.c | 6 +-
> > mm/memory.c | 6 +-
> > mm/memremap.c | 47 ++++-----
> > mm/mm_init.c | 9 +--
> > mm/swap.c | 2 +-
> > 15 files changed, 183 insertions(+), 165 deletions(-)
> >
> > diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> > index d81faa9..785b2d2 100644
> > --- a/drivers/nvdimm/pmem.c
> > +++ b/drivers/nvdimm/pmem.c
> > @@ -513,7 +513,7 @@ static int pmem_attach_disk(struct device *dev,
> >
> > pmem->disk = disk;
> > pmem->pgmap.owner = pmem;
> > - pmem->pfn_flags = PFN_DEV;
> > + pmem->pfn_flags = 0;
> > if (is_nd_pfn(dev)) {
> > pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
> > pmem->pgmap.ops = &fsdax_pagemap_ops;
> > @@ -522,7 +522,6 @@ static int pmem_attach_disk(struct device *dev,
> > pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
> > pmem->pfn_pad = resource_size(res) -
> > range_len(&pmem->pgmap.range);
> > - pmem->pfn_flags |= PFN_MAP;
> > bb_range = pmem->pgmap.range;
> > bb_range.start += pmem->data_offset;
> > } else if (pmem_should_map_pages(dev)) {
> > @@ -532,7 +531,6 @@ static int pmem_attach_disk(struct device *dev,
> > pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
> > pmem->pgmap.ops = &fsdax_pagemap_ops;
> > addr = devm_memremap_pages(dev, &pmem->pgmap);
> > - pmem->pfn_flags |= PFN_MAP;
> > bb_range = pmem->pgmap.range;
> > } else {
> > addr = devm_memremap(dev, pmem->phys_addr,
> > diff --git a/fs/dax.c b/fs/dax.c
> > index d35dbe1..19f444e 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -71,6 +71,11 @@ static unsigned long dax_to_pfn(void *entry)
> > return xa_to_value(entry) >> DAX_SHIFT;
> > }
> >
> > +static struct folio *dax_to_folio(void *entry)
> > +{
> > + return page_folio(pfn_to_page(dax_to_pfn(entry)));
> > +}
> > +
> > static void *dax_make_entry(pfn_t pfn, unsigned long flags)
> > {
> > return xa_mk_value(flags | (pfn_t_to_pfn(pfn) << DAX_SHIFT));
> > @@ -338,44 +343,88 @@ static unsigned long dax_entry_size(void *entry)
> > return PAGE_SIZE;
> > }
> >
> > -static unsigned long dax_end_pfn(void *entry)
> > -{
> > - return dax_to_pfn(entry) + dax_entry_size(entry) / PAGE_SIZE;
> > -}
> > -
> > -/*
> > - * Iterate through all mapped pfns represented by an entry, i.e. skip
> > - * 'empty' and 'zero' entries.
> > - */
> > -#define for_each_mapped_pfn(entry, pfn) \
> > - for (pfn = dax_to_pfn(entry); \
> > - pfn < dax_end_pfn(entry); pfn++)
> > -
> > /*
> > * A DAX page is considered shared if it has no mapping set and ->share (which
> > * shares the ->index field) is non-zero. Note this may return false even if the
> > * page is shared between multiple files but has not yet actually been mapped
> > * into multiple address spaces.
> > */
> > -static inline bool dax_page_is_shared(struct page *page)
> > +static inline bool dax_folio_is_shared(struct folio *folio)
> > {
> > - return !page->mapping && page->share;
> > + return !folio->mapping && folio->share;
> > }
> >
> > /*
> > - * Increase the page share refcount, warning if the page is not marked as shared.
> > + * Increase the folio share refcount, warning if the folio is not marked as shared.
> > */
> > -static inline void dax_page_share_get(struct page *page)
> > +static inline void dax_folio_share_get(void *entry)
> > {
> > - WARN_ON_ONCE(!page->share);
> > - WARN_ON_ONCE(page->mapping);
> > - page->share++;
> > + struct folio *folio = dax_to_folio(entry);
> > +
> > + WARN_ON_ONCE(!folio->share);
> > + WARN_ON_ONCE(folio->mapping);
> > + WARN_ON_ONCE(dax_entry_order(entry) != folio_order(folio));
> > + folio->share++;
> > +}
> > +
> > +static inline unsigned long dax_folio_share_put(struct folio *folio)
> > +{
> > + unsigned long ref;
> > +
> > + if (!dax_folio_is_shared(folio))
> > + ref = 0;
> > + else
> > + ref = --folio->share;
> > +
> > + WARN_ON_ONCE(ref < 0);
> > + if (!ref) {
> > + folio->mapping = NULL;
> > + if (folio_order(folio)) {
> > + struct dev_pagemap *pgmap = page_pgmap(&folio->page);
> > + unsigned int order = folio_order(folio);
> > + unsigned int i;
> > +
> > + for (i = 0; i < (1UL << order); i++) {
> > + struct page *page = folio_page(folio, i);
> > +
> > + ClearPageHead(page);
> > + clear_compound_head(page);
> > +
> > + /*
> > + * Reset pgmap which was over-written by
> > + * prep_compound_page().
> > + */
> > + page_folio(page)->pgmap = pgmap;
> > +
> > + /* Make sure this isn't set to TAIL_MAPPING */
> > + page->mapping = NULL;
> > + page->share = 0;
> > + WARN_ON_ONCE(page_ref_count(page));
> > + }
> > + }
> > + }
> > +
> > + return ref;
> > }
> >
> > -static inline unsigned long dax_page_share_put(struct page *page)
> > +static void dax_device_folio_init(void *entry)
> > {
> > - WARN_ON_ONCE(!page->share);
> > - return --page->share;
> > + struct folio *folio = dax_to_folio(entry);
> > + int order = dax_entry_order(entry);
> > +
> > + /*
> > + * Folio should have been split back to order-0 pages in
> > + * dax_folio_share_put() when they were removed from their
> > + * final mapping.
> > + */
> > + WARN_ON_ONCE(folio_order(folio));
> > +
> > + if (order > 0) {
> > + prep_compound_page(&folio->page, order);
> > + if (order > 1)
> > + INIT_LIST_HEAD(&folio->_deferred_list);
> > + WARN_ON_ONCE(folio_ref_count(folio));
> > + }
> > }
> >
> > /*
> > @@ -388,72 +437,58 @@ static inline unsigned long dax_page_share_put(struct page *page)
> > * dax_holder_operations.
> > */
> > static void dax_associate_entry(void *entry, struct address_space *mapping,
> > - struct vm_area_struct *vma, unsigned long address, bool shared)
> > + struct vm_area_struct *vma, unsigned long address, bool shared)
> > {
> > - unsigned long size = dax_entry_size(entry), pfn, index;
> > - int i = 0;
> > + unsigned long size = dax_entry_size(entry), index;
> > + struct folio *folio = dax_to_folio(entry);
> >
> > if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
> > return;
> >
> > index = linear_page_index(vma, address & ~(size - 1));
> > - for_each_mapped_pfn(entry, pfn) {
> > - struct page *page = pfn_to_page(pfn);
> > -
> > - if (shared && page->mapping && page->share) {
> > - if (page->mapping) {
> > - page->mapping = NULL;
> > + if (shared && (folio->mapping || dax_folio_is_shared(folio))) {
> > + if (folio->mapping) {
> > + folio->mapping = NULL;
> >
> > - /*
> > - * Page has already been mapped into one address
> > - * space so set the share count.
> > - */
> > - page->share = 1;
> > - }
> > -
> > - dax_page_share_get(page);
> > - } else {
> > - WARN_ON_ONCE(page->mapping);
> > - page->mapping = mapping;
> > - page->index = index + i++;
> > + /*
> > + * folio has already been mapped into one address
> > + * space so set the share count.
> > + */
> > + folio->share = 1;
> > }
> > +
> > + dax_folio_share_get(entry);
> > + } else {
> > + WARN_ON_ONCE(folio->mapping);
> > + dax_device_folio_init(entry);
> > + folio = dax_to_folio(entry);
> > + folio->mapping = mapping;
> > + folio->index = index;
> > }
> > }
> >
> > static void dax_disassociate_entry(void *entry, struct address_space *mapping,
> > - bool trunc)
> > + bool trunc)
> > {
> > - unsigned long pfn;
> > + struct folio *folio = dax_to_folio(entry);
> >
> > if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
> > return;
> >
> > - for_each_mapped_pfn(entry, pfn) {
> > - struct page *page = pfn_to_page(pfn);
> > -
> > - WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
> > - if (dax_page_is_shared(page)) {
> > - /* keep the shared flag if this page is still shared */
> > - if (dax_page_share_put(page) > 0)
> > - continue;
> > - } else
> > - WARN_ON_ONCE(page->mapping && page->mapping != mapping);
> > - page->mapping = NULL;
> > - page->index = 0;
> > - }
> > + dax_folio_share_put(folio);
> > }
> >
> > static struct page *dax_busy_page(void *entry)
> > {
> > - unsigned long pfn;
> > + struct folio *folio = dax_to_folio(entry);
> >
> > - for_each_mapped_pfn(entry, pfn) {
> > - struct page *page = pfn_to_page(pfn);
> > + if (dax_is_zero_entry(entry) || dax_is_empty_entry(entry))
> > + return NULL;
> >
> > - if (page_ref_count(page) > 1)
> > - return page;
> > - }
> > - return NULL;
> > + if (folio_ref_count(folio) - folio_mapcount(folio))
> > + return &folio->page;
> > + else
> > + return NULL;
> > }
> >
> > /**
> > @@ -786,7 +821,7 @@ struct page *dax_layout_busy_page(struct address_space *mapping)
> > EXPORT_SYMBOL_GPL(dax_layout_busy_page);
> >
> > static int __dax_invalidate_entry(struct address_space *mapping,
> > - pgoff_t index, bool trunc)
> > + pgoff_t index, bool trunc)
> > {
> > XA_STATE(xas, &mapping->i_pages, index);
> > int ret = 0;
> > @@ -892,7 +927,7 @@ static int wait_page_idle(struct page *page,
> > void (cb)(struct inode *),
> > struct inode *inode)
> > {
> > - return ___wait_var_event(page, page_ref_count(page) == 1,
> > + return ___wait_var_event(page, page_ref_count(page) == 0,
> > TASK_INTERRUPTIBLE, 0, 0, cb(inode));
> > }
> >
> > @@ -900,7 +935,7 @@ static void wait_page_idle_uninterruptible(struct page *page,
> > void (cb)(struct inode *),
> > struct inode *inode)
> > {
> > - ___wait_var_event(page, page_ref_count(page) == 1,
> > + ___wait_var_event(page, page_ref_count(page) == 0,
> > TASK_UNINTERRUPTIBLE, 0, 0, cb(inode));
> > }
> >
> > @@ -949,7 +984,8 @@ void dax_break_mapping_uninterruptible(struct inode *inode,
> > wait_page_idle_uninterruptible(page, cb, inode);
> > } while (true);
> >
> > - dax_delete_mapping_range(inode->i_mapping, 0, LLONG_MAX);
> > + if (!page)
> > + dax_delete_mapping_range(inode->i_mapping, 0, LLONG_MAX);
> > }
> > EXPORT_SYMBOL_GPL(dax_break_mapping_uninterruptible);
> >
> > @@ -1035,8 +1071,10 @@ static void *dax_insert_entry(struct xa_state *xas, struct vm_fault *vmf,
> > void *old;
> >
> > dax_disassociate_entry(entry, mapping, false);
> > - dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address,
> > - shared);
> > + if (!(flags & DAX_ZERO_PAGE))
> > + dax_associate_entry(new_entry, mapping, vmf->vma, vmf->address,
> > + shared);
> > +
> > /*
> > * Only swap our new entry into the page cache if the current
> > * entry is a zero page or an empty entry. If a normal PTE or
> > @@ -1224,9 +1262,7 @@ static int dax_iomap_direct_access(const struct iomap *iomap, loff_t pos,
> > goto out;
> > if (pfn_t_to_pfn(*pfnp) & (PHYS_PFN(size)-1))
> > goto out;
> > - /* For larger pages we need devmap */
> > - if (length > 1 && !pfn_t_devmap(*pfnp))
> > - goto out;
> > +
> > rc = 0;
> >
> > out_check_addr:
> > @@ -1333,7 +1369,7 @@ static vm_fault_t dax_load_hole(struct xa_state *xas, struct vm_fault *vmf,
> >
> > *entry = dax_insert_entry(xas, vmf, iter, *entry, pfn, DAX_ZERO_PAGE);
> >
> > - ret = vmf_insert_mixed(vmf->vma, vaddr, pfn);
> > + ret = vmf_insert_page_mkwrite(vmf, pfn_t_to_page(pfn), false);
> > trace_dax_load_hole(inode, vmf, ret);
> > return ret;
> > }
> > @@ -1804,7 +1840,8 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
> > loff_t pos = (loff_t)xas->xa_index << PAGE_SHIFT;
> > bool write = iter->flags & IOMAP_WRITE;
> > unsigned long entry_flags = pmd ? DAX_PMD : 0;
> > - int err = 0;
> > + struct folio *folio;
> > + int ret, err = 0;
> > pfn_t pfn;
> > void *kaddr;
> >
> > @@ -1836,17 +1873,18 @@ static vm_fault_t dax_fault_iter(struct vm_fault *vmf,
> > return dax_fault_return(err);
> > }
> >
> > + folio = dax_to_folio(*entry);
> > if (dax_fault_is_synchronous(iter, vmf->vma))
> > return dax_fault_synchronous_pfnp(pfnp, pfn);
> >
> > - /* insert PMD pfn */
> > + folio_ref_inc(folio);
> > if (pmd)
> > - return vmf_insert_pfn_pmd(vmf, pfn, write);
> > + ret = vmf_insert_folio_pmd(vmf, pfn_folio(pfn_t_to_pfn(pfn)), write);
> > + else
> > + ret = vmf_insert_page_mkwrite(vmf, pfn_t_to_page(pfn), write);
> > + folio_put(folio);
> >
> > - /* insert PTE pfn */
> > - if (write)
> > - return vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn);
> > - return vmf_insert_mixed(vmf->vma, vmf->address, pfn);
> > + return ret;
> > }
> >
> > static vm_fault_t dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
> > @@ -2085,6 +2123,7 @@ dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, unsigned int order)
> > {
> > struct address_space *mapping = vmf->vma->vm_file->f_mapping;
> > XA_STATE_ORDER(xas, &mapping->i_pages, vmf->pgoff, order);
> > + struct folio *folio;
> > void *entry;
> > vm_fault_t ret;
> >
> > @@ -2102,14 +2141,17 @@ dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, unsigned int order)
> > xas_set_mark(&xas, PAGECACHE_TAG_DIRTY);
> > dax_lock_entry(&xas, entry);
> > xas_unlock_irq(&xas);
> > + folio = pfn_folio(pfn_t_to_pfn(pfn));
> > + folio_ref_inc(folio);
> > if (order == 0)
> > - ret = vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn);
> > + ret = vmf_insert_page_mkwrite(vmf, &folio->page, true);
> > #ifdef CONFIG_FS_DAX_PMD
> > else if (order == PMD_ORDER)
> > - ret = vmf_insert_pfn_pmd(vmf, pfn, FAULT_FLAG_WRITE);
> > + ret = vmf_insert_folio_pmd(vmf, folio, FAULT_FLAG_WRITE);
> > #endif
> > else
> > ret = VM_FAULT_FALLBACK;
> > + folio_put(folio);
> > dax_unlock_entry(&xas, entry);
> > trace_dax_insert_pfn_mkwrite(mapping->host, vmf, ret);
> > return ret;
> > diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> > index 82afe78..2c7b24c 100644
> > --- a/fs/fuse/virtio_fs.c
> > +++ b/fs/fuse/virtio_fs.c
> > @@ -1017,8 +1017,7 @@ static long virtio_fs_direct_access(struct dax_device *dax_dev, pgoff_t pgoff,
> > if (kaddr)
> > *kaddr = fs->window_kaddr + offset;
> > if (pfn)
> > - *pfn = phys_to_pfn_t(fs->window_phys_addr + offset,
> > - PFN_DEV | PFN_MAP);
> > + *pfn = phys_to_pfn_t(fs->window_phys_addr + offset, 0);
> > return nr_pages > max_nr_pages ? max_nr_pages : nr_pages;
> > }
> >
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index c7ec5ab..7bfb4eb 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -2740,7 +2740,7 @@ xfs_mmaplock_two_inodes_and_break_dax_layout(
> > * for this nested lock case.
> > */
> > page = dax_layout_busy_page(VFS_I(ip2)->i_mapping);
> > - if (page && page_ref_count(page) != 1) {
> > + if (page && page_ref_count(page) != 0) {
>
> You might want to wrap this weird detail for the next filesystem that
> uses it, so that the fine details of fsdax aren't opencoded in xfs:
Good idea. I will introduce this in patch 5 as dax_page_is_idle() and do this
change there. That way I can replace some of the other page_ref_count() checks
in the dax code as well.
- Alistair
>
> static inline bool dax_page_in_use(struct page *page)
> {
> return page && page_ref_count(page) != 0;
> }
>
> page = dax_layout_busy_page(VFS_I(ip2)->i_mapping);
> if (dax_page_in_use(page)) {
> /* unlock and retry... */
> }
>
> --D
>
> > xfs_iunlock(ip2, XFS_MMAPLOCK_EXCL);
> > xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
> > goto again;
> > diff --git a/include/linux/dax.h b/include/linux/dax.h
> > index 7c3773f..dbefea1 100644
> > --- a/include/linux/dax.h
> > +++ b/include/linux/dax.h
> > @@ -211,8 +211,12 @@ static inline int dax_wait_page_idle(struct page *page,
> > void (cb)(struct inode *),
> > struct inode *inode)
> > {
> > - return ___wait_var_event(page, page_ref_count(page) == 1,
> > + int ret;
> > +
> > + ret = ___wait_var_event(page, !page_ref_count(page),
> > TASK_INTERRUPTIBLE, 0, 0, cb(inode));
> > +
> > + return ret;
> > }
> >
> > #if IS_ENABLED(CONFIG_DAX)
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 01edca9..a734278 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -1161,6 +1161,8 @@ int vma_is_stack_for_current(struct vm_area_struct *vma);
> > struct mmu_gather;
> > struct inode;
> >
> > +extern void prep_compound_page(struct page *page, unsigned int order);
> > +
> > /*
> > * compound_order() can be called without holding a reference, which means
> > * that niceties like page_folio() don't work. These callers should be
> > @@ -1482,25 +1484,6 @@ vm_fault_t finish_fault(struct vm_fault *vmf);
> > * back into memory.
> > */
> >
> > -#if defined(CONFIG_ZONE_DEVICE) && defined(CONFIG_FS_DAX)
> > -DECLARE_STATIC_KEY_FALSE(devmap_managed_key);
> > -
> > -bool __put_devmap_managed_folio_refs(struct folio *folio, int refs);
> > -static inline bool put_devmap_managed_folio_refs(struct folio *folio, int refs)
> > -{
> > - if (!static_branch_unlikely(&devmap_managed_key))
> > - return false;
> > - if (!folio_is_zone_device(folio))
> > - return false;
> > - return __put_devmap_managed_folio_refs(folio, refs);
> > -}
> > -#else /* CONFIG_ZONE_DEVICE && CONFIG_FS_DAX */
> > -static inline bool put_devmap_managed_folio_refs(struct folio *folio, int refs)
> > -{
> > - return false;
> > -}
> > -#endif /* CONFIG_ZONE_DEVICE && CONFIG_FS_DAX */
> > -
> > /* 127: arbitrary random number, small enough to assemble well */
> > #define folio_ref_zero_or_close_to_overflow(folio) \
> > ((unsigned int) folio_ref_count(folio) + 127u <= 127u)
> > @@ -1615,12 +1598,6 @@ static inline void put_page(struct page *page)
> > {
> > struct folio *folio = page_folio(page);
> >
> > - /*
> > - * For some devmap managed pages we need to catch refcount transition
> > - * from 2 to 1:
> > - */
> > - if (put_devmap_managed_folio_refs(folio, 1))
> > - return;
> > folio_put(folio);
> > }
> >
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index 54b59b8..e308cb9 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -295,6 +295,8 @@ typedef struct {
> > * anonymous memory.
> > * @index: Offset within the file, in units of pages. For anonymous memory,
> > * this is the index from the beginning of the mmap.
> > + * @share: number of DAX mappings that reference this folio. See
> > + * dax_associate_entry.
> > * @private: Filesystem per-folio data (see folio_attach_private()).
> > * @swap: Used for swp_entry_t if folio_test_swapcache().
> > * @_mapcount: Do not access this member directly. Use folio_mapcount() to
> > @@ -344,7 +346,10 @@ struct folio {
> > struct dev_pagemap *pgmap;
> > };
> > struct address_space *mapping;
> > - pgoff_t index;
> > + union {
> > + pgoff_t index;
> > + unsigned long share;
> > + };
> > union {
> > void *private;
> > swp_entry_t swap;
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 9b587b5..d6575ed 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -96,8 +96,7 @@ static inline struct folio *try_get_folio(struct page *page, int refs)
> > * belongs to this folio.
> > */
> > if (unlikely(page_folio(page) != folio)) {
> > - if (!put_devmap_managed_folio_refs(folio, refs))
> > - folio_put_refs(folio, refs);
> > + folio_put_refs(folio, refs);
> > goto retry;
> > }
> >
> > @@ -116,8 +115,7 @@ static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
> > refs *= GUP_PIN_COUNTING_BIAS;
> > }
> >
> > - if (!put_devmap_managed_folio_refs(folio, refs))
> > - folio_put_refs(folio, refs);
> > + folio_put_refs(folio, refs);
> > }
> >
> > /**
> > @@ -565,8 +563,7 @@ static struct folio *try_grab_folio_fast(struct page *page, int refs,
> > */
> > if (unlikely((flags & FOLL_LONGTERM) &&
> > !folio_is_longterm_pinnable(folio))) {
> > - if (!put_devmap_managed_folio_refs(folio, refs))
> > - folio_put_refs(folio, refs);
> > + folio_put_refs(folio, refs);
> > return NULL;
> > }
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index d1ea76e..0cf1151 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -2209,7 +2209,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> > tlb->fullmm);
> > arch_check_zapped_pmd(vma, orig_pmd);
> > tlb_remove_pmd_tlb_entry(tlb, pmd, addr);
> > - if (vma_is_special_huge(vma)) {
> > + if (!vma_is_dax(vma) && vma_is_special_huge(vma)) {
> > if (arch_needs_pgtable_deposit())
> > zap_deposited_table(tlb->mm, pmd);
> > spin_unlock(ptl);
> > @@ -2853,13 +2853,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> > */
> > if (arch_needs_pgtable_deposit())
> > zap_deposited_table(mm, pmd);
> > - if (vma_is_special_huge(vma))
> > + if (!vma_is_dax(vma) && vma_is_special_huge(vma))
> > return;
> > if (unlikely(is_pmd_migration_entry(old_pmd))) {
> > swp_entry_t entry;
> >
> > entry = pmd_to_swp_entry(old_pmd);
> > folio = pfn_swap_entry_folio(entry);
> > + } else if (is_huge_zero_pmd(old_pmd)) {
> > + return;
> > } else {
> > page = pmd_page(old_pmd);
> > folio = page_folio(page);
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 3922788..c4df0ad 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -733,8 +733,6 @@ static inline void prep_compound_tail(struct page *head, int tail_idx)
> > set_page_private(p, 0);
> > }
> >
> > -extern void prep_compound_page(struct page *page, unsigned int order);
> > -
> > void post_alloc_hook(struct page *page, unsigned int order, gfp_t gfp_flags);
> > extern bool free_pages_prepare(struct page *page, unsigned int order);
> >
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index a7b8ccd..7838bf1 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -419,18 +419,18 @@ static unsigned long dev_pagemap_mapping_shift(struct vm_area_struct *vma,
> > pud = pud_offset(p4d, address);
> > if (!pud_present(*pud))
> > return 0;
> > - if (pud_devmap(*pud))
> > + if (pud_trans_huge(*pud))
> > return PUD_SHIFT;
> > pmd = pmd_offset(pud, address);
> > if (!pmd_present(*pmd))
> > return 0;
> > - if (pmd_devmap(*pmd))
> > + if (pmd_trans_huge(*pmd))
> > return PMD_SHIFT;
> > pte = pte_offset_map(pmd, address);
> > if (!pte)
> > return 0;
> > ptent = ptep_get(pte);
> > - if (pte_present(ptent) && pte_devmap(ptent))
> > + if (pte_present(ptent))
> > ret = PAGE_SHIFT;
> > pte_unmap(pte);
> > return ret;
> > diff --git a/mm/memory.c b/mm/memory.c
> > index c60b819..02e12b0 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3843,13 +3843,15 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
> > if (vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) {
> > /*
> > * VM_MIXEDMAP !pfn_valid() case, or VM_SOFTDIRTY clear on a
> > - * VM_PFNMAP VMA.
> > + * VM_PFNMAP VMA. FS DAX also wants ops->pfn_mkwrite called.
> > *
> > * We should not cow pages in a shared writeable mapping.
> > * Just mark the pages writable and/or call ops->pfn_mkwrite.
> > */
> > - if (!vmf->page)
> > + if (!vmf->page || is_fsdax_page(vmf->page)) {
> > + vmf->page = NULL;
> > return wp_pfn_shared(vmf);
> > + }
> > return wp_page_shared(vmf, folio);
> > }
> >
> > diff --git a/mm/memremap.c b/mm/memremap.c
> > index 68099af..9a8879b 100644
> > --- a/mm/memremap.c
> > +++ b/mm/memremap.c
> > @@ -458,8 +458,13 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
> >
> > void free_zone_device_folio(struct folio *folio)
> > {
> > - if (WARN_ON_ONCE(!folio->pgmap->ops ||
> > - !folio->pgmap->ops->page_free))
> > + struct dev_pagemap *pgmap = folio->pgmap;
> > +
> > + if (WARN_ON_ONCE(!pgmap->ops))
> > + return;
> > +
> > + if (WARN_ON_ONCE(pgmap->type != MEMORY_DEVICE_FS_DAX &&
> > + !pgmap->ops->page_free))
> > return;
> >
> > mem_cgroup_uncharge(folio);
> > @@ -484,26 +489,36 @@ void free_zone_device_folio(struct folio *folio)
> > * For other types of ZONE_DEVICE pages, migration is either
> > * handled differently or not done at all, so there is no need
> > * to clear folio->mapping.
> > + *
> > + * FS DAX pages clear the mapping when the folio->share count hits
> > + * zero which indicating the page has been removed from the file
> > + * system mapping.
> > */
> > - folio->mapping = NULL;
> > - folio->pgmap->ops->page_free(folio_page(folio, 0));
> > + if (pgmap->type != MEMORY_DEVICE_FS_DAX)
> > + folio->mapping = NULL;
> >
> > - switch (folio->pgmap->type) {
> > + switch (pgmap->type) {
> > case MEMORY_DEVICE_PRIVATE:
> > case MEMORY_DEVICE_COHERENT:
> > - put_dev_pagemap(folio->pgmap);
> > + pgmap->ops->page_free(folio_page(folio, 0));
> > + put_dev_pagemap(pgmap);
> > break;
> >
> > - case MEMORY_DEVICE_FS_DAX:
> > case MEMORY_DEVICE_GENERIC:
> > /*
> > * Reset the refcount to 1 to prepare for handing out the page
> > * again.
> > */
> > + pgmap->ops->page_free(folio_page(folio, 0));
> > folio_set_count(folio, 1);
> > break;
> >
> > + case MEMORY_DEVICE_FS_DAX:
> > + wake_up_var(&folio->page);
> > + break;
> > +
> > case MEMORY_DEVICE_PCI_P2PDMA:
> > + pgmap->ops->page_free(folio_page(folio, 0));
> > break;
> > }
> > }
> > @@ -519,21 +534,3 @@ void zone_device_page_init(struct page *page)
> > lock_page(page);
> > }
> > EXPORT_SYMBOL_GPL(zone_device_page_init);
> > -
> > -#ifdef CONFIG_FS_DAX
> > -bool __put_devmap_managed_folio_refs(struct folio *folio, int refs)
> > -{
> > - if (folio->pgmap->type != MEMORY_DEVICE_FS_DAX)
> > - return false;
> > -
> > - /*
> > - * fsdax page refcounts are 1-based, rather than 0-based: if
> > - * refcount is 1, then the page is free and the refcount is
> > - * stable because nobody holds a reference on the page.
> > - */
> > - if (folio_ref_sub_return(folio, refs) == 1)
> > - wake_up_var(&folio->_refcount);
> > - return true;
> > -}
> > -EXPORT_SYMBOL(__put_devmap_managed_folio_refs);
> > -#endif /* CONFIG_FS_DAX */
> > diff --git a/mm/mm_init.c b/mm/mm_init.c
> > index cb73402..0c12b29 100644
> > --- a/mm/mm_init.c
> > +++ b/mm/mm_init.c
> > @@ -1017,23 +1017,22 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
> > }
> >
> > /*
> > - * ZONE_DEVICE pages other than MEMORY_TYPE_GENERIC and
> > - * MEMORY_TYPE_FS_DAX pages are released directly to the driver page
> > - * allocator which will set the page count to 1 when allocating the
> > - * page.
> > + * ZONE_DEVICE pages other than MEMORY_TYPE_GENERIC are released
> > + * directly to the driver page allocator which will set the page count
> > + * to 1 when allocating the page.
> > *
> > * MEMORY_TYPE_GENERIC and MEMORY_TYPE_FS_DAX pages automatically have
> > * their refcount reset to one whenever they are freed (ie. after
> > * their refcount drops to 0).
> > */
> > switch (pgmap->type) {
> > + case MEMORY_DEVICE_FS_DAX:
> > case MEMORY_DEVICE_PRIVATE:
> > case MEMORY_DEVICE_COHERENT:
> > case MEMORY_DEVICE_PCI_P2PDMA:
> > set_page_count(page, 0);
> > break;
> >
> > - case MEMORY_DEVICE_FS_DAX:
> > case MEMORY_DEVICE_GENERIC:
> > break;
> > }
> > diff --git a/mm/swap.c b/mm/swap.c
> > index 062c856..a587842 100644
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -952,8 +952,6 @@ void folios_put_refs(struct folio_batch *folios, unsigned int *refs)
> > unlock_page_lruvec_irqrestore(lruvec, flags);
> > lruvec = NULL;
> > }
> > - if (put_devmap_managed_folio_refs(folio, nr_refs))
> > - continue;
> > if (folio_ref_sub_and_test(folio, nr_refs))
> > free_zone_device_folio(folio);
> > continue;
> > --
> > git-series 0.9.1
> >
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 07/26] fs/dax: Ensure all pages are idle prior to filesystem unmount
2025-01-13 2:49 ` Darrick J. Wong
@ 2025-01-13 5:48 ` Alistair Popple
2025-01-13 16:39 ` Darrick J. Wong
0 siblings, 1 reply; 97+ messages in thread
From: Alistair Popple @ 2025-01-13 5:48 UTC (permalink / raw)
To: Darrick J. Wong
Cc: akpm, dan.j.williams, linux-mm, alison.schofield, lina,
zhang.lyra, gerald.schaefer, vishal.l.verma, dave.jiang, logang,
bhelgaas, jack, jgg, catalin.marinas, will, mpe, npiggin,
dave.hansen, ira.weiny, willy, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
On Sun, Jan 12, 2025 at 06:49:40PM -0800, Darrick J. Wong wrote:
> On Mon, Jan 13, 2025 at 11:57:18AM +1100, Alistair Popple wrote:
> > On Fri, Jan 10, 2025 at 08:50:19AM -0800, Darrick J. Wong wrote:
> > > On Fri, Jan 10, 2025 at 05:00:35PM +1100, Alistair Popple wrote:
> > > > File systems call dax_break_mapping() prior to reallocating file
> > > > system blocks to ensure the page is not undergoing any DMA or other
> > > > accesses. Generally this is needed when a file is truncated to ensure
> > > > that if a block is reallocated nothing is writing to it. However
> > > > filesystems currently don't call this when an FS DAX inode is evicted.
> > > >
> > > > This can cause problems when the file system is unmounted as a page
> > > > can continue to be under going DMA or other remote access after
> > > > unmount. This means if the file system is remounted any truncate or
> > > > other operation which requires the underlying file system block to be
> > > > freed will not wait for the remote access to complete. Therefore a
> > > > busy block may be reallocated to a new file leading to corruption.
> > > >
> > > > Signed-off-by: Alistair Popple <apopple@nvidia.com>
> > > >
> > > > ---
> > > >
> > > > Changes for v5:
> > > >
> > > > - Don't wait for pages to be idle in non-DAX mappings
> > > > ---
> > > > fs/dax.c | 29 +++++++++++++++++++++++++++++
> > > > fs/ext4/inode.c | 32 ++++++++++++++------------------
> > > > fs/xfs/xfs_inode.c | 9 +++++++++
> > > > fs/xfs/xfs_inode.h | 1 +
> > > > fs/xfs/xfs_super.c | 18 ++++++++++++++++++
> > > > include/linux/dax.h | 2 ++
> > > > 6 files changed, 73 insertions(+), 18 deletions(-)
> > > >
> > > > diff --git a/fs/dax.c b/fs/dax.c
> > > > index 7008a73..4e49cc4 100644
> > > > --- a/fs/dax.c
> > > > +++ b/fs/dax.c
> > > > @@ -883,6 +883,14 @@ static int wait_page_idle(struct page *page,
> > > > TASK_INTERRUPTIBLE, 0, 0, cb(inode));
> > > > }
> > > >
> > > > +static void wait_page_idle_uninterruptible(struct page *page,
> > > > + void (cb)(struct inode *),
> > > > + struct inode *inode)
> > > > +{
> > > > + ___wait_var_event(page, page_ref_count(page) == 1,
> > > > + TASK_UNINTERRUPTIBLE, 0, 0, cb(inode));
> > > > +}
> > > > +
> > > > /*
> > > > * Unmaps the inode and waits for any DMA to complete prior to deleting the
> > > > * DAX mapping entries for the range.
> > > > @@ -911,6 +919,27 @@ int dax_break_mapping(struct inode *inode, loff_t start, loff_t end,
> > > > }
> > > > EXPORT_SYMBOL_GPL(dax_break_mapping);
> > > >
> > > > +void dax_break_mapping_uninterruptible(struct inode *inode,
> > > > + void (cb)(struct inode *))
> > > > +{
> > > > + struct page *page;
> > > > +
> > > > + if (!dax_mapping(inode->i_mapping))
> > > > + return;
> > > > +
> > > > + do {
> > > > + page = dax_layout_busy_page_range(inode->i_mapping, 0,
> > > > + LLONG_MAX);
> > > > + if (!page)
> > > > + break;
> > > > +
> > > > + wait_page_idle_uninterruptible(page, cb, inode);
> > > > + } while (true);
> > > > +
> > > > + dax_delete_mapping_range(inode->i_mapping, 0, LLONG_MAX);
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(dax_break_mapping_uninterruptible);
> > > > +
> > > > /*
> > > > * Invalidate DAX entry if it is clean.
> > > > */
> > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > > index ee8e83f..fa35161 100644
> > > > --- a/fs/ext4/inode.c
> > > > +++ b/fs/ext4/inode.c
> > > > @@ -163,6 +163,18 @@ int ext4_inode_is_fast_symlink(struct inode *inode)
> > > > (inode->i_size < EXT4_N_BLOCKS * 4);
> > > > }
> > > >
> > > > +static void ext4_wait_dax_page(struct inode *inode)
> > > > +{
> > > > + filemap_invalidate_unlock(inode->i_mapping);
> > > > + schedule();
> > > > + filemap_invalidate_lock(inode->i_mapping);
> > > > +}
> > > > +
> > > > +int ext4_break_layouts(struct inode *inode)
> > > > +{
> > > > + return dax_break_mapping_inode(inode, ext4_wait_dax_page);
> > > > +}
> > > > +
> > > > /*
> > > > * Called at the last iput() if i_nlink is zero.
> > > > */
> > > > @@ -181,6 +193,8 @@ void ext4_evict_inode(struct inode *inode)
> > > >
> > > > trace_ext4_evict_inode(inode);
> > > >
> > > > + dax_break_mapping_uninterruptible(inode, ext4_wait_dax_page);
> > > > +
> > > > if (EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL)
> > > > ext4_evict_ea_inode(inode);
> > > > if (inode->i_nlink) {
> > > > @@ -3902,24 +3916,6 @@ int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset,
> > > > return ret;
> > > > }
> > > >
> > > > -static void ext4_wait_dax_page(struct inode *inode)
> > > > -{
> > > > - filemap_invalidate_unlock(inode->i_mapping);
> > > > - schedule();
> > > > - filemap_invalidate_lock(inode->i_mapping);
> > > > -}
> > > > -
> > > > -int ext4_break_layouts(struct inode *inode)
> > > > -{
> > > > - struct page *page;
> > > > - int error;
> > > > -
> > > > - if (WARN_ON_ONCE(!rwsem_is_locked(&inode->i_mapping->invalidate_lock)))
> > > > - return -EINVAL;
> > > > -
> > > > - return dax_break_mapping_inode(inode, ext4_wait_dax_page);
> > > > -}
> > > > -
> > > > /*
> > > > * ext4_punch_hole: punches a hole in a file by releasing the blocks
> > > > * associated with the given offset and length
> > > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > > index 4410b42..c7ec5ab 100644
> > > > --- a/fs/xfs/xfs_inode.c
> > > > +++ b/fs/xfs/xfs_inode.c
> > > > @@ -2997,6 +2997,15 @@ xfs_break_dax_layouts(
> > > > return dax_break_mapping_inode(inode, xfs_wait_dax_page);
> > > > }
> > > >
> > > > +void
> > > > +xfs_break_dax_layouts_uninterruptible(
> > > > + struct inode *inode)
> > > > +{
> > > > + xfs_assert_ilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL);
> > > > +
> > > > + dax_break_mapping_uninterruptible(inode, xfs_wait_dax_page);
> > > > +}
> > > > +
> > > > int
> > > > xfs_break_layouts(
> > > > struct inode *inode,
> > > > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > > > index c4f03f6..613797a 100644
> > > > --- a/fs/xfs/xfs_inode.h
> > > > +++ b/fs/xfs/xfs_inode.h
> > > > @@ -594,6 +594,7 @@ xfs_itruncate_extents(
> > > > }
> > > >
> > > > int xfs_break_dax_layouts(struct inode *inode);
> > > > +void xfs_break_dax_layouts_uninterruptible(struct inode *inode);
> > > > int xfs_break_layouts(struct inode *inode, uint *iolock,
> > > > enum layout_break_reason reason);
> > > >
> > > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > > index 8524b9d..73ec060 100644
> > > > --- a/fs/xfs/xfs_super.c
> > > > +++ b/fs/xfs/xfs_super.c
> > > > @@ -751,6 +751,23 @@ xfs_fs_drop_inode(
> > > > return generic_drop_inode(inode);
> > > > }
> > > >
> > > > +STATIC void
> > > > +xfs_fs_evict_inode(
> > > > + struct inode *inode)
> > > > +{
> > > > + struct xfs_inode *ip = XFS_I(inode);
> > > > + uint iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
> > > > +
> > > > + if (IS_DAX(inode)) {
> > > > + xfs_ilock(ip, iolock);
> > > > + xfs_break_dax_layouts_uninterruptible(inode);
> > > > + xfs_iunlock(ip, iolock);
> > >
> > > If we're evicting the inode, why is it necessary to take i_rwsem and the
> > > mmap invalidation lock? Shouldn't the evicting thread be the only one
> > > with access to this inode?
> >
> > Hmm, good point. I think you're right. I can easily stop taking
> > XFS_IOLOCK_EXCL. Not taking XFS_MMAPLOCK_EXCL is slightly more difficult because
> > xfs_wait_dax_page() expects it to be taken. Do you think it is worth creating a
> > separate callback (xfs_wait_dax_page_unlocked()?) specifically for this path or
> > would you be happy with a comment explaining why we take the XFS_MMAPLOCK_EXCL
> > lock here?
>
> There shouldn't be any other threads removing "pages" from i_mapping
> during eviction, right? If so, I think you can just call schedule()
> directly from dax_break_mapping_uninterruptble.
Oh right, and I guess you are saying the same would apply to ext4 so no need to
cycle the filemap lock there either, which I've just noticed is buggy anyway. So
I can just remove the callback entirely for dax_break_mapping_uninterruptible.
> (dax mappings aren't allowed supposed to persist beyond unmount /
> eviction, just like regular pagecache, right??)
Right they're not *supposed* to, but until at least this patch is applied they
can ;-)
- Alistair
> --D
>
> > - Alistair
> >
> > > --D
> > >
> > > > + }
> > > > +
> > > > + truncate_inode_pages_final(&inode->i_data);
> > > > + clear_inode(inode);
> > > > +}
> > > > +
> > > > static void
> > > > xfs_mount_free(
> > > > struct xfs_mount *mp)
> > > > @@ -1189,6 +1206,7 @@ static const struct super_operations xfs_super_operations = {
> > > > .destroy_inode = xfs_fs_destroy_inode,
> > > > .dirty_inode = xfs_fs_dirty_inode,
> > > > .drop_inode = xfs_fs_drop_inode,
> > > > + .evict_inode = xfs_fs_evict_inode,
> > > > .put_super = xfs_fs_put_super,
> > > > .sync_fs = xfs_fs_sync_fs,
> > > > .freeze_fs = xfs_fs_freeze,
> > > > diff --git a/include/linux/dax.h b/include/linux/dax.h
> > > > index ef9e02c..7c3773f 100644
> > > > --- a/include/linux/dax.h
> > > > +++ b/include/linux/dax.h
> > > > @@ -274,6 +274,8 @@ static inline int __must_check dax_break_mapping_inode(struct inode *inode,
> > > > {
> > > > return dax_break_mapping(inode, 0, LLONG_MAX, cb);
> > > > }
> > > > +void dax_break_mapping_uninterruptible(struct inode *inode,
> > > > + void (cb)(struct inode *));
> > > > int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> > > > struct inode *dest, loff_t destoff,
> > > > loff_t len, bool *is_same,
> > > > --
> > > > git-series 0.9.1
> > > >
> >
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 07/26] fs/dax: Ensure all pages are idle prior to filesystem unmount
2025-01-13 5:48 ` Alistair Popple
@ 2025-01-13 16:39 ` Darrick J. Wong
0 siblings, 0 replies; 97+ messages in thread
From: Darrick J. Wong @ 2025-01-13 16:39 UTC (permalink / raw)
To: Alistair Popple
Cc: akpm, dan.j.williams, linux-mm, alison.schofield, lina,
zhang.lyra, gerald.schaefer, vishal.l.verma, dave.jiang, logang,
bhelgaas, jack, jgg, catalin.marinas, will, mpe, npiggin,
dave.hansen, ira.weiny, willy, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
On Mon, Jan 13, 2025 at 04:48:31PM +1100, Alistair Popple wrote:
> On Sun, Jan 12, 2025 at 06:49:40PM -0800, Darrick J. Wong wrote:
> > On Mon, Jan 13, 2025 at 11:57:18AM +1100, Alistair Popple wrote:
> > > On Fri, Jan 10, 2025 at 08:50:19AM -0800, Darrick J. Wong wrote:
> > > > On Fri, Jan 10, 2025 at 05:00:35PM +1100, Alistair Popple wrote:
> > > > > File systems call dax_break_mapping() prior to reallocating file
> > > > > system blocks to ensure the page is not undergoing any DMA or other
> > > > > accesses. Generally this is needed when a file is truncated to ensure
> > > > > that if a block is reallocated nothing is writing to it. However
> > > > > filesystems currently don't call this when an FS DAX inode is evicted.
> > > > >
> > > > > This can cause problems when the file system is unmounted as a page
> > > > > can continue to be under going DMA or other remote access after
> > > > > unmount. This means if the file system is remounted any truncate or
> > > > > other operation which requires the underlying file system block to be
> > > > > freed will not wait for the remote access to complete. Therefore a
> > > > > busy block may be reallocated to a new file leading to corruption.
> > > > >
> > > > > Signed-off-by: Alistair Popple <apopple@nvidia.com>
> > > > >
> > > > > ---
> > > > >
> > > > > Changes for v5:
> > > > >
> > > > > - Don't wait for pages to be idle in non-DAX mappings
> > > > > ---
> > > > > fs/dax.c | 29 +++++++++++++++++++++++++++++
> > > > > fs/ext4/inode.c | 32 ++++++++++++++------------------
> > > > > fs/xfs/xfs_inode.c | 9 +++++++++
> > > > > fs/xfs/xfs_inode.h | 1 +
> > > > > fs/xfs/xfs_super.c | 18 ++++++++++++++++++
> > > > > include/linux/dax.h | 2 ++
> > > > > 6 files changed, 73 insertions(+), 18 deletions(-)
> > > > >
> > > > > diff --git a/fs/dax.c b/fs/dax.c
> > > > > index 7008a73..4e49cc4 100644
> > > > > --- a/fs/dax.c
> > > > > +++ b/fs/dax.c
> > > > > @@ -883,6 +883,14 @@ static int wait_page_idle(struct page *page,
> > > > > TASK_INTERRUPTIBLE, 0, 0, cb(inode));
> > > > > }
> > > > >
> > > > > +static void wait_page_idle_uninterruptible(struct page *page,
> > > > > + void (cb)(struct inode *),
> > > > > + struct inode *inode)
> > > > > +{
> > > > > + ___wait_var_event(page, page_ref_count(page) == 1,
> > > > > + TASK_UNINTERRUPTIBLE, 0, 0, cb(inode));
> > > > > +}
> > > > > +
> > > > > /*
> > > > > * Unmaps the inode and waits for any DMA to complete prior to deleting the
> > > > > * DAX mapping entries for the range.
> > > > > @@ -911,6 +919,27 @@ int dax_break_mapping(struct inode *inode, loff_t start, loff_t end,
> > > > > }
> > > > > EXPORT_SYMBOL_GPL(dax_break_mapping);
> > > > >
> > > > > +void dax_break_mapping_uninterruptible(struct inode *inode,
> > > > > + void (cb)(struct inode *))
> > > > > +{
> > > > > + struct page *page;
> > > > > +
> > > > > + if (!dax_mapping(inode->i_mapping))
> > > > > + return;
> > > > > +
> > > > > + do {
> > > > > + page = dax_layout_busy_page_range(inode->i_mapping, 0,
> > > > > + LLONG_MAX);
> > > > > + if (!page)
> > > > > + break;
> > > > > +
> > > > > + wait_page_idle_uninterruptible(page, cb, inode);
> > > > > + } while (true);
> > > > > +
> > > > > + dax_delete_mapping_range(inode->i_mapping, 0, LLONG_MAX);
> > > > > +}
> > > > > +EXPORT_SYMBOL_GPL(dax_break_mapping_uninterruptible);
> > > > > +
> > > > > /*
> > > > > * Invalidate DAX entry if it is clean.
> > > > > */
> > > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > > > index ee8e83f..fa35161 100644
> > > > > --- a/fs/ext4/inode.c
> > > > > +++ b/fs/ext4/inode.c
> > > > > @@ -163,6 +163,18 @@ int ext4_inode_is_fast_symlink(struct inode *inode)
> > > > > (inode->i_size < EXT4_N_BLOCKS * 4);
> > > > > }
> > > > >
> > > > > +static void ext4_wait_dax_page(struct inode *inode)
> > > > > +{
> > > > > + filemap_invalidate_unlock(inode->i_mapping);
> > > > > + schedule();
> > > > > + filemap_invalidate_lock(inode->i_mapping);
> > > > > +}
> > > > > +
> > > > > +int ext4_break_layouts(struct inode *inode)
> > > > > +{
> > > > > + return dax_break_mapping_inode(inode, ext4_wait_dax_page);
> > > > > +}
> > > > > +
> > > > > /*
> > > > > * Called at the last iput() if i_nlink is zero.
> > > > > */
> > > > > @@ -181,6 +193,8 @@ void ext4_evict_inode(struct inode *inode)
> > > > >
> > > > > trace_ext4_evict_inode(inode);
> > > > >
> > > > > + dax_break_mapping_uninterruptible(inode, ext4_wait_dax_page);
> > > > > +
> > > > > if (EXT4_I(inode)->i_flags & EXT4_EA_INODE_FL)
> > > > > ext4_evict_ea_inode(inode);
> > > > > if (inode->i_nlink) {
> > > > > @@ -3902,24 +3916,6 @@ int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset,
> > > > > return ret;
> > > > > }
> > > > >
> > > > > -static void ext4_wait_dax_page(struct inode *inode)
> > > > > -{
> > > > > - filemap_invalidate_unlock(inode->i_mapping);
> > > > > - schedule();
> > > > > - filemap_invalidate_lock(inode->i_mapping);
> > > > > -}
> > > > > -
> > > > > -int ext4_break_layouts(struct inode *inode)
> > > > > -{
> > > > > - struct page *page;
> > > > > - int error;
> > > > > -
> > > > > - if (WARN_ON_ONCE(!rwsem_is_locked(&inode->i_mapping->invalidate_lock)))
> > > > > - return -EINVAL;
> > > > > -
> > > > > - return dax_break_mapping_inode(inode, ext4_wait_dax_page);
> > > > > -}
> > > > > -
> > > > > /*
> > > > > * ext4_punch_hole: punches a hole in a file by releasing the blocks
> > > > > * associated with the given offset and length
> > > > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > > > index 4410b42..c7ec5ab 100644
> > > > > --- a/fs/xfs/xfs_inode.c
> > > > > +++ b/fs/xfs/xfs_inode.c
> > > > > @@ -2997,6 +2997,15 @@ xfs_break_dax_layouts(
> > > > > return dax_break_mapping_inode(inode, xfs_wait_dax_page);
> > > > > }
> > > > >
> > > > > +void
> > > > > +xfs_break_dax_layouts_uninterruptible(
> > > > > + struct inode *inode)
> > > > > +{
> > > > > + xfs_assert_ilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL);
> > > > > +
> > > > > + dax_break_mapping_uninterruptible(inode, xfs_wait_dax_page);
> > > > > +}
> > > > > +
> > > > > int
> > > > > xfs_break_layouts(
> > > > > struct inode *inode,
> > > > > diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> > > > > index c4f03f6..613797a 100644
> > > > > --- a/fs/xfs/xfs_inode.h
> > > > > +++ b/fs/xfs/xfs_inode.h
> > > > > @@ -594,6 +594,7 @@ xfs_itruncate_extents(
> > > > > }
> > > > >
> > > > > int xfs_break_dax_layouts(struct inode *inode);
> > > > > +void xfs_break_dax_layouts_uninterruptible(struct inode *inode);
> > > > > int xfs_break_layouts(struct inode *inode, uint *iolock,
> > > > > enum layout_break_reason reason);
> > > > >
> > > > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > > > index 8524b9d..73ec060 100644
> > > > > --- a/fs/xfs/xfs_super.c
> > > > > +++ b/fs/xfs/xfs_super.c
> > > > > @@ -751,6 +751,23 @@ xfs_fs_drop_inode(
> > > > > return generic_drop_inode(inode);
> > > > > }
> > > > >
> > > > > +STATIC void
> > > > > +xfs_fs_evict_inode(
> > > > > + struct inode *inode)
> > > > > +{
> > > > > + struct xfs_inode *ip = XFS_I(inode);
> > > > > + uint iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
> > > > > +
> > > > > + if (IS_DAX(inode)) {
> > > > > + xfs_ilock(ip, iolock);
> > > > > + xfs_break_dax_layouts_uninterruptible(inode);
> > > > > + xfs_iunlock(ip, iolock);
> > > >
> > > > If we're evicting the inode, why is it necessary to take i_rwsem and the
> > > > mmap invalidation lock? Shouldn't the evicting thread be the only one
> > > > with access to this inode?
> > >
> > > Hmm, good point. I think you're right. I can easily stop taking
> > > XFS_IOLOCK_EXCL. Not taking XFS_MMAPLOCK_EXCL is slightly more difficult because
> > > xfs_wait_dax_page() expects it to be taken. Do you think it is worth creating a
> > > separate callback (xfs_wait_dax_page_unlocked()?) specifically for this path or
> > > would you be happy with a comment explaining why we take the XFS_MMAPLOCK_EXCL
> > > lock here?
> >
> > There shouldn't be any other threads removing "pages" from i_mapping
> > during eviction, right? If so, I think you can just call schedule()
> > directly from dax_break_mapping_uninterruptble.
>
> Oh right, and I guess you are saying the same would apply to ext4 so no need to
> cycle the filemap lock there either, which I've just noticed is buggy anyway. So
> I can just remove the callback entirely for dax_break_mapping_uninterruptible.
Right. You might want to rename dax_break_layouts_uninterruptible to
make it clearer that it's for evictions and doesn't go through the
mmap invalidation lock.
> > (dax mappings aren't allowed supposed to persist beyond unmount /
> > eviction, just like regular pagecache, right??)
>
> Right they're not *supposed* to, but until at least this patch is applied they
> can ;-)
Yikes!
--D
> - Alistair
>
> > --D
> >
> > > - Alistair
> > >
> > > > --D
> > > >
> > > > > + }
> > > > > +
> > > > > + truncate_inode_pages_final(&inode->i_data);
> > > > > + clear_inode(inode);
> > > > > +}
> > > > > +
> > > > > static void
> > > > > xfs_mount_free(
> > > > > struct xfs_mount *mp)
> > > > > @@ -1189,6 +1206,7 @@ static const struct super_operations xfs_super_operations = {
> > > > > .destroy_inode = xfs_fs_destroy_inode,
> > > > > .dirty_inode = xfs_fs_dirty_inode,
> > > > > .drop_inode = xfs_fs_drop_inode,
> > > > > + .evict_inode = xfs_fs_evict_inode,
> > > > > .put_super = xfs_fs_put_super,
> > > > > .sync_fs = xfs_fs_sync_fs,
> > > > > .freeze_fs = xfs_fs_freeze,
> > > > > diff --git a/include/linux/dax.h b/include/linux/dax.h
> > > > > index ef9e02c..7c3773f 100644
> > > > > --- a/include/linux/dax.h
> > > > > +++ b/include/linux/dax.h
> > > > > @@ -274,6 +274,8 @@ static inline int __must_check dax_break_mapping_inode(struct inode *inode,
> > > > > {
> > > > > return dax_break_mapping(inode, 0, LLONG_MAX, cb);
> > > > > }
> > > > > +void dax_break_mapping_uninterruptible(struct inode *inode,
> > > > > + void (cb)(struct inode *));
> > > > > int dax_dedupe_file_range_compare(struct inode *src, loff_t srcoff,
> > > > > struct inode *dest, loff_t destoff,
> > > > > loff_t len, bool *is_same,
> > > > > --
> > > > > git-series 0.9.1
> > > > >
> > >
>
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 05/26] fs/dax: Create a common implementation to break DAX layouts
2025-01-10 6:00 ` [PATCH v6 05/26] fs/dax: Create a common implementation to break DAX layouts Alistair Popple
2025-01-10 16:44 ` Darrick J. Wong
@ 2025-01-13 20:11 ` Dan Williams
2025-01-13 23:06 ` Dan Williams
2025-01-14 0:19 ` Dan Williams
3 siblings, 0 replies; 97+ messages in thread
From: Dan Williams @ 2025-01-13 20:11 UTC (permalink / raw)
To: Alistair Popple, akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
Alistair Popple wrote:
> Prior to freeing a block file systems supporting FS DAX must check
> that the associated pages are both unmapped from user-space and not
> undergoing DMA or other access from eg. get_user_pages(). This is
> achieved by unmapping the file range and scanning the FS DAX
> page-cache to see if any pages within the mapping have an elevated
> refcount.
>
> This is done using two functions - dax_layout_busy_page_range() which
> returns a page to wait for the refcount to become idle on. Rather than
> open-code this introduce a common implementation to both unmap and
> wait for the page to become idle.
>
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
After resolving my confusion about retries, you can add:
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
...although some bikeshedding below that can take or leave as you wish.
>
> ---
>
> Changes for v5:
>
> - Don't wait for idle pages on non-DAX mappings
>
> Changes for v4:
>
> - Fixed some build breakage due to missing symbol exports reported by
> John Hubbard (thanks!).
> ---
> fs/dax.c | 33 +++++++++++++++++++++++++++++++++
> fs/ext4/inode.c | 10 +---------
> fs/fuse/dax.c | 27 +++------------------------
> fs/xfs/xfs_inode.c | 23 +++++------------------
> fs/xfs/xfs_inode.h | 2 +-
> include/linux/dax.h | 21 +++++++++++++++++++++
> mm/madvise.c | 8 ++++----
> 7 files changed, 68 insertions(+), 56 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index d010c10..9c3bd07 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -845,6 +845,39 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
> return ret;
> }
>
> +static int wait_page_idle(struct page *page,
> + void (cb)(struct inode *),
> + struct inode *inode)
> +{
> + return ___wait_var_event(page, page_ref_count(page) == 1,
> + TASK_INTERRUPTIBLE, 0, 0, cb(inode));
> +}
> +
> +/*
> + * Unmaps the inode and waits for any DMA to complete prior to deleting the
> + * DAX mapping entries for the range.
> + */
> +int dax_break_mapping(struct inode *inode, loff_t start, loff_t end,
> + void (cb)(struct inode *))
> +{
> + struct page *page;
> + int error;
> +
> + if (!dax_mapping(inode->i_mapping))
> + return 0;
> +
> + do {
> + page = dax_layout_busy_page_range(inode->i_mapping, start, end);
> + if (!page)
> + break;
> +
> + error = wait_page_idle(page, cb, inode);
> + } while (error == 0);
> +
> + return error;
> +}
> +EXPORT_SYMBOL_GPL(dax_break_mapping);
It is not clear why this is called "mapping" vs "layout". The detail
about the file that is being "broken" is whether there are any live
subscriptions to the "layout" of the file, the pfn storage layout, not
the memory mapping.
For example the bulk of dax_break_layout() is performed after
invalidate_inode_pages() has torn down the memory mapping.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 05/26] fs/dax: Create a common implementation to break DAX layouts
2025-01-10 6:00 ` [PATCH v6 05/26] fs/dax: Create a common implementation to break DAX layouts Alistair Popple
2025-01-10 16:44 ` Darrick J. Wong
2025-01-13 20:11 ` Dan Williams
@ 2025-01-13 23:06 ` Dan Williams
2025-01-14 0:19 ` Dan Williams
3 siblings, 0 replies; 97+ messages in thread
From: Dan Williams @ 2025-01-13 23:06 UTC (permalink / raw)
To: Alistair Popple, akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
Alistair Popple wrote:
> Prior to freeing a block file systems supporting FS DAX must check
> that the associated pages are both unmapped from user-space and not
> undergoing DMA or other access from eg. get_user_pages(). This is
> achieved by unmapping the file range and scanning the FS DAX
> page-cache to see if any pages within the mapping have an elevated
> refcount.
>
> This is done using two functions - dax_layout_busy_page_range() which
> returns a page to wait for the refcount to become idle on. Rather than
> open-code this introduce a common implementation to both unmap and
> wait for the page to become idle.
>
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>
> ---
[..]
Whoops, I hit send on the last mail before seeing this:
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 49f3a75..1f4c99e 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
This hunk needs to move to the devmap removal patch, right?
With that fixed up the Reviewed-by still stands.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 06/26] fs/dax: Always remove DAX page-cache entries when breaking layouts
2025-01-10 6:00 ` [PATCH v6 06/26] fs/dax: Always remove DAX page-cache entries when breaking layouts Alistair Popple
@ 2025-01-13 23:31 ` Dan Williams
0 siblings, 0 replies; 97+ messages in thread
From: Dan Williams @ 2025-01-13 23:31 UTC (permalink / raw)
To: Alistair Popple, akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
Alistair Popple wrote:
> Prior to any truncation operations file systems call
> dax_break_mapping() to ensure pages in the range are not under going
> DMA. Later DAX page-cache entries will be removed by
> truncate_folio_batch_exceptionals() in the generic page-cache code.
>
> However this makes it possible for folios to be removed from the
> page-cache even though they are still DMA busy if the file-system
> hasn't called dax_break_mapping(). It also means they can never be
> waited on in future because FS DAX will lose track of them once the
> page-cache entry has been deleted.
>
> Instead it is better to delete the FS DAX entry when the file-system
> calls dax_break_mapping() as part of it's truncate operation. This
> ensures only idle pages can be removed from the FS DAX page-cache and
> makes it easy to detect if a file-system hasn't called
> dax_break_mapping() prior to a truncate operation.
>
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>
> ---
>
> Ideally I think we would move the whole wait-for-idle logic directly
> into the truncate paths. However this is difficult for a few
> reasons. Each filesystem needs it's own wait callback, although a new
> address space operation could address that. More problematic is that
> the wait-for-idle can fail as the wait is TASK_INTERRUPTIBLE, but none
> of the generic truncate paths allow for failure.
>
> So it ends up being easier to continue to let file systems call this
> and check that they behave as expected.
> ---
> fs/dax.c | 33 +++++++++++++++++++++++++++++++++
> fs/xfs/xfs_inode.c | 6 ++++++
> include/linux/dax.h | 2 ++
> mm/truncate.c | 16 +++++++++++++++-
> 4 files changed, 56 insertions(+), 1 deletion(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 9c3bd07..7008a73 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -845,6 +845,36 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
> return ret;
> }
>
> +void dax_delete_mapping_range(struct address_space *mapping,
> + loff_t start, loff_t end)
> +{
> + void *entry;
> + pgoff_t start_idx = start >> PAGE_SHIFT;
> + pgoff_t end_idx;
> + XA_STATE(xas, &mapping->i_pages, start_idx);
> +
> + /* If end == LLONG_MAX, all pages from start to till end of file */
> + if (end == LLONG_MAX)
> + end_idx = ULONG_MAX;
> + else
> + end_idx = end >> PAGE_SHIFT;
> +
> + xas_lock_irq(&xas);
> + xas_for_each(&xas, entry, end_idx) {
> + if (!xa_is_value(entry))
> + continue;
> + entry = wait_entry_unlocked_exclusive(&xas, entry);
> + if (!entry)
> + continue;
> + dax_disassociate_entry(entry, mapping, true);
> + xas_store(&xas, NULL);
> + mapping->nrpages -= 1UL << dax_entry_order(entry);
> + put_unlocked_entry(&xas, entry, WAKE_ALL);
> + }
> + xas_unlock_irq(&xas);
> +}
> +EXPORT_SYMBOL_GPL(dax_delete_mapping_range);
> +
> static int wait_page_idle(struct page *page,
> void (cb)(struct inode *),
> struct inode *inode)
> @@ -874,6 +904,9 @@ int dax_break_mapping(struct inode *inode, loff_t start, loff_t end,
> error = wait_page_idle(page, cb, inode);
> } while (error == 0);
>
> + if (!page)
> + dax_delete_mapping_range(inode->i_mapping, start, end);
> +
Just reinforcing the rename comment on the last patch...
I think this is an example where the
s/dax_break_mapping/dax_break_layout/ rename helps disambiguate what is
related to mapping cleanup and what is related to mapping cleanup as
dax_break_layout calls dax_delete_mapping.
> return error;
> }
> EXPORT_SYMBOL_GPL(dax_break_mapping);
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 295730a..4410b42 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2746,6 +2746,12 @@ xfs_mmaplock_two_inodes_and_break_dax_layout(
> goto again;
> }
>
> + /*
> + * Normally xfs_break_dax_layouts() would delete the mapping entries as well so
> + * do that here.
> + */
> + dax_delete_mapping_range(VFS_I(ip2)->i_mapping, 0, LLONG_MAX);
> +
I think it is unfortunate that dax_break_mapping is so close to being
useful for this case... how about this incremental cleanup?
diff --git a/fs/dax.c b/fs/dax.c
index facddd6c6bbb..1fa5521e5a2e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -942,12 +942,15 @@ static void wait_page_idle_uninterruptible(struct page *page,
/*
* Unmaps the inode and waits for any DMA to complete prior to deleting the
* DAX mapping entries for the range.
+ *
+ * For NOWAIT behavior, pass @cb as NULL to early-exit on first found
+ * busy page
*/
int dax_break_mapping(struct inode *inode, loff_t start, loff_t end,
void (cb)(struct inode *))
{
struct page *page;
- int error;
+ int error = 0;
if (!dax_mapping(inode->i_mapping))
return 0;
@@ -956,6 +959,10 @@ int dax_break_mapping(struct inode *inode, loff_t start, loff_t end,
page = dax_layout_busy_page_range(inode->i_mapping, start, end);
if (!page)
break;
+ if (!cb) {
+ error = -ERESTARTSYS;
+ break;
+ }
error = wait_page_idle(page, cb, inode);
} while (error == 0);
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 7bfb4eb387c6..0988a9088259 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2739,19 +2739,13 @@ xfs_mmaplock_two_inodes_and_break_dax_layout(
* need to unlock & lock the XFS_MMAPLOCK_EXCL which is not suitable
* for this nested lock case.
*/
- page = dax_layout_busy_page(VFS_I(ip2)->i_mapping);
- if (page && page_ref_count(page) != 0) {
+ error = dax_break_layout(VFS_I(ip2), 0, -1, NULL);
+ if (error) {
xfs_iunlock(ip2, XFS_MMAPLOCK_EXCL);
xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
goto again;
}
- /*
- * Normally xfs_break_dax_layouts() would delete the mapping entries as well so
- * do that here.
- */
- dax_delete_mapping_range(VFS_I(ip2)->i_mapping, 0, LLONG_MAX);
-
return 0;
}
This also addresses Darrick's feedback around introducing
dax_page_in_use() which xfs does not really care about, only that no
more pages are busy.
> return 0;
> }
>
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index f6583d3..ef9e02c 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -263,6 +263,8 @@ vm_fault_t dax_iomap_fault(struct vm_fault *vmf, unsigned int order,
> vm_fault_t dax_finish_sync_fault(struct vm_fault *vmf,
> unsigned int order, pfn_t pfn);
> int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
> +void dax_delete_mapping_range(struct address_space *mapping,
> + loff_t start, loff_t end);
> int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
> pgoff_t index);
> int __must_check dax_break_mapping(struct inode *inode, loff_t start,
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 7c304d2..b7f51a6 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -78,8 +78,22 @@ static void truncate_folio_batch_exceptionals(struct address_space *mapping,
>
> if (dax_mapping(mapping)) {
> for (i = j; i < nr; i++) {
> - if (xa_is_value(fbatch->folios[i]))
> + if (xa_is_value(fbatch->folios[i])) {
> + /*
> + * File systems should already have called
> + * dax_break_mapping_entry() to remove all DAX
> + * entries while holding a lock to prevent
> + * establishing new entries. Therefore we
> + * shouldn't find any here.
> + */
> + WARN_ON_ONCE(1);
> +
> + /*
> + * Delete the mapping so truncate_pagecache()
> + * doesn't loop forever.
> + */
> dax_delete_mapping_entry(mapping, indices[i]);
> + }
Looks good.
With the above additional fixup you can add:
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 07/26] fs/dax: Ensure all pages are idle prior to filesystem unmount
2025-01-10 6:00 ` [PATCH v6 07/26] fs/dax: Ensure all pages are idle prior to filesystem unmount Alistair Popple
2025-01-10 16:50 ` Darrick J. Wong
@ 2025-01-13 23:42 ` Dan Williams
1 sibling, 0 replies; 97+ messages in thread
From: Dan Williams @ 2025-01-13 23:42 UTC (permalink / raw)
To: Alistair Popple, akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
Alistair Popple wrote:
> File systems call dax_break_mapping() prior to reallocating file
> system blocks to ensure the page is not undergoing any DMA or other
> accesses. Generally this is needed when a file is truncated to ensure
> that if a block is reallocated nothing is writing to it. However
> filesystems currently don't call this when an FS DAX inode is evicted.
>
> This can cause problems when the file system is unmounted as a page
> can continue to be under going DMA or other remote access after
> unmount. This means if the file system is remounted any truncate or
> other operation which requires the underlying file system block to be
> freed will not wait for the remote access to complete. Therefore a
> busy block may be reallocated to a new file leading to corruption.
>
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>
> ---
>
> Changes for v5:
>
> - Don't wait for pages to be idle in non-DAX mappings
> ---
> fs/dax.c | 29 +++++++++++++++++++++++++++++
> fs/ext4/inode.c | 32 ++++++++++++++------------------
> fs/xfs/xfs_inode.c | 9 +++++++++
> fs/xfs/xfs_inode.h | 1 +
> fs/xfs/xfs_super.c | 18 ++++++++++++++++++
> include/linux/dax.h | 2 ++
> 6 files changed, 73 insertions(+), 18 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 7008a73..4e49cc4 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -883,6 +883,14 @@ static int wait_page_idle(struct page *page,
> TASK_INTERRUPTIBLE, 0, 0, cb(inode));
> }
>
> +static void wait_page_idle_uninterruptible(struct page *page,
> + void (cb)(struct inode *),
> + struct inode *inode)
> +{
> + ___wait_var_event(page, page_ref_count(page) == 1,
> + TASK_UNINTERRUPTIBLE, 0, 0, cb(inode));
> +}
> +
> /*
> * Unmaps the inode and waits for any DMA to complete prior to deleting the
> * DAX mapping entries for the range.
> @@ -911,6 +919,27 @@ int dax_break_mapping(struct inode *inode, loff_t start, loff_t end,
> }
> EXPORT_SYMBOL_GPL(dax_break_mapping);
>
> +void dax_break_mapping_uninterruptible(struct inode *inode,
> + void (cb)(struct inode *))
> +{
> + struct page *page;
> +
> + if (!dax_mapping(inode->i_mapping))
> + return;
> +
> + do {
> + page = dax_layout_busy_page_range(inode->i_mapping, 0,
> + LLONG_MAX);
> + if (!page)
> + break;
> +
> + wait_page_idle_uninterruptible(page, cb, inode);
> + } while (true);
> +
> + dax_delete_mapping_range(inode->i_mapping, 0, LLONG_MAX);
> +}
> +EXPORT_SYMBOL_GPL(dax_break_mapping_uninterruptible);
Riffing off of Darrick's feedback, how about call this
dax_break_layout_final()?
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 05/26] fs/dax: Create a common implementation to break DAX layouts
2025-01-10 6:00 ` [PATCH v6 05/26] fs/dax: Create a common implementation to break DAX layouts Alistair Popple
` (2 preceding siblings ...)
2025-01-13 23:06 ` Dan Williams
@ 2025-01-14 0:19 ` Dan Williams
3 siblings, 0 replies; 97+ messages in thread
From: Dan Williams @ 2025-01-14 0:19 UTC (permalink / raw)
To: Alistair Popple, akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
Alistair Popple wrote:
> Prior to freeing a block file systems supporting FS DAX must check
> that the associated pages are both unmapped from user-space and not
> undergoing DMA or other access from eg. get_user_pages(). This is
> achieved by unmapping the file range and scanning the FS DAX
> page-cache to see if any pages within the mapping have an elevated
> refcount.
>
> This is done using two functions - dax_layout_busy_page_range() which
> returns a page to wait for the refcount to become idle on. Rather than
> open-code this introduce a common implementation to both unmap and
> wait for the page to become idle.
>
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>
> ---
>
> Changes for v5:
>
> - Don't wait for idle pages on non-DAX mappings
>
> Changes for v4:
>
> - Fixed some build breakage due to missing symbol exports reported by
> John Hubbard (thanks!).
[..]
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index cc1acb1..ee8e83f 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3917,15 +3917,7 @@ int ext4_break_layouts(struct inode *inode)
> if (WARN_ON_ONCE(!rwsem_is_locked(&inode->i_mapping->invalidate_lock)))
> return -EINVAL;
>
> - do {
> - page = dax_layout_busy_page(inode->i_mapping);
> - if (!page)
> - return 0;
> -
> - error = dax_wait_page_idle(page, ext4_wait_dax_page, inode);
> - } while (error == 0);
> -
> - return error;
> + return dax_break_mapping_inode(inode, ext4_wait_dax_page);
I hit this in my compile testing:
fs/ext4/inode.c: In function ‘ext4_break_layouts’:
fs/ext4/inode.c:3915:13: error: unused variable ‘error’ [-Werror=unused-variable]
3915 | int error;
| ^~~~~
fs/ext4/inode.c:3914:22: error: unused variable ‘page’ [-Werror=unused-variable]
3914 | struct page *page;
| ^~~~
cc1: all warnings being treated as errors
...which gets fixed up later on, but bisect breakage is unwanted.
The bots will probably find this too eventually.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 08/26] fs/dax: Remove PAGE_MAPPING_DAX_SHARED mapping flag
2025-01-10 6:00 ` [PATCH v6 08/26] fs/dax: Remove PAGE_MAPPING_DAX_SHARED mapping flag Alistair Popple
@ 2025-01-14 0:52 ` Dan Williams
2025-01-15 5:32 ` Alistair Popple
2025-01-14 14:47 ` David Hildenbrand
1 sibling, 1 reply; 97+ messages in thread
From: Dan Williams @ 2025-01-14 0:52 UTC (permalink / raw)
To: Alistair Popple, akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
Alistair Popple wrote:
> PAGE_MAPPING_DAX_SHARED is the same as PAGE_MAPPING_ANON.
I think a bit a bit more detail is warranted, how about?
The page ->mapping pointer can have magic values like
PAGE_MAPPING_DAX_SHARED and PAGE_MAPPING_ANON for page owner specific
usage. In fact, PAGE_MAPPING_DAX_SHARED and PAGE_MAPPING_ANON alias the
same value.
> This isn't currently a problem because FS DAX pages are treated
> specially.
s/are treated specially/are never seen by the anonymous mapping code and
vice versa/
> However a future change will make FS DAX pages more like
> normal pages, so folio_test_anon() must not return true for a FS DAX
> page.
>
> We could explicitly test for a FS DAX page in folio_test_anon(),
> etc. however the PAGE_MAPPING_DAX_SHARED flag isn't actually
> needed. Instead we can use the page->mapping field to implicitly track
> the first mapping of a page. If page->mapping is non-NULL it implies
> the page is associated with a single mapping at page->index. If the
> page is associated with a second mapping clear page->mapping and set
> page->share to 1.
>
> This is possible because a shared mapping implies the file-system
> implements dax_holder_operations which makes the ->mapping and
> ->index, which is a union with ->share, unused.
>
> The page is considered shared when page->mapping == NULL and
> page->share > 0 or page->mapping != NULL, implying it is present in at
> least one address space. This also makes it easier for a future change
> to detect when a page is first mapped into an address space which
> requires special handling.
>
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> ---
> fs/dax.c | 45 +++++++++++++++++++++++++--------------
> include/linux/page-flags.h | 6 +-----
> 2 files changed, 29 insertions(+), 22 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 4e49cc4..d35dbe1 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -351,38 +351,41 @@ static unsigned long dax_end_pfn(void *entry)
> for (pfn = dax_to_pfn(entry); \
> pfn < dax_end_pfn(entry); pfn++)
>
> +/*
> + * A DAX page is considered shared if it has no mapping set and ->share (which
> + * shares the ->index field) is non-zero. Note this may return false even if the
> + * page is shared between multiple files but has not yet actually been mapped
> + * into multiple address spaces.
> + */
> static inline bool dax_page_is_shared(struct page *page)
> {
> - return page->mapping == PAGE_MAPPING_DAX_SHARED;
> + return !page->mapping && page->share;
> }
>
> /*
> - * Set the page->mapping with PAGE_MAPPING_DAX_SHARED flag, increase the
> - * refcount.
> + * Increase the page share refcount, warning if the page is not marked as shared.
> */
> static inline void dax_page_share_get(struct page *page)
> {
> - if (page->mapping != PAGE_MAPPING_DAX_SHARED) {
> - /*
> - * Reset the index if the page was already mapped
> - * regularly before.
> - */
> - if (page->mapping)
> - page->share = 1;
> - page->mapping = PAGE_MAPPING_DAX_SHARED;
> - }
> + WARN_ON_ONCE(!page->share);
> + WARN_ON_ONCE(page->mapping);
Given the only caller of this function is dax_associate_entry() it seems
like overkill to check that a function only a few lines away manipulated
->mapping correctly.
I don't see much reason for dax_page_share_get() to exist after your
changes.
Perhaps all that is needed is a dax_make_shared() helper that does the
initial fiddling of '->mapping = NULL' and '->share = 1'?
> page->share++;
> }
>
> static inline unsigned long dax_page_share_put(struct page *page)
> {
> + WARN_ON_ONCE(!page->share);
> return --page->share;
> }
>
> /*
> - * When it is called in dax_insert_entry(), the shared flag will indicate that
> - * whether this entry is shared by multiple files. If so, set the page->mapping
> - * PAGE_MAPPING_DAX_SHARED, and use page->share as refcount.
> + * When it is called in dax_insert_entry(), the shared flag will indicate
> + * whether this entry is shared by multiple files. If the page has not
> + * previously been associated with any mappings the ->mapping and ->index
> + * fields will be set. If it has already been associated with a mapping
> + * the mapping will be cleared and the share count set. It's then up to the
> + * file-system to track which mappings contain which pages, ie. by implementing
> + * dax_holder_operations.
This feels like a good comment for a new dax_make_shared() not
dax_associate_entry().
I would also:
s/up to the file-system to track which mappings contain which pages, ie. by implementing
dax_holder_operations/up to reverse map users like memory_failure() to
call back into the filesystem to recover ->mapping and ->index
information/
> */
> static void dax_associate_entry(void *entry, struct address_space *mapping,
> struct vm_area_struct *vma, unsigned long address, bool shared)
> @@ -397,7 +400,17 @@ static void dax_associate_entry(void *entry, struct address_space *mapping,
> for_each_mapped_pfn(entry, pfn) {
> struct page *page = pfn_to_page(pfn);
>
> - if (shared) {
> + if (shared && page->mapping && page->share) {
How does this case happen? I don't think any page would ever enter with
both ->mapping and ->share set, right?
If the file was mapped then reflinked then ->share should be zero at the
first mapping attempt. It might not be zero because it is aliased with
index until it is converted to a shared page.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 14/26] rmap: Add support for PUD sized mappings to rmap
2025-01-10 6:00 ` [PATCH v6 14/26] rmap: Add support for PUD sized mappings to rmap Alistair Popple
@ 2025-01-14 1:21 ` Dan Williams
0 siblings, 0 replies; 97+ messages in thread
From: Dan Williams @ 2025-01-14 1:21 UTC (permalink / raw)
To: Alistair Popple, akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
Alistair Popple wrote:
> The rmap doesn't currently support adding a PUD mapping of a
> folio. This patch adds support for entire PUD mappings of folios,
> primarily to allow for more standard refcounting of device DAX
> folios. Currently DAX is the only user of this and it doesn't require
> support for partially mapped PUD-sized folios so we don't support for
> that for now.
>
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> Acked-by: David Hildenbrand <david@redhat.com>
>
> ---
>
> Changes for v6:
>
> - Minor comment formatting fix
> - Add an additional check for CONFIG_TRANSPARENT_HUGEPAGE to fix a
> build breakage when CONFIG_PGTABLE_HAS_HUGE_LEAVES is not defined.
>
> Changes for v5:
>
> - Fixed accounting as suggested by David.
>
> Changes for v4:
>
> - New for v4, split out rmap changes as suggested by David.
> ---
> include/linux/rmap.h | 15 ++++++++++-
> mm/rmap.c | 67 ++++++++++++++++++++++++++++++++++++++++++---
> 2 files changed, 78 insertions(+), 4 deletions(-)
Looks mechanically correct to me.
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 15/26] huge_memory: Add vmf_insert_folio_pud()
2025-01-10 6:00 ` [PATCH v6 15/26] huge_memory: Add vmf_insert_folio_pud() Alistair Popple
@ 2025-01-14 1:27 ` Dan Williams
2025-01-14 16:22 ` David Hildenbrand
1 sibling, 0 replies; 97+ messages in thread
From: Dan Williams @ 2025-01-14 1:27 UTC (permalink / raw)
To: Alistair Popple, akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
Alistair Popple wrote:
> Currently DAX folio/page reference counts are managed differently to
> normal pages. To allow these to be managed the same as normal pages
> introduce vmf_insert_folio_pud. This will map the entire PUD-sized folio
> and take references as it would for a normally mapped page.
>
> This is distinct from the current mechanism, vmf_insert_pfn_pud, which
> simply inserts a special devmap PUD entry into the page table without
> holding a reference to the page for the mapping.
>
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
Looks correct for what it is:
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 16/26] huge_memory: Add vmf_insert_folio_pmd()
2025-01-10 6:00 ` [PATCH v6 16/26] huge_memory: Add vmf_insert_folio_pmd() Alistair Popple
@ 2025-01-14 2:04 ` Dan Williams
2025-01-14 16:40 ` David Hildenbrand
1 sibling, 0 replies; 97+ messages in thread
From: Dan Williams @ 2025-01-14 2:04 UTC (permalink / raw)
To: Alistair Popple, akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
Alistair Popple wrote:
> Currently DAX folio/page reference counts are managed differently to
> normal pages. To allow these to be managed the same as normal pages
> introduce vmf_insert_folio_pmd. This will map the entire PMD-sized folio
> and take references as it would for a normally mapped page.
>
> This is distinct from the current mechanism, vmf_insert_pfn_pmd, which
> simply inserts a special devmap PMD entry into the page table without
> holding a reference to the page for the mapping.
>
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>
> ---
>
> Changes for v5:
> - Minor code cleanup suggested by David
> ---
> include/linux/huge_mm.h | 1 +-
> mm/huge_memory.c | 54 ++++++++++++++++++++++++++++++++++--------
> 2 files changed, 45 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 5bd1ff7..3633bd3 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -39,6 +39,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>
> vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
> vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
> +vm_fault_t vmf_insert_folio_pmd(struct vm_fault *vmf, struct folio *folio, bool write);
> vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio, bool write);
>
> enum transparent_hugepage_flag {
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 256adc3..d1ea76e 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1381,14 +1381,12 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr,
> {
> struct mm_struct *mm = vma->vm_mm;
> pmd_t entry;
> - spinlock_t *ptl;
>
> - ptl = pmd_lock(mm, pmd);
Apply this comment to the previous patch too, but I think this would be
more self-documenting as:
lockdep_assert_held(pmd_lock(mm, pmd));
...to make it clear in this diff and into the future what the locking
constraints of this function are.
After that you can add:
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 17/26] memremap: Add is_devdax_page() and is_fsdax_page() helpers
2025-01-10 6:00 ` [PATCH v6 17/26] memremap: Add is_devdax_page() and is_fsdax_page() helpers Alistair Popple
@ 2025-01-14 2:05 ` Dan Williams
0 siblings, 0 replies; 97+ messages in thread
From: Dan Williams @ 2025-01-14 2:05 UTC (permalink / raw)
To: Alistair Popple, akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
Alistair Popple wrote:
> Add helpers to determine if a page or folio is a devdax or fsdax page
> or folio.
>
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> Acked-by: David Hildenbrand <david@redhat.com>
>
> ---
>
> Changes for v5:
> - Renamed is_device_dax_page() to is_devdax_page() for consistency.
> ---
> include/linux/memremap.h | 22 ++++++++++++++++++++++
> 1 file changed, 22 insertions(+)
Patch does what it says on the tin, but I am not a fan patches this
tiny. Fold it in with the first user.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 18/26] mm/gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages
2025-01-10 6:00 ` [PATCH v6 18/26] mm/gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages Alistair Popple
@ 2025-01-14 2:16 ` Dan Williams
0 siblings, 0 replies; 97+ messages in thread
From: Dan Williams @ 2025-01-14 2:16 UTC (permalink / raw)
To: Alistair Popple, akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
Alistair Popple wrote:
> Longterm pinning of FS DAX pages should already be disallowed by
> various pXX_devmap checks. However a future change will cause these
> checks to be invalid for FS DAX pages so make
> folio_is_longterm_pinnable() return false for FS DAX pages.
>
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>
> Acked-by: David Hildenbrand <david@redhat.com>
> ---
> include/linux/mm.h | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index f267b06..01edca9 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2078,6 +2078,10 @@ static inline bool folio_is_longterm_pinnable(struct folio *folio)
> if (folio_is_device_coherent(folio))
> return false;
>
> + /* DAX must also always allow eviction. */
This 'eviction' terminology seems like it was copied from the
device-memory comment, but with fsdax it does not fit. How about:
/*
* Filesystems can only tolerate transient delays to truncate and
* hole-punch operations
*/
> + if (folio_is_fsdax(folio))
> + return false;
> +
After the comment fixup you can add:
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 19/26] proc/task_mmu: Mark devdax and fsdax pages as always unpinned
2025-01-10 6:00 ` [PATCH v6 19/26] proc/task_mmu: Mark devdax and fsdax pages as always unpinned Alistair Popple
@ 2025-01-14 2:28 ` Dan Williams
2025-01-14 16:45 ` David Hildenbrand
0 siblings, 1 reply; 97+ messages in thread
From: Dan Williams @ 2025-01-14 2:28 UTC (permalink / raw)
To: Alistair Popple, akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
Alistair Popple wrote:
> The procfs mmu files such as smaps and pagemap currently ignore devdax and
> fsdax pages because these pages are considered special. A future change
> will start treating these as normal pages, meaning they can be exposed via
> smaps and pagemap.
>
> The only difference is that devdax and fsdax pages can never be pinned for
> DMA via FOLL_LONGTERM, so add an explicit check in pte_is_pinned() to
> reflect that.
I don't understand this patch.
pin_user_pages() is also used for Direct-I/O page pinning, so the
comment about FOLL_LONGTERM is wrong, and I otherwise do not understand
what goes wrong if the only pte_is_pinned() user correctly detects the
pin state?
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 20/26] mm/mlock: Skip ZONE_DEVICE PMDs during mlock
2025-01-10 6:00 ` [PATCH v6 20/26] mm/mlock: Skip ZONE_DEVICE PMDs during mlock Alistair Popple
@ 2025-01-14 2:42 ` Dan Williams
2025-01-17 1:54 ` Alistair Popple
0 siblings, 1 reply; 97+ messages in thread
From: Dan Williams @ 2025-01-14 2:42 UTC (permalink / raw)
To: Alistair Popple, akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
Alistair Popple wrote:
> At present mlock skips ptes mapping ZONE_DEVICE pages. A future change
> to remove pmd_devmap will allow pmd_trans_huge_lock() to return
> ZONE_DEVICE folios so make sure we continue to skip those.
>
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> Acked-by: David Hildenbrand <david@redhat.com>
This looks like a fix in that mlock_pte_range() *does* call mlock_folio()
when pmd_trans_huge_lock() returns a non-NULL @ptl.
So it is not in preparation for a future change it is making the pte and
pmd cases behave the same to drop mlock requests.
The code change looks good, but do add a Fixes tag and reword the
changelog a bit before adding:
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 21/26] fs/dax: Properly refcount fs dax pages
2025-01-10 6:00 ` [PATCH v6 21/26] fs/dax: Properly refcount fs dax pages Alistair Popple
2025-01-10 16:54 ` Darrick J. Wong
@ 2025-01-14 3:35 ` Dan Williams
2025-02-07 5:31 ` Alistair Popple
1 sibling, 1 reply; 97+ messages in thread
From: Dan Williams @ 2025-01-14 3:35 UTC (permalink / raw)
To: Alistair Popple, akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
Alistair Popple wrote:
> Currently fs dax pages are considered free when the refcount drops to
> one and their refcounts are not increased when mapped via PTEs or
> decreased when unmapped. This requires special logic in mm paths to
> detect that these pages should not be properly refcounted, and to
> detect when the refcount drops to one instead of zero.
>
> On the other hand get_user_pages(), etc. will properly refcount fs dax
> pages by taking a reference and dropping it when the page is
> unpinned.
>
> Tracking this special behaviour requires extra PTE bits
> (eg. pte_devmap) and introduces rules that are potentially confusing
> and specific to FS DAX pages. To fix this, and to possibly allow
> removal of the special PTE bits in future, convert the fs dax page
> refcounts to be zero based and instead take a reference on the page
> each time it is mapped as is currently the case for normal pages.
>
> This may also allow a future clean-up to remove the pgmap refcounting
> that is currently done in mm/gup.c.
This patch depends on FS_DAX_LIMITED being abandoned first, so do
include the patch at the bottom of this reply in your series before this
patch.
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>
> ---
>
> Changes since v2:
>
> Based on some questions from Dan I attempted to have the FS DAX page
> cache (ie. address space) hold a reference to the folio whilst it was
> mapped. However I came to the strong conclusion that this was not the
> right thing to do.
>
> If the page refcount == 0 it means the page is:
>
> 1. not mapped into user-space
> 2. not subject to other access via DMA/GUP/etc.
>
> Ie. From the core MM perspective the page is not in use.
>
> The fact a page may or may not be present in one or more address space
> mappings is irrelevant for core MM. It just means the page is still in
> use or valid from the file system perspective, and it's a
> responsiblity of the file system to remove these mappings if the pfn
> mapping becomes invalid (along with first making sure the MM state,
> ie. page->refcount, is idle). So we shouldn't be trying to track that
> lifetime with MM refcounts.
>
> Doing so just makes DMA-idle tracking more complex because there is
> now another thing (one or more address spaces) which can hold
> references on a page. And FS DAX can't even keep track of all the
> address spaces which might contain a reference to the page in the
> XFS/reflink case anyway.
>
> We could do this if we made file systems invalidate all address space
> mappings prior to calling dax_break_layouts(), but that isn't
> currently neccessary and would lead to increased faults just so we
> could do some superfluous refcounting which the file system already
> does.
>
> I have however put the page sharing checks and WARN_ON's back which
> also turned out to be useful for figuring out when to re-initialising
> a folio.
I feel like these comments are a useful analysis that deserve not to be
lost to the sands of time on the list.
Perhaps capture a flavor of this relevant for future consideration in a
"DAX page Lifetime" section of Documentation/filesystems/dax.rst?
> ---
> drivers/nvdimm/pmem.c | 4 +-
> fs/dax.c | 212 +++++++++++++++++++++++-----------------
> fs/fuse/virtio_fs.c | 3 +-
> fs/xfs/xfs_inode.c | 2 +-
> include/linux/dax.h | 6 +-
> include/linux/mm.h | 27 +-----
> include/linux/mm_types.h | 7 +-
> mm/gup.c | 9 +--
> mm/huge_memory.c | 6 +-
> mm/internal.h | 2 +-
> mm/memory-failure.c | 6 +-
> mm/memory.c | 6 +-
> mm/memremap.c | 47 ++++-----
> mm/mm_init.c | 9 +--
> mm/swap.c | 2 +-
> 15 files changed, 183 insertions(+), 165 deletions(-)
>
> diff --git a/drivers/nvdimm/pmem.c b/drivers/nvdimm/pmem.c
> index d81faa9..785b2d2 100644
> --- a/drivers/nvdimm/pmem.c
> +++ b/drivers/nvdimm/pmem.c
> @@ -513,7 +513,7 @@ static int pmem_attach_disk(struct device *dev,
>
> pmem->disk = disk;
> pmem->pgmap.owner = pmem;
> - pmem->pfn_flags = PFN_DEV;
> + pmem->pfn_flags = 0;
> if (is_nd_pfn(dev)) {
> pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
> pmem->pgmap.ops = &fsdax_pagemap_ops;
> @@ -522,7 +522,6 @@ static int pmem_attach_disk(struct device *dev,
> pmem->data_offset = le64_to_cpu(pfn_sb->dataoff);
> pmem->pfn_pad = resource_size(res) -
> range_len(&pmem->pgmap.range);
> - pmem->pfn_flags |= PFN_MAP;
> bb_range = pmem->pgmap.range;
> bb_range.start += pmem->data_offset;
> } else if (pmem_should_map_pages(dev)) {
> @@ -532,7 +531,6 @@ static int pmem_attach_disk(struct device *dev,
> pmem->pgmap.type = MEMORY_DEVICE_FS_DAX;
> pmem->pgmap.ops = &fsdax_pagemap_ops;
> addr = devm_memremap_pages(dev, &pmem->pgmap);
> - pmem->pfn_flags |= PFN_MAP;
> bb_range = pmem->pgmap.range;
> } else {
> addr = devm_memremap(dev, pmem->phys_addr,
> diff --git a/fs/dax.c b/fs/dax.c
> index d35dbe1..19f444e 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -71,6 +71,11 @@ static unsigned long dax_to_pfn(void *entry)
> return xa_to_value(entry) >> DAX_SHIFT;
> }
>
> +static struct folio *dax_to_folio(void *entry)
> +{
> + return page_folio(pfn_to_page(dax_to_pfn(entry)));
> +}
> +
> static void *dax_make_entry(pfn_t pfn, unsigned long flags)
> {
> return xa_mk_value(flags | (pfn_t_to_pfn(pfn) << DAX_SHIFT));
> @@ -338,44 +343,88 @@ static unsigned long dax_entry_size(void *entry)
> return PAGE_SIZE;
> }
>
> -static unsigned long dax_end_pfn(void *entry)
> -{
> - return dax_to_pfn(entry) + dax_entry_size(entry) / PAGE_SIZE;
> -}
> -
> -/*
> - * Iterate through all mapped pfns represented by an entry, i.e. skip
> - * 'empty' and 'zero' entries.
> - */
> -#define for_each_mapped_pfn(entry, pfn) \
> - for (pfn = dax_to_pfn(entry); \
> - pfn < dax_end_pfn(entry); pfn++)
> -
> /*
> * A DAX page is considered shared if it has no mapping set and ->share (which
> * shares the ->index field) is non-zero. Note this may return false even if the
> * page is shared between multiple files but has not yet actually been mapped
> * into multiple address spaces.
> */
> -static inline bool dax_page_is_shared(struct page *page)
> +static inline bool dax_folio_is_shared(struct folio *folio)
> {
> - return !page->mapping && page->share;
> + return !folio->mapping && folio->share;
> }
>
> /*
> - * Increase the page share refcount, warning if the page is not marked as shared.
> + * Increase the folio share refcount, warning if the folio is not marked as shared.
> */
> -static inline void dax_page_share_get(struct page *page)
> +static inline void dax_folio_share_get(void *entry)
> {
> - WARN_ON_ONCE(!page->share);
> - WARN_ON_ONCE(page->mapping);
> - page->share++;
> + struct folio *folio = dax_to_folio(entry);
> +
> + WARN_ON_ONCE(!folio->share);
> + WARN_ON_ONCE(folio->mapping);
> + WARN_ON_ONCE(dax_entry_order(entry) != folio_order(folio));
> + folio->share++;
> +}
> +
> +static inline unsigned long dax_folio_share_put(struct folio *folio)
> +{
> + unsigned long ref;
> +
> + if (!dax_folio_is_shared(folio))
> + ref = 0;
> + else
> + ref = --folio->share;
> +
> + WARN_ON_ONCE(ref < 0);
> + if (!ref) {
> + folio->mapping = NULL;
> + if (folio_order(folio)) {
> + struct dev_pagemap *pgmap = page_pgmap(&folio->page);
> + unsigned int order = folio_order(folio);
> + unsigned int i;
> +
> + for (i = 0; i < (1UL << order); i++) {
> + struct page *page = folio_page(folio, i);
> +
> + ClearPageHead(page);
> + clear_compound_head(page);
> +
> + /*
> + * Reset pgmap which was over-written by
> + * prep_compound_page().
> + */
> + page_folio(page)->pgmap = pgmap;
> +
> + /* Make sure this isn't set to TAIL_MAPPING */
> + page->mapping = NULL;
> + page->share = 0;
> + WARN_ON_ONCE(page_ref_count(page));
> + }
> + }
> + }
> +
> + return ref;
> }
>
> -static inline unsigned long dax_page_share_put(struct page *page)
> +static void dax_device_folio_init(void *entry)
s/dax_device_folio_init/dax_folio_init/
...otherwise I do not see any connection to a "device" concept in this
file.
> {
> - WARN_ON_ONCE(!page->share);
> - return --page->share;
> + struct folio *folio = dax_to_folio(entry);
> + int order = dax_entry_order(entry);
> +
> + /*
> + * Folio should have been split back to order-0 pages in
> + * dax_folio_share_put() when they were removed from their
> + * final mapping.
> + */
> + WARN_ON_ONCE(folio_order(folio));
> +
> + if (order > 0) {
> + prep_compound_page(&folio->page, order);
> + if (order > 1)
> + INIT_LIST_HEAD(&folio->_deferred_list);
> + WARN_ON_ONCE(folio_ref_count(folio));
> + }
> }
>
> /*
> @@ -388,72 +437,58 @@ static inline unsigned long dax_page_share_put(struct page *page)
> * dax_holder_operations.
> */
> static void dax_associate_entry(void *entry, struct address_space *mapping,
> - struct vm_area_struct *vma, unsigned long address, bool shared)
> + struct vm_area_struct *vma, unsigned long address, bool shared)
> {
> - unsigned long size = dax_entry_size(entry), pfn, index;
> - int i = 0;
> + unsigned long size = dax_entry_size(entry), index;
> + struct folio *folio = dax_to_folio(entry);
>
> if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
> return;
>
> index = linear_page_index(vma, address & ~(size - 1));
> - for_each_mapped_pfn(entry, pfn) {
> - struct page *page = pfn_to_page(pfn);
> -
> - if (shared && page->mapping && page->share) {
> - if (page->mapping) {
> - page->mapping = NULL;
> + if (shared && (folio->mapping || dax_folio_is_shared(folio))) {
This change in logic aligns with the previous feedback on the suspect
"if (shared && page->mapping && page->share)"
...statememt, right?
...and maybe the dax_make_shared() suggestion makes the diff smaller
here.
> + if (folio->mapping) {
> + folio->mapping = NULL;
>
> - /*
> - * Page has already been mapped into one address
> - * space so set the share count.
> - */
> - page->share = 1;
> - }
> -
> - dax_page_share_get(page);
> - } else {
> - WARN_ON_ONCE(page->mapping);
> - page->mapping = mapping;
> - page->index = index + i++;
> + /*
> + * folio has already been mapped into one address
> + * space so set the share count.
> + */
> + folio->share = 1;
> }
> +
> + dax_folio_share_get(entry);
> + } else {
> + WARN_ON_ONCE(folio->mapping);
> + dax_device_folio_init(entry);
> + folio = dax_to_folio(entry);
> + folio->mapping = mapping;
> + folio->index = index;
> }
> }
>
> static void dax_disassociate_entry(void *entry, struct address_space *mapping,
> - bool trunc)
> + bool trunc)
> {
> - unsigned long pfn;
> + struct folio *folio = dax_to_folio(entry);
>
> if (IS_ENABLED(CONFIG_FS_DAX_LIMITED))
> return;
>
> - for_each_mapped_pfn(entry, pfn) {
> - struct page *page = pfn_to_page(pfn);
> -
> - WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
> - if (dax_page_is_shared(page)) {
> - /* keep the shared flag if this page is still shared */
> - if (dax_page_share_put(page) > 0)
> - continue;
> - } else
> - WARN_ON_ONCE(page->mapping && page->mapping != mapping);
> - page->mapping = NULL;
> - page->index = 0;
> - }
> + dax_folio_share_put(folio);
Probably should not call this "share_put" anymore since it is handling
both the shared and non-shared case.
> }
>
> static struct page *dax_busy_page(void *entry)
Hmm, will this ultimately become dax_busy_folio()?
[..]
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 54b59b8..e308cb9 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -295,6 +295,8 @@ typedef struct {
> * anonymous memory.
> * @index: Offset within the file, in units of pages. For anonymous memory,
> * this is the index from the beginning of the mmap.
> + * @share: number of DAX mappings that reference this folio. See
> + * dax_associate_entry.
> * @private: Filesystem per-folio data (see folio_attach_private()).
> * @swap: Used for swp_entry_t if folio_test_swapcache().
> * @_mapcount: Do not access this member directly. Use folio_mapcount() to
> @@ -344,7 +346,10 @@ struct folio {
> struct dev_pagemap *pgmap;
> };
> struct address_space *mapping;
> - pgoff_t index;
> + union {
> + pgoff_t index;
> + unsigned long share;
> + };
This feels like it should be an immediate follow-on change if only to
separate fsdax conversion bugs from ->index ->share aliasing bugs, and
due to the significance of touching 'struct page'.
[..]
As I only have cosmetic comments you can add:
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
...and here is that aformentioned patch:
-- 8< --
Subject: dcssblk: Mark DAX broken, remove FS_DAX_LIMITED support
From: Dan Williams <dan.j.williams@intel.com>
The dcssblk driver has long needed special case supoprt to enable
limited dax operation, so called CONFIG_FS_DAX_LIMITED. This mode
works around the incomplete support for ZONE_DEVICE on s390 by forgoing
the ability of dax-mapped pages to support GUP.
Now, pending cleanups to fsdax that fix its reference counting [1] depend on
the ability of all dax drivers to supply ZONE_DEVICE pages.
To allow that work to move forward, dax support needs to be paused for
dcssblk until ZONE_DEVICE support arrives. That work has been known for
a few years [2], and the removal of "pte_devmap" requirements [3] makes the
conversion easier.
For now, place the support behind CONFIG_BROKEN, and remove PFN_SPECIAL
(dcssblk was the only user).
Link: http://lore.kernel.org/cover.9f0e45d52f5cff58807831b6b867084d0b14b61c.1725941415.git-series.apopple@nvidia.com [1]
Link: http://lore.kernel.org/20210820210318.187742e8@thinkpad/ [2]
Link: http://lore.kernel.org/4511465a4f8429f45e2ac70d2e65dc5e1df1eb47.1725941415.git-series.apopple@nvidia.com [3]
Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Tested-by: Alexander Gordeev <agordeev@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alistair Popple <apopple@nvidia.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
---
Documentation/filesystems/dax.rst | 1 -
drivers/s390/block/Kconfig | 12 ++++++++++--
drivers/s390/block/dcssblk.c | 27 +++++++++++++++++----------
3 files changed, 27 insertions(+), 13 deletions(-)
diff --git a/Documentation/filesystems/dax.rst b/Documentation/filesystems/dax.rst
index 719e90f1988e..08dd5e254cc5 100644
--- a/Documentation/filesystems/dax.rst
+++ b/Documentation/filesystems/dax.rst
@@ -207,7 +207,6 @@ implement direct_access.
These block devices may be used for inspiration:
- brd: RAM backed block device driver
-- dcssblk: s390 dcss block device driver
- pmem: NVDIMM persistent memory driver
diff --git a/drivers/s390/block/Kconfig b/drivers/s390/block/Kconfig
index e3710a762aba..4bfe469c04aa 100644
--- a/drivers/s390/block/Kconfig
+++ b/drivers/s390/block/Kconfig
@@ -4,13 +4,21 @@ comment "S/390 block device drivers"
config DCSSBLK
def_tristate m
- select FS_DAX_LIMITED
- select DAX
prompt "DCSSBLK support"
depends on S390 && BLOCK
help
Support for dcss block device
+config DCSSBLK_DAX
+ def_bool y
+ depends on DCSSBLK
+ # requires S390 ZONE_DEVICE support
+ depends on BROKEN
+ select DAX
+ prompt "DCSSBLK DAX support"
+ help
+ Enable DAX operation for the dcss block device
+
config DASD
def_tristate y
prompt "Support for DASD devices"
diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
index 0f14d279d30b..7248e547fefb 100644
--- a/drivers/s390/block/dcssblk.c
+++ b/drivers/s390/block/dcssblk.c
@@ -534,6 +534,21 @@ static const struct attribute_group *dcssblk_dev_attr_groups[] = {
NULL,
};
+static int dcssblk_setup_dax(struct dcssblk_dev_info *dev_info)
+{
+ struct dax_device *dax_dev;
+
+ if (!IS_ENABLED(CONFIG_DCSSBLK_DAX))
+ return 0;
+
+ dax_dev = alloc_dax(dev_info, &dcssblk_dax_ops);
+ if (IS_ERR(dax_dev))
+ return PTR_ERR(dax_dev);
+ set_dax_synchronous(dax_dev);
+ dev_info->dax_dev = dax_dev;
+ return dax_add_host(dev_info->dax_dev, dev_info->gd);
+}
+
/*
* device attribute for adding devices
*/
@@ -547,7 +562,6 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char
int rc, i, j, num_of_segments;
struct dcssblk_dev_info *dev_info;
struct segment_info *seg_info, *temp;
- struct dax_device *dax_dev;
char *local_buf;
unsigned long seg_byte_size;
@@ -674,14 +688,7 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char
if (rc)
goto put_dev;
- dax_dev = alloc_dax(dev_info, &dcssblk_dax_ops);
- if (IS_ERR(dax_dev)) {
- rc = PTR_ERR(dax_dev);
- goto put_dev;
- }
- set_dax_synchronous(dax_dev);
- dev_info->dax_dev = dax_dev;
- rc = dax_add_host(dev_info->dax_dev, dev_info->gd);
+ rc = dcssblk_setup_dax(dev_info);
if (rc)
goto out_dax;
@@ -917,7 +924,7 @@ __dcssblk_direct_access(struct dcssblk_dev_info *dev_info, pgoff_t pgoff,
*kaddr = __va(dev_info->start + offset);
if (pfn)
*pfn = __pfn_to_pfn_t(PFN_DOWN(dev_info->start + offset),
- PFN_DEV|PFN_SPECIAL);
+ PFN_DEV);
return (dev_sz - offset) / PAGE_SIZE;
}
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 22/26] device/dax: Properly refcount device dax pages when mapping
2025-01-10 6:00 ` [PATCH v6 22/26] device/dax: Properly refcount device dax pages when mapping Alistair Popple
@ 2025-01-14 6:12 ` Dan Williams
2025-02-03 11:29 ` Alistair Popple
0 siblings, 1 reply; 97+ messages in thread
From: Dan Williams @ 2025-01-14 6:12 UTC (permalink / raw)
To: Alistair Popple, akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
Alistair Popple wrote:
> Device DAX pages are currently not reference counted when mapped,
> instead relying on the devmap PTE bit to ensure mapping code will not
> get/put references. This requires special handling in various page
> table walkers, particularly GUP, to manage references on the
> underlying pgmap to ensure the pages remain valid.
>
> However there is no reason these pages can't be refcounted properly at
> map time. Doning so eliminates the need for the devmap PTE bit,
> freeing up a precious PTE bit. It also simplifies GUP as it no longer
> needs to manage the special pgmap references and can instead just
> treat the pages normally as defined by vm_normal_page().
>
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> ---
> drivers/dax/device.c | 15 +++++++++------
> mm/memremap.c | 13 ++++++-------
> 2 files changed, 15 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> index 6d74e62..fd22dbf 100644
> --- a/drivers/dax/device.c
> +++ b/drivers/dax/device.c
> @@ -126,11 +126,12 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
> return VM_FAULT_SIGBUS;
> }
>
> - pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);
> + pfn = phys_to_pfn_t(phys, 0);
>
> dax_set_mapping(vmf, pfn, fault_size);
>
> - return vmf_insert_mixed(vmf->vma, vmf->address, pfn);
> + return vmf_insert_page_mkwrite(vmf, pfn_t_to_page(pfn),
> + vmf->flags & FAULT_FLAG_WRITE);
> }
>
> static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
> @@ -169,11 +170,12 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
> return VM_FAULT_SIGBUS;
> }
>
> - pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);
> + pfn = phys_to_pfn_t(phys, 0);
>
> dax_set_mapping(vmf, pfn, fault_size);
>
> - return vmf_insert_pfn_pmd(vmf, pfn, vmf->flags & FAULT_FLAG_WRITE);
> + return vmf_insert_folio_pmd(vmf, page_folio(pfn_t_to_page(pfn)),
> + vmf->flags & FAULT_FLAG_WRITE);
This looks suspect without initializing the compound page metadata.
This might be getting compound pages by default with
CONFIG_ARCH_WANT_OPTIMIZE_DAX_VMEMMAP. The device-dax unit tests are ok
so far, but that is not super comforting until I can think about this a
bit more... but not tonight.
Might as well fix up device-dax refcounts in this series too, but I
won't ask you to do that, will send you something to include.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 08/26] fs/dax: Remove PAGE_MAPPING_DAX_SHARED mapping flag
2025-01-10 6:00 ` [PATCH v6 08/26] fs/dax: Remove PAGE_MAPPING_DAX_SHARED mapping flag Alistair Popple
2025-01-14 0:52 ` Dan Williams
@ 2025-01-14 14:47 ` David Hildenbrand
1 sibling, 0 replies; 97+ messages in thread
From: David Hildenbrand @ 2025-01-14 14:47 UTC (permalink / raw)
To: Alistair Popple, akpm, dan.j.williams, linux-mm
Cc: alison.schofield, lina, zhang.lyra, gerald.schaefer,
vishal.l.verma, dave.jiang, logang, bhelgaas, jack, jgg,
catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
willy, djwong, tytso, linmiaohe, peterx, linux-doc, linux-kernel,
linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl, linux-fsdevel,
linux-ext4, linux-xfs, jhubbard, hch, david, chenhuacai, kernel,
loongarch
On 10.01.25 07:00, Alistair Popple wrote:
> PAGE_MAPPING_DAX_SHARED is the same as PAGE_MAPPING_ANON. This isn't
> currently a problem because FS DAX pages are treated
> specially. However a future change will make FS DAX pages more like
> normal pages, so folio_test_anon() must not return true for a FS DAX
> page.
Yes, very nice to see PAGE_MAPPING_DAX_SHARED go!
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 10/26] mm/mm_init: Move p2pdma page refcount initialisation to p2pdma
2025-01-10 6:00 ` [PATCH v6 10/26] mm/mm_init: Move p2pdma page refcount initialisation to p2pdma Alistair Popple
@ 2025-01-14 14:51 ` David Hildenbrand
0 siblings, 0 replies; 97+ messages in thread
From: David Hildenbrand @ 2025-01-14 14:51 UTC (permalink / raw)
To: Alistair Popple, akpm, dan.j.williams, linux-mm
Cc: alison.schofield, lina, zhang.lyra, gerald.schaefer,
vishal.l.verma, dave.jiang, logang, bhelgaas, jack, jgg,
catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
willy, djwong, tytso, linmiaohe, peterx, linux-doc, linux-kernel,
linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl, linux-fsdevel,
linux-ext4, linux-xfs, jhubbard, hch, david, chenhuacai, kernel,
loongarch
On 10.01.25 07:00, Alistair Popple wrote:
> Currently ZONE_DEVICE page reference counts are initialised by core
> memory management code in __init_zone_device_page() as part of the
> memremap() call which driver modules make to obtain ZONE_DEVICE
> pages. This initialises page refcounts to 1 before returning them to
> the driver.
>
> This was presumably done because it drivers had a reference of sorts
> on the page. It also ensured the page could always be mapped with
> vm_insert_page() for example and would never get freed (ie. have a
> zero refcount), freeing drivers of manipulating page reference counts.
>
> However it complicates figuring out whether or not a page is free from
> the mm perspective because it is no longer possible to just look at
> the refcount. Instead the page type must be known and if GUP is used a
> secondary pgmap reference is also sometimes needed.
>
> To simplify this it is desirable to remove the page reference count
> for the driver, so core mm can just use the refcount without having to
> account for page type or do other types of tracking. This is possible
> because drivers can always assume the page is valid as core kernel
> will never offline or remove the struct page.
>
> This means it is now up to drivers to initialise the page refcount as
> required. P2PDMA uses vm_insert_page() to map the page, and that
> requires a non-zero reference count when initialising the page so set
> that when the page is first mapped.
>
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
>
LGTM
Acked-by: David Hildenbrand <david@redhat.com>
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 11/26] mm: Allow compound zone device pages
2025-01-10 6:00 ` [PATCH v6 11/26] mm: Allow compound zone device pages Alistair Popple
@ 2025-01-14 14:59 ` David Hildenbrand
2025-01-17 1:05 ` Alistair Popple
0 siblings, 1 reply; 97+ messages in thread
From: David Hildenbrand @ 2025-01-14 14:59 UTC (permalink / raw)
To: Alistair Popple, akpm, dan.j.williams, linux-mm
Cc: alison.schofield, lina, zhang.lyra, gerald.schaefer,
vishal.l.verma, dave.jiang, logang, bhelgaas, jack, jgg,
catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
willy, djwong, tytso, linmiaohe, peterx, linux-doc, linux-kernel,
linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl, linux-fsdevel,
linux-ext4, linux-xfs, jhubbard, hch, david, chenhuacai, kernel,
loongarch, Jason Gunthorpe
On 10.01.25 07:00, Alistair Popple wrote:
> Zone device pages are used to represent various type of device memory
> managed by device drivers. Currently compound zone device pages are
> not supported. This is because MEMORY_DEVICE_FS_DAX pages are the only
> user of higher order zone device pages and have their own page
> reference counting.
>
> A future change will unify FS DAX reference counting with normal page
> reference counting rules and remove the special FS DAX reference
> counting. Supporting that requires compound zone device pages.
>
> Supporting compound zone device pages requires compound_head() to
> distinguish between head and tail pages whilst still preserving the
> special struct page fields that are specific to zone device pages.
>
> A tail page is distinguished by having bit zero being set in
> page->compound_head, with the remaining bits pointing to the head
> page. For zone device pages page->compound_head is shared with
> page->pgmap.
>
> The page->pgmap field is common to all pages within a memory section.
> Therefore pgmap is the same for both head and tail pages and can be
> moved into the folio and we can use the standard scheme to find
> compound_head from a tail page.
The more relevant thing is that the pgmap field must be common to all
pages in a folio, even if a folio exceeds memory sections (e.g., 128 MiB
on x86_64 where we have 1 GiB folios).
> > Signed-off-by: Alistair Popple <apopple@nvidia.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
>
> ---
>
> Changes for v4:
> - Fix build breakages reported by kernel test robot
>
> Changes since v2:
>
> - Indentation fix
> - Rename page_dev_pagemap() to page_pgmap()
> - Rename folio _unused field to _unused_pgmap_compound_head
> - s/WARN_ON/VM_WARN_ON_ONCE_PAGE/
>
> Changes since v1:
>
> - Move pgmap to the folio as suggested by Matthew Wilcox
> ---
[...]
> static inline bool folio_is_device_coherent(const struct folio *folio)
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 29919fa..61899ec 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -205,8 +205,8 @@ struct migrate_vma {
> unsigned long end;
>
> /*
> - * Set to the owner value also stored in page->pgmap->owner for
> - * migrating out of device private memory. The flags also need to
> + * Set to the owner value also stored in page_pgmap(page)->owner
> + * for migrating out of device private memory. The flags also need to
> * be set to MIGRATE_VMA_SELECT_DEVICE_PRIVATE.
> * The caller should always set this field when using mmu notifier
> * callbacks to avoid device MMU invalidations for device private
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index df8f515..54b59b8 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -129,8 +129,11 @@ struct page {
> unsigned long compound_head; /* Bit zero is set */
> };
> struct { /* ZONE_DEVICE pages */
> - /** @pgmap: Points to the hosting device page map. */
> - struct dev_pagemap *pgmap;
> + /*
> + * The first word is used for compound_head or folio
> + * pgmap
> + */
> + void *_unused_pgmap_compound_head;
> void *zone_device_data;
> /*
> * ZONE_DEVICE private pages are counted as being
> @@ -299,6 +302,7 @@ typedef struct {
> * @_refcount: Do not access this member directly. Use folio_ref_count()
> * to find how many references there are to this folio.
> * @memcg_data: Memory Control Group data.
> + * @pgmap: Metadata for ZONE_DEVICE mappings
> * @virtual: Virtual address in the kernel direct map.
> * @_last_cpupid: IDs of last CPU and last process that accessed the folio.
> * @_entire_mapcount: Do not use directly, call folio_entire_mapcount().
> @@ -337,6 +341,7 @@ struct folio {
> /* private: */
> };
> /* public: */
> + struct dev_pagemap *pgmap;
Agreed, that should work.
Acked-by: David Hildenbrand <david@redhat.com>
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 12/26] mm/memory: Enhance insert_page_into_pte_locked() to create writable mappings
2025-01-10 6:00 ` [PATCH v6 12/26] mm/memory: Enhance insert_page_into_pte_locked() to create writable mappings Alistair Popple
@ 2025-01-14 15:03 ` David Hildenbrand
[not found] ` <6785b90f300d8_20fa29465@dwillia2-xfh.jf.intel.com.notmuch>
1 sibling, 0 replies; 97+ messages in thread
From: David Hildenbrand @ 2025-01-14 15:03 UTC (permalink / raw)
To: Alistair Popple, akpm, dan.j.williams, linux-mm
Cc: alison.schofield, lina, zhang.lyra, gerald.schaefer,
vishal.l.verma, dave.jiang, logang, bhelgaas, jack, jgg,
catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
willy, djwong, tytso, linmiaohe, peterx, linux-doc, linux-kernel,
linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl, linux-fsdevel,
linux-ext4, linux-xfs, jhubbard, hch, david, chenhuacai, kernel,
loongarch
On 10.01.25 07:00, Alistair Popple wrote:
> In preparation for using insert_page() for DAX, enhance
> insert_page_into_pte_locked() to handle establishing writable
> mappings. Recall that DAX returns VM_FAULT_NOPAGE after installing a
> PTE which bypasses the typical set_pte_range() in finish_fault.
>
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> Suggested-by: Dan Williams <dan.j.williams@intel.com>
>
> ---
>
> Changes for v5:
> - Minor comment/formatting fixes suggested by David Hildenbrand
>
> Changes since v2:
>
> - New patch split out from "mm/memory: Add dax_insert_pfn"
> ---
> mm/memory.c | 37 +++++++++++++++++++++++++++++--------
> 1 file changed, 29 insertions(+), 8 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 06bb29e..8531acb 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2126,19 +2126,40 @@ static int validate_page_before_insert(struct vm_area_struct *vma,
> }
>
> static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte,
> - unsigned long addr, struct page *page, pgprot_t prot)
> + unsigned long addr, struct page *page,
> + pgprot_t prot, bool mkwrite)
> {
> struct folio *folio = page_folio(page);
> + pte_t entry = ptep_get(pte);
> pte_t pteval;
>
Just drop "entry" and reuse "pteval"; even saves you from one bug below :)
pte_t pteval = ptep_get(pte);
> - if (!pte_none(ptep_get(pte)))
> - return -EBUSY;
> + if (!pte_none(entry)) {
> + if (!mkwrite)
> + return -EBUSY;
> +
> + /* see insert_pfn(). */
> + if (pte_pfn(entry) != page_to_pfn(page)) {
> + WARN_ON_ONCE(!is_zero_pfn(pte_pfn(entry)));
> + return -EFAULT;
> + }
> + entry = maybe_mkwrite(entry, vma);
> + entry = pte_mkyoung(entry);
> + if (ptep_set_access_flags(vma, addr, pte, entry, 1))
> + update_mmu_cache(vma, addr, pte);
> + return 0;
> + }
> +
> /* Ok, finally just insert the thing.. */
> pteval = mk_pte(page, prot);
> if (unlikely(is_zero_folio(folio))) {
> pteval = pte_mkspecial(pteval);
> } else {
> folio_get(folio);
> + entry = mk_pte(page, prot);
we already do "pteval = mk_pte(page, prot);" above?
And I think your change here does not do what you want, because you
modify the new "entry" but we do
set_pte_at(vma->vm_mm, addr, pte, pteval);
below ...
> + if (mkwrite) {
> + entry = pte_mkyoung(entry);
> + entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> + }
So again, better just reuse pteval :)
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 13/26] mm/memory: Add vmf_insert_page_mkwrite()
2025-01-10 6:00 ` [PATCH v6 13/26] mm/memory: Add vmf_insert_page_mkwrite() Alistair Popple
@ 2025-01-14 16:15 ` David Hildenbrand
2025-01-15 6:13 ` Alistair Popple
0 siblings, 1 reply; 97+ messages in thread
From: David Hildenbrand @ 2025-01-14 16:15 UTC (permalink / raw)
To: Alistair Popple, akpm, dan.j.williams, linux-mm
Cc: alison.schofield, lina, zhang.lyra, gerald.schaefer,
vishal.l.verma, dave.jiang, logang, bhelgaas, jack, jgg,
catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
willy, djwong, tytso, linmiaohe, peterx, linux-doc, linux-kernel,
linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl, linux-fsdevel,
linux-ext4, linux-xfs, jhubbard, hch, david, chenhuacai, kernel,
loongarch
On 10.01.25 07:00, Alistair Popple wrote:
> Currently to map a DAX page the DAX driver calls vmf_insert_pfn. This
> creates a special devmap PTE entry for the pfn but does not take a
> reference on the underlying struct page for the mapping. This is
> because DAX page refcounts are treated specially, as indicated by the
> presence of a devmap entry.
>
> To allow DAX page refcounts to be managed the same as normal page
> refcounts introduce vmf_insert_page_mkwrite(). This will take a
> reference on the underlying page much the same as vmf_insert_page,
> except it also permits upgrading an existing mapping to be writable if
> requested/possible.
>
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
>
> ---
>
> Updates from v2:
>
> - Rename function to make not DAX specific
>
> - Split the insert_page_into_pte_locked() change into a separate
> patch.
>
> Updates from v1:
>
> - Re-arrange code in insert_page_into_pte_locked() based on comments
> from Jan Kara.
>
> - Call mkdrity/mkyoung for the mkwrite case, also suggested by Jan.
> ---
> include/linux/mm.h | 2 ++
> mm/memory.c | 36 ++++++++++++++++++++++++++++++++++++
> 2 files changed, 38 insertions(+)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index e790298..f267b06 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3620,6 +3620,8 @@ int vm_map_pages(struct vm_area_struct *vma, struct page **pages,
> unsigned long num);
> int vm_map_pages_zero(struct vm_area_struct *vma, struct page **pages,
> unsigned long num);
> +vm_fault_t vmf_insert_page_mkwrite(struct vm_fault *vmf, struct page *page,
> + bool write);
> vm_fault_t vmf_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
> unsigned long pfn);
> vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
> diff --git a/mm/memory.c b/mm/memory.c
> index 8531acb..c60b819 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2624,6 +2624,42 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma,
> return VM_FAULT_NOPAGE;
> }
>
> +vm_fault_t vmf_insert_page_mkwrite(struct vm_fault *vmf, struct page *page,
> + bool write)
> +{
> + struct vm_area_struct *vma = vmf->vma;
> + pgprot_t pgprot = vma->vm_page_prot;
> + unsigned long pfn = page_to_pfn(page);
> + unsigned long addr = vmf->address;
> + int err;
> +
> + if (addr < vma->vm_start || addr >= vma->vm_end)
> + return VM_FAULT_SIGBUS;
> +
> + track_pfn_insert(vma, &pgprot, pfn_to_pfn_t(pfn));
I think I raised this before: why is this track_pfn_insert() in here? It
only ever does something to VM_PFNMAP mappings, and that cannot possibly
be the case here (nothing in VM_PFNMAP is refcounted, ever)?
> +
> + if (!pfn_modify_allowed(pfn, pgprot))
> + return VM_FAULT_SIGBUS;
Why is that required? Why are we messing so much with PFNs? :)
Note that x86 does in there
/* If it's real memory always allow */
if (pfn_valid(pfn))
return true;
See below, when would we ever have a "struct page *" but !pfn_valid() ?
> +
> + /*
> + * We refcount the page normally so make sure pfn_valid is true.
> + */
> + if (!pfn_valid(pfn))
> + return VM_FAULT_SIGBUS;
Somebody gave us a "struct page", how could the pfn ever by invalid (not
have a struct page)?
I think all of the above regarding PFNs should be dropped -- unless I am
missing something important.
> +
> + if (WARN_ON(is_zero_pfn(pfn) && write))
> + return VM_FAULT_SIGBUS;
is_zero_page() if you already have the "page". But note that in
validate_page_before_insert() we do have a check that allows for
conditional insertion of the shared zeropage.
So maybe this hunk is also not required.
> +
> + err = insert_page(vma, addr, page, pgprot, write);
> + if (err == -ENOMEM)
> + return VM_FAULT_OOM;
> + if (err < 0 && err != -EBUSY)
> + return VM_FAULT_SIGBUS;
> +
> + return VM_FAULT_NOPAGE;
> +}
> +EXPORT_SYMBOL_GPL(vmf_insert_page_mkwrite);
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 15/26] huge_memory: Add vmf_insert_folio_pud()
2025-01-10 6:00 ` [PATCH v6 15/26] huge_memory: Add vmf_insert_folio_pud() Alistair Popple
2025-01-14 1:27 ` Dan Williams
@ 2025-01-14 16:22 ` David Hildenbrand
2025-01-15 6:38 ` Alistair Popple
1 sibling, 1 reply; 97+ messages in thread
From: David Hildenbrand @ 2025-01-14 16:22 UTC (permalink / raw)
To: Alistair Popple, akpm, dan.j.williams, linux-mm
Cc: alison.schofield, lina, zhang.lyra, gerald.schaefer,
vishal.l.verma, dave.jiang, logang, bhelgaas, jack, jgg,
catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
willy, djwong, tytso, linmiaohe, peterx, linux-doc, linux-kernel,
linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl, linux-fsdevel,
linux-ext4, linux-xfs, jhubbard, hch, david, chenhuacai, kernel,
loongarch
On 10.01.25 07:00, Alistair Popple wrote:
> Currently DAX folio/page reference counts are managed differently to
> normal pages. To allow these to be managed the same as normal pages
> introduce vmf_insert_folio_pud. This will map the entire PUD-sized folio
> and take references as it would for a normally mapped page.
>
> This is distinct from the current mechanism, vmf_insert_pfn_pud, which
> simply inserts a special devmap PUD entry into the page table without
> holding a reference to the page for the mapping.
>
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
[...]
> +/**
> + * vmf_insert_folio_pud - insert a pud size folio mapped by a pud entry
> + * @vmf: Structure describing the fault
> + * @folio: folio to insert
> + * @write: whether it's a write fault
> + *
> + * Return: vm_fault_t value.
> + */
> +vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio, bool write)
> +{
> + struct vm_area_struct *vma = vmf->vma;
> + unsigned long addr = vmf->address & PUD_MASK;
> + pud_t *pud = vmf->pud;
> + struct mm_struct *mm = vma->vm_mm;
> + spinlock_t *ptl;
> +
> + if (addr < vma->vm_start || addr >= vma->vm_end)
> + return VM_FAULT_SIGBUS;
> +
> + if (WARN_ON_ONCE(folio_order(folio) != PUD_ORDER))
> + return VM_FAULT_SIGBUS;
> +
> + ptl = pud_lock(mm, pud);
> + if (pud_none(*vmf->pud)) {
> + folio_get(folio);
> + folio_add_file_rmap_pud(folio, &folio->page, vma);
> + add_mm_counter(mm, mm_counter_file(folio), HPAGE_PUD_NR);
> + }
> + insert_pfn_pud(vma, addr, vmf->pud, pfn_to_pfn_t(folio_pfn(folio)), write);
This looks scary at first (inserting something when not taking a
reference), but insert_pfn_pud() seems to handle that. A comment here
would have been nice.
It's weird, though, that if there is already something else, that we
only WARN but don't actually return an error. So ...
> + spin_unlock(ptl);
> +
> + return VM_FAULT_NOPAGE;
I assume always returning VM_FAULT_NOPAGE, even when something went
wrong, is the right thing to do?
Apart from that LGTM.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 16/26] huge_memory: Add vmf_insert_folio_pmd()
2025-01-10 6:00 ` [PATCH v6 16/26] huge_memory: Add vmf_insert_folio_pmd() Alistair Popple
2025-01-14 2:04 ` Dan Williams
@ 2025-01-14 16:40 ` David Hildenbrand
2025-01-14 17:22 ` Dan Williams
1 sibling, 1 reply; 97+ messages in thread
From: David Hildenbrand @ 2025-01-14 16:40 UTC (permalink / raw)
To: Alistair Popple, akpm, dan.j.williams, linux-mm
Cc: alison.schofield, lina, zhang.lyra, gerald.schaefer,
vishal.l.verma, dave.jiang, logang, bhelgaas, jack, jgg,
catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
willy, djwong, tytso, linmiaohe, peterx, linux-doc, linux-kernel,
linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl, linux-fsdevel,
linux-ext4, linux-xfs, jhubbard, hch, david, chenhuacai, kernel,
loongarch
> +vm_fault_t vmf_insert_folio_pmd(struct vm_fault *vmf, struct folio *folio, bool write)
> +{
> + struct vm_area_struct *vma = vmf->vma;
> + unsigned long addr = vmf->address & PMD_MASK;
> + struct mm_struct *mm = vma->vm_mm;
> + spinlock_t *ptl;
> + pgtable_t pgtable = NULL;
> +
> + if (addr < vma->vm_start || addr >= vma->vm_end)
> + return VM_FAULT_SIGBUS;
> +
> + if (WARN_ON_ONCE(folio_order(folio) != PMD_ORDER))
> + return VM_FAULT_SIGBUS;
> +
> + if (arch_needs_pgtable_deposit()) {
> + pgtable = pte_alloc_one(vma->vm_mm);
> + if (!pgtable)
> + return VM_FAULT_OOM;
> + }
This is interesting and nasty at the same time (only to make ppc64 boo3s
with has tables happy). But it seems to be the right thing to do.
> +
> + ptl = pmd_lock(mm, vmf->pmd);
> + if (pmd_none(*vmf->pmd)) {
> + folio_get(folio);
> + folio_add_file_rmap_pmd(folio, &folio->page, vma);
> + add_mm_counter(mm, mm_counter_file(folio), HPAGE_PMD_NR);
> + }
> + insert_pfn_pmd(vma, addr, vmf->pmd, pfn_to_pfn_t(folio_pfn(folio)),
> + vma->vm_page_prot, write, pgtable);
> + spin_unlock(ptl);
> + if (pgtable)
> + pte_free(mm, pgtable);
Ehm, are you unconditionally freeing the pgtable, even if consumed by
insert_pfn_pmd() ?
Note that setting pgtable to NULL in insert_pfn_pmd() when consumed will
not be visible here.
You'd have to pass a pointer to the ... pointer (&pgtable).
... unless I am missing something, staring at the diff.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 19/26] proc/task_mmu: Mark devdax and fsdax pages as always unpinned
2025-01-14 2:28 ` Dan Williams
@ 2025-01-14 16:45 ` David Hildenbrand
2025-01-17 1:28 ` Alistair Popple
0 siblings, 1 reply; 97+ messages in thread
From: David Hildenbrand @ 2025-01-14 16:45 UTC (permalink / raw)
To: Dan Williams, Alistair Popple, akpm, linux-mm
Cc: alison.schofield, lina, zhang.lyra, gerald.schaefer,
vishal.l.verma, dave.jiang, logang, bhelgaas, jack, jgg,
catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
willy, djwong, tytso, linmiaohe, peterx, linux-doc, linux-kernel,
linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl, linux-fsdevel,
linux-ext4, linux-xfs, jhubbard, hch, david, chenhuacai, kernel,
loongarch
On 14.01.25 03:28, Dan Williams wrote:
> Alistair Popple wrote:
>> The procfs mmu files such as smaps and pagemap currently ignore devdax and
>> fsdax pages because these pages are considered special. A future change
>> will start treating these as normal pages, meaning they can be exposed via
>> smaps and pagemap.
>>
>> The only difference is that devdax and fsdax pages can never be pinned for
>> DMA via FOLL_LONGTERM, so add an explicit check in pte_is_pinned() to
>> reflect that.
>
> I don't understand this patch.
This whole pte_is_pinned() should likely be ripped out (and I have a
patch here to do that for a long time).
But that's a different discussion.
>
> pin_user_pages() is also used for Direct-I/O page pinning, so the
> comment about FOLL_LONGTERM is wrong, and I otherwise do not understand
> what goes wrong if the only pte_is_pinned() user correctly detects the
> pin state?
Yes, this patch should likely just be dropped.
Even if folio_maybe_dma_pinned() == true because of "false positives",
it will behave just like other order-0 pages with false positives, and
only affect soft-dirty tracking ... which nobody should be caring about
here at all.
We would always detect the PTE as soft-dirty because we we never
pte_wrprotect(old_pte)
Yes, nobody should care.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 16/26] huge_memory: Add vmf_insert_folio_pmd()
2025-01-14 16:40 ` David Hildenbrand
@ 2025-01-14 17:22 ` Dan Williams
2025-01-15 7:05 ` Alistair Popple
0 siblings, 1 reply; 97+ messages in thread
From: Dan Williams @ 2025-01-14 17:22 UTC (permalink / raw)
To: David Hildenbrand, Alistair Popple, akpm, dan.j.williams, linux-mm
Cc: alison.schofield, lina, zhang.lyra, gerald.schaefer,
vishal.l.verma, dave.jiang, logang, bhelgaas, jack, jgg,
catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
willy, djwong, tytso, linmiaohe, peterx, linux-doc, linux-kernel,
linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl, linux-fsdevel,
linux-ext4, linux-xfs, jhubbard, hch, david, chenhuacai, kernel,
loongarch
David Hildenbrand wrote:
> > +vm_fault_t vmf_insert_folio_pmd(struct vm_fault *vmf, struct folio *folio, bool write)
> > +{
> > + struct vm_area_struct *vma = vmf->vma;
> > + unsigned long addr = vmf->address & PMD_MASK;
> > + struct mm_struct *mm = vma->vm_mm;
> > + spinlock_t *ptl;
> > + pgtable_t pgtable = NULL;
> > +
> > + if (addr < vma->vm_start || addr >= vma->vm_end)
> > + return VM_FAULT_SIGBUS;
> > +
> > + if (WARN_ON_ONCE(folio_order(folio) != PMD_ORDER))
> > + return VM_FAULT_SIGBUS;
> > +
> > + if (arch_needs_pgtable_deposit()) {
> > + pgtable = pte_alloc_one(vma->vm_mm);
> > + if (!pgtable)
> > + return VM_FAULT_OOM;
> > + }
>
> This is interesting and nasty at the same time (only to make ppc64 boo3s
> with has tables happy). But it seems to be the right thing to do.
>
> > +
> > + ptl = pmd_lock(mm, vmf->pmd);
> > + if (pmd_none(*vmf->pmd)) {
> > + folio_get(folio);
> > + folio_add_file_rmap_pmd(folio, &folio->page, vma);
> > + add_mm_counter(mm, mm_counter_file(folio), HPAGE_PMD_NR);
> > + }
> > + insert_pfn_pmd(vma, addr, vmf->pmd, pfn_to_pfn_t(folio_pfn(folio)),
> > + vma->vm_page_prot, write, pgtable);
> > + spin_unlock(ptl);
> > + if (pgtable)
> > + pte_free(mm, pgtable);
>
> Ehm, are you unconditionally freeing the pgtable, even if consumed by
> insert_pfn_pmd() ?
>
> Note that setting pgtable to NULL in insert_pfn_pmd() when consumed will
> not be visible here.
>
> You'd have to pass a pointer to the ... pointer (&pgtable).
>
> ... unless I am missing something, staring at the diff.
In fact I glazed over the fact that this has been commented on before
and assumed it was fixed:
http://lore.kernel.org/66f61ce4da80_964f2294fb@dwillia2-xfh.jf.intel.com.notmuch
So, yes, insert_pfn_pmd needs to take &pgtable to report back if the
allocation got consumed.
Good catch.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 23/26] mm: Remove pXX_devmap callers
2025-01-10 6:00 ` [PATCH v6 23/26] mm: Remove pXX_devmap callers Alistair Popple
@ 2025-01-14 18:50 ` Dan Williams
2025-01-15 7:27 ` Alistair Popple
0 siblings, 1 reply; 97+ messages in thread
From: Dan Williams @ 2025-01-14 18:50 UTC (permalink / raw)
To: Alistair Popple, akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
Alistair Popple wrote:
> The devmap PTE special bit was used to detect mappings of FS DAX
> pages. This tracking was required to ensure the generic mm did not
> manipulate the page reference counts as FS DAX implemented it's own
> reference counting scheme.
>
> Now that FS DAX pages have their references counted the same way as
> normal pages this tracking is no longer needed and can be
> removed.
>
> Almost all existing uses of pmd_devmap() are paired with a check of
> pmd_trans_huge(). As pmd_trans_huge() now returns true for FS DAX pages
> dropping the check in these cases doesn't change anything.
>
> However care needs to be taken because pmd_trans_huge() also checks that
> a page is not an FS DAX page. This is dealt with either by checking
> !vma_is_dax() or relying on the fact that the page pointer was obtained
> from a page list. This is possible because zone device pages cannot
> appear in any page list due to sharing page->lru with page->pgmap.
While the patch looks straightforward I think part of taking "care" in
this case is to split it such that any of those careful conversions have
their own bisect point in the history.
Perhaps this can move to follow-on series to not blow up the patch count
of the base series? ...but first want to get your reaction to splitting
for bisect purposes.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 24/26] mm: Remove devmap related functions and page table bits
2025-01-10 6:00 ` [PATCH v6 24/26] mm: Remove devmap related functions and page table bits Alistair Popple
2025-01-11 10:08 ` Huacai Chen
@ 2025-01-14 19:03 ` Dan Williams
1 sibling, 0 replies; 97+ messages in thread
From: Dan Williams @ 2025-01-14 19:03 UTC (permalink / raw)
To: Alistair Popple, akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
Alistair Popple wrote:
> Now that DAX and all other reference counts to ZONE_DEVICE pages are
> managed normally there is no need for the special devmap PTE/PMD/PUD
> page table bits. So drop all references to these, freeing up a
> software defined page table bit on architectures supporting it.
>
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> Acked-by: Will Deacon <will@kernel.org> # arm64
Hooray! Looks good to me modulo breaking up the previous patch.
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 25/26] Revert "riscv: mm: Add support for ZONE_DEVICE"
2025-01-10 6:00 ` [PATCH v6 25/26] Revert "riscv: mm: Add support for ZONE_DEVICE" Alistair Popple
@ 2025-01-14 19:11 ` Dan Williams
0 siblings, 0 replies; 97+ messages in thread
From: Dan Williams @ 2025-01-14 19:11 UTC (permalink / raw)
To: Alistair Popple, akpm, dan.j.williams, linux-mm
Cc: alison.schofield, Alistair Popple, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch, Björn Töpel
Alistair Popple wrote:
> DEVMAP PTEs are no longer required to support ZONE_DEVICE so remove
> them.
>
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> Suggested-by: Chunyan Zhang <zhang.lyra@gmail.com>
> Reviewed-by: Björn Töpel <bjorn@rivosinc.com>
This and the next are candidates to squash into the previous remove
patch, right? ...and I am not sure a standalone "Revert" commit is
appropriate when other archs get an omnibus "Remove" cleanup with a
longer explanation in the changelog.
Patch looks good though and you can preserve my Reviewed-by on the
squash.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 08/26] fs/dax: Remove PAGE_MAPPING_DAX_SHARED mapping flag
2025-01-14 0:52 ` Dan Williams
@ 2025-01-15 5:32 ` Alistair Popple
2025-01-15 5:44 ` Dan Williams
0 siblings, 1 reply; 97+ messages in thread
From: Alistair Popple @ 2025-01-15 5:32 UTC (permalink / raw)
To: Dan Williams
Cc: akpm, linux-mm, alison.schofield, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
On Mon, Jan 13, 2025 at 04:52:34PM -0800, Dan Williams wrote:
> Alistair Popple wrote:
> > PAGE_MAPPING_DAX_SHARED is the same as PAGE_MAPPING_ANON.
>
> I think a bit a bit more detail is warranted, how about?
>
> The page ->mapping pointer can have magic values like
> PAGE_MAPPING_DAX_SHARED and PAGE_MAPPING_ANON for page owner specific
> usage. In fact, PAGE_MAPPING_DAX_SHARED and PAGE_MAPPING_ANON alias the
> same value.
Massaged it slightly but sounds good.
> > This isn't currently a problem because FS DAX pages are treated
> > specially.
>
> s/are treated specially/are never seen by the anonymous mapping code and
> vice versa/
>
> > However a future change will make FS DAX pages more like
> > normal pages, so folio_test_anon() must not return true for a FS DAX
> > page.
> >
> > We could explicitly test for a FS DAX page in folio_test_anon(),
> > etc. however the PAGE_MAPPING_DAX_SHARED flag isn't actually
> > needed. Instead we can use the page->mapping field to implicitly track
> > the first mapping of a page. If page->mapping is non-NULL it implies
> > the page is associated with a single mapping at page->index. If the
> > page is associated with a second mapping clear page->mapping and set
> > page->share to 1.
> >
> > This is possible because a shared mapping implies the file-system
> > implements dax_holder_operations which makes the ->mapping and
> > ->index, which is a union with ->share, unused.
> >
> > The page is considered shared when page->mapping == NULL and
> > page->share > 0 or page->mapping != NULL, implying it is present in at
> > least one address space. This also makes it easier for a future change
> > to detect when a page is first mapped into an address space which
> > requires special handling.
> >
> > Signed-off-by: Alistair Popple <apopple@nvidia.com>
> > ---
> > fs/dax.c | 45 +++++++++++++++++++++++++--------------
> > include/linux/page-flags.h | 6 +-----
> > 2 files changed, 29 insertions(+), 22 deletions(-)
> >
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 4e49cc4..d35dbe1 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -351,38 +351,41 @@ static unsigned long dax_end_pfn(void *entry)
> > for (pfn = dax_to_pfn(entry); \
> > pfn < dax_end_pfn(entry); pfn++)
> >
> > +/*
> > + * A DAX page is considered shared if it has no mapping set and ->share (which
> > + * shares the ->index field) is non-zero. Note this may return false even if the
> > + * page is shared between multiple files but has not yet actually been mapped
> > + * into multiple address spaces.
> > + */
> > static inline bool dax_page_is_shared(struct page *page)
> > {
> > - return page->mapping == PAGE_MAPPING_DAX_SHARED;
> > + return !page->mapping && page->share;
> > }
> >
> > /*
> > - * Set the page->mapping with PAGE_MAPPING_DAX_SHARED flag, increase the
> > - * refcount.
> > + * Increase the page share refcount, warning if the page is not marked as shared.
> > */
> > static inline void dax_page_share_get(struct page *page)
> > {
> > - if (page->mapping != PAGE_MAPPING_DAX_SHARED) {
> > - /*
> > - * Reset the index if the page was already mapped
> > - * regularly before.
> > - */
> > - if (page->mapping)
> > - page->share = 1;
> > - page->mapping = PAGE_MAPPING_DAX_SHARED;
> > - }
> > + WARN_ON_ONCE(!page->share);
> > + WARN_ON_ONCE(page->mapping);
>
> Given the only caller of this function is dax_associate_entry() it seems
> like overkill to check that a function only a few lines away manipulated
> ->mapping correctly.
Good call.
> I don't see much reason for dax_page_share_get() to exist after your
> changes.
>
> Perhaps all that is needed is a dax_make_shared() helper that does the
> initial fiddling of '->mapping = NULL' and '->share = 1'?
Ok. I was going to make the argument that dax_make_shared() was overkill as
well, but as noted below it's a good place to put the comment describing how
this all works so have done that.
> > page->share++;
> > }
> >
> > static inline unsigned long dax_page_share_put(struct page *page)
> > {
> > + WARN_ON_ONCE(!page->share);
> > return --page->share;
> > }
> >
> > /*
> > - * When it is called in dax_insert_entry(), the shared flag will indicate that
> > - * whether this entry is shared by multiple files. If so, set the page->mapping
> > - * PAGE_MAPPING_DAX_SHARED, and use page->share as refcount.
> > + * When it is called in dax_insert_entry(), the shared flag will indicate
> > + * whether this entry is shared by multiple files. If the page has not
> > + * previously been associated with any mappings the ->mapping and ->index
> > + * fields will be set. If it has already been associated with a mapping
> > + * the mapping will be cleared and the share count set. It's then up to the
> > + * file-system to track which mappings contain which pages, ie. by implementing
> > + * dax_holder_operations.
>
> This feels like a good comment for a new dax_make_shared() not
> dax_associate_entry().
>
> I would also:
>
> s/up to the file-system to track which mappings contain which pages, ie. by implementing
> dax_holder_operations/up to reverse map users like memory_failure() to
> call back into the filesystem to recover ->mapping and ->index
> information/
Sounds good, although I left a reference to dax_holder_operations in the comment
because it's not immediately obvious how file-systems do this currently and I
had to relearn that more times than I'd care to admit :-)
> > */
> > static void dax_associate_entry(void *entry, struct address_space *mapping,
> > struct vm_area_struct *vma, unsigned long address, bool shared)
> > @@ -397,7 +400,17 @@ static void dax_associate_entry(void *entry, struct address_space *mapping,
> > for_each_mapped_pfn(entry, pfn) {
> > struct page *page = pfn_to_page(pfn);
> >
> > - if (shared) {
> > + if (shared && page->mapping && page->share) {
>
> How does this case happen? I don't think any page would ever enter with
> both ->mapping and ->share set, right?
Sigh. You're right - it can't. This patch series is getting a litte bit large
and unweildy with all the prerequisite bugfixes and cleanups. Obviously I fixed
this when developing the main fs dax count fixup but forgot to rebase the fix
further back in the series.
Anyway I have fixed that now, thanks.
> If the file was mapped then reflinked then ->share should be zero at the
> first mapping attempt. It might not be zero because it is aliased with
> index until it is converted to a shared page.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 12/26] mm/memory: Enhance insert_page_into_pte_locked() to create writable mappings
[not found] ` <6785b90f300d8_20fa29465@dwillia2-xfh.jf.intel.com.notmuch>
@ 2025-01-15 5:36 ` Alistair Popple
0 siblings, 0 replies; 97+ messages in thread
From: Alistair Popple @ 2025-01-15 5:36 UTC (permalink / raw)
To: Dan Williams
Cc: akpm, linux-mm, alison.schofield, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
On Mon, Jan 13, 2025 at 05:08:31PM -0800, Dan Williams wrote:
> Alistair Popple wrote:
> > In preparation for using insert_page() for DAX, enhance
> > insert_page_into_pte_locked() to handle establishing writable
> > mappings. Recall that DAX returns VM_FAULT_NOPAGE after installing a
> > PTE which bypasses the typical set_pte_range() in finish_fault.
> >
> > Signed-off-by: Alistair Popple <apopple@nvidia.com>
> > Suggested-by: Dan Williams <dan.j.williams@intel.com>
> >
> > ---
> >
> > Changes for v5:
> > - Minor comment/formatting fixes suggested by David Hildenbrand
> >
> > Changes since v2:
> >
> > - New patch split out from "mm/memory: Add dax_insert_pfn"
> > ---
> > mm/memory.c | 37 +++++++++++++++++++++++++++++--------
> > 1 file changed, 29 insertions(+), 8 deletions(-)
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 06bb29e..8531acb 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -2126,19 +2126,40 @@ static int validate_page_before_insert(struct vm_area_struct *vma,
> > }
> >
> > static int insert_page_into_pte_locked(struct vm_area_struct *vma, pte_t *pte,
> > - unsigned long addr, struct page *page, pgprot_t prot)
> > + unsigned long addr, struct page *page,
> > + pgprot_t prot, bool mkwrite)
> > {
> > struct folio *folio = page_folio(page);
> > + pte_t entry = ptep_get(pte);
> > pte_t pteval;
> >
> > - if (!pte_none(ptep_get(pte)))
> > - return -EBUSY;
> > + if (!pte_none(entry)) {
> > + if (!mkwrite)
> > + return -EBUSY;
> > +
> > + /* see insert_pfn(). */
> > + if (pte_pfn(entry) != page_to_pfn(page)) {
> > + WARN_ON_ONCE(!is_zero_pfn(pte_pfn(entry)));
> > + return -EFAULT;
> > + }
> > + entry = maybe_mkwrite(entry, vma);
> > + entry = pte_mkyoung(entry);
> > + if (ptep_set_access_flags(vma, addr, pte, entry, 1))
> > + update_mmu_cache(vma, addr, pte);
> > + return 0;
> > + }
>
> This hunk feels like it is begging to be unified with insert_pfn() after
> pfn_t dies. Perhaps a TODO to remember to come back and unify them, or
> you can go append that work to your pfn_t removal series?
No one has complained about removing pfn_t so I do intend to clean that series
up once this has all been merged somewhere, so I will just go append this
work there.
> Other than that you can add:
>
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 08/26] fs/dax: Remove PAGE_MAPPING_DAX_SHARED mapping flag
2025-01-15 5:32 ` Alistair Popple
@ 2025-01-15 5:44 ` Dan Williams
2025-01-17 0:54 ` Alistair Popple
0 siblings, 1 reply; 97+ messages in thread
From: Dan Williams @ 2025-01-15 5:44 UTC (permalink / raw)
To: Alistair Popple, Dan Williams
Cc: akpm, linux-mm, alison.schofield, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
Alistair Popple wrote:
[..]
> > How does this case happen? I don't think any page would ever enter with
> > both ->mapping and ->share set, right?
>
> Sigh. You're right - it can't. This patch series is getting a litte bit large
> and unweildy with all the prerequisite bugfixes and cleanups. Obviously I fixed
> this when developing the main fs dax count fixup but forgot to rebase the fix
> further back in the series.
I assumed as much when I got to that patch.
> Anyway I have fixed that now, thanks.
You deserve a large helping of grace for waking and then slaying this
old dragon.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 13/26] mm/memory: Add vmf_insert_page_mkwrite()
2025-01-14 16:15 ` David Hildenbrand
@ 2025-01-15 6:13 ` Alistair Popple
0 siblings, 0 replies; 97+ messages in thread
From: Alistair Popple @ 2025-01-15 6:13 UTC (permalink / raw)
To: David Hildenbrand
Cc: akpm, dan.j.williams, linux-mm, alison.schofield, lina,
zhang.lyra, gerald.schaefer, vishal.l.verma, dave.jiang, logang,
bhelgaas, jack, jgg, catalin.marinas, will, mpe, npiggin,
dave.hansen, ira.weiny, willy, djwong, tytso, linmiaohe, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
On Tue, Jan 14, 2025 at 05:15:54PM +0100, David Hildenbrand wrote:
> On 10.01.25 07:00, Alistair Popple wrote:
> > Currently to map a DAX page the DAX driver calls vmf_insert_pfn. This
> > creates a special devmap PTE entry for the pfn but does not take a
> > reference on the underlying struct page for the mapping. This is
> > because DAX page refcounts are treated specially, as indicated by the
> > presence of a devmap entry.
> >
> > To allow DAX page refcounts to be managed the same as normal page
> > refcounts introduce vmf_insert_page_mkwrite(). This will take a
> > reference on the underlying page much the same as vmf_insert_page,
> > except it also permits upgrading an existing mapping to be writable if
> > requested/possible.
> >
> > Signed-off-by: Alistair Popple <apopple@nvidia.com>
> >
> > ---
> >
> > Updates from v2:
> >
> > - Rename function to make not DAX specific
> >
> > - Split the insert_page_into_pte_locked() change into a separate
> > patch.
> >
> > Updates from v1:
> >
> > - Re-arrange code in insert_page_into_pte_locked() based on comments
> > from Jan Kara.
> >
> > - Call mkdrity/mkyoung for the mkwrite case, also suggested by Jan.
> > ---
> > include/linux/mm.h | 2 ++
> > mm/memory.c | 36 ++++++++++++++++++++++++++++++++++++
> > 2 files changed, 38 insertions(+)
> >
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index e790298..f267b06 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -3620,6 +3620,8 @@ int vm_map_pages(struct vm_area_struct *vma, struct page **pages,
> > unsigned long num);
> > int vm_map_pages_zero(struct vm_area_struct *vma, struct page **pages,
> > unsigned long num);
> > +vm_fault_t vmf_insert_page_mkwrite(struct vm_fault *vmf, struct page *page,
> > + bool write);
> > vm_fault_t vmf_insert_pfn(struct vm_area_struct *vma, unsigned long addr,
> > unsigned long pfn);
> > vm_fault_t vmf_insert_pfn_prot(struct vm_area_struct *vma, unsigned long addr,
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 8531acb..c60b819 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -2624,6 +2624,42 @@ static vm_fault_t __vm_insert_mixed(struct vm_area_struct *vma,
> > return VM_FAULT_NOPAGE;
> > }
> > +vm_fault_t vmf_insert_page_mkwrite(struct vm_fault *vmf, struct page *page,
> > + bool write)
> > +{
> > + struct vm_area_struct *vma = vmf->vma;
> > + pgprot_t pgprot = vma->vm_page_prot;
> > + unsigned long pfn = page_to_pfn(page);
> > + unsigned long addr = vmf->address;
> > + int err;
> > +
> > + if (addr < vma->vm_start || addr >= vma->vm_end)
> > + return VM_FAULT_SIGBUS;
> > +
> > + track_pfn_insert(vma, &pgprot, pfn_to_pfn_t(pfn));
>
> I think I raised this before: why is this track_pfn_insert() in here? It
> only ever does something to VM_PFNMAP mappings, and that cannot possibly be
> the case here (nothing in VM_PFNMAP is refcounted, ever)?
Yes, I also had deja vu reading this comment and a vague recollection of fixing
them too. Your comments[1] were for vmf_insert_folio_pud() though which exlains
why I neglected to do the same clean-up here even though I should have so thanks
for pointing them out.
[1] - https://lore.kernel.org/linux-mm/ee19854f-fa1f-4207-9176-3c7b79bccd07@redhat.com/
>
> > +
> > + if (!pfn_modify_allowed(pfn, pgprot))
> > + return VM_FAULT_SIGBUS;
>
> Why is that required? Why are we messing so much with PFNs? :)
>
> Note that x86 does in there
>
> /* If it's real memory always allow */
> if (pfn_valid(pfn))
> return true;
>
> See below, when would we ever have a "struct page *" but !pfn_valid() ?
>
>
> > +
> > + /*
> > + * We refcount the page normally so make sure pfn_valid is true.
> > + */
> > + if (!pfn_valid(pfn))
> > + return VM_FAULT_SIGBUS;
>
> Somebody gave us a "struct page", how could the pfn ever by invalid (not
> have a struct page)?
>
> I think all of the above regarding PFNs should be dropped -- unless I am
> missing something important.
>
> > +
> > + if (WARN_ON(is_zero_pfn(pfn) && write))
> > + return VM_FAULT_SIGBUS;
>
> is_zero_page() if you already have the "page". But note that in
> validate_page_before_insert() we do have a check that allows for conditional
> insertion of the shared zeropage.
>
> So maybe this hunk is also not required.
Yes, also not required. I have removed the above hunks as well because we don't
need any of this pfn stuff. Again it's just a hangover from an earlier version
of the series when I was passing pfn's rather than pages here.
> > +
> > + err = insert_page(vma, addr, page, pgprot, write);
> > + if (err == -ENOMEM)
> > + return VM_FAULT_OOM;
> > + if (err < 0 && err != -EBUSY)
> > + return VM_FAULT_SIGBUS;
> > +
> > + return VM_FAULT_NOPAGE;
> > +}
> > +EXPORT_SYMBOL_GPL(vmf_insert_page_mkwrite);
>
>
>
>
>
> --
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 15/26] huge_memory: Add vmf_insert_folio_pud()
2025-01-14 16:22 ` David Hildenbrand
@ 2025-01-15 6:38 ` Alistair Popple
0 siblings, 0 replies; 97+ messages in thread
From: Alistair Popple @ 2025-01-15 6:38 UTC (permalink / raw)
To: David Hildenbrand
Cc: akpm, dan.j.williams, linux-mm, alison.schofield, lina,
zhang.lyra, gerald.schaefer, vishal.l.verma, dave.jiang, logang,
bhelgaas, jack, jgg, catalin.marinas, will, mpe, npiggin,
dave.hansen, ira.weiny, willy, djwong, tytso, linmiaohe, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
On Tue, Jan 14, 2025 at 05:22:15PM +0100, David Hildenbrand wrote:
> On 10.01.25 07:00, Alistair Popple wrote:
> > Currently DAX folio/page reference counts are managed differently to
> > normal pages. To allow these to be managed the same as normal pages
> > introduce vmf_insert_folio_pud. This will map the entire PUD-sized folio
> > and take references as it would for a normally mapped page.
> >
> > This is distinct from the current mechanism, vmf_insert_pfn_pud, which
> > simply inserts a special devmap PUD entry into the page table without
> > holding a reference to the page for the mapping.
> >
> > Signed-off-by: Alistair Popple <apopple@nvidia.com>
>
> [...]
>
> > +/**
> > + * vmf_insert_folio_pud - insert a pud size folio mapped by a pud entry
> > + * @vmf: Structure describing the fault
> > + * @folio: folio to insert
> > + * @write: whether it's a write fault
> > + *
> > + * Return: vm_fault_t value.
> > + */
> > +vm_fault_t vmf_insert_folio_pud(struct vm_fault *vmf, struct folio *folio, bool write)
> > +{
> > + struct vm_area_struct *vma = vmf->vma;
> > + unsigned long addr = vmf->address & PUD_MASK;
> > + pud_t *pud = vmf->pud;
> > + struct mm_struct *mm = vma->vm_mm;
> > + spinlock_t *ptl;
> > +
> > + if (addr < vma->vm_start || addr >= vma->vm_end)
> > + return VM_FAULT_SIGBUS;
> > +
> > + if (WARN_ON_ONCE(folio_order(folio) != PUD_ORDER))
> > + return VM_FAULT_SIGBUS;
> > +
> > + ptl = pud_lock(mm, pud);
> > + if (pud_none(*vmf->pud)) {
> > + folio_get(folio);
> > + folio_add_file_rmap_pud(folio, &folio->page, vma);
> > + add_mm_counter(mm, mm_counter_file(folio), HPAGE_PUD_NR);
> > + }
> > + insert_pfn_pud(vma, addr, vmf->pud, pfn_to_pfn_t(folio_pfn(folio)), write);
>
> This looks scary at first (inserting something when not taking a reference),
> but insert_pfn_pud() seems to handle that. A comment here would have been
> nice.
Indeed, I will add one.
> It's weird, though, that if there is already something else, that we only
> WARN but don't actually return an error. So ...
Note we only WARN when there is already a mapping there and we're trying to
upgrade it to writeable. This just mimics the logic which currently exists in
insert_pfn() and insert_pfn_pmd().
The comment in insert_pfn() sheds more light:
/*
* For read faults on private mappings the PFN passed
* in may not match the PFN we have mapped if the
* mapped PFN is a writeable COW page. In the mkwrite
* case we are creating a writable PTE for a shared
* mapping and we expect the PFNs to match. If they
* don't match, we are likely racing with block
* allocation and mapping invalidation so just skip the
* update.
*/
> > + spin_unlock(ptl);
> > +
> > + return VM_FAULT_NOPAGE;
>
> I assume always returning VM_FAULT_NOPAGE, even when something went wrong,
> is the right thing to do?
Yes, I think so. I guess in the WARN case we could return something like
VM_FAULT_SIGBUS to kill the application, but the existing vmf_insert_*()
functions don't currently do that so I think that would be a separate clean-up.
> Apart from that LGTM.
>
>
> --
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 16/26] huge_memory: Add vmf_insert_folio_pmd()
2025-01-14 17:22 ` Dan Williams
@ 2025-01-15 7:05 ` Alistair Popple
0 siblings, 0 replies; 97+ messages in thread
From: Alistair Popple @ 2025-01-15 7:05 UTC (permalink / raw)
To: Dan Williams
Cc: David Hildenbrand, akpm, linux-mm, alison.schofield, lina,
zhang.lyra, gerald.schaefer, vishal.l.verma, dave.jiang, logang,
bhelgaas, jack, jgg, catalin.marinas, will, mpe, npiggin,
dave.hansen, ira.weiny, willy, djwong, tytso, linmiaohe, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
On Tue, Jan 14, 2025 at 09:22:00AM -0800, Dan Williams wrote:
> David Hildenbrand wrote:
> > > +vm_fault_t vmf_insert_folio_pmd(struct vm_fault *vmf, struct folio *folio, bool write)
> > > +{
> > > + struct vm_area_struct *vma = vmf->vma;
> > > + unsigned long addr = vmf->address & PMD_MASK;
> > > + struct mm_struct *mm = vma->vm_mm;
> > > + spinlock_t *ptl;
> > > + pgtable_t pgtable = NULL;
> > > +
> > > + if (addr < vma->vm_start || addr >= vma->vm_end)
> > > + return VM_FAULT_SIGBUS;
> > > +
> > > + if (WARN_ON_ONCE(folio_order(folio) != PMD_ORDER))
> > > + return VM_FAULT_SIGBUS;
> > > +
> > > + if (arch_needs_pgtable_deposit()) {
> > > + pgtable = pte_alloc_one(vma->vm_mm);
> > > + if (!pgtable)
> > > + return VM_FAULT_OOM;
> > > + }
> >
> > This is interesting and nasty at the same time (only to make ppc64 boo3s
> > with has tables happy). But it seems to be the right thing to do.
> >
> > > +
> > > + ptl = pmd_lock(mm, vmf->pmd);
> > > + if (pmd_none(*vmf->pmd)) {
> > > + folio_get(folio);
> > > + folio_add_file_rmap_pmd(folio, &folio->page, vma);
> > > + add_mm_counter(mm, mm_counter_file(folio), HPAGE_PMD_NR);
> > > + }
> > > + insert_pfn_pmd(vma, addr, vmf->pmd, pfn_to_pfn_t(folio_pfn(folio)),
> > > + vma->vm_page_prot, write, pgtable);
> > > + spin_unlock(ptl);
> > > + if (pgtable)
> > > + pte_free(mm, pgtable);
> >
> > Ehm, are you unconditionally freeing the pgtable, even if consumed by
> > insert_pfn_pmd() ?
> >
> > Note that setting pgtable to NULL in insert_pfn_pmd() when consumed will
> > not be visible here.
> >
> > You'd have to pass a pointer to the ... pointer (&pgtable).
> >
> > ... unless I am missing something, staring at the diff.
>
> In fact I glazed over the fact that this has been commented on before
> and assumed it was fixed:
>
> http://lore.kernel.org/66f61ce4da80_964f2294fb@dwillia2-xfh.jf.intel.com.notmuch
>
> So, yes, insert_pfn_pmd needs to take &pgtable to report back if the
> allocation got consumed.
>
> Good catch.
Yes, thanks Dave and Dan and apologies for missing that originally. Looking
at the thread I suspect I went down the rabbit hole of trying to implement
vmf_insert_folio() and when that wasn't possible forgot to come back and fix
this up. I have added a return code to insert_pfn_pmd() to indicate whether
or not the pgtable was consumed. I have also added a comment in the commit log
explaining why a vmf_insert_folio() isn't useful.
- Alistair
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 23/26] mm: Remove pXX_devmap callers
2025-01-14 18:50 ` Dan Williams
@ 2025-01-15 7:27 ` Alistair Popple
2025-02-04 19:06 ` Dan Williams
0 siblings, 1 reply; 97+ messages in thread
From: Alistair Popple @ 2025-01-15 7:27 UTC (permalink / raw)
To: Dan Williams
Cc: akpm, linux-mm, alison.schofield, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
On Tue, Jan 14, 2025 at 10:50:49AM -0800, Dan Williams wrote:
> Alistair Popple wrote:
> > The devmap PTE special bit was used to detect mappings of FS DAX
> > pages. This tracking was required to ensure the generic mm did not
> > manipulate the page reference counts as FS DAX implemented it's own
> > reference counting scheme.
> >
> > Now that FS DAX pages have their references counted the same way as
> > normal pages this tracking is no longer needed and can be
> > removed.
> >
> > Almost all existing uses of pmd_devmap() are paired with a check of
> > pmd_trans_huge(). As pmd_trans_huge() now returns true for FS DAX pages
> > dropping the check in these cases doesn't change anything.
> >
> > However care needs to be taken because pmd_trans_huge() also checks that
> > a page is not an FS DAX page. This is dealt with either by checking
> > !vma_is_dax() or relying on the fact that the page pointer was obtained
> > from a page list. This is possible because zone device pages cannot
> > appear in any page list due to sharing page->lru with page->pgmap.
>
> While the patch looks straightforward I think part of taking "care" in
> this case is to split it such that any of those careful conversions have
> their own bisect point in the history.
>
> Perhaps this can move to follow-on series to not blow up the patch count
> of the base series? ...but first want to get your reaction to splitting
> for bisect purposes.
TBH I don't feel too strongly about it - I suppose it would make it easier to
bisect to the specific case we weren't careful enough about. However I think if
a bug is bisected to this particular patch it would be relatively easy based on
the context of the bug to narrow it down to a particular file or two.
I do however feel strongly about whether or not that should be done in a
follow-on series :-)
Rebasing such a large series has already become painful and error prone enough
so if we want to split this change up it will definitely need to be a separate
series done once the rest of this has been merged. So I could be pursaded to
roll this and the pfn_t removal (as that depends on devmap going away) together.
Let me know what you think.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 08/26] fs/dax: Remove PAGE_MAPPING_DAX_SHARED mapping flag
2025-01-15 5:44 ` Dan Williams
@ 2025-01-17 0:54 ` Alistair Popple
0 siblings, 0 replies; 97+ messages in thread
From: Alistair Popple @ 2025-01-17 0:54 UTC (permalink / raw)
To: Dan Williams
Cc: akpm, linux-mm, alison.schofield, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
On Tue, Jan 14, 2025 at 09:44:38PM -0800, Dan Williams wrote:
> Alistair Popple wrote:
> [..]
> > > How does this case happen? I don't think any page would ever enter with
> > > both ->mapping and ->share set, right?
> >
> > Sigh. You're right - it can't. This patch series is getting a litte bit large
> > and unweildy with all the prerequisite bugfixes and cleanups. Obviously I fixed
> > this when developing the main fs dax count fixup but forgot to rebase the fix
> > further back in the series.
>
> I assumed as much when I got to that patch.
>
> > Anyway I have fixed that now, thanks.
>
> You deserve a large helping of grace for waking and then slaying this
> old dragon.
Heh, thanks. Lets hope this dragon doesn't have too many more heads :-)
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 11/26] mm: Allow compound zone device pages
2025-01-14 14:59 ` David Hildenbrand
@ 2025-01-17 1:05 ` Alistair Popple
0 siblings, 0 replies; 97+ messages in thread
From: Alistair Popple @ 2025-01-17 1:05 UTC (permalink / raw)
To: David Hildenbrand
Cc: akpm, dan.j.williams, linux-mm, alison.schofield, lina,
zhang.lyra, gerald.schaefer, vishal.l.verma, dave.jiang, logang,
bhelgaas, jack, jgg, catalin.marinas, will, mpe, npiggin,
dave.hansen, ira.weiny, willy, djwong, tytso, linmiaohe, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch, Jason Gunthorpe
On Tue, Jan 14, 2025 at 03:59:31PM +0100, David Hildenbrand wrote:
> On 10.01.25 07:00, Alistair Popple wrote:
> > Zone device pages are used to represent various type of device memory
> > managed by device drivers. Currently compound zone device pages are
> > not supported. This is because MEMORY_DEVICE_FS_DAX pages are the only
> > user of higher order zone device pages and have their own page
> > reference counting.
> >
> > A future change will unify FS DAX reference counting with normal page
> > reference counting rules and remove the special FS DAX reference
> > counting. Supporting that requires compound zone device pages.
> >
> > Supporting compound zone device pages requires compound_head() to
> > distinguish between head and tail pages whilst still preserving the
> > special struct page fields that are specific to zone device pages.
> >
> > A tail page is distinguished by having bit zero being set in
> > page->compound_head, with the remaining bits pointing to the head
> > page. For zone device pages page->compound_head is shared with
> > page->pgmap.
> >
> > The page->pgmap field is common to all pages within a memory section.
> > Therefore pgmap is the same for both head and tail pages and can be
> > moved into the folio and we can use the standard scheme to find
> > compound_head from a tail page.
>
> The more relevant thing is that the pgmap field must be common to all pages
> in a folio, even if a folio exceeds memory sections (e.g., 128 MiB on x86_64
> where we have 1 GiB folios).
Thanks for pointing that out. I had assumed folios couldn't cross a memory
section. Obviously that is wrong so I've updated the commit message accordingly.
- Alistair
> > > Signed-off-by: Alistair Popple <apopple@nvidia.com>
> > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> > Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> >
> > ---
> >
> > Changes for v4:
> > - Fix build breakages reported by kernel test robot
> >
> > Changes since v2:
> >
> > - Indentation fix
> > - Rename page_dev_pagemap() to page_pgmap()
> > - Rename folio _unused field to _unused_pgmap_compound_head
> > - s/WARN_ON/VM_WARN_ON_ONCE_PAGE/
> >
> > Changes since v1:
> >
> > - Move pgmap to the folio as suggested by Matthew Wilcox
> > ---
>
> [...]
>
> > static inline bool folio_is_device_coherent(const struct folio *folio)
> > diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> > index 29919fa..61899ec 100644
> > --- a/include/linux/migrate.h
> > +++ b/include/linux/migrate.h
> > @@ -205,8 +205,8 @@ struct migrate_vma {
> > unsigned long end;
> > /*
> > - * Set to the owner value also stored in page->pgmap->owner for
> > - * migrating out of device private memory. The flags also need to
> > + * Set to the owner value also stored in page_pgmap(page)->owner
> > + * for migrating out of device private memory. The flags also need to
> > * be set to MIGRATE_VMA_SELECT_DEVICE_PRIVATE.
> > * The caller should always set this field when using mmu notifier
> > * callbacks to avoid device MMU invalidations for device private
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> > index df8f515..54b59b8 100644
> > --- a/include/linux/mm_types.h
> > +++ b/include/linux/mm_types.h
> > @@ -129,8 +129,11 @@ struct page {
> > unsigned long compound_head; /* Bit zero is set */
> > };
> > struct { /* ZONE_DEVICE pages */
> > - /** @pgmap: Points to the hosting device page map. */
> > - struct dev_pagemap *pgmap;
> > + /*
> > + * The first word is used for compound_head or folio
> > + * pgmap
> > + */
> > + void *_unused_pgmap_compound_head;
> > void *zone_device_data;
> > /*
> > * ZONE_DEVICE private pages are counted as being
> > @@ -299,6 +302,7 @@ typedef struct {
> > * @_refcount: Do not access this member directly. Use folio_ref_count()
> > * to find how many references there are to this folio.
> > * @memcg_data: Memory Control Group data.
> > + * @pgmap: Metadata for ZONE_DEVICE mappings
> > * @virtual: Virtual address in the kernel direct map.
> > * @_last_cpupid: IDs of last CPU and last process that accessed the folio.
> > * @_entire_mapcount: Do not use directly, call folio_entire_mapcount().
> > @@ -337,6 +341,7 @@ struct folio {
> > /* private: */
> > };
> > /* public: */
> > + struct dev_pagemap *pgmap;
>
> Agreed, that should work.
>
> Acked-by: David Hildenbrand <david@redhat.com>
>
> --
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 19/26] proc/task_mmu: Mark devdax and fsdax pages as always unpinned
2025-01-14 16:45 ` David Hildenbrand
@ 2025-01-17 1:28 ` Alistair Popple
0 siblings, 0 replies; 97+ messages in thread
From: Alistair Popple @ 2025-01-17 1:28 UTC (permalink / raw)
To: David Hildenbrand
Cc: Dan Williams, akpm, linux-mm, alison.schofield, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, peterx, linux-doc,
linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch, david,
chenhuacai, kernel, loongarch
On Tue, Jan 14, 2025 at 05:45:46PM +0100, David Hildenbrand wrote:
> On 14.01.25 03:28, Dan Williams wrote:
> > Alistair Popple wrote:
> > > The procfs mmu files such as smaps and pagemap currently ignore devdax and
> > > fsdax pages because these pages are considered special. A future change
> > > will start treating these as normal pages, meaning they can be exposed via
> > > smaps and pagemap.
> > >
> > > The only difference is that devdax and fsdax pages can never be pinned for
> > > DMA via FOLL_LONGTERM, so add an explicit check in pte_is_pinned() to
> > > reflect that.
> >
> > I don't understand this patch.
>
>
> This whole pte_is_pinned() should likely be ripped out (and I have a patch
> here to do that for a long time).
Agreed.
> But that's a different discussion.
>
> >
> > pin_user_pages() is also used for Direct-I/O page pinning, so the
> > comment about FOLL_LONGTERM is wrong, and I otherwise do not understand
> > what goes wrong if the only pte_is_pinned() user correctly detects the
> > pin state?
>
> Yes, this patch should likely just be dropped.
Yeah, I think I was just being overly cautious about the change to
vm_normal_page(). Agree this can be dropped. Looking at task_mmu.c there is one
other user of vm_normal_page() - clear_refs_pte_range().
We will start clearing access/referenced bits on DAX PTEs there. But I think
that's actually the right thing to do given we do currently clear them for PMD
mapped DAX pages.
> Even if folio_maybe_dma_pinned() == true because of "false positives", it
> will behave just like other order-0 pages with false positives, and only
> affect soft-dirty tracking ... which nobody should be caring about here at
> all.
>
> We would always detect the PTE as soft-dirty because we we never
> pte_wrprotect(old_pte)
>
> Yes, nobody should care.
>
> --
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 20/26] mm/mlock: Skip ZONE_DEVICE PMDs during mlock
2025-01-14 2:42 ` Dan Williams
@ 2025-01-17 1:54 ` Alistair Popple
0 siblings, 0 replies; 97+ messages in thread
From: Alistair Popple @ 2025-01-17 1:54 UTC (permalink / raw)
To: Dan Williams, a
Cc: akpm, linux-mm, alison.schofield, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
On Mon, Jan 13, 2025 at 06:42:46PM -0800, Dan Williams wrote:
> Alistair Popple wrote:
> > At present mlock skips ptes mapping ZONE_DEVICE pages. A future change
> > to remove pmd_devmap will allow pmd_trans_huge_lock() to return
> > ZONE_DEVICE folios so make sure we continue to skip those.
> >
> > Signed-off-by: Alistair Popple <apopple@nvidia.com>
> > Acked-by: David Hildenbrand <david@redhat.com>
>
> This looks like a fix in that mlock_pte_range() *does* call mlock_folio()
> when pmd_trans_huge_lock() returns a non-NULL @ptl.
>
> So it is not in preparation for a future change it is making the pte and
> pmd cases behave the same to drop mlock requests.
>
> The code change looks good, but do add a Fixes tag and reword the
> changelog a bit before adding:
Yeah, that changelog is a bit whacked. In fact it's not a fix - because
mlock_fixup() (the only caller) already filters dax VMAs. So this is really
about fixing a possible future bug when we start having PMDs for other types of
ZONE_DEVICE pages (ie. private, coherent, etc).
So probably I should just roll this into "mm: Allow compound zone device pages".
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 22/26] device/dax: Properly refcount device dax pages when mapping
2025-01-14 6:12 ` Dan Williams
@ 2025-02-03 11:29 ` Alistair Popple
0 siblings, 0 replies; 97+ messages in thread
From: Alistair Popple @ 2025-02-03 11:29 UTC (permalink / raw)
To: Dan Williams
Cc: akpm, linux-mm, alison.schofield, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
On Mon, Jan 13, 2025 at 10:12:41PM -0800, Dan Williams wrote:
> Alistair Popple wrote:
> > Device DAX pages are currently not reference counted when mapped,
> > instead relying on the devmap PTE bit to ensure mapping code will not
> > get/put references. This requires special handling in various page
> > table walkers, particularly GUP, to manage references on the
> > underlying pgmap to ensure the pages remain valid.
> >
> > However there is no reason these pages can't be refcounted properly at
> > map time. Doning so eliminates the need for the devmap PTE bit,
> > freeing up a precious PTE bit. It also simplifies GUP as it no longer
> > needs to manage the special pgmap references and can instead just
> > treat the pages normally as defined by vm_normal_page().
> >
> > Signed-off-by: Alistair Popple <apopple@nvidia.com>
> > ---
> > drivers/dax/device.c | 15 +++++++++------
> > mm/memremap.c | 13 ++++++-------
> > 2 files changed, 15 insertions(+), 13 deletions(-)
> >
> > diff --git a/drivers/dax/device.c b/drivers/dax/device.c
> > index 6d74e62..fd22dbf 100644
> > --- a/drivers/dax/device.c
> > +++ b/drivers/dax/device.c
> > @@ -126,11 +126,12 @@ static vm_fault_t __dev_dax_pte_fault(struct dev_dax *dev_dax,
> > return VM_FAULT_SIGBUS;
> > }
> >
> > - pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);
> > + pfn = phys_to_pfn_t(phys, 0);
> >
> > dax_set_mapping(vmf, pfn, fault_size);
> >
> > - return vmf_insert_mixed(vmf->vma, vmf->address, pfn);
> > + return vmf_insert_page_mkwrite(vmf, pfn_t_to_page(pfn),
> > + vmf->flags & FAULT_FLAG_WRITE);
> > }
> >
> > static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
> > @@ -169,11 +170,12 @@ static vm_fault_t __dev_dax_pmd_fault(struct dev_dax *dev_dax,
> > return VM_FAULT_SIGBUS;
> > }
> >
> > - pfn = phys_to_pfn_t(phys, PFN_DEV|PFN_MAP);
> > + pfn = phys_to_pfn_t(phys, 0);
> >
> > dax_set_mapping(vmf, pfn, fault_size);
> >
> > - return vmf_insert_pfn_pmd(vmf, pfn, vmf->flags & FAULT_FLAG_WRITE);
> > + return vmf_insert_folio_pmd(vmf, page_folio(pfn_t_to_page(pfn)),
> > + vmf->flags & FAULT_FLAG_WRITE);
>
> This looks suspect without initializing the compound page metadata.
I initially wondered about this too, however I think the compound page metadata
should be initialised by memmap_init_zone_device(). That said I kind of get lost
in all the namespace/CXL/PMEM/DAX drivers in the stack so maybe I've overlooked
something.
> This might be getting compound pages by default with
> CONFIG_ARCH_WANT_OPTIMIZE_DAX_VMEMMAP. The device-dax unit tests are ok
> so far, but that is not super comforting until I can think about this a
> bit more... but not tonight.
From my reading of the code I don't _think_
CONFIG_ARCH_WANT_OPTIMIZE_DAX_VMEMMAP would change whether or not we got
compound pages by default, just that if we did some of the (tail?) pages may
refer to the same physical struct page.
> Might as well fix up device-dax refcounts in this series too, but I
> won't ask you to do that, will send you something to include.
Eh. That should be relatively straight forward. But then I thought that about FS
DAX too :-)
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 23/26] mm: Remove pXX_devmap callers
2025-01-15 7:27 ` Alistair Popple
@ 2025-02-04 19:06 ` Dan Williams
2025-02-05 9:57 ` Alistair Popple
0 siblings, 1 reply; 97+ messages in thread
From: Dan Williams @ 2025-02-04 19:06 UTC (permalink / raw)
To: Alistair Popple, Dan Williams
Cc: akpm, linux-mm, alison.schofield, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
Alistair Popple wrote:
> On Tue, Jan 14, 2025 at 10:50:49AM -0800, Dan Williams wrote:
> > Alistair Popple wrote:
> > > The devmap PTE special bit was used to detect mappings of FS DAX
> > > pages. This tracking was required to ensure the generic mm did not
> > > manipulate the page reference counts as FS DAX implemented it's own
> > > reference counting scheme.
> > >
> > > Now that FS DAX pages have their references counted the same way as
> > > normal pages this tracking is no longer needed and can be
> > > removed.
> > >
> > > Almost all existing uses of pmd_devmap() are paired with a check of
> > > pmd_trans_huge(). As pmd_trans_huge() now returns true for FS DAX pages
> > > dropping the check in these cases doesn't change anything.
> > >
> > > However care needs to be taken because pmd_trans_huge() also checks that
> > > a page is not an FS DAX page. This is dealt with either by checking
> > > !vma_is_dax() or relying on the fact that the page pointer was obtained
> > > from a page list. This is possible because zone device pages cannot
> > > appear in any page list due to sharing page->lru with page->pgmap.
> >
> > While the patch looks straightforward I think part of taking "care" in
> > this case is to split it such that any of those careful conversions have
> > their own bisect point in the history.
> >
> > Perhaps this can move to follow-on series to not blow up the patch count
> > of the base series? ...but first want to get your reaction to splitting
> > for bisect purposes.
>
> TBH I don't feel too strongly about it - I suppose it would make it easier to
> bisect to the specific case we weren't careful enough about. However I think if
> a bug is bisected to this particular patch it would be relatively easy based on
> the context of the bug to narrow it down to a particular file or two.
>
> I do however feel strongly about whether or not that should be done in a
> follow-on series :-)
>
> Rebasing such a large series has already become painful and error prone enough
> so if we want to split this change up it will definitely need to be a separate
> series done once the rest of this has been merged. So I could be pursaded to
> roll this and the pfn_t removal (as that depends on devmap going away) together.
>
> Let me know what you think.
I tend to think that there's never any regrets for splitting a patch
along lines of risk. I am fine with keeping that in this series if that
makes things easier.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 23/26] mm: Remove pXX_devmap callers
2025-02-04 19:06 ` Dan Williams
@ 2025-02-05 9:57 ` Alistair Popple
0 siblings, 0 replies; 97+ messages in thread
From: Alistair Popple @ 2025-02-05 9:57 UTC (permalink / raw)
To: Dan Williams
Cc: akpm, linux-mm, alison.schofield, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
On Tue, Feb 04, 2025 at 11:06:08AM -0800, Dan Williams wrote:
> Alistair Popple wrote:
> > On Tue, Jan 14, 2025 at 10:50:49AM -0800, Dan Williams wrote:
> > > Alistair Popple wrote:
> > > > The devmap PTE special bit was used to detect mappings of FS DAX
> > > > pages. This tracking was required to ensure the generic mm did not
> > > > manipulate the page reference counts as FS DAX implemented it's own
> > > > reference counting scheme.
> > > >
> > > > Now that FS DAX pages have their references counted the same way as
> > > > normal pages this tracking is no longer needed and can be
> > > > removed.
> > > >
> > > > Almost all existing uses of pmd_devmap() are paired with a check of
> > > > pmd_trans_huge(). As pmd_trans_huge() now returns true for FS DAX pages
> > > > dropping the check in these cases doesn't change anything.
> > > >
> > > > However care needs to be taken because pmd_trans_huge() also checks that
> > > > a page is not an FS DAX page. This is dealt with either by checking
> > > > !vma_is_dax() or relying on the fact that the page pointer was obtained
> > > > from a page list. This is possible because zone device pages cannot
> > > > appear in any page list due to sharing page->lru with page->pgmap.
> > >
> > > While the patch looks straightforward I think part of taking "care" in
> > > this case is to split it such that any of those careful conversions have
> > > their own bisect point in the history.
> > >
> > > Perhaps this can move to follow-on series to not blow up the patch count
> > > of the base series? ...but first want to get your reaction to splitting
> > > for bisect purposes.
> >
> > TBH I don't feel too strongly about it - I suppose it would make it easier to
> > bisect to the specific case we weren't careful enough about. However I think if
> > a bug is bisected to this particular patch it would be relatively easy based on
> > the context of the bug to narrow it down to a particular file or two.
> >
> > I do however feel strongly about whether or not that should be done in a
> > follow-on series :-)
> >
> > Rebasing such a large series has already become painful and error prone enough
> > so if we want to split this change up it will definitely need to be a separate
> > series done once the rest of this has been merged. So I could be pursaded to
> > roll this and the pfn_t removal (as that depends on devmap going away) together.
> >
> > Let me know what you think.
>
> I tend to think that there's never any regrets for splitting a patch
> along lines of risk. I am fine with keeping that in this series if that
> makes things easier.
Yes, that is a reaonable point of view. You will notice I dropped these
clean-ups in my latest repost as I intend to post them as a separate clean-up
series to be applied on top of this one. My hope would be the clean up series
would also make it into v6.15.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 01/26] fuse: Fix dax truncate/punch_hole fault path
2025-01-10 6:00 ` [PATCH v6 01/26] fuse: Fix dax truncate/punch_hole fault path Alistair Popple
@ 2025-02-05 13:03 ` Vivek Goyal
2025-02-06 0:10 ` Dan Williams
0 siblings, 1 reply; 97+ messages in thread
From: Vivek Goyal @ 2025-02-05 13:03 UTC (permalink / raw)
To: Alistair Popple
Cc: akpm, dan.j.williams, linux-mm, alison.schofield, lina,
zhang.lyra, gerald.schaefer, vishal.l.verma, dave.jiang, logang,
bhelgaas, jack, jgg, catalin.marinas, will, mpe, npiggin,
dave.hansen, ira.weiny, willy, djwong, tytso, linmiaohe, david,
peterx, linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev,
nvdimm, linux-cxl, linux-fsdevel, linux-ext4, linux-xfs,
jhubbard, hch, david, chenhuacai, kernel, loongarch,
Hanna Czenczek, German Maglione
On Fri, Jan 10, 2025 at 05:00:29PM +1100, Alistair Popple wrote:
> FS DAX requires file systems to call into the DAX layout prior to unlinking
> inodes to ensure there is no ongoing DMA or other remote access to the
> direct mapped page. The fuse file system implements
> fuse_dax_break_layouts() to do this which includes a comment indicating
> that passing dmap_end == 0 leads to unmapping of the whole file.
>
> However this is not true - passing dmap_end == 0 will not unmap anything
> before dmap_start, and further more dax_layout_busy_page_range() will not
> scan any of the range to see if there maybe ongoing DMA access to the
> range. Fix this by passing -1 for dmap_end to fuse_dax_break_layouts()
> which will invalidate the entire file range to
> dax_layout_busy_page_range().
Hi Alistair,
Thanks for fixing DAX related issues for virtiofs. I am wondering how are
you testing DAX with virtiofs. AFAIK, we don't have DAX support in Rust
virtiofsd. C version of virtiofsd used to have out of the tree patches
for DAX. But C version got deprecated long time ago.
Do you have another implementation of virtiofsd somewhere else which
supports DAX and allows for testing DAX related changes?
Thanks
Vivek
>
> Signed-off-by: Alistair Popple <apopple@nvidia.com>
> Co-developed-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> Fixes: 6ae330cad6ef ("virtiofs: serialize truncate/punch_hole and dax fault path")
> Cc: Vivek Goyal <vgoyal@redhat.com>
>
> ---
>
> Changes for v6:
>
> - Original patch had a misplaced hunk due to a bad rebase.
> - Reworked fix based on Dan's comments.
> ---
> fs/fuse/dax.c | 1 -
> fs/fuse/dir.c | 2 +-
> fs/fuse/file.c | 4 ++--
> 3 files changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/fs/fuse/dax.c b/fs/fuse/dax.c
> index 9abbc2f..455c4a1 100644
> --- a/fs/fuse/dax.c
> +++ b/fs/fuse/dax.c
> @@ -681,7 +681,6 @@ static int __fuse_dax_break_layouts(struct inode *inode, bool *retry,
> 0, 0, fuse_wait_dax_page(inode));
> }
>
> -/* dmap_end == 0 leads to unmapping of whole file */
> int fuse_dax_break_layouts(struct inode *inode, u64 dmap_start,
> u64 dmap_end)
> {
> diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
> index 0b2f856..bc6c893 100644
> --- a/fs/fuse/dir.c
> +++ b/fs/fuse/dir.c
> @@ -1936,7 +1936,7 @@ int fuse_do_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
> if (FUSE_IS_DAX(inode) && is_truncate) {
> filemap_invalidate_lock(mapping);
> fault_blocked = true;
> - err = fuse_dax_break_layouts(inode, 0, 0);
> + err = fuse_dax_break_layouts(inode, 0, -1);
> if (err) {
> filemap_invalidate_unlock(mapping);
> return err;
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 082ee37..cef7a8f 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -253,7 +253,7 @@ static int fuse_open(struct inode *inode, struct file *file)
>
> if (dax_truncate) {
> filemap_invalidate_lock(inode->i_mapping);
> - err = fuse_dax_break_layouts(inode, 0, 0);
> + err = fuse_dax_break_layouts(inode, 0, -1);
> if (err)
> goto out_inode_unlock;
> }
> @@ -2890,7 +2890,7 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
> inode_lock(inode);
> if (block_faults) {
> filemap_invalidate_lock(inode->i_mapping);
> - err = fuse_dax_break_layouts(inode, 0, 0);
> + err = fuse_dax_break_layouts(inode, 0, -1);
> if (err)
> goto out;
> }
> --
> git-series 0.9.1
>
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 01/26] fuse: Fix dax truncate/punch_hole fault path
2025-02-05 13:03 ` Vivek Goyal
@ 2025-02-06 0:10 ` Dan Williams
2025-02-06 12:41 ` Asahi Lina
2025-02-06 13:37 ` Vivek Goyal
0 siblings, 2 replies; 97+ messages in thread
From: Dan Williams @ 2025-02-06 0:10 UTC (permalink / raw)
To: Vivek Goyal, Alistair Popple
Cc: akpm, dan.j.williams, linux-mm, alison.schofield, lina,
zhang.lyra, gerald.schaefer, vishal.l.verma, dave.jiang, logang,
bhelgaas, jack, jgg, catalin.marinas, will, mpe, npiggin,
dave.hansen, ira.weiny, willy, djwong, tytso, linmiaohe, david,
peterx, linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev,
nvdimm, linux-cxl, linux-fsdevel, linux-ext4, linux-xfs,
jhubbard, hch, david, chenhuacai, kernel, loongarch,
Hanna Czenczek, German Maglione
Vivek Goyal wrote:
> On Fri, Jan 10, 2025 at 05:00:29PM +1100, Alistair Popple wrote:
> > FS DAX requires file systems to call into the DAX layout prior to unlinking
> > inodes to ensure there is no ongoing DMA or other remote access to the
> > direct mapped page. The fuse file system implements
> > fuse_dax_break_layouts() to do this which includes a comment indicating
> > that passing dmap_end == 0 leads to unmapping of the whole file.
> >
> > However this is not true - passing dmap_end == 0 will not unmap anything
> > before dmap_start, and further more dax_layout_busy_page_range() will not
> > scan any of the range to see if there maybe ongoing DMA access to the
> > range. Fix this by passing -1 for dmap_end to fuse_dax_break_layouts()
> > which will invalidate the entire file range to
> > dax_layout_busy_page_range().
>
> Hi Alistair,
>
> Thanks for fixing DAX related issues for virtiofs. I am wondering how are
> you testing DAX with virtiofs. AFAIK, we don't have DAX support in Rust
> virtiofsd. C version of virtiofsd used to have out of the tree patches
> for DAX. But C version got deprecated long time ago.
>
> Do you have another implementation of virtiofsd somewhere else which
> supports DAX and allows for testing DAX related changes?
I have personally never seen a virtiofs-dax test. It sounds like you are
saying we can deprecate that support if there are no longer any users.
Or, do you expect that C-virtiofsd is alive in the ecosystem?
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 01/26] fuse: Fix dax truncate/punch_hole fault path
2025-02-06 0:10 ` Dan Williams
@ 2025-02-06 12:41 ` Asahi Lina
2025-02-06 19:44 ` Dan Williams
2025-02-06 13:37 ` Vivek Goyal
1 sibling, 1 reply; 97+ messages in thread
From: Asahi Lina @ 2025-02-06 12:41 UTC (permalink / raw)
To: Dan Williams, Vivek Goyal, Alistair Popple
Cc: akpm, dan.j.williams, linux-mm, alison.schofield, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch, Hanna Czenczek,
German Maglione
Hi,
On February 6, 2025 1:10:15 AM GMT+01:00, Dan Williams <dan.j.williams@intel.com> wrote:
>Vivek Goyal wrote:
>> On Fri, Jan 10, 2025 at 05:00:29PM +1100, Alistair Popple wrote:
>> > FS DAX requires file systems to call into the DAX layout prior to unlinking
>> > inodes to ensure there is no ongoing DMA or other remote access to the
>> > direct mapped page. The fuse file system implements
>> > fuse_dax_break_layouts() to do this which includes a comment indicating
>> > that passing dmap_end == 0 leads to unmapping of the whole file.
>> >
>> > However this is not true - passing dmap_end == 0 will not unmap anything
>> > before dmap_start, and further more dax_layout_busy_page_range() will not
>> > scan any of the range to see if there maybe ongoing DMA access to the
>> > range. Fix this by passing -1 for dmap_end to fuse_dax_break_layouts()
>> > which will invalidate the entire file range to
>> > dax_layout_busy_page_range().
>>
>> Hi Alistair,
>>
>> Thanks for fixing DAX related issues for virtiofs. I am wondering how are
>> you testing DAX with virtiofs. AFAIK, we don't have DAX support in Rust
>> virtiofsd. C version of virtiofsd used to have out of the tree patches
>> for DAX. But C version got deprecated long time ago.
>>
>> Do you have another implementation of virtiofsd somewhere else which
>> supports DAX and allows for testing DAX related changes?
>
>I have personally never seen a virtiofs-dax test. It sounds like you are
>saying we can deprecate that support if there are no longer any users.
>Or, do you expect that C-virtiofsd is alive in the ecosystem?
I accidentally replied offlist, but I wanted to mention that libkrun supports DAX and we use it in muvm. It's a critical part of x11bridge functionality, since it uses DAX to share X11 shm fences between X11 clients in the VM and the XWayland server on the host, which only works if the mmaps are coherent.
(Sorry for the unwrapped reply, I'm on mobile right now.)
~~ Lina
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 01/26] fuse: Fix dax truncate/punch_hole fault path
2025-02-06 0:10 ` Dan Williams
2025-02-06 12:41 ` Asahi Lina
@ 2025-02-06 13:37 ` Vivek Goyal
2025-02-06 14:30 ` Stefan Hajnoczi
1 sibling, 1 reply; 97+ messages in thread
From: Vivek Goyal @ 2025-02-06 13:37 UTC (permalink / raw)
To: Dan Williams
Cc: Alistair Popple, akpm, linux-mm, alison.schofield, lina,
zhang.lyra, gerald.schaefer, vishal.l.verma, dave.jiang, logang,
bhelgaas, jack, jgg, catalin.marinas, will, mpe, npiggin,
dave.hansen, ira.weiny, willy, djwong, tytso, linmiaohe, david,
peterx, linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev,
nvdimm, linux-cxl, linux-fsdevel, linux-ext4, linux-xfs,
jhubbard, hch, david, chenhuacai, kernel, loongarch,
Hanna Czenczek, German Maglione, Stefan Hajnoczi
On Wed, Feb 05, 2025 at 04:10:15PM -0800, Dan Williams wrote:
> Vivek Goyal wrote:
> > On Fri, Jan 10, 2025 at 05:00:29PM +1100, Alistair Popple wrote:
> > > FS DAX requires file systems to call into the DAX layout prior to unlinking
> > > inodes to ensure there is no ongoing DMA or other remote access to the
> > > direct mapped page. The fuse file system implements
> > > fuse_dax_break_layouts() to do this which includes a comment indicating
> > > that passing dmap_end == 0 leads to unmapping of the whole file.
> > >
> > > However this is not true - passing dmap_end == 0 will not unmap anything
> > > before dmap_start, and further more dax_layout_busy_page_range() will not
> > > scan any of the range to see if there maybe ongoing DMA access to the
> > > range. Fix this by passing -1 for dmap_end to fuse_dax_break_layouts()
> > > which will invalidate the entire file range to
> > > dax_layout_busy_page_range().
> >
> > Hi Alistair,
> >
> > Thanks for fixing DAX related issues for virtiofs. I am wondering how are
> > you testing DAX with virtiofs. AFAIK, we don't have DAX support in Rust
> > virtiofsd. C version of virtiofsd used to have out of the tree patches
> > for DAX. But C version got deprecated long time ago.
> >
> > Do you have another implementation of virtiofsd somewhere else which
> > supports DAX and allows for testing DAX related changes?
>
> I have personally never seen a virtiofs-dax test. It sounds like you are
> saying we can deprecate that support if there are no longer any users.
> Or, do you expect that C-virtiofsd is alive in the ecosystem?
Ashai Lina responded that they need and test DAX using libkrun.
C version of virtiofsd is now gone. We are actively working and testing
Rust version of virtiofsd. We have not been able to add DAX support to
it yet for various reasons.
Biggest unsolved problem with viritofsd DAX mode is guest process should
get a SIGBUS if it tries to access a file beyond the file. This can happen
if file has been truncated on the host (while it is still mapped inside
the guest).
I had tried to summarize the problem in this presentation in the section
"KVM Page fault error handling".
https://kvm-forum.qemu.org/2020/KVMForum2020_APF.pdf
This is a tricky problem to handle. Once this gets handled, it becomes
safer to use DAX with virtiofs. Otherwise you can't share the filesystem
with other guests in DAX mode and use cases are limited.
And then there are challenges at QEMU level. virtiofsd needs additional
vhost-user commands to implement DAX and these never went upstream in
QEMU. I hope these challenges are sorted at some point of time.
I think virtiofs DAX is a very cool piece of technology. I would not like
to deprecate it. It has its own problems and challenges and once we
are able to solve these, it might see wider usage/adoption.
Thanks
Vivek
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 01/26] fuse: Fix dax truncate/punch_hole fault path
2025-02-06 13:37 ` Vivek Goyal
@ 2025-02-06 14:30 ` Stefan Hajnoczi
2025-02-06 14:59 ` Albert Esteve
0 siblings, 1 reply; 97+ messages in thread
From: Stefan Hajnoczi @ 2025-02-06 14:30 UTC (permalink / raw)
To: Vivek Goyal
Cc: Dan Williams, Alistair Popple, akpm, linux-mm, alison.schofield,
lina, zhang.lyra, gerald.schaefer, vishal.l.verma, dave.jiang,
logang, bhelgaas, jack, jgg, catalin.marinas, will, mpe, npiggin,
dave.hansen, ira.weiny, willy, djwong, tytso, linmiaohe, david,
peterx, linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev,
nvdimm, linux-cxl, linux-fsdevel, linux-ext4, linux-xfs,
jhubbard, hch, david, chenhuacai, kernel, loongarch,
Hanna Czenczek, German Maglione, Albert Esteve
[-- Attachment #1: Type: text/plain, Size: 680 bytes --]
On Thu, Feb 06, 2025 at 08:37:07AM -0500, Vivek Goyal wrote:
> And then there are challenges at QEMU level. virtiofsd needs additional
> vhost-user commands to implement DAX and these never went upstream in
> QEMU. I hope these challenges are sorted at some point of time.
Albert Esteve has been working on QEMU support:
https://lore.kernel.org/qemu-devel/20240912145335.129447-1-aesteve@redhat.com/
He has a viable solution. I think the remaining issue is how to best
structure the memory regions. The reason for slow progress is not
because it can't be done, it's probably just because this is a
background task.
Please discuss with Albert if QEMU support is urgent.
Stefan
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 01/26] fuse: Fix dax truncate/punch_hole fault path
2025-02-06 14:30 ` Stefan Hajnoczi
@ 2025-02-06 14:59 ` Albert Esteve
2025-02-06 18:10 ` Stefan Hajnoczi
2025-02-06 18:22 ` David Hildenbrand
0 siblings, 2 replies; 97+ messages in thread
From: Albert Esteve @ 2025-02-06 14:59 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: Vivek Goyal, Dan Williams, Alistair Popple, akpm, linux-mm,
alison.schofield, lina, zhang.lyra, gerald.schaefer,
vishal.l.verma, dave.jiang, logang, bhelgaas, jack, jgg,
catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
willy, djwong, tytso, linmiaohe, david, peterx, linux-doc,
linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch, david,
chenhuacai, kernel, loongarch, Hanna Czenczek, German Maglione
Hi!
On Thu, Feb 6, 2025 at 3:30 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>
> On Thu, Feb 06, 2025 at 08:37:07AM -0500, Vivek Goyal wrote:
> > And then there are challenges at QEMU level. virtiofsd needs additional
> > vhost-user commands to implement DAX and these never went upstream in
> > QEMU. I hope these challenges are sorted at some point of time.
>
> Albert Esteve has been working on QEMU support:
> https://lore.kernel.org/qemu-devel/20240912145335.129447-1-aesteve@redhat.com/
>
> He has a viable solution. I think the remaining issue is how to best
> structure the memory regions. The reason for slow progress is not
> because it can't be done, it's probably just because this is a
> background task.
It is partially that, indeed. But what has me blocked for now on posting the
next version is that I was reworking a bit the MMAP strategy.
Following David comments, I am relying more on RAMBlocks and
subregions for mmaps. But this turned out more difficult than anticipated.
I hope I can make it work this month and then post the next version.
If there are no major blockers/reworks, further iterations on the
patch shall go smoother.
I have a separate patch for the vhost-user spec which could
iterate faster, if that'd help.
BR,
Albert.
>
> Please discuss with Albert if QEMU support is urgent.
>
> Stefan
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 01/26] fuse: Fix dax truncate/punch_hole fault path
2025-02-06 14:59 ` Albert Esteve
@ 2025-02-06 18:10 ` Stefan Hajnoczi
2025-02-06 18:22 ` David Hildenbrand
1 sibling, 0 replies; 97+ messages in thread
From: Stefan Hajnoczi @ 2025-02-06 18:10 UTC (permalink / raw)
To: Albert Esteve
Cc: Vivek Goyal, Dan Williams, Alistair Popple, akpm, linux-mm,
alison.schofield, lina, zhang.lyra, gerald.schaefer,
vishal.l.verma, dave.jiang, logang, bhelgaas, jack, jgg,
catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
willy, djwong, tytso, linmiaohe, david, peterx, linux-doc,
linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch, david,
chenhuacai, kernel, loongarch, Hanna Czenczek, German Maglione
[-- Attachment #1: Type: text/plain, Size: 1560 bytes --]
On Thu, Feb 06, 2025 at 03:59:03PM +0100, Albert Esteve wrote:
> Hi!
>
> On Thu, Feb 6, 2025 at 3:30 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Thu, Feb 06, 2025 at 08:37:07AM -0500, Vivek Goyal wrote:
> > > And then there are challenges at QEMU level. virtiofsd needs additional
> > > vhost-user commands to implement DAX and these never went upstream in
> > > QEMU. I hope these challenges are sorted at some point of time.
> >
> > Albert Esteve has been working on QEMU support:
> > https://lore.kernel.org/qemu-devel/20240912145335.129447-1-aesteve@redhat.com/
> >
> > He has a viable solution. I think the remaining issue is how to best
> > structure the memory regions. The reason for slow progress is not
> > because it can't be done, it's probably just because this is a
> > background task.
>
> It is partially that, indeed. But what has me blocked for now on posting the
> next version is that I was reworking a bit the MMAP strategy.
> Following David comments, I am relying more on RAMBlocks and
> subregions for mmaps. But this turned out more difficult than anticipated.
>
> I hope I can make it work this month and then post the next version.
> If there are no major blockers/reworks, further iterations on the
> patch shall go smoother.
>
> I have a separate patch for the vhost-user spec which could
> iterate faster, if that'd help.
Let's see if anyone needs the vhost-user spec extension now. Otherwise
it seems fine to merge it together with the implementation of that spec.
Stefan
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 01/26] fuse: Fix dax truncate/punch_hole fault path
2025-02-06 14:59 ` Albert Esteve
2025-02-06 18:10 ` Stefan Hajnoczi
@ 2025-02-06 18:22 ` David Hildenbrand
2025-02-07 16:16 ` Albert Esteve
1 sibling, 1 reply; 97+ messages in thread
From: David Hildenbrand @ 2025-02-06 18:22 UTC (permalink / raw)
To: Albert Esteve, Stefan Hajnoczi
Cc: Vivek Goyal, Dan Williams, Alistair Popple, akpm, linux-mm,
alison.schofield, lina, zhang.lyra, gerald.schaefer,
vishal.l.verma, dave.jiang, logang, bhelgaas, jack, jgg,
catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
willy, djwong, tytso, linmiaohe, peterx, linux-doc, linux-kernel,
linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl, linux-fsdevel,
linux-ext4, linux-xfs, jhubbard, hch, david, chenhuacai, kernel,
loongarch, Hanna Czenczek, German Maglione
On 06.02.25 15:59, Albert Esteve wrote:
> Hi!
>
> On Thu, Feb 6, 2025 at 3:30 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>
>> On Thu, Feb 06, 2025 at 08:37:07AM -0500, Vivek Goyal wrote:
>>> And then there are challenges at QEMU level. virtiofsd needs additional
>>> vhost-user commands to implement DAX and these never went upstream in
>>> QEMU. I hope these challenges are sorted at some point of time.
>>
>> Albert Esteve has been working on QEMU support:
>> https://lore.kernel.org/qemu-devel/20240912145335.129447-1-aesteve@redhat.com/
>>
>> He has a viable solution. I think the remaining issue is how to best
>> structure the memory regions. The reason for slow progress is not
>> because it can't be done, it's probably just because this is a
>> background task.
>
> It is partially that, indeed. But what has me blocked for now on posting the
> next version is that I was reworking a bit the MMAP strategy.
> Following David comments, I am relying more on RAMBlocks and
> subregions for mmaps. But this turned out more difficult than anticipated.
Yeah, if that turns out to be too painful, we could start with the
previous approach and work on that later. I also did not expect that to
become that complicated.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 01/26] fuse: Fix dax truncate/punch_hole fault path
2025-02-06 12:41 ` Asahi Lina
@ 2025-02-06 19:44 ` Dan Williams
2025-02-06 19:57 ` Asahi Lina
0 siblings, 1 reply; 97+ messages in thread
From: Dan Williams @ 2025-02-06 19:44 UTC (permalink / raw)
To: Asahi Lina, Dan Williams, Vivek Goyal, Alistair Popple
Cc: akpm, dan.j.williams, linux-mm, alison.schofield, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch, Hanna Czenczek,
German Maglione
Asahi Lina wrote:
> Hi,
>
> On February 6, 2025 1:10:15 AM GMT+01:00, Dan Williams <dan.j.williams@intel.com> wrote:
> >Vivek Goyal wrote:
> >> On Fri, Jan 10, 2025 at 05:00:29PM +1100, Alistair Popple wrote:
> >> > FS DAX requires file systems to call into the DAX layout prior to unlinking
> >> > inodes to ensure there is no ongoing DMA or other remote access to the
> >> > direct mapped page. The fuse file system implements
> >> > fuse_dax_break_layouts() to do this which includes a comment indicating
> >> > that passing dmap_end == 0 leads to unmapping of the whole file.
> >> >
> >> > However this is not true - passing dmap_end == 0 will not unmap anything
> >> > before dmap_start, and further more dax_layout_busy_page_range() will not
> >> > scan any of the range to see if there maybe ongoing DMA access to the
> >> > range. Fix this by passing -1 for dmap_end to fuse_dax_break_layouts()
> >> > which will invalidate the entire file range to
> >> > dax_layout_busy_page_range().
> >>
> >> Hi Alistair,
> >>
> >> Thanks for fixing DAX related issues for virtiofs. I am wondering how are
> >> you testing DAX with virtiofs. AFAIK, we don't have DAX support in Rust
> >> virtiofsd. C version of virtiofsd used to have out of the tree patches
> >> for DAX. But C version got deprecated long time ago.
> >>
> >> Do you have another implementation of virtiofsd somewhere else which
> >> supports DAX and allows for testing DAX related changes?
> >
> >I have personally never seen a virtiofs-dax test. It sounds like you are
> >saying we can deprecate that support if there are no longer any users.
> >Or, do you expect that C-virtiofsd is alive in the ecosystem?
>
> I accidentally replied offlist, but I wanted to mention that libkrun
> supports DAX and we use it in muvm. It's a critical part of x11bridge
> functionality, since it uses DAX to share X11 shm fences between X11
> clients in the VM and the XWayland server on the host, which only
> works if the mmaps are coherent.
Ah, good to hear. It would be lovely to integrate an muvm smoketest
somewhere in https://github.com/pmem/ndctl/tree/main/test so that we
have early warning on potential breakage.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 01/26] fuse: Fix dax truncate/punch_hole fault path
2025-02-06 19:44 ` Dan Williams
@ 2025-02-06 19:57 ` Asahi Lina
0 siblings, 0 replies; 97+ messages in thread
From: Asahi Lina @ 2025-02-06 19:57 UTC (permalink / raw)
To: Dan Williams, Vivek Goyal, Alistair Popple, Sergio Lopez Pascual
Cc: akpm, linux-mm, alison.schofield, zhang.lyra, gerald.schaefer,
vishal.l.verma, dave.jiang, logang, bhelgaas, jack, jgg,
catalin.marinas, will, mpe, npiggin, dave.hansen, ira.weiny,
willy, djwong, tytso, linmiaohe, david, peterx, linux-doc,
linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch, david,
chenhuacai, kernel, loongarch, Hanna Czenczek, German Maglione
On 2/7/25 4:44 AM, Dan Williams wrote:
> Asahi Lina wrote:
>> Hi,
>>
>> On February 6, 2025 1:10:15 AM GMT+01:00, Dan Williams <dan.j.williams@intel.com> wrote:
>>> Vivek Goyal wrote:
>>>> On Fri, Jan 10, 2025 at 05:00:29PM +1100, Alistair Popple wrote:
>>>>> FS DAX requires file systems to call into the DAX layout prior to unlinking
>>>>> inodes to ensure there is no ongoing DMA or other remote access to the
>>>>> direct mapped page. The fuse file system implements
>>>>> fuse_dax_break_layouts() to do this which includes a comment indicating
>>>>> that passing dmap_end == 0 leads to unmapping of the whole file.
>>>>>
>>>>> However this is not true - passing dmap_end == 0 will not unmap anything
>>>>> before dmap_start, and further more dax_layout_busy_page_range() will not
>>>>> scan any of the range to see if there maybe ongoing DMA access to the
>>>>> range. Fix this by passing -1 for dmap_end to fuse_dax_break_layouts()
>>>>> which will invalidate the entire file range to
>>>>> dax_layout_busy_page_range().
>>>>
>>>> Hi Alistair,
>>>>
>>>> Thanks for fixing DAX related issues for virtiofs. I am wondering how are
>>>> you testing DAX with virtiofs. AFAIK, we don't have DAX support in Rust
>>>> virtiofsd. C version of virtiofsd used to have out of the tree patches
>>>> for DAX. But C version got deprecated long time ago.
>>>>
>>>> Do you have another implementation of virtiofsd somewhere else which
>>>> supports DAX and allows for testing DAX related changes?
>>>
>>> I have personally never seen a virtiofs-dax test. It sounds like you are
>>> saying we can deprecate that support if there are no longer any users.
>>> Or, do you expect that C-virtiofsd is alive in the ecosystem?
>>
>> I accidentally replied offlist, but I wanted to mention that libkrun
>> supports DAX and we use it in muvm. It's a critical part of x11bridge
>> functionality, since it uses DAX to share X11 shm fences between X11
>> clients in the VM and the XWayland server on the host, which only
>> works if the mmaps are coherent.
>
> Ah, good to hear. It would be lovely to integrate an muvm smoketest
> somewhere in https://github.com/pmem/ndctl/tree/main/test so that we
> have early warning on potential breakage.
I think you'll probably want a smoke test using libkrun directly, since
muvm is quite application-specific. It's really easy to write a quick C
file to call into libkrun and spin up a VM.
If it's supposed to test an arbitrary kernel though, I'm not sure what
the test setup would look like. You'd need to run it on a host (whose
kernel is mostly irrelevant) and then use libkrun to spin up a VM with a
guest, which then runs the test. libkrun normally uses a bundled kernel
though (shipped as libkrunfw), we'd need to add an API to specify an
external kernel binary I guess?
I'm happy to help with that, but I'll need to know a bit more about the
intended usage first. I *think* most of the scaffolding for running
arbitrary kernels is already planned, since there was some talk of
running the host kernel as the guest kernel, so this wouldn't add much
work on top of that.
I definitely have a few tests in mind if we do put this together, since
I know of one or two things that are definitely broken in DAX upstream
right now (which I *think* this series fixes but I never got around to
testing it...).
Cc: slp for libkrun.
~~ Lina
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 21/26] fs/dax: Properly refcount fs dax pages
2025-01-14 3:35 ` Dan Williams
@ 2025-02-07 5:31 ` Alistair Popple
2025-02-07 5:50 ` Dan Williams
0 siblings, 1 reply; 97+ messages in thread
From: Alistair Popple @ 2025-02-07 5:31 UTC (permalink / raw)
To: Dan Williams
Cc: akpm, linux-mm, alison.schofield, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
On Mon, Jan 13, 2025 at 07:35:07PM -0800, Dan Williams wrote:
> Alistair Popple wrote:
[...]
> ...and here is that aformentioned patch:
This patch is different from what you originally posted here:
https://yhbt.net/lore/linux-s390/172721874675.497781.3277495908107141898.stgit@dwillia2-xfh.jf.intel.com/
> -- 8< --
> Subject: dcssblk: Mark DAX broken, remove FS_DAX_LIMITED support
>
> From: Dan Williams <dan.j.williams@intel.com>
>
> The dcssblk driver has long needed special case supoprt to enable
> limited dax operation, so called CONFIG_FS_DAX_LIMITED. This mode
> works around the incomplete support for ZONE_DEVICE on s390 by forgoing
> the ability of dax-mapped pages to support GUP.
>
> Now, pending cleanups to fsdax that fix its reference counting [1] depend on
> the ability of all dax drivers to supply ZONE_DEVICE pages.
>
> To allow that work to move forward, dax support needs to be paused for
> dcssblk until ZONE_DEVICE support arrives. That work has been known for
> a few years [2], and the removal of "pte_devmap" requirements [3] makes the
> conversion easier.
>
> For now, place the support behind CONFIG_BROKEN, and remove PFN_SPECIAL
> (dcssblk was the only user).
Specifically it no longer removes PFN_SPECIAL. Was this intentional? Or should I
really have picked up the original patch from the mailing list?
- Alistair
> Link: http://lore.kernel.org/cover.9f0e45d52f5cff58807831b6b867084d0b14b61c.1725941415.git-series.apopple@nvidia.com [1]
> Link: http://lore.kernel.org/20210820210318.187742e8@thinkpad/ [2]
> Link: http://lore.kernel.org/4511465a4f8429f45e2ac70d2e65dc5e1df1eb47.1725941415.git-series.apopple@nvidia.com [3]
> Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
> Tested-by: Alexander Gordeev <agordeev@linux.ibm.com>
> Acked-by: David Hildenbrand <david@redhat.com>
> Cc: Heiko Carstens <hca@linux.ibm.com>
> Cc: Vasily Gorbik <gor@linux.ibm.com>
> Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
> Cc: Sven Schnelle <svens@linux.ibm.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Christoph Hellwig <hch@lst.de>
> Cc: Alistair Popple <apopple@nvidia.com>
> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
> ---
> Documentation/filesystems/dax.rst | 1 -
> drivers/s390/block/Kconfig | 12 ++++++++++--
> drivers/s390/block/dcssblk.c | 27 +++++++++++++++++----------
> 3 files changed, 27 insertions(+), 13 deletions(-)
>
> diff --git a/Documentation/filesystems/dax.rst b/Documentation/filesystems/dax.rst
> index 719e90f1988e..08dd5e254cc5 100644
> --- a/Documentation/filesystems/dax.rst
> +++ b/Documentation/filesystems/dax.rst
> @@ -207,7 +207,6 @@ implement direct_access.
>
> These block devices may be used for inspiration:
> - brd: RAM backed block device driver
> -- dcssblk: s390 dcss block device driver
> - pmem: NVDIMM persistent memory driver
>
>
> diff --git a/drivers/s390/block/Kconfig b/drivers/s390/block/Kconfig
> index e3710a762aba..4bfe469c04aa 100644
> --- a/drivers/s390/block/Kconfig
> +++ b/drivers/s390/block/Kconfig
> @@ -4,13 +4,21 @@ comment "S/390 block device drivers"
>
> config DCSSBLK
> def_tristate m
> - select FS_DAX_LIMITED
> - select DAX
> prompt "DCSSBLK support"
> depends on S390 && BLOCK
> help
> Support for dcss block device
>
> +config DCSSBLK_DAX
> + def_bool y
> + depends on DCSSBLK
> + # requires S390 ZONE_DEVICE support
> + depends on BROKEN
> + select DAX
> + prompt "DCSSBLK DAX support"
> + help
> + Enable DAX operation for the dcss block device
> +
> config DASD
> def_tristate y
> prompt "Support for DASD devices"
> diff --git a/drivers/s390/block/dcssblk.c b/drivers/s390/block/dcssblk.c
> index 0f14d279d30b..7248e547fefb 100644
> --- a/drivers/s390/block/dcssblk.c
> +++ b/drivers/s390/block/dcssblk.c
> @@ -534,6 +534,21 @@ static const struct attribute_group *dcssblk_dev_attr_groups[] = {
> NULL,
> };
>
> +static int dcssblk_setup_dax(struct dcssblk_dev_info *dev_info)
> +{
> + struct dax_device *dax_dev;
> +
> + if (!IS_ENABLED(CONFIG_DCSSBLK_DAX))
> + return 0;
> +
> + dax_dev = alloc_dax(dev_info, &dcssblk_dax_ops);
> + if (IS_ERR(dax_dev))
> + return PTR_ERR(dax_dev);
> + set_dax_synchronous(dax_dev);
> + dev_info->dax_dev = dax_dev;
> + return dax_add_host(dev_info->dax_dev, dev_info->gd);
> +}
> +
> /*
> * device attribute for adding devices
> */
> @@ -547,7 +562,6 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char
> int rc, i, j, num_of_segments;
> struct dcssblk_dev_info *dev_info;
> struct segment_info *seg_info, *temp;
> - struct dax_device *dax_dev;
> char *local_buf;
> unsigned long seg_byte_size;
>
> @@ -674,14 +688,7 @@ dcssblk_add_store(struct device *dev, struct device_attribute *attr, const char
> if (rc)
> goto put_dev;
>
> - dax_dev = alloc_dax(dev_info, &dcssblk_dax_ops);
> - if (IS_ERR(dax_dev)) {
> - rc = PTR_ERR(dax_dev);
> - goto put_dev;
> - }
> - set_dax_synchronous(dax_dev);
> - dev_info->dax_dev = dax_dev;
> - rc = dax_add_host(dev_info->dax_dev, dev_info->gd);
> + rc = dcssblk_setup_dax(dev_info);
> if (rc)
> goto out_dax;
>
> @@ -917,7 +924,7 @@ __dcssblk_direct_access(struct dcssblk_dev_info *dev_info, pgoff_t pgoff,
> *kaddr = __va(dev_info->start + offset);
> if (pfn)
> *pfn = __pfn_to_pfn_t(PFN_DOWN(dev_info->start + offset),
> - PFN_DEV|PFN_SPECIAL);
> + PFN_DEV);
>
> return (dev_sz - offset) / PAGE_SIZE;
> }
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 21/26] fs/dax: Properly refcount fs dax pages
2025-02-07 5:31 ` Alistair Popple
@ 2025-02-07 5:50 ` Dan Williams
2025-02-09 23:35 ` Alistair Popple
0 siblings, 1 reply; 97+ messages in thread
From: Dan Williams @ 2025-02-07 5:50 UTC (permalink / raw)
To: Alistair Popple, Dan Williams
Cc: akpm, linux-mm, alison.schofield, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
Alistair Popple wrote:
> On Mon, Jan 13, 2025 at 07:35:07PM -0800, Dan Williams wrote:
> > Alistair Popple wrote:
>
> [...]
>
> > ...and here is that aformentioned patch:
>
> This patch is different from what you originally posted here:
> https://yhbt.net/lore/linux-s390/172721874675.497781.3277495908107141898.stgit@dwillia2-xfh.jf.intel.com/
>
> > -- 8< --
> > Subject: dcssblk: Mark DAX broken, remove FS_DAX_LIMITED support
> >
> > From: Dan Williams <dan.j.williams@intel.com>
> >
> > The dcssblk driver has long needed special case supoprt to enable
> > limited dax operation, so called CONFIG_FS_DAX_LIMITED. This mode
> > works around the incomplete support for ZONE_DEVICE on s390 by forgoing
> > the ability of dax-mapped pages to support GUP.
> >
> > Now, pending cleanups to fsdax that fix its reference counting [1] depend on
> > the ability of all dax drivers to supply ZONE_DEVICE pages.
> >
> > To allow that work to move forward, dax support needs to be paused for
> > dcssblk until ZONE_DEVICE support arrives. That work has been known for
> > a few years [2], and the removal of "pte_devmap" requirements [3] makes the
> > conversion easier.
> >
> > For now, place the support behind CONFIG_BROKEN, and remove PFN_SPECIAL
> > (dcssblk was the only user).
>
> Specifically it no longer removes PFN_SPECIAL. Was this intentional? Or should I
> really have picked up the original patch from the mailing list?
I think this patch that only removes the dccsblk usage of PFN_SPECIAL is
sufficient. Leave the rest to the pfn_t cleanup.
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 01/26] fuse: Fix dax truncate/punch_hole fault path
2025-02-06 18:22 ` David Hildenbrand
@ 2025-02-07 16:16 ` Albert Esteve
0 siblings, 0 replies; 97+ messages in thread
From: Albert Esteve @ 2025-02-07 16:16 UTC (permalink / raw)
To: David Hildenbrand
Cc: Stefan Hajnoczi, Vivek Goyal, Dan Williams, Alistair Popple,
akpm, linux-mm, alison.schofield, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, peterx, linux-doc,
linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm, linux-cxl,
linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch, david,
chenhuacai, kernel, loongarch, Hanna Czenczek, German Maglione
On Thu, Feb 6, 2025 at 7:22 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 06.02.25 15:59, Albert Esteve wrote:
> > Hi!
> >
> > On Thu, Feb 6, 2025 at 3:30 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >>
> >> On Thu, Feb 06, 2025 at 08:37:07AM -0500, Vivek Goyal wrote:
> >>> And then there are challenges at QEMU level. virtiofsd needs additional
> >>> vhost-user commands to implement DAX and these never went upstream in
> >>> QEMU. I hope these challenges are sorted at some point of time.
> >>
> >> Albert Esteve has been working on QEMU support:
> >> https://lore.kernel.org/qemu-devel/20240912145335.129447-1-aesteve@redhat.com/
> >>
> >> He has a viable solution. I think the remaining issue is how to best
> >> structure the memory regions. The reason for slow progress is not
> >> because it can't be done, it's probably just because this is a
> >> background task.
> >
> > It is partially that, indeed. But what has me blocked for now on posting the
> > next version is that I was reworking a bit the MMAP strategy.
> > Following David comments, I am relying more on RAMBlocks and
> > subregions for mmaps. But this turned out more difficult than anticipated.
>
> Yeah, if that turns out to be too painful, we could start with the
> previous approach and work on that later. I also did not expect that to
> become that complicated.
Thanks. I'd like to do it properly, so I think will try a bit more to get it
to work. Maybe another week. If I do not manage, I may do
what you suggested (I'll align with you first) to move the patch forward.
That said, if I end up doing that, I will definitively revisit it later.
BR,
Albert.
>
> --
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 97+ messages in thread
* Re: [PATCH v6 21/26] fs/dax: Properly refcount fs dax pages
2025-02-07 5:50 ` Dan Williams
@ 2025-02-09 23:35 ` Alistair Popple
0 siblings, 0 replies; 97+ messages in thread
From: Alistair Popple @ 2025-02-09 23:35 UTC (permalink / raw)
To: Dan Williams
Cc: akpm, linux-mm, alison.schofield, lina, zhang.lyra,
gerald.schaefer, vishal.l.verma, dave.jiang, logang, bhelgaas,
jack, jgg, catalin.marinas, will, mpe, npiggin, dave.hansen,
ira.weiny, willy, djwong, tytso, linmiaohe, david, peterx,
linux-doc, linux-kernel, linux-arm-kernel, linuxppc-dev, nvdimm,
linux-cxl, linux-fsdevel, linux-ext4, linux-xfs, jhubbard, hch,
david, chenhuacai, kernel, loongarch
On Thu, Feb 06, 2025 at 09:50:07PM -0800, Dan Williams wrote:
> Alistair Popple wrote:
> > On Mon, Jan 13, 2025 at 07:35:07PM -0800, Dan Williams wrote:
> > > Alistair Popple wrote:
> >
> > [...]
> >
> > > ...and here is that aformentioned patch:
> >
> > This patch is different from what you originally posted here:
> > https://yhbt.net/lore/linux-s390/172721874675.497781.3277495908107141898.stgit@dwillia2-xfh.jf.intel.com/
> >
> > > -- 8< --
> > > Subject: dcssblk: Mark DAX broken, remove FS_DAX_LIMITED support
> > >
> > > From: Dan Williams <dan.j.williams@intel.com>
> > >
> > > The dcssblk driver has long needed special case supoprt to enable
> > > limited dax operation, so called CONFIG_FS_DAX_LIMITED. This mode
> > > works around the incomplete support for ZONE_DEVICE on s390 by forgoing
> > > the ability of dax-mapped pages to support GUP.
> > >
> > > Now, pending cleanups to fsdax that fix its reference counting [1] depend on
> > > the ability of all dax drivers to supply ZONE_DEVICE pages.
> > >
> > > To allow that work to move forward, dax support needs to be paused for
> > > dcssblk until ZONE_DEVICE support arrives. That work has been known for
> > > a few years [2], and the removal of "pte_devmap" requirements [3] makes the
> > > conversion easier.
> > >
> > > For now, place the support behind CONFIG_BROKEN, and remove PFN_SPECIAL
> > > (dcssblk was the only user).
> >
> > Specifically it no longer removes PFN_SPECIAL. Was this intentional? Or should I
> > really have picked up the original patch from the mailing list?
>
> I think this patch that only removes the dccsblk usage of PFN_SPECIAL is
> sufficient. Leave the rest to the pfn_t cleanup.
Makes sense. I noticed it when rebaing the pfn_t cleanup because previously it
did remove PFN_SPECIAL so was just wondering if it was intentional. I will add
a patch removing PFN_SPECIAL to the pfn_t/pXX_devmap cleanup series I'm writing
now.
^ permalink raw reply [flat|nested] 97+ messages in thread
end of thread, other threads:[~2025-02-09 23:35 UTC | newest]
Thread overview: 97+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-10 6:00 [PATCH v6 00/26] fs/dax: Fix ZONE_DEVICE page reference counts Alistair Popple
2025-01-10 6:00 ` [PATCH v6 01/26] fuse: Fix dax truncate/punch_hole fault path Alistair Popple
2025-02-05 13:03 ` Vivek Goyal
2025-02-06 0:10 ` Dan Williams
2025-02-06 12:41 ` Asahi Lina
2025-02-06 19:44 ` Dan Williams
2025-02-06 19:57 ` Asahi Lina
2025-02-06 13:37 ` Vivek Goyal
2025-02-06 14:30 ` Stefan Hajnoczi
2025-02-06 14:59 ` Albert Esteve
2025-02-06 18:10 ` Stefan Hajnoczi
2025-02-06 18:22 ` David Hildenbrand
2025-02-07 16:16 ` Albert Esteve
2025-01-10 6:00 ` [PATCH v6 02/26] fs/dax: Return unmapped busy pages from dax_layout_busy_page_range() Alistair Popple
2025-01-10 6:00 ` [PATCH v6 03/26] fs/dax: Don't skip locked entries when scanning entries Alistair Popple
2025-01-10 6:00 ` [PATCH v6 04/26] fs/dax: Refactor wait for dax idle page Alistair Popple
2025-01-10 6:00 ` [PATCH v6 05/26] fs/dax: Create a common implementation to break DAX layouts Alistair Popple
2025-01-10 16:44 ` Darrick J. Wong
2025-01-13 0:47 ` Alistair Popple
2025-01-13 2:47 ` Darrick J. Wong
2025-01-13 20:11 ` Dan Williams
2025-01-13 23:06 ` Dan Williams
2025-01-14 0:19 ` Dan Williams
2025-01-10 6:00 ` [PATCH v6 06/26] fs/dax: Always remove DAX page-cache entries when breaking layouts Alistair Popple
2025-01-13 23:31 ` Dan Williams
2025-01-10 6:00 ` [PATCH v6 07/26] fs/dax: Ensure all pages are idle prior to filesystem unmount Alistair Popple
2025-01-10 16:50 ` Darrick J. Wong
2025-01-13 0:57 ` Alistair Popple
2025-01-13 2:49 ` Darrick J. Wong
2025-01-13 5:48 ` Alistair Popple
2025-01-13 16:39 ` Darrick J. Wong
2025-01-13 23:42 ` Dan Williams
2025-01-10 6:00 ` [PATCH v6 08/26] fs/dax: Remove PAGE_MAPPING_DAX_SHARED mapping flag Alistair Popple
2025-01-14 0:52 ` Dan Williams
2025-01-15 5:32 ` Alistair Popple
2025-01-15 5:44 ` Dan Williams
2025-01-17 0:54 ` Alistair Popple
2025-01-14 14:47 ` David Hildenbrand
2025-01-10 6:00 ` [PATCH v6 09/26] mm/gup: Remove redundant check for PCI P2PDMA page Alistair Popple
2025-01-10 6:00 ` [PATCH v6 10/26] mm/mm_init: Move p2pdma page refcount initialisation to p2pdma Alistair Popple
2025-01-14 14:51 ` David Hildenbrand
2025-01-10 6:00 ` [PATCH v6 11/26] mm: Allow compound zone device pages Alistair Popple
2025-01-14 14:59 ` David Hildenbrand
2025-01-17 1:05 ` Alistair Popple
2025-01-10 6:00 ` [PATCH v6 12/26] mm/memory: Enhance insert_page_into_pte_locked() to create writable mappings Alistair Popple
2025-01-14 15:03 ` David Hildenbrand
[not found] ` <6785b90f300d8_20fa29465@dwillia2-xfh.jf.intel.com.notmuch>
2025-01-15 5:36 ` Alistair Popple
2025-01-10 6:00 ` [PATCH v6 13/26] mm/memory: Add vmf_insert_page_mkwrite() Alistair Popple
2025-01-14 16:15 ` David Hildenbrand
2025-01-15 6:13 ` Alistair Popple
2025-01-10 6:00 ` [PATCH v6 14/26] rmap: Add support for PUD sized mappings to rmap Alistair Popple
2025-01-14 1:21 ` Dan Williams
2025-01-10 6:00 ` [PATCH v6 15/26] huge_memory: Add vmf_insert_folio_pud() Alistair Popple
2025-01-14 1:27 ` Dan Williams
2025-01-14 16:22 ` David Hildenbrand
2025-01-15 6:38 ` Alistair Popple
2025-01-10 6:00 ` [PATCH v6 16/26] huge_memory: Add vmf_insert_folio_pmd() Alistair Popple
2025-01-14 2:04 ` Dan Williams
2025-01-14 16:40 ` David Hildenbrand
2025-01-14 17:22 ` Dan Williams
2025-01-15 7:05 ` Alistair Popple
2025-01-10 6:00 ` [PATCH v6 17/26] memremap: Add is_devdax_page() and is_fsdax_page() helpers Alistair Popple
2025-01-14 2:05 ` Dan Williams
2025-01-10 6:00 ` [PATCH v6 18/26] mm/gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages Alistair Popple
2025-01-14 2:16 ` Dan Williams
2025-01-10 6:00 ` [PATCH v6 19/26] proc/task_mmu: Mark devdax and fsdax pages as always unpinned Alistair Popple
2025-01-14 2:28 ` Dan Williams
2025-01-14 16:45 ` David Hildenbrand
2025-01-17 1:28 ` Alistair Popple
2025-01-10 6:00 ` [PATCH v6 20/26] mm/mlock: Skip ZONE_DEVICE PMDs during mlock Alistair Popple
2025-01-14 2:42 ` Dan Williams
2025-01-17 1:54 ` Alistair Popple
2025-01-10 6:00 ` [PATCH v6 21/26] fs/dax: Properly refcount fs dax pages Alistair Popple
2025-01-10 16:54 ` Darrick J. Wong
2025-01-13 3:18 ` Alistair Popple
2025-01-14 3:35 ` Dan Williams
2025-02-07 5:31 ` Alistair Popple
2025-02-07 5:50 ` Dan Williams
2025-02-09 23:35 ` Alistair Popple
2025-01-10 6:00 ` [PATCH v6 22/26] device/dax: Properly refcount device dax pages when mapping Alistair Popple
2025-01-14 6:12 ` Dan Williams
2025-02-03 11:29 ` Alistair Popple
2025-01-10 6:00 ` [PATCH v6 23/26] mm: Remove pXX_devmap callers Alistair Popple
2025-01-14 18:50 ` Dan Williams
2025-01-15 7:27 ` Alistair Popple
2025-02-04 19:06 ` Dan Williams
2025-02-05 9:57 ` Alistair Popple
2025-01-10 6:00 ` [PATCH v6 24/26] mm: Remove devmap related functions and page table bits Alistair Popple
2025-01-11 10:08 ` Huacai Chen
2025-01-14 19:03 ` Dan Williams
2025-01-10 6:00 ` [PATCH v6 25/26] Revert "riscv: mm: Add support for ZONE_DEVICE" Alistair Popple
2025-01-14 19:11 ` Dan Williams
2025-01-10 6:00 ` [PATCH v6 26/26] Revert "LoongArch: Add ARCH_HAS_PTE_DEVMAP support" Alistair Popple
2025-01-10 7:05 ` [PATCH v6 00/26] fs/dax: Fix ZONE_DEVICE page reference counts Dan Williams
2025-01-11 1:30 ` Andrew Morton
2025-01-11 3:35 ` Dan Williams
2025-01-13 1:05 ` Alistair Popple
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox