From: Alistair Popple <apopple@nvidia.com>
To: akpm@linux-foundation.org, dan.j.williams@intel.com, linux-mm@kvack.org
Cc: Alistair Popple <apopple@nvidia.com>,
Alison Schofield <alison.schofield@intel.com>,
lina@asahilina.net, zhang.lyra@gmail.com,
gerald.schaefer@linux.ibm.com, vishal.l.verma@intel.com,
dave.jiang@intel.com, logang@deltatee.com, bhelgaas@google.com,
jack@suse.cz, jgg@ziepe.ca, catalin.marinas@arm.com,
will@kernel.org, mpe@ellerman.id.au, npiggin@gmail.com,
dave.hansen@linux.intel.com, ira.weiny@intel.com,
willy@infradead.org, djwong@kernel.org, tytso@mit.edu,
linmiaohe@huawei.com, david@redhat.com, peterx@redhat.com,
linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org,
linuxppc-dev@lists.ozlabs.org, nvdimm@lists.linux.dev,
linux-cxl@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-ext4@vger.kernel.org, linux-xfs@vger.kernel.org,
jhubbard@nvidia.com, hch@lst.de, david@fromorbit.com,
chenhuacai@kernel.org, kernel@xen0n.name,
loongarch@lists.linux.dev, Balbir Singh <balbirs@nvidia.com>,
Jason Gunthorpe <jgg@nvidia.com>
Subject: [PATCH v9 11/20] mm: Allow compound zone device pages
Date: Fri, 28 Feb 2025 14:31:06 +1100 [thread overview]
Message-ID: <67055d772e6102accf85161d0b57b0b3944292bf.1740713401.git-series.apopple@nvidia.com> (raw)
In-Reply-To: <cover.8068ad144a7eea4a813670301f4d2a86a8e68ec4.1740713401.git-series.apopple@nvidia.com>
Zone device pages are used to represent various type of device memory
managed by device drivers. Currently compound zone device pages are
not supported. This is because MEMORY_DEVICE_FS_DAX pages are the only
user of higher order zone device pages and have their own page
reference counting.
A future change will unify FS DAX reference counting with normal page
reference counting rules and remove the special FS DAX reference
counting. Supporting that requires compound zone device pages.
Supporting compound zone device pages requires compound_head() to
distinguish between head and tail pages whilst still preserving the
special struct page fields that are specific to zone device pages.
A tail page is distinguished by having bit zero being set in
page->compound_head, with the remaining bits pointing to the head
page. For zone device pages page->compound_head is shared with
page->pgmap.
The page->pgmap field must be common to all pages within a folio, even
if the folio spans memory sections. Therefore pgmap is the same for
both head and tail pages and can be moved into the folio and we can
use the standard scheme to find compound_head from a tail page.
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Signed-off-by: Balbir Singh <balbirs@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
Changes for v9:
- Fixes from Balbir
Changes for v7:
- Skip ZONE_DEVICE PMDs during mlock which was previously a separate
patch.
Changes for v4:
- Fix build breakages reported by kernel test robot
Changes since v2:
- Indentation fix
- Rename page_dev_pagemap() to page_pgmap()
- Rename folio _unused field to _unused_pgmap_compound_head
- s/WARN_ON/VM_WARN_ON_ONCE_PAGE/
Changes since v1:
- Move pgmap to the folio as suggested by Matthew Wilcox
---
drivers/gpu/drm/nouveau/nouveau_dmem.c | 3 ++-
drivers/pci/p2pdma.c | 6 +++---
include/linux/memremap.h | 6 +++---
include/linux/migrate.h | 4 ++--
include/linux/mm_types.h | 9 +++++++--
include/linux/mmzone.h | 12 +++++++++++-
lib/test_hmm.c | 3 ++-
mm/hmm.c | 2 +-
mm/memory.c | 4 +++-
mm/memremap.c | 14 +++++++-------
mm/migrate_device.c | 18 ++++++++++++------
mm/mlock.c | 2 ++
mm/mm_init.c | 2 +-
13 files changed, 56 insertions(+), 29 deletions(-)
diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index 1a07256..61d0f41 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -88,7 +88,8 @@ struct nouveau_dmem {
static struct nouveau_dmem_chunk *nouveau_page_to_chunk(struct page *page)
{
- return container_of(page->pgmap, struct nouveau_dmem_chunk, pagemap);
+ return container_of(page_pgmap(page), struct nouveau_dmem_chunk,
+ pagemap);
}
static struct nouveau_drm *page_to_drm(struct page *page)
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 04773a8..19214ec 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -202,7 +202,7 @@ static const struct attribute_group p2pmem_group = {
static void p2pdma_page_free(struct page *page)
{
- struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page->pgmap);
+ struct pci_p2pdma_pagemap *pgmap = to_p2p_pgmap(page_pgmap(page));
/* safe to dereference while a reference is held to the percpu ref */
struct pci_p2pdma *p2pdma =
rcu_dereference_protected(pgmap->provider->p2pdma, 1);
@@ -1025,8 +1025,8 @@ enum pci_p2pdma_map_type
pci_p2pdma_map_segment(struct pci_p2pdma_map_state *state, struct device *dev,
struct scatterlist *sg)
{
- if (state->pgmap != sg_page(sg)->pgmap) {
- state->pgmap = sg_page(sg)->pgmap;
+ if (state->pgmap != page_pgmap(sg_page(sg))) {
+ state->pgmap = page_pgmap(sg_page(sg));
state->map = pci_p2pdma_map_type(state->pgmap, dev);
state->bus_off = to_p2p_pgmap(state->pgmap)->bus_offset;
}
diff --git a/include/linux/memremap.h b/include/linux/memremap.h
index 3f7143a..0256a42 100644
--- a/include/linux/memremap.h
+++ b/include/linux/memremap.h
@@ -161,7 +161,7 @@ static inline bool is_device_private_page(const struct page *page)
{
return IS_ENABLED(CONFIG_DEVICE_PRIVATE) &&
is_zone_device_page(page) &&
- page->pgmap->type == MEMORY_DEVICE_PRIVATE;
+ page_pgmap(page)->type == MEMORY_DEVICE_PRIVATE;
}
static inline bool folio_is_device_private(const struct folio *folio)
@@ -173,13 +173,13 @@ static inline bool is_pci_p2pdma_page(const struct page *page)
{
return IS_ENABLED(CONFIG_PCI_P2PDMA) &&
is_zone_device_page(page) &&
- page->pgmap->type == MEMORY_DEVICE_PCI_P2PDMA;
+ page_pgmap(page)->type == MEMORY_DEVICE_PCI_P2PDMA;
}
static inline bool is_device_coherent_page(const struct page *page)
{
return is_zone_device_page(page) &&
- page->pgmap->type == MEMORY_DEVICE_COHERENT;
+ page_pgmap(page)->type == MEMORY_DEVICE_COHERENT;
}
static inline bool folio_is_device_coherent(const struct folio *folio)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 29919fa..61899ec 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -205,8 +205,8 @@ struct migrate_vma {
unsigned long end;
/*
- * Set to the owner value also stored in page->pgmap->owner for
- * migrating out of device private memory. The flags also need to
+ * Set to the owner value also stored in page_pgmap(page)->owner
+ * for migrating out of device private memory. The flags also need to
* be set to MIGRATE_VMA_SELECT_DEVICE_PRIVATE.
* The caller should always set this field when using mmu notifier
* callbacks to avoid device MMU invalidations for device private
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 369f76a..6f2d6bb 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -130,8 +130,11 @@ struct page {
unsigned long compound_head; /* Bit zero is set */
};
struct { /* ZONE_DEVICE pages */
- /** @pgmap: Points to the hosting device page map. */
- struct dev_pagemap *pgmap;
+ /*
+ * The first word is used for compound_head or folio
+ * pgmap
+ */
+ void *_unused_pgmap_compound_head;
void *zone_device_data;
/*
* ZONE_DEVICE private pages are counted as being
@@ -300,6 +303,7 @@ typedef struct {
* @_refcount: Do not access this member directly. Use folio_ref_count()
* to find how many references there are to this folio.
* @memcg_data: Memory Control Group data.
+ * @pgmap: Metadata for ZONE_DEVICE mappings
* @virtual: Virtual address in the kernel direct map.
* @_last_cpupid: IDs of last CPU and last process that accessed the folio.
* @_entire_mapcount: Do not use directly, call folio_entire_mapcount().
@@ -338,6 +342,7 @@ struct folio {
/* private: */
};
/* public: */
+ struct dev_pagemap *pgmap;
};
struct address_space *mapping;
pgoff_t index;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9540b41..8aecbbb 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1158,6 +1158,12 @@ static inline bool is_zone_device_page(const struct page *page)
return page_zonenum(page) == ZONE_DEVICE;
}
+static inline struct dev_pagemap *page_pgmap(const struct page *page)
+{
+ VM_WARN_ON_ONCE_PAGE(!is_zone_device_page(page), page);
+ return page_folio(page)->pgmap;
+}
+
/*
* Consecutive zone device pages should not be merged into the same sgl
* or bvec segment with other types of pages or if they belong to different
@@ -1173,7 +1179,7 @@ static inline bool zone_device_pages_have_same_pgmap(const struct page *a,
return false;
if (!is_zone_device_page(a))
return true;
- return a->pgmap == b->pgmap;
+ return page_pgmap(a) == page_pgmap(b);
}
extern void memmap_init_zone_device(struct zone *, unsigned long,
@@ -1188,6 +1194,10 @@ static inline bool zone_device_pages_have_same_pgmap(const struct page *a,
{
return true;
}
+static inline struct dev_pagemap *page_pgmap(const struct page *page)
+{
+ return NULL;
+}
#endif
static inline bool folio_is_zone_device(const struct folio *folio)
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index e4afca8..155b18c 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -195,7 +195,8 @@ static int dmirror_fops_release(struct inode *inode, struct file *filp)
static struct dmirror_chunk *dmirror_page_to_chunk(struct page *page)
{
- return container_of(page->pgmap, struct dmirror_chunk, pagemap);
+ return container_of(page_pgmap(page), struct dmirror_chunk,
+ pagemap);
}
static struct dmirror_device *dmirror_page_to_device(struct page *page)
diff --git a/mm/hmm.c b/mm/hmm.c
index 7e0229a..082f7b7 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -248,7 +248,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
* just report the PFN.
*/
if (is_device_private_entry(entry) &&
- pfn_swap_entry_to_page(entry)->pgmap->owner ==
+ page_pgmap(pfn_swap_entry_to_page(entry))->owner ==
range->dev_private_owner) {
cpu_flags = HMM_PFN_VALID;
if (is_writable_device_private_entry(entry))
diff --git a/mm/memory.c b/mm/memory.c
index d337eab..905ed2f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4316,6 +4316,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
vmf->page = pfn_swap_entry_to_page(entry);
ret = remove_device_exclusive_entry(vmf);
} else if (is_device_private_entry(entry)) {
+ struct dev_pagemap *pgmap;
if (vmf->flags & FAULT_FLAG_VMA_LOCK) {
/*
* migrate_to_ram is not yet ready to operate
@@ -4340,7 +4341,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
*/
get_page(vmf->page);
pte_unmap_unlock(vmf->pte, vmf->ptl);
- ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
+ pgmap = page_pgmap(vmf->page);
+ ret = pgmap->ops->migrate_to_ram(vmf);
put_page(vmf->page);
} else if (is_hwpoison_entry(entry)) {
ret = VM_FAULT_HWPOISON;
diff --git a/mm/memremap.c b/mm/memremap.c
index 07bbe0e..68099af 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -458,8 +458,8 @@ EXPORT_SYMBOL_GPL(get_dev_pagemap);
void free_zone_device_folio(struct folio *folio)
{
- if (WARN_ON_ONCE(!folio->page.pgmap->ops ||
- !folio->page.pgmap->ops->page_free))
+ if (WARN_ON_ONCE(!folio->pgmap->ops ||
+ !folio->pgmap->ops->page_free))
return;
mem_cgroup_uncharge(folio);
@@ -486,12 +486,12 @@ void free_zone_device_folio(struct folio *folio)
* to clear folio->mapping.
*/
folio->mapping = NULL;
- folio->page.pgmap->ops->page_free(folio_page(folio, 0));
+ folio->pgmap->ops->page_free(folio_page(folio, 0));
- switch (folio->page.pgmap->type) {
+ switch (folio->pgmap->type) {
case MEMORY_DEVICE_PRIVATE:
case MEMORY_DEVICE_COHERENT:
- put_dev_pagemap(folio->page.pgmap);
+ put_dev_pagemap(folio->pgmap);
break;
case MEMORY_DEVICE_FS_DAX:
@@ -514,7 +514,7 @@ void zone_device_page_init(struct page *page)
* Drivers shouldn't be allocating pages after calling
* memunmap_pages().
*/
- WARN_ON_ONCE(!percpu_ref_tryget_live(&page->pgmap->ref));
+ WARN_ON_ONCE(!percpu_ref_tryget_live(&page_pgmap(page)->ref));
set_page_count(page, 1);
lock_page(page);
}
@@ -523,7 +523,7 @@ EXPORT_SYMBOL_GPL(zone_device_page_init);
#ifdef CONFIG_FS_DAX
bool __put_devmap_managed_folio_refs(struct folio *folio, int refs)
{
- if (folio->page.pgmap->type != MEMORY_DEVICE_FS_DAX)
+ if (folio->pgmap->type != MEMORY_DEVICE_FS_DAX)
return false;
/*
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 5bd8882..7d0d64f 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -106,6 +106,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
arch_enter_lazy_mmu_mode();
for (; addr < end; addr += PAGE_SIZE, ptep++) {
+ struct dev_pagemap *pgmap;
unsigned long mpfn = 0, pfn;
struct folio *folio;
struct page *page;
@@ -133,9 +134,10 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
goto next;
page = pfn_swap_entry_to_page(entry);
+ pgmap = page_pgmap(page);
if (!(migrate->flags &
MIGRATE_VMA_SELECT_DEVICE_PRIVATE) ||
- page->pgmap->owner != migrate->pgmap_owner)
+ pgmap->owner != migrate->pgmap_owner)
goto next;
mpfn = migrate_pfn(page_to_pfn(page)) |
@@ -152,12 +154,16 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
}
page = vm_normal_page(migrate->vma, addr, pte);
if (page && !is_zone_device_page(page) &&
- !(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM))
- goto next;
- else if (page && is_device_coherent_page(page) &&
- (!(migrate->flags & MIGRATE_VMA_SELECT_DEVICE_COHERENT) ||
- page->pgmap->owner != migrate->pgmap_owner))
+ !(migrate->flags & MIGRATE_VMA_SELECT_SYSTEM)) {
goto next;
+ } else if (page && is_device_coherent_page(page)) {
+ pgmap = page_pgmap(page);
+
+ if (!(migrate->flags &
+ MIGRATE_VMA_SELECT_DEVICE_COHERENT) ||
+ pgmap->owner != migrate->pgmap_owner)
+ goto next;
+ }
mpfn = migrate_pfn(pfn) | MIGRATE_PFN_MIGRATE;
mpfn |= pte_write(pte) ? MIGRATE_PFN_WRITE : 0;
}
diff --git a/mm/mlock.c b/mm/mlock.c
index cde076f..3cb72b5 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -368,6 +368,8 @@ static int mlock_pte_range(pmd_t *pmd, unsigned long addr,
if (is_huge_zero_pmd(*pmd))
goto out;
folio = pmd_folio(*pmd);
+ if (folio_is_zone_device(folio))
+ goto out;
if (vma->vm_flags & VM_LOCKED)
mlock_folio(folio);
else
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 6be9796..d0b5bef 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -998,7 +998,7 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
* and zone_device_data. It is a bug if a ZONE_DEVICE page is
* ever freed or placed on a driver-private list.
*/
- page->pgmap = pgmap;
+ page_folio(page)->pgmap = pgmap;
page->zone_device_data = NULL;
/*
--
git-series 0.9.1
next prev parent reply other threads:[~2025-02-28 3:35 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-28 3:30 [PATCH v9 00/20] fs/dax: Fix ZONE_DEVICE page reference counts Alistair Popple
2025-02-28 3:30 ` [PATCH v9 01/20] fuse: Fix dax truncate/punch_hole fault path Alistair Popple
2025-02-28 3:30 ` [PATCH v9 02/20] fs/dax: Return unmapped busy pages from dax_layout_busy_page_range() Alistair Popple
2025-02-28 3:30 ` [PATCH v9 03/20] fs/dax: Don't skip locked entries when scanning entries Alistair Popple
2025-02-28 3:30 ` [PATCH v9 04/20] fs/dax: Refactor wait for dax idle page Alistair Popple
2025-02-28 3:31 ` [PATCH v9 05/20] fs/dax: Create a common implementation to break DAX layouts Alistair Popple
2025-02-28 3:31 ` [PATCH v9 06/20] fs/dax: Always remove DAX page-cache entries when breaking layouts Alistair Popple
2025-02-28 3:31 ` [PATCH v9 07/20] fs/dax: Ensure all pages are idle prior to filesystem unmount Alistair Popple
2025-02-28 3:31 ` [PATCH v9 08/20] fs/dax: Remove PAGE_MAPPING_DAX_SHARED mapping flag Alistair Popple
2025-02-28 3:31 ` [PATCH v9 09/20] mm/gup: Remove redundant check for PCI P2PDMA page Alistair Popple
2025-02-28 3:31 ` [PATCH v9 10/20] mm/mm_init: Move p2pdma page refcount initialisation to p2pdma Alistair Popple
2025-02-28 3:31 ` Alistair Popple [this message]
2025-02-28 3:31 ` [PATCH v9 12/20] mm/memory: Enhance insert_page_into_pte_locked() to create writable mappings Alistair Popple
2025-02-28 3:31 ` [PATCH v9 13/20] mm/memory: Add vmf_insert_page_mkwrite() Alistair Popple
2025-02-28 3:31 ` [PATCH v9 14/20] mm/rmap: Add support for PUD sized mappings to rmap Alistair Popple
2025-02-28 3:31 ` [PATCH v9 15/20] mm/huge_memory: Add vmf_insert_folio_pud() Alistair Popple
2025-02-28 3:31 ` [PATCH v9 16/20] mm/huge_memory: Add vmf_insert_folio_pmd() Alistair Popple
2025-02-28 3:31 ` [PATCH v9 17/20] mm/gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages Alistair Popple
2025-02-28 3:31 ` [PATCH v9 18/20] dcssblk: Mark DAX broken, remove FS_DAX_LIMITED support Alistair Popple
2025-02-28 3:31 ` [PATCH v9 19/20] fs/dax: Properly refcount fs dax pages Alistair Popple
2025-03-03 8:58 ` David Hildenbrand
2025-03-26 21:04 ` Dan Williams
2025-02-28 3:31 ` [PATCH v9 20/20] device/dax: Properly refcount device dax pages when mapping Alistair Popple
2025-02-28 3:42 ` [PATCH v9 00/20] fs/dax: Fix ZONE_DEVICE page reference counts Alistair Popple
2025-03-04 4:46 ` Andrew Morton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=67055d772e6102accf85161d0b57b0b3944292bf.1740713401.git-series.apopple@nvidia.com \
--to=apopple@nvidia.com \
--cc=akpm@linux-foundation.org \
--cc=alison.schofield@intel.com \
--cc=balbirs@nvidia.com \
--cc=bhelgaas@google.com \
--cc=catalin.marinas@arm.com \
--cc=chenhuacai@kernel.org \
--cc=dan.j.williams@intel.com \
--cc=dave.hansen@linux.intel.com \
--cc=dave.jiang@intel.com \
--cc=david@fromorbit.com \
--cc=david@redhat.com \
--cc=djwong@kernel.org \
--cc=gerald.schaefer@linux.ibm.com \
--cc=hch@lst.de \
--cc=ira.weiny@intel.com \
--cc=jack@suse.cz \
--cc=jgg@nvidia.com \
--cc=jgg@ziepe.ca \
--cc=jhubbard@nvidia.com \
--cc=kernel@xen0n.name \
--cc=lina@asahilina.net \
--cc=linmiaohe@huawei.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-xfs@vger.kernel.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=logang@deltatee.com \
--cc=loongarch@lists.linux.dev \
--cc=mpe@ellerman.id.au \
--cc=npiggin@gmail.com \
--cc=nvdimm@lists.linux.dev \
--cc=peterx@redhat.com \
--cc=tytso@mit.edu \
--cc=vishal.l.verma@intel.com \
--cc=will@kernel.org \
--cc=willy@infradead.org \
--cc=zhang.lyra@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox