From: Alistair Popple <apopple@nvidia.com>
To: akpm@linux-foundation.org, dan.j.williams@intel.com, linux-mm@kvack.org
Cc: Alistair Popple <apopple@nvidia.com>,
lina@asahilina.net, zhang.lyra@gmail.com,
gerald.schaefer@linux.ibm.com, vishal.l.verma@intel.com,
dave.jiang@intel.com, logang@deltatee.com, bhelgaas@google.com,
jack@suse.cz, jgg@ziepe.ca, catalin.marinas@arm.com,
will@kernel.org, mpe@ellerman.id.au, npiggin@gmail.com,
dave.hansen@linux.intel.com, ira.weiny@intel.com,
willy@infradead.org, djwong@kernel.org, tytso@mit.edu,
linmiaohe@huawei.com, david@redhat.com, peterx@redhat.com,
linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org,
linuxppc-dev@lists.ozlabs.org, nvdimm@lists.linux.dev,
linux-cxl@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-ext4@vger.kernel.org, linux-xfs@vger.kernel.org,
jhubbard@nvidia.com, hch@lst.de, david@fromorbit.com
Subject: [PATCH v5 10/25] mm/mm_init: Move p2pdma page refcount initialisation to p2pdma
Date: Tue, 7 Jan 2025 14:42:26 +1100 [thread overview]
Message-ID: <09143f567302ccfe293972b7f20fc709fafdd8db.1736221254.git-series.apopple@nvidia.com> (raw)
In-Reply-To: <cover.425da7c4e76c2749d0ad1734f972b06114e02d52.1736221254.git-series.apopple@nvidia.com>
Currently ZONE_DEVICE page reference counts are initialised by core
memory management code in __init_zone_device_page() as part of the
memremap() call which driver modules make to obtain ZONE_DEVICE
pages. This initialises page refcounts to 1 before returning them to
the driver.
This was presumably done because it drivers had a reference of sorts
on the page. It also ensured the page could always be mapped with
vm_insert_page() for example and would never get freed (ie. have a
zero refcount), freeing drivers of manipulating page reference counts.
However it complicates figuring out whether or not a page is free from
the mm perspective because it is no longer possible to just look at
the refcount. Instead the page type must be known and if GUP is used a
secondary pgmap reference is also sometimes needed.
To simplify this it is desirable to remove the page reference count
for the driver, so core mm can just use the refcount without having to
account for page type or do other types of tracking. This is possible
because drivers can always assume the page is valid as core kernel
will never offline or remove the struct page.
This means it is now up to drivers to initialise the page refcount as
required. P2PDMA uses vm_insert_page() to map the page, and that
requires a non-zero reference count when initialising the page so set
that when the page is first mapped.
Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
---
Changes since v2:
- Initialise the page refcount for all pages covered by the kaddr
---
drivers/pci/p2pdma.c | 13 +++++++++++--
mm/memremap.c | 17 +++++++++++++----
mm/mm_init.c | 22 ++++++++++++++++++----
3 files changed, 42 insertions(+), 10 deletions(-)
diff --git a/drivers/pci/p2pdma.c b/drivers/pci/p2pdma.c
index 0cb7e0a..04773a8 100644
--- a/drivers/pci/p2pdma.c
+++ b/drivers/pci/p2pdma.c
@@ -140,13 +140,22 @@ static int p2pmem_alloc_mmap(struct file *filp, struct kobject *kobj,
rcu_read_unlock();
for (vaddr = vma->vm_start; vaddr < vma->vm_end; vaddr += PAGE_SIZE) {
- ret = vm_insert_page(vma, vaddr, virt_to_page(kaddr));
+ struct page *page = virt_to_page(kaddr);
+
+ /*
+ * Initialise the refcount for the freshly allocated page. As
+ * we have just allocated the page no one else should be
+ * using it.
+ */
+ VM_WARN_ON_ONCE_PAGE(!page_ref_count(page), page);
+ set_page_count(page, 1);
+ ret = vm_insert_page(vma, vaddr, page);
if (ret) {
gen_pool_free(p2pdma->pool, (uintptr_t)kaddr, len);
return ret;
}
percpu_ref_get(ref);
- put_page(virt_to_page(kaddr));
+ put_page(page);
kaddr += PAGE_SIZE;
len -= PAGE_SIZE;
}
diff --git a/mm/memremap.c b/mm/memremap.c
index 40d4547..07bbe0e 100644
--- a/mm/memremap.c
+++ b/mm/memremap.c
@@ -488,15 +488,24 @@ void free_zone_device_folio(struct folio *folio)
folio->mapping = NULL;
folio->page.pgmap->ops->page_free(folio_page(folio, 0));
- if (folio->page.pgmap->type != MEMORY_DEVICE_PRIVATE &&
- folio->page.pgmap->type != MEMORY_DEVICE_COHERENT)
+ switch (folio->page.pgmap->type) {
+ case MEMORY_DEVICE_PRIVATE:
+ case MEMORY_DEVICE_COHERENT:
+ put_dev_pagemap(folio->page.pgmap);
+ break;
+
+ case MEMORY_DEVICE_FS_DAX:
+ case MEMORY_DEVICE_GENERIC:
/*
* Reset the refcount to 1 to prepare for handing out the page
* again.
*/
folio_set_count(folio, 1);
- else
- put_dev_pagemap(folio->page.pgmap);
+ break;
+
+ case MEMORY_DEVICE_PCI_P2PDMA:
+ break;
+ }
}
void zone_device_page_init(struct page *page)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 24b68b4..f021e63 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1017,12 +1017,26 @@ static void __ref __init_zone_device_page(struct page *page, unsigned long pfn,
}
/*
- * ZONE_DEVICE pages are released directly to the driver page allocator
- * which will set the page count to 1 when allocating the page.
+ * ZONE_DEVICE pages other than MEMORY_TYPE_GENERIC and
+ * MEMORY_TYPE_FS_DAX pages are released directly to the driver page
+ * allocator which will set the page count to 1 when allocating the
+ * page.
+ *
+ * MEMORY_TYPE_GENERIC and MEMORY_TYPE_FS_DAX pages automatically have
+ * their refcount reset to one whenever they are freed (ie. after
+ * their refcount drops to 0).
*/
- if (pgmap->type == MEMORY_DEVICE_PRIVATE ||
- pgmap->type == MEMORY_DEVICE_COHERENT)
+ switch (pgmap->type) {
+ case MEMORY_DEVICE_PRIVATE:
+ case MEMORY_DEVICE_COHERENT:
+ case MEMORY_DEVICE_PCI_P2PDMA:
set_page_count(page, 0);
+ break;
+
+ case MEMORY_DEVICE_FS_DAX:
+ case MEMORY_DEVICE_GENERIC:
+ break;
+ }
}
/*
--
git-series 0.9.1
next prev parent reply other threads:[~2025-01-07 3:43 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-07 3:42 [PATCH v5 00/25] fs/dax: Fix ZONE_DEVICE page reference counts Alistair Popple
2025-01-07 3:42 ` [PATCH v5 01/25] fuse: Fix dax truncate/punch_hole fault path Alistair Popple
2025-01-08 22:30 ` Dan Williams
2025-01-09 4:38 ` Alistair Popple
2025-01-07 3:42 ` [PATCH v5 02/25] fs/dax: Return unmapped busy pages from dax_layout_busy_page_range() Alistair Popple
2025-01-08 22:30 ` Dan Williams
2025-01-07 3:42 ` [PATCH v5 03/25] fs/dax: Don't skip locked entries when scanning entries Alistair Popple
2025-01-08 22:50 ` Dan Williams
2025-01-09 5:21 ` Alistair Popple
2025-01-07 3:42 ` [PATCH v5 04/25] fs/dax: Refactor wait for dax idle page Alistair Popple
2025-01-07 3:42 ` [PATCH v5 05/25] fs/dax: Create a common implementation to break DAX layouts Alistair Popple
2025-01-09 0:14 ` Dan Williams
2025-01-09 6:15 ` Alistair Popple
2025-01-10 6:56 ` Dan Williams
2025-01-07 3:42 ` [PATCH v5 06/25] fs/dax: Always remove DAX page-cache entries when breaking layouts Alistair Popple
2025-01-07 3:42 ` [PATCH v5 07/25] fs/dax: Ensure all pages are idle prior to filesystem unmount Alistair Popple
2025-01-07 3:42 ` [PATCH v5 08/25] fs/dax: Remove PAGE_MAPPING_DAX_SHARED mapping flag Alistair Popple
2025-01-07 3:42 ` [PATCH v5 09/25] mm/gup: Remove redundant check for PCI P2PDMA page Alistair Popple
2025-01-07 3:42 ` Alistair Popple [this message]
2025-01-07 3:42 ` [PATCH v5 11/25] mm: Allow compound zone device pages Alistair Popple
2025-01-07 3:42 ` [PATCH v5 12/25] mm/memory: Enhance insert_page_into_pte_locked() to create writable mappings Alistair Popple
2025-01-07 3:42 ` [PATCH v5 13/25] mm/memory: Add vmf_insert_page_mkwrite() Alistair Popple
2025-01-07 3:42 ` [PATCH v5 14/25] rmap: Add support for PUD sized mappings to rmap Alistair Popple
2025-01-07 11:36 ` David Hildenbrand
2025-01-07 3:42 ` [PATCH v5 15/25] huge_memory: Add vmf_insert_folio_pud() Alistair Popple
2025-01-08 3:54 ` kernel test robot
2025-01-07 3:42 ` [PATCH v5 16/25] huge_memory: Add vmf_insert_folio_pmd() Alistair Popple
2025-01-07 3:42 ` [PATCH v5 17/25] memremap: Add is_devdax_page() and is_fsdax_page() helpers Alistair Popple
2025-01-07 3:42 ` [PATCH v5 18/25] mm/gup: Don't allow FOLL_LONGTERM pinning of FS DAX pages Alistair Popple
2025-01-07 3:42 ` [PATCH v5 19/25] proc/task_mmu: Mark devdax and fsdax pages as always unpinned Alistair Popple
2025-01-07 3:42 ` [PATCH v5 20/25] mm/mlock: Skip ZONE_DEVICE PMDs during mlock Alistair Popple
2025-01-07 3:42 ` [PATCH v5 21/25] fs/dax: Properly refcount fs dax pages Alistair Popple
2025-01-07 3:42 ` [PATCH v5 22/25] device/dax: Properly refcount device dax pages when mapping Alistair Popple
2025-01-07 3:42 ` [PATCH v5 23/25] mm: Remove pXX_devmap callers Alistair Popple
2025-01-07 3:42 ` [PATCH v5 24/25] mm: Remove devmap related functions and page table bits Alistair Popple
2025-01-07 3:42 ` [PATCH v5 25/25] Revert "riscv: mm: Add support for ZONE_DEVICE" Alistair Popple
2025-01-08 6:26 ` [PATCH v5 00/25] fs/dax: Fix ZONE_DEVICE page reference counts Andrew Morton
2025-01-08 21:51 ` Dan Williams
2025-01-09 1:34 ` Alison Schofield
2025-01-10 6:03 ` Alistair Popple
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=09143f567302ccfe293972b7f20fc709fafdd8db.1736221254.git-series.apopple@nvidia.com \
--to=apopple@nvidia.com \
--cc=akpm@linux-foundation.org \
--cc=bhelgaas@google.com \
--cc=catalin.marinas@arm.com \
--cc=dan.j.williams@intel.com \
--cc=dave.hansen@linux.intel.com \
--cc=dave.jiang@intel.com \
--cc=david@fromorbit.com \
--cc=david@redhat.com \
--cc=djwong@kernel.org \
--cc=gerald.schaefer@linux.ibm.com \
--cc=hch@lst.de \
--cc=ira.weiny@intel.com \
--cc=jack@suse.cz \
--cc=jgg@ziepe.ca \
--cc=jhubbard@nvidia.com \
--cc=lina@asahilina.net \
--cc=linmiaohe@huawei.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-xfs@vger.kernel.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=logang@deltatee.com \
--cc=mpe@ellerman.id.au \
--cc=npiggin@gmail.com \
--cc=nvdimm@lists.linux.dev \
--cc=peterx@redhat.com \
--cc=tytso@mit.edu \
--cc=vishal.l.verma@intel.com \
--cc=will@kernel.org \
--cc=willy@infradead.org \
--cc=zhang.lyra@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox