linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/5] mm/gup: Introduce pin_user_pages_fd() for pinning shmem/hugetlbfs file pages (v4)
@ 2023-11-18  6:32 Vivek Kasireddy
  2023-11-18  6:32 ` [PATCH v4 1/5] udmabuf: Use vmf_insert_pfn and VM_PFNMAP for handling mmap Vivek Kasireddy
                   ` (4 more replies)
  0 siblings, 5 replies; 8+ messages in thread
From: Vivek Kasireddy @ 2023-11-18  6:32 UTC (permalink / raw)
  To: dri-devel, linux-mm
  Cc: Vivek Kasireddy, David Hildenbrand, Daniel Vetter, Mike Kravetz,
	Hugh Dickins, Peter Xu, Jason Gunthorpe, Gerd Hoffmann,
	Dongwon Kim, Junxiao Chang

The first two patches were previously reviewed but not yet merged.
These ones need to be merged first as the fourth patch depends on
the changes introduced in them and they also fix bugs seen in
very specific scenarios (running Qemu with hugetlb=on, blob=true
and rebooting guest VM).

The third patch introduces pin_user_pages_fd() API and the fourth
patch shows how the udmabuf driver can make use of it to
longterm-pin the the pages. The last patch adds two new udmabuf
selftests to verify data coherency after potential page migration.

v2:
- Updated the first patch to include review feedback from David and
  Jason. The main change in this series is the allocation of page
  in the case of hugetlbfs if it is not found in the page cache.

v3:
- Made changes to include review feedback from David to improve the
  comments and readability of code
- Enclosed the hugepage alloc code with #ifdef CONFIG_HUGETLB_PAGE

v4:
- Augmented the commit message of the udmabuf patch that uses
  pin_user_pages_fd()
- Added previously reviewed but unmerged udmabuf patches to this
  series

Cc: David Hildenbrand <david@redhat.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Dongwon Kim <dongwon.kim@intel.com>
Cc: Junxiao Chang <junxiao.chang@intel.com>

Vivek Kasireddy (5):
  udmabuf: Use vmf_insert_pfn and VM_PFNMAP for handling mmap
  udmabuf: Add back support for mapping hugetlb pages (v3)
  mm/gup: Introduce pin_user_pages_fd() for pinning shmem/hugetlbfs file
    pages (v4)
  udmabuf: Pin the pages using pin_user_pages_fd() API (v3)
  selftests/dma-buf/udmabuf: Add tests to verify data after page
    migration

 drivers/dma-buf/udmabuf.c                     |  96 ++++++++---
 include/linux/mm.h                            |   2 +
 mm/gup.c                                      | 108 +++++++++++++
 .../selftests/drivers/dma-buf/udmabuf.c       | 151 +++++++++++++++++-
 4 files changed, 328 insertions(+), 29 deletions(-)

-- 
2.39.2



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v4 1/5] udmabuf: Use vmf_insert_pfn and VM_PFNMAP for handling mmap
  2023-11-18  6:32 [PATCH v4 0/5] mm/gup: Introduce pin_user_pages_fd() for pinning shmem/hugetlbfs file pages (v4) Vivek Kasireddy
@ 2023-11-18  6:32 ` Vivek Kasireddy
  2023-11-18  6:32 ` [PATCH v4 2/5] udmabuf: Add back support for mapping hugetlb pages (v3) Vivek Kasireddy
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: Vivek Kasireddy @ 2023-11-18  6:32 UTC (permalink / raw)
  To: dri-devel, linux-mm
  Cc: Vivek Kasireddy, David Hildenbrand, Daniel Vetter, Mike Kravetz,
	Hugh Dickins, Peter Xu, Jason Gunthorpe, Gerd Hoffmann,
	Dongwon Kim, Junxiao Chang

Add VM_PFNMAP to vm_flags in the mmap handler to ensure that
the mappings would be managed without using struct page.

And, in the vm_fault handler, use vmf_insert_pfn to share the
page's pfn to userspace instead of directly sharing the page
(via struct page *).

Cc: David Hildenbrand <david@redhat.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Dongwon Kim <dongwon.kim@intel.com>
Cc: Junxiao Chang <junxiao.chang@intel.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
---
 drivers/dma-buf/udmabuf.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
index c40645999648..820c993c8659 100644
--- a/drivers/dma-buf/udmabuf.c
+++ b/drivers/dma-buf/udmabuf.c
@@ -35,12 +35,13 @@ static vm_fault_t udmabuf_vm_fault(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	struct udmabuf *ubuf = vma->vm_private_data;
 	pgoff_t pgoff = vmf->pgoff;
+	unsigned long pfn;
 
 	if (pgoff >= ubuf->pagecount)
 		return VM_FAULT_SIGBUS;
-	vmf->page = ubuf->pages[pgoff];
-	get_page(vmf->page);
-	return 0;
+
+	pfn = page_to_pfn(ubuf->pages[pgoff]);
+	return vmf_insert_pfn(vma, vmf->address, pfn);
 }
 
 static const struct vm_operations_struct udmabuf_vm_ops = {
@@ -56,6 +57,7 @@ static int mmap_udmabuf(struct dma_buf *buf, struct vm_area_struct *vma)
 
 	vma->vm_ops = &udmabuf_vm_ops;
 	vma->vm_private_data = ubuf;
+	vm_flags_set(vma, VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
 	return 0;
 }
 
-- 
2.39.2



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v4 2/5] udmabuf: Add back support for mapping hugetlb pages (v3)
  2023-11-18  6:32 [PATCH v4 0/5] mm/gup: Introduce pin_user_pages_fd() for pinning shmem/hugetlbfs file pages (v4) Vivek Kasireddy
  2023-11-18  6:32 ` [PATCH v4 1/5] udmabuf: Use vmf_insert_pfn and VM_PFNMAP for handling mmap Vivek Kasireddy
@ 2023-11-18  6:32 ` Vivek Kasireddy
  2023-11-18  6:32 ` [PATCH v4 3/5] mm/gup: Introduce pin_user_pages_fd() for pinning shmem/hugetlbfs file pages (v4) Vivek Kasireddy
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: Vivek Kasireddy @ 2023-11-18  6:32 UTC (permalink / raw)
  To: dri-devel, linux-mm
  Cc: Vivek Kasireddy, David Hildenbrand, Daniel Vetter, Mike Kravetz,
	Hugh Dickins, Peter Xu, Jason Gunthorpe, Gerd Hoffmann,
	Dongwon Kim, Junxiao Chang

A user or admin can configure a VMM (Qemu) Guest's memory to be
backed by hugetlb pages for various reasons. However, a Guest OS
would still allocate (and pin) buffers that are backed by regular
4k sized pages. In order to map these buffers and create dma-bufs
for them on the Host, we first need to find the hugetlb pages where
the buffer allocations are located and then determine the offsets
of individual chunks (within those pages) and use this information
to eventually populate a scatterlist.

Testcase: default_hugepagesz=2M hugepagesz=2M hugepages=2500 options
were passed to the Host kernel and Qemu was launched with these
relevant options: qemu-system-x86_64 -m 4096m....
-device virtio-gpu-pci,max_outputs=1,blob=true,xres=1920,yres=1080
-display gtk,gl=on
-object memory-backend-memfd,hugetlb=on,id=mem1,size=4096M
-machine memory-backend=mem1

Replacing -display gtk,gl=on with -display gtk,gl=off above would
exercise the mmap handler.

v2: Updated get_sg_table() to manually populate the scatterlist for
    both huge page and non-huge-page cases.

v3: s/offsets/subpgoff/g
    s/hpoff/mapidx/g

Cc: David Hildenbrand <david@redhat.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Dongwon Kim <dongwon.kim@intel.com>
Cc: Junxiao Chang <junxiao.chang@intel.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com> (v2)
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
---
 drivers/dma-buf/udmabuf.c | 85 +++++++++++++++++++++++++++++++++------
 1 file changed, 72 insertions(+), 13 deletions(-)

diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
index 820c993c8659..1a41c4a069ea 100644
--- a/drivers/dma-buf/udmabuf.c
+++ b/drivers/dma-buf/udmabuf.c
@@ -10,6 +10,7 @@
 #include <linux/miscdevice.h>
 #include <linux/module.h>
 #include <linux/shmem_fs.h>
+#include <linux/hugetlb.h>
 #include <linux/slab.h>
 #include <linux/udmabuf.h>
 #include <linux/vmalloc.h>
@@ -28,6 +29,7 @@ struct udmabuf {
 	struct page **pages;
 	struct sg_table *sg;
 	struct miscdevice *device;
+	pgoff_t *subpgoff;
 };
 
 static vm_fault_t udmabuf_vm_fault(struct vm_fault *vmf)
@@ -41,6 +43,10 @@ static vm_fault_t udmabuf_vm_fault(struct vm_fault *vmf)
 		return VM_FAULT_SIGBUS;
 
 	pfn = page_to_pfn(ubuf->pages[pgoff]);
+	if (ubuf->subpgoff) {
+		pfn += ubuf->subpgoff[pgoff] >> PAGE_SHIFT;
+	}
+
 	return vmf_insert_pfn(vma, vmf->address, pfn);
 }
 
@@ -90,23 +96,31 @@ static struct sg_table *get_sg_table(struct device *dev, struct dma_buf *buf,
 {
 	struct udmabuf *ubuf = buf->priv;
 	struct sg_table *sg;
+	struct scatterlist *sgl;
+	pgoff_t offset;
+	unsigned long i = 0;
 	int ret;
 
 	sg = kzalloc(sizeof(*sg), GFP_KERNEL);
 	if (!sg)
 		return ERR_PTR(-ENOMEM);
-	ret = sg_alloc_table_from_pages(sg, ubuf->pages, ubuf->pagecount,
-					0, ubuf->pagecount << PAGE_SHIFT,
-					GFP_KERNEL);
+
+	ret = sg_alloc_table(sg, ubuf->pagecount, GFP_KERNEL);
 	if (ret < 0)
-		goto err;
+		goto err_alloc;
+
+	for_each_sg(sg->sgl, sgl, ubuf->pagecount, i) {
+		offset = ubuf->subpgoff ? ubuf->subpgoff[i] : 0;
+		sg_set_page(sgl, ubuf->pages[i], PAGE_SIZE, offset);
+	}
 	ret = dma_map_sgtable(dev, sg, direction, 0);
 	if (ret < 0)
-		goto err;
+		goto err_map;
 	return sg;
 
-err:
+err_map:
 	sg_free_table(sg);
+err_alloc:
 	kfree(sg);
 	return ERR_PTR(ret);
 }
@@ -143,6 +157,7 @@ static void release_udmabuf(struct dma_buf *buf)
 
 	for (pg = 0; pg < ubuf->pagecount; pg++)
 		put_page(ubuf->pages[pg]);
+	kfree(ubuf->subpgoff);
 	kfree(ubuf->pages);
 	kfree(ubuf);
 }
@@ -206,7 +221,9 @@ static long udmabuf_create(struct miscdevice *device,
 	struct udmabuf *ubuf;
 	struct dma_buf *buf;
 	pgoff_t pgoff, pgcnt, pgidx, pgbuf = 0, pglimit;
-	struct page *page;
+	struct page *page, *hpage = NULL;
+	pgoff_t mapidx, chunkoff, maxchunks;
+	struct hstate *hpstate;
 	int seals, ret = -EINVAL;
 	u32 i, flags;
 
@@ -242,7 +259,7 @@ static long udmabuf_create(struct miscdevice *device,
 		if (!memfd)
 			goto err;
 		mapping = memfd->f_mapping;
-		if (!shmem_mapping(mapping))
+		if (!shmem_mapping(mapping) && !is_file_hugepages(memfd))
 			goto err;
 		seals = memfd_fcntl(memfd, F_GET_SEALS, 0);
 		if (seals == -EINVAL)
@@ -253,16 +270,57 @@ static long udmabuf_create(struct miscdevice *device,
 			goto err;
 		pgoff = list[i].offset >> PAGE_SHIFT;
 		pgcnt = list[i].size   >> PAGE_SHIFT;
+		if (is_file_hugepages(memfd)) {
+			if (!ubuf->subpgoff) {
+				ubuf->subpgoff = kmalloc_array(ubuf->pagecount,
+							       sizeof(*ubuf->subpgoff),
+							       GFP_KERNEL);
+				if (!ubuf->subpgoff) {
+					ret = -ENOMEM;
+					goto err;
+				}
+			}
+			hpstate = hstate_file(memfd);
+			mapidx = list[i].offset >> huge_page_shift(hpstate);
+			chunkoff = (list[i].offset &
+				    ~huge_page_mask(hpstate)) >> PAGE_SHIFT;
+			maxchunks = huge_page_size(hpstate) >> PAGE_SHIFT;
+		}
 		for (pgidx = 0; pgidx < pgcnt; pgidx++) {
-			page = shmem_read_mapping_page(mapping, pgoff + pgidx);
-			if (IS_ERR(page)) {
-				ret = PTR_ERR(page);
-				goto err;
+			if (is_file_hugepages(memfd)) {
+				if (!hpage) {
+					hpage = find_get_page_flags(mapping, mapidx,
+								    FGP_ACCESSED);
+					if (!hpage) {
+						ret = -EINVAL;
+						goto err;
+					}
+				}
+				get_page(hpage);
+				ubuf->pages[pgbuf] = hpage;
+				ubuf->subpgoff[pgbuf++] = chunkoff << PAGE_SHIFT;
+				if (++chunkoff == maxchunks) {
+					put_page(hpage);
+					hpage = NULL;
+					chunkoff = 0;
+					mapidx++;
+				}
+			} else {
+				mapidx = pgoff + pgidx;
+				page = shmem_read_mapping_page(mapping, mapidx);
+				if (IS_ERR(page)) {
+					ret = PTR_ERR(page);
+					goto err;
+				}
+				ubuf->pages[pgbuf++] = page;
 			}
-			ubuf->pages[pgbuf++] = page;
 		}
 		fput(memfd);
 		memfd = NULL;
+		if (hpage) {
+			put_page(hpage);
+			hpage = NULL;
+		}
 	}
 
 	exp_info.ops  = &udmabuf_ops;
@@ -287,6 +345,7 @@ static long udmabuf_create(struct miscdevice *device,
 		put_page(ubuf->pages[--pgbuf]);
 	if (memfd)
 		fput(memfd);
+	kfree(ubuf->subpgoff);
 	kfree(ubuf->pages);
 	kfree(ubuf);
 	return ret;
-- 
2.39.2



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v4 3/5] mm/gup: Introduce pin_user_pages_fd() for pinning shmem/hugetlbfs file pages (v4)
  2023-11-18  6:32 [PATCH v4 0/5] mm/gup: Introduce pin_user_pages_fd() for pinning shmem/hugetlbfs file pages (v4) Vivek Kasireddy
  2023-11-18  6:32 ` [PATCH v4 1/5] udmabuf: Use vmf_insert_pfn and VM_PFNMAP for handling mmap Vivek Kasireddy
  2023-11-18  6:32 ` [PATCH v4 2/5] udmabuf: Add back support for mapping hugetlb pages (v3) Vivek Kasireddy
@ 2023-11-18  6:32 ` Vivek Kasireddy
  2023-11-20  8:48   ` David Hildenbrand
  2023-11-18  6:32 ` [PATCH v4 4/5] udmabuf: Pin the pages using pin_user_pages_fd() API (v3) Vivek Kasireddy
  2023-11-18  6:32 ` [PATCH v4 5/5] selftests/dma-buf/udmabuf: Add tests to verify data after page migration Vivek Kasireddy
  4 siblings, 1 reply; 8+ messages in thread
From: Vivek Kasireddy @ 2023-11-18  6:32 UTC (permalink / raw)
  To: dri-devel, linux-mm
  Cc: Vivek Kasireddy, David Hildenbrand, Daniel Vetter, Mike Kravetz,
	Hugh Dickins, Peter Xu, Gerd Hoffmann, Dongwon Kim,
	Junxiao Chang, Jason Gunthorpe

For drivers that would like to longterm-pin the pages associated
with a file, the pin_user_pages_fd() API provides an option to
not only pin the pages via FOLL_PIN but also to check and migrate
them if they reside in movable zone or CMA block. This API
currently works with files that belong to either shmem or hugetlbfs.
Files belonging to other filesystems are rejected for now.

The pages need to be located first before pinning them via FOLL_PIN.
If they are found in the page cache, they can be immediately pinned.
Otherwise, they need to be allocated using the filesystem specific
APIs and then pinned.

v2:
- Drop gup_flags and improve comments and commit message (David)
- Allocate a page if we cannot find in page cache for the hugetlbfs
  case as well (David)
- Don't unpin pages if there is a migration related failure (David)
- Drop the unnecessary nr_pages <= 0 check (Jason)
- Have the caller of the API pass in file * instead of fd (Jason)

v3: (David)
- Enclose the huge page allocation code with #ifdef CONFIG_HUGETLB_PAGE
  (Build error reported by kernel test robot <lkp@intel.com>)
- Don't forget memalloc_pin_restore() on non-migration related errors
- Improve the readability of the cleanup code associated with
  non-migration related errors
- Augment the comments by describing FOLL_LONGTERM like behavior
- Include the R-b tag from Jason

v4:
- Remove the local variable "page" and instead use 3 return statements
  in alloc_file_page() (David)
- Add the R-b tag from David

Cc: David Hildenbrand <david@redhat.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Dongwon Kim <dongwon.kim@intel.com>
Cc: Junxiao Chang <junxiao.chang@intel.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> (v2)
Reviewed-by: David Hildenbrand <david@redhat.com> (v3)
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
---
 include/linux/mm.h |   2 +
 mm/gup.c           | 108 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 110 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 418d26608ece..1b675fa35059 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2472,6 +2472,8 @@ long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
 		    struct page **pages, unsigned int gup_flags);
 long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
 		    struct page **pages, unsigned int gup_flags);
+long pin_user_pages_fd(struct file *file, pgoff_t start,
+		       unsigned long nr_pages, struct page **pages);
 
 int get_user_pages_fast(unsigned long start, int nr_pages,
 			unsigned int gup_flags, struct page **pages);
diff --git a/mm/gup.c b/mm/gup.c
index 231711efa390..875c51d13ee5 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -3410,3 +3410,111 @@ long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
 				     &locked, gup_flags);
 }
 EXPORT_SYMBOL(pin_user_pages_unlocked);
+
+static struct page *alloc_file_page(struct file *file, pgoff_t idx)
+{
+#ifdef CONFIG_HUGETLB_PAGE
+	struct folio *folio;
+	int err;
+
+	if (is_file_hugepages(file)) {
+		folio = alloc_hugetlb_folio_nodemask(hstate_file(file),
+						     NUMA_NO_NODE,
+						     NULL,
+						     GFP_USER);
+		if (folio && folio_try_get(folio)) {
+			err = hugetlb_add_to_page_cache(folio,
+							file->f_mapping,
+							idx);
+			if (err) {
+				folio_put(folio);
+				free_huge_folio(folio);
+				return ERR_PTR(err);
+			}
+			return &folio->page;
+		}
+		return ERR_PTR(-ENOMEM);
+	}
+#endif
+	return shmem_read_mapping_page(file->f_mapping, idx);
+}
+
+/**
+ * pin_user_pages_fd() - pin user pages associated with a file
+ * @file:       the file whose pages are to be pinned
+ * @start:      starting file offset
+ * @nr_pages:   number of pages from start to pin
+ * @pages:      array that receives pointers to the pages pinned.
+ *              Should be at-least nr_pages long.
+ *
+ * Attempt to pin pages associated with a file that belongs to either shmem
+ * or hugetlb. The pages are either found in the page cache or allocated if
+ * necessary. Once the pages are located, they are all pinned via FOLL_PIN.
+ * And, these pinned pages need to be released either using unpin_user_pages()
+ * or unpin_user_page().
+ *
+ * It must be noted that the pages may be pinned for an indefinite amount
+ * of time. And, in most cases, the duration of time they may stay pinned
+ * would be controlled by the userspace. This behavior is effectively the
+ * same as using FOLL_LONGTERM with other GUP APIs.
+ *
+ * Returns number of pages pinned. This would be equal to the number of
+ * pages requested. If no pages were pinned, it returns -errno.
+ */
+long pin_user_pages_fd(struct file *file, pgoff_t start,
+		       unsigned long nr_pages, struct page **pages)
+{
+	struct page *page;
+	unsigned int flags, i;
+	long ret;
+
+	if (start < 0)
+		return -EINVAL;
+
+	if (!file)
+	    return -EINVAL;
+
+	if (!shmem_file(file) && !is_file_hugepages(file))
+	    return -EINVAL;
+
+	flags = memalloc_pin_save();
+	do {
+		for (i = 0; i < nr_pages; i++) {
+			/*
+ 			 * In most cases, we should be able to find the page
+			 * in the page cache. If we cannot find it, we try to
+			 * allocate one and add it to the page cache.
+			 */
+			page = find_get_page_flags(file->f_mapping,
+						   start + i,
+						   FGP_ACCESSED);
+			if (!page) {
+				page = alloc_file_page(file, start + i);
+				if (IS_ERR(page)) {
+					ret = PTR_ERR(page);
+					goto err;
+				}
+			}
+			ret = try_grab_page(page, FOLL_PIN);
+			if (unlikely(ret))
+				goto err;
+
+			pages[i] = page;
+			put_page(pages[i]);
+		}
+
+		ret = check_and_migrate_movable_pages(nr_pages, pages);
+	} while (ret == -EAGAIN);
+
+	memalloc_pin_restore(flags);
+	return ret ? ret : nr_pages;
+err:
+	memalloc_pin_restore(flags);
+	while (i-- > 0)
+		if (pages[i])
+			unpin_user_page(pages[i]);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(pin_user_pages_fd);
+
-- 
2.39.2



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v4 4/5] udmabuf: Pin the pages using pin_user_pages_fd() API (v3)
  2023-11-18  6:32 [PATCH v4 0/5] mm/gup: Introduce pin_user_pages_fd() for pinning shmem/hugetlbfs file pages (v4) Vivek Kasireddy
                   ` (2 preceding siblings ...)
  2023-11-18  6:32 ` [PATCH v4 3/5] mm/gup: Introduce pin_user_pages_fd() for pinning shmem/hugetlbfs file pages (v4) Vivek Kasireddy
@ 2023-11-18  6:32 ` Vivek Kasireddy
  2023-11-18  6:32 ` [PATCH v4 5/5] selftests/dma-buf/udmabuf: Add tests to verify data after page migration Vivek Kasireddy
  4 siblings, 0 replies; 8+ messages in thread
From: Vivek Kasireddy @ 2023-11-18  6:32 UTC (permalink / raw)
  To: dri-devel, linux-mm
  Cc: Vivek Kasireddy, David Hildenbrand, Daniel Vetter, Mike Kravetz,
	Hugh Dickins, Peter Xu, Jason Gunthorpe, Gerd Hoffmann,
	Dongwon Kim, Junxiao Chang

Using pin_user_pages_fd() will ensure that the pages are pinned
correctly using FOLL_PIN. And, this also ensures that we don't
accidentally break features such as memory hotunplug as it would
not allow pinning pages in the movable zone.

Using this new API also simplifies the code as we no longer have
to deal with extracting individual pages from their mappings. As
a result, we can drop some of the local variables such as page,
hpage, mapping, etc.

v2:
- Adjust to the change in signature of pin_user_pages_fd() by
  passing in file * instead of fd.

v3:
- Limit the changes in this patch only to those that are required
  for using pin_user_pages_fd()
- Slightly improve the commit message

Cc: David Hildenbrand <david@redhat.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Dongwon Kim <dongwon.kim@intel.com>
Cc: Junxiao Chang <junxiao.chang@intel.com>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
---
 drivers/dma-buf/udmabuf.c | 59 +++++++++++++++------------------------
 1 file changed, 22 insertions(+), 37 deletions(-)

diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
index 1a41c4a069ea..883bd97e4076 100644
--- a/drivers/dma-buf/udmabuf.c
+++ b/drivers/dma-buf/udmabuf.c
@@ -156,7 +156,8 @@ static void release_udmabuf(struct dma_buf *buf)
 		put_sg_table(dev, ubuf->sg, DMA_BIDIRECTIONAL);
 
 	for (pg = 0; pg < ubuf->pagecount; pg++)
-		put_page(ubuf->pages[pg]);
+		unpin_user_page(ubuf->pages[pg]);
+
 	kfree(ubuf->subpgoff);
 	kfree(ubuf->pages);
 	kfree(ubuf);
@@ -217,14 +218,13 @@ static long udmabuf_create(struct miscdevice *device,
 {
 	DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
 	struct file *memfd = NULL;
-	struct address_space *mapping = NULL;
 	struct udmabuf *ubuf;
 	struct dma_buf *buf;
-	pgoff_t pgoff, pgcnt, pgidx, pgbuf = 0, pglimit;
-	struct page *page, *hpage = NULL;
+	pgoff_t pgcnt, pgbuf = 0, pglimit, nr_pages;
 	pgoff_t mapidx, chunkoff, maxchunks;
 	struct hstate *hpstate;
-	int seals, ret = -EINVAL;
+	long ret = -EINVAL;
+	int seals;
 	u32 i, flags;
 
 	ubuf = kzalloc(sizeof(*ubuf), GFP_KERNEL);
@@ -258,8 +258,7 @@ static long udmabuf_create(struct miscdevice *device,
 		memfd = fget(list[i].memfd);
 		if (!memfd)
 			goto err;
-		mapping = memfd->f_mapping;
-		if (!shmem_mapping(mapping) && !is_file_hugepages(memfd))
+		if (!shmem_file(memfd) && !is_file_hugepages(memfd))
 			goto err;
 		seals = memfd_fcntl(memfd, F_GET_SEALS, 0);
 		if (seals == -EINVAL)
@@ -268,7 +267,7 @@ static long udmabuf_create(struct miscdevice *device,
 		if ((seals & SEALS_WANTED) != SEALS_WANTED ||
 		    (seals & SEALS_DENIED) != 0)
 			goto err;
-		pgoff = list[i].offset >> PAGE_SHIFT;
+		mapidx = list[i].offset >> PAGE_SHIFT;
 		pgcnt = list[i].size   >> PAGE_SHIFT;
 		if (is_file_hugepages(memfd)) {
 			if (!ubuf->subpgoff) {
@@ -286,41 +285,26 @@ static long udmabuf_create(struct miscdevice *device,
 				    ~huge_page_mask(hpstate)) >> PAGE_SHIFT;
 			maxchunks = huge_page_size(hpstate) >> PAGE_SHIFT;
 		}
-		for (pgidx = 0; pgidx < pgcnt; pgidx++) {
+
+		do {
+			nr_pages = shmem_file(memfd) ? pgcnt : 1;
+			ret = pin_user_pages_fd(memfd, mapidx, nr_pages,
+						ubuf->pages + pgbuf);
+			if (ret < 0)
+				goto err;
+
 			if (is_file_hugepages(memfd)) {
-				if (!hpage) {
-					hpage = find_get_page_flags(mapping, mapidx,
-								    FGP_ACCESSED);
-					if (!hpage) {
-						ret = -EINVAL;
-						goto err;
-					}
-				}
-				get_page(hpage);
-				ubuf->pages[pgbuf] = hpage;
-				ubuf->subpgoff[pgbuf++] = chunkoff << PAGE_SHIFT;
+				ubuf->subpgoff[pgbuf] = chunkoff << PAGE_SHIFT;
 				if (++chunkoff == maxchunks) {
-					put_page(hpage);
-					hpage = NULL;
 					chunkoff = 0;
 					mapidx++;
 				}
-			} else {
-				mapidx = pgoff + pgidx;
-				page = shmem_read_mapping_page(mapping, mapidx);
-				if (IS_ERR(page)) {
-					ret = PTR_ERR(page);
-					goto err;
-				}
-				ubuf->pages[pgbuf++] = page;
 			}
-		}
+			pgbuf += nr_pages;
+			pgcnt -= nr_pages;
+		} while (pgcnt > 0);
 		fput(memfd);
 		memfd = NULL;
-		if (hpage) {
-			put_page(hpage);
-			hpage = NULL;
-		}
 	}
 
 	exp_info.ops  = &udmabuf_ops;
@@ -341,8 +325,9 @@ static long udmabuf_create(struct miscdevice *device,
 	return dma_buf_fd(buf, flags);
 
 err:
-	while (pgbuf > 0)
-		put_page(ubuf->pages[--pgbuf]);
+	while (pgbuf-- > 0)
+		if (ubuf->pages[pgbuf])
+			unpin_user_page(ubuf->pages[pgbuf]);
 	if (memfd)
 		fput(memfd);
 	kfree(ubuf->subpgoff);
-- 
2.39.2



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v4 5/5] selftests/dma-buf/udmabuf: Add tests to verify data after page migration
  2023-11-18  6:32 [PATCH v4 0/5] mm/gup: Introduce pin_user_pages_fd() for pinning shmem/hugetlbfs file pages (v4) Vivek Kasireddy
                   ` (3 preceding siblings ...)
  2023-11-18  6:32 ` [PATCH v4 4/5] udmabuf: Pin the pages using pin_user_pages_fd() API (v3) Vivek Kasireddy
@ 2023-11-18  6:32 ` Vivek Kasireddy
  4 siblings, 0 replies; 8+ messages in thread
From: Vivek Kasireddy @ 2023-11-18  6:32 UTC (permalink / raw)
  To: dri-devel, linux-mm
  Cc: Vivek Kasireddy, Shuah Khan, David Hildenbrand, Daniel Vetter,
	Mike Kravetz, Hugh Dickins, Peter Xu, Jason Gunthorpe,
	Gerd Hoffmann, Dongwon Kim, Junxiao Chang

Since the memfd pages associated with a udmabuf may be migrated
as part of udmabuf create, we need to verify the data coherency
after successful migration. The new tests added in this patch try
to do just that using 4k sized pages and also 2 MB sized huge
pages for the memfd.

Successful completion of the tests would mean that there is no
disconnect between the memfd pages and the ones associated with
a udmabuf. And, these tests can also be augmented in the future
to test newer udmabuf features (such as handling memfd hole punch).

Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Dongwon Kim <dongwon.kim@intel.com>
Cc: Junxiao Chang <junxiao.chang@intel.com>
Based-on-patch-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
---
 .../selftests/drivers/dma-buf/udmabuf.c       | 151 +++++++++++++++++-
 1 file changed, 147 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/drivers/dma-buf/udmabuf.c b/tools/testing/selftests/drivers/dma-buf/udmabuf.c
index c812080e304e..d76c813fe652 100644
--- a/tools/testing/selftests/drivers/dma-buf/udmabuf.c
+++ b/tools/testing/selftests/drivers/dma-buf/udmabuf.c
@@ -9,26 +9,132 @@
 #include <errno.h>
 #include <fcntl.h>
 #include <malloc.h>
+#include <stdbool.h>
 
 #include <sys/ioctl.h>
 #include <sys/syscall.h>
+#include <sys/mman.h>
 #include <linux/memfd.h>
 #include <linux/udmabuf.h>
 
 #define TEST_PREFIX	"drivers/dma-buf/udmabuf"
 #define NUM_PAGES       4
+#define NUM_ENTRIES     4
+#define MEMFD_SIZE      1024 /* in pages */
 
-static int memfd_create(const char *name, unsigned int flags)
+static unsigned int page_size;
+
+static int create_memfd_with_seals(off64_t size, bool hpage)
+{
+	int memfd, ret;
+	unsigned int flags = MFD_ALLOW_SEALING;
+
+	if (hpage)
+		flags |= MFD_HUGETLB;
+
+	memfd = memfd_create("udmabuf-test", flags);
+	if (memfd < 0) {
+		printf("%s: [skip,no-memfd]\n", TEST_PREFIX);
+		exit(77);
+	}
+
+	ret = fcntl(memfd, F_ADD_SEALS, F_SEAL_SHRINK);
+	if (ret < 0) {
+		printf("%s: [skip,fcntl-add-seals]\n", TEST_PREFIX);
+		exit(77);
+	}
+
+	ret = ftruncate(memfd, size);
+	if (ret == -1) {
+		printf("%s: [FAIL,memfd-truncate]\n", TEST_PREFIX);
+		exit(1);
+	}
+
+	return memfd;
+}
+
+static int create_udmabuf_list(int devfd, int memfd, off64_t memfd_size)
+{
+	struct udmabuf_create_list *list;
+	int ubuf_fd, i;
+
+	list = malloc(sizeof(struct udmabuf_create_list) +
+		      sizeof(struct udmabuf_create_item) * NUM_ENTRIES);
+	if (!list) {
+		printf("%s: [FAIL, udmabuf-malloc]\n", TEST_PREFIX);
+		exit(1);
+	}
+
+	for (i = 0; i < NUM_ENTRIES; i++) {
+		list->list[i].memfd  = memfd;
+		list->list[i].offset = i * (memfd_size / NUM_ENTRIES);
+		list->list[i].size   = getpagesize() * NUM_PAGES;
+	}
+
+	list->count = NUM_ENTRIES;
+	list->flags = UDMABUF_FLAGS_CLOEXEC;
+	ubuf_fd = ioctl(devfd, UDMABUF_CREATE_LIST, list);
+	free(list);
+	if (ubuf_fd < 0) {
+		printf("%s: [FAIL, udmabuf-create]\n", TEST_PREFIX);
+		exit(1);
+	}
+
+	return ubuf_fd;
+}
+
+static void write_to_memfd(void *addr, off64_t size, char chr)
+{
+	int i;
+
+	for (i = 0; i < size / page_size; i++) {
+		*((char *)addr + (i * page_size)) = chr;
+	}
+}
+
+static void *mmap_fd(int fd, off64_t size)
 {
-	return syscall(__NR_memfd_create, name, flags);
+	void *addr;
+
+	addr = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
+	if (addr == MAP_FAILED) {
+		printf("%s: ubuf_fd mmap fail\n", TEST_PREFIX);
+		exit(1);
+	}
+
+	return addr;
+}
+
+static int compare_chunks(void *addr1, void *addr2, off64_t memfd_size)
+{
+	off64_t off;
+	int i = 0, j, k = 0, ret = 0;
+	char char1, char2;
+
+	while (i < NUM_ENTRIES) {
+		off = i * (memfd_size / NUM_ENTRIES);
+		for (j = 0; j < NUM_PAGES; j++, k++) {
+			char1 = *((char *)addr1 + off + (j * getpagesize()));
+			char2 = *((char *)addr2 + (k * getpagesize()));
+			if (char1 != char2) {
+				ret = -1;
+				goto err;
+			}
+		}
+		i++;
+	}
+err:
+	munmap(addr1, memfd_size);
+	munmap(addr2, NUM_ENTRIES * NUM_PAGES * getpagesize());
+	return ret;
 }
 
 int main(int argc, char *argv[])
 {
 	struct udmabuf_create create;
 	int devfd, memfd, buf, ret;
-	off_t size;
-	void *mem;
+	off64_t size;
+	void *addr1, *addr2;
 
 	devfd = open("/dev/udmabuf", O_RDWR);
 	if (devfd < 0) {
@@ -90,6 +196,9 @@ int main(int argc, char *argv[])
 	}
 
 	/* should work */
+	page_size = getpagesize();
+	addr1 = mmap_fd(memfd, size);
+	write_to_memfd(addr1, size, 'a');
 	create.memfd  = memfd;
 	create.offset = 0;
 	create.size   = size;
@@ -98,6 +207,40 @@ int main(int argc, char *argv[])
 		printf("%s: [FAIL,test-4]\n", TEST_PREFIX);
 		exit(1);
 	}
+	munmap(addr1, size);
+	close(buf);
+	close(memfd);
+
+	/* should work (migration of 4k size pages)*/
+	size = MEMFD_SIZE * page_size;
+	memfd = create_memfd_with_seals(size, false);
+	addr1 = mmap_fd(memfd, size);
+	write_to_memfd(addr1, size, 'a');
+	buf = create_udmabuf_list(devfd, memfd, size);
+	addr2 = mmap_fd(buf, NUM_PAGES * NUM_ENTRIES * getpagesize());
+	write_to_memfd(addr1, size, 'b');
+	ret = compare_chunks(addr1, addr2, size);
+	if (ret < 0) {
+		printf("%s: [FAIL,test-5]\n", TEST_PREFIX);
+		exit(1);
+	}
+	close(buf);
+	close(memfd);
+
+	/* should work (migration of 2MB size huge pages)*/
+	page_size = getpagesize() * 512; /* 2 MB */
+	size = MEMFD_SIZE * page_size;
+	memfd = create_memfd_with_seals(size, true);
+	addr1 = mmap_fd(memfd, size);
+	write_to_memfd(addr1, size, 'a');
+	buf = create_udmabuf_list(devfd, memfd, size);
+	addr2 = mmap_fd(buf, NUM_PAGES * NUM_ENTRIES * getpagesize());
+	write_to_memfd(addr1, size, 'b');
+	ret = compare_chunks(addr1, addr2, size);
+	if (ret < 0) {
+		printf("%s: [FAIL,test-6]\n", TEST_PREFIX);
+		exit(1);
+	}
 
 	fprintf(stderr, "%s: ok\n", TEST_PREFIX);
 	close(buf);
-- 
2.39.2



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v4 3/5] mm/gup: Introduce pin_user_pages_fd() for pinning shmem/hugetlbfs file pages (v4)
  2023-11-18  6:32 ` [PATCH v4 3/5] mm/gup: Introduce pin_user_pages_fd() for pinning shmem/hugetlbfs file pages (v4) Vivek Kasireddy
@ 2023-11-20  8:48   ` David Hildenbrand
  2023-11-21  6:54     ` Kasireddy, Vivek
  0 siblings, 1 reply; 8+ messages in thread
From: David Hildenbrand @ 2023-11-20  8:48 UTC (permalink / raw)
  To: Vivek Kasireddy, dri-devel, linux-mm
  Cc: Daniel Vetter, Mike Kravetz, Hugh Dickins, Peter Xu,
	Gerd Hoffmann, Dongwon Kim, Junxiao Chang, Jason Gunthorpe

On 18.11.23 07:32, Vivek Kasireddy wrote:
> For drivers that would like to longterm-pin the pages associated
> with a file, the pin_user_pages_fd() API provides an option to
> not only pin the pages via FOLL_PIN but also to check and migrate
> them if they reside in movable zone or CMA block. This API
> currently works with files that belong to either shmem or hugetlbfs.
> Files belonging to other filesystems are rejected for now.
> 
> The pages need to be located first before pinning them via FOLL_PIN.
> If they are found in the page cache, they can be immediately pinned.
> Otherwise, they need to be allocated using the filesystem specific
> APIs and then pinned.
> 
> v2:
> - Drop gup_flags and improve comments and commit message (David)
> - Allocate a page if we cannot find in page cache for the hugetlbfs
>    case as well (David)
> - Don't unpin pages if there is a migration related failure (David)
> - Drop the unnecessary nr_pages <= 0 check (Jason)
> - Have the caller of the API pass in file * instead of fd (Jason)
> 
> v3: (David)
> - Enclose the huge page allocation code with #ifdef CONFIG_HUGETLB_PAGE
>    (Build error reported by kernel test robot <lkp@intel.com>)
> - Don't forget memalloc_pin_restore() on non-migration related errors
> - Improve the readability of the cleanup code associated with
>    non-migration related errors
> - Augment the comments by describing FOLL_LONGTERM like behavior
> - Include the R-b tag from Jason
> 
> v4:
> - Remove the local variable "page" and instead use 3 return statements
>    in alloc_file_page() (David)
> - Add the R-b tag from David
> 
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Gerd Hoffmann <kraxel@redhat.com>
> Cc: Dongwon Kim <dongwon.kim@intel.com>
> Cc: Junxiao Chang <junxiao.chang@intel.com>
> Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> (v2)
> Reviewed-by: David Hildenbrand <david@redhat.com> (v3)
> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> ---


[...]


> +static struct page *alloc_file_page(struct file *file, pgoff_t idx)
> +{
> +#ifdef CONFIG_HUGETLB_PAGE
> +	struct folio *folio;
> +	int err;
> +
> +	if (is_file_hugepages(file)) {
> +		folio = alloc_hugetlb_folio_nodemask(hstate_file(file),
> +						     NUMA_NO_NODE,
> +						     NULL,
> +						     GFP_USER);
> +		if (folio && folio_try_get(folio)) {
> +			err = hugetlb_add_to_page_cache(folio,
> +							file->f_mapping,
> +							idx);
> +			if (err) {
> +				folio_put(folio);
> +				free_huge_folio(folio);
> +				return ERR_PTR(err);
> +			}
> +			return &folio->page;

While looking at the user of pin_user_pages_fd(), I realized something:

Assume idx is not aligned to the hugetlb page size. 
find_get_page_flags() would always return a tail page in that case, but 
you'd be returning the head page here.

See pagecache_get_page()->folio_file_page(folio, index);

> +		}
> +		return ERR_PTR(-ENOMEM);
> +	}
> +#endif
> +	return shmem_read_mapping_page(file->f_mapping, idx);
> +}
> +
> +/**
> + * pin_user_pages_fd() - pin user pages associated with a file
> + * @file:       the file whose pages are to be pinned
> + * @start:      starting file offset
> + * @nr_pages:   number of pages from start to pin
> + * @pages:      array that receives pointers to the pages pinned.
> + *              Should be at-least nr_pages long.
> + *
> + * Attempt to pin pages associated with a file that belongs to either shmem
> + * or hugetlb. The pages are either found in the page cache or allocated if
> + * necessary. Once the pages are located, they are all pinned via FOLL_PIN.
> + * And, these pinned pages need to be released either using unpin_user_pages()
> + * or unpin_user_page().
> + *
> + * It must be noted that the pages may be pinned for an indefinite amount
> + * of time. And, in most cases, the duration of time they may stay pinned
> + * would be controlled by the userspace. This behavior is effectively the
> + * same as using FOLL_LONGTERM with other GUP APIs.
> + *
> + * Returns number of pages pinned. This would be equal to the number of
> + * pages requested. If no pages were pinned, it returns -errno.
> + */
> +long pin_user_pages_fd(struct file *file, pgoff_t start,
> +		       unsigned long nr_pages, struct page **pages)
> +{
> +	struct page *page;
> +	unsigned int flags, i;
> +	long ret;
> +
> +	if (start < 0)
> +		return -EINVAL;
> +
> +	if (!file)
> +	    return -EINVAL;
> +
> +	if (!shmem_file(file) && !is_file_hugepages(file))
> +	    return -EINVAL;
> +
> +	flags = memalloc_pin_save();
> +	do {
> +		for (i = 0; i < nr_pages; i++) {
> +			/*
> + 			 * In most cases, we should be able to find the page
> +			 * in the page cache. If we cannot find it, we try to
> +			 * allocate one and add it to the page cache.
> +			 */
> +			page = find_get_page_flags(file->f_mapping,
> +						   start + i,
> +						   FGP_ACCESSED);
> +			if (!page) {
> +				page = alloc_file_page(file, start + i);
> +				if (IS_ERR(page)) {
> +					ret = PTR_ERR(page);
> +					goto err;

While looking at above, I do wonder: what if two parties tried to alloc 
the page at the same time? I suspect we'd want to handle -EEXIST a bit 
nicer here, right?


-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 8+ messages in thread

* RE: [PATCH v4 3/5] mm/gup: Introduce pin_user_pages_fd() for pinning shmem/hugetlbfs file pages (v4)
  2023-11-20  8:48   ` David Hildenbrand
@ 2023-11-21  6:54     ` Kasireddy, Vivek
  0 siblings, 0 replies; 8+ messages in thread
From: Kasireddy, Vivek @ 2023-11-21  6:54 UTC (permalink / raw)
  To: David Hildenbrand, dri-devel, linux-mm
  Cc: Daniel Vetter, Mike Kravetz, Hugh Dickins, Peter Xu,
	Gerd Hoffmann, Kim, Dongwon, Chang, Junxiao, Jason Gunthorpe

Hi David,

> 
> On 18.11.23 07:32, Vivek Kasireddy wrote:
> > For drivers that would like to longterm-pin the pages associated
> > with a file, the pin_user_pages_fd() API provides an option to
> > not only pin the pages via FOLL_PIN but also to check and migrate
> > them if they reside in movable zone or CMA block. This API
> > currently works with files that belong to either shmem or hugetlbfs.
> > Files belonging to other filesystems are rejected for now.
> >
> > The pages need to be located first before pinning them via FOLL_PIN.
> > If they are found in the page cache, they can be immediately pinned.
> > Otherwise, they need to be allocated using the filesystem specific
> > APIs and then pinned.
> >
> > v2:
> > - Drop gup_flags and improve comments and commit message (David)
> > - Allocate a page if we cannot find in page cache for the hugetlbfs
> >    case as well (David)
> > - Don't unpin pages if there is a migration related failure (David)
> > - Drop the unnecessary nr_pages <= 0 check (Jason)
> > - Have the caller of the API pass in file * instead of fd (Jason)
> >
> > v3: (David)
> > - Enclose the huge page allocation code with #ifdef
> CONFIG_HUGETLB_PAGE
> >    (Build error reported by kernel test robot <lkp@intel.com>)
> > - Don't forget memalloc_pin_restore() on non-migration related errors
> > - Improve the readability of the cleanup code associated with
> >    non-migration related errors
> > - Augment the comments by describing FOLL_LONGTERM like behavior
> > - Include the R-b tag from Jason
> >
> > v4:
> > - Remove the local variable "page" and instead use 3 return statements
> >    in alloc_file_page() (David)
> > - Add the R-b tag from David
> >
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> > Cc: Mike Kravetz <mike.kravetz@oracle.com>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Gerd Hoffmann <kraxel@redhat.com>
> > Cc: Dongwon Kim <dongwon.kim@intel.com>
> > Cc: Junxiao Chang <junxiao.chang@intel.com>
> > Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> (v2)
> > Reviewed-by: David Hildenbrand <david@redhat.com> (v3)
> > Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> > ---
> 
> 
> [...]
> 
> 
> > +static struct page *alloc_file_page(struct file *file, pgoff_t idx)
> > +{
> > +#ifdef CONFIG_HUGETLB_PAGE
> > +	struct folio *folio;
> > +	int err;
> > +
> > +	if (is_file_hugepages(file)) {
> > +		folio = alloc_hugetlb_folio_nodemask(hstate_file(file),
> > +						     NUMA_NO_NODE,
> > +						     NULL,
> > +						     GFP_USER);
> > +		if (folio && folio_try_get(folio)) {
> > +			err = hugetlb_add_to_page_cache(folio,
> > +							file->f_mapping,
> > +							idx);
> > +			if (err) {
> > +				folio_put(folio);
> > +				free_huge_folio(folio);
> > +				return ERR_PTR(err);
> > +			}
> > +			return &folio->page;
> 
> While looking at the user of pin_user_pages_fd(), I realized something:
> 
> Assume idx is not aligned to the hugetlb page size.
> find_get_page_flags() would always return a tail page in that case, but
> you'd be returning the head page here.
> 
> See pagecache_get_page()->folio_file_page(folio, index);
Thank you for catching this. Looking at how udambuf uses this API for hugetlb case:
hpstate = hstate_file(memfd);
mapidx = list[i].offset >> huge_page_shift(hpstate);
do {
	nr_pages = shmem_file(memfd) ? pgcnt : 1;
               ret = pin_user_pages_fd(memfd, mapidx, nr_pages,
                                                            ubuf->pages + pgbuf);
As the raw file offset is translated into huge page size units, represented by
mapidx, I was expecting find_get_page_flags() to return a head page but I
did not realize that find_get_page_flags() now returns tail pages given that
it had returned head pages in the previous kernel versions I had tested IIRC.
As my goal is to only grab the head pages, __filemap_get_folio() seems like
the right API to use instead of find_get_page_flags(). With this change, the
hugetlb subtest (that I have not tested with kernels >= 6.7) that fails with
kernel 6.7 RC1 now seems to work as expected.

> 
> > +		}
> > +		return ERR_PTR(-ENOMEM);
> > +	}
> > +#endif
> > +	return shmem_read_mapping_page(file->f_mapping, idx);
> > +}
> > +
> > +/**
> > + * pin_user_pages_fd() - pin user pages associated with a file
> > + * @file:       the file whose pages are to be pinned
> > + * @start:      starting file offset
> > + * @nr_pages:   number of pages from start to pin
> > + * @pages:      array that receives pointers to the pages pinned.
> > + *              Should be at-least nr_pages long.
> > + *
> > + * Attempt to pin pages associated with a file that belongs to either
> shmem
> > + * or hugetlb. The pages are either found in the page cache or allocated if
> > + * necessary. Once the pages are located, they are all pinned via FOLL_PIN.
> > + * And, these pinned pages need to be released either using
> unpin_user_pages()
> > + * or unpin_user_page().
> > + *
> > + * It must be noted that the pages may be pinned for an indefinite amount
> > + * of time. And, in most cases, the duration of time they may stay pinned
> > + * would be controlled by the userspace. This behavior is effectively the
> > + * same as using FOLL_LONGTERM with other GUP APIs.
> > + *
> > + * Returns number of pages pinned. This would be equal to the number of
> > + * pages requested. If no pages were pinned, it returns -errno.
> > + */
> > +long pin_user_pages_fd(struct file *file, pgoff_t start,
> > +		       unsigned long nr_pages, struct page **pages)
> > +{
> > +	struct page *page;
> > +	unsigned int flags, i;
> > +	long ret;
> > +
> > +	if (start < 0)
> > +		return -EINVAL;
> > +
> > +	if (!file)
> > +	    return -EINVAL;
> > +
> > +	if (!shmem_file(file) && !is_file_hugepages(file))
> > +	    return -EINVAL;
> > +
> > +	flags = memalloc_pin_save();
> > +	do {
> > +		for (i = 0; i < nr_pages; i++) {
> > +			/*
> > + 			 * In most cases, we should be able to find the page
> > +			 * in the page cache. If we cannot find it, we try to
> > +			 * allocate one and add it to the page cache.
> > +			 */
> > +			page = find_get_page_flags(file->f_mapping,
> > +						   start + i,
> > +						   FGP_ACCESSED);
> > +			if (!page) {
> > +				page = alloc_file_page(file, start + i);
> > +				if (IS_ERR(page)) {
> > +					ret = PTR_ERR(page);
> > +					goto err;
> 
> While looking at above, I do wonder: what if two parties tried to alloc
> the page at the same time? I suspect we'd want to handle -EEXIST a bit
> nicer here, right?
At-least with the udmabuf use-case, there cannot be multiple entities allocating
and adding a page at a given offset in the memfd at the same time. However, I
can make the following changes to protect against this potential failure mode:
        do {
                for (i = 0; i < nr_pages; i++) {
+retry:
+                       page = NULL;
                        /*
                         * In most cases, we should be able to find the page
                         * in the page cache. If we cannot find it, we try to
                         * allocate one and add it to the page cache.
                         */
-                       page = find_get_page_flags(file->f_mapping,
-                                                  start + i,
-                                                  FGP_ACCESSED);
+                       folio = __filemap_get_folio(file->f_mapping,
+                                                   start + i,
+                                                   FGP_ACCESSED, 0);
+                       if (!IS_ERR(folio))
+                               page = &folio->page;
+
                        if (!page) {
                                page = alloc_file_page(file, start + i);
                                if (IS_ERR(page)) {
                                        ret = PTR_ERR(page);
+                                       if (ret == -EEXIST)
+                                               goto retry;
                                        goto err;
                                }

Thanks,
Vivek

> 
> 
> --
> Cheers,
> 
> David / dhildenb


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-11-21  6:55 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-18  6:32 [PATCH v4 0/5] mm/gup: Introduce pin_user_pages_fd() for pinning shmem/hugetlbfs file pages (v4) Vivek Kasireddy
2023-11-18  6:32 ` [PATCH v4 1/5] udmabuf: Use vmf_insert_pfn and VM_PFNMAP for handling mmap Vivek Kasireddy
2023-11-18  6:32 ` [PATCH v4 2/5] udmabuf: Add back support for mapping hugetlb pages (v3) Vivek Kasireddy
2023-11-18  6:32 ` [PATCH v4 3/5] mm/gup: Introduce pin_user_pages_fd() for pinning shmem/hugetlbfs file pages (v4) Vivek Kasireddy
2023-11-20  8:48   ` David Hildenbrand
2023-11-21  6:54     ` Kasireddy, Vivek
2023-11-18  6:32 ` [PATCH v4 4/5] udmabuf: Pin the pages using pin_user_pages_fd() API (v3) Vivek Kasireddy
2023-11-18  6:32 ` [PATCH v4 5/5] selftests/dma-buf/udmabuf: Add tests to verify data after page migration Vivek Kasireddy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox