[PATCH v7 0/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v7 0/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7)
@ 2023-12-12  7:37 Vivek Kasireddy
  2023-12-12  7:37 ` [PATCH v7 1/6] udmabuf: Use vmf_insert_pfn and VM_PFNMAP for handling mmap Vivek Kasireddy
                   ` (6 more replies)
  0 siblings, 7 replies; 18+ messages in thread
From: Vivek Kasireddy @ 2023-12-12  7:37 UTC (permalink / raw)
  To: dri-devel, linux-mm
  Cc: Vivek Kasireddy, David Hildenbrand, Christoph Hellwig,
	Daniel Vetter, Mike Kravetz, Hugh Dickins, Peter Xu,
	Jason Gunthorpe, Gerd Hoffmann, Dongwon Kim, Junxiao Chang

The first two patches were previously reviewed but not yet merged.
These ones need to be merged first as the fourth patch depends on
the changes introduced in them and they also fix bugs seen in
very specific scenarios (running Qemu with hugetlb=on, blob=true
and rebooting guest VM).

The third patch introduces memfd_pin_folios() API and the fourth
patch converts udmabuf driver to use folios. The fifth patch shows
how the udmabuf driver can make use of the new API to longterm-pin
the folios. The last patch adds two new udmabuf selftests to verify
data coherency after potential page migration.

v2:
- Updated the first patch to include review feedback from David and
  Jason. The main change in this series is the allocation of page
  in the case of hugetlbfs if it is not found in the page cache.

v3:
- Made changes to include review feedback from David to improve the
  comments and readability of code
- Enclosed the hugepage alloc code with #ifdef CONFIG_HUGETLB_PAGE

v4:
- Augmented the commit message of the udmabuf patch that uses
  pin_user_pages_fd()
- Added previously reviewed but unmerged udmabuf patches to this
  series

v5:
- Updated the patch that adds pin_user_pages_fd() to include feedback
  from David to handle simultaneous users trying to add a huge page
  to the mapping
- Replaced find_get_page_flags() with __filemap_get_folio() in the
  second and third patches to ensure that we only obtain head pages
  from the mapping

v6: (Christoph)
- Renamed the new API to memfd_pin_user_pages()
- Improved the page cache lookup efficiency by using
  filemap_get_folios_contig() which uses batches

v7:
- Rename the new API to memfd_pin_folios() and make it return folios
  and offsets (David)
- Added a new preparatory patch to this series to convert udmabuf
  driver to use folios

This series is tested using following methods:
- Run the subtests added in the fifth patch
- Run Qemu (master) with the following options and a few additional
  patches to Spice:
  qemu-system-x86_64 -m 4096m....
  -device virtio-gpu-pci,max_outputs=1,blob=true,xres=1920,yres=1080
  -spice port=3001,gl=on,disable-ticketing=on,preferred-codec=gstreamer:h264
  -object memory-backend-memfd,hugetlb=on,id=mem1,size=4096M
  -machine memory-backend=mem1

Cc: David Hildenbrand <david@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Dongwon Kim <dongwon.kim@intel.com>
Cc: Junxiao Chang <junxiao.chang@intel.com>

Vivek Kasireddy (6):
  udmabuf: Use vmf_insert_pfn and VM_PFNMAP for handling mmap
  udmabuf: Add back support for mapping hugetlb pages (v6)
  mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7)
  udmabuf: Convert udmabuf driver to use folios
  udmabuf: Pin the pages using memfd_pin_folios() API (v5)
  selftests/dma-buf/udmabuf: Add tests to verify data after page
    migration

 drivers/dma-buf/udmabuf.c                     | 166 +++++++++++-------
 include/linux/memfd.h                         |   5 +
 include/linux/mm.h                            |   3 +
 mm/gup.c                                      | 155 ++++++++++++++++
 mm/memfd.c                                    |  34 ++++
 .../selftests/drivers/dma-buf/udmabuf.c       | 151 +++++++++++++++-
 6 files changed, 451 insertions(+), 63 deletions(-)

-- 
2.39.2



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v7 1/6] udmabuf: Use vmf_insert_pfn and VM_PFNMAP for handling mmap
  2023-12-12  7:37 [PATCH v7 0/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7) Vivek Kasireddy
@ 2023-12-12  7:37 ` Vivek Kasireddy
  2023-12-12  7:37 ` [PATCH v7 2/6] udmabuf: Add back support for mapping hugetlb pages (v6) Vivek Kasireddy
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Vivek Kasireddy @ 2023-12-12  7:37 UTC (permalink / raw)
  To: dri-devel, linux-mm
  Cc: Vivek Kasireddy, David Hildenbrand, Daniel Vetter, Mike Kravetz,
	Hugh Dickins, Peter Xu, Jason Gunthorpe, Gerd Hoffmann,
	Dongwon Kim, Junxiao Chang

Add VM_PFNMAP to vm_flags in the mmap handler to ensure that
the mappings would be managed without using struct page.

And, in the vm_fault handler, use vmf_insert_pfn to share the
page's pfn to userspace instead of directly sharing the page
(via struct page *).

Cc: David Hildenbrand <david@redhat.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Dongwon Kim <dongwon.kim@intel.com>
Cc: Junxiao Chang <junxiao.chang@intel.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
---
 drivers/dma-buf/udmabuf.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
index c40645999648..820c993c8659 100644
--- a/drivers/dma-buf/udmabuf.c
+++ b/drivers/dma-buf/udmabuf.c
@@ -35,12 +35,13 @@ static vm_fault_t udmabuf_vm_fault(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	struct udmabuf *ubuf = vma->vm_private_data;
 	pgoff_t pgoff = vmf->pgoff;
+	unsigned long pfn;
 
 	if (pgoff >= ubuf->pagecount)
 		return VM_FAULT_SIGBUS;
-	vmf->page = ubuf->pages[pgoff];
-	get_page(vmf->page);
-	return 0;
+
+	pfn = page_to_pfn(ubuf->pages[pgoff]);
+	return vmf_insert_pfn(vma, vmf->address, pfn);
 }
 
 static const struct vm_operations_struct udmabuf_vm_ops = {
@@ -56,6 +57,7 @@ static int mmap_udmabuf(struct dma_buf *buf, struct vm_area_struct *vma)
 
 	vma->vm_ops = &udmabuf_vm_ops;
 	vma->vm_private_data = ubuf;
+	vm_flags_set(vma, VM_PFNMAP | VM_DONTEXPAND | VM_DONTDUMP);
 	return 0;
 }
 
-- 
2.39.2



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v7 2/6] udmabuf: Add back support for mapping hugetlb pages (v6)
  2023-12-12  7:37 [PATCH v7 0/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7) Vivek Kasireddy
  2023-12-12  7:37 ` [PATCH v7 1/6] udmabuf: Use vmf_insert_pfn and VM_PFNMAP for handling mmap Vivek Kasireddy
@ 2023-12-12  7:37 ` Vivek Kasireddy
  2023-12-12  7:38 ` [PATCH v7 3/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7) Vivek Kasireddy
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 18+ messages in thread
From: Vivek Kasireddy @ 2023-12-12  7:37 UTC (permalink / raw)
  To: dri-devel, linux-mm
  Cc: Vivek Kasireddy, David Hildenbrand, Daniel Vetter, Mike Kravetz,
	Hugh Dickins, Peter Xu, Jason Gunthorpe, Gerd Hoffmann,
	Dongwon Kim, Junxiao Chang

A user or admin can configure a VMM (Qemu) Guest's memory to be
backed by hugetlb pages for various reasons. However, a Guest OS
would still allocate (and pin) buffers that are backed by regular
4k sized pages. In order to map these buffers and create dma-bufs
for them on the Host, we first need to find the hugetlb pages where
the buffer allocations are located and then determine the offsets
of individual chunks (within those pages) and use this information
to eventually populate a scatterlist.

Testcase: default_hugepagesz=2M hugepagesz=2M hugepages=2500 options
were passed to the Host kernel and Qemu was launched with these
relevant options: qemu-system-x86_64 -m 4096m....
-device virtio-gpu-pci,max_outputs=1,blob=true,xres=1920,yres=1080
-display gtk,gl=on
-object memory-backend-memfd,hugetlb=on,id=mem1,size=4096M
-machine memory-backend=mem1

Replacing -display gtk,gl=on with -display gtk,gl=off above would
exercise the mmap handler.

v2: Updated get_sg_table() to manually populate the scatterlist for
    both huge page and non-huge-page cases.

v3: s/offsets/subpgoff/g
    s/hpoff/mapidx/g

v4: Replaced find_get_page_flags() with __filemap_get_folio() to
    ensure that we only obtain head pages from the mapping

v5: Fix the calculation of mapidx to ensure that it is a order-n
    page multiple

v6:
- Split the processing of hugetlb or shmem pages into helpers to
  simplify the code in udmabuf_create() (Christoph)
- Move the creation of offsets array out of hugetlb context and
  into common code

Cc: David Hildenbrand <david@redhat.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Dongwon Kim <dongwon.kim@intel.com>
Cc: Junxiao Chang <junxiao.chang@intel.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com> (v2)
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
---
 drivers/dma-buf/udmabuf.c | 122 +++++++++++++++++++++++++++++++-------
 1 file changed, 101 insertions(+), 21 deletions(-)

diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
index 820c993c8659..274defd3fa3e 100644
--- a/drivers/dma-buf/udmabuf.c
+++ b/drivers/dma-buf/udmabuf.c
@@ -10,6 +10,7 @@
 #include <linux/miscdevice.h>
 #include <linux/module.h>
 #include <linux/shmem_fs.h>
+#include <linux/hugetlb.h>
 #include <linux/slab.h>
 #include <linux/udmabuf.h>
 #include <linux/vmalloc.h>
@@ -28,6 +29,7 @@ struct udmabuf {
 	struct page **pages;
 	struct sg_table *sg;
 	struct miscdevice *device;
+	pgoff_t *offsets;
 };
 
 static vm_fault_t udmabuf_vm_fault(struct vm_fault *vmf)
@@ -41,6 +43,8 @@ static vm_fault_t udmabuf_vm_fault(struct vm_fault *vmf)
 		return VM_FAULT_SIGBUS;
 
 	pfn = page_to_pfn(ubuf->pages[pgoff]);
+	pfn += ubuf->offsets[pgoff] >> PAGE_SHIFT;
+
 	return vmf_insert_pfn(vma, vmf->address, pfn);
 }
 
@@ -90,23 +94,29 @@ static struct sg_table *get_sg_table(struct device *dev, struct dma_buf *buf,
 {
 	struct udmabuf *ubuf = buf->priv;
 	struct sg_table *sg;
+	struct scatterlist *sgl;
+	unsigned int i = 0;
 	int ret;
 
 	sg = kzalloc(sizeof(*sg), GFP_KERNEL);
 	if (!sg)
 		return ERR_PTR(-ENOMEM);
-	ret = sg_alloc_table_from_pages(sg, ubuf->pages, ubuf->pagecount,
-					0, ubuf->pagecount << PAGE_SHIFT,
-					GFP_KERNEL);
+
+	ret = sg_alloc_table(sg, ubuf->pagecount, GFP_KERNEL);
 	if (ret < 0)
-		goto err;
+		goto err_alloc;
+
+	for_each_sg(sg->sgl, sgl, ubuf->pagecount, i)
+		sg_set_page(sgl, ubuf->pages[i], PAGE_SIZE, ubuf->offsets[i]);
+
 	ret = dma_map_sgtable(dev, sg, direction, 0);
 	if (ret < 0)
-		goto err;
+		goto err_map;
 	return sg;
 
-err:
+err_map:
 	sg_free_table(sg);
+err_alloc:
 	kfree(sg);
 	return ERR_PTR(ret);
 }
@@ -143,6 +153,7 @@ static void release_udmabuf(struct dma_buf *buf)
 
 	for (pg = 0; pg < ubuf->pagecount; pg++)
 		put_page(ubuf->pages[pg]);
+	kfree(ubuf->offsets);
 	kfree(ubuf->pages);
 	kfree(ubuf);
 }
@@ -196,17 +207,77 @@ static const struct dma_buf_ops udmabuf_ops = {
 #define SEALS_WANTED (F_SEAL_SHRINK)
 #define SEALS_DENIED (F_SEAL_WRITE)
 
+static int handle_hugetlb_pages(struct udmabuf *ubuf, struct file *memfd,
+				pgoff_t offset, pgoff_t pgcnt,
+				pgoff_t *pgbuf)
+{
+	struct hstate *hpstate = hstate_file(memfd);
+	pgoff_t mapidx = offset >> huge_page_shift(hpstate);
+	pgoff_t subpgoff = (offset & ~huge_page_mask(hpstate)) >> PAGE_SHIFT;
+	pgoff_t maxsubpgs = huge_page_size(hpstate) >> PAGE_SHIFT;
+	struct page *hpage = NULL;
+	struct folio *folio;
+	pgoff_t pgidx;
+
+	mapidx <<= huge_page_order(hpstate);
+	for (pgidx = 0; pgidx < pgcnt; pgidx++) {
+		if (!hpage) {
+			folio = __filemap_get_folio(memfd->f_mapping,
+						    mapidx,
+						    FGP_ACCESSED, 0);
+			if (IS_ERR(folio))
+				return PTR_ERR(folio);
+
+			hpage = &folio->page;
+		}
+
+		get_page(hpage);
+		ubuf->pages[*pgbuf] = hpage;
+		ubuf->offsets[*pgbuf] = subpgoff << PAGE_SHIFT;
+		(*pgbuf)++;
+		if (++subpgoff == maxsubpgs) {
+			put_page(hpage);
+			hpage = NULL;
+			subpgoff = 0;
+			mapidx += pages_per_huge_page(hpstate);
+		}
+	}
+
+	if (hpage)
+		put_page(hpage);
+
+	return 0;
+}
+
+static int handle_shmem_pages(struct udmabuf *ubuf, struct file *memfd,
+			      pgoff_t offset, pgoff_t pgcnt,
+			      pgoff_t *pgbuf)
+{
+	pgoff_t pgidx, pgoff = offset >> PAGE_SHIFT;
+	struct page *page;
+
+	for (pgidx = 0; pgidx < pgcnt; pgidx++) {
+		page = shmem_read_mapping_page(memfd->f_mapping,
+					       pgoff + pgidx);
+		if (IS_ERR(page))
+			return PTR_ERR(page);
+
+		ubuf->pages[*pgbuf] = page;
+		(*pgbuf)++;
+	}
+
+	return 0;
+}
+
 static long udmabuf_create(struct miscdevice *device,
 			   struct udmabuf_create_list *head,
 			   struct udmabuf_create_item *list)
 {
 	DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
 	struct file *memfd = NULL;
-	struct address_space *mapping = NULL;
 	struct udmabuf *ubuf;
 	struct dma_buf *buf;
-	pgoff_t pgoff, pgcnt, pgidx, pgbuf = 0, pglimit;
-	struct page *page;
+	pgoff_t pgcnt, pgbuf = 0, pglimit;
 	int seals, ret = -EINVAL;
 	u32 i, flags;
 
@@ -234,6 +305,12 @@ static long udmabuf_create(struct miscdevice *device,
 		ret = -ENOMEM;
 		goto err;
 	}
+	ubuf->offsets = kcalloc(ubuf->pagecount, sizeof(*ubuf->offsets),
+				GFP_KERNEL);
+	if (!ubuf->offsets) {
+		ret = -ENOMEM;
+		goto err;
+	}
 
 	pgbuf = 0;
 	for (i = 0; i < head->count; i++) {
@@ -241,8 +318,7 @@ static long udmabuf_create(struct miscdevice *device,
 		memfd = fget(list[i].memfd);
 		if (!memfd)
 			goto err;
-		mapping = memfd->f_mapping;
-		if (!shmem_mapping(mapping))
+		if (!shmem_file(memfd) && !is_file_hugepages(memfd))
 			goto err;
 		seals = memfd_fcntl(memfd, F_GET_SEALS, 0);
 		if (seals == -EINVAL)
@@ -251,16 +327,19 @@ static long udmabuf_create(struct miscdevice *device,
 		if ((seals & SEALS_WANTED) != SEALS_WANTED ||
 		    (seals & SEALS_DENIED) != 0)
 			goto err;
-		pgoff = list[i].offset >> PAGE_SHIFT;
-		pgcnt = list[i].size   >> PAGE_SHIFT;
-		for (pgidx = 0; pgidx < pgcnt; pgidx++) {
-			page = shmem_read_mapping_page(mapping, pgoff + pgidx);
-			if (IS_ERR(page)) {
-				ret = PTR_ERR(page);
-				goto err;
-			}
-			ubuf->pages[pgbuf++] = page;
-		}
+
+		pgcnt = list[i].size >> PAGE_SHIFT;
+		if (is_file_hugepages(memfd))
+			ret = handle_hugetlb_pages(ubuf, memfd,
+						   list[i].offset,
+						   pgcnt, &pgbuf);
+		else
+			ret = handle_shmem_pages(ubuf, memfd,
+						 list[i].offset,
+						 pgcnt, &pgbuf);
+		if (ret < 0)
+			goto err;
+
 		fput(memfd);
 		memfd = NULL;
 	}
@@ -287,6 +366,7 @@ static long udmabuf_create(struct miscdevice *device,
 		put_page(ubuf->pages[--pgbuf]);
 	if (memfd)
 		fput(memfd);
+	kfree(ubuf->offsets);
 	kfree(ubuf->pages);
 	kfree(ubuf);
 	return ret;
-- 
2.39.2



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v7 3/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7)
  2023-12-12  7:37 [PATCH v7 0/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7) Vivek Kasireddy
  2023-12-12  7:37 ` [PATCH v7 1/6] udmabuf: Use vmf_insert_pfn and VM_PFNMAP for handling mmap Vivek Kasireddy
  2023-12-12  7:37 ` [PATCH v7 2/6] udmabuf: Add back support for mapping hugetlb pages (v6) Vivek Kasireddy
@ 2023-12-12  7:38 ` Vivek Kasireddy
  2023-12-12 12:13   ` David Hildenbrand
                     ` (2 more replies)
  2023-12-12  7:38 ` [PATCH v7 4/6] udmabuf: Convert udmabuf driver to use folios Vivek Kasireddy
                   ` (3 subsequent siblings)
  6 siblings, 3 replies; 18+ messages in thread
From: Vivek Kasireddy @ 2023-12-12  7:38 UTC (permalink / raw)
  To: dri-devel, linux-mm
  Cc: Vivek Kasireddy, David Hildenbrand, Christoph Hellwig,
	Daniel Vetter, Mike Kravetz, Hugh Dickins, Peter Xu,
	Gerd Hoffmann, Dongwon Kim, Junxiao Chang, Jason Gunthorpe,
	Christoph Hellwig

For drivers that would like to longterm-pin the folios associated
with a memfd, the memfd_pin_folios() API provides an option to
not only pin the folios via FOLL_PIN but also to check and migrate
them if they reside in movable zone or CMA block. This API
currently works with memfds but it should work with any files
that belong to either shmemfs or hugetlbfs. Files belonging to
other filesystems are rejected for now.

The folios need to be located first before pinning them via FOLL_PIN.
If they are found in the page cache, they can be immediately pinned.
Otherwise, they need to be allocated using the filesystem specific
APIs and then pinned.

v2:
- Drop gup_flags and improve comments and commit message (David)
- Allocate a page if we cannot find in page cache for the hugetlbfs
  case as well (David)
- Don't unpin pages if there is a migration related failure (David)
- Drop the unnecessary nr_pages <= 0 check (Jason)
- Have the caller of the API pass in file * instead of fd (Jason)

v3: (David)
- Enclose the huge page allocation code with #ifdef CONFIG_HUGETLB_PAGE
  (Build error reported by kernel test robot <lkp@intel.com>)
- Don't forget memalloc_pin_restore() on non-migration related errors
- Improve the readability of the cleanup code associated with
  non-migration related errors
- Augment the comments by describing FOLL_LONGTERM like behavior
- Include the R-b tag from Jason

v4:
- Remove the local variable "page" and instead use 3 return statements
  in alloc_file_page() (David)
- Add the R-b tag from David

v5: (David)
- For hugetlb case, ensure that we only obtain head pages from the
  mapping by using __filemap_get_folio() instead of find_get_page_flags()
- Handle -EEXIST when two or more potential users try to simultaneously
  add a huge page to the mapping by forcing them to retry on failure

v6: (Christoph)
- Rename this API to memfd_pin_user_pages() to make it clear that it
  is intended for memfds
- Move the memfd page allocation helper from gup.c to memfd.c
- Fix indentation errors in memfd_pin_user_pages()
- For contiguous ranges of folios, use a helper such as
  filemap_get_folios_contig() to lookup the page cache in batches

v7:
- Rename this API to memfd_pin_folios() and make it return folios
  and offsets instead of pages (David)
- Don't continue processing the folios in the batch returned by
  filemap_get_folios_contig() if they do not have correct next_idx
- Add the R-b tag from Christoph

Cc: David Hildenbrand <david@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Dongwon Kim <dongwon.kim@intel.com>
Cc: Junxiao Chang <junxiao.chang@intel.com>
Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> (v2)
Reviewed-by: David Hildenbrand <david@redhat.com> (v3)
Reviewed-by: Christoph Hellwig <hch@lst.de> (v6)
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
---
 include/linux/memfd.h |   5 ++
 include/linux/mm.h    |   3 +
 mm/gup.c              | 155 ++++++++++++++++++++++++++++++++++++++++++
 mm/memfd.c            |  34 +++++++++
 4 files changed, 197 insertions(+)

diff --git a/include/linux/memfd.h b/include/linux/memfd.h
index e7abf6fa4c52..b38fb35f4271 100644
--- a/include/linux/memfd.h
+++ b/include/linux/memfd.h
@@ -6,11 +6,16 @@
 
 #ifdef CONFIG_MEMFD_CREATE
 extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int arg);
+extern struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx);
 #else
 static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned int a)
 {
 	return -EINVAL;
 }
+static inline struct page *memfd_alloc_folio(struct file *memfd, pgoff_t idx)
+{
+	return ERR_PTR(-EINVAL);
+}
 #endif
 
 #endif /* __LINUX_MEMFD_H */
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 418d26608ece..537f40989837 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2472,6 +2472,9 @@ long get_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
 		    struct page **pages, unsigned int gup_flags);
 long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
 		    struct page **pages, unsigned int gup_flags);
+long memfd_pin_folios(struct file *file, unsigned long start,
+		      unsigned long nr_pages, struct folio **folios,
+		      pgoff_t *offsets);
 
 int get_user_pages_fast(unsigned long start, int nr_pages,
 			unsigned int gup_flags, struct page **pages);
diff --git a/mm/gup.c b/mm/gup.c
index 231711efa390..e798cdbc6a11 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -5,6 +5,7 @@
 #include <linux/spinlock.h>
 
 #include <linux/mm.h>
+#include <linux/memfd.h>
 #include <linux/memremap.h>
 #include <linux/pagemap.h>
 #include <linux/rmap.h>
@@ -17,6 +18,7 @@
 #include <linux/hugetlb.h>
 #include <linux/migrate.h>
 #include <linux/mm_inline.h>
+#include <linux/pagevec.h>
 #include <linux/sched/mm.h>
 #include <linux/shmem_fs.h>
 
@@ -3410,3 +3412,156 @@ long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
 				     &locked, gup_flags);
 }
 EXPORT_SYMBOL(pin_user_pages_unlocked);
+
+/**
+ * memfd_pin_folios() - pin folios associated with a memfd
+ * @memfd:      the memfd whose folios are to be pinned
+ * @start:      starting memfd offset
+ * @nr_pages:   number of pages from start to pin
+ * @folios:     array that receives pointers to the folios pinned.
+ *              Should be at-least nr_pages long.
+ * @offsets:    array that receives offsets of pages in their folios.
+ *              Should be at-least nr_pages long.
+ *
+ * Attempt to pin folios associated with a memfd; given that a memfd is
+ * either backed by shmem or hugetlb, the folios can either be found in
+ * the page cache or need to be allocated if necessary. Once the folios
+ * are located, they are all pinned via FOLL_PIN and the @offsets array
+ * is populated with offsets of the pages in their respective folios.
+ * Therefore, for each page the caller requested, there will be a
+ * corresponding entry in both @folios and @offsets. And, eventually,
+ * these pinned folios need to be released either using unpin_user_pages()
+ * or unpin_user_page().
+ *
+ * It must be noted that the folios may be pinned for an indefinite amount
+ * of time. And, in most cases, the duration of time they may stay pinned
+ * would be controlled by the userspace. This behavior is effectively the
+ * same as using FOLL_LONGTERM with other GUP APIs.
+ *
+ * Returns number of folios pinned. This would be equal to the number of
+ * pages requested. If no folios were pinned, it returns -errno.
+ */
+long memfd_pin_folios(struct file *memfd, unsigned long start,
+		      unsigned long nr_pages, struct folio **folios,
+		      pgoff_t *offsets)
+{
+	unsigned long end = start + (nr_pages << PAGE_SHIFT) - 1;
+	unsigned int max_pgs, pgoff, pgshift = PAGE_SHIFT;
+	pgoff_t start_idx, end_idx, next_idx;
+	unsigned int flags, nr_folios, i, j;
+	struct folio *folio = NULL;
+	struct folio_batch fbatch;
+	struct page **pages;
+	struct hstate *h;
+	long ret;
+
+	if (!nr_pages)
+		return -EINVAL;
+
+	if (!memfd)
+		return -EINVAL;
+
+	if (!shmem_file(memfd) && !is_file_hugepages(memfd))
+		return -EINVAL;
+
+	pages = kmalloc_array(nr_pages, sizeof(*pages), GFP_KERNEL);
+	if (!pages)
+		return -ENOMEM;
+
+	if (is_file_hugepages(memfd)) {
+		h = hstate_file(memfd);
+		pgshift = huge_page_shift(h);
+	}
+
+	flags = memalloc_pin_save();
+	do {
+		i = 0;
+		start_idx = start >> pgshift;
+		end_idx = end >> pgshift;
+		if (is_file_hugepages(memfd)) {
+			start_idx <<= huge_page_order(h);
+			end_idx <<= huge_page_order(h);
+		}
+
+		folio_batch_init(&fbatch);
+		while (start_idx <= end_idx) {
+			/*
+			 * In most cases, we should be able to find the folios
+			 * in the page cache. If we cannot find them for some
+			 * reason, we try to allocate them and add them to the
+			 * page cache.
+			 */
+			nr_folios = filemap_get_folios_contig(memfd->f_mapping,
+							      &start_idx,
+							      end_idx,
+							      &fbatch);
+			if (folio) {
+				folio_put(folio);
+				folio = NULL;
+			}
+
+			next_idx = 0;
+			for (j = 0; j < nr_folios; j++) {
+				if (next_idx &&
+				    next_idx != folio_index(fbatch.folios[j]))
+					continue;
+
+				folio = try_grab_folio(&fbatch.folios[j]->page,
+						       1, FOLL_PIN);
+				if (!folio) {
+					folio_batch_release(&fbatch);
+					kfree(pages);
+					goto err;
+				}
+
+				max_pgs = folio_nr_pages(folio);
+				if (i == 0) {
+					pgoff = offset_in_folio(folio, start);
+					pgoff >>= PAGE_SHIFT;
+				}
+
+				do {
+					folios[i] = folio;
+					offsets[i] = pgoff << PAGE_SHIFT;
+					pages[i] = folio_page(folio, 0);
+					folio_add_pin(folio);
+
+					pgoff++;
+					i++;
+				} while (pgoff < max_pgs && i < nr_pages);
+
+				pgoff = 0;
+				next_idx = folio_next_index(folio);
+				gup_put_folio(folio, 1, FOLL_PIN);
+			}
+
+			folio = NULL;
+			folio_batch_release(&fbatch);
+			if (!nr_folios) {
+				folio = memfd_alloc_folio(memfd, start_idx);
+				if (IS_ERR(folio)) {
+					ret = PTR_ERR(folio);
+					if (ret != -EEXIST) {
+						kfree(pages);
+						goto err;
+					}
+				}
+			}
+		}
+
+		ret = check_and_migrate_movable_pages(nr_pages, pages);
+	} while (ret == -EAGAIN);
+
+	kfree(pages);
+	memalloc_pin_restore(flags);
+	return ret ? ret : nr_pages;
+err:
+	memalloc_pin_restore(flags);
+	while (i-- > 0)
+		if (folios[i])
+			gup_put_folio(folios[i], 1, FOLL_PIN);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(memfd_pin_folios);
+
diff --git a/mm/memfd.c b/mm/memfd.c
index d3a1ba4208c9..36a75e8249f8 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -63,6 +63,40 @@ static void memfd_tag_pins(struct xa_state *xas)
 	xas_unlock_irq(xas);
 }
 
+/*
+ * This is a helper function used by memfd_pin_user_pages() in GUP (gup.c).
+ * It is mainly called to allocate a page in a memfd when the caller
+ * (memfd_pin_user_pages()) cannot find a page in the page cache at a given
+ * index in the mapping.
+ */
+struct folio *memfd_alloc_folio(struct file *memfd, pgoff_t idx)
+{
+#ifdef CONFIG_HUGETLB_PAGE
+	struct folio *folio;
+	int err;
+
+	if (is_file_hugepages(memfd)) {
+		folio = alloc_hugetlb_folio_nodemask(hstate_file(memfd),
+						     NUMA_NO_NODE,
+						     NULL,
+						     GFP_USER);
+		if (folio && folio_try_get(folio)) {
+			err = hugetlb_add_to_page_cache(folio,
+							memfd->f_mapping,
+							idx);
+			if (err) {
+				folio_put(folio);
+				free_huge_folio(folio);
+				return ERR_PTR(err);
+			}
+			return folio;
+		}
+		return ERR_PTR(-ENOMEM);
+	}
+#endif
+	return shmem_read_folio(memfd->f_mapping, idx);
+}
+
 /*
  * Setting SEAL_WRITE requires us to verify there's no pending writer. However,
  * via get_user_pages(), drivers might have some pending I/O without any active
-- 
2.39.2



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v7 4/6] udmabuf: Convert udmabuf driver to use folios
  2023-12-12  7:37 [PATCH v7 0/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7) Vivek Kasireddy
                   ` (2 preceding siblings ...)
  2023-12-12  7:38 ` [PATCH v7 3/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7) Vivek Kasireddy
@ 2023-12-12  7:38 ` Vivek Kasireddy
  2023-12-13 18:00   ` Matthew Wilcox
  2023-12-12  7:38 ` [PATCH v7 5/6] udmabuf: Pin the pages using memfd_pin_folios() API (v5) Vivek Kasireddy
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 18+ messages in thread
From: Vivek Kasireddy @ 2023-12-12  7:38 UTC (permalink / raw)
  To: dri-devel, linux-mm
  Cc: Vivek Kasireddy, David Hildenbrand, Daniel Vetter, Mike Kravetz,
	Hugh Dickins, Peter Xu, Jason Gunthorpe, Gerd Hoffmann,
	Dongwon Kim, Junxiao Chang

This is mainly a preparatory patch to use memfd_pin_folios() API
for pinning folios. Using folios instead of pages makes sense as
the udmabuf driver needs to handle both shmem and hugetlb cases.
However, the function vmap_udmabuf() still needs a list of pages;
so, we collect all the head pages into a local array in this case.

Other changes in this patch include the addition of helpers for
checking the memfd seals and exporting dmabuf. Moving code from
udmabuf_create() into these helpers improves readability given
that udmabuf_create() is a bit long.

Cc: David Hildenbrand <david@redhat.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Dongwon Kim <dongwon.kim@intel.com>
Cc: Junxiao Chang <junxiao.chang@intel.com>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
---
 drivers/dma-buf/udmabuf.c | 139 +++++++++++++++++++++++---------------
 1 file changed, 83 insertions(+), 56 deletions(-)

diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
index 274defd3fa3e..e1b8da3c9b2a 100644
--- a/drivers/dma-buf/udmabuf.c
+++ b/drivers/dma-buf/udmabuf.c
@@ -26,7 +26,7 @@ MODULE_PARM_DESC(size_limit_mb, "Max size of a dmabuf, in megabytes. Default is
 
 struct udmabuf {
 	pgoff_t pagecount;
-	struct page **pages;
+	struct folio **folios;
 	struct sg_table *sg;
 	struct miscdevice *device;
 	pgoff_t *offsets;
@@ -42,7 +42,7 @@ static vm_fault_t udmabuf_vm_fault(struct vm_fault *vmf)
 	if (pgoff >= ubuf->pagecount)
 		return VM_FAULT_SIGBUS;
 
-	pfn = page_to_pfn(ubuf->pages[pgoff]);
+	pfn = page_to_pfn(&ubuf->folios[pgoff]->page);
 	pfn += ubuf->offsets[pgoff] >> PAGE_SHIFT;
 
 	return vmf_insert_pfn(vma, vmf->address, pfn);
@@ -68,11 +68,21 @@ static int mmap_udmabuf(struct dma_buf *buf, struct vm_area_struct *vma)
 static int vmap_udmabuf(struct dma_buf *buf, struct iosys_map *map)
 {
 	struct udmabuf *ubuf = buf->priv;
+	struct page **pages;
 	void *vaddr;
+	pgoff_t pg;
 
 	dma_resv_assert_held(buf->resv);
 
-	vaddr = vm_map_ram(ubuf->pages, ubuf->pagecount, -1);
+	pages = kmalloc_array(ubuf->pagecount, sizeof(*pages), GFP_KERNEL);
+	if (!pages)
+		return -ENOMEM;
+
+	for (pg = 0; pg < ubuf->pagecount; pg++)
+		pages[pg] = &ubuf->folios[pg]->page;
+
+	vaddr = vm_map_ram(pages, ubuf->pagecount, -1);
+	kfree(pages);
 	if (!vaddr)
 		return -EINVAL;
 
@@ -107,7 +117,8 @@ static struct sg_table *get_sg_table(struct device *dev, struct dma_buf *buf,
 		goto err_alloc;
 
 	for_each_sg(sg->sgl, sgl, ubuf->pagecount, i)
-		sg_set_page(sgl, ubuf->pages[i], PAGE_SIZE, ubuf->offsets[i]);
+		sg_set_folio(sgl, ubuf->folios[i], PAGE_SIZE,
+			     ubuf->offsets[i]);
 
 	ret = dma_map_sgtable(dev, sg, direction, 0);
 	if (ret < 0)
@@ -152,9 +163,9 @@ static void release_udmabuf(struct dma_buf *buf)
 		put_sg_table(dev, ubuf->sg, DMA_BIDIRECTIONAL);
 
 	for (pg = 0; pg < ubuf->pagecount; pg++)
-		put_page(ubuf->pages[pg]);
+		folio_put(ubuf->folios[pg]);
 	kfree(ubuf->offsets);
-	kfree(ubuf->pages);
+	kfree(ubuf->folios);
 	kfree(ubuf);
 }
 
@@ -215,36 +226,33 @@ static int handle_hugetlb_pages(struct udmabuf *ubuf, struct file *memfd,
 	pgoff_t mapidx = offset >> huge_page_shift(hpstate);
 	pgoff_t subpgoff = (offset & ~huge_page_mask(hpstate)) >> PAGE_SHIFT;
 	pgoff_t maxsubpgs = huge_page_size(hpstate) >> PAGE_SHIFT;
-	struct page *hpage = NULL;
-	struct folio *folio;
+	struct folio *folio = NULL;
 	pgoff_t pgidx;
 
 	mapidx <<= huge_page_order(hpstate);
 	for (pgidx = 0; pgidx < pgcnt; pgidx++) {
-		if (!hpage) {
+		if (!folio) {
 			folio = __filemap_get_folio(memfd->f_mapping,
 						    mapidx,
 						    FGP_ACCESSED, 0);
 			if (IS_ERR(folio))
 				return PTR_ERR(folio);
-
-			hpage = &folio->page;
 		}
 
-		get_page(hpage);
-		ubuf->pages[*pgbuf] = hpage;
+		folio_get(folio);
+		ubuf->folios[*pgbuf] = folio;
 		ubuf->offsets[*pgbuf] = subpgoff << PAGE_SHIFT;
 		(*pgbuf)++;
 		if (++subpgoff == maxsubpgs) {
-			put_page(hpage);
-			hpage = NULL;
+			folio_put(folio);
+			folio = NULL;
 			subpgoff = 0;
 			mapidx += pages_per_huge_page(hpstate);
 		}
 	}
 
-	if (hpage)
-		put_page(hpage);
+	if (folio)
+		folio_put(folio);
 
 	return 0;
 }
@@ -254,31 +262,70 @@ static int handle_shmem_pages(struct udmabuf *ubuf, struct file *memfd,
 			      pgoff_t *pgbuf)
 {
 	pgoff_t pgidx, pgoff = offset >> PAGE_SHIFT;
-	struct page *page;
+	struct folio *folio = NULL;
 
 	for (pgidx = 0; pgidx < pgcnt; pgidx++) {
-		page = shmem_read_mapping_page(memfd->f_mapping,
-					       pgoff + pgidx);
-		if (IS_ERR(page))
-			return PTR_ERR(page);
+		folio = shmem_read_folio(memfd->f_mapping,
+					 pgoff + pgidx);
+		if (IS_ERR(folio))
+			return PTR_ERR(folio);
 
-		ubuf->pages[*pgbuf] = page;
+		ubuf->folios[*pgbuf] = folio;
 		(*pgbuf)++;
 	}
 
 	return 0;
 }
 
+static int check_memfd_seals(struct file *memfd)
+{
+	int seals;
+
+	if (!memfd)
+		return -EBADFD;
+
+	if (!shmem_file(memfd) && !is_file_hugepages(memfd))
+		return -EBADFD;
+
+	seals = memfd_fcntl(memfd, F_GET_SEALS, 0);
+	if (seals == -EINVAL)
+		return -EBADFD;
+
+	if ((seals & SEALS_WANTED) != SEALS_WANTED ||
+	    (seals & SEALS_DENIED) != 0)
+		return -EINVAL;
+
+	return 0;
+}
+
+static int export_udmabuf(struct udmabuf *ubuf,
+			  struct miscdevice *device,
+			  u32 flags)
+{
+	DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
+	struct dma_buf *buf;
+
+	ubuf->device = device;
+	exp_info.ops  = &udmabuf_ops;
+	exp_info.size = ubuf->pagecount << PAGE_SHIFT;
+	exp_info.priv = ubuf;
+	exp_info.flags = O_RDWR;
+
+	buf = dma_buf_export(&exp_info);
+	if (IS_ERR(buf))
+		return PTR_ERR(buf);
+
+	return dma_buf_fd(buf, flags);
+}
+
 static long udmabuf_create(struct miscdevice *device,
 			   struct udmabuf_create_list *head,
 			   struct udmabuf_create_item *list)
 {
-	DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
+	pgoff_t pgcnt, pgbuf = 0, pglimit;
 	struct file *memfd = NULL;
 	struct udmabuf *ubuf;
-	struct dma_buf *buf;
-	pgoff_t pgcnt, pgbuf = 0, pglimit;
-	int seals, ret = -EINVAL;
+	int ret = -EINVAL;
 	u32 i, flags;
 
 	ubuf = kzalloc(sizeof(*ubuf), GFP_KERNEL);
@@ -299,9 +346,9 @@ static long udmabuf_create(struct miscdevice *device,
 	if (!ubuf->pagecount)
 		goto err;
 
-	ubuf->pages = kmalloc_array(ubuf->pagecount, sizeof(*ubuf->pages),
+	ubuf->folios = kmalloc_array(ubuf->pagecount, sizeof(*ubuf->folios),
 				    GFP_KERNEL);
-	if (!ubuf->pages) {
+	if (!ubuf->folios) {
 		ret = -ENOMEM;
 		goto err;
 	}
@@ -314,18 +361,8 @@ static long udmabuf_create(struct miscdevice *device,
 
 	pgbuf = 0;
 	for (i = 0; i < head->count; i++) {
-		ret = -EBADFD;
 		memfd = fget(list[i].memfd);
-		if (!memfd)
-			goto err;
-		if (!shmem_file(memfd) && !is_file_hugepages(memfd))
-			goto err;
-		seals = memfd_fcntl(memfd, F_GET_SEALS, 0);
-		if (seals == -EINVAL)
-			goto err;
-		ret = -EINVAL;
-		if ((seals & SEALS_WANTED) != SEALS_WANTED ||
-		    (seals & SEALS_DENIED) != 0)
+		if (check_memfd_seals(memfd) < 0)
 			goto err;
 
 		pgcnt = list[i].size >> PAGE_SHIFT;
@@ -344,30 +381,20 @@ static long udmabuf_create(struct miscdevice *device,
 		memfd = NULL;
 	}
 
-	exp_info.ops  = &udmabuf_ops;
-	exp_info.size = ubuf->pagecount << PAGE_SHIFT;
-	exp_info.priv = ubuf;
-	exp_info.flags = O_RDWR;
-
-	ubuf->device = device;
-	buf = dma_buf_export(&exp_info);
-	if (IS_ERR(buf)) {
-		ret = PTR_ERR(buf);
+	flags = head->flags & UDMABUF_FLAGS_CLOEXEC ? O_CLOEXEC : 0;
+	ret = export_udmabuf(ubuf, device, flags);
+	if (ret < 0)
 		goto err;
-	}
 
-	flags = 0;
-	if (head->flags & UDMABUF_FLAGS_CLOEXEC)
-		flags |= O_CLOEXEC;
-	return dma_buf_fd(buf, flags);
+	return ret;
 
 err:
 	while (pgbuf > 0)
-		put_page(ubuf->pages[--pgbuf]);
+		folio_put(ubuf->folios[--pgbuf]);
 	if (memfd)
 		fput(memfd);
 	kfree(ubuf->offsets);
-	kfree(ubuf->pages);
+	kfree(ubuf->folios);
 	kfree(ubuf);
 	return ret;
 }
-- 
2.39.2



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v7 5/6] udmabuf: Pin the pages using memfd_pin_folios() API (v5)
  2023-12-12  7:37 [PATCH v7 0/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7) Vivek Kasireddy
                   ` (3 preceding siblings ...)
  2023-12-12  7:38 ` [PATCH v7 4/6] udmabuf: Convert udmabuf driver to use folios Vivek Kasireddy
@ 2023-12-12  7:38 ` Vivek Kasireddy
  2023-12-13 18:03   ` Matthew Wilcox
  2023-12-12  7:38 ` [PATCH v7 6/6] selftests/dma-buf/udmabuf: Add tests to verify data after page migration Vivek Kasireddy
  2023-12-12 12:15 ` [PATCH v7 0/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7) David Hildenbrand
  6 siblings, 1 reply; 18+ messages in thread
From: Vivek Kasireddy @ 2023-12-12  7:38 UTC (permalink / raw)
  To: dri-devel, linux-mm
  Cc: Vivek Kasireddy, David Hildenbrand, Daniel Vetter, Mike Kravetz,
	Hugh Dickins, Peter Xu, Jason Gunthorpe, Gerd Hoffmann,
	Dongwon Kim, Junxiao Chang

Using memfd_pin_folios() will ensure that the pages are pinned
correctly using FOLL_PIN. And, this also ensures that we don't
accidentally break features such as memory hotunplug as it would
not allow pinning pages in the movable zone.

Using this new API also simplifies the code as we no longer have
to deal with extracting individual pages from their mappings or
handle shmem and hugetlb cases separately.

v2:
- Adjust to the change in signature of pin_user_pages_fd() by
  passing in file * instead of fd.

v3:
- Limit the changes in this patch only to those that are required
  for using pin_user_pages_fd()
- Slightly improve the commit message

v4:
- Adjust to the change in name of the API (memfd_pin_user_pages)

v5:
- Adjust to the changes in memfd_pin_folios which now populates
  a list of folios and offsets

Cc: David Hildenbrand <david@redhat.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Dongwon Kim <dongwon.kim@intel.com>
Cc: Junxiao Chang <junxiao.chang@intel.com>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
---
 drivers/dma-buf/udmabuf.c | 85 ++++++---------------------------------
 1 file changed, 12 insertions(+), 73 deletions(-)

diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
index e1b8da3c9b2a..a614e720837d 100644
--- a/drivers/dma-buf/udmabuf.c
+++ b/drivers/dma-buf/udmabuf.c
@@ -42,7 +42,7 @@ static vm_fault_t udmabuf_vm_fault(struct vm_fault *vmf)
 	if (pgoff >= ubuf->pagecount)
 		return VM_FAULT_SIGBUS;
 
-	pfn = page_to_pfn(&ubuf->folios[pgoff]->page);
+	pfn = page_to_pfn(folio_page(ubuf->folios[pgoff], 0));
 	pfn += ubuf->offsets[pgoff] >> PAGE_SHIFT;
 
 	return vmf_insert_pfn(vma, vmf->address, pfn);
@@ -79,7 +79,7 @@ static int vmap_udmabuf(struct dma_buf *buf, struct iosys_map *map)
 		return -ENOMEM;
 
 	for (pg = 0; pg < ubuf->pagecount; pg++)
-		pages[pg] = &ubuf->folios[pg]->page;
+		pages[pg] = folio_page(ubuf->folios[pg], 0);
 
 	vaddr = vm_map_ram(pages, ubuf->pagecount, -1);
 	kfree(pages);
@@ -163,7 +163,8 @@ static void release_udmabuf(struct dma_buf *buf)
 		put_sg_table(dev, ubuf->sg, DMA_BIDIRECTIONAL);
 
 	for (pg = 0; pg < ubuf->pagecount; pg++)
-		folio_put(ubuf->folios[pg]);
+		unpin_user_page(folio_page(ubuf->folios[pg], 0));
+
 	kfree(ubuf->offsets);
 	kfree(ubuf->folios);
 	kfree(ubuf);
@@ -218,65 +219,6 @@ static const struct dma_buf_ops udmabuf_ops = {
 #define SEALS_WANTED (F_SEAL_SHRINK)
 #define SEALS_DENIED (F_SEAL_WRITE)
 
-static int handle_hugetlb_pages(struct udmabuf *ubuf, struct file *memfd,
-				pgoff_t offset, pgoff_t pgcnt,
-				pgoff_t *pgbuf)
-{
-	struct hstate *hpstate = hstate_file(memfd);
-	pgoff_t mapidx = offset >> huge_page_shift(hpstate);
-	pgoff_t subpgoff = (offset & ~huge_page_mask(hpstate)) >> PAGE_SHIFT;
-	pgoff_t maxsubpgs = huge_page_size(hpstate) >> PAGE_SHIFT;
-	struct folio *folio = NULL;
-	pgoff_t pgidx;
-
-	mapidx <<= huge_page_order(hpstate);
-	for (pgidx = 0; pgidx < pgcnt; pgidx++) {
-		if (!folio) {
-			folio = __filemap_get_folio(memfd->f_mapping,
-						    mapidx,
-						    FGP_ACCESSED, 0);
-			if (IS_ERR(folio))
-				return PTR_ERR(folio);
-		}
-
-		folio_get(folio);
-		ubuf->folios[*pgbuf] = folio;
-		ubuf->offsets[*pgbuf] = subpgoff << PAGE_SHIFT;
-		(*pgbuf)++;
-		if (++subpgoff == maxsubpgs) {
-			folio_put(folio);
-			folio = NULL;
-			subpgoff = 0;
-			mapidx += pages_per_huge_page(hpstate);
-		}
-	}
-
-	if (folio)
-		folio_put(folio);
-
-	return 0;
-}
-
-static int handle_shmem_pages(struct udmabuf *ubuf, struct file *memfd,
-			      pgoff_t offset, pgoff_t pgcnt,
-			      pgoff_t *pgbuf)
-{
-	pgoff_t pgidx, pgoff = offset >> PAGE_SHIFT;
-	struct folio *folio = NULL;
-
-	for (pgidx = 0; pgidx < pgcnt; pgidx++) {
-		folio = shmem_read_folio(memfd->f_mapping,
-					 pgoff + pgidx);
-		if (IS_ERR(folio))
-			return PTR_ERR(folio);
-
-		ubuf->folios[*pgbuf] = folio;
-		(*pgbuf)++;
-	}
-
-	return 0;
-}
-
 static int check_memfd_seals(struct file *memfd)
 {
 	int seals;
@@ -325,7 +267,7 @@ static long udmabuf_create(struct miscdevice *device,
 	pgoff_t pgcnt, pgbuf = 0, pglimit;
 	struct file *memfd = NULL;
 	struct udmabuf *ubuf;
-	int ret = -EINVAL;
+	long ret = -EINVAL;
 	u32 i, flags;
 
 	ubuf = kzalloc(sizeof(*ubuf), GFP_KERNEL);
@@ -366,17 +308,13 @@ static long udmabuf_create(struct miscdevice *device,
 			goto err;
 
 		pgcnt = list[i].size >> PAGE_SHIFT;
-		if (is_file_hugepages(memfd))
-			ret = handle_hugetlb_pages(ubuf, memfd,
-						   list[i].offset,
-						   pgcnt, &pgbuf);
-		else
-			ret = handle_shmem_pages(ubuf, memfd,
-						 list[i].offset,
-						 pgcnt, &pgbuf);
+		ret = memfd_pin_folios(memfd, list[i].offset, pgcnt,
+				       ubuf->folios + pgbuf,
+				       ubuf->offsets + pgbuf);
 		if (ret < 0)
 			goto err;
 
+		pgbuf += pgcnt;
 		fput(memfd);
 		memfd = NULL;
 	}
@@ -389,8 +327,9 @@ static long udmabuf_create(struct miscdevice *device,
 	return ret;
 
 err:
-	while (pgbuf > 0)
-		folio_put(ubuf->folios[--pgbuf]);
+	while (pgbuf-- > 0)
+		if (ubuf->folios[pgbuf])
+			unpin_user_page(folio_page(ubuf->folios[pgbuf], 0));
 	if (memfd)
 		fput(memfd);
 	kfree(ubuf->offsets);
-- 
2.39.2



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v7 6/6] selftests/dma-buf/udmabuf: Add tests to verify data after page migration
  2023-12-12  7:37 [PATCH v7 0/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7) Vivek Kasireddy
                   ` (4 preceding siblings ...)
  2023-12-12  7:38 ` [PATCH v7 5/6] udmabuf: Pin the pages using memfd_pin_folios() API (v5) Vivek Kasireddy
@ 2023-12-12  7:38 ` Vivek Kasireddy
  2023-12-12 12:15 ` [PATCH v7 0/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7) David Hildenbrand
  6 siblings, 0 replies; 18+ messages in thread
From: Vivek Kasireddy @ 2023-12-12  7:38 UTC (permalink / raw)
  To: dri-devel, linux-mm
  Cc: Vivek Kasireddy, Shuah Khan, David Hildenbrand, Daniel Vetter,
	Mike Kravetz, Hugh Dickins, Peter Xu, Jason Gunthorpe,
	Gerd Hoffmann, Dongwon Kim, Junxiao Chang

Since the memfd pages associated with a udmabuf may be migrated
as part of udmabuf create, we need to verify the data coherency
after successful migration. The new tests added in this patch try
to do just that using 4k sized pages and also 2 MB sized huge
pages for the memfd.

Successful completion of the tests would mean that there is no
disconnect between the memfd pages and the ones associated with
a udmabuf. And, these tests can also be augmented in the future
to test newer udmabuf features (such as handling memfd hole punch).

Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Gerd Hoffmann <kraxel@redhat.com>
Cc: Dongwon Kim <dongwon.kim@intel.com>
Cc: Junxiao Chang <junxiao.chang@intel.com>
Based-on-patch-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
---
 .../selftests/drivers/dma-buf/udmabuf.c       | 151 +++++++++++++++++-
 1 file changed, 147 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/drivers/dma-buf/udmabuf.c b/tools/testing/selftests/drivers/dma-buf/udmabuf.c
index c812080e304e..d76c813fe652 100644
--- a/tools/testing/selftests/drivers/dma-buf/udmabuf.c
+++ b/tools/testing/selftests/drivers/dma-buf/udmabuf.c
@@ -9,26 +9,132 @@
 #include <errno.h>
 #include <fcntl.h>
 #include <malloc.h>
+#include <stdbool.h>
 
 #include <sys/ioctl.h>
 #include <sys/syscall.h>
+#include <sys/mman.h>
 #include <linux/memfd.h>
 #include <linux/udmabuf.h>
 
 #define TEST_PREFIX	"drivers/dma-buf/udmabuf"
 #define NUM_PAGES       4
+#define NUM_ENTRIES     4
+#define MEMFD_SIZE      1024 /* in pages */
 
-static int memfd_create(const char *name, unsigned int flags)
+static unsigned int page_size;
+
+static int create_memfd_with_seals(off64_t size, bool hpage)
+{
+	int memfd, ret;
+	unsigned int flags = MFD_ALLOW_SEALING;
+
+	if (hpage)
+		flags |= MFD_HUGETLB;
+
+	memfd = memfd_create("udmabuf-test", flags);
+	if (memfd < 0) {
+		printf("%s: [skip,no-memfd]\n", TEST_PREFIX);
+		exit(77);
+	}
+
+	ret = fcntl(memfd, F_ADD_SEALS, F_SEAL_SHRINK);
+	if (ret < 0) {
+		printf("%s: [skip,fcntl-add-seals]\n", TEST_PREFIX);
+		exit(77);
+	}
+
+	ret = ftruncate(memfd, size);
+	if (ret == -1) {
+		printf("%s: [FAIL,memfd-truncate]\n", TEST_PREFIX);
+		exit(1);
+	}
+
+	return memfd;
+}
+
+static int create_udmabuf_list(int devfd, int memfd, off64_t memfd_size)
+{
+	struct udmabuf_create_list *list;
+	int ubuf_fd, i;
+
+	list = malloc(sizeof(struct udmabuf_create_list) +
+		      sizeof(struct udmabuf_create_item) * NUM_ENTRIES);
+	if (!list) {
+		printf("%s: [FAIL, udmabuf-malloc]\n", TEST_PREFIX);
+		exit(1);
+	}
+
+	for (i = 0; i < NUM_ENTRIES; i++) {
+		list->list[i].memfd  = memfd;
+		list->list[i].offset = i * (memfd_size / NUM_ENTRIES);
+		list->list[i].size   = getpagesize() * NUM_PAGES;
+	}
+
+	list->count = NUM_ENTRIES;
+	list->flags = UDMABUF_FLAGS_CLOEXEC;
+	ubuf_fd = ioctl(devfd, UDMABUF_CREATE_LIST, list);
+	free(list);
+	if (ubuf_fd < 0) {
+		printf("%s: [FAIL, udmabuf-create]\n", TEST_PREFIX);
+		exit(1);
+	}
+
+	return ubuf_fd;
+}
+
+static void write_to_memfd(void *addr, off64_t size, char chr)
+{
+	int i;
+
+	for (i = 0; i < size / page_size; i++) {
+		*((char *)addr + (i * page_size)) = chr;
+	}
+}
+
+static void *mmap_fd(int fd, off64_t size)
 {
-	return syscall(__NR_memfd_create, name, flags);
+	void *addr;
+
+	addr = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
+	if (addr == MAP_FAILED) {
+		printf("%s: ubuf_fd mmap fail\n", TEST_PREFIX);
+		exit(1);
+	}
+
+	return addr;
+}
+
+static int compare_chunks(void *addr1, void *addr2, off64_t memfd_size)
+{
+	off64_t off;
+	int i = 0, j, k = 0, ret = 0;
+	char char1, char2;
+
+	while (i < NUM_ENTRIES) {
+		off = i * (memfd_size / NUM_ENTRIES);
+		for (j = 0; j < NUM_PAGES; j++, k++) {
+			char1 = *((char *)addr1 + off + (j * getpagesize()));
+			char2 = *((char *)addr2 + (k * getpagesize()));
+			if (char1 != char2) {
+				ret = -1;
+				goto err;
+			}
+		}
+		i++;
+	}
+err:
+	munmap(addr1, memfd_size);
+	munmap(addr2, NUM_ENTRIES * NUM_PAGES * getpagesize());
+	return ret;
 }
 
 int main(int argc, char *argv[])
 {
 	struct udmabuf_create create;
 	int devfd, memfd, buf, ret;
-	off_t size;
-	void *mem;
+	off64_t size;
+	void *addr1, *addr2;
 
 	devfd = open("/dev/udmabuf", O_RDWR);
 	if (devfd < 0) {
@@ -90,6 +196,9 @@ int main(int argc, char *argv[])
 	}
 
 	/* should work */
+	page_size = getpagesize();
+	addr1 = mmap_fd(memfd, size);
+	write_to_memfd(addr1, size, 'a');
 	create.memfd  = memfd;
 	create.offset = 0;
 	create.size   = size;
@@ -98,6 +207,40 @@ int main(int argc, char *argv[])
 		printf("%s: [FAIL,test-4]\n", TEST_PREFIX);
 		exit(1);
 	}
+	munmap(addr1, size);
+	close(buf);
+	close(memfd);
+
+	/* should work (migration of 4k size pages)*/
+	size = MEMFD_SIZE * page_size;
+	memfd = create_memfd_with_seals(size, false);
+	addr1 = mmap_fd(memfd, size);
+	write_to_memfd(addr1, size, 'a');
+	buf = create_udmabuf_list(devfd, memfd, size);
+	addr2 = mmap_fd(buf, NUM_PAGES * NUM_ENTRIES * getpagesize());
+	write_to_memfd(addr1, size, 'b');
+	ret = compare_chunks(addr1, addr2, size);
+	if (ret < 0) {
+		printf("%s: [FAIL,test-5]\n", TEST_PREFIX);
+		exit(1);
+	}
+	close(buf);
+	close(memfd);
+
+	/* should work (migration of 2MB size huge pages)*/
+	page_size = getpagesize() * 512; /* 2 MB */
+	size = MEMFD_SIZE * page_size;
+	memfd = create_memfd_with_seals(size, true);
+	addr1 = mmap_fd(memfd, size);
+	write_to_memfd(addr1, size, 'a');
+	buf = create_udmabuf_list(devfd, memfd, size);
+	addr2 = mmap_fd(buf, NUM_PAGES * NUM_ENTRIES * getpagesize());
+	write_to_memfd(addr1, size, 'b');
+	ret = compare_chunks(addr1, addr2, size);
+	if (ret < 0) {
+		printf("%s: [FAIL,test-6]\n", TEST_PREFIX);
+		exit(1);
+	}
 
 	fprintf(stderr, "%s: ok\n", TEST_PREFIX);
 	close(buf);
-- 
2.39.2



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v7 3/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7)
  2023-12-12  7:38 ` [PATCH v7 3/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7) Vivek Kasireddy
@ 2023-12-12 12:13   ` David Hildenbrand
  2023-12-13  8:44     ` Kasireddy, Vivek
  2023-12-12 13:21   ` kernel test robot
  2023-12-12 15:27   ` kernel test robot
  2 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2023-12-12 12:13 UTC (permalink / raw)
  To: Vivek Kasireddy, dri-devel, linux-mm
  Cc: Christoph Hellwig, Daniel Vetter, Mike Kravetz, Hugh Dickins,
	Peter Xu, Gerd Hoffmann, Dongwon Kim, Junxiao Chang,
	Jason Gunthorpe, Christoph Hellwig

On 12.12.23 08:38, Vivek Kasireddy wrote:
> For drivers that would like to longterm-pin the folios associated
> with a memfd, the memfd_pin_folios() API provides an option to
> not only pin the folios via FOLL_PIN but also to check and migrate
> them if they reside in movable zone or CMA block. This API
> currently works with memfds but it should work with any files
> that belong to either shmemfs or hugetlbfs. Files belonging to
> other filesystems are rejected for now.
> 
> The folios need to be located first before pinning them via FOLL_PIN.
> If they are found in the page cache, they can be immediately pinned.
> Otherwise, they need to be allocated using the filesystem specific
> APIs and then pinned.
> 
> v2:
> - Drop gup_flags and improve comments and commit message (David)
> - Allocate a page if we cannot find in page cache for the hugetlbfs
>    case as well (David)
> - Don't unpin pages if there is a migration related failure (David)
> - Drop the unnecessary nr_pages <= 0 check (Jason)
> - Have the caller of the API pass in file * instead of fd (Jason)
> 
> v3: (David)
> - Enclose the huge page allocation code with #ifdef CONFIG_HUGETLB_PAGE
>    (Build error reported by kernel test robot <lkp@intel.com>)
> - Don't forget memalloc_pin_restore() on non-migration related errors
> - Improve the readability of the cleanup code associated with
>    non-migration related errors
> - Augment the comments by describing FOLL_LONGTERM like behavior
> - Include the R-b tag from Jason
> 
> v4:
> - Remove the local variable "page" and instead use 3 return statements
>    in alloc_file_page() (David)
> - Add the R-b tag from David
> 
> v5: (David)
> - For hugetlb case, ensure that we only obtain head pages from the
>    mapping by using __filemap_get_folio() instead of find_get_page_flags()
> - Handle -EEXIST when two or more potential users try to simultaneously
>    add a huge page to the mapping by forcing them to retry on failure
> 
> v6: (Christoph)
> - Rename this API to memfd_pin_user_pages() to make it clear that it
>    is intended for memfds
> - Move the memfd page allocation helper from gup.c to memfd.c
> - Fix indentation errors in memfd_pin_user_pages()
> - For contiguous ranges of folios, use a helper such as
>    filemap_get_folios_contig() to lookup the page cache in batches
> 
> v7:
> - Rename this API to memfd_pin_folios() and make it return folios
>    and offsets instead of pages (David)
> - Don't continue processing the folios in the batch returned by
>    filemap_get_folios_contig() if they do not have correct next_idx
> - Add the R-b tag from Christoph
> 

Sorry, I'm still not happy about the current state, because (1) the
folio vs. pages handling is still mixed (2) we're returning+pinning a
large folio multiple times.

See below if there is an easy way to clean this up.

>> @@ -5,6 +5,7 @@
>   #include <linux/spinlock.h>
>   
>   #include <linux/mm.h>
> +#include <linux/memfd.h>
>   #include <linux/memremap.h>
>   #include <linux/pagemap.h>
>   #include <linux/rmap.h>
> @@ -17,6 +18,7 @@
>   #include <linux/hugetlb.h>
>   #include <linux/migrate.h>
>   #include <linux/mm_inline.h>
> +#include <linux/pagevec.h>
>   #include <linux/sched/mm.h>
>   #include <linux/shmem_fs.h>
>   
> @@ -3410,3 +3412,156 @@ long pin_user_pages_unlocked(unsigned long start, unsigned long nr_pages,
>   				     &locked, gup_flags);
>   }
>   EXPORT_SYMBOL(pin_user_pages_unlocked);
> +
> +/**
> + * memfd_pin_folios() - pin folios associated with a memfd
> + * @memfd:      the memfd whose folios are to be pinned
> + * @start:      starting memfd offset
> + * @nr_pages:   number of pages from start to pin

We're not pinning pages. An inclusive range [start, end] would be clearer.

> + * @folios:     array that receives pointers to the folios pinned.
> + *              Should be at-least nr_pages long.
> + * @offsets:    array that receives offsets of pages in their folios.
> + *              Should be at-least nr_pages long.

See below, I'm wondering if this is really required once we return each folio
only once.

> + *
> + * Attempt to pin folios associated with a memfd; given that a memfd is
> + * either backed by shmem or hugetlb, the folios can either be found in
> + * the page cache or need to be allocated if necessary. Once the folios
> + * are located, they are all pinned via FOLL_PIN and the @offsets array
> + * is populated with offsets of the pages in their respective folios.
> + * Therefore, for each page the caller requested, there will be a
> + * corresponding entry in both @folios and @offsets. And, eventually,
> + * these pinned folios need to be released either using unpin_user_pages()
> + * or unpin_user_page().

Oh, we don't have a folio function yet? Should be easy to add, and we'd really
add them.

> + *
> + * It must be noted that the folios may be pinned for an indefinite amount
> + * of time. And, in most cases, the duration of time they may stay pinned
> + * would be controlled by the userspace. This behavior is effectively the
> + * same as using FOLL_LONGTERM with other GUP APIs.
> + *
> + * Returns number of folios pinned. This would be equal to the number of
> + * pages requested. If no folios were pinned, it returns -errno.
> + */
> +long memfd_pin_folios(struct file *memfd, unsigned long start,
> +		      unsigned long nr_pages, struct folio **folios,
> +		      pgoff_t *offsets)
> +{
> +	unsigned long end = start + (nr_pages << PAGE_SHIFT) - 1;
> +	unsigned int max_pgs, pgoff, pgshift = PAGE_SHIFT;
> +	pgoff_t start_idx, end_idx, next_idx;
> +	unsigned int flags, nr_folios, i, j;
> +	struct folio *folio = NULL;
> +	struct folio_batch fbatch;
> +	struct page **pages;
> +	struct hstate *h;
> +	long ret;
> +
> +	if (!nr_pages)
> +		return -EINVAL;
> +
> +	if (!memfd)
> +		return -EINVAL;
> +
> +	if (!shmem_file(memfd) && !is_file_hugepages(memfd))
> +		return -EINVAL;
> +
> +	pages = kmalloc_array(nr_pages, sizeof(*pages), GFP_KERNEL);
> +	if (!pages)
> +		return -ENOMEM;
> +
> +	if (is_file_hugepages(memfd)) {
> +		h = hstate_file(memfd);
> +		pgshift = huge_page_shift(h);
> +	}
> +
> +	flags = memalloc_pin_save();
> +	do {
> +		i = 0;
> +		start_idx = start >> pgshift;
> +		end_idx = end >> pgshift;
> +		if (is_file_hugepages(memfd)) {
> +			start_idx <<= huge_page_order(h);
> +			end_idx <<= huge_page_order(h);
> +		}
> +
> +		folio_batch_init(&fbatch);
> +		while (start_idx <= end_idx) {
> +			/*
> +			 * In most cases, we should be able to find the folios
> +			 * in the page cache. If we cannot find them for some
> +			 * reason, we try to allocate them and add them to the
> +			 * page cache.
> +			 */
> +			nr_folios = filemap_get_folios_contig(memfd->f_mapping,
> +							      &start_idx,
> +							      end_idx,
> +							      &fbatch);
> +			if (folio) {
> +				folio_put(folio);
> +				folio = NULL;
> +			}
> +
> +			next_idx = 0;
> +			for (j = 0; j < nr_folios; j++) {
> +				if (next_idx &&
> +				    next_idx != folio_index(fbatch.folios[j]))
> +					continue;
> +
> +				folio = try_grab_folio(&fbatch.folios[j]->page,
> +						       1, FOLL_PIN);
> +				if (!folio) {
> +					folio_batch_release(&fbatch);
> +					kfree(pages);
> +					goto err;
> +				}
> +
> +				max_pgs = folio_nr_pages(folio);
> +				if (i == 0) {
> +					pgoff = offset_in_folio(folio, start);
> +					pgoff >>= PAGE_SHIFT;
> +				}
> +
> +				do {
> +					folios[i] = folio;
> +					offsets[i] = pgoff << PAGE_SHIFT;
> +					pages[i] = folio_page(folio, 0);
> +					folio_add_pin(folio);
> +
> +					pgoff++;
> +					i++;
> +				} while (pgoff < max_pgs && i < nr_pages);
> +
> +				pgoff = 0;
> +				next_idx = folio_next_index(folio);
> +				gup_put_folio(folio, 1, FOLL_PIN);
> +			}
> +
> +			folio = NULL;
> +			folio_batch_release(&fbatch);
> +			if (!nr_folios) {
> +				folio = memfd_alloc_folio(memfd, start_idx);
> +				if (IS_ERR(folio)) {
> +					ret = PTR_ERR(folio);
> +					if (ret != -EEXIST) {
> +						kfree(pages);
> +						goto err;
> +					}
> +				}
> +			}
> +		}
> +
> +		ret = check_and_migrate_movable_pages(nr_pages, pages);

Having a folio variant would avoid having to mess with pages here at all.
Further, we're now returning+pinning the same folio multiple times, instead of
just once like the folio batching variant would.

I'm wondering if the following wouldn't make more sense, assuming we add
check_and_migrate_movable_folios(), which should be pretty easy to add.

Obviously untested, just to express what I have in mind:



/**
  * memfd_pin_folios() - pin folios associated with a memfd
  * @memfd:      the memfd whose folios are to be pinned
  * @start:      the starting memfd offset
  * @end:        the final memfd offset (inclusive)
  * @folios:     array that receives pointers to the folios pinned
  * @max_folios: the number of entries in the array for folios
  * @offsets:    the offset into the first folio
  *
  * Attempt to pin folios associated with a memfd; given that a memfd is
  * either backed by shmem or hugetlb, the folios can either be found in
  * the page cache or need to be allocated if necessary. Once the folios
  * are located, they are all pinned via FOLL_PIN and @offset is populated
  * with the offset into the first folio.
  *
  * Pinned folios must be released using unpin_folio() or unpin_folios().
  *
  * It must be noted that the folios may be pinned for an indefinite amount
  * of time. And, in most cases, the duration of time they may stay pinned
  * would be controlled by the userspace. This behavior is effectively the
  * same as using FOLL_LONGTERM with other GUP APIs.
  *
  * Returns number of folios pinned, which might be less than @max_folios
  * only if the whole range was pinned. If no folios were pinned, it returns
  * -errno.
  */
long memfd_pin_folios(struct file *memfd, unsigned long start,
		      unsigned long end, struct folio **folios,
		      unsigned int max_folios, unsigned long *offset)
{
	unsigned int pgshift = PAGE_SHIFT;
	unsigned int flags, nr_folios, cur_folios, i;
	pgoff_t start_idx, end_idx;
	struct folio_batch fbatch;
	struct folio *folio;
	struct hstate *h;
	long ret;

	if (start > end || !max_folios)
		return -EINVAL;

	if (!memfd)
		return -EINVAL;

	if (!shmem_file(memfd) && !is_file_hugepages(memfd))
		return -EINVAL;

	if (is_file_hugepages(memfd)) {
		h = hstate_file(memfd);
		pgshift = huge_page_shift(h);
	}

	flags = memalloc_pin_save();
	folio_batch_init(&fbatch);
	do {
		nr_folios = 0;
		start_idx = start >> pgshift;
		end_idx = end >> pgshift;
		if (is_file_hugepages(memfd)) {
			start_idx <<= huge_page_order(h);
			end_idx <<= huge_page_order(h);
		}

		while (start_idx <= end_idx) {
			/*
			 * In most cases, we should be able to find the folios
			 * in the page cache. If we cannot find them for some
			 * reason, we try to allocate them and add them to the
			 * page cache.
			 */
			folio_batch_release(&fbatch);
			cur_folios = filemap_get_folios_contig(memfd->f_mapping,
							       &start_idx,
							       end_idx,
							       &fbatch);
			if (!cur_folios) {
				folio = memfd_alloc_folio(memfd, start_idx);
				if (IS_ERR(folio)) {
					ret = PTR_ERR(folio);
					if (ret != -EEXIST)
						goto err;
				}
				folio_put(folio);
				continue;
			}

			/* Let's pin each folio, which shouldn't really fail. */
			for (i = 0; i < cur_folios; i++) {
				folio = try_grab_folio(&fbatch.folios[i]->page,
						       1, FOLL_PIN);
				if (!folio)
					goto err;

				if (!nr_folios)
					*offset = offset_in_folio(folio, start);
				folios[nr_folios++] = folio;

				if (max_folios == nr_folios)
					break;
			}
			if (max_folios == nr_folios)
				break;
		}
		folio_batch_release(&fbatch);

		ret = check_and_migrate_movable_folios(nr_folios, folios);
	} while (ret == -EAGAIN);

	memalloc_pin_restore(flags);
	return ret ? ret : nr_folios;
err:
	folio_batch_release(&fbatch);
	memalloc_pin_restore(flags);
	while (i-- > 0)
		if (folios[i])
			gup_put_folio(folios[i], 1, FOLL_PIN);

	return ret;
}
EXPORT_SYMBOL_GPL(memfd_pin_folios);



I'm still wondering about the  offset handling, though. Could it happen that why we are
repeatedly calling filemap_get_folios_contig(), that we would need offset!=0 on any of
the other folios besides the first one? My current understanding (and looking at
filemap_get_folios_contig()) is: no.

I'm primarily concerned about concurrent fallocate(PUNCH_HOLE) and THP collapse/splitting.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v7 0/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7)
  2023-12-12  7:37 [PATCH v7 0/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7) Vivek Kasireddy
                   ` (5 preceding siblings ...)
  2023-12-12  7:38 ` [PATCH v7 6/6] selftests/dma-buf/udmabuf: Add tests to verify data after page migration Vivek Kasireddy
@ 2023-12-12 12:15 ` David Hildenbrand
  6 siblings, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2023-12-12 12:15 UTC (permalink / raw)
  To: Vivek Kasireddy, dri-devel, linux-mm
  Cc: Christoph Hellwig, Daniel Vetter, Mike Kravetz, Hugh Dickins,
	Peter Xu, Jason Gunthorpe, Gerd Hoffmann, Dongwon Kim,
	Junxiao Chang

On 12.12.23 08:37, Vivek Kasireddy wrote:
> The first two patches were previously reviewed but not yet merged.
> These ones need to be merged first as the fourth patch depends on
> the changes introduced in them and they also fix bugs seen in
> very specific scenarios (running Qemu with hugetlb=on, blob=true
> and rebooting guest VM).
> 
> The third patch introduces memfd_pin_folios() API and the fourth
> patch converts udmabuf driver to use folios. The fifth patch shows
> how the udmabuf driver can make use of the new API to longterm-pin
> the folios. The last patch adds two new udmabuf selftests to verify
> data coherency after potential page migration.
> 

We should probably get started with the first two patches, they are
independent of the remaining discussion regarding memfd_pin_folios().

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v7 3/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7)
  2023-12-12  7:38 ` [PATCH v7 3/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7) Vivek Kasireddy
  2023-12-12 12:13   ` David Hildenbrand
@ 2023-12-12 13:21   ` kernel test robot
  2023-12-12 15:27   ` kernel test robot
  2 siblings, 0 replies; 18+ messages in thread
From: kernel test robot @ 2023-12-12 13:21 UTC (permalink / raw)
  To: Vivek Kasireddy, dri-devel, linux-mm
  Cc: oe-kbuild-all, Dongwon Kim, David Hildenbrand, Daniel Vetter,
	Hugh Dickins, Vivek Kasireddy, Peter Xu, Christoph Hellwig,
	Gerd Hoffmann, Jason Gunthorpe, Junxiao Chang, Mike Kravetz

Hi Vivek,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Vivek-Kasireddy/udmabuf-Use-vmf_insert_pfn-and-VM_PFNMAP-for-handling-mmap/20231212-160312
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20231212073803.3233055-4-vivek.kasireddy%40intel.com
patch subject: [PATCH v7 3/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7)
config: x86_64-allnoconfig (https://download.01.org/0day-ci/archive/20231212/202312122109.SxDFnBaq-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231212/202312122109.SxDFnBaq-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202312122109.SxDFnBaq-lkp@intel.com/

All errors (new ones prefixed by >>):

   mm/gup.c: In function 'memfd_pin_folios':
>> mm/gup.c:3543:39: error: assignment to 'struct folio *' from incompatible pointer type 'struct page *' [-Werror=incompatible-pointer-types]
    3543 |                                 folio = memfd_alloc_folio(memfd, start_idx);
         |                                       ^
   cc1: some warnings being treated as errors


vim +3543 mm/gup.c

  3417	
  3418	/**
  3419	 * memfd_pin_folios() - pin folios associated with a memfd
  3420	 * @memfd:      the memfd whose folios are to be pinned
  3421	 * @start:      starting memfd offset
  3422	 * @nr_pages:   number of pages from start to pin
  3423	 * @folios:     array that receives pointers to the folios pinned.
  3424	 *              Should be at-least nr_pages long.
  3425	 * @offsets:    array that receives offsets of pages in their folios.
  3426	 *              Should be at-least nr_pages long.
  3427	 *
  3428	 * Attempt to pin folios associated with a memfd; given that a memfd is
  3429	 * either backed by shmem or hugetlb, the folios can either be found in
  3430	 * the page cache or need to be allocated if necessary. Once the folios
  3431	 * are located, they are all pinned via FOLL_PIN and the @offsets array
  3432	 * is populated with offsets of the pages in their respective folios.
  3433	 * Therefore, for each page the caller requested, there will be a
  3434	 * corresponding entry in both @folios and @offsets. And, eventually,
  3435	 * these pinned folios need to be released either using unpin_user_pages()
  3436	 * or unpin_user_page().
  3437	 *
  3438	 * It must be noted that the folios may be pinned for an indefinite amount
  3439	 * of time. And, in most cases, the duration of time they may stay pinned
  3440	 * would be controlled by the userspace. This behavior is effectively the
  3441	 * same as using FOLL_LONGTERM with other GUP APIs.
  3442	 *
  3443	 * Returns number of folios pinned. This would be equal to the number of
  3444	 * pages requested. If no folios were pinned, it returns -errno.
  3445	 */
  3446	long memfd_pin_folios(struct file *memfd, unsigned long start,
  3447			      unsigned long nr_pages, struct folio **folios,
  3448			      pgoff_t *offsets)
  3449	{
  3450		unsigned long end = start + (nr_pages << PAGE_SHIFT) - 1;
  3451		unsigned int max_pgs, pgoff, pgshift = PAGE_SHIFT;
  3452		pgoff_t start_idx, end_idx, next_idx;
  3453		unsigned int flags, nr_folios, i, j;
  3454		struct folio *folio = NULL;
  3455		struct folio_batch fbatch;
  3456		struct page **pages;
  3457		struct hstate *h;
  3458		long ret;
  3459	
  3460		if (!nr_pages)
  3461			return -EINVAL;
  3462	
  3463		if (!memfd)
  3464			return -EINVAL;
  3465	
  3466		if (!shmem_file(memfd) && !is_file_hugepages(memfd))
  3467			return -EINVAL;
  3468	
  3469		pages = kmalloc_array(nr_pages, sizeof(*pages), GFP_KERNEL);
  3470		if (!pages)
  3471			return -ENOMEM;
  3472	
  3473		if (is_file_hugepages(memfd)) {
  3474			h = hstate_file(memfd);
  3475			pgshift = huge_page_shift(h);
  3476		}
  3477	
  3478		flags = memalloc_pin_save();
  3479		do {
  3480			i = 0;
  3481			start_idx = start >> pgshift;
  3482			end_idx = end >> pgshift;
  3483			if (is_file_hugepages(memfd)) {
  3484				start_idx <<= huge_page_order(h);
  3485				end_idx <<= huge_page_order(h);
  3486			}
  3487	
  3488			folio_batch_init(&fbatch);
  3489			while (start_idx <= end_idx) {
  3490				/*
  3491				 * In most cases, we should be able to find the folios
  3492				 * in the page cache. If we cannot find them for some
  3493				 * reason, we try to allocate them and add them to the
  3494				 * page cache.
  3495				 */
  3496				nr_folios = filemap_get_folios_contig(memfd->f_mapping,
  3497								      &start_idx,
  3498								      end_idx,
  3499								      &fbatch);
  3500				if (folio) {
  3501					folio_put(folio);
  3502					folio = NULL;
  3503				}
  3504	
  3505				next_idx = 0;
  3506				for (j = 0; j < nr_folios; j++) {
  3507					if (next_idx &&
  3508					    next_idx != folio_index(fbatch.folios[j]))
  3509						continue;
  3510	
  3511					folio = try_grab_folio(&fbatch.folios[j]->page,
  3512							       1, FOLL_PIN);
  3513					if (!folio) {
  3514						folio_batch_release(&fbatch);
  3515						kfree(pages);
  3516						goto err;
  3517					}
  3518	
  3519					max_pgs = folio_nr_pages(folio);
  3520					if (i == 0) {
  3521						pgoff = offset_in_folio(folio, start);
  3522						pgoff >>= PAGE_SHIFT;
  3523					}
  3524	
  3525					do {
  3526						folios[i] = folio;
  3527						offsets[i] = pgoff << PAGE_SHIFT;
  3528						pages[i] = folio_page(folio, 0);
  3529						folio_add_pin(folio);
  3530	
  3531						pgoff++;
  3532						i++;
  3533					} while (pgoff < max_pgs && i < nr_pages);
  3534	
  3535					pgoff = 0;
  3536					next_idx = folio_next_index(folio);
  3537					gup_put_folio(folio, 1, FOLL_PIN);
  3538				}
  3539	
  3540				folio = NULL;
  3541				folio_batch_release(&fbatch);
  3542				if (!nr_folios) {
> 3543					folio = memfd_alloc_folio(memfd, start_idx);

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v7 3/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7)
  2023-12-12  7:38 ` [PATCH v7 3/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7) Vivek Kasireddy
  2023-12-12 12:13   ` David Hildenbrand
  2023-12-12 13:21   ` kernel test robot
@ 2023-12-12 15:27   ` kernel test robot
  2 siblings, 0 replies; 18+ messages in thread
From: kernel test robot @ 2023-12-12 15:27 UTC (permalink / raw)
  To: Vivek Kasireddy, dri-devel, linux-mm
  Cc: llvm, oe-kbuild-all, Dongwon Kim, David Hildenbrand,
	Daniel Vetter, Hugh Dickins, Vivek Kasireddy, Peter Xu,
	Christoph Hellwig, Gerd Hoffmann, Jason Gunthorpe, Junxiao Chang,
	Mike Kravetz

Hi Vivek,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Vivek-Kasireddy/udmabuf-Use-vmf_insert_pfn-and-VM_PFNMAP-for-handling-mmap/20231212-160312
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20231212073803.3233055-4-vivek.kasireddy%40intel.com
patch subject: [PATCH v7 3/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7)
config: i386-allnoconfig (https://download.01.org/0day-ci/archive/20231212/202312122339.viQUjwIW-lkp@intel.com/config)
compiler: clang version 16.0.4 (https://github.com/llvm/llvm-project.git ae42196bc493ffe877a7e3dff8be32035dea4d07)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231212/202312122339.viQUjwIW-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202312122339.viQUjwIW-lkp@intel.com/

All errors (new ones prefixed by >>):

>> mm/gup.c:3543:11: error: incompatible pointer types assigning to 'struct folio *' from 'struct page *' [-Werror,-Wincompatible-pointer-types]
                                   folio = memfd_alloc_folio(memfd, start_idx);
                                         ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   1 error generated.


vim +3543 mm/gup.c

  3417	
  3418	/**
  3419	 * memfd_pin_folios() - pin folios associated with a memfd
  3420	 * @memfd:      the memfd whose folios are to be pinned
  3421	 * @start:      starting memfd offset
  3422	 * @nr_pages:   number of pages from start to pin
  3423	 * @folios:     array that receives pointers to the folios pinned.
  3424	 *              Should be at-least nr_pages long.
  3425	 * @offsets:    array that receives offsets of pages in their folios.
  3426	 *              Should be at-least nr_pages long.
  3427	 *
  3428	 * Attempt to pin folios associated with a memfd; given that a memfd is
  3429	 * either backed by shmem or hugetlb, the folios can either be found in
  3430	 * the page cache or need to be allocated if necessary. Once the folios
  3431	 * are located, they are all pinned via FOLL_PIN and the @offsets array
  3432	 * is populated with offsets of the pages in their respective folios.
  3433	 * Therefore, for each page the caller requested, there will be a
  3434	 * corresponding entry in both @folios and @offsets. And, eventually,
  3435	 * these pinned folios need to be released either using unpin_user_pages()
  3436	 * or unpin_user_page().
  3437	 *
  3438	 * It must be noted that the folios may be pinned for an indefinite amount
  3439	 * of time. And, in most cases, the duration of time they may stay pinned
  3440	 * would be controlled by the userspace. This behavior is effectively the
  3441	 * same as using FOLL_LONGTERM with other GUP APIs.
  3442	 *
  3443	 * Returns number of folios pinned. This would be equal to the number of
  3444	 * pages requested. If no folios were pinned, it returns -errno.
  3445	 */
  3446	long memfd_pin_folios(struct file *memfd, unsigned long start,
  3447			      unsigned long nr_pages, struct folio **folios,
  3448			      pgoff_t *offsets)
  3449	{
  3450		unsigned long end = start + (nr_pages << PAGE_SHIFT) - 1;
  3451		unsigned int max_pgs, pgoff, pgshift = PAGE_SHIFT;
  3452		pgoff_t start_idx, end_idx, next_idx;
  3453		unsigned int flags, nr_folios, i, j;
  3454		struct folio *folio = NULL;
  3455		struct folio_batch fbatch;
  3456		struct page **pages;
  3457		struct hstate *h;
  3458		long ret;
  3459	
  3460		if (!nr_pages)
  3461			return -EINVAL;
  3462	
  3463		if (!memfd)
  3464			return -EINVAL;
  3465	
  3466		if (!shmem_file(memfd) && !is_file_hugepages(memfd))
  3467			return -EINVAL;
  3468	
  3469		pages = kmalloc_array(nr_pages, sizeof(*pages), GFP_KERNEL);
  3470		if (!pages)
  3471			return -ENOMEM;
  3472	
  3473		if (is_file_hugepages(memfd)) {
  3474			h = hstate_file(memfd);
  3475			pgshift = huge_page_shift(h);
  3476		}
  3477	
  3478		flags = memalloc_pin_save();
  3479		do {
  3480			i = 0;
  3481			start_idx = start >> pgshift;
  3482			end_idx = end >> pgshift;
  3483			if (is_file_hugepages(memfd)) {
  3484				start_idx <<= huge_page_order(h);
  3485				end_idx <<= huge_page_order(h);
  3486			}
  3487	
  3488			folio_batch_init(&fbatch);
  3489			while (start_idx <= end_idx) {
  3490				/*
  3491				 * In most cases, we should be able to find the folios
  3492				 * in the page cache. If we cannot find them for some
  3493				 * reason, we try to allocate them and add them to the
  3494				 * page cache.
  3495				 */
  3496				nr_folios = filemap_get_folios_contig(memfd->f_mapping,
  3497								      &start_idx,
  3498								      end_idx,
  3499								      &fbatch);
  3500				if (folio) {
  3501					folio_put(folio);
  3502					folio = NULL;
  3503				}
  3504	
  3505				next_idx = 0;
  3506				for (j = 0; j < nr_folios; j++) {
  3507					if (next_idx &&
  3508					    next_idx != folio_index(fbatch.folios[j]))
  3509						continue;
  3510	
  3511					folio = try_grab_folio(&fbatch.folios[j]->page,
  3512							       1, FOLL_PIN);
  3513					if (!folio) {
  3514						folio_batch_release(&fbatch);
  3515						kfree(pages);
  3516						goto err;
  3517					}
  3518	
  3519					max_pgs = folio_nr_pages(folio);
  3520					if (i == 0) {
  3521						pgoff = offset_in_folio(folio, start);
  3522						pgoff >>= PAGE_SHIFT;
  3523					}
  3524	
  3525					do {
  3526						folios[i] = folio;
  3527						offsets[i] = pgoff << PAGE_SHIFT;
  3528						pages[i] = folio_page(folio, 0);
  3529						folio_add_pin(folio);
  3530	
  3531						pgoff++;
  3532						i++;
  3533					} while (pgoff < max_pgs && i < nr_pages);
  3534	
  3535					pgoff = 0;
  3536					next_idx = folio_next_index(folio);
  3537					gup_put_folio(folio, 1, FOLL_PIN);
  3538				}
  3539	
  3540				folio = NULL;
  3541				folio_batch_release(&fbatch);
  3542				if (!nr_folios) {
> 3543					folio = memfd_alloc_folio(memfd, start_idx);

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH v7 3/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7)
  2023-12-12 12:13   ` David Hildenbrand
@ 2023-12-13  8:44     ` Kasireddy, Vivek
  2023-12-13 12:31       ` Jason Gunthorpe
  2023-12-13 15:15       ` David Hildenbrand
  0 siblings, 2 replies; 18+ messages in thread
From: Kasireddy, Vivek @ 2023-12-13  8:44 UTC (permalink / raw)
  To: David Hildenbrand, dri-devel, linux-mm
  Cc: Christoph Hellwig, Daniel Vetter, Mike Kravetz, Hugh Dickins,
	Peter Xu, Gerd Hoffmann, Kim, Dongwon, Chang, Junxiao,
	Jason Gunthorpe, Christoph Hellwig

Hi David,

> 
> On 12.12.23 08:38, Vivek Kasireddy wrote:
> > For drivers that would like to longterm-pin the folios associated
> > with a memfd, the memfd_pin_folios() API provides an option to
> > not only pin the folios via FOLL_PIN but also to check and migrate
> > them if they reside in movable zone or CMA block. This API
> > currently works with memfds but it should work with any files
> > that belong to either shmemfs or hugetlbfs. Files belonging to
> > other filesystems are rejected for now.
> >
> > The folios need to be located first before pinning them via FOLL_PIN.
> > If they are found in the page cache, they can be immediately pinned.
> > Otherwise, they need to be allocated using the filesystem specific
> > APIs and then pinned.
> >
> > v2:
> > - Drop gup_flags and improve comments and commit message (David)
> > - Allocate a page if we cannot find in page cache for the hugetlbfs
> >    case as well (David)
> > - Don't unpin pages if there is a migration related failure (David)
> > - Drop the unnecessary nr_pages <= 0 check (Jason)
> > - Have the caller of the API pass in file * instead of fd (Jason)
> >
> > v3: (David)
> > - Enclose the huge page allocation code with #ifdef
> CONFIG_HUGETLB_PAGE
> >    (Build error reported by kernel test robot <lkp@intel.com>)
> > - Don't forget memalloc_pin_restore() on non-migration related errors
> > - Improve the readability of the cleanup code associated with
> >    non-migration related errors
> > - Augment the comments by describing FOLL_LONGTERM like behavior
> > - Include the R-b tag from Jason
> >
> > v4:
> > - Remove the local variable "page" and instead use 3 return statements
> >    in alloc_file_page() (David)
> > - Add the R-b tag from David
> >
> > v5: (David)
> > - For hugetlb case, ensure that we only obtain head pages from the
> >    mapping by using __filemap_get_folio() instead of find_get_page_flags()
> > - Handle -EEXIST when two or more potential users try to simultaneously
> >    add a huge page to the mapping by forcing them to retry on failure
> >
> > v6: (Christoph)
> > - Rename this API to memfd_pin_user_pages() to make it clear that it
> >    is intended for memfds
> > - Move the memfd page allocation helper from gup.c to memfd.c
> > - Fix indentation errors in memfd_pin_user_pages()
> > - For contiguous ranges of folios, use a helper such as
> >    filemap_get_folios_contig() to lookup the page cache in batches
> >
> > v7:
> > - Rename this API to memfd_pin_folios() and make it return folios
> >    and offsets instead of pages (David)
> > - Don't continue processing the folios in the batch returned by
> >    filemap_get_folios_contig() if they do not have correct next_idx
> > - Add the R-b tag from Christoph
> >
> 
> Sorry, I'm still not happy about the current state, because (1) the
> folio vs. pages handling is still mixed (2) we're returning+pinning a
> large folio multiple times.
I can address (1) in a follow-up series and as far as (2) is concerned, my
understanding is that we need to increase the folio's refcount as and
when the folio's tail pages are used. Is this not the case? It appears
this is what unpin_user_pages() expects as well. Do you see any
concern with this?

> 
> See below if there is an easy way to clean this up.
> 
> >> @@ -5,6 +5,7 @@
> >   #include <linux/spinlock.h>
> >
> >   #include <linux/mm.h>
> > +#include <linux/memfd.h>
> >   #include <linux/memremap.h>
> >   #include <linux/pagemap.h>
> >   #include <linux/rmap.h>
> > @@ -17,6 +18,7 @@
> >   #include <linux/hugetlb.h>
> >   #include <linux/migrate.h>
> >   #include <linux/mm_inline.h>
> > +#include <linux/pagevec.h>
> >   #include <linux/sched/mm.h>
> >   #include <linux/shmem_fs.h>
> >
> > @@ -3410,3 +3412,156 @@ long pin_user_pages_unlocked(unsigned long
> start, unsigned long nr_pages,
> >   				     &locked, gup_flags);
> >   }
> >   EXPORT_SYMBOL(pin_user_pages_unlocked);
> > +
> > +/**
> > + * memfd_pin_folios() - pin folios associated with a memfd
> > + * @memfd:      the memfd whose folios are to be pinned
> > + * @start:      starting memfd offset
> > + * @nr_pages:   number of pages from start to pin
> 
> We're not pinning pages. An inclusive range [start, end] would be clearer.
Ok, I'll make this change in the next version.

> 
> > + * @folios:     array that receives pointers to the folios pinned.
> > + *              Should be at-least nr_pages long.
> > + * @offsets:    array that receives offsets of pages in their folios.
> > + *              Should be at-least nr_pages long.
> 
> See below, I'm wondering if this is really required once we return each folio
> only once.
The offsets can be calculated by the caller (udmabuf) as well but doing so
in this interface would prevent special handling in the caller for the hugetlb
case. Please look at patch 5 in this series (udmabuf: Pin the pages using
memfd_pin_folios() API (v5)) for more details as to what I mean.

> 
> > + *
> > + * Attempt to pin folios associated with a memfd; given that a memfd is
> > + * either backed by shmem or hugetlb, the folios can either be found in
> > + * the page cache or need to be allocated if necessary. Once the folios
> > + * are located, they are all pinned via FOLL_PIN and the @offsets array
> > + * is populated with offsets of the pages in their respective folios.
> > + * Therefore, for each page the caller requested, there will be a
> > + * corresponding entry in both @folios and @offsets. And, eventually,
> > + * these pinned folios need to be released either using
> unpin_user_pages()
> > + * or unpin_user_page().
> 
> Oh, we don't have a folio function yet? Should be easy to add, and we'd
> really
> add them.
Sure, I'll add them in a follow-up series.

> 
> > + *
> > + * It must be noted that the folios may be pinned for an indefinite amount
> > + * of time. And, in most cases, the duration of time they may stay pinned
> > + * would be controlled by the userspace. This behavior is effectively the
> > + * same as using FOLL_LONGTERM with other GUP APIs.
> > + *
> > + * Returns number of folios pinned. This would be equal to the number of
> > + * pages requested. If no folios were pinned, it returns -errno.
> > + */
> > +long memfd_pin_folios(struct file *memfd, unsigned long start,
> > +		      unsigned long nr_pages, struct folio **folios,
> > +		      pgoff_t *offsets)
> > +{
> > +	unsigned long end = start + (nr_pages << PAGE_SHIFT) - 1;
> > +	unsigned int max_pgs, pgoff, pgshift = PAGE_SHIFT;
> > +	pgoff_t start_idx, end_idx, next_idx;
> > +	unsigned int flags, nr_folios, i, j;
> > +	struct folio *folio = NULL;
> > +	struct folio_batch fbatch;
> > +	struct page **pages;
> > +	struct hstate *h;
> > +	long ret;
> > +
> > +	if (!nr_pages)
> > +		return -EINVAL;
> > +
> > +	if (!memfd)
> > +		return -EINVAL;
> > +
> > +	if (!shmem_file(memfd) && !is_file_hugepages(memfd))
> > +		return -EINVAL;
> > +
> > +	pages = kmalloc_array(nr_pages, sizeof(*pages), GFP_KERNEL);
> > +	if (!pages)
> > +		return -ENOMEM;
> > +
> > +	if (is_file_hugepages(memfd)) {
> > +		h = hstate_file(memfd);
> > +		pgshift = huge_page_shift(h);
> > +	}
> > +
> > +	flags = memalloc_pin_save();
> > +	do {
> > +		i = 0;
> > +		start_idx = start >> pgshift;
> > +		end_idx = end >> pgshift;
> > +		if (is_file_hugepages(memfd)) {
> > +			start_idx <<= huge_page_order(h);
> > +			end_idx <<= huge_page_order(h);
> > +		}
> > +
> > +		folio_batch_init(&fbatch);
> > +		while (start_idx <= end_idx) {
> > +			/*
> > +			 * In most cases, we should be able to find the folios
> > +			 * in the page cache. If we cannot find them for some
> > +			 * reason, we try to allocate them and add them to
> the
> > +			 * page cache.
> > +			 */
> > +			nr_folios = filemap_get_folios_contig(memfd-
> >f_mapping,
> > +							      &start_idx,
> > +							      end_idx,
> > +							      &fbatch);
> > +			if (folio) {
> > +				folio_put(folio);
> > +				folio = NULL;
> > +			}
> > +
> > +			next_idx = 0;
> > +			for (j = 0; j < nr_folios; j++) {
> > +				if (next_idx &&
> > +				    next_idx != folio_index(fbatch.folios[j]))
> > +					continue;
> > +
> > +				folio = try_grab_folio(&fbatch.folios[j]->page,
> > +						       1, FOLL_PIN);
> > +				if (!folio) {
> > +					folio_batch_release(&fbatch);
> > +					kfree(pages);
> > +					goto err;
> > +				}
> > +
> > +				max_pgs = folio_nr_pages(folio);
> > +				if (i == 0) {
> > +					pgoff = offset_in_folio(folio, start);
> > +					pgoff >>= PAGE_SHIFT;
> > +				}
> > +
> > +				do {
> > +					folios[i] = folio;
> > +					offsets[i] = pgoff << PAGE_SHIFT;
> > +					pages[i] = folio_page(folio, 0);
> > +					folio_add_pin(folio);
> > +
> > +					pgoff++;
> > +					i++;
> > +				} while (pgoff < max_pgs && i < nr_pages);
> > +
> > +				pgoff = 0;
> > +				next_idx = folio_next_index(folio);
> > +				gup_put_folio(folio, 1, FOLL_PIN);
> > +			}
> > +
> > +			folio = NULL;
> > +			folio_batch_release(&fbatch);
> > +			if (!nr_folios) {
> > +				folio = memfd_alloc_folio(memfd, start_idx);
> > +				if (IS_ERR(folio)) {
> > +					ret = PTR_ERR(folio);
> > +					if (ret != -EEXIST) {
> > +						kfree(pages);
> > +						goto err;
> > +					}
> > +				}
> > +			}
> > +		}
> > +
> > +		ret = check_and_migrate_movable_pages(nr_pages, pages);
> 
> Having a folio variant would avoid having to mess with pages here at all.
> Further, we're now returning+pinning the same folio multiple times, instead
> of
> just once like the folio batching variant would.
It should be possible to pin the folio only once but I don't see any problem with
pinning it multiple times -- once per each subpage used -- as long as it is unpinned
correctly the same number of times. Is this not ok?

> 
> I'm wondering if the following wouldn't make more sense, assuming we add
> check_and_migrate_movable_folios(), which should be pretty easy to add.
> 
> Obviously untested, just to express what I have in mind:
Thank you for taking the time to do this!

> 
> 
> 
> /**
>   * memfd_pin_folios() - pin folios associated with a memfd
>   * @memfd:      the memfd whose folios are to be pinned
>   * @start:      the starting memfd offset
>   * @end:        the final memfd offset (inclusive)
>   * @folios:     array that receives pointers to the folios pinned
>   * @max_folios: the number of entries in the array for folios
>   * @offsets:    the offset into the first folio
Given that my goal is to do the following in udmabuf driver:
        ret = sg_alloc_table(sg, ubuf->pagecount, GFP_KERNEL);
        for_each_sg(sg->sgl, sgl, ubuf->pagecount, i)
                sg_set_folio(sgl, ubuf->folios[i], PAGE_SIZE, ubuf->offsets[i]);

        ret = dma_map_sgtable(dev, sg, direction, 0);

That is, populate a scatterlist with ubuf->pagecount number of entries,
where each segment if of size PAGE_SIZE, in order to be consistent and
support a wide variety of DMA importers that may not probably handle
segments that are larger than PAGE_SIZE.

Therefore, in the hugetlb case, there would be multiple entries pointing to
the same folio with different offsets. The question really is whether these
entries associated with @folios and @offsets would need to be populated
by the caller (udmabuf) or the API (memfd_pin_folios). I have tried both of
these approaches in the earlier versions and they all work fine but I think
populating the entries in memfd_pin_folios() seems to be cleaner as the
caller does not need to do any special handling (hugetlb vs shmem).

>   *
>   * Attempt to pin folios associated with a memfd; given that a memfd is
>   * either backed by shmem or hugetlb, the folios can either be found in
>   * the page cache or need to be allocated if necessary. Once the folios
>   * are located, they are all pinned via FOLL_PIN and @offset is populated
>   * with the offset into the first folio.
>   *
>   * Pinned folios must be released using unpin_folio() or unpin_folios().
>   *
>   * It must be noted that the folios may be pinned for an indefinite amount
>   * of time. And, in most cases, the duration of time they may stay pinned
>   * would be controlled by the userspace. This behavior is effectively the
>   * same as using FOLL_LONGTERM with other GUP APIs.
>   *
>   * Returns number of folios pinned, which might be less than @max_folios
>   * only if the whole range was pinned. If no folios were pinned, it returns
>   * -errno.
>   */
> long memfd_pin_folios(struct file *memfd, unsigned long start,
> 		      unsigned long end, struct folio **folios,
> 		      unsigned int max_folios, unsigned long *offset)
> {
> 	unsigned int pgshift = PAGE_SHIFT;
> 	unsigned int flags, nr_folios, cur_folios, i;
> 	pgoff_t start_idx, end_idx;
> 	struct folio_batch fbatch;
> 	struct folio *folio;
> 	struct hstate *h;
> 	long ret;
> 
> 	if (start > end || !max_folios)
> 		return -EINVAL;
> 
> 	if (!memfd)
> 		return -EINVAL;
> 
> 	if (!shmem_file(memfd) && !is_file_hugepages(memfd))
> 		return -EINVAL;
> 
> 	if (is_file_hugepages(memfd)) {
> 		h = hstate_file(memfd);
> 		pgshift = huge_page_shift(h);
> 	}
> 
> 	flags = memalloc_pin_save();
> 	folio_batch_init(&fbatch);
> 	do {
> 		nr_folios = 0;
> 		start_idx = start >> pgshift;
> 		end_idx = end >> pgshift;
> 		if (is_file_hugepages(memfd)) {
> 			start_idx <<= huge_page_order(h);
> 			end_idx <<= huge_page_order(h);
> 		}
> 
> 		while (start_idx <= end_idx) {
> 			/*
> 			 * In most cases, we should be able to find the folios
> 			 * in the page cache. If we cannot find them for some
> 			 * reason, we try to allocate them and add them to
> the
> 			 * page cache.
> 			 */
> 			folio_batch_release(&fbatch);
> 			cur_folios = filemap_get_folios_contig(memfd-
> >f_mapping,
> 							       &start_idx,
> 							       end_idx,
> 							       &fbatch);
> 			if (!cur_folios) {
> 				folio = memfd_alloc_folio(memfd, start_idx);
> 				if (IS_ERR(folio)) {
> 					ret = PTR_ERR(folio);
> 					if (ret != -EEXIST)
> 						goto err;
> 				}
> 				folio_put(folio);
> 				continue;
> 			}
> 
> 			/* Let's pin each folio, which shouldn't really fail. */
> 			for (i = 0; i < cur_folios; i++) {
> 				folio = try_grab_folio(&fbatch.folios[i]->page,
> 						       1, FOLL_PIN);
> 				if (!folio)
> 					goto err;
> 
> 				if (!nr_folios)
> 					*offset = offset_in_folio(folio, start);
> 				folios[nr_folios++] = folio;
> 
> 				if (max_folios == nr_folios)
> 					break;
> 			}
> 			if (max_folios == nr_folios)
> 				break;
> 		}
> 		folio_batch_release(&fbatch);
> 
> 		ret = check_and_migrate_movable_folios(nr_folios, folios);
> 	} while (ret == -EAGAIN);
> 
> 	memalloc_pin_restore(flags);
> 	return ret ? ret : nr_folios;
> err:
> 	folio_batch_release(&fbatch);
> 	memalloc_pin_restore(flags);
> 	while (i-- > 0)
> 		if (folios[i])
> 			gup_put_folio(folios[i], 1, FOLL_PIN);
> 
> 	return ret;
> }
> EXPORT_SYMBOL_GPL(memfd_pin_folios);
> 
> 
> 
> I'm still wondering about the  offset handling, though. Could it happen that
> why we are
> repeatedly calling filemap_get_folios_contig(), that we would need offset!=0
> on any of
> the other folios besides the first one? My current understanding (and looking
> at
> filemap_get_folios_contig()) is: no.
I am not entirely sure but while testing this series with Qemu master + kernel
snapshot of drm-tip which is 6.7 RC1, I noticed strange behavior of
filemap_get_folios_contig() and the batches it returns particularly for the
hugetlb folios. Assuming we have order-9 folios in the memfd (my test-case),
and if the range [start, end] cuts across more than one folio: lets say start is
at subpage 490 (in folio-0) and end is at subpage 520 (in folio-1), then start_idx
would be 0 and end_idx would be 512. In this case, I would have expected
filemap_get_folios_contig() to return two entries in the batch that included
folio-0 and folio-1. However, it returned a batch with 15 entries (max batch size)
with all the entries pointing to folio-0. This is why I added the check:
	if (next_idx &&
                   next_idx != folio_index(fbatch.folios[j]))
                   	continue;

Anyway, based on the code you wrote, I have realized that we both have a
different view on how many entries need to be there in the @folios array
for a given range [start, end] in the hugetlb case.

I have assumed that it is highly desirable to have a segment length of
PAGE_SIZE for consistency and interoperability reasons but I guess it might
be ok to do:
sg_set_folio(sgl, ubuf->folios[i], nr_tails * PAGE_SIZE, ubuf->offsets[i]);

I'll run some experiments to see if this would work in most cases or not.

> 
> I'm primarily concerned about concurrent fallocate(PUNCH_HOLE) and THP
> collapse/splitting.
Could you please elaborate on what the issue would be in this case?

Thanks,
Vivek
> 
> --
> Cheers,
> 
> David / dhildenb


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v7 3/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7)
  2023-12-13  8:44     ` Kasireddy, Vivek
@ 2023-12-13 12:31       ` Jason Gunthorpe
  2023-12-13 15:36         ` Christoph Hellwig
  2023-12-13 15:15       ` David Hildenbrand
  1 sibling, 1 reply; 18+ messages in thread
From: Jason Gunthorpe @ 2023-12-13 12:31 UTC (permalink / raw)
  To: Kasireddy, Vivek
  Cc: David Hildenbrand, dri-devel, linux-mm, Christoph Hellwig,
	Daniel Vetter, Mike Kravetz, Hugh Dickins, Peter Xu,
	Gerd Hoffmann, Kim, Dongwon, Chang, Junxiao, Christoph Hellwig

On Wed, Dec 13, 2023 at 08:44:51AM +0000, Kasireddy, Vivek wrote:

> That is, populate a scatterlist with ubuf->pagecount number of entries,
> where each segment if of size PAGE_SIZE, in order to be consistent and
> support a wide variety of DMA importers that may not probably handle
> segments that are larger than PAGE_SIZE.

No! This is totally wrong, sg lists must aggregate up to the limits
specified in the struct device. We have importer helpers that do this
aggregation.

If some driver is working with a sglist and can't handle this it is
simply broken. Do not mess up core code to accomodate such things.

Jason


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v7 3/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7)
  2023-12-13  8:44     ` Kasireddy, Vivek
  2023-12-13 12:31       ` Jason Gunthorpe
@ 2023-12-13 15:15       ` David Hildenbrand
  1 sibling, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2023-12-13 15:15 UTC (permalink / raw)
  To: Kasireddy, Vivek, dri-devel, linux-mm
  Cc: Christoph Hellwig, Daniel Vetter, Mike Kravetz, Hugh Dickins,
	Peter Xu, Gerd Hoffmann, Kim, Dongwon, Chang, Junxiao,
	Jason Gunthorpe, Christoph Hellwig

Hi,

>> Sorry, I'm still not happy about the current state, because (1) the
>> folio vs. pages handling is still mixed (2) we're returning+pinning a
>> large folio multiple times.
> I can address (1) in a follow-up series and as far as (2) is concerned, my
> understanding is that we need to increase the folio's refcount as and
> when the folio's tail pages are used. Is this not the case? It appears
> this is what unpin_user_pages() expects as well. Do you see any
> concern with this?

If you'd just pin the folio once, you'd also only have to unpin it once.

Bu that raises a question: Is it a requirement for the user of this 
interface, being able to unmap+unpin each individual page?

If you really want to handle each subpage possibly individually, then 
subpage or folio+offset makes more sense, agreed.

> 
>>
>> See below if there is an easy way to clean this up.
>>
>>>> @@ -5,6 +5,7 @@
>>>    #include <linux/spinlock.h>
>>>
>>>    #include <linux/mm.h>
>>> +#include <linux/memfd.h>
>>>    #include <linux/memremap.h>
>>>    #include <linux/pagemap.h>
>>>    #include <linux/rmap.h>
>>> @@ -17,6 +18,7 @@
>>>    #include <linux/hugetlb.h>
>>>    #include <linux/migrate.h>
>>>    #include <linux/mm_inline.h>
>>> +#include <linux/pagevec.h>
>>>    #include <linux/sched/mm.h>
>>>    #include <linux/shmem_fs.h>
>>>
>>> @@ -3410,3 +3412,156 @@ long pin_user_pages_unlocked(unsigned long
>> start, unsigned long nr_pages,
>>>    				     &locked, gup_flags);
>>>    }
>>>    EXPORT_SYMBOL(pin_user_pages_unlocked);
>>> +
>>> +/**
>>> + * memfd_pin_folios() - pin folios associated with a memfd
>>> + * @memfd:      the memfd whose folios are to be pinned
>>> + * @start:      starting memfd offset
>>> + * @nr_pages:   number of pages from start to pin
>>
>> We're not pinning pages. An inclusive range [start, end] would be clearer.
> Ok, I'll make this change in the next version.
> 
>>
>>> + * @folios:     array that receives pointers to the folios pinned.
>>> + *              Should be at-least nr_pages long.
>>> + * @offsets:    array that receives offsets of pages in their folios.
>>> + *              Should be at-least nr_pages long.
>>
>> See below, I'm wondering if this is really required once we return each folio
>> only once.
> The offsets can be calculated by the caller (udmabuf) as well but doing so
> in this interface would prevent special handling in the caller for the hugetlb
> case. Please look at patch 5 in this series (udmabuf: Pin the pages using
> memfd_pin_folios() API (v5)) for more details as to what I mean.
> 

I'll have a look later to be reminded about the target use case :)


>>
>>> + *
>>> + * It must be noted that the folios may be pinned for an indefinite amount
>>> + * of time. And, in most cases, the duration of time they may stay pinned
>>> + * would be controlled by the userspace. This behavior is effectively the
>>> + * same as using FOLL_LONGTERM with other GUP APIs.
>>> + *
>>> + * Returns number of folios pinned. This would be equal to the number of
>>> + * pages requested. If no folios were pinned, it returns -errno.
>>> + */
>>> +long memfd_pin_folios(struct file *memfd, unsigned long start,
>>> +		      unsigned long nr_pages, struct folio **folios,
>>> +		      pgoff_t *offsets)
>>> +{
>>> +	unsigned long end = start + (nr_pages << PAGE_SHIFT) - 1;
>>> +	unsigned int max_pgs, pgoff, pgshift = PAGE_SHIFT;
>>> +	pgoff_t start_idx, end_idx, next_idx;
>>> +	unsigned int flags, nr_folios, i, j;
>>> +	struct folio *folio = NULL;
>>> +	struct folio_batch fbatch;
>>> +	struct page **pages;
>>> +	struct hstate *h;
>>> +	long ret;
>>> +
>>> +	if (!nr_pages)
>>> +		return -EINVAL;
>>> +
>>> +	if (!memfd)
>>> +		return -EINVAL;
>>> +
>>> +	if (!shmem_file(memfd) && !is_file_hugepages(memfd))
>>> +		return -EINVAL;
>>> +
>>> +	pages = kmalloc_array(nr_pages, sizeof(*pages), GFP_KERNEL);
>>> +	if (!pages)
>>> +		return -ENOMEM;
>>> +
>>> +	if (is_file_hugepages(memfd)) {
>>> +		h = hstate_file(memfd);
>>> +		pgshift = huge_page_shift(h);
>>> +	}
>>> +
>>> +	flags = memalloc_pin_save();
>>> +	do {
>>> +		i = 0;
>>> +		start_idx = start >> pgshift;
>>> +		end_idx = end >> pgshift;
>>> +		if (is_file_hugepages(memfd)) {
>>> +			start_idx <<= huge_page_order(h);
>>> +			end_idx <<= huge_page_order(h);
>>> +		}
>>> +
>>> +		folio_batch_init(&fbatch);
>>> +		while (start_idx <= end_idx) {
>>> +			/*
>>> +			 * In most cases, we should be able to find the folios
>>> +			 * in the page cache. If we cannot find them for some
>>> +			 * reason, we try to allocate them and add them to
>> the
>>> +			 * page cache.
>>> +			 */
>>> +			nr_folios = filemap_get_folios_contig(memfd-
>>> f_mapping,
>>> +							      &start_idx,
>>> +							      end_idx,
>>> +							      &fbatch);
>>> +			if (folio) {
>>> +				folio_put(folio);
>>> +				folio = NULL;
>>> +			}
>>> +
>>> +			next_idx = 0;
>>> +			for (j = 0; j < nr_folios; j++) {
>>> +				if (next_idx &&
>>> +				    next_idx != folio_index(fbatch.folios[j]))
>>> +					continue;
>>> +
>>> +				folio = try_grab_folio(&fbatch.folios[j]->page,
>>> +						       1, FOLL_PIN);
>>> +				if (!folio) {
>>> +					folio_batch_release(&fbatch);
>>> +					kfree(pages);
>>> +					goto err;
>>> +				}
>>> +
>>> +				max_pgs = folio_nr_pages(folio);
>>> +				if (i == 0) {
>>> +					pgoff = offset_in_folio(folio, start);
>>> +					pgoff >>= PAGE_SHIFT;
>>> +				}
>>> +
>>> +				do {
>>> +					folios[i] = folio;
>>> +					offsets[i] = pgoff << PAGE_SHIFT;
>>> +					pages[i] = folio_page(folio, 0);
>>> +					folio_add_pin(folio);
>>> +
>>> +					pgoff++;
>>> +					i++;
>>> +				} while (pgoff < max_pgs && i < nr_pages);
>>> +
>>> +				pgoff = 0;
>>> +				next_idx = folio_next_index(folio);
>>> +				gup_put_folio(folio, 1, FOLL_PIN);
>>> +			}
>>> +
>>> +			folio = NULL;
>>> +			folio_batch_release(&fbatch);
>>> +			if (!nr_folios) {
>>> +				folio = memfd_alloc_folio(memfd, start_idx);
>>> +				if (IS_ERR(folio)) {
>>> +					ret = PTR_ERR(folio);
>>> +					if (ret != -EEXIST) {
>>> +						kfree(pages);
>>> +						goto err;
>>> +					}
>>> +				}
>>> +			}
>>> +		}
>>> +
>>> +		ret = check_and_migrate_movable_pages(nr_pages, pages);
>>
>> Having a folio variant would avoid having to mess with pages here at all.
>> Further, we're now returning+pinning the same folio multiple times, instead
>> of
>> just once like the folio batching variant would.
> It should be possible to pin the folio only once but I don't see any problem with
> pinning it multiple times -- once per each subpage used -- as long as it is unpinned
> correctly the same number of times. Is this not ok?

You can, but that partially avoids the benefit of using folios?

Instead of "large folio + offset" you have "folio+offset1, folio+offset2 
..." essentially for each subpage. But again, maybe that really is 
required for the target use case.

It's not necessarily wrong to do that, but staring just at the interface 
it's the opposite of what other folio-handling functions like batching do.

> 
>>
>> I'm wondering if the following wouldn't make more sense, assuming we add
>> check_and_migrate_movable_folios(), which should be pretty easy to add.
>>
>> Obviously untested, just to express what I have in mind:
> Thank you for taking the time to do this!
> 
>>
>>
>>
>> /**
>>    * memfd_pin_folios() - pin folios associated with a memfd
>>    * @memfd:      the memfd whose folios are to be pinned
>>    * @start:      the starting memfd offset
>>    * @end:        the final memfd offset (inclusive)
>>    * @folios:     array that receives pointers to the folios pinned
>>    * @max_folios: the number of entries in the array for folios
>>    * @offsets:    the offset into the first folio
> Given that my goal is to do the following in udmabuf driver:
>          ret = sg_alloc_table(sg, ubuf->pagecount, GFP_KERNEL);
>          for_each_sg(sg->sgl, sgl, ubuf->pagecount, i)
>                  sg_set_folio(sgl, ubuf->folios[i], PAGE_SIZE, ubuf->offsets[i]);
> 
>          ret = dma_map_sgtable(dev, sg, direction, 0);
> 
> That is, populate a scatterlist with ubuf->pagecount number of entries,
> where each segment if of size PAGE_SIZE, in order to be consistent and
> support a wide variety of DMA importers that may not probably handle
> segments that are larger than PAGE_SIZE.
> 
> Therefore, in the hugetlb case, there would be multiple entries pointing to
> the same folio with different offsets. The question really is whether these
> entries associated with @folios and @offsets would need to be populated
> by the caller (udmabuf) or the API (memfd_pin_folios). I have tried both of
> these approaches in the earlier versions and they all work fine but I think
> populating the entries in memfd_pin_folios() seems to be cleaner as the
> caller does not need to do any special handling (hugetlb vs shmem).
> 
>>    *
>>    * Attempt to pin folios associated with a memfd; given that a memfd is
>>    * either backed by shmem or hugetlb, the folios can either be found in
>>    * the page cache or need to be allocated if necessary. Once the folios
>>    * are located, they are all pinned via FOLL_PIN and @offset is populated
>>    * with the offset into the first folio.
>>    *
>>    * Pinned folios must be released using unpin_folio() or unpin_folios().
>>    *
>>    * It must be noted that the folios may be pinned for an indefinite amount
>>    * of time. And, in most cases, the duration of time they may stay pinned
>>    * would be controlled by the userspace. This behavior is effectively the
>>    * same as using FOLL_LONGTERM with other GUP APIs.
>>    *
>>    * Returns number of folios pinned, which might be less than @max_folios
>>    * only if the whole range was pinned. If no folios were pinned, it returns
>>    * -errno.
>>    */
>> long memfd_pin_folios(struct file *memfd, unsigned long start,
>> 		      unsigned long end, struct folio **folios,
>> 		      unsigned int max_folios, unsigned long *offset)
>> {
>> 	unsigned int pgshift = PAGE_SHIFT;
>> 	unsigned int flags, nr_folios, cur_folios, i;
>> 	pgoff_t start_idx, end_idx;
>> 	struct folio_batch fbatch;
>> 	struct folio *folio;
>> 	struct hstate *h;
>> 	long ret;
>>
>> 	if (start > end || !max_folios)
>> 		return -EINVAL;
>>
>> 	if (!memfd)
>> 		return -EINVAL;
>>
>> 	if (!shmem_file(memfd) && !is_file_hugepages(memfd))
>> 		return -EINVAL;
>>
>> 	if (is_file_hugepages(memfd)) {
>> 		h = hstate_file(memfd);
>> 		pgshift = huge_page_shift(h);
>> 	}
>>
>> 	flags = memalloc_pin_save();
>> 	folio_batch_init(&fbatch);
>> 	do {
>> 		nr_folios = 0;
>> 		start_idx = start >> pgshift;
>> 		end_idx = end >> pgshift;
>> 		if (is_file_hugepages(memfd)) {
>> 			start_idx <<= huge_page_order(h);
>> 			end_idx <<= huge_page_order(h);
>> 		}
>>
>> 		while (start_idx <= end_idx) {
>> 			/*
>> 			 * In most cases, we should be able to find the folios
>> 			 * in the page cache. If we cannot find them for some
>> 			 * reason, we try to allocate them and add them to
>> the
>> 			 * page cache.
>> 			 */
>> 			folio_batch_release(&fbatch);
>> 			cur_folios = filemap_get_folios_contig(memfd-
>>> f_mapping,
>> 							       &start_idx,
>> 							       end_idx,
>> 							       &fbatch);
>> 			if (!cur_folios) {
>> 				folio = memfd_alloc_folio(memfd, start_idx);
>> 				if (IS_ERR(folio)) {
>> 					ret = PTR_ERR(folio);
>> 					if (ret != -EEXIST)
>> 						goto err;
>> 				}
>> 				folio_put(folio);
>> 				continue;
>> 			}
>>
>> 			/* Let's pin each folio, which shouldn't really fail. */
>> 			for (i = 0; i < cur_folios; i++) {
>> 				folio = try_grab_folio(&fbatch.folios[i]->page,
>> 						       1, FOLL_PIN);
>> 				if (!folio)
>> 					goto err;
>>
>> 				if (!nr_folios)
>> 					*offset = offset_in_folio(folio, start);
>> 				folios[nr_folios++] = folio;
>>
>> 				if (max_folios == nr_folios)
>> 					break;
>> 			}
>> 			if (max_folios == nr_folios)
>> 				break;
>> 		}
>> 		folio_batch_release(&fbatch);
>>
>> 		ret = check_and_migrate_movable_folios(nr_folios, folios);
>> 	} while (ret == -EAGAIN);
>>
>> 	memalloc_pin_restore(flags);
>> 	return ret ? ret : nr_folios;
>> err:
>> 	folio_batch_release(&fbatch);
>> 	memalloc_pin_restore(flags);
>> 	while (i-- > 0)
>> 		if (folios[i])
>> 			gup_put_folio(folios[i], 1, FOLL_PIN);
>>
>> 	return ret;
>> }
>> EXPORT_SYMBOL_GPL(memfd_pin_folios);
>>
>>
>>
>> I'm still wondering about the  offset handling, though. Could it happen that
>> why we are
>> repeatedly calling filemap_get_folios_contig(), that we would need offset!=0
>> on any of
>> the other folios besides the first one? My current understanding (and looking
>> at
>> filemap_get_folios_contig()) is: no.

> I am not entirely sure but while testing this series with Qemu master + kernel
> snapshot of drm-tip which is 6.7 RC1, I noticed strange behavior of
> filemap_get_folios_contig() and the batches it returns particularly for the
> hugetlb folios. Assuming we have order-9 folios in the memfd (my test-case),
> and if the range [start, end] cuts across more than one folio: lets say start is
> at subpage 490 (in folio-0) and end is at subpage 520 (in folio-1), then start_idx
> would be 0 and end_idx would be 512. In this case, I would have expected

That is weird. Shouldn't you get start_idx = 0 and end_idx = 1 with 
hugetlb, where the idx differs ? Maybe that's the problem.

> filemap_get_folios_contig() to return two entries in the batch that included
> folio-0 and folio-1. However, it returned a batch with 15 entries (max batch size)
> with all the entries pointing to folio-0. This is why I added the check: > 	if (next_idx &&
>                     next_idx != folio_index(fbatch.folios[j]))
>                     	continue;
> 
> Anyway, based on the code you wrote, I have realized that we both have a
> different view on how many entries need to be there in the @folios array
> for a given range [start, end] in the hugetlb case.

Oh, yes, ideally the interface should behave the same for hugetlb and shmem.

> 
> I have assumed that it is highly desirable to have a segment length of
> PAGE_SIZE for consistency and interoperability reasons but I guess it might
> be ok to do:
> sg_set_folio(sgl, ubuf->folios[i], nr_tails * PAGE_SIZE, ubuf->offsets[i]);
> 
> I'll run some experiments to see if this would work in most cases or not.
> 
>>
>> I'm primarily concerned about concurrent fallocate(PUNCH_HOLE) and THP
>> collapse/splitting.
> Could you please elaborate on what the issue would be in this case?

I'm not sure if this can happen, but assume the following (shouldn't 
happen as long as shmem does not support 1m folios):

Assume the file looks like this:

[    1m    ][ 512k ]
^0          ^256    ^384

Assume we call filemap_get_folios_contig() and get back the first folio 
and get start_idx=256

Then, someone fallocate(PUNCH_HOLE) the whole range and re-populates the 
whole range with a 2m folio.

[          2m          ]
^0          ^256    ^384


if we call filemap_get_folios_contig() with 256, we get another "large 
folio with offset".

Of course, we can detect that, and simply fail/retry. Just wondering if
that could happen.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v7 3/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7)
  2023-12-13 12:31       ` Jason Gunthorpe
@ 2023-12-13 15:36         ` Christoph Hellwig
  2023-12-13 17:06           ` Jason Gunthorpe
  0 siblings, 1 reply; 18+ messages in thread
From: Christoph Hellwig @ 2023-12-13 15:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Kasireddy, Vivek, David Hildenbrand, dri-devel, linux-mm,
	Christoph Hellwig, Daniel Vetter, Mike Kravetz, Hugh Dickins,
	Peter Xu, Gerd Hoffmann, Kim, Dongwon, Chang, Junxiao,
	Christoph Hellwig

On Wed, Dec 13, 2023 at 08:31:55AM -0400, Jason Gunthorpe wrote:
> > That is, populate a scatterlist with ubuf->pagecount number of entries,
> > where each segment if of size PAGE_SIZE, in order to be consistent and
> > support a wide variety of DMA importers that may not probably handle
> > segments that are larger than PAGE_SIZE.
> 
> No! This is totally wrong, sg lists must aggregate up to the limits
> specified in the struct device. We have importer helpers that do this
> aggregation.
> 
> If some driver is working with a sglist and can't handle this it is
> simply broken. Do not mess up core code to accomodate such things.

Well.. There's no single driver that is broken, it's more the whole
dmabuf philosophy that wants things to be mappable by multiple devices
without knowing their limits beforehand.  So you'll get this cargo
culting.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v7 3/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7)
  2023-12-13 15:36         ` Christoph Hellwig
@ 2023-12-13 17:06           ` Jason Gunthorpe
  0 siblings, 0 replies; 18+ messages in thread
From: Jason Gunthorpe @ 2023-12-13 17:06 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kasireddy, Vivek, David Hildenbrand, dri-devel, linux-mm,
	Christoph Hellwig, Daniel Vetter, Mike Kravetz, Hugh Dickins,
	Peter Xu, Gerd Hoffmann, Kim, Dongwon, Chang, Junxiao

On Wed, Dec 13, 2023 at 04:36:34PM +0100, Christoph Hellwig wrote:
> On Wed, Dec 13, 2023 at 08:31:55AM -0400, Jason Gunthorpe wrote:
> > > That is, populate a scatterlist with ubuf->pagecount number of entries,
> > > where each segment if of size PAGE_SIZE, in order to be consistent and
> > > support a wide variety of DMA importers that may not probably handle
> > > segments that are larger than PAGE_SIZE.
> > 
> > No! This is totally wrong, sg lists must aggregate up to the limits
> > specified in the struct device. We have importer helpers that do this
> > aggregation.
> > 
> > If some driver is working with a sglist and can't handle this it is
> > simply broken. Do not mess up core code to accomodate such things.
> 
> Well.. There's no single driver that is broken, it's more the whole
> dmabuf philosophy that wants things to be mappable by multiple devices
> without knowing their limits beforehand.  So you'll get this cargo
> culting.

It is not so bad, the API has the importer pass a struct device to the
exporter that can be used in the usual way to shape the sg list.

But really, I think in most cases importers don't strictly need the sg
list to be a certain configuration, it is just a combination of lazy
driver writers and a lack of common helpers to iterate over the sg
list in the way they need.

RDMA has done this right, but for it to work efficiently the exporter
*must* aggregate all contiguous memory into a single sg element
otherwise you loose the HW's large page support.

Jason


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v7 4/6] udmabuf: Convert udmabuf driver to use folios
  2023-12-12  7:38 ` [PATCH v7 4/6] udmabuf: Convert udmabuf driver to use folios Vivek Kasireddy
@ 2023-12-13 18:00   ` Matthew Wilcox
  0 siblings, 0 replies; 18+ messages in thread
From: Matthew Wilcox @ 2023-12-13 18:00 UTC (permalink / raw)
  To: Vivek Kasireddy
  Cc: dri-devel, linux-mm, David Hildenbrand, Daniel Vetter,
	Mike Kravetz, Hugh Dickins, Peter Xu, Jason Gunthorpe,
	Gerd Hoffmann, Dongwon Kim, Junxiao Chang

On Mon, Dec 11, 2023 at 11:38:01PM -0800, Vivek Kasireddy wrote:
> @@ -42,7 +42,7 @@ static vm_fault_t udmabuf_vm_fault(struct vm_fault *vmf)
>  	if (pgoff >= ubuf->pagecount)
>  		return VM_FAULT_SIGBUS;
>  
> -	pfn = page_to_pfn(ubuf->pages[pgoff]);
> +	pfn = page_to_pfn(&ubuf->folios[pgoff]->page);

We have folio_pfn().

>  static int vmap_udmabuf(struct dma_buf *buf, struct iosys_map *map)
>  {
>  	struct udmabuf *ubuf = buf->priv;
> +	struct page **pages;
>  	void *vaddr;
> +	pgoff_t pg;
>  
>  	dma_resv_assert_held(buf->resv);
>  
> -	vaddr = vm_map_ram(ubuf->pages, ubuf->pagecount, -1);
> +	pages = kmalloc_array(ubuf->pagecount, sizeof(*pages), GFP_KERNEL);
> +	if (!pages)
> +		return -ENOMEM;
> +
> +	for (pg = 0; pg < ubuf->pagecount; pg++)
> +		pages[pg] = &ubuf->folios[pg]->page;
> +
> +	vaddr = vm_map_ram(pages, ubuf->pagecount, -1);
> +	kfree(pages);

We don't yet have a vm_map_ram() variant that takes an array of
folios.  We probably should; there was _something_ I was looking at
recently that would have liked it ...

> @@ -254,31 +262,70 @@ static int handle_shmem_pages(struct udmabuf *ubuf, struct file *memfd,
>  			      pgoff_t *pgbuf)
>  {
>  	pgoff_t pgidx, pgoff = offset >> PAGE_SHIFT;
> -	struct page *page;
> +	struct folio *folio = NULL;
>  
>  	for (pgidx = 0; pgidx < pgcnt; pgidx++) {
> -		page = shmem_read_mapping_page(memfd->f_mapping,
> -					       pgoff + pgidx);
> -		if (IS_ERR(page))
> -			return PTR_ERR(page);
> +		folio = shmem_read_folio(memfd->f_mapping,
> +					 pgoff + pgidx);

You could join these two lines.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v7 5/6] udmabuf: Pin the pages using memfd_pin_folios() API (v5)
  2023-12-12  7:38 ` [PATCH v7 5/6] udmabuf: Pin the pages using memfd_pin_folios() API (v5) Vivek Kasireddy
@ 2023-12-13 18:03   ` Matthew Wilcox
  0 siblings, 0 replies; 18+ messages in thread
From: Matthew Wilcox @ 2023-12-13 18:03 UTC (permalink / raw)
  To: Vivek Kasireddy
  Cc: dri-devel, linux-mm, David Hildenbrand, Daniel Vetter,
	Mike Kravetz, Hugh Dickins, Peter Xu, Jason Gunthorpe,
	Gerd Hoffmann, Dongwon Kim, Junxiao Chang

On Mon, Dec 11, 2023 at 11:38:02PM -0800, Vivek Kasireddy wrote:
> +++ b/drivers/dma-buf/udmabuf.c
> @@ -42,7 +42,7 @@ static vm_fault_t udmabuf_vm_fault(struct vm_fault *vmf)
>  	if (pgoff >= ubuf->pagecount)
>  		return VM_FAULT_SIGBUS;
>  
> -	pfn = page_to_pfn(&ubuf->folios[pgoff]->page);
> +	pfn = page_to_pfn(folio_page(ubuf->folios[pgoff], 0));

Why are you changing from &folio->page to folio_page(folio, 0) in
this patch?  Either that should have been done the other way in the
earlier patch, or it shouldn't be done at all.

My vote is that it shuoldn't be done at all.  These all seem like "I
have to convert back from folio to page because the APIs I want aren't
available", not "I want the first page from this folio".



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2023-12-13 18:04 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-12  7:37 [PATCH v7 0/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7) Vivek Kasireddy
2023-12-12  7:37 ` [PATCH v7 1/6] udmabuf: Use vmf_insert_pfn and VM_PFNMAP for handling mmap Vivek Kasireddy
2023-12-12  7:37 ` [PATCH v7 2/6] udmabuf: Add back support for mapping hugetlb pages (v6) Vivek Kasireddy
2023-12-12  7:38 ` [PATCH v7 3/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7) Vivek Kasireddy
2023-12-12 12:13   ` David Hildenbrand
2023-12-13  8:44     ` Kasireddy, Vivek
2023-12-13 12:31       ` Jason Gunthorpe
2023-12-13 15:36         ` Christoph Hellwig
2023-12-13 17:06           ` Jason Gunthorpe
2023-12-13 15:15       ` David Hildenbrand
2023-12-12 13:21   ` kernel test robot
2023-12-12 15:27   ` kernel test robot
2023-12-12  7:38 ` [PATCH v7 4/6] udmabuf: Convert udmabuf driver to use folios Vivek Kasireddy
2023-12-13 18:00   ` Matthew Wilcox
2023-12-12  7:38 ` [PATCH v7 5/6] udmabuf: Pin the pages using memfd_pin_folios() API (v5) Vivek Kasireddy
2023-12-13 18:03   ` Matthew Wilcox
2023-12-12  7:38 ` [PATCH v7 6/6] selftests/dma-buf/udmabuf: Add tests to verify data after page migration Vivek Kasireddy
2023-12-12 12:15 ` [PATCH v7 0/6] mm/gup: Introduce memfd_pin_folios() for pinning memfd folios (v7) David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox