linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
       [not found] <20230120175556.3556978-1-dhowells@redhat.com>
@ 2023-01-20 17:55 ` David Howells
  2023-01-21 13:01   ` Christoph Hellwig
                     ` (5 more replies)
  2023-01-20 17:55 ` [PATCH v7 3/8] mm: Provide a helper to drop a pin/ref on a page David Howells
  2023-01-20 17:55 ` [PATCH v7 8/8] mm: Renumber FOLL_GET and FOLL_PIN down David Howells
  2 siblings, 6 replies; 28+ messages in thread
From: David Howells @ 2023-01-20 17:55 UTC (permalink / raw)
  To: Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel,
	Christoph Hellwig, John Hubbard, linux-mm

Add a function, iov_iter_extract_pages(), to extract a list of pages from
an iterator.  The pages may be returned with a reference added or a pin
added or neither, depending on the type of iterator and the direction of
transfer.  The caller must pass FOLL_READ_FROM_MEM or FOLL_WRITE_TO_MEM
as part of gup_flags to indicate how the iterator contents are to be used.

Add a second function, iov_iter_extract_mode(), to determine how the
cleanup should be done.

There are three cases:

 (1) Transfer *into* an ITER_IOVEC or ITER_UBUF iterator.

     Extracted pages will have pins obtained on them (but not references)
     so that fork() doesn't CoW the pages incorrectly whilst the I/O is in
     progress.

     iov_iter_extract_mode() will return FOLL_PIN for this case.  The
     caller should use something like unpin_user_page() to dispose of the
     page.

 (2) Transfer is *out of* an ITER_IOVEC or ITER_UBUF iterator.

     Extracted pages will have references obtained on them, but not pins.

     iov_iter_extract_mode() will return FOLL_GET.  The caller should use
     something like put_page() for page disposal.

 (3) Any other sort of iterator.

     No refs or pins are obtained on the page, the assumption is made that
     the caller will manage page retention.  ITER_ALLOW_P2PDMA is not
     permitted.

     iov_iter_extract_mode() will return 0.  The pages don't need
     additional disposal.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Christoph Hellwig <hch@lst.de>
cc: John Hubbard <jhubbard@nvidia.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/166920903885.1461876.692029808682876184.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/166997421646.9475.14837976344157464997.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/167305163883.1521586.10777155475378874823.stgit@warthog.procyon.org.uk/ # v4
Link: https://lore.kernel.org/r/167344728530.2425628.9613910866466387722.stgit@warthog.procyon.org.uk/ # v5
Link: https://lore.kernel.org/r/167391053207.2311931.16398133457201442907.stgit@warthog.procyon.org.uk/ # v6
---

Notes:
    ver #7)
     - Switch to passing in iter-specific flags rather than FOLL_* flags.
     - Drop the direction flags for now.
     - Use ITER_ALLOW_P2PDMA to request FOLL_PCI_P2PDMA.
     - Disallow use of ITER_ALLOW_P2PDMA with non-user-backed iter.
     - Add support for extraction from KVEC-type iters.
     - Use iov_iter_advance() rather than open-coding it.
     - Make BVEC- and KVEC-type skip over initial empty vectors.
    
    ver #6)
     - Add back the function to indicate the cleanup mode.
     - Drop the cleanup_mode return arg to iov_iter_extract_pages().
     - Pass FOLL_SOURCE/DEST_BUF in gup_flags.  Check this against the iter
       data_source.
    
    ver #4)
     - Use ITER_SOURCE/DEST instead of WRITE/READ.
     - Allow additional FOLL_* flags, such as FOLL_PCI_P2PDMA to be passed in.
    
    ver #3)
     - Switch to using EXPORT_SYMBOL_GPL to prevent indirect 3rd-party access
       to get/pin_user_pages_fast()[1].

 include/linux/uio.h |  28 +++
 lib/iov_iter.c      | 424 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 452 insertions(+)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 46d5080314c6..a4233049ab7a 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -363,4 +363,32 @@ static inline void iov_iter_ubuf(struct iov_iter *i, unsigned int direction,
 /* Flags for iov_iter_get/extract_pages*() */
 #define ITER_ALLOW_P2PDMA	0x01	/* Allow P2PDMA on the extracted pages */
 
+ssize_t iov_iter_extract_pages(struct iov_iter *i, struct page ***pages,
+			       size_t maxsize, unsigned int maxpages,
+			       unsigned int extract_flags, size_t *offset0);
+
+/**
+ * iov_iter_extract_mode - Indicate how pages from the iterator will be retained
+ * @iter: The iterator
+ * @extract_flags: How the iterator is to be used
+ *
+ * Examine the iterator and @extract_flags and indicate by returning FOLL_PIN,
+ * FOLL_GET or 0 as to how, if at all, pages extracted from the iterator will
+ * be retained by the extraction function.
+ *
+ * FOLL_GET indicates that the pages will have a reference taken on them that
+ * the caller must put.  This can be done for DMA/async DIO write from a page.
+ *
+ * FOLL_PIN indicates that the pages will have a pin placed in them that the
+ * caller must unpin.  This is must be done for DMA/async DIO read to a page to
+ * avoid CoW problems in fork.
+ *
+ * 0 indicates that no measures are taken and that it's up to the caller to
+ * retain the pages.
+ */
+#define iov_iter_extract_mode(iter, extract_flags) \
+	(user_backed_iter(iter) ?				\
+	 (iter->data_source == ITER_SOURCE) ?			\
+	 FOLL_GET : FOLL_PIN : 0)
+
 #endif
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index fb04abe7d746..843abe566efb 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1916,3 +1916,427 @@ void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state)
 		i->iov -= state->nr_segs - i->nr_segs;
 	i->nr_segs = state->nr_segs;
 }
+
+/*
+ * Extract a list of contiguous pages from an ITER_PIPE iterator.  This does
+ * not get references of its own on the pages, nor does it get a pin on them.
+ * If there's a partial page, it adds that first and will then allocate and add
+ * pages into the pipe to make up the buffer space to the amount required.
+ *
+ * The caller must hold the pipe locked and only transferring into a pipe is
+ * supported.
+ */
+static ssize_t iov_iter_extract_pipe_pages(struct iov_iter *i,
+					   struct page ***pages, size_t maxsize,
+					   unsigned int maxpages,
+					   unsigned int extract_flags,
+					   size_t *offset0)
+{
+	unsigned int nr, offset, chunk, j;
+	struct page **p;
+	size_t left;
+
+	if (!sanity(i))
+		return -EFAULT;
+
+	offset = pipe_npages(i, &nr);
+	if (!nr)
+		return -EFAULT;
+	*offset0 = offset;
+
+	maxpages = min_t(size_t, nr, maxpages);
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	p = *pages;
+
+	left = maxsize;
+	for (j = 0; j < maxpages; j++) {
+		struct page *page = append_pipe(i, left, &offset);
+		if (!page)
+			break;
+		chunk = min_t(size_t, left, PAGE_SIZE - offset);
+		left -= chunk;
+		*p++ = page;
+	}
+	if (!j)
+		return -EFAULT;
+	return maxsize - left;
+}
+
+/*
+ * Extract a list of contiguous pages from an ITER_XARRAY iterator.  This does not
+ * get references on the pages, nor does it get a pin on them.
+ */
+static ssize_t iov_iter_extract_xarray_pages(struct iov_iter *i,
+					     struct page ***pages, size_t maxsize,
+					     unsigned int maxpages,
+					     unsigned int extract_flags,
+					     size_t *offset0)
+{
+	struct page *page, **p;
+	unsigned int nr = 0, offset;
+	loff_t pos = i->xarray_start + i->iov_offset;
+	pgoff_t index = pos >> PAGE_SHIFT;
+	XA_STATE(xas, i->xarray, index);
+
+	offset = pos & ~PAGE_MASK;
+	*offset0 = offset;
+
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	p = *pages;
+
+	rcu_read_lock();
+	for (page = xas_load(&xas); page; page = xas_next(&xas)) {
+		if (xas_retry(&xas, page))
+			continue;
+
+		/* Has the page moved or been split? */
+		if (unlikely(page != xas_reload(&xas))) {
+			xas_reset(&xas);
+			continue;
+		}
+
+		p[nr++] = find_subpage(page, xas.xa_index);
+		if (nr == maxpages)
+			break;
+	}
+	rcu_read_unlock();
+
+	maxsize = min_t(size_t, nr * PAGE_SIZE - offset, maxsize);
+	iov_iter_advance(i, maxsize);
+	return maxsize;
+}
+
+/*
+ * Extract a list of contiguous pages from an ITER_BVEC iterator.  This does
+ * not get references on the pages, nor does it get a pin on them.
+ */
+static ssize_t iov_iter_extract_bvec_pages(struct iov_iter *i,
+					   struct page ***pages, size_t maxsize,
+					   unsigned int maxpages,
+					   unsigned int extract_flags,
+					   size_t *offset0)
+{
+	struct page **p, *page;
+	size_t skip = i->iov_offset, offset;
+	int k;
+
+	for (;;) {
+		if (i->nr_segs == 0)
+			return 0;
+		maxsize = min(maxsize, i->bvec->bv_len - skip);
+		if (maxsize)
+			break;
+		i->iov_offset = 0;
+		i->nr_segs--;
+		i->kvec++;
+		skip = 0;
+	}
+
+	skip += i->bvec->bv_offset;
+	page = i->bvec->bv_page + skip / PAGE_SIZE;
+	offset = skip % PAGE_SIZE;
+	*offset0 = offset;
+
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	p = *pages;
+	for (k = 0; k < maxpages; k++)
+		p[k] = page + k;
+
+	maxsize = min_t(size_t, maxsize, maxpages * PAGE_SIZE - offset);
+	iov_iter_advance(i, maxsize);
+	return maxsize;
+}
+
+/*
+ * Extract a list of virtually contiguous pages from an ITER_KVEC iterator.
+ * This does not get references on the pages, nor does it get a pin on them.
+ */
+static ssize_t iov_iter_extract_kvec_pages(struct iov_iter *i,
+					   struct page ***pages, size_t maxsize,
+					   unsigned int maxpages,
+					   unsigned int extract_flags,
+					   size_t *offset0)
+{
+	struct page **p, *page;
+	const void *kaddr;
+	size_t skip = i->iov_offset, offset, len;
+	int k;
+
+	for (;;) {
+		if (i->nr_segs == 0)
+			return 0;
+		maxsize = min(maxsize, i->kvec->iov_len - skip);
+		if (maxsize)
+			break;
+		i->iov_offset = 0;
+		i->nr_segs--;
+		i->kvec++;
+		skip = 0;
+	}
+
+	offset = skip % PAGE_SIZE;
+	*offset0 = offset;
+	kaddr = i->kvec->iov_base;
+
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	p = *pages;
+
+	kaddr -= offset;
+	len = offset + maxsize;
+	for (k = 0; k < maxpages; k++) {
+		size_t seg = min_t(size_t, len, PAGE_SIZE);
+
+		if (is_vmalloc_or_module_addr(kaddr))
+			page = vmalloc_to_page(kaddr);
+		else
+			page = virt_to_page(kaddr);
+
+		p[k] = page;
+		len -= seg;
+		kaddr += PAGE_SIZE;
+	}
+
+	maxsize = min_t(size_t, maxsize, maxpages * PAGE_SIZE - offset);
+	iov_iter_advance(i, maxsize);
+	return maxsize;
+}
+
+/*
+ * Get the first segment from an ITER_UBUF or ITER_IOVEC iterator.  The
+ * iterator must not be empty.
+ */
+static unsigned long iov_iter_extract_first_user_segment(const struct iov_iter *i,
+							 size_t *size)
+{
+	size_t skip;
+	long k;
+
+	if (iter_is_ubuf(i))
+		return (unsigned long)i->ubuf + i->iov_offset;
+
+	for (k = 0, skip = i->iov_offset; k < i->nr_segs; k++, skip = 0) {
+		size_t len = i->iov[k].iov_len - skip;
+
+		if (unlikely(!len))
+			continue;
+		if (*size > len)
+			*size = len;
+		return (unsigned long)i->iov[k].iov_base + skip;
+	}
+	BUG(); // if it had been empty, we wouldn't get called
+}
+
+/*
+ * Extract a list of contiguous pages from a user iterator and get references
+ * on them.  This should only be used iff the iterator is user-backed
+ * (IOBUF/UBUF) and data is being transferred out of the buffer described by
+ * the iterator (ie. this is the source).
+ *
+ * The pages are returned with incremented refcounts that the caller must undo
+ * once the transfer is complete, but no additional pins are obtained.
+ *
+ * This is only safe to be used where background IO/DMA is not going to be
+ * modifying the buffer, and so won't cause a problem with CoW on fork.
+ */
+static ssize_t iov_iter_extract_user_pages_and_get(struct iov_iter *i,
+						   struct page ***pages,
+						   size_t maxsize,
+						   unsigned int maxpages,
+						   unsigned int extract_flags,
+						   size_t *offset0)
+{
+	unsigned long addr;
+	unsigned int gup_flags = FOLL_GET;
+	size_t offset;
+	int res;
+
+	if (WARN_ON_ONCE(i->data_source != ITER_SOURCE))
+		return -EFAULT;
+
+	if (extract_flags & ITER_ALLOW_P2PDMA)
+		gup_flags |= FOLL_PCI_P2PDMA;
+	if (i->nofault)
+		gup_flags |= FOLL_NOFAULT;
+
+	addr = iov_iter_extract_first_user_segment(i, &maxsize);
+	*offset0 = offset = addr % PAGE_SIZE;
+	addr &= PAGE_MASK;
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	res = get_user_pages_fast(addr, maxpages, gup_flags, *pages);
+	if (unlikely(res <= 0))
+		return res;
+	maxsize = min_t(size_t, maxsize, res * PAGE_SIZE - offset);
+	iov_iter_advance(i, maxsize);
+	return maxsize;
+}
+
+/*
+ * Extract a list of contiguous pages from a user iterator and get a pin on
+ * each of them.  This should only be used iff the iterator is user-backed
+ * (IOBUF/UBUF) and data is being transferred into the buffer described by the
+ * iterator (ie. this is the destination).
+ *
+ * It does not get refs on the pages, but the pages must be unpinned by the
+ * caller once the transfer is complete.
+ *
+ * This is safe to be used where background IO/DMA *is* going to be modifying
+ * the buffer; using a pin rather than a ref makes sure that CoW happens
+ * correctly in the parent during fork.
+ */
+static ssize_t iov_iter_extract_user_pages_and_pin(struct iov_iter *i,
+						   struct page ***pages,
+						   size_t maxsize,
+						   unsigned int maxpages,
+						   unsigned int extract_flags,
+						   size_t *offset0)
+{
+	unsigned long addr;
+	unsigned int gup_flags = FOLL_PIN | FOLL_WRITE;
+	size_t offset;
+	int res;
+
+	if (WARN_ON_ONCE(i->data_source != ITER_DEST))
+		return -EFAULT;
+
+	if (extract_flags & ITER_ALLOW_P2PDMA)
+		gup_flags |= FOLL_PCI_P2PDMA;
+	if (i->nofault)
+		gup_flags |= FOLL_NOFAULT;
+
+	addr = first_iovec_segment(i, &maxsize);
+	*offset0 = offset = addr % PAGE_SIZE;
+	addr &= PAGE_MASK;
+	maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+	if (!maxpages)
+		return -ENOMEM;
+	res = pin_user_pages_fast(addr, maxpages, gup_flags, *pages);
+	if (unlikely(res <= 0))
+		return res;
+	maxsize = min_t(size_t, maxsize, res * PAGE_SIZE - offset);
+	iov_iter_advance(i, maxsize);
+	return maxsize;
+}
+
+static ssize_t iov_iter_extract_user_pages(struct iov_iter *i,
+					   struct page ***pages, size_t maxsize,
+					   unsigned int maxpages,
+					   unsigned int extract_flags,
+					   size_t *offset0)
+{
+	if (iov_iter_extract_mode(i, extract_flags) == FOLL_GET)
+		return iov_iter_extract_user_pages_and_get(i, pages, maxsize,
+							   maxpages, extract_flags,
+							   offset0);
+	else
+		return iov_iter_extract_user_pages_and_pin(i, pages, maxsize,
+							   maxpages, extract_flags,
+							   offset0);
+}
+
+/**
+ * iov_iter_extract_pages - Extract a list of contiguous pages from an iterator
+ * @i: The iterator to extract from
+ * @pages: Where to return the list of pages
+ * @maxsize: The maximum amount of iterator to extract
+ * @maxpages: The maximum size of the list of pages
+ * @extract_flags: Flags to qualify request
+ * @offset0: Where to return the starting offset into (*@pages)[0]
+ *
+ * Extract a list of contiguous pages from the current point of the iterator,
+ * advancing the iterator.  The maximum number of pages and the maximum amount
+ * of page contents can be set.
+ *
+ * If *@pages is NULL, a page list will be allocated to the required size and
+ * *@pages will be set to its base.  If *@pages is not NULL, it will be assumed
+ * that the caller allocated a page list at least @maxpages in size and this
+ * will be filled in.
+ *
+ * @extract_flags can have ITER_ALLOW_P2PDMA set to request peer-to-peer DMA be
+ * allowed on the pages extracted.
+ *
+ * The iov_iter_extract_mode() function can be used to query how cleanup should
+ * be performed.
+ *
+ * Extra refs or pins on the pages may be obtained as follows:
+ *
+ *  (*) If the iterator is user-backed (ITER_IOVEC/ITER_UBUF) and data is to be
+ *      transferred /OUT OF/ the buffer (@i->data_source == ITER_SOURCE), refs
+ *      will be taken on the pages, but pins will not be added.  This can be
+ *      used for DMA from a page; it cannot be used for DMA to a page, as it
+ *      may cause page-COW problems in fork.  iov_iter_extract_mode() will
+ *      return FOLL_GET.
+ *
+ *  (*) If the iterator is user-backed (ITER_IOVEC/ITER_UBUF) and data is to be
+ *      transferred /INTO/ the described buffer (@i->data_source |= ITER_DEST),
+ *      pins will be added to the pages, but refs will not be taken.  This must
+ *      be used for DMA to a page.  iov_iter_extract_mode() will return
+ *      FOLL_PIN.
+ *
+ *  (*) If the iterator is ITER_PIPE, this must describe a destination for the
+ *      data.  Additional pages may be allocated and added to the pipe (which
+ *      will hold the refs), but neither refs nor pins will be obtained for the
+ *      caller.  The caller must hold the pipe lock.  iov_iter_extract_mode()
+ *      will return 0.
+ *
+ *  (*) If the iterator is ITER_KVEC, ITER_BVEC or ITER_XARRAY, the pages are
+ *      merely listed; no extra refs or pins are obtained.
+ *      iov_iter_extract_mode() will return 0.
+ *
+ * Note also:
+ *
+ *  (*) Peer-to-peer DMA (ITER_ALLOW_P2PDMA) is only permitted with user-backed
+ *      iterators.
+ *
+ *  (*) Use with ITER_DISCARD is not supported as that has no content.
+ *
+ * On success, the function sets *@pages to the new pagelist, if allocated, and
+ * sets *offset0 to the offset into the first page..
+ *
+ * It may also return -ENOMEM and -EFAULT.
+ */
+ssize_t iov_iter_extract_pages(struct iov_iter *i,
+			       struct page ***pages,
+			       size_t maxsize,
+			       unsigned int maxpages,
+			       unsigned int extract_flags,
+			       size_t *offset0)
+{
+	maxsize = min_t(size_t, min_t(size_t, maxsize, i->count), MAX_RW_COUNT);
+	if (!maxsize)
+		return 0;
+
+	if (likely(user_backed_iter(i)))
+		return iov_iter_extract_user_pages(i, pages, maxsize,
+						   maxpages, extract_flags,
+						   offset0);
+	if (WARN_ON_ONCE(extract_flags & ITER_ALLOW_P2PDMA))
+		return -EIO;
+	if (iov_iter_is_kvec(i))
+		return iov_iter_extract_kvec_pages(i, pages, maxsize,
+						   maxpages, extract_flags,
+						   offset0);
+	if (iov_iter_is_bvec(i))
+		return iov_iter_extract_bvec_pages(i, pages, maxsize,
+						   maxpages, extract_flags,
+						   offset0);
+	if (iov_iter_is_pipe(i))
+		return iov_iter_extract_pipe_pages(i, pages, maxsize,
+						   maxpages, extract_flags,
+						   offset0);
+	if (iov_iter_is_xarray(i))
+		return iov_iter_extract_xarray_pages(i, pages, maxsize,
+						     maxpages, extract_flags,
+						     offset0);
+	return -EFAULT;
+}
+EXPORT_SYMBOL_GPL(iov_iter_extract_pages);



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v7 3/8] mm: Provide a helper to drop a pin/ref on a page
       [not found] <20230120175556.3556978-1-dhowells@redhat.com>
  2023-01-20 17:55 ` [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator David Howells
@ 2023-01-20 17:55 ` David Howells
  2023-01-20 17:55 ` [PATCH v7 8/8] mm: Renumber FOLL_GET and FOLL_PIN down David Howells
  2 siblings, 0 replies; 28+ messages in thread
From: David Howells @ 2023-01-20 17:55 UTC (permalink / raw)
  To: Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel,
	Christoph Hellwig, linux-mm

Provide a helper in the get_user_pages code to drop a pin or a ref on a
page based on being given FOLL_GET or FOLL_PIN in its flags argument or do
nothing if neither is set.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Christoph Hellwig <hch@lst.de>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-fsdevel@vger.kernel.org
cc: linux-block@vger.kernel.org
cc: linux-mm@kvack.org
---
 include/linux/mm.h |  3 +++
 mm/gup.c           | 22 ++++++++++++++++++++++
 2 files changed, 25 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f3f196e4d66d..f1cf8f4eb946 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1367,6 +1367,9 @@ static inline bool is_cow_mapping(vm_flags_t flags)
 #define SECTION_IN_PAGE_FLAGS
 #endif
 
+void folio_put_unpin(struct folio *folio, unsigned int flags);
+void page_put_unpin(struct page *page, unsigned int flags);
+
 /*
  * The identification function is mainly used by the buddy allocator for
  * determining if two pages could be buddies. We are not really identifying
diff --git a/mm/gup.c b/mm/gup.c
index f45a3a5be53a..3ee4b4c7e0cb 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -191,6 +191,28 @@ static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
 		folio_put_refs(folio, refs);
 }
 
+/**
+ * folio_put_unpin - Unpin/put a folio as appropriate
+ * @folio: The folio to release
+ * @flags: gup flags indicating the mode of release (FOLL_*)
+ *
+ * Release a folio according to the flags.  If FOLL_GET is set, the folio has a
+ * ref dropped; if FOLL_PIN is set, it is unpinned; otherwise it is left
+ * unaltered.
+ */
+void folio_put_unpin(struct folio *folio, unsigned int flags)
+{
+	if (flags & (FOLL_GET | FOLL_PIN))
+		gup_put_folio(folio, 1, flags);
+}
+EXPORT_SYMBOL_GPL(folio_put_unpin);
+
+void page_put_unpin(struct page *page, unsigned int flags)
+{
+	folio_put_unpin(page_folio(page), flags);
+}
+EXPORT_SYMBOL_GPL(page_put_unpin);
+
 /**
  * try_grab_page() - elevate a page's refcount by a flag-dependent amount
  * @page:    pointer to page to be grabbed



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH v7 8/8] mm: Renumber FOLL_GET and FOLL_PIN down
       [not found] <20230120175556.3556978-1-dhowells@redhat.com>
  2023-01-20 17:55 ` [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator David Howells
  2023-01-20 17:55 ` [PATCH v7 3/8] mm: Provide a helper to drop a pin/ref on a page David Howells
@ 2023-01-20 17:55 ` David Howells
  2023-01-20 18:59   ` Matthew Wilcox
  2023-01-20 19:18   ` David Howells
  2 siblings, 2 replies; 28+ messages in thread
From: David Howells @ 2023-01-20 17:55 UTC (permalink / raw)
  To: Al Viro, Christoph Hellwig
  Cc: David Howells, Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel,
	Christoph Hellwig, linux-mm

Renumber FOLL_GET and FOLL_PIN down to bit 0 and 1 respectively so that
they are coincidentally the same as BIO_PAGE_REFFED and BIO_PAGE_PINNED and
also so that they can be stored in the bottom two bits of a page pointer
(something I'm looking at for zerocopy socket fragments).

Signed-off-by: David Howells <dhowells@redhat.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Christoph Hellwig <hch@lst.de>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
---
 include/linux/mm.h | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f1cf8f4eb946..33c9eacd9548 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3074,12 +3074,13 @@ static inline vm_fault_t vmf_error(int err)
 struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 			 unsigned int foll_flags);
 
-#define FOLL_WRITE	0x01	/* check pte is writable */
-#define FOLL_TOUCH	0x02	/* mark page accessed */
-#define FOLL_GET	0x04	/* do get_page on page */
-#define FOLL_DUMP	0x08	/* give error on hole if it would be zero */
-#define FOLL_FORCE	0x10	/* get_user_pages read/write w/o permission */
-#define FOLL_NOWAIT	0x20	/* if a disk transfer is needed, start the IO
+#define FOLL_GET	0x01	/* do get_page on page (equivalent to BIO_FOLL_GET) */
+#define FOLL_PIN	0x02	/* pages must be released via unpin_user_page */
+#define FOLL_WRITE	0x04	/* check pte is writable */
+#define FOLL_TOUCH	0x08	/* mark page accessed */
+#define FOLL_DUMP	0x10	/* give error on hole if it would be zero */
+#define FOLL_FORCE	0x20	/* get_user_pages read/write w/o permission */
+#define FOLL_NOWAIT	0x40	/* if a disk transfer is needed, start the IO
 				 * and return without waiting upon it */
 #define FOLL_NOFAULT	0x80	/* do not fault in pages */
 #define FOLL_HWPOISON	0x100	/* check page is hwpoisoned */
@@ -3088,7 +3089,6 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 #define FOLL_ANON	0x8000	/* don't do file mappings */
 #define FOLL_LONGTERM	0x10000	/* mapping lifetime is indefinite: see below */
 #define FOLL_SPLIT_PMD	0x20000	/* split huge pmd before returning */
-#define FOLL_PIN	0x40000	/* pages must be released via unpin_user_page */
 #define FOLL_FAST_ONLY	0x80000	/* gup_fast: prevent fall-back to slow gup */
 #define FOLL_PCI_P2PDMA	0x100000 /* allow returning PCI P2PDMA pages */
 #define FOLL_INTERRUPTIBLE  0x200000 /* allow interrupts from generic signals */



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v7 8/8] mm: Renumber FOLL_GET and FOLL_PIN down
  2023-01-20 17:55 ` [PATCH v7 8/8] mm: Renumber FOLL_GET and FOLL_PIN down David Howells
@ 2023-01-20 18:59   ` Matthew Wilcox
  2023-01-20 19:18   ` David Howells
  1 sibling, 0 replies; 28+ messages in thread
From: Matthew Wilcox @ 2023-01-20 18:59 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Christoph Hellwig, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel,
	Christoph Hellwig, linux-mm

On Fri, Jan 20, 2023 at 05:55:56PM +0000, David Howells wrote:
> Renumber FOLL_GET and FOLL_PIN down to bit 0 and 1 respectively so that
> they are coincidentally the same as BIO_PAGE_REFFED and BIO_PAGE_PINNED and
> also so that they can be stored in the bottom two bits of a page pointer
> (something I'm looking at for zerocopy socket fragments).
> 
> Signed-off-by: David Howells <dhowells@redhat.com>

Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>

Although I was hoping you'd renumber some of the others since we
currently have gaps at 0x200, 0x400, 0x1000, and 0x4000 as well
as the 0x40 you're using here.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v7 8/8] mm: Renumber FOLL_GET and FOLL_PIN down
  2023-01-20 17:55 ` [PATCH v7 8/8] mm: Renumber FOLL_GET and FOLL_PIN down David Howells
  2023-01-20 18:59   ` Matthew Wilcox
@ 2023-01-20 19:18   ` David Howells
  1 sibling, 0 replies; 28+ messages in thread
From: David Howells @ 2023-01-20 19:18 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: dhowells, Al Viro, Christoph Hellwig, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel, Christoph Hellwig, linux-mm

Matthew Wilcox <willy@infradead.org> wrote:

> Although I was hoping you'd renumber some of the others since we
> currently have gaps at 0x200, 0x400, 0x1000, and 0x4000 as well
> as the 0x40 you're using here.

Since the other patches don't require this one, I can renumber them all and
send it directly to Andrew.

David



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-20 17:55 ` [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator David Howells
@ 2023-01-21 13:01   ` Christoph Hellwig
  2023-01-21 13:10   ` Christoph Hellwig
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2023-01-21 13:01 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel, Christoph Hellwig, John Hubbard, linux-mm

> +#define iov_iter_extract_mode(iter, extract_flags) \
> +	(user_backed_iter(iter) ?				\
> +	 (iter->data_source == ITER_SOURCE) ?			\
> +	 FOLL_GET : FOLL_PIN : 0)

The extract_flags argument is unused here.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-20 17:55 ` [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator David Howells
  2023-01-21 13:01   ` Christoph Hellwig
@ 2023-01-21 13:10   ` Christoph Hellwig
  2023-01-21 13:30   ` David Howells
                     ` (3 subsequent siblings)
  5 siblings, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2023-01-21 13:10 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel, Christoph Hellwig, John Hubbard, linux-mm

On Fri, Jan 20, 2023 at 05:55:50PM +0000, David Howells wrote:
>  (3) Any other sort of iterator.
> 
>      No refs or pins are obtained on the page, the assumption is made that
>      the caller will manage page retention.  ITER_ALLOW_P2PDMA is not
>      permitted.

The ITER_ALLOW_P2PDMA check here is fundamentally wrong and will
break things like io_uring to fixed buffers on NVMe.  ITER_ALLOW_P2PDMA
should strictly increase the group of acceptable iters, it does must
not restrict anything.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-20 17:55 ` [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator David Howells
  2023-01-21 13:01   ` Christoph Hellwig
  2023-01-21 13:10   ` Christoph Hellwig
@ 2023-01-21 13:30   ` David Howells
  2023-01-21 13:33     ` Christoph Hellwig
  2023-01-23 11:28   ` David Hildenbrand
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 28+ messages in thread
From: David Howells @ 2023-01-21 13:30 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dhowells, Al Viro, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel, Christoph Hellwig, John Hubbard, linux-mm

Christoph Hellwig <hch@infradead.org> wrote:

> On Fri, Jan 20, 2023 at 05:55:50PM +0000, David Howells wrote:
> >  (3) Any other sort of iterator.
> > 
> >      No refs or pins are obtained on the page, the assumption is made that
> >      the caller will manage page retention.  ITER_ALLOW_P2PDMA is not
> >      permitted.
> 
> The ITER_ALLOW_P2PDMA check here is fundamentally wrong and will
> break things like io_uring to fixed buffers on NVMe.  ITER_ALLOW_P2PDMA
> should strictly increase the group of acceptable iters, it does must
> not restrict anything.

So just drop the check?  Or do I actually need to do something to the pages to
make it work?  If I understand the code correctly, it's not actually operable
right now on any page that doesn't have an appropriately marked PTE.

David



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-21 13:30   ` David Howells
@ 2023-01-21 13:33     ` Christoph Hellwig
  0 siblings, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2023-01-21 13:33 UTC (permalink / raw)
  To: David Howells
  Cc: Christoph Hellwig, Al Viro, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel, Christoph Hellwig, John Hubbard, linux-mm

On Sat, Jan 21, 2023 at 01:30:47PM +0000, David Howells wrote:
> So just drop the check?  Or do I actually need to do something to the pages to
> make it work?  If I understand the code correctly, it's not actually operable
> right now on any page that doesn't have an appropriately marked PTE.

Yes, just drop the check. The flag just signals that the caller can
deal with p2p pages.  If it gets any or not is a totally different
story.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-20 17:55 ` [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator David Howells
                     ` (2 preceding siblings ...)
  2023-01-21 13:30   ` David Howells
@ 2023-01-23 11:28   ` David Hildenbrand
  2023-01-23 11:51   ` David Howells
  2023-01-23 12:00   ` David Howells
  5 siblings, 0 replies; 28+ messages in thread
From: David Hildenbrand @ 2023-01-23 11:28 UTC (permalink / raw)
  To: David Howells, Al Viro, Christoph Hellwig
  Cc: Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel,
	Christoph Hellwig, John Hubbard, linux-mm

On 20.01.23 18:55, David Howells wrote:
> Add a function, iov_iter_extract_pages(), to extract a list of pages from
> an iterator.  The pages may be returned with a reference added or a pin
> added or neither, depending on the type of iterator and the direction of
> transfer.  The caller must pass FOLL_READ_FROM_MEM or FOLL_WRITE_TO_MEM
> as part of gup_flags to indicate how the iterator contents are to be used.
> 
> Add a second function, iov_iter_extract_mode(), to determine how the
> cleanup should be done.
> 
> There are three cases:
> 
>   (1) Transfer *into* an ITER_IOVEC or ITER_UBUF iterator.
> 
>       Extracted pages will have pins obtained on them (but not references)
>       so that fork() doesn't CoW the pages incorrectly whilst the I/O is in
>       progress.
> 
>       iov_iter_extract_mode() will return FOLL_PIN for this case.  The
>       caller should use something like unpin_user_page() to dispose of the
>       page.
> 
>   (2) Transfer is *out of* an ITER_IOVEC or ITER_UBUF iterator.
> 
>       Extracted pages will have references obtained on them, but not pins.
> 
>       iov_iter_extract_mode() will return FOLL_GET.  The caller should use
>       something like put_page() for page disposal.
> 
>   (3) Any other sort of iterator.
> 
>       No refs or pins are obtained on the page, the assumption is made that
>       the caller will manage page retention.  ITER_ALLOW_P2PDMA is not
>       permitted.
> 
>       iov_iter_extract_mode() will return 0.  The pages don't need
>       additional disposal.
> 
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: Al Viro <viro@zeniv.linux.org.uk>
> cc: Christoph Hellwig <hch@lst.de>
> cc: John Hubbard <jhubbard@nvidia.com>
> cc: Matthew Wilcox <willy@infradead.org>
> cc: linux-fsdevel@vger.kernel.org
> cc: linux-mm@kvack.org
> Link: https://lore.kernel.org/r/166920903885.1461876.692029808682876184.stgit@warthog.procyon.org.uk/ # v2
> Link: https://lore.kernel.org/r/166997421646.9475.14837976344157464997.stgit@warthog.procyon.org.uk/ # v3
> Link: https://lore.kernel.org/r/167305163883.1521586.10777155475378874823.stgit@warthog.procyon.org.uk/ # v4
> Link: https://lore.kernel.org/r/167344728530.2425628.9613910866466387722.stgit@warthog.procyon.org.uk/ # v5
> Link: https://lore.kernel.org/r/167391053207.2311931.16398133457201442907.stgit@warthog.procyon.org.uk/ # v6
> ---
> 
> Notes:
>      ver #7)
>       - Switch to passing in iter-specific flags rather than FOLL_* flags.
>       - Drop the direction flags for now.
>       - Use ITER_ALLOW_P2PDMA to request FOLL_PCI_P2PDMA.
>       - Disallow use of ITER_ALLOW_P2PDMA with non-user-backed iter.
>       - Add support for extraction from KVEC-type iters.
>       - Use iov_iter_advance() rather than open-coding it.
>       - Make BVEC- and KVEC-type skip over initial empty vectors.
>      
>      ver #6)
>       - Add back the function to indicate the cleanup mode.
>       - Drop the cleanup_mode return arg to iov_iter_extract_pages().
>       - Pass FOLL_SOURCE/DEST_BUF in gup_flags.  Check this against the iter
>         data_source.
>      
>      ver #4)
>       - Use ITER_SOURCE/DEST instead of WRITE/READ.
>       - Allow additional FOLL_* flags, such as FOLL_PCI_P2PDMA to be passed in.
>      
>      ver #3)
>       - Switch to using EXPORT_SYMBOL_GPL to prevent indirect 3rd-party access
>         to get/pin_user_pages_fast()[1].
> 
>   include/linux/uio.h |  28 +++
>   lib/iov_iter.c      | 424 ++++++++++++++++++++++++++++++++++++++++++++
>   2 files changed, 452 insertions(+)
> 
> diff --git a/include/linux/uio.h b/include/linux/uio.h
> index 46d5080314c6..a4233049ab7a 100644
> --- a/include/linux/uio.h
> +++ b/include/linux/uio.h
> @@ -363,4 +363,32 @@ static inline void iov_iter_ubuf(struct iov_iter *i, unsigned int direction,
>   /* Flags for iov_iter_get/extract_pages*() */
>   #define ITER_ALLOW_P2PDMA	0x01	/* Allow P2PDMA on the extracted pages */
>   
> +ssize_t iov_iter_extract_pages(struct iov_iter *i, struct page ***pages,
> +			       size_t maxsize, unsigned int maxpages,
> +			       unsigned int extract_flags, size_t *offset0);
> +
> +/**
> + * iov_iter_extract_mode - Indicate how pages from the iterator will be retained
> + * @iter: The iterator
> + * @extract_flags: How the iterator is to be used
> + *
> + * Examine the iterator and @extract_flags and indicate by returning FOLL_PIN,
> + * FOLL_GET or 0 as to how, if at all, pages extracted from the iterator will
> + * be retained by the extraction function.
> + *
> + * FOLL_GET indicates that the pages will have a reference taken on them that
> + * the caller must put.  This can be done for DMA/async DIO write from a page.
> + *
> + * FOLL_PIN indicates that the pages will have a pin placed in them that the
> + * caller must unpin.  This is must be done for DMA/async DIO read to a page to
> + * avoid CoW problems in fork.
> + *
> + * 0 indicates that no measures are taken and that it's up to the caller to
> + * retain the pages.
> + */
> +#define iov_iter_extract_mode(iter, extract_flags) \
> +	(user_backed_iter(iter) ?				\
> +	 (iter->data_source == ITER_SOURCE) ?			\
> +	 FOLL_GET : FOLL_PIN : 0)
> +
>

How does this work align with the goal of no longer using FOLL_GET for 
O_DIRECT? We should get rid of any FOLL_GET usage for accessing page 
content.

@John, any comments?

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-20 17:55 ` [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator David Howells
                     ` (3 preceding siblings ...)
  2023-01-23 11:28   ` David Hildenbrand
@ 2023-01-23 11:51   ` David Howells
  2023-01-23 13:11     ` David Hildenbrand
  2023-01-23 13:19     ` David Howells
  2023-01-23 12:00   ` David Howells
  5 siblings, 2 replies; 28+ messages in thread
From: David Howells @ 2023-01-23 11:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: dhowells, Al Viro, Christoph Hellwig, Matthew Wilcox, Jens Axboe,
	Jan Kara, Jeff Layton, Logan Gunthorpe, linux-fsdevel,
	linux-block, linux-kernel, Christoph Hellwig, John Hubbard,
	linux-mm

David Hildenbrand <david@redhat.com> wrote:

> How does this work align with the goal of no longer using FOLL_GET for
> O_DIRECT? We should get rid of any FOLL_GET usage for accessing page content.

Would that run the risk of changes being made by the child being visible to
the a DIO write if the parent changes the buffer first?


	PARENT			CHILD
	======			=====
	start-DIO-write
	fork() = pid		fork() = 0
	alter-buffer
	CoW happens
	page copied		original page retained
				alter-buffer
		<DMA-happens>

David



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-20 17:55 ` [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator David Howells
                     ` (4 preceding siblings ...)
  2023-01-23 11:51   ` David Howells
@ 2023-01-23 12:00   ` David Howells
  5 siblings, 0 replies; 28+ messages in thread
From: David Howells @ 2023-01-23 12:00 UTC (permalink / raw)
  Cc: dhowells, David Hildenbrand, Al Viro, Christoph Hellwig,
	Matthew Wilcox, Jens Axboe, Jan Kara, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel,
	Christoph Hellwig, John Hubbard, linux-mm

David Howells <dhowells@redhat.com> wrote:

> > How does this work align with the goal of no longer using FOLL_GET for
> > O_DIRECT? We should get rid of any FOLL_GET usage for accessing page content.
> 
> Would that run the risk of changes being made by the child being visible to
> the a DIO write if the parent changes the buffer first?
> 
> 
> 	PARENT			CHILD
> 	======			=====
> 	start-DIO-write
> 	fork() = pid		fork() = 0
> 	alter-buffer
> 	CoW happens
> 	page copied		original page retained
> 				alter-buffer
> 		<DMA-happens>

Ah, I think I might have got the wrong end of the stick.  A pinned page is
*always* copied on fork() if I understand copy_present_pte() correctly.

David



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-23 11:51   ` David Howells
@ 2023-01-23 13:11     ` David Hildenbrand
  2023-01-23 13:19     ` David Howells
  1 sibling, 0 replies; 28+ messages in thread
From: David Hildenbrand @ 2023-01-23 13:11 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel, Christoph Hellwig, John Hubbard, linux-mm

On 23.01.23 12:51, David Howells wrote:
> David Hildenbrand <david@redhat.com> wrote:
> 
>> How does this work align with the goal of no longer using FOLL_GET for
>> O_DIRECT? We should get rid of any FOLL_GET usage for accessing page content.
> 
> Would that run the risk of changes being made by the child being visible to
> the a DIO write if the parent changes the buffer first?
> 
> 
> 	PARENT			CHILD
> 	======			=====
> 	start-DIO-write
> 	fork() = pid		fork() = 0
> 	alter-buffer
> 	CoW happens
> 	page copied		original page retained
> 				alter-buffer
> 		<DMA-happens>

FOLL_PIN users are fine in that regard, because we properly detect 
"maybe pinned" during fork() and copy the page. See 
tools/testing/selftests/mm/cow.c (still called 
tools/testing/selftests/vm/cow.c upstream IIRC) for some test cases for 
that handling.

FOLL_GET does not work as expected in that regard: pages can't be 
detected as pinned and we won't be copying them during fork(). We'll end 
up COW-sharing them, which can result in trouble later.

Switching from FOLL_GET to FOLL_PIN was in the works by John H. Not sure 
what the status is. Interestingly, 
Documentation/core-api/pin_user_pages.rst already documents that "CASE 
1: Direct IO (DIO)" uses FOLL_PIN ... which does, unfortunately, no 
reflect reality yet.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-23 11:51   ` David Howells
  2023-01-23 13:11     ` David Hildenbrand
@ 2023-01-23 13:19     ` David Howells
  2023-01-23 13:24       ` David Hildenbrand
  2023-01-23 13:38       ` David Howells
  1 sibling, 2 replies; 28+ messages in thread
From: David Howells @ 2023-01-23 13:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: dhowells, Al Viro, Christoph Hellwig, Matthew Wilcox, Jens Axboe,
	Jan Kara, Jeff Layton, Logan Gunthorpe, linux-fsdevel,
	linux-block, linux-kernel, Christoph Hellwig, John Hubbard,
	linux-mm

David Hildenbrand <david@redhat.com> wrote:

> Switching from FOLL_GET to FOLL_PIN was in the works by John H. Not sure what
> the status is. Interestingly, Documentation/core-api/pin_user_pages.rst
> already documents that "CASE 1: Direct IO (DIO)" uses FOLL_PIN ... which does,
> unfortunately, no reflect reality yet.

Yeah - I just came across that.

Should iov_iter.c then switch entirely to using pin_user_pages(), rather than
get_user_pages()?  In which case my patches only need keep track of
pinned/not-pinned and never "got".

David



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-23 13:19     ` David Howells
@ 2023-01-23 13:24       ` David Hildenbrand
  2023-01-23 19:56         ` John Hubbard
  2023-01-26 22:15         ` Al Viro
  2023-01-23 13:38       ` David Howells
  1 sibling, 2 replies; 28+ messages in thread
From: David Hildenbrand @ 2023-01-23 13:24 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel, Christoph Hellwig, John Hubbard, linux-mm

On 23.01.23 14:19, David Howells wrote:
> David Hildenbrand <david@redhat.com> wrote:
> 
>> Switching from FOLL_GET to FOLL_PIN was in the works by John H. Not sure what
>> the status is. Interestingly, Documentation/core-api/pin_user_pages.rst
>> already documents that "CASE 1: Direct IO (DIO)" uses FOLL_PIN ... which does,
>> unfortunately, no reflect reality yet.
> 
> Yeah - I just came across that.
> 
> Should iov_iter.c then switch entirely to using pin_user_pages(), rather than
> get_user_pages()?  In which case my patches only need keep track of
> pinned/not-pinned and never "got".

That would be the ideal case: whenever intending to access page content, 
use FOLL_PIN instead of FOLL_GET.

The issue that John was trying to sort out was that there are plenty of 
callsites that do a simple put_page() instead of calling 
unpin_user_page(). IIRC, handling that correctly in existing code -- 
what was pinned must be released via unpin_user_page() -- was the 
biggest workitem.

Not sure how that relates to your work here (that's why I was asking): 
if you could avoid FOLL_GET, that would be great :)

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-23 13:19     ` David Howells
  2023-01-23 13:24       ` David Hildenbrand
@ 2023-01-23 13:38       ` David Howells
  2023-01-23 14:20         ` David Hildenbrand
  2023-01-23 16:11         ` Jan Kara
  1 sibling, 2 replies; 28+ messages in thread
From: David Howells @ 2023-01-23 13:38 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: dhowells, Al Viro, Christoph Hellwig, Matthew Wilcox, Jens Axboe,
	Jan Kara, Jeff Layton, Logan Gunthorpe, linux-fsdevel,
	linux-block, linux-kernel, Christoph Hellwig, John Hubbard,
	linux-mm

David Hildenbrand <david@redhat.com> wrote:

> That would be the ideal case: whenever intending to access page content, use
> FOLL_PIN instead of FOLL_GET.
> 
> The issue that John was trying to sort out was that there are plenty of
> callsites that do a simple put_page() instead of calling
> unpin_user_page(). IIRC, handling that correctly in existing code -- what was
> pinned must be released via unpin_user_page() -- was the biggest workitem.
> 
> Not sure how that relates to your work here (that's why I was asking): if you
> could avoid FOLL_GET, that would be great :)

Well, it simplifies things a bit.

I can make the new iov_iter_extract_pages() just do "pin" or "don't pin" and
do no ref-getting at all.  Things can be converted over to "unpin the pages or
doing nothing" as they're converted over to using iov_iter_extract_pages()
from iov_iter_get_pages*().

The block bio code then only needs a single bit of state: pinned or not
pinned.

For cifs RDMA, do I need to make it pass in FOLL_LONGTERM?  And does that need
a special cleanup?

sk_buff fragment handling could still be tricky.  I'm thinking that in that
code I'll need to store FOLL_GET/PIN in the bottom two bits of the frag page
pointer.  Sometimes it allocates a new page and attaches it (have ref);
sometimes it does zerocopy to/from a page (have pin) and sometimes it may be
pointing to a kernel buffer (don't pin or ref).

David



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-23 13:38       ` David Howells
@ 2023-01-23 14:20         ` David Hildenbrand
  2023-01-23 14:48           ` Christoph Hellwig
  2023-01-23 16:11         ` Jan Kara
  1 sibling, 1 reply; 28+ messages in thread
From: David Hildenbrand @ 2023-01-23 14:20 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel, Christoph Hellwig, John Hubbard, linux-mm

On 23.01.23 14:38, David Howells wrote:
> David Hildenbrand <david@redhat.com> wrote:
> 
>> That would be the ideal case: whenever intending to access page content, use
>> FOLL_PIN instead of FOLL_GET.
>>
>> The issue that John was trying to sort out was that there are plenty of
>> callsites that do a simple put_page() instead of calling
>> unpin_user_page(). IIRC, handling that correctly in existing code -- what was
>> pinned must be released via unpin_user_page() -- was the biggest workitem.
>>
>> Not sure how that relates to your work here (that's why I was asking): if you
>> could avoid FOLL_GET, that would be great :)
> 
> Well, it simplifies things a bit.
> 
> I can make the new iov_iter_extract_pages() just do "pin" or "don't pin" and
> do no ref-getting at all.  Things can be converted over to "unpin the pages or
> doing nothing" as they're converted over to using iov_iter_extract_pages()
> from iov_iter_get_pages*().
> 
> The block bio code then only needs a single bit of state: pinned or not
> pinned.

Unfortunately, I'll have to let BIO experts comment on that :) I only 
know the MM side of things here.

> 
> For cifs RDMA, do I need to make it pass in FOLL_LONGTERM?  And does that need
> a special cleanup?

Anything that holds pins "possibly forever" should that. vmsplice() is 
another example that should use it, once properly using FOLL_PIN. 
[FOLL_GET | FOLL_LONGTERM is not really used/defined with semantics]

> 
> sk_buff fragment handling could still be tricky.  I'm thinking that in that
> code I'll need to store FOLL_GET/PIN in the bottom two bits of the frag page
> pointer.  Sometimes it allocates a new page and attaches it (have ref);
> sometimes it does zerocopy to/from a page (have pin) and sometimes it may be
> pointing to a kernel buffer (don't pin or ref).
> 
> David
> 

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-23 14:20         ` David Hildenbrand
@ 2023-01-23 14:48           ` Christoph Hellwig
  0 siblings, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2023-01-23 14:48 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: David Howells, Al Viro, Christoph Hellwig, Matthew Wilcox,
	Jens Axboe, Jan Kara, Jeff Layton, Logan Gunthorpe,
	linux-fsdevel, linux-block, linux-kernel, Christoph Hellwig,
	John Hubbard, linux-mm

On Mon, Jan 23, 2023 at 03:20:41PM +0100, David Hildenbrand wrote:
> > The block bio code then only needs a single bit of state: pinned or not
> > pinned.
> 
> Unfortunately, I'll have to let BIO experts comment on that :) I only know
> the MM side of things here.

It can work for bios.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-23 13:38       ` David Howells
  2023-01-23 14:20         ` David Hildenbrand
@ 2023-01-23 16:11         ` Jan Kara
  2023-01-23 16:17           ` Christoph Hellwig
  2023-01-23 23:07           ` John Hubbard
  1 sibling, 2 replies; 28+ messages in thread
From: Jan Kara @ 2023-01-23 16:11 UTC (permalink / raw)
  To: David Howells
  Cc: David Hildenbrand, Al Viro, Christoph Hellwig, Matthew Wilcox,
	Jens Axboe, Jan Kara, Jeff Layton, Logan Gunthorpe,
	linux-fsdevel, linux-block, linux-kernel, Christoph Hellwig,
	John Hubbard, linux-mm

On Mon 23-01-23 13:38:31, David Howells wrote:
> David Hildenbrand <david@redhat.com> wrote:
> 
> > That would be the ideal case: whenever intending to access page content, use
> > FOLL_PIN instead of FOLL_GET.
> > 
> > The issue that John was trying to sort out was that there are plenty of
> > callsites that do a simple put_page() instead of calling
> > unpin_user_page(). IIRC, handling that correctly in existing code -- what was
> > pinned must be released via unpin_user_page() -- was the biggest workitem.
> > 
> > Not sure how that relates to your work here (that's why I was asking): if you
> > could avoid FOLL_GET, that would be great :)
> 
> Well, it simplifies things a bit.
> 
> I can make the new iov_iter_extract_pages() just do "pin" or "don't pin" and
> do no ref-getting at all.  Things can be converted over to "unpin the pages or
> doing nothing" as they're converted over to using iov_iter_extract_pages()
> from iov_iter_get_pages*().
> 
> The block bio code then only needs a single bit of state: pinned or not
> pinned.

I'm all for using only pin/unpin in the end. But you'd still need to handle
the 'put' for the intermediate time when there are still FOLL_GET users of
the bio infrastructure, wouldn't you?

> For cifs RDMA, do I need to make it pass in FOLL_LONGTERM?  And does that need
> a special cleanup?

FOLL_LONGTERM doesn't need a special cleanup AFAIK. It should be used
whenever there isn't reasonably bound time after which the page is
unpinned. So in case CIFS sets up RDMA and then it is up to userspace how
long the RDMA is going to be running it should be using FOLL_LONGTERM. The
thing is that pins can block e.g. truncate for DAX inodes and so longterm
pins are not supported for DAX backed pages.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-23 16:11         ` Jan Kara
@ 2023-01-23 16:17           ` Christoph Hellwig
  2023-01-23 23:07           ` John Hubbard
  1 sibling, 0 replies; 28+ messages in thread
From: Christoph Hellwig @ 2023-01-23 16:17 UTC (permalink / raw)
  To: Jan Kara
  Cc: David Howells, David Hildenbrand, Al Viro, Christoph Hellwig,
	Matthew Wilcox, Jens Axboe, Jeff Layton, Logan Gunthorpe,
	linux-fsdevel, linux-block, linux-kernel, Christoph Hellwig,
	John Hubbard, linux-mm

On Mon, Jan 23, 2023 at 05:11:14PM +0100, Jan Kara wrote:
> I'm all for using only pin/unpin in the end. But you'd still need to handle
> the 'put' for the intermediate time when there are still FOLL_GET users of
> the bio infrastructure, wouldn't you?

Yes, we need it for the transition.  But Dave already has a patch to
convert the legacy direct-io code as well, at which point we can
the retire the get bit.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-23 13:24       ` David Hildenbrand
@ 2023-01-23 19:56         ` John Hubbard
  2023-01-26 22:15         ` Al Viro
  1 sibling, 0 replies; 28+ messages in thread
From: John Hubbard @ 2023-01-23 19:56 UTC (permalink / raw)
  To: David Hildenbrand, David Howells
  Cc: Al Viro, Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel, Christoph Hellwig, linux-mm

On 1/23/23 05:24, David Hildenbrand wrote:
> On 23.01.23 14:19, David Howells wrote:
>> David Hildenbrand <david@redhat.com> wrote:
>>
>>> Switching from FOLL_GET to FOLL_PIN was in the works by John H. Not sure what
>>> the status is. Interestingly, Documentation/core-api/pin_user_pages.rst
>>> already documents that "CASE 1: Direct IO (DIO)" uses FOLL_PIN ... which does,
>>> unfortunately, no reflect reality yet.
>>

Yes, that part of the documentation is...aspirational. :) But this
series is taking us there, so good. Let me go review v8 of the series in
actual detail to verify but it sounds very promising.

>> Yeah - I just came across that.
>>
>> Should iov_iter.c then switch entirely to using pin_user_pages(), rather than
>> get_user_pages()?  In which case my patches only need keep track of
>> pinned/not-pinned and never "got".
> 
> That would be the ideal case: whenever intending to access page content, use FOLL_PIN instead of FOLL_GET.
> 
> The issue that John was trying to sort out was that there are plenty of callsites that do a simple put_page() instead of calling unpin_user_page(). IIRC, handling that correctly in existing code -- what was pinned must be released via unpin_user_page() -- was the biggest workitem.
> 
> Not sure how that relates to your work here (that's why I was asking): if you could avoid FOLL_GET, that would be great :)
> 

The largest part of this problem has been: __iov_iter_get_pages_alloc()
calls get_user_pages_fast() (so, FOLL_GET), as part of the Direct IO
path. And that __iov_iter_get_pages_alloc() is also called by a wide
variety of things that are not Direct IO: networking, crytpo, RDS.

So splitting out a variant that only Direct IO uses is a great move and
should allow conversion of Direct IO to FOLL_PIN. Again, let me go do an
actual review to check on that.

thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-23 16:11         ` Jan Kara
  2023-01-23 16:17           ` Christoph Hellwig
@ 2023-01-23 23:07           ` John Hubbard
  2023-01-24  5:57             ` Christoph Hellwig
  1 sibling, 1 reply; 28+ messages in thread
From: John Hubbard @ 2023-01-23 23:07 UTC (permalink / raw)
  To: Jan Kara, David Howells
  Cc: David Hildenbrand, Al Viro, Christoph Hellwig, Matthew Wilcox,
	Jens Axboe, Jeff Layton, Logan Gunthorpe, linux-fsdevel,
	linux-block, linux-kernel, Christoph Hellwig, linux-mm

On 1/23/23 08:11, Jan Kara wrote:
>> For cifs RDMA, do I need to make it pass in FOLL_LONGTERM?  And does that need
>> a special cleanup?
> 
> FOLL_LONGTERM doesn't need a special cleanup AFAIK. It should be used
> whenever there isn't reasonably bound time after which the page is
> unpinned. So in case CIFS sets up RDMA and then it is up to userspace how
> long the RDMA is going to be running it should be using FOLL_LONGTERM. The

Yes, we have been pretty consistently deciding that RDMA generally
implies FOLL_LONGTERM. (And furthermore, FOLL_LONGTERM implies
FOLL_PIN--that one is actually enforced by the gup/pup APIs.)


thanks,
-- 
John Hubbard
NVIDIA


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-23 23:07           ` John Hubbard
@ 2023-01-24  5:57             ` Christoph Hellwig
  2023-01-24  6:55               ` John Hubbard
  0 siblings, 1 reply; 28+ messages in thread
From: Christoph Hellwig @ 2023-01-24  5:57 UTC (permalink / raw)
  To: John Hubbard
  Cc: Jan Kara, David Howells, David Hildenbrand, Al Viro,
	Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jeff Layton,
	Logan Gunthorpe, linux-fsdevel, linux-block, linux-kernel,
	Christoph Hellwig, linux-mm

On Mon, Jan 23, 2023 at 03:07:48PM -0800, John Hubbard wrote:
> On 1/23/23 08:11, Jan Kara wrote:
> > > For cifs RDMA, do I need to make it pass in FOLL_LONGTERM?  And does that need
> > > a special cleanup?
> > 
> > FOLL_LONGTERM doesn't need a special cleanup AFAIK. It should be used
> > whenever there isn't reasonably bound time after which the page is
> > unpinned. So in case CIFS sets up RDMA and then it is up to userspace how
> > long the RDMA is going to be running it should be using FOLL_LONGTERM. The
> 
> Yes, we have been pretty consistently deciding that RDMA generally
> implies FOLL_LONGTERM. (And furthermore, FOLL_LONGTERM implies
> FOLL_PIN--that one is actually enforced by the gup/pup APIs.)

That's weird.  For storage or file systems, pages are pinnen just as
long when using RDMA as when using local DMA, in fact if you do RDMA
to really fast remote media vs slow local media (e.g. SSD vs disk) you
might pin it shorter when using RDMA.

I think FOLL_LONGTERM makes sense for non-ODP user space memory
registrations for RDMA, which will last basically forever.  It does
not really make much sense at all for in-kernel memory registration for
RDMA that are very short term.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-24  5:57             ` Christoph Hellwig
@ 2023-01-24  6:55               ` John Hubbard
  0 siblings, 0 replies; 28+ messages in thread
From: John Hubbard @ 2023-01-24  6:55 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, David Howells, David Hildenbrand, Al Viro,
	Matthew Wilcox, Jens Axboe, Jeff Layton, Logan Gunthorpe,
	linux-fsdevel, linux-block, linux-kernel, Christoph Hellwig,
	linux-mm

On 1/23/23 21:57, Christoph Hellwig wrote:
> On Mon, Jan 23, 2023 at 03:07:48PM -0800, John Hubbard wrote:
>> On 1/23/23 08:11, Jan Kara wrote:
>>>> For cifs RDMA, do I need to make it pass in FOLL_LONGTERM?  And does that need
>>>> a special cleanup?
>>>
>>> FOLL_LONGTERM doesn't need a special cleanup AFAIK. It should be used
>>> whenever there isn't reasonably bound time after which the page is
>>> unpinned. So in case CIFS sets up RDMA and then it is up to userspace how
>>> long the RDMA is going to be running it should be using FOLL_LONGTERM. The
>>
>> Yes, we have been pretty consistently deciding that RDMA generally
>> implies FOLL_LONGTERM. (And furthermore, FOLL_LONGTERM implies
>> FOLL_PIN--that one is actually enforced by the gup/pup APIs.)
> 
> That's weird.  For storage or file systems, pages are pinnen just as
> long when using RDMA as when using local DMA, in fact if you do RDMA
> to really fast remote media vs slow local media (e.g. SSD vs disk) you
> might pin it shorter when using RDMA.

ah OK, it is possible that I've been thinking too much about the HPC
cases that set up RDMA and leave it there, and not enough about
storage.

> 
> I think FOLL_LONGTERM makes sense for non-ODP user space memory
> registrations for RDMA, which will last basically forever.  It does
> not really make much sense at all for in-kernel memory registration for
> RDMA that are very short term.

Agreed.


thanks,
-- 
John Hubbard
NVIDIA



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-23 13:24       ` David Hildenbrand
  2023-01-23 19:56         ` John Hubbard
@ 2023-01-26 22:15         ` Al Viro
  2023-01-26 23:41           ` David Hildenbrand
  2023-01-27  0:05           ` David Howells
  1 sibling, 2 replies; 28+ messages in thread
From: Al Viro @ 2023-01-26 22:15 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: David Howells, Christoph Hellwig, Matthew Wilcox, Jens Axboe,
	Jan Kara, Jeff Layton, Logan Gunthorpe, linux-fsdevel,
	linux-block, linux-kernel, Christoph Hellwig, John Hubbard,
	linux-mm

On Mon, Jan 23, 2023 at 02:24:13PM +0100, David Hildenbrand wrote:
> On 23.01.23 14:19, David Howells wrote:
> > David Hildenbrand <david@redhat.com> wrote:
> > 
> > > Switching from FOLL_GET to FOLL_PIN was in the works by John H. Not sure what
> > > the status is. Interestingly, Documentation/core-api/pin_user_pages.rst
> > > already documents that "CASE 1: Direct IO (DIO)" uses FOLL_PIN ... which does,
> > > unfortunately, no reflect reality yet.
> > 
> > Yeah - I just came across that.
> > 
> > Should iov_iter.c then switch entirely to using pin_user_pages(), rather than
> > get_user_pages()?  In which case my patches only need keep track of
> > pinned/not-pinned and never "got".
> 
> That would be the ideal case: whenever intending to access page content, use
> FOLL_PIN instead of FOLL_GET.
> 
> The issue that John was trying to sort out was that there are plenty of
> callsites that do a simple put_page() instead of calling unpin_user_page().
> IIRC, handling that correctly in existing code -- what was pinned must be
> released via unpin_user_page() -- was the biggest workitem.
> 
> Not sure how that relates to your work here (that's why I was asking): if
> you could avoid FOLL_GET, that would be great :)

Take a good look at iter_to_pipe().  It does *not* need to pin anything
(we have an ITER_SOURCE there); with this approach it will.  And it
will stuff those pinned references into a pipe, where they can sit
indefinitely.

IOW, I don't believe it's a usable approach.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-26 22:15         ` Al Viro
@ 2023-01-26 23:41           ` David Hildenbrand
  2023-01-27  0:05           ` David Howells
  1 sibling, 0 replies; 28+ messages in thread
From: David Hildenbrand @ 2023-01-26 23:41 UTC (permalink / raw)
  To: Al Viro
  Cc: David Howells, Christoph Hellwig, Matthew Wilcox, Jens Axboe,
	Jan Kara, Jeff Layton, Logan Gunthorpe, linux-fsdevel,
	linux-block, linux-kernel, Christoph Hellwig, John Hubbard,
	linux-mm

On 26.01.23 23:15, Al Viro wrote:
> On Mon, Jan 23, 2023 at 02:24:13PM +0100, David Hildenbrand wrote:
>> On 23.01.23 14:19, David Howells wrote:
>>> David Hildenbrand <david@redhat.com> wrote:
>>>
>>>> Switching from FOLL_GET to FOLL_PIN was in the works by John H. Not sure what
>>>> the status is. Interestingly, Documentation/core-api/pin_user_pages.rst
>>>> already documents that "CASE 1: Direct IO (DIO)" uses FOLL_PIN ... which does,
>>>> unfortunately, no reflect reality yet.
>>>
>>> Yeah - I just came across that.
>>>
>>> Should iov_iter.c then switch entirely to using pin_user_pages(), rather than
>>> get_user_pages()?  In which case my patches only need keep track of
>>> pinned/not-pinned and never "got".
>>
>> That would be the ideal case: whenever intending to access page content, use
>> FOLL_PIN instead of FOLL_GET.
>>
>> The issue that John was trying to sort out was that there are plenty of
>> callsites that do a simple put_page() instead of calling unpin_user_page().
>> IIRC, handling that correctly in existing code -- what was pinned must be
>> released via unpin_user_page() -- was the biggest workitem.
>>
>> Not sure how that relates to your work here (that's why I was asking): if
>> you could avoid FOLL_GET, that would be great :)
> 
> Take a good look at iter_to_pipe().  It does *not* need to pin anything
> (we have an ITER_SOURCE there); with this approach it will.  And it
> will stuff those pinned references into a pipe, where they can sit
> indefinitely.
> 
> IOW, I don't believe it's a usable approach.
> 

Not sure what makes you believe that FOLL_GET is any better for this 
long-term pinning, I'd like to learn about that.

As raised already somewhere in the whole discussion by me, the right way 
to take such a long-term ping as vmsplice() does is to use 
FOLL_PIN|FOLL_LONGTERM. As also raised, that will fix the last remaining 
vmsplice()+hugetlb COW issue as tested by the cow.c vm selftest and make 
sure to migrate that memory off of MIGRATE_MOVABLE/CMA memory where we 
cannot tolerate to have long-term unmovable memory sitting around.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-26 22:15         ` Al Viro
  2023-01-26 23:41           ` David Hildenbrand
@ 2023-01-27  0:05           ` David Howells
  2023-01-27  0:20             ` David Hildenbrand
  1 sibling, 1 reply; 28+ messages in thread
From: David Howells @ 2023-01-27  0:05 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: dhowells, Al Viro, Christoph Hellwig, Matthew Wilcox, Jens Axboe,
	Jan Kara, Jeff Layton, Logan Gunthorpe, linux-fsdevel,
	linux-block, linux-kernel, Christoph Hellwig, John Hubbard,
	linux-mm

David Hildenbrand <david@redhat.com> wrote:

> As raised already somewhere in the whole discussion by me, the right way to
> take such a long-term ping as vmsplice() does is to use
> FOLL_PIN|FOLL_LONGTERM.

So the pipe infrastructure would have to be able to pin pages instead of
carrying refs on them?  What about pages just allocated and added to the pipe
ring in normal pipe use?

David



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator
  2023-01-27  0:05           ` David Howells
@ 2023-01-27  0:20             ` David Hildenbrand
  0 siblings, 0 replies; 28+ messages in thread
From: David Hildenbrand @ 2023-01-27  0:20 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, Christoph Hellwig, Matthew Wilcox, Jens Axboe, Jan Kara,
	Jeff Layton, Logan Gunthorpe, linux-fsdevel, linux-block,
	linux-kernel, Christoph Hellwig, John Hubbard, linux-mm

On 27.01.23 01:05, David Howells wrote:
> David Hildenbrand <david@redhat.com> wrote:
> 
>> As raised already somewhere in the whole discussion by me, the right way to
>> take such a long-term ping as vmsplice() does is to use
>> FOLL_PIN|FOLL_LONGTERM.
> 
> So the pipe infrastructure would have to be able to pin pages instead of
> carrying refs on them?  What about pages just allocated and added to the pipe
> ring in normal pipe use?
Ordinary kernel allocations (alloc_page() ...) always have to be freed 
via put_page() and friends. Such allocations are unmovable as default 
and don't require any special care.

Pages mapped into user space are movable as default and might be placed 
on ZONE_MOVABLE/CMA memory (well, and might be swapped out). 
FOLL_LONGTERM makes sure to migrate these pages off of such problematic 
physical memory regions, such that we can safely pin them until eternity.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2023-01-27  0:20 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20230120175556.3556978-1-dhowells@redhat.com>
2023-01-20 17:55 ` [PATCH v7 2/8] iov_iter: Add a function to extract a page list from an iterator David Howells
2023-01-21 13:01   ` Christoph Hellwig
2023-01-21 13:10   ` Christoph Hellwig
2023-01-21 13:30   ` David Howells
2023-01-21 13:33     ` Christoph Hellwig
2023-01-23 11:28   ` David Hildenbrand
2023-01-23 11:51   ` David Howells
2023-01-23 13:11     ` David Hildenbrand
2023-01-23 13:19     ` David Howells
2023-01-23 13:24       ` David Hildenbrand
2023-01-23 19:56         ` John Hubbard
2023-01-26 22:15         ` Al Viro
2023-01-26 23:41           ` David Hildenbrand
2023-01-27  0:05           ` David Howells
2023-01-27  0:20             ` David Hildenbrand
2023-01-23 13:38       ` David Howells
2023-01-23 14:20         ` David Hildenbrand
2023-01-23 14:48           ` Christoph Hellwig
2023-01-23 16:11         ` Jan Kara
2023-01-23 16:17           ` Christoph Hellwig
2023-01-23 23:07           ` John Hubbard
2023-01-24  5:57             ` Christoph Hellwig
2023-01-24  6:55               ` John Hubbard
2023-01-23 12:00   ` David Howells
2023-01-20 17:55 ` [PATCH v7 3/8] mm: Provide a helper to drop a pin/ref on a page David Howells
2023-01-20 17:55 ` [PATCH v7 8/8] mm: Renumber FOLL_GET and FOLL_PIN down David Howells
2023-01-20 18:59   ` Matthew Wilcox
2023-01-20 19:18   ` David Howells

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox