* [RFC PATCH 0/4] iov_iter: Add extraction helpers
@ 2022-11-17 14:54 David Howells
2022-11-17 14:54 ` [RFC PATCH 1/4] mm: Move FOLL_* defs to mm_types.h David Howells
2022-11-17 14:54 ` [RFC PATCH 2/4] iov_iter: Add a function to extract a page list from an iterator David Howells
0 siblings, 2 replies; 7+ messages in thread
From: David Howells @ 2022-11-17 14:54 UTC (permalink / raw)
To: Al Viro
Cc: linux-cachefs, Matthew Wilcox, Jeff Layton, linux-mm,
linux-fsdevel, Steve French, Shyam Prasad N, linux-cifs,
Rohith Surabattula, John Hubbard, Christoph Hellwig, dhowells,
Christoph Hellwig, Matthew Wilcox, Jeff Layton, linux-fsdevel,
linux-kernel
Hi Al,
Here are four patches to provide support for extracting pages from an
iov_iter, where such a thing makes sense, if you could take a look?
The first couple of patches provide iov_iter general stuff:
(1) Move the FOLL_* flags to linux/mm_types.h so that linux/uio.h can make
use of them.
(2) Add a function to list-only, get or pin pages from an iterator as a
future replacement for iov_iter_get_pages*(). Pointers to the pages
are placed into an array (which will get allocated if not provided)
and, depending on the iterator type and direction, the pages will have
a ref or a pin get on them, or left untouched (on the assumption that
the caller manages their lifetime).
The determination is:
UBUF/IOVEC + READ -> pin
UBUF/IOVEC + WRITE -> get
PIPE + READ -> list-only
BVEC/XARRAY -> list-only
Anything else -> EFAULT
It also adds a function by which the caller can determine which of
"list only, get or pin" the extraction function will actually do to
aid in cleaning up (returning 0, FOLL_GET or FOLL_PIN as appropriate).
Then there are a couple of patches that add stuff to netfslib that I want
to use there as well as in cifs:
(3) Add a netfslib function to use (2) to extract pages from an ITER_IOBUF
or ITER_UBUF iterator into an ITER_BVEC iterator. This will get or
pin the pages as appropriate.
(4) Add a netfslib function to extract pages from an iterator that's of
type ITER_UBUF/IOVEC/BVEC/KVEC/XARRAY and add them to a scatterlist.
The function in (2) is used for a UBUF and IOVEC iterators, so those
need cleaning up afterwards. BVEC and XARRAY iterators are ungot and
unpinned and may be rendered into elements that span multiple pages,
for example if large folios are present.
I've pushed the patches here also:
https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=iov-extract
David
Link: https://lore.kernel.org/r/166697254399.61150.1256557652599252121.stgit@warthog.procyon.org.uk/
---
David Howells (4):
mm: Move FOLL_* defs to mm_types.h
iov_iter: Add a function to extract a page list from an iterator
netfs: Add a function to extract a UBUF or IOVEC into a BVEC iterator
netfs: Add a function to extract an iterator into a scatterlist
fs/netfs/Makefile | 1 +
fs/netfs/iterator.c | 346 +++++++++++++++++++++++++++++++++++++++
include/linux/mm.h | 74 ---------
include/linux/mm_types.h | 73 +++++++++
include/linux/netfs.h | 5 +
include/linux/uio.h | 29 ++++
lib/iov_iter.c | 333 +++++++++++++++++++++++++++++++++++++
mm/vmalloc.c | 1 +
8 files changed, 788 insertions(+), 74 deletions(-)
create mode 100644 fs/netfs/iterator.c
^ permalink raw reply [flat|nested] 7+ messages in thread
* [RFC PATCH 1/4] mm: Move FOLL_* defs to mm_types.h
2022-11-17 14:54 [RFC PATCH 0/4] iov_iter: Add extraction helpers David Howells
@ 2022-11-17 14:54 ` David Howells
2022-11-17 23:15 ` John Hubbard
2022-11-22 12:46 ` Christoph Hellwig
2022-11-17 14:54 ` [RFC PATCH 2/4] iov_iter: Add a function to extract a page list from an iterator David Howells
1 sibling, 2 replies; 7+ messages in thread
From: David Howells @ 2022-11-17 14:54 UTC (permalink / raw)
To: Al Viro
Cc: Matthew Wilcox, John Hubbard, linux-mm, linux-fsdevel, dhowells,
Christoph Hellwig, Matthew Wilcox, Jeff Layton, linux-fsdevel,
linux-kernel
Move FOLL_* definitions to linux/mm_types.h to make them more accessible
without having to drag in all of linux/mm.h and everything that drags in
too[1].
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: John Hubbard <jhubbard@nvidia.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: linux-mm@kvack.org
cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/linux-fsdevel/Y1%2FhSO+7kAJhGShG@casper.infradead.org/ [1]
---
include/linux/mm.h | 74 ----------------------------------------------
include/linux/mm_types.h | 73 +++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 73 insertions(+), 74 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8bbcccbc5565..7a7a287818ad 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2941,80 +2941,6 @@ static inline vm_fault_t vmf_error(int err)
struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
unsigned int foll_flags);
-#define FOLL_WRITE 0x01 /* check pte is writable */
-#define FOLL_TOUCH 0x02 /* mark page accessed */
-#define FOLL_GET 0x04 /* do get_page on page */
-#define FOLL_DUMP 0x08 /* give error on hole if it would be zero */
-#define FOLL_FORCE 0x10 /* get_user_pages read/write w/o permission */
-#define FOLL_NOWAIT 0x20 /* if a disk transfer is needed, start the IO
- * and return without waiting upon it */
-#define FOLL_NOFAULT 0x80 /* do not fault in pages */
-#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */
-#define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */
-#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */
-#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */
-#define FOLL_ANON 0x8000 /* don't do file mappings */
-#define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */
-#define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */
-#define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page */
-#define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gup */
-
-/*
- * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
- * other. Here is what they mean, and how to use them:
- *
- * FOLL_LONGTERM indicates that the page will be held for an indefinite time
- * period _often_ under userspace control. This is in contrast to
- * iov_iter_get_pages(), whose usages are transient.
- *
- * FIXME: For pages which are part of a filesystem, mappings are subject to the
- * lifetime enforced by the filesystem and we need guarantees that longterm
- * users like RDMA and V4L2 only establish mappings which coordinate usage with
- * the filesystem. Ideas for this coordination include revoking the longterm
- * pin, delaying writeback, bounce buffer page writeback, etc. As FS DAX was
- * added after the problem with filesystems was found FS DAX VMAs are
- * specifically failed. Filesystem pages are still subject to bugs and use of
- * FOLL_LONGTERM should be avoided on those pages.
- *
- * FIXME: Also NOTE that FOLL_LONGTERM is not supported in every GUP call.
- * Currently only get_user_pages() and get_user_pages_fast() support this flag
- * and calls to get_user_pages_[un]locked are specifically not allowed. This
- * is due to an incompatibility with the FS DAX check and
- * FAULT_FLAG_ALLOW_RETRY.
- *
- * In the CMA case: long term pins in a CMA region would unnecessarily fragment
- * that region. And so, CMA attempts to migrate the page before pinning, when
- * FOLL_LONGTERM is specified.
- *
- * FOLL_PIN indicates that a special kind of tracking (not just page->_refcount,
- * but an additional pin counting system) will be invoked. This is intended for
- * anything that gets a page reference and then touches page data (for example,
- * Direct IO). This lets the filesystem know that some non-file-system entity is
- * potentially changing the pages' data. In contrast to FOLL_GET (whose pages
- * are released via put_page()), FOLL_PIN pages must be released, ultimately, by
- * a call to unpin_user_page().
- *
- * FOLL_PIN is similar to FOLL_GET: both of these pin pages. They use different
- * and separate refcounting mechanisms, however, and that means that each has
- * its own acquire and release mechanisms:
- *
- * FOLL_GET: get_user_pages*() to acquire, and put_page() to release.
- *
- * FOLL_PIN: pin_user_pages*() to acquire, and unpin_user_pages to release.
- *
- * FOLL_PIN and FOLL_GET are mutually exclusive for a given function call.
- * (The underlying pages may experience both FOLL_GET-based and FOLL_PIN-based
- * calls applied to them, and that's perfectly OK. This is a constraint on the
- * callers, not on the pages.)
- *
- * FOLL_PIN should be set internally by the pin_user_pages*() APIs, never
- * directly by the caller. That's in order to help avoid mismatches when
- * releasing pages: get_user_pages*() pages must be released via put_page(),
- * while pin_user_pages*() pages must be released via unpin_user_page().
- *
- * Please see Documentation/core-api/pin_user_pages.rst for more information.
- */
-
static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags)
{
if (vm_fault & VM_FAULT_OOM)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 500e536796ca..0c80a5ad6e6a 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1003,4 +1003,77 @@ enum fault_flag {
typedef unsigned int __bitwise zap_flags_t;
+/*
+ * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
+ * other. Here is what they mean, and how to use them:
+ *
+ * FOLL_LONGTERM indicates that the page will be held for an indefinite time
+ * period _often_ under userspace control. This is in contrast to
+ * iov_iter_get_pages(), whose usages are transient.
+ *
+ * FIXME: For pages which are part of a filesystem, mappings are subject to the
+ * lifetime enforced by the filesystem and we need guarantees that longterm
+ * users like RDMA and V4L2 only establish mappings which coordinate usage with
+ * the filesystem. Ideas for this coordination include revoking the longterm
+ * pin, delaying writeback, bounce buffer page writeback, etc. As FS DAX was
+ * added after the problem with filesystems was found FS DAX VMAs are
+ * specifically failed. Filesystem pages are still subject to bugs and use of
+ * FOLL_LONGTERM should be avoided on those pages.
+ *
+ * FIXME: Also NOTE that FOLL_LONGTERM is not supported in every GUP call.
+ * Currently only get_user_pages() and get_user_pages_fast() support this flag
+ * and calls to get_user_pages_[un]locked are specifically not allowed. This
+ * is due to an incompatibility with the FS DAX check and
+ * FAULT_FLAG_ALLOW_RETRY.
+ *
+ * In the CMA case: long term pins in a CMA region would unnecessarily fragment
+ * that region. And so, CMA attempts to migrate the page before pinning, when
+ * FOLL_LONGTERM is specified.
+ *
+ * FOLL_PIN indicates that a special kind of tracking (not just page->_refcount,
+ * but an additional pin counting system) will be invoked. This is intended for
+ * anything that gets a page reference and then touches page data (for example,
+ * Direct IO). This lets the filesystem know that some non-file-system entity is
+ * potentially changing the pages' data. In contrast to FOLL_GET (whose pages
+ * are released via put_page()), FOLL_PIN pages must be released, ultimately, by
+ * a call to unpin_user_page().
+ *
+ * FOLL_PIN is similar to FOLL_GET: both of these pin pages. They use different
+ * and separate refcounting mechanisms, however, and that means that each has
+ * its own acquire and release mechanisms:
+ *
+ * FOLL_GET: get_user_pages*() to acquire, and put_page() to release.
+ *
+ * FOLL_PIN: pin_user_pages*() to acquire, and unpin_user_pages to release.
+ *
+ * FOLL_PIN and FOLL_GET are mutually exclusive for a given function call.
+ * (The underlying pages may experience both FOLL_GET-based and FOLL_PIN-based
+ * calls applied to them, and that's perfectly OK. This is a constraint on the
+ * callers, not on the pages.)
+ *
+ * FOLL_PIN should be set internally by the pin_user_pages*() APIs, never
+ * directly by the caller. That's in order to help avoid mismatches when
+ * releasing pages: get_user_pages*() pages must be released via put_page(),
+ * while pin_user_pages*() pages must be released via unpin_user_page().
+ *
+ * Please see Documentation/core-api/pin_user_pages.rst for more information.
+ */
+#define FOLL_WRITE 0x01 /* check pte is writable */
+#define FOLL_TOUCH 0x02 /* mark page accessed */
+#define FOLL_GET 0x04 /* do get_page on page */
+#define FOLL_DUMP 0x08 /* give error on hole if it would be zero */
+#define FOLL_FORCE 0x10 /* get_user_pages read/write w/o permission */
+#define FOLL_NOWAIT 0x20 /* if a disk transfer is needed, start the IO
+ * and return without waiting upon it */
+#define FOLL_NOFAULT 0x80 /* do not fault in pages */
+#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */
+#define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */
+#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */
+#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */
+#define FOLL_ANON 0x8000 /* don't do file mappings */
+#define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */
+#define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */
+#define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page */
+#define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gup */
+
#endif /* _LINUX_MM_TYPES_H */
^ permalink raw reply [flat|nested] 7+ messages in thread
* [RFC PATCH 2/4] iov_iter: Add a function to extract a page list from an iterator
2022-11-17 14:54 [RFC PATCH 0/4] iov_iter: Add extraction helpers David Howells
2022-11-17 14:54 ` [RFC PATCH 1/4] mm: Move FOLL_* defs to mm_types.h David Howells
@ 2022-11-17 14:54 ` David Howells
2022-11-22 12:51 ` Christoph Hellwig
2022-11-22 13:36 ` David Howells
1 sibling, 2 replies; 7+ messages in thread
From: David Howells @ 2022-11-17 14:54 UTC (permalink / raw)
To: Al Viro
Cc: Christoph Hellwig, John Hubbard, Matthew Wilcox, linux-fsdevel,
linux-mm, dhowells, Christoph Hellwig, Matthew Wilcox,
Jeff Layton, linux-fsdevel, linux-kernel
Add a function, iov_iter_extract_pages(), to extract a list of pages from
an iterator. The pages may be returned with a reference added or a pin
added or neither, depending on the type of iterator and the direction of
transfer.
An additional function, iov_iter_extract_mode() is also provided so that the
mode of retention that will be employed for an iterator can be queried - and
therefore how the caller should dispose of the pages later.
There are three cases:
(1) Transfer *into* an ITER_IOVEC or ITER_UBUF iterator.
Extracted pages will have pins obtained on them (but not references)
so that fork() doesn't CoW the pages incorrectly whilst the I/O is in
progress.
iov_iter_extract_mode() will return FOLL_PIN for this case. The caller
should use something like unpin_user_page() to dispose of the page.
(2) Transfer is *out of* an ITER_IOVEC or ITER_UBUF iterator.
Extracted pages will have references obtained on them, but not pins.
iov_iter_extract_mode() will return FOLL_GET. The caller should use
something like put_page() for page disposal.
(3) Any other sort of iterator.
No refs or pins are obtained on the page, the assumption is made that
the caller will manage page retention.
iov_iter_extract_mode() will return 0. The pages don't need additional
disposal.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Al Viro <viro@zeniv.linux.org.uk>
cc: Christoph Hellwig <hch@lst.de>
cc: John Hubbard <jhubbard@nvidia.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-fsdevel@vger.kernel.org
cc: linux-mm@kvack.org
Link: https://lore.kernel.org/r/166722777971.2555743.12953624861046741424.stgit@warthog.procyon.org.uk/
---
include/linux/uio.h | 29 ++++
lib/iov_iter.c | 333 +++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 362 insertions(+)
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 2e3134b14ffd..329e36d41f0a 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -351,4 +351,33 @@ static inline void iov_iter_ubuf(struct iov_iter *i, unsigned int direction,
};
}
+ssize_t iov_iter_extract_pages(struct iov_iter *i, struct page ***pages,
+ size_t maxsize, unsigned int maxpages,
+ size_t *offset0);
+
+/**
+ * iov_iter_extract_mode - Indicate how pages from the iterator will be retained
+ * @iter: The iterator
+ *
+ * Examine the indicator and indicate with FOLL_PIN, FOLL_GET or 0 as to how,
+ * if at all, pages extracted from the iterator will be retained by the
+ * extraction function.
+ *
+ * FOLL_GET indicates that the pages will have a reference taken on them that
+ * the caller must put. This can be done for DMA/async DIO write from a page.
+ *
+ * FOLL_PIN indicates that the pages will have a pin placed in them that the
+ * caller must unpin. This is must be done for DMA/async DIO read to a page to
+ * avoid CoW problems in fork.
+ *
+ * 0 indicates that no measures are taken and that it's up to the caller to
+ * retain the pages.
+ */
+static inline unsigned int iov_iter_extract_mode(struct iov_iter *iter)
+{
+ if (user_backed_iter(iter))
+ return iter->data_source ? FOLL_GET : FOLL_PIN;
+ return 0;
+}
+
#endif
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index c3ca28ca68a6..17f63f4d499b 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -1892,3 +1892,336 @@ void iov_iter_restore(struct iov_iter *i, struct iov_iter_state *state)
i->iov -= state->nr_segs - i->nr_segs;
i->nr_segs = state->nr_segs;
}
+
+/*
+ * Extract a list of contiguous pages from an ITER_PIPE iterator. This does
+ * not get references of its own on the pages, nor does it get a pin on them.
+ * If there's a partial page, it adds that first and will then allocate and add
+ * pages into the pipe to make up the buffer space to the amount required.
+ *
+ * The caller must hold the pipe locked and only transferring into a pipe is
+ * supported.
+ */
+static ssize_t iov_iter_extract_pipe_pages(struct iov_iter *i,
+ struct page ***pages, size_t maxsize,
+ unsigned int maxpages,
+ size_t *offset0)
+{
+ unsigned int nr, offset, chunk, j;
+ struct page **p;
+ size_t left;
+
+ if (!sanity(i))
+ return -EFAULT;
+
+ offset = pipe_npages(i, &nr);
+ if (!nr)
+ return -EFAULT;
+ *offset0 = offset;
+
+ maxpages = min_t(size_t, nr, maxpages);
+ maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+ if (!maxpages)
+ return -ENOMEM;
+ p = *pages;
+
+ left = maxsize;
+ for (j = 0; j < maxpages; j++) {
+ struct page *page = append_pipe(i, left, &offset);
+ if (!page)
+ break;
+ chunk = min_t(size_t, left, PAGE_SIZE - offset);
+ left -= chunk;
+ *p++ = page;
+ }
+ if (!j)
+ return -EFAULT;
+ return maxsize - left;
+}
+
+/*
+ * Extract a list of contiguous pages from an ITER_XARRAY iterator. This does not
+ * get references on the pages, nor does it get a pin on them.
+ */
+static ssize_t iov_iter_extract_xarray_pages(struct iov_iter *i,
+ struct page ***pages, size_t maxsize,
+ unsigned int maxpages,
+ size_t *offset0)
+{
+ struct page *page, **p;
+ unsigned int nr = 0, offset;
+ loff_t pos = i->xarray_start + i->iov_offset;
+ pgoff_t index = pos >> PAGE_SHIFT;
+ XA_STATE(xas, i->xarray, index);
+
+ offset = pos & ~PAGE_MASK;
+ *offset0 = offset;
+
+ maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+ if (!maxpages)
+ return -ENOMEM;
+ p = *pages;
+
+ rcu_read_lock();
+ for (page = xas_load(&xas); page; page = xas_next(&xas)) {
+ if (xas_retry(&xas, page))
+ continue;
+
+ /* Has the page moved or been split? */
+ if (unlikely(page != xas_reload(&xas))) {
+ xas_reset(&xas);
+ continue;
+ }
+
+ p[nr++] = find_subpage(page, xas.xa_index);
+ if (nr == maxpages)
+ break;
+ }
+ rcu_read_unlock();
+
+ maxsize = min_t(size_t, nr * PAGE_SIZE - offset, maxsize);
+ i->iov_offset += maxsize;
+ i->count -= maxsize;
+ return maxsize;
+}
+
+/*
+ * Extract a list of contiguous pages from an ITER_BVEC iterator. This does
+ * not get references on the pages, nor does it get a pin on them.
+ */
+static ssize_t iov_iter_extract_bvec_pages(struct iov_iter *i,
+ struct page ***pages, size_t maxsize,
+ unsigned int maxpages,
+ size_t *offset0)
+{
+ struct page **p, *page;
+ size_t skip = i->iov_offset, offset;
+ int k;
+
+ maxsize = min(maxsize, i->bvec->bv_len - skip);
+ skip += i->bvec->bv_offset;
+ page = i->bvec->bv_page + skip / PAGE_SIZE;
+ offset = skip % PAGE_SIZE;
+ *offset0 = offset;
+
+ maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+ if (!maxpages)
+ return -ENOMEM;
+ p = *pages;
+ for (k = 0; k < maxpages; k++)
+ p[k] = page + k;
+
+ maxsize = min_t(size_t, maxsize, maxpages * PAGE_SIZE - offset);
+ i->count -= maxsize;
+ i->iov_offset += maxsize;
+ if (i->iov_offset == i->bvec->bv_len) {
+ i->iov_offset = 0;
+ i->bvec++;
+ i->nr_segs--;
+ }
+ return maxsize;
+}
+
+/*
+ * Get the first segment from an ITER_UBUF or ITER_IOVEC iterator. The
+ * iterator must not be empty.
+ */
+static unsigned long iov_iter_extract_first_user_segment(const struct iov_iter *i,
+ size_t *size)
+{
+ size_t skip;
+ long k;
+
+ if (iter_is_ubuf(i))
+ return (unsigned long)i->ubuf + i->iov_offset;
+
+ for (k = 0, skip = i->iov_offset; k < i->nr_segs; k++, skip = 0) {
+ size_t len = i->iov[k].iov_len - skip;
+
+ if (unlikely(!len))
+ continue;
+ if (*size > len)
+ *size = len;
+ return (unsigned long)i->iov[k].iov_base + skip;
+ }
+ BUG(); // if it had been empty, we wouldn't get called
+}
+
+/*
+ * Extract a list of contiguous pages from a user iterator and get references
+ * on them. This should only be used iff the iterator is user-backed
+ * (IOBUF/UBUF) and data is being transferred out of the buffer described by
+ * the iterator (ie. this is the source).
+ *
+ * The pages are returned with incremented refcounts that the caller must undo
+ * once the transfer is complete, but no additional pins are obtained.
+ *
+ * This is only safe to be used where background IO/DMA is not going to be
+ * modifying the buffer, and so won't cause a problem with CoW on fork.
+ */
+static ssize_t iov_iter_extract_user_pages_and_get(struct iov_iter *i,
+ struct page ***pages,
+ size_t maxsize,
+ unsigned int maxpages,
+ size_t *offset0)
+{
+ unsigned long addr;
+ unsigned int gup_flags = FOLL_GET;
+ size_t offset;
+ int res;
+
+ if (WARN_ON_ONCE(iov_iter_rw(i) != WRITE))
+ return -EFAULT;
+
+ if (i->nofault)
+ gup_flags |= FOLL_NOFAULT;
+
+ addr = iov_iter_extract_first_user_segment(i, &maxsize);
+ *offset0 = offset = addr % PAGE_SIZE;
+ addr &= PAGE_MASK;
+ maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+ if (!maxpages)
+ return -ENOMEM;
+ res = get_user_pages_fast(addr, maxpages, gup_flags, *pages);
+ if (unlikely(res <= 0))
+ return res;
+ maxsize = min_t(size_t, maxsize, res * PAGE_SIZE - offset);
+ iov_iter_advance(i, maxsize);
+ return maxsize;
+}
+
+/*
+ * Extract a list of contiguous pages from a user iterator and get a pin on
+ * each of them. This should only be used iff the iterator is user-backed
+ * (IOBUF/UBUF) and data is being transferred into the buffer described by the
+ * iterator (ie. this is the destination).
+ *
+ * It does not get refs on the pages, but the pages must be unpinned by the
+ * caller once the transfer is complete.
+ *
+ * This is safe to be used where background IO/DMA *is* going to be modifying
+ * the buffer; using a pin rather than a ref makes sure that CoW happens
+ * correctly in the parent during fork.
+ */
+static ssize_t iov_iter_extract_user_pages_and_pin(struct iov_iter *i,
+ struct page ***pages,
+ size_t maxsize,
+ unsigned int maxpages,
+ size_t *offset0)
+{
+ unsigned long addr;
+ unsigned int gup_flags = FOLL_PIN | FOLL_WRITE;
+ size_t offset;
+ int res;
+
+ if (WARN_ON_ONCE(iov_iter_rw(i) != READ))
+ return -EFAULT;
+
+ if (i->nofault)
+ gup_flags |= FOLL_NOFAULT;
+
+ addr = first_iovec_segment(i, &maxsize);
+ *offset0 = offset = addr % PAGE_SIZE;
+ addr &= PAGE_MASK;
+ maxpages = want_pages_array(pages, maxsize, offset, maxpages);
+ if (!maxpages)
+ return -ENOMEM;
+ res = pin_user_pages_fast(addr, maxpages, gup_flags, *pages);
+ if (unlikely(res <= 0))
+ return res;
+ maxsize = min_t(size_t, maxsize, res * PAGE_SIZE - offset);
+ iov_iter_advance(i, maxsize);
+ return maxsize;
+}
+
+static ssize_t iov_iter_extract_user_pages(struct iov_iter *i,
+ struct page ***pages, size_t maxsize,
+ unsigned int maxpages,
+ size_t *offset0)
+{
+ switch (iov_iter_extract_mode(i)) {
+ case FOLL_GET:
+ return iov_iter_extract_user_pages_and_get(i, pages, maxsize,
+ maxpages, offset0);
+ case FOLL_PIN:
+ return iov_iter_extract_user_pages_and_pin(i, pages, maxsize,
+ maxpages, offset0);
+ default:
+ BUG();
+ }
+}
+
+/**
+ * iov_iter_extract_pages - Extract a list of contiguous pages from an iterator
+ * @i: The iterator to extract from
+ * @pages: Where to return the list of pages
+ * @maxsize: The maximum amount of iterator to extract
+ * @maxpages: The maximum size of the list of pages
+ * @offset0: Where to return the starting offset into (*@pages)[0]
+ *
+ * Extract a list of contiguous pages from the current point of the iterator,
+ * advancing the iterator. The maximum number of pages and the maximum amount
+ * of page contents can be set.
+ *
+ * If *@pages is NULL, a page list will be allocated to the required size and
+ * *@pages will be set to its base. If *@pages is not NULL, it will be assumed
+ * that the caller allocated a page list at least @maxpages in size and this
+ * will be filled in.
+ *
+ * Extra refs or pins on the pages may be obtained as follows:
+ *
+ * (*) If the iterator is user-backed (ITER_IOVEC/ITER_UBUF) and data is to be
+ * transferred /OUT OF/ the described buffer, refs will be taken on the
+ * pages, but pins will not be added. This can be used for DMA from a
+ * page; it cannot be used for DMA to a page, as it may cause page-COW
+ * problems in fork.
+ *
+ * (*) If the iterator is user-backed (ITER_IOVEC/ITER_UBUF) and data is to be
+ * transferred /INTO/ the described buffer, pins will be added to the
+ * pages, but refs will not be taken. This must be used for DMA to a
+ * page.
+ *
+ * (*) If the iterator is ITER_PIPE, this must describe a destination for the
+ * data. Additional pages may be allocated and added to the pipe (which
+ * will hold the refs), but neither refs nor pins will be obtained for the
+ * caller. The caller must hold the pipe lock.
+ *
+ * (*) If the iterator is ITER_BVEC or ITER_XARRAY, the pages are merely
+ * listed; no extra refs or pins are obtained.
+ *
+ * Note also:
+ *
+ * (*) Use with ITER_KVEC is not supported as that may refer to memory that
+ * doesn't have associated page structs.
+ *
+ * (*) Use with ITER_DISCARD is not supported as that has no content.
+ *
+ * On success, the function sets *@pages to the new pagelist, if allocated, and
+ * sets *offset0 to the offset into the first page and returns the amount of
+ * buffer space added represented by the page list.
+ *
+ * It may also return -ENOMEM and -EFAULT.
+ */
+ssize_t iov_iter_extract_pages(struct iov_iter *i, struct page ***pages,
+ size_t maxsize, unsigned int maxpages,
+ size_t *offset0)
+{
+ maxsize = min_t(size_t, min_t(size_t, maxsize, i->count), MAX_RW_COUNT);
+ if (!maxsize)
+ return 0;
+
+ if (likely(user_backed_iter(i)))
+ return iov_iter_extract_user_pages(i, pages, maxsize,
+ maxpages, offset0);
+ if (iov_iter_is_bvec(i))
+ return iov_iter_extract_bvec_pages(i, pages, maxsize,
+ maxpages, offset0);
+ if (iov_iter_is_pipe(i))
+ return iov_iter_extract_pipe_pages(i, pages, maxsize,
+ maxpages, offset0);
+ if (iov_iter_is_xarray(i))
+ return iov_iter_extract_xarray_pages(i, pages, maxsize,
+ maxpages, offset0);
+ return -EFAULT;
+}
+EXPORT_SYMBOL(iov_iter_extract_pages);
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH 1/4] mm: Move FOLL_* defs to mm_types.h
2022-11-17 14:54 ` [RFC PATCH 1/4] mm: Move FOLL_* defs to mm_types.h David Howells
@ 2022-11-17 23:15 ` John Hubbard
2022-11-22 12:46 ` Christoph Hellwig
1 sibling, 0 replies; 7+ messages in thread
From: John Hubbard @ 2022-11-17 23:15 UTC (permalink / raw)
To: David Howells, Al Viro
Cc: Matthew Wilcox, linux-mm, linux-fsdevel, Christoph Hellwig,
Jeff Layton, linux-kernel
On 11/17/22 06:54, David Howells wrote:
> Move FOLL_* definitions to linux/mm_types.h to make them more accessible
> without having to drag in all of linux/mm.h and everything that drags in
> too[1].
>
> Suggested-by: Matthew Wilcox <willy@infradead.org>
> Signed-off-by: David Howells <dhowells@redhat.com>
> cc: John Hubbard <jhubbard@nvidia.com>
> cc: Al Viro <viro@zeniv.linux.org.uk>
> cc: linux-mm@kvack.org
> cc: linux-fsdevel@vger.kernel.org
> Link: https://lore.kernel.org/linux-fsdevel/Y1%2FhSO+7kAJhGShG@casper.infradead.org/ [1]
> ---
>
> include/linux/mm.h | 74 ----------------------------------------------
> include/linux/mm_types.h | 73 +++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 73 insertions(+), 74 deletions(-)
OK, I've verified that this is a "mostly identical" movement: the only
thing that changes is that the comments now come before the defines.
And because mm.h includes mm_types.h, it is unlikely that moving a
define from mm.h to mm_types.h would cause build failures. It's not
completely impossible: ordering issues are sometimes involved in this
sort of change. But unlikely.
Anyway, this is a good move. The users of various mm APIs should not
have to pull in quite so much of the internals of mm, and this is a step
in that direction. FOLL_* items are used by filesystems and other
subsystems that definitely do not need all of mm.h.
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
thanks,
--
John Hubbard
NVIDIA
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 8bbcccbc5565..7a7a287818ad 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2941,80 +2941,6 @@ static inline vm_fault_t vmf_error(int err)
> struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
> unsigned int foll_flags);
>
> -#define FOLL_WRITE 0x01 /* check pte is writable */
> -#define FOLL_TOUCH 0x02 /* mark page accessed */
> -#define FOLL_GET 0x04 /* do get_page on page */
> -#define FOLL_DUMP 0x08 /* give error on hole if it would be zero */
> -#define FOLL_FORCE 0x10 /* get_user_pages read/write w/o permission */
> -#define FOLL_NOWAIT 0x20 /* if a disk transfer is needed, start the IO
> - * and return without waiting upon it */
> -#define FOLL_NOFAULT 0x80 /* do not fault in pages */
> -#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */
> -#define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */
> -#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */
> -#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */
> -#define FOLL_ANON 0x8000 /* don't do file mappings */
> -#define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */
> -#define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */
> -#define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page */
> -#define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gup */
> -
> -/*
> - * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
> - * other. Here is what they mean, and how to use them:
> - *
> - * FOLL_LONGTERM indicates that the page will be held for an indefinite time
> - * period _often_ under userspace control. This is in contrast to
> - * iov_iter_get_pages(), whose usages are transient.
> - *
> - * FIXME: For pages which are part of a filesystem, mappings are subject to the
> - * lifetime enforced by the filesystem and we need guarantees that longterm
> - * users like RDMA and V4L2 only establish mappings which coordinate usage with
> - * the filesystem. Ideas for this coordination include revoking the longterm
> - * pin, delaying writeback, bounce buffer page writeback, etc. As FS DAX was
> - * added after the problem with filesystems was found FS DAX VMAs are
> - * specifically failed. Filesystem pages are still subject to bugs and use of
> - * FOLL_LONGTERM should be avoided on those pages.
> - *
> - * FIXME: Also NOTE that FOLL_LONGTERM is not supported in every GUP call.
> - * Currently only get_user_pages() and get_user_pages_fast() support this flag
> - * and calls to get_user_pages_[un]locked are specifically not allowed. This
> - * is due to an incompatibility with the FS DAX check and
> - * FAULT_FLAG_ALLOW_RETRY.
> - *
> - * In the CMA case: long term pins in a CMA region would unnecessarily fragment
> - * that region. And so, CMA attempts to migrate the page before pinning, when
> - * FOLL_LONGTERM is specified.
> - *
> - * FOLL_PIN indicates that a special kind of tracking (not just page->_refcount,
> - * but an additional pin counting system) will be invoked. This is intended for
> - * anything that gets a page reference and then touches page data (for example,
> - * Direct IO). This lets the filesystem know that some non-file-system entity is
> - * potentially changing the pages' data. In contrast to FOLL_GET (whose pages
> - * are released via put_page()), FOLL_PIN pages must be released, ultimately, by
> - * a call to unpin_user_page().
> - *
> - * FOLL_PIN is similar to FOLL_GET: both of these pin pages. They use different
> - * and separate refcounting mechanisms, however, and that means that each has
> - * its own acquire and release mechanisms:
> - *
> - * FOLL_GET: get_user_pages*() to acquire, and put_page() to release.
> - *
> - * FOLL_PIN: pin_user_pages*() to acquire, and unpin_user_pages to release.
> - *
> - * FOLL_PIN and FOLL_GET are mutually exclusive for a given function call.
> - * (The underlying pages may experience both FOLL_GET-based and FOLL_PIN-based
> - * calls applied to them, and that's perfectly OK. This is a constraint on the
> - * callers, not on the pages.)
> - *
> - * FOLL_PIN should be set internally by the pin_user_pages*() APIs, never
> - * directly by the caller. That's in order to help avoid mismatches when
> - * releasing pages: get_user_pages*() pages must be released via put_page(),
> - * while pin_user_pages*() pages must be released via unpin_user_page().
> - *
> - * Please see Documentation/core-api/pin_user_pages.rst for more information.
> - */
> -
> static inline int vm_fault_to_errno(vm_fault_t vm_fault, int foll_flags)
> {
> if (vm_fault & VM_FAULT_OOM)
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 500e536796ca..0c80a5ad6e6a 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -1003,4 +1003,77 @@ enum fault_flag {
>
> typedef unsigned int __bitwise zap_flags_t;
>
> +/*
> + * FOLL_PIN and FOLL_LONGTERM may be used in various combinations with each
> + * other. Here is what they mean, and how to use them:
> + *
> + * FOLL_LONGTERM indicates that the page will be held for an indefinite time
> + * period _often_ under userspace control. This is in contrast to
> + * iov_iter_get_pages(), whose usages are transient.
> + *
> + * FIXME: For pages which are part of a filesystem, mappings are subject to the
> + * lifetime enforced by the filesystem and we need guarantees that longterm
> + * users like RDMA and V4L2 only establish mappings which coordinate usage with
> + * the filesystem. Ideas for this coordination include revoking the longterm
> + * pin, delaying writeback, bounce buffer page writeback, etc. As FS DAX was
> + * added after the problem with filesystems was found FS DAX VMAs are
> + * specifically failed. Filesystem pages are still subject to bugs and use of
> + * FOLL_LONGTERM should be avoided on those pages.
> + *
> + * FIXME: Also NOTE that FOLL_LONGTERM is not supported in every GUP call.
> + * Currently only get_user_pages() and get_user_pages_fast() support this flag
> + * and calls to get_user_pages_[un]locked are specifically not allowed. This
> + * is due to an incompatibility with the FS DAX check and
> + * FAULT_FLAG_ALLOW_RETRY.
> + *
> + * In the CMA case: long term pins in a CMA region would unnecessarily fragment
> + * that region. And so, CMA attempts to migrate the page before pinning, when
> + * FOLL_LONGTERM is specified.
> + *
> + * FOLL_PIN indicates that a special kind of tracking (not just page->_refcount,
> + * but an additional pin counting system) will be invoked. This is intended for
> + * anything that gets a page reference and then touches page data (for example,
> + * Direct IO). This lets the filesystem know that some non-file-system entity is
> + * potentially changing the pages' data. In contrast to FOLL_GET (whose pages
> + * are released via put_page()), FOLL_PIN pages must be released, ultimately, by
> + * a call to unpin_user_page().
> + *
> + * FOLL_PIN is similar to FOLL_GET: both of these pin pages. They use different
> + * and separate refcounting mechanisms, however, and that means that each has
> + * its own acquire and release mechanisms:
> + *
> + * FOLL_GET: get_user_pages*() to acquire, and put_page() to release.
> + *
> + * FOLL_PIN: pin_user_pages*() to acquire, and unpin_user_pages to release.
> + *
> + * FOLL_PIN and FOLL_GET are mutually exclusive for a given function call.
> + * (The underlying pages may experience both FOLL_GET-based and FOLL_PIN-based
> + * calls applied to them, and that's perfectly OK. This is a constraint on the
> + * callers, not on the pages.)
> + *
> + * FOLL_PIN should be set internally by the pin_user_pages*() APIs, never
> + * directly by the caller. That's in order to help avoid mismatches when
> + * releasing pages: get_user_pages*() pages must be released via put_page(),
> + * while pin_user_pages*() pages must be released via unpin_user_page().
> + *
> + * Please see Documentation/core-api/pin_user_pages.rst for more information.
> + */
> +#define FOLL_WRITE 0x01 /* check pte is writable */
> +#define FOLL_TOUCH 0x02 /* mark page accessed */
> +#define FOLL_GET 0x04 /* do get_page on page */
> +#define FOLL_DUMP 0x08 /* give error on hole if it would be zero */
> +#define FOLL_FORCE 0x10 /* get_user_pages read/write w/o permission */
> +#define FOLL_NOWAIT 0x20 /* if a disk transfer is needed, start the IO
> + * and return without waiting upon it */
> +#define FOLL_NOFAULT 0x80 /* do not fault in pages */
> +#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */
> +#define FOLL_MIGRATION 0x400 /* wait for page to replace migration entry */
> +#define FOLL_TRIED 0x800 /* a retry, previous pass started an IO */
> +#define FOLL_REMOTE 0x2000 /* we are working on non-current tsk/mm */
> +#define FOLL_ANON 0x8000 /* don't do file mappings */
> +#define FOLL_LONGTERM 0x10000 /* mapping lifetime is indefinite: see below */
> +#define FOLL_SPLIT_PMD 0x20000 /* split huge pmd before returning */
> +#define FOLL_PIN 0x40000 /* pages must be released via unpin_user_page */
> +#define FOLL_FAST_ONLY 0x80000 /* gup_fast: prevent fall-back to slow gup */
> +
> #endif /* _LINUX_MM_TYPES_H */
>
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH 1/4] mm: Move FOLL_* defs to mm_types.h
2022-11-17 14:54 ` [RFC PATCH 1/4] mm: Move FOLL_* defs to mm_types.h David Howells
2022-11-17 23:15 ` John Hubbard
@ 2022-11-22 12:46 ` Christoph Hellwig
1 sibling, 0 replies; 7+ messages in thread
From: Christoph Hellwig @ 2022-11-22 12:46 UTC (permalink / raw)
To: David Howells
Cc: Al Viro, Matthew Wilcox, John Hubbard, linux-mm, linux-fsdevel,
Christoph Hellwig, Jeff Layton, linux-kernel
Looks good:
Reviewed-by: Christoph Hellwig <hch@lst.de>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH 2/4] iov_iter: Add a function to extract a page list from an iterator
2022-11-17 14:54 ` [RFC PATCH 2/4] iov_iter: Add a function to extract a page list from an iterator David Howells
@ 2022-11-22 12:51 ` Christoph Hellwig
2022-11-22 13:36 ` David Howells
1 sibling, 0 replies; 7+ messages in thread
From: Christoph Hellwig @ 2022-11-22 12:51 UTC (permalink / raw)
To: David Howells
Cc: Al Viro, Christoph Hellwig, John Hubbard, Matthew Wilcox,
linux-fsdevel, linux-mm, Christoph Hellwig, Jeff Layton,
linux-kernel
On Thu, Nov 17, 2022 at 02:54:54PM +0000, David Howells wrote:
> An additional function, iov_iter_extract_mode() is also provided so that the
> mode of retention that will be employed for an iterator can be queried - and
> therefore how the caller should dispose of the pages later.
Any reason to not just add an out paramter to the main function and
return this directly instead of an extra helper?
> +EXPORT_SYMBOL(iov_iter_extract_pages);
get_user_pages_fast, pin_user_pages_fast are very intentionally
EXPORT_SYMBOL_GPL, which should not be bypassed by an iov_* wrapper.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH 2/4] iov_iter: Add a function to extract a page list from an iterator
2022-11-17 14:54 ` [RFC PATCH 2/4] iov_iter: Add a function to extract a page list from an iterator David Howells
2022-11-22 12:51 ` Christoph Hellwig
@ 2022-11-22 13:36 ` David Howells
1 sibling, 0 replies; 7+ messages in thread
From: David Howells @ 2022-11-22 13:36 UTC (permalink / raw)
To: Christoph Hellwig
Cc: dhowells, Al Viro, Christoph Hellwig, John Hubbard,
Matthew Wilcox, linux-fsdevel, linux-mm, Jeff Layton,
linux-kernel
Christoph Hellwig <hch@infradead.org> wrote:
> > +EXPORT_SYMBOL(iov_iter_extract_pages);
>
> get_user_pages_fast, pin_user_pages_fast are very intentionally
> EXPORT_SYMBOL_GPL, which should not be bypassed by an iov_* wrapper.
Ah, but I'm intending to replace:
EXPORT_SYMBOL(iov_iter_get_pages2);
EXPORT_SYMBOL(iov_iter_get_pages_alloc2);
which *aren't* marked _GPL, so you need to argue that one with Al.
David
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2022-11-22 13:37 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-17 14:54 [RFC PATCH 0/4] iov_iter: Add extraction helpers David Howells
2022-11-17 14:54 ` [RFC PATCH 1/4] mm: Move FOLL_* defs to mm_types.h David Howells
2022-11-17 23:15 ` John Hubbard
2022-11-22 12:46 ` Christoph Hellwig
2022-11-17 14:54 ` [RFC PATCH 2/4] iov_iter: Add a function to extract a page list from an iterator David Howells
2022-11-22 12:51 ` Christoph Hellwig
2022-11-22 13:36 ` David Howells
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox