From: "Kasireddy, Vivek" <vivek.kasireddy@intel.com>
To: David Hildenbrand <david@redhat.com>,
"dri-devel@lists.freedesktop.org"
<dri-devel@lists.freedesktop.org>,
"linux-mm@kvack.org" <linux-mm@kvack.org>
Cc: Christoph Hellwig <hch@infradead.org>,
Daniel Vetter <daniel.vetter@ffwll.ch>,
Mike Kravetz <mike.kravetz@oracle.com>,
"Hugh Dickins" <hughd@google.com>, Peter Xu <peterx@redhat.com>,
Gerd Hoffmann <kraxel@redhat.com>,
"Kim, Dongwon" <dongwon.kim@intel.com>,
"Chang, Junxiao" <junxiao.chang@intel.com>,
Jason Gunthorpe <jgg@nvidia.com>
Subject: RE: [PATCH v6 3/5] mm/gup: Introduce memfd_pin_user_pages() for pinning memfd pages (v6)
Date: Thu, 7 Dec 2023 05:09:20 +0000 [thread overview]
Message-ID: <IA0PR11MB718520192D5885CFCCEE320FF88BA@IA0PR11MB7185.namprd11.prod.outlook.com> (raw)
In-Reply-To: <5ffe2ea3-83da-4b5f-adc8-af9cd9a57cd2@redhat.com>
Hi David,
> On 05.12.23 06:35, Vivek Kasireddy wrote:
> > For drivers that would like to longterm-pin the pages associated
> > with a memfd, the pin_user_pages_fd() API provides an option to
> > not only pin the pages via FOLL_PIN but also to check and migrate
> > them if they reside in movable zone or CMA block. This API
> > currently works with memfds but it should work with any files
> > that belong to either shmemfs or hugetlbfs. Files belonging to
> > other filesystems are rejected for now.
> >
> > The pages need to be located first before pinning them via FOLL_PIN.
> > If they are found in the page cache, they can be immediately pinned.
> > Otherwise, they need to be allocated using the filesystem specific
> > APIs and then pinned.
> >
> > v2:
> > - Drop gup_flags and improve comments and commit message (David)
> > - Allocate a page if we cannot find in page cache for the hugetlbfs
> > case as well (David)
> > - Don't unpin pages if there is a migration related failure (David)
> > - Drop the unnecessary nr_pages <= 0 check (Jason)
> > - Have the caller of the API pass in file * instead of fd (Jason)
> >
> > v3: (David)
> > - Enclose the huge page allocation code with #ifdef
> CONFIG_HUGETLB_PAGE
> > (Build error reported by kernel test robot <lkp@intel.com>)
> > - Don't forget memalloc_pin_restore() on non-migration related errors
> > - Improve the readability of the cleanup code associated with
> > non-migration related errors
> > - Augment the comments by describing FOLL_LONGTERM like behavior
> > - Include the R-b tag from Jason
> >
> > v4:
> > - Remove the local variable "page" and instead use 3 return statements
> > in alloc_file_page() (David)
> > - Add the R-b tag from David
> >
> > v5: (David)
> > - For hugetlb case, ensure that we only obtain head pages from the
> > mapping by using __filemap_get_folio() instead of find_get_page_flags()
> > - Handle -EEXIST when two or more potential users try to simultaneously
> > add a huge page to the mapping by forcing them to retry on failure
> >
> > v6: (Christoph)
> > - Rename this API to memfd_pin_user_pages() to make it clear that it
> > is intended for memfds
> > - Move the memfd page allocation helper from gup.c to memfd.c
> > - Fix indentation errors in memfd_pin_user_pages()
> > - For contiguous ranges of folios, use a helper such as
> > filemap_get_folios_contig() to lookup the page cache in batches
> >
> > Cc: David Hildenbrand <david@redhat.com>
> > Cc: Christoph Hellwig <hch@infradead.org>
> > Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
> > Cc: Mike Kravetz <mike.kravetz@oracle.com>
> > Cc: Hugh Dickins <hughd@google.com>
> > Cc: Peter Xu <peterx@redhat.com>
> > Cc: Gerd Hoffmann <kraxel@redhat.com>
> > Cc: Dongwon Kim <dongwon.kim@intel.com>
> > Cc: Junxiao Chang <junxiao.chang@intel.com>
> > Suggested-by: Jason Gunthorpe <jgg@nvidia.com>
> > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> (v2)
> > Reviewed-by: David Hildenbrand <david@redhat.com> (v3)
> > Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com>
> > ---
> > include/linux/memfd.h | 5 +++
> > include/linux/mm.h | 2 +
> > mm/gup.c | 102 ++++++++++++++++++++++++++++++++++++++++++
> > mm/memfd.c | 34 ++++++++++++++
> > 4 files changed, 143 insertions(+)
> >
> > diff --git a/include/linux/memfd.h b/include/linux/memfd.h
> > index e7abf6fa4c52..6fc0d1282151 100644
> > --- a/include/linux/memfd.h
> > +++ b/include/linux/memfd.h
> > @@ -6,11 +6,16 @@
> >
> > #ifdef CONFIG_MEMFD_CREATE
> > extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int
> arg);
> > +extern struct page *memfd_alloc_page(struct file *memfd, pgoff_t idx);
> > #else
> > static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned int a)
> > {
> > return -EINVAL;
> > }
> > +static inline struct page *memfd_alloc_page(struct file *memfd, pgoff_t
> idx)
> > +{
> > + return ERR_PTR(-EINVAL);
> > +}
> > #endif
> >
> > #endif /* __LINUX_MEMFD_H */
> > diff --git a/include/linux/mm.h b/include/linux/mm.h
> > index 418d26608ece..ac69db45509f 100644
> > --- a/include/linux/mm.h
> > +++ b/include/linux/mm.h
> > @@ -2472,6 +2472,8 @@ long get_user_pages_unlocked(unsigned long
> start, unsigned long nr_pages,
> > struct page **pages, unsigned int gup_flags);
> > long pin_user_pages_unlocked(unsigned long start, unsigned long
> nr_pages,
> > struct page **pages, unsigned int gup_flags);
> > +long memfd_pin_user_pages(struct file *file, pgoff_t start,
> > + unsigned long nr_pages, struct page **pages);
> >
> > int get_user_pages_fast(unsigned long start, int nr_pages,
> > unsigned int gup_flags, struct page **pages);
> > diff --git a/mm/gup.c b/mm/gup.c
> > index 231711efa390..eb93d1ec9dc6 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -5,6 +5,7 @@
> > #include <linux/spinlock.h>
> >
> > #include <linux/mm.h>
> > +#include <linux/memfd.h>
> > #include <linux/memremap.h>
> > #include <linux/pagemap.h>
> > #include <linux/rmap.h>
> > @@ -17,6 +18,7 @@
> > #include <linux/hugetlb.h>
> > #include <linux/migrate.h>
> > #include <linux/mm_inline.h>
> > +#include <linux/pagevec.h>
> > #include <linux/sched/mm.h>
> > #include <linux/shmem_fs.h>
> >
> > @@ -3410,3 +3412,103 @@ long pin_user_pages_unlocked(unsigned long
> start, unsigned long nr_pages,
> > &locked, gup_flags);
> > }
> > EXPORT_SYMBOL(pin_user_pages_unlocked);
> > +
> > +/**
> > + * memfd_pin_user_pages() - pin user pages associated with a memfd
> > + * @memfd: the memfd whose pages are to be pinned
> > + * @start: starting memfd offset
> > + * @nr_pages: number of pages from start to pin
> > + * @pages: array that receives pointers to the pages pinned.
> > + * Should be at-least nr_pages long.
> > + *
> > + * Attempt to pin pages associated with a memfd; given that a memfd is
> either
> > + * backed by shmem or hugetlb, the pages can either be found in the page
> cache
> > + * or need to be allocated if necessary. Once the pages are located, they
> are
> > + * all pinned via FOLL_PIN. And, these pinned pages need to be released
> either
> > + * using unpin_user_pages() or unpin_user_page().
> > + *
> > + * It must be noted that the pages may be pinned for an indefinite amount
> > + * of time. And, in most cases, the duration of time they may stay pinned
> > + * would be controlled by the userspace. This behavior is effectively the
> > + * same as using FOLL_LONGTERM with other GUP APIs.
> > + *
> > + * Returns number of pages pinned. This would be equal to the number of
> > + * pages requested. If no pages were pinned, it returns -errno.
> > + */
> > +long memfd_pin_user_pages(struct file *memfd, pgoff_t start,
> > + unsigned long nr_pages, struct page **pages)
> > +{
> > + pgoff_t start_idx, end_idx = start + nr_pages - 1;
> > + unsigned int flags, nr_folios, i, j;
> > + struct folio_batch fbatch;
> > + struct page *page = NULL;
> > + struct folio *folio;
> > + long ret;
> > +
> > + if (!nr_pages)
> > + return -EINVAL;
> > +
> > + if (!memfd)
> > + return -EINVAL;
> > +
> > + if (!shmem_file(memfd) && !is_file_hugepages(memfd))
> > + return -EINVAL;
> > +
> > + flags = memalloc_pin_save();
> > + do {
> > + folio_batch_init(&fbatch);
> > + start_idx = start;
> > + i = 0;
> > +
> > + while (start_idx <= end_idx) {
> > + /*
> > + * In most cases, we should be able to find the page
> > + * in the page cache. If we cannot find it for some
> > + * reason, we try to allocate one and add it to the
> > + * page cache.
> > + */
> > + nr_folios = filemap_get_folios_contig(memfd-
> >f_mapping,
> > + &start_idx,
> > + end_idx,
> > + &fbatch);
> > + if (page) {
> > + put_page(page);
> > + page = NULL;
> > + }
> > + for (j = 0; j < nr_folios; j++) {
> > + folio = fbatch.folios[j];
> > + ret = try_grab_page(&folio->page, FOLL_PIN);
> > + if (unlikely(ret)) {
> > + folio_batch_release(&fbatch);
> > + goto err;
> > + }
> > +
> > + pages[i++] = &folio->page;
> > + }
>
> I might be wrong, but that interface is still inconsistent. I think your
> intention is to always return folios (head pages), but why are we
> returning pages from this interface then?
>
> It would be more consistent regarding the other GUP interfaces to return
> the actual tail pages that fit the given "pgoff_t start". So if you
> punch in "nr_pages" you expect to get "nr_pages" pages, and not some
> other number of folios.
>
> Otherwise, this interface is highly confusing.
>
> If you always want to return folios, then better name it
> "memfd_pin_user_folios" (or just "memfd_pin_folios") and pass in a range
> (instead of a nr_pages parameter), and somehow indicate to the caller
> how many folio were in that range, and if that range was fully covered.
I think it makes sense to return folios from this interface; and considering my
use-case, I'd like have this API return an error if it cannot pin (or allocate)
the exact number of folios the caller requested.
>
> Or am I missing something?
I can make the udmabuf driver use folios instead of pages too but the function
check_and_migrate_movable_pages() in GUP still takes a list of pages. Do you
think it is ok to use a local variable to collect all the head pages for this?
Thanks,
Vivek
>
> --
> Cheers,
>
> David / dhildenb
>
next prev parent reply other threads:[~2023-12-07 5:09 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-05 5:35 [PATCH v6 0/5] " Vivek Kasireddy
2023-12-05 5:35 ` [PATCH v6 1/5] udmabuf: Use vmf_insert_pfn and VM_PFNMAP for handling mmap Vivek Kasireddy
2023-12-05 5:35 ` [PATCH v6 2/5] udmabuf: Add back support for mapping hugetlb pages (v5) Vivek Kasireddy
2023-12-06 9:16 ` Christoph Hellwig
2023-12-05 5:35 ` [PATCH v6 3/5] mm/gup: Introduce memfd_pin_user_pages() for pinning memfd pages (v6) Vivek Kasireddy
2023-12-06 9:18 ` Christoph Hellwig
2023-12-07 5:06 ` Kasireddy, Vivek
2023-12-06 11:19 ` David Hildenbrand
2023-12-07 5:09 ` Kasireddy, Vivek [this message]
2023-12-07 9:44 ` David Hildenbrand
2023-12-07 13:05 ` Jason Gunthorpe
2023-12-07 13:35 ` David Hildenbrand
2023-12-08 7:57 ` Kasireddy, Vivek
2023-12-08 9:53 ` David Hildenbrand
2023-12-05 5:35 ` [PATCH v6 4/5] udmabuf: Pin the pages using memfd_pin_user_pages() API (v4) Vivek Kasireddy
2023-12-05 5:35 ` [PATCH v6 5/5] selftests/dma-buf/udmabuf: Add tests to verify data after page migration Vivek Kasireddy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=IA0PR11MB718520192D5885CFCCEE320FF88BA@IA0PR11MB7185.namprd11.prod.outlook.com \
--to=vivek.kasireddy@intel.com \
--cc=daniel.vetter@ffwll.ch \
--cc=david@redhat.com \
--cc=dongwon.kim@intel.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=hch@infradead.org \
--cc=hughd@google.com \
--cc=jgg@nvidia.com \
--cc=junxiao.chang@intel.com \
--cc=kraxel@redhat.com \
--cc=linux-mm@kvack.org \
--cc=mike.kravetz@oracle.com \
--cc=peterx@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox