From: David Hildenbrand <david@redhat.com>
To: John Hubbard <jhubbard@nvidia.com>, Alistair Popple <apopple@nvidia.com>
Cc: Christoph Hellwig <hch@infradead.org>,
Andrew Morton <akpm@linux-foundation.org>,
LKML <linux-kernel@vger.kernel.org>,
linux-mm@kvack.org, linux-stable@vger.kernel.org,
Vivek Kasireddy <vivek.kasireddy@intel.com>,
Dave Airlie <airlied@redhat.com>,
Gerd Hoffmann <kraxel@redhat.com>,
Matthew Wilcox <willy@infradead.org>,
Jason Gunthorpe <jgg@nvidia.com>, Peter Xu <peterx@redhat.com>,
Arnd Bergmann <arnd@arndb.de>,
Daniel Vetter <daniel.vetter@ffwll.ch>,
Dongwon Kim <dongwon.kim@intel.com>,
Hugh Dickins <hughd@google.com>,
Junxiao Chang <junxiao.chang@intel.com>,
Mike Kravetz <mike.kravetz@oracle.com>,
Oscar Salvador <osalvador@suse.de>
Subject: Re: [PATCH] mm/gup: restore the ability to pin more than 2GB at a time
Date: Wed, 30 Oct 2024 10:01:37 +0100 [thread overview]
Message-ID: <8d9dc103-47c5-4719-971a-31efb091432a@redhat.com> (raw)
In-Reply-To: <bfee966f-807d-4668-b353-159a6e8066f2@redhat.com>
On 30.10.24 09:34, David Hildenbrand wrote:
> On 30.10.24 07:50, John Hubbard wrote:
>> On 10/29/24 11:18 PM, Alistair Popple wrote:
>>> John Hubbard <jhubbard@nvidia.com> writes:
>>>> On 10/29/24 9:42 PM, Christoph Hellwig wrote:
>>>>> On Tue, Oct 29, 2024 at 09:39:15PM -0700, John Hubbard wrote:
>> ...
>>>>> Because pinning down these amounts of memoryt is completely insane.
>>>>> I don't mind the switch to kvmalloc, but we need to put in an upper
>>>>> bound of what can be pinned.
>>>>
>>>> I'm wondering though, how it is that we decide how much of the user's
>>>> system we prevent them from using? :) People with hardware accelerators
>>>> do not always have page fault capability, and yet these troublesome
>>>> users insist on stacking their system full of DRAM and then pointing
>>>> the accelerator to it.
>>>>
>>>> How would we choose a value? Memory sizes keep going up...
>>>
>>> The obvious answer is you let users decide. I did have a patch series to
>>> do that via a cgroup[1]. However I dropped that series mostly because I
>>> couldn't find any users of such a limit to provide feedback on how they
>>> would use it or how they wanted it to work.
>>>
>>
>> Trawling through the discussion there, I see that Jason Gunthorpe mentioned:
>>
>> "Things like VFIO & KVM use cases effectively pin 90% of all system memory"
>
> The unusual thing is not the amount of system memory we are pinning but
> *how many* pages we try pinning in the single call.
>
> If you stare at vfio_pin_pages_remote, we seem to be batching it.
>
> long req_pages = min_t(long, npage, batch->capacity);
>
> Which is
>
> #define VFIO_BATCH_MAX_CAPACITY (PAGE_SIZE / sizeof(struct page *))
>
>
> So you can fix this in your driver ;)
>
>
> We should maybe try a similar limit internally: if you call
> pin_user_pages_remote() with a large number, we'll cap it at some magic
> value (similar to above). The caller will simply realize that not all
> pages were pinned and will retry.
>
> See get_user_pages_remote(): "Returns either number of pages pinned
> (which may be less than the number requested), or an error. Details
> about the return value:"
>
>
> Alternatively, I recall there was a way to avoid the temporary
> allocation ... let me hack up a prototype real quick.
Completely untested (also note the interesting TODO):
From a23984b4f1a39ec984489fbe16708aedf4f9db95 Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Wed, 30 Oct 2024 10:00:50 +0100
Subject: [PATCH] tmp
Signed-off-by: David Hildenbrand <david@redhat.com>
---
mm/gup.c | 120 ++++++++++++++++++++++++++++++++++++-------------------
1 file changed, 78 insertions(+), 42 deletions(-)
diff --git a/mm/gup.c b/mm/gup.c
index a82890b46a36..8807b36c2363 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2273,20 +2273,56 @@ struct page *get_dump_page(unsigned long addr)
#endif /* CONFIG_ELF_CORE */
#ifdef CONFIG_MIGRATION
+
+/*
+ * An array of either pages or folios. While we could currently interpret a
+ * list of folios like a list of pages, it would only work as long as
+ * "struct folio" overlays "struct page" -- and it would not allow for
+ * avoiding page_folio() calls.
+ */
+struct pages_or_folios {
+ union {
+ struct page **pages;
+ struct folio **folios;
+ void **entries;
+ };
+ bool has_folios;
+ long nr_entries;
+};
+
+static struct folio *pofs_get_folio(struct pages_or_folios *pofs, long i)
+{
+ if (pofs->has_folios)
+ return pofs->folios[i];
+ return page_folio(pofs->pages[i]);
+}
+
+static void pofs_clear_entry(struct pages_or_folios *pofs, long i)
+{
+ pofs->entries[i] = NULL;
+}
+
+static void pofs_unpin(struct pages_or_folios *pofs)
+{
+ if (pofs->has_folios)
+ unpin_folios(pofs->folios, pofs->nr_entries);
+ else
+ unpin_user_pages(pofs->pages, pofs->nr_entries);
+}
+
/*
* Returns the number of collected folios. Return value is always >= 0.
*/
static unsigned long collect_longterm_unpinnable_folios(
- struct list_head *movable_folio_list,
- unsigned long nr_folios,
- struct folio **folios)
+ struct list_head *movable_folio_list,
+ struct pages_or_folios *pofs)
{
unsigned long i, collected = 0;
struct folio *prev_folio = NULL;
bool drain_allow = true;
- for (i = 0; i < nr_folios; i++) {
- struct folio *folio = folios[i];
+ for (i = 0; i < pofs->nr_entries; i++) {
+ struct folio *folio = pofs_get_folio(pofs, i);
if (folio == prev_folio)
continue;
@@ -2310,6 +2346,11 @@ static unsigned long collect_longterm_unpinnable_folios(
drain_allow = false;
}
+ /*
+ * TODO: if isolation fails we might already have it in the
+ * list, if pages of different folios are interleaved
+ * (e.g., COW). We might want to check all entries in the list.
+ */
if (!folio_isolate_lru(folio))
continue;
@@ -2328,15 +2369,14 @@ static unsigned long collect_longterm_unpinnable_folios(
* failure (or partial success).
*/
static int migrate_longterm_unpinnable_folios(
- struct list_head *movable_folio_list,
- unsigned long nr_folios,
- struct folio **folios)
+ struct list_head *movable_folio_list,
+ struct pages_or_folios *pofs)
{
int ret;
unsigned long i;
- for (i = 0; i < nr_folios; i++) {
- struct folio *folio = folios[i];
+ for (i = 0; i < pofs->nr_entries; i++) {
+ struct folio *folio = pofs_get_folio(pofs, i);
if (folio_is_device_coherent(folio)) {
/*
@@ -2344,7 +2384,7 @@ static int migrate_longterm_unpinnable_folios(
* convert the pin on the source folio to a normal
* reference.
*/
- folios[i] = NULL;
+ pofs_clear_entry(pofs, i);
folio_get(folio);
gup_put_folio(folio, 1, FOLL_PIN);
@@ -2363,8 +2403,8 @@ static int migrate_longterm_unpinnable_folios(
* calling folio_isolate_lru() which takes a reference so the
* folio won't be freed if it's migrating.
*/
- unpin_folio(folios[i]);
- folios[i] = NULL;
+ unpin_folio(pofs_get_folio(pofs, i));
+ pofs_clear_entry(pofs, i);
}
if (!list_empty(movable_folio_list)) {
@@ -2387,12 +2427,24 @@ static int migrate_longterm_unpinnable_folios(
return -EAGAIN;
err:
- unpin_folios(folios, nr_folios);
+ pofs_unpin(pofs);
putback_movable_pages(movable_folio_list);
return ret;
}
+static long check_and_migrate_movable_pages_or_folios(struct pages_or_folios *pofs)
+{
+ LIST_HEAD(movable_folio_list);
+ unsigned long collected;
+
+ collected = collect_longterm_unpinnable_folios(&movable_folio_list, pofs);
+ if (!collected)
+ return 0;
+
+ return migrate_longterm_unpinnable_folios(&movable_folio_list, pofs);
+}
+
/*
* Check whether all folios are *allowed* to be pinned indefinitely (longterm).
* Rather confusingly, all folios in the range are required to be pinned via
@@ -2412,41 +2464,25 @@ static int migrate_longterm_unpinnable_folios(
static long check_and_migrate_movable_folios(unsigned long nr_folios,
struct folio **folios)
{
- unsigned long collected;
- LIST_HEAD(movable_folio_list);
-
- collected = collect_longterm_unpinnable_folios(&movable_folio_list,
- nr_folios, folios);
- if (!collected)
- return 0;
+ struct pages_or_folios pofs = {
+ .folios = folios,
+ .has_folios = true,
+ .nr_entries = nr_folios,
+ };
- return migrate_longterm_unpinnable_folios(&movable_folio_list,
- nr_folios, folios);
+ return check_and_migrate_movable_pages_or_folios(&pofs);
}
-/*
- * This routine just converts all the pages in the @pages array to folios and
- * calls check_and_migrate_movable_folios() to do the heavy lifting.
- *
- * Please see the check_and_migrate_movable_folios() documentation for details.
- */
+/* See check_and_migrate_movable_folios(). */
static long check_and_migrate_movable_pages(unsigned long nr_pages,
struct page **pages)
{
- struct folio **folios;
- long i, ret;
+ struct pages_or_folios pofs = {
+ .pages = pages,
+ .nr_entries = nr_pages,
+ };
- folios = kmalloc_array(nr_pages, sizeof(*folios), GFP_KERNEL);
- if (!folios)
- return -ENOMEM;
-
- for (i = 0; i < nr_pages; i++)
- folios[i] = page_folio(pages[i]);
-
- ret = check_and_migrate_movable_folios(nr_pages, folios);
-
- kfree(folios);
- return ret;
+ return check_and_migrate_movable_pages_or_folios(&pofs);
}
#else
static long check_and_migrate_movable_pages(unsigned long nr_pages,
--
2.47.0
--
Cheers,
David / dhildenb
next prev parent reply other threads:[~2024-10-30 9:01 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-30 3:01 John Hubbard
2024-10-30 4:21 ` Christoph Hellwig
2024-10-30 4:30 ` John Hubbard
2024-10-30 4:33 ` Christoph Hellwig
2024-10-30 4:39 ` John Hubbard
2024-10-30 4:42 ` Christoph Hellwig
2024-10-30 4:44 ` John Hubbard
2024-10-30 6:18 ` Alistair Popple
2024-10-30 6:50 ` John Hubbard
2024-10-30 8:34 ` David Hildenbrand
2024-10-30 9:01 ` David Hildenbrand [this message]
2024-10-30 18:34 ` John Hubbard
2024-10-31 0:02 ` Jason Gunthorpe
2024-10-31 0:17 ` John Hubbard
2024-10-31 0:25 ` Jason Gunthorpe
2024-10-31 0:47 ` John Hubbard
2024-10-30 12:04 ` Jason Gunthorpe
2024-10-30 17:25 ` John Hubbard
2024-10-30 11:59 ` Jason Gunthorpe
2024-10-30 11:03 ` Vlastimil Babka
2024-10-30 17:29 ` John Hubbard
2024-10-30 17:42 ` Vlastimil Babka
2024-10-30 17:49 ` John Hubbard
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8d9dc103-47c5-4719-971a-31efb091432a@redhat.com \
--to=david@redhat.com \
--cc=airlied@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=apopple@nvidia.com \
--cc=arnd@arndb.de \
--cc=daniel.vetter@ffwll.ch \
--cc=dongwon.kim@intel.com \
--cc=hch@infradead.org \
--cc=hughd@google.com \
--cc=jgg@nvidia.com \
--cc=jhubbard@nvidia.com \
--cc=junxiao.chang@intel.com \
--cc=kraxel@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-stable@vger.kernel.org \
--cc=mike.kravetz@oracle.com \
--cc=osalvador@suse.de \
--cc=peterx@redhat.com \
--cc=vivek.kasireddy@intel.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox