From: Will Deacon <will@kernel.org>
To: John Hubbard <jhubbard@nvidia.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
Hugh Dickins <hughd@google.com>, Keir Fraser <keirf@google.com>,
Jason Gunthorpe <jgg@ziepe.ca>,
David Hildenbrand <david@redhat.com>,
Frederick Mayle <fmayle@google.com>,
Andrew Morton <akpm@linux-foundation.org>,
Peter Xu <peterx@redhat.com>
Subject: Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration
Date: Mon, 18 Aug 2025 14:38:46 +0100 [thread overview]
Message-ID: <aKMs5t6oT6UxeGfF@willie-the-truck> (raw)
In-Reply-To: <ef85aa74-180c-4fbc-8af6-e6cca45eed43@nvidia.com>
On Fri, Aug 15, 2025 at 06:03:17PM -0700, John Hubbard wrote:
> On 8/15/25 3:18 AM, Will Deacon wrote:
> > When taking a longterm GUP pin via pin_user_pages(),
> > __gup_longterm_locked() tries to migrate target folios that should not
> > be longterm pinned, for example because they reside in a CMA region or
> > movable zone. This is done by first pinning all of the target folios
> > anyway, collecting all of the longterm-unpinnable target folios into a
> > list, dropping the pins that were just taken and finally handing the
> > list off to migrate_pages() for the actual migration.
> >
> > It is critically important that no unexpected references are held on the
> > folios being migrated, otherwise the migration will fail and
> > pin_user_pages() will return -ENOMEM to its caller. Unfortunately, it is
> > relatively easy to observe migration failures when running pKVM (which
> > uses pin_user_pages() on crosvm's virtual address space to resolve
> > stage-2 page faults from the guest) on a 6.15-based Pixel 6 device and
> > this results in the VM terminating prematurely.
> >
> > In the failure case, 'crosvm' has called mlock(MLOCK_ONFAULT) on its
> > mapping of guest memory prior to the pinning. Subsequently, when
> > pin_user_pages() walks the page-table, the relevant 'pte' is not
> > present and so the faulting logic allocates a new folio, mlocks it
> > with mlock_folio() and maps it in the page-table.
> >
> > Since commit 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page()
> > batch by pagevec"), mlock/munlock operations on a folio (formerly page),
> > are deferred. For example, mlock_folio() takes an additional reference
> > on the target folio before placing it into a per-cpu 'folio_batch' for
> > later processing by mlock_folio_batch(), which drops the refcount once
> > the operation is complete. Processing of the batches is coupled with
> > the LRU batch logic and can be forcefully drained with
> > lru_add_drain_all() but as long as a folio remains unprocessed on the
> > batch, its refcount will be elevated.
> >
> > This deferred batching therefore interacts poorly with the pKVM pinning
>
> I would go even a little broader (more general), and claim that this
> deferred batching interacts poorly with gup FOLL_LONGTERM when trying
> to pin folios in CMA or ZONE_MOVABLE, in fact.
That's much better, thanks.
> > diff --git a/mm/gup.c b/mm/gup.c
> > index adffe663594d..656835890f05 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -2307,7 +2307,8 @@ static unsigned long collect_longterm_unpinnable_folios(
> > continue;
> > }
> >
> > - if (!folio_test_lru(folio) && drain_allow) {
> > + if (drain_allow &&
> > + (!folio_test_lru(folio) || folio_test_mlocked(folio))) {
>
> That should work, yes.
>
> Alternatively, after thinking about this a bit today, it seems to me that the
> mlock batching is a little too bold, given the presence of gup/pup. And so I'm
> tempted to fix the problem closer to the root cause, like this (below).
>
> But maybe this is actually *less* wise than what you have proposed...
>
> I'd like to hear other mm folks' opinion on this approach:
>
> diff --git a/mm/mlock.c b/mm/mlock.c
> index a1d93ad33c6d..edecdd32996e 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -278,7 +278,15 @@ void mlock_new_folio(struct folio *folio)
>
> folio_get(folio);
> if (!folio_batch_add(fbatch, mlock_new(folio)) ||
> - folio_test_large(folio) || lru_cache_disabled())
> + folio_test_large(folio) || lru_cache_disabled() ||
> + /*
> + * If this is being called as part of a gup FOLL_LONGTERM operation in
> + * CMA/MOVABLE zones with MLOCK_ONFAULT active, then the newly faulted
> + * in folio will need to immediately migrate to a pinnable zone.
> + * Allowing the mlock operation to batch would break the ability to
> + * migrate the folio. Instead, force immediate processing.
> + */
> + (current->flags & PF_MEMALLOC_PIN))
> mlock_folio_batch(fbatch);
> local_unlock(&mlock_fbatch.lock);
> }
So after Hugh's eagle eyes spotted mlock_folio() in my description, it
turns out that the mlock happens on the user page fault path rather than
during the pin itself. I think that means that checking for
PF_MEMALLOC_PIN isn't going to work, as the pinning comes later. Hrm.
I posted some stacktraces in my reply to Hugh that might help (and boy
do I have plenty more of those).
Will
next prev parent reply other threads:[~2025-08-18 13:38 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-15 10:18 Will Deacon
2025-08-16 1:03 ` John Hubbard
2025-08-16 4:33 ` Hugh Dickins
2025-08-18 13:38 ` Will Deacon [this message]
2025-08-16 4:14 ` Hugh Dickins
2025-08-16 8:15 ` David Hildenbrand
2025-08-18 13:31 ` Will Deacon
2025-08-18 14:31 ` Will Deacon
2025-08-25 1:25 ` Hugh Dickins
2025-08-25 16:04 ` David Hildenbrand
2025-08-28 8:47 ` Hugh Dickins
2025-08-28 8:59 ` David Hildenbrand
2025-08-28 16:12 ` Hugh Dickins
2025-08-28 20:38 ` David Hildenbrand
2025-08-29 1:58 ` Hugh Dickins
2025-08-29 8:56 ` David Hildenbrand
2025-08-29 11:57 ` Will Deacon
2025-08-29 13:21 ` Will Deacon
2025-08-29 16:04 ` Hugh Dickins
2025-08-29 15:46 ` Hugh Dickins
2025-09-09 11:39 ` Will Deacon
2025-09-09 11:50 ` David Hildenbrand
2025-09-10 0:24 ` John Hubbard
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aKMs5t6oT6UxeGfF@willie-the-truck \
--to=will@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=fmayle@google.com \
--cc=hughd@google.com \
--cc=jgg@ziepe.ca \
--cc=jhubbard@nvidia.com \
--cc=keirf@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=peterx@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox