Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Will Deacon <will@kernel.org>
To: Hugh Dickins <hughd@google.com>
Cc: David Hildenbrand <david@redhat.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	Keir Fraser <keirf@google.com>, Jason Gunthorpe <jgg@ziepe.ca>,
	John Hubbard <jhubbard@nvidia.com>,
	Frederick Mayle <fmayle@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Peter Xu <peterx@redhat.com>, Rik van Riel <riel@surriel.com>,
	Vlastimil Babka <vbabka@suse.cz>, Ge Yang <yangge1116@126.com>
Subject: Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration
Date: Fri, 29 Aug 2025 12:57:37 +0100	[thread overview]
Message-ID: <aLGVsXpyUx9-ZRIl@willie-the-truck> (raw)
In-Reply-To: <8376d8a3-cc36-ae70-0fa8-427e9ca17b9b@google.com>

Hi Hugh,

On Thu, Aug 28, 2025 at 01:47:14AM -0700, Hugh Dickins wrote:
> On Sun, 24 Aug 2025, Hugh Dickins wrote:
> > On Mon, 18 Aug 2025, Will Deacon wrote:
> > > On Mon, Aug 18, 2025 at 02:31:42PM +0100, Will Deacon wrote:
> > > > On Fri, Aug 15, 2025 at 09:14:48PM -0700, Hugh Dickins wrote:
> > > > > I think replace the folio_test_mlocked(folio) part of it by
> > > > > (folio_test_mlocked(folio) && !folio_test_unevictable(folio)).
> > > > > That should reduce the extra calls to a much more reasonable
> > > > > number, while still solving your issue.
> > > > 
> > > > Alas, I fear that the folio may be unevictable by this point (which
> > > > seems to coincide with the readahead fault adding it to the LRU above)
> > > > but I can try it out.
> > > 
> > > I gave this a spin but I still see failures with this change.
> > 
> > Many thanks, Will, for the precisely relevant traces (in which,
> > by the way, mapcount=0 really means _mapcount=0 hence mapcount=1).
> > 
> > Yes, those do indeed illustrate a case which my suggested
> > (folio_test_mlocked(folio) && !folio_test_unevictable(folio))
> > failed to cover.  Very helpful to have an example of that.
> > 
> > And many thanks, David, for your reminder of commit 33dfe9204f29
> > ("mm/gup: clear the LRU flag of a page before adding to LRU batch").
> > 
> > Yes, I strongly agree with your suggestion that the mlock batch
> > be brought into line with its change to the ordinary LRU batches,
> > and agree that doing so will be likely to solve Will's issue
> > (and similar cases elsewhere, without needing to modify them).
> > 
> > Now I just have to cool my head and get back down into those
> > mlock batches.  I am fearful that making a change there to suit
> > this case will turn out later to break another case (and I just
> > won't have time to redevelop as thorough a grasp of the races as
> > I had back then).  But if we're lucky, applying that "one batch
> > at a time" rule will actually make it all more comprehensible.
> > 
> > (I so wish we had spare room in struct page to keep the address
> > of that one batch entry, or the CPU to which that one batch
> > belongs: then, although that wouldn't eliminate all uses of
> > lru_add_drain_all(), it would allow us to efficiently extract
> > a target page from its LRU batch without a remote drain.)
> > 
> > I have not yet begun to write such a patch, and I'm not yet sure
> > that it's even feasible: this mail sent to get the polite thank
> > yous out of my mind, to help clear it for getting down to work.
> 
> It took several days in search of the least bad compromise, but
> in the end I concluded the opposite of what we'd intended above.
> 
> There is a fundamental incompatibility between my 5.18 2fbb0c10d1e8
> ("mm/munlock: mlock_page() munlock_page() batch by pagevec")
> and Ge Yang's 6.11 33dfe9204f29
> ("mm/gup: clear the LRU flag of a page before adding to LRU batch").

That's actually pretty good news, as I was initially worried that we'd
have to backport a fix all the way back to 6.1. From the above, the only
LTS affected is 6.12.y.

> It turns out that the mm/swap.c folio batches (apart from lru_add)
> are all for best-effort, doesn't matter if it's missed, operations;
> whereas mlock and munlock are more serious.  Probably mlock could
> be (not very satisfactorily) converted, but then munlock?  Because
> of failed folio_test_clear_lru()s, it would be far too likely to
> err on either side, munlocking too soon or too late.
> 
> I've concluded that one or the other has to go.  If we're having
> a beauty contest, there's no doubt that 33dfe9204f29 is much nicer
> than 2fbb0c10d1e8 (which is itself far from perfect).  But functionally,
> I'm afraid that removing the mlock/munlock batching will show up as a
> perceptible regression in realistic workloadsg; and on consideration,
> I've found no real justification for the LRU flag clearing change.
> 
> Unless I'm mistaken, collect_longterm_unpinnable_folios() should
> never have been relying on folio_test_lru(), and should simply be
> checking for expected ref_count instead.
> 
> Will, please give the portmanteau patch (combination of four)
> below a try: reversion of 33dfe9204f29 and a later MGLRU fixup,
> corrected test in collect...(), preparatory lru_add_drain() there.
> 
> I hope you won't be proving me wrong again, and I can move on to
> writing up those four patches (and adding probably three more that
> make sense in such a series, but should not affect your testing).
> 
> I've tested enough to know that it's not harmful, but am hoping
> to take advantage of your superior testing, particularly in the
> GUP pin area.  But if you're uneasy with the combination, and would
> prefer to check just the minimum, then ignore the reversions and try
> just the mm/gup.c part of it - that will probably be good enough for
> you even without the reversions.

Thanks, I'll try to test the whole lot. I was geographically separated
from my testing device yesterday but I should be able to give it a spin
later today. I'm _supposed_ to be writing my KVM Forum slides for next
week, so this offers a perfect opportunity to procrastinate.

> Patch is against 6.17-rc3; but if you'd prefer the patch against 6.12
> (or an intervening release), I already did the backport so please just
> ask.

We've got 6.15 working well at the moment, so I'll backport your diff
to that.

One question on the diff below:

> Thanks!
> 
>  mm/gup.c    |    5 ++++-
>  mm/swap.c   |   50 ++++++++++++++++++++++++++------------------------
>  mm/vmscan.c |    2 +-
>  3 files changed, 31 insertions(+), 26 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index adffe663594d..9f7c87f504a9 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2291,6 +2291,8 @@ static unsigned long collect_longterm_unpinnable_folios(
>  	struct folio *folio;
>  	long i = 0;
>  
> +	lru_add_drain();
> +
>  	for (folio = pofs_get_folio(pofs, i); folio;
>  	     folio = pofs_next_folio(folio, pofs, &i)) {
>  
> @@ -2307,7 +2309,8 @@ static unsigned long collect_longterm_unpinnable_folios(
>  			continue;
>  		}
>  
> -		if (!folio_test_lru(folio) && drain_allow) {
> +		if (drain_allow && folio_ref_count(folio) !=
> +				   folio_expected_ref_count(folio) + 1) {
>  			lru_add_drain_all();

How does this synchronise with the folio being added to the mlock batch
on another CPU?

need_mlock_drain(), which is what I think lru_add_drain_all() ends up
using to figure out which CPU batches to process, just looks at the
'nr' field in the batch and I can't see anything in mlock_folio() to
ensure any ordering between adding the folio to the batch and
incrementing its refcount.

Then again, my hack to use folio_test_mlocked() would have a similar
issue because the flag is set (albeit with barrier semantics) before
adding the folio to the batch, meaning the drain could miss the folio.

I guess there's some higher-level synchronisation making this all work,
but it would be good to understand that as I can't see that
collect_longterm_unpinnable_folios() can rely on much other than the pin.

Will

next prev parent reply	other threads:[~2025-08-29 11:57 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-15 10:18 Will Deacon
2025-08-16  1:03 ` John Hubbard
2025-08-16  4:33   ` Hugh Dickins
2025-08-18 13:38   ` Will Deacon
2025-08-16  4:14 ` Hugh Dickins
2025-08-16  8:15   ` David Hildenbrand
2025-08-18 13:31   ` Will Deacon
2025-08-18 14:31     ` Will Deacon
2025-08-25  1:25       ` Hugh Dickins
2025-08-25 16:04         ` David Hildenbrand
2025-08-28  8:47         ` Hugh Dickins
2025-08-28  8:59           ` David Hildenbrand
2025-08-28 16:12             ` Hugh Dickins
2025-08-28 20:38               ` David Hildenbrand
2025-08-29  1:58                 ` Hugh Dickins
2025-08-29  8:56                   ` David Hildenbrand
2025-08-29 11:57           ` Will Deacon [this message]
2025-08-29 13:21             ` Will Deacon
2025-08-29 16:04               ` Hugh Dickins
2025-08-29 15:46             ` Hugh Dickins
2025-09-09 11:39               ` Will Deacon
2025-09-09 11:50                 ` David Hildenbrand
2025-09-10  0:24                   ` John Hubbard

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aLGVsXpyUx9-ZRIl@willie-the-truck \
    --to=will@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=fmayle@google.com \
    --cc=hughd@google.com \
    --cc=jgg@ziepe.ca \
    --cc=jhubbard@nvidia.com \
    --cc=keirf@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=peterx@redhat.com \
    --cc=riel@surriel.com \
    --cc=vbabka@suse.cz \
    --cc=yangge1116@126.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox