linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Hugh Dickins <hughd@google.com>
To: David Hildenbrand <david@redhat.com>
Cc: Hugh Dickins <hughd@google.com>, Will Deacon <will@kernel.org>,
	 linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	 Keir Fraser <keirf@google.com>, Jason Gunthorpe <jgg@ziepe.ca>,
	 John Hubbard <jhubbard@nvidia.com>,
	Frederick Mayle <fmayle@google.com>,
	 Andrew Morton <akpm@linux-foundation.org>,
	Peter Xu <peterx@redhat.com>,  Rik van Riel <riel@surriel.com>,
	Vlastimil Babka <vbabka@suse.cz>,  Ge Yang <yangge1116@126.com>
Subject: Re: [PATCH] mm/gup: Drain batched mlock folio processing before attempting migration
Date: Thu, 28 Aug 2025 18:58:57 -0700 (PDT)	[thread overview]
Message-ID: <d8e8c6d9-2b58-a1e8-d3cb-3f578f3f5889@google.com> (raw)
In-Reply-To: <56819052-d3f5-4209-824d-5cfbf49ff6e9@redhat.com>

On Thu, 28 Aug 2025, David Hildenbrand wrote:
> On 28.08.25 18:12, Hugh Dickins wrote:
> > On Thu, 28 Aug 2025, David Hildenbrand wrote:
> >> On 28.08.25 10:47, Hugh Dickins wrote:
> > ...
> >>> It took several days in search of the least bad compromise, but
> >>> in the end I concluded the opposite of what we'd intended above.
> >>>
> >>> There is a fundamental incompatibility between my 5.18 2fbb0c10d1e8
> >>> ("mm/munlock: mlock_page() munlock_page() batch by pagevec")
> >>> and Ge Yang's 6.11 33dfe9204f29
> >>> ("mm/gup: clear the LRU flag of a page before adding to LRU batch").
> >>>
> >>> It turns out that the mm/swap.c folio batches (apart from lru_add)
> >>> are all for best-effort, doesn't matter if it's missed, operations;
> >>> whereas mlock and munlock are more serious.  Probably mlock could
> >>> be (not very satisfactorily) converted, but then munlock?  Because
> >>> of failed folio_test_clear_lru()s, it would be far too likely to
> >>> err on either side, munlocking too soon or too late.
> >>>
> >>> I've concluded that one or the other has to go.  If we're having
> >>> a beauty contest, there's no doubt that 33dfe9204f29 is much nicer
> >>> than 2fbb0c10d1e8 (which is itself far from perfect).  But functionally,
> >>> I'm afraid that removing the mlock/munlock batching will show up as a
> >>> perceptible regression in realistic workloadsg; and on consideration,
> >>> I've found no real justification for the LRU flag clearing change.
> >>
> >> Just to understand what you are saying: are you saying that we will go back
> >> to
> >> having a folio being part of multiple LRU caches?
> > 
> > Yes.  Well, if you count the mlock/munlock batches in as "LRU caches",
> > then that has been so all along.
> 
> Yes ...
> 
> > 
> >> :/ If so, I really rally
> >> hope that we can find another way and not go back to that old handling.
> > 
> > For what reason?  It sounded like a nice "invariant" to keep in mind,
> > but it's a limitation, and  what purpose was it actually serving?
> 
> I liked the semantics that if !lru, there could be at most one reference from
> the LRU caches.
> 
> That is, if there are two references, you don't even have to bother with
> flushing anything.

If that assumption is being put into practice anywhere (not that I know of),
then it's currently wrong and needs currecting.

It would be nice and simple if it worked, but I couldn't get it to work
with mlock/munlock batching, so it seemed better to give up on the
pretence.

And one of the points of using a pagevec used to be precisely that a
page could exist on more than one at a time (unlike threading via lru).

> 
> > 
> > If it's the "spare room in struct page to keep the address of that one
> > batch entry ... efficiently extract ..." that I was dreaming of: that
> > has to be a future thing, when perhaps memdescs will allow an extensible
> > structure to be attached, and extending it for an mlocked folio (to hold
> > the mlock_count instead of squeezing it into lru.prev) would not need
> > mlock/munlock batching at all (I guess: far from uppermost in my mind!),
> > and including a field for "efficiently extract" from LRU batch would be
> > nice.
> > 
> > But, memdescs or not, there will always be pressure to keep the common
> > struct as small as possible, so I don't know if we would actually go
> > that way.
> > 
> > But I suspect that was not your reason at all: please illuminate.
> 
> You are very right :)

OK, thanks, I'll stop reading further now :)

> 
> Regarding the issue at hand:
> 
> There were discussions at LSF/MM about also putting (some) large folios onto
> the LRU caches.
> 
> In that context, GUP could take multiple references on the same folio, and a
> simple folio_expected_ref_count() + 1 would no longer do the trick.
> 
> I thought about this today, and likely it could be handled by scanning the
> page array for same folios etc. A bit nasty when wanting to cover all corner
> cases (folio pages must not be consecutive in the passed array ) ...

I haven't thought about that problem at all (except when working around
a similar issue in mm/mempolicy.c's folio queueing), but can sympathize.

It had irritated me to notice how the flush-immediately code for 512-page
folios now extends to 2-page folios: you've enlightened me why that remains
so, I hadn't thought of the implications.  Perhaps even more reason to
allow for multiple references on the pagevecs/batches?

> 
> Apart from that issue, I liked the idea of a "single entry in the cache" for
> other reasons: it gives clear semantics. We cannot end up in a scenario where
> someone performs OPX and later someone OPY on a folio, but the way the lru
> caches are flushed we might end up processing OPX after OPY -- this should be
> able to happen in case of local or remote flushes IIRC.

It's been that way for many years, that's how they are.

> 
> Anyhow, I quickly scanned your code. The folio_expected_ref_count() should
> solve the issue for now. It's quite unfortunate that any raised reference will
> make us now flush all remote LRU caches.

There will be more false positives (drains which drain nothing relevant)
by relying on ref_count rather than test_lru, yes.  But certainly there
were already false positives by relying on test_lru.  And I expect the
preparatory local lru_add_drain() to cut out a lot of true positives.

> 
> Maybe we just want to limit it to !folio_test_large(), because flushing there
> really doesn't give us any chance in succeeding right now? Not sure if it
> makes any difference in practice, though, we'll likely be flushing remote LRU
> caches now more frequently either way.

Ah, good idea. with or without my changes.  Maybe material for a separate
patch.  I wonder if we would do better to add a folio_is_batchable() and
use that consistently in all of the places which are refusing to batch
when folio_test_large() - I wonder if a !folio_test_large() here will
get missed if there's a change there.

Thanks,
Hugh


  reply	other threads:[~2025-08-29  1:59 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-15 10:18 Will Deacon
2025-08-16  1:03 ` John Hubbard
2025-08-16  4:33   ` Hugh Dickins
2025-08-18 13:38   ` Will Deacon
2025-08-16  4:14 ` Hugh Dickins
2025-08-16  8:15   ` David Hildenbrand
2025-08-18 13:31   ` Will Deacon
2025-08-18 14:31     ` Will Deacon
2025-08-25  1:25       ` Hugh Dickins
2025-08-25 16:04         ` David Hildenbrand
2025-08-28  8:47         ` Hugh Dickins
2025-08-28  8:59           ` David Hildenbrand
2025-08-28 16:12             ` Hugh Dickins
2025-08-28 20:38               ` David Hildenbrand
2025-08-29  1:58                 ` Hugh Dickins [this message]
2025-08-29  8:56                   ` David Hildenbrand
2025-08-29 11:57           ` Will Deacon
2025-08-29 13:21             ` Will Deacon
2025-08-29 16:04               ` Hugh Dickins
2025-08-29 15:46             ` Hugh Dickins
2025-09-09 11:39               ` Will Deacon
2025-09-09 11:50                 ` David Hildenbrand
2025-09-10  0:24                   ` John Hubbard

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d8e8c6d9-2b58-a1e8-d3cb-3f578f3f5889@google.com \
    --to=hughd@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=fmayle@google.com \
    --cc=jgg@ziepe.ca \
    --cc=jhubbard@nvidia.com \
    --cc=keirf@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=peterx@redhat.com \
    --cc=riel@surriel.com \
    --cc=vbabka@suse.cz \
    --cc=will@kernel.org \
    --cc=yangge1116@126.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox