linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Matthew Wilcox <willy@infradead.org>
To: David Hildenbrand <david@redhat.com>
Cc: Yin Fengwei <fengwei.yin@intel.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	yuzhao@google.com, ryan.roberts@arm.com, shy828301@gmail.com,
	akpm@linux-foundation.org
Subject: Re: [RFC PATCH 0/3] support large folio for mlock
Date: Fri, 7 Jul 2023 20:06:00 +0100	[thread overview]
Message-ID: <ZKhiGLpIWi5Z2WnY@casper.infradead.org> (raw)
In-Reply-To: <4bb39d6e-a324-0d85-7d44-8e8a37a1cfec@redhat.com>

On Fri, Jul 07, 2023 at 08:54:33PM +0200, David Hildenbrand wrote:
> On 07.07.23 19:26, Matthew Wilcox wrote:
> > On Sat, Jul 08, 2023 at 12:52:18AM +0800, Yin Fengwei wrote:
> > > This series identified the large folio for mlock to two types:
> > >    - The large folio is in VM_LOCKED VMA range
> > >    - The large folio cross VM_LOCKED VMA boundary
> > 
> > This is somewhere that I think our fixation on MUST USE PMD ENTRIES
> > has led us astray.  Today when the arguments to mlock() cross a folio
> > boundary, we split the PMD entry but leave the folio intact.  That means
> > that we continue to manage the folio as a single entry on the LRU list.
> > But userspace may have no idea that we're doing this.  It may have made
> > several calls to mmap() 256kB at once, they've all been coalesced into
> > a single VMA and khugepaged has come along behind its back and created
> > a 2MB THP.  Now userspace calls mlock() and instead of treating that as
> > a hint that oops, maybe we shouldn't've done that, we do our utmost to
> > preserve the 2MB folio.
> > 
> > I think this whole approach needs rethinking.  IMO, anonymous folios
> > should not cross VMA boundaries.  Tell me why I'm wrong.
> 
> I think we touched upon that a couple of times already, and the main issue
> is that while it sounds nice in theory, it's impossible in practice.
> 
> THP are supposed to be transparent, that is, we should not let arbitrary
> operations fail.
> 
> But nothing stops user space from
> 
> (a) mmap'ing a 2 MiB region
> (b) GUP-pinning the whole range
> (c) GUP-pinning the first half
> (d) unpinning the whole range from (a)
> (e) munmap'ing the second half
> 
> 
> And that's just one out of many examples I can think of, not even
> considering temporary/speculative references that can prevent a split at
> random points in time -- especially when splitting a VMA.
> 
> Sure, any time we PTE-map a THP we might just say "let's put that on the
> deferred split queue" and cross fingers that we can eventually split it
> later. (I was recently thinking about that in the context of the mapcount
> ...)
> 
> It's all a big mess ...

Oh, I agree, there are always going to be circumstances where we realise
we've made a bad decision and can't (easily) undo it.  Unless we have a
per-page pincount, and I Would Rather Not Do That.  But we should _try_
to do that because it's the right model -- that's what I meant by "Tell
me why I'm wrong"; what scenarios do we have where a user temporarilly
mlocks (or mprotects or ...) a range of memory, but wants that memory
to be aged in the LRU exactly the same way as the adjacent memory that
wasn't mprotected?

GUP-pinning is different, and I don't think GUP-pinning should split
a folio.  That's a temporary use (not FOLL_LONGTERM), eg, we're doing
tcp zero-copy or it's the source/target of O_DIRECT.  That's not an
instruction that this memory is different from its neighbours.

Maybe we end up deciding to split folios on GUP-pin.  That would be
regrettable.


  reply	other threads:[~2023-07-07 19:06 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-07 16:52 Yin Fengwei
2023-07-07 16:52 ` [RFC PATCH 1/3] mm: add function folio_in_range() Yin Fengwei
2023-07-08  5:47   ` Yu Zhao
2023-07-08  6:44     ` Yin, Fengwei
2023-07-07 16:52 ` [RFC PATCH 2/3] mm: handle large folio when large folio in VM_LOCKED VMA range Yin Fengwei
2023-07-08  5:11   ` Yu Zhao
2023-07-08  5:33     ` Yin, Fengwei
2023-07-08  5:56       ` Yu Zhao
2023-07-07 16:52 ` [RFC PATCH 3/3] mm: mlock: update mlock_pte_range to handle large folio Yin Fengwei
2023-07-07 17:26 ` [RFC PATCH 0/3] support large folio for mlock Matthew Wilcox
2023-07-07 18:54   ` David Hildenbrand
2023-07-07 19:06     ` Matthew Wilcox [this message]
2023-07-07 19:15       ` David Hildenbrand
2023-07-07 19:26         ` Matthew Wilcox
2023-07-10 10:36           ` Ryan Roberts
2023-07-08  3:52       ` Yin, Fengwei
2023-07-08  4:02         ` Matthew Wilcox
2023-07-08  4:35           ` Yu Zhao
2023-07-08  4:40             ` Yin, Fengwei
2023-07-08  4:36           ` Yin, Fengwei
2023-07-09 13:25           ` Yin, Fengwei
2023-07-10  9:32             ` David Hildenbrand
2023-07-10  9:43               ` Yin, Fengwei
2023-07-10  9:57                 ` David Hildenbrand
2023-07-10 10:19                   ` Yin, Fengwei
2023-07-08  3:34     ` Yin, Fengwei
2023-07-08  3:31   ` Yin, Fengwei
2023-07-08  4:45 ` Yu Zhao
2023-07-08  5:01   ` Yin, Fengwei
2023-07-08  5:06     ` Yu Zhao
2023-07-08  5:35       ` Yin, Fengwei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZKhiGLpIWi5Z2WnY@casper.infradead.org \
    --to=willy@infradead.org \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=fengwei.yin@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ryan.roberts@arm.com \
    --cc=shy828301@gmail.com \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox