linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: Matthew Wilcox <willy@infradead.org>,
	Yin Fengwei <fengwei.yin@intel.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	yuzhao@google.com, ryan.roberts@arm.com, shy828301@gmail.com,
	akpm@linux-foundation.org
Subject: Re: [RFC PATCH 0/3] support large folio for mlock
Date: Fri, 7 Jul 2023 20:54:33 +0200	[thread overview]
Message-ID: <4bb39d6e-a324-0d85-7d44-8e8a37a1cfec@redhat.com> (raw)
In-Reply-To: <ZKhK1Ic1KCdOLRYm@casper.infradead.org>

On 07.07.23 19:26, Matthew Wilcox wrote:
> On Sat, Jul 08, 2023 at 12:52:18AM +0800, Yin Fengwei wrote:
>> This series identified the large folio for mlock to two types:
>>    - The large folio is in VM_LOCKED VMA range
>>    - The large folio cross VM_LOCKED VMA boundary
> 
> This is somewhere that I think our fixation on MUST USE PMD ENTRIES
> has led us astray.  Today when the arguments to mlock() cross a folio
> boundary, we split the PMD entry but leave the folio intact.  That means
> that we continue to manage the folio as a single entry on the LRU list.
> But userspace may have no idea that we're doing this.  It may have made
> several calls to mmap() 256kB at once, they've all been coalesced into
> a single VMA and khugepaged has come along behind its back and created
> a 2MB THP.  Now userspace calls mlock() and instead of treating that as
> a hint that oops, maybe we shouldn't've done that, we do our utmost to
> preserve the 2MB folio.
> 
> I think this whole approach needs rethinking.  IMO, anonymous folios
> should not cross VMA boundaries.  Tell me why I'm wrong.

I think we touched upon that a couple of times already, and the main 
issue is that while it sounds nice in theory, it's impossible in practice.

THP are supposed to be transparent, that is, we should not let arbitrary 
operations fail.

But nothing stops user space from

(a) mmap'ing a 2 MiB region
(b) GUP-pinning the whole range
(c) GUP-pinning the first half
(d) unpinning the whole range from (a)
(e) munmap'ing the second half


And that's just one out of many examples I can think of, not even 
considering temporary/speculative references that can prevent a split at 
random points in time -- especially when splitting a VMA.

Sure, any time we PTE-map a THP we might just say "let's put that on the 
deferred split queue" and cross fingers that we can eventually split it 
later. (I was recently thinking about that in the context of the 
mapcount ...)

It's all a big mess ...

-- 
Cheers,

David / dhildenb



  reply	other threads:[~2023-07-07 18:54 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-07 16:52 Yin Fengwei
2023-07-07 16:52 ` [RFC PATCH 1/3] mm: add function folio_in_range() Yin Fengwei
2023-07-08  5:47   ` Yu Zhao
2023-07-08  6:44     ` Yin, Fengwei
2023-07-07 16:52 ` [RFC PATCH 2/3] mm: handle large folio when large folio in VM_LOCKED VMA range Yin Fengwei
2023-07-08  5:11   ` Yu Zhao
2023-07-08  5:33     ` Yin, Fengwei
2023-07-08  5:56       ` Yu Zhao
2023-07-07 16:52 ` [RFC PATCH 3/3] mm: mlock: update mlock_pte_range to handle large folio Yin Fengwei
2023-07-07 17:26 ` [RFC PATCH 0/3] support large folio for mlock Matthew Wilcox
2023-07-07 18:54   ` David Hildenbrand [this message]
2023-07-07 19:06     ` Matthew Wilcox
2023-07-07 19:15       ` David Hildenbrand
2023-07-07 19:26         ` Matthew Wilcox
2023-07-10 10:36           ` Ryan Roberts
2023-07-08  3:52       ` Yin, Fengwei
2023-07-08  4:02         ` Matthew Wilcox
2023-07-08  4:35           ` Yu Zhao
2023-07-08  4:40             ` Yin, Fengwei
2023-07-08  4:36           ` Yin, Fengwei
2023-07-09 13:25           ` Yin, Fengwei
2023-07-10  9:32             ` David Hildenbrand
2023-07-10  9:43               ` Yin, Fengwei
2023-07-10  9:57                 ` David Hildenbrand
2023-07-10 10:19                   ` Yin, Fengwei
2023-07-08  3:34     ` Yin, Fengwei
2023-07-08  3:31   ` Yin, Fengwei
2023-07-08  4:45 ` Yu Zhao
2023-07-08  5:01   ` Yin, Fengwei
2023-07-08  5:06     ` Yu Zhao
2023-07-08  5:35       ` Yin, Fengwei

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4bb39d6e-a324-0d85-7d44-8e8a37a1cfec@redhat.com \
    --to=david@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=fengwei.yin@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ryan.roberts@arm.com \
    --cc=shy828301@gmail.com \
    --cc=willy@infradead.org \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox