Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Hildenbrand <david@redhat.com>
To: David Rientjes <rientjes@google.com>,
	Mike Kravetz <mike.kravetz@oracle.com>
Cc: Yosry Ahmed <yosryahmed@google.com>,
	James Houghton <jthoughton@google.com>,
	Naoya Horiguchi <naoya.horiguchi@nec.com>,
	Miaohe Lin <linmiaohe@huawei.com>,
	lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org,
	Peter Xu <peterx@redhat.com>, Michal Hocko <mhocko@suse.com>,
	Matthew Wilcox <willy@infradead.org>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Jiaqi Yan <jiaqiyan@google.com>
Subject: Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs
Date: Thu, 8 Jun 2023 08:34:10 +0200	[thread overview]
Message-ID: <cfc580f1-e1de-dca8-8549-324a35e21a12@redhat.com> (raw)
In-Reply-To: <686e3e61-704e-1258-8a8b-f18399b41668@google.com>

On 08.06.23 02:02, David Rientjes wrote:
> On Wed, 7 Jun 2023, Mike Kravetz wrote:
> 
>>>>>> Are there strong objections to extending hugetlb for this support?
>>>>>
>>>>> I don't want to get too involved in this discussion (busy), but I
>>>>> absolutely agree on the points that were raised at LSF/MM that
>>>>>
>>>>> (A) hugetlb is complicated and very special (many things not integrated
>>>>> with core-mm, so we need special-casing all over the place). [example:
>>>>> what is a pte?]
>>>>>
>>>>> (B) We added a bunch of complexity in the past that some people
>>>>> considered very important (and it was not feature frozen, right? ;) ).
>>>>> Looking back, we might just not have done some of that, or done it
>>>>> differently/cleaner -- better integrated in the core. (PMD sharing,
>>>>> MAP_PRIVATE, a reservation mechanism that still requires preallocation
>>>>> because it fails with NUMA/fork, ...)
>>>>>
>>>>> (C) Unifying hugetlb and the core looks like it's getting more and more
>>>>> out of reach, maybe even impossible with all the complexity we added
>>>>> over the years (well, and keep adding).
>>>>>
>>>>> Sure, HGM for the purpose of better hwpoison handling makes sense. But
>>>>> hugetlb is probably 20 years old and hwpoison handling probably 13 years
>>>>> old. So we managed to get quite far without that optimization.
>>>>>
> 
> Sane handling for memory poisoning and optimizations for live migration
> are both much more important for the real-world 1GB hugetlb user, so it
> doesn't quite have that lengthy of a history.
> 
> Unfortuantely, cloud providers receive complaints about both of these from
> customers.  They are one of the most significant causes for poor customer
> experience.
> 
> While people have proposed 1GB THP support in the past, it was nacked, in
> part, because of the suggestion to just use existing 1GB support in
> hugetlb instead :)

Yes, because I still think that the use for "transparent" (for the user) 
nowadays is very limited and not worth the complexity.

IMHO, what you really want is a pool of large pages that (guarantees 
about availability and nodes) and fine control about who gets these 
pages. That's what hugetlb provides.

In contrast to THP, you don't want to allow for
* Partially mmap, mremap, munmap, mprotect them
* Partially sharing then / COW'ing them
* Partially mixing them with other anon pages (MADV_DONTNEED + refault)
* Exclude them from some features KSM/swap
* (swap them out and eventually split them for that)

Because you don't want to get these pages PTE-mapped by the system 
*unless* there is a real reason (HGM, hwpoison) -- you want guarantees. 
Once such a page is PTE-mapped, you only want to collapse in place.

But you don't want special-HGM, you simply want the core to PTE-map them 
like a (file) THP.

IMHO, getting that realized much easier would be if we wouldn't have to 
care about some of the hugetlb complexity I raised (MAP_PRIVATE, PMD 
sharing), but maybe there is a way ...

> 
>>>>> Absolutely, HGM for better postcopy live migration also makes sense, I
>>>>> guess nobody disagrees on that.
>>>>>
>>>>>
>>>>> But as discussed in that session, maybe we should just start anew and
>>>>> implement something that integrates nicely with the core , instead of
>>>>> making hugetlb more complicated and even more special.
>>>>>
> 
> Certainly an ideal would be where we could support everybody's use cases
> in a much more cohesive way with the rest of the core MM.  I'm
> particularly concerned about how long it will take to get to that state
> even if we had kernel developers committed to doing the work.  Even if we
> had a design for this new subsystem that was more tightly coupled with the
> core MM, it would take O(years) to implement, test, extend for other
> architectures, and that's before any existing of users of hugetlb could
> make the changes in the rest of their software stack to support it.

One interesting experiment would be, to just take hugetlb and remove all 
complexity (strip it to it's core: a pooling of large pages without 
special MAP_PRIVATE support, PMD sharing, reservations, ...). Then, see 
how to get core-mm to just treat them like PUD/PMD-mapped folios that 
can get PTE-mapped -- just like we have with FS-level THP.

Maybe we could then factor out what's shared with the old hugetlb 
implementations (e.g., pooling) and have both co-exist (e.g., configured 
at runtime).

The user-space interface for hugetlb would not change (well, except fail 
MAP_PRIVATE for now)

(especially, no messing with anon hugetlb pages)


Again, the spirit would be "teach the core to just treat them like 
folios that can get PTE-mapped" instead of "add HGM to hugetlb". If we 
can achieve that without a hugetlb v2, great. But i think that will be 
harder .... but I might be just wrong.

-- 
Cheers,

David / dhildenb

next prev parent reply	other threads:[~2023-06-08  6:34 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-06 19:19 Mike Kravetz
2023-03-14 15:37 ` James Houghton
2023-04-12  1:44   ` David Rientjes
2023-05-24 20:26 ` James Houghton
2023-05-26  3:00   ` David Rientjes
     [not found]     ` <20230602172723.GA3941@monkey>
2023-06-06 22:40       ` David Rientjes
2023-06-07  7:38         ` David Hildenbrand
2023-06-07  7:51           ` Yosry Ahmed
2023-06-07  8:13             ` David Hildenbrand
2023-06-07 22:06               ` Mike Kravetz
2023-06-08  0:02                 ` David Rientjes
2023-06-08  6:34                   ` David Hildenbrand [this message]
2023-06-08 18:50                     ` Yang Shi
2023-06-08 21:23                       ` Mike Kravetz
2023-06-09  1:57                         ` Zi Yan
2023-06-09 15:17                           ` Pasha Tatashin
2023-06-09 19:04                             ` Ankur Arora
2023-06-09 19:57                           ` Matthew Wilcox
2023-06-08 20:10                     ` Matthew Wilcox
2023-06-09  2:59                       ` David Rientjes
2023-06-13 14:59                       ` Jason Gunthorpe
2023-06-13 15:15                         ` David Hildenbrand
2023-06-13 15:45                           ` Peter Xu
2023-06-08 21:54                 ` [Lsf-pc] " Dan Williams
2023-06-08 22:35                   ` Mike Kravetz
2023-06-09  3:36                     ` Dan Williams
2023-06-09 20:20                       ` James Houghton
2023-06-13 15:17                         ` Jason Gunthorpe
2023-06-07 14:40           ` Matthew Wilcox

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cfc580f1-e1de-dca8-8549-324a35e21a12@redhat.com \
    --to=david@redhat.com \
    --cc=axelrasmussen@google.com \
    --cc=jiaqiyan@google.com \
    --cc=jthoughton@google.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mhocko@suse.com \
    --cc=mike.kravetz@oracle.com \
    --cc=naoya.horiguchi@nec.com \
    --cc=peterx@redhat.com \
    --cc=rientjes@google.com \
    --cc=willy@infradead.org \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox