[LSF/MM/BPF TOPIC] Hugetlb Unifications

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] Hugetlb Unifications
@ 2024-02-22  8:50 Peter Xu
  2024-02-22 20:36 ` Frank van der Linden
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Peter Xu @ 2024-02-22  8:50 UTC (permalink / raw)
  To: lsf-pc; +Cc: linux-mm, James Houghton, Muchun Song

I want to propose a session to discuss how we should unify hugetlb into
core mm.

Due to legacy reasons, hugetlb has plenty of its own code paths that are
plugged into core mm, causing itself even more special than shmem.  While
it is a pretty decent and useful file system, efficient on supporting large
& statically allocated chunks of memory, it also added maintenance burden
due to having its own specific code paths spread all over the place.

It went into a bit of a mess, and it is messed up enough to become a reason
to not accept new major features like what used to be proposed last year to
map hugetlb pages in smaller sizes [1].

We all seem to agree something needs to be done to hugetlb, but it seems
still not as clear on what exactly, then people forgot about it and move
on, until hit it again.  The problem didn't yet go away itself even if
nobody asks.

Is it worthwhile to spend time do such work?  Do we really need a fresh new
hugetlb-v2 just to accept new features?  What exactly need to be
generalized for hugetlb?  Is huge_pte_offset() the culprit, or what else?
To what extent hugetlb is free to accept new features?

The goal of such a session is trying to make it clearer on answering above
questions.

[1] https://lore.kernel.org/r/20230306191944.GA15773@monkey

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Hugetlb Unifications
  2024-02-22  8:50 [LSF/MM/BPF TOPIC] Hugetlb Unifications Peter Xu
@ 2024-02-22 20:36 ` Frank van der Linden
  2024-02-22 22:21   ` Matthew Wilcox
  2024-02-22 22:16 ` Pasha Tatashin
  2024-03-01  1:37 ` James Houghton
  2 siblings, 1 reply; 13+ messages in thread
From: Frank van der Linden @ 2024-02-22 20:36 UTC (permalink / raw)
  To: Peter Xu; +Cc: lsf-pc, linux-mm, James Houghton, Muchun Song

+1 on this topic, thanks for bringing it up. This needs to be
resolved, or it'll keep hanging over any future hugetlb feature
changes.

To me, it makes sense to have hugetlb pages themselves to just be
large folios as much as possible. On top of that, there could be a
notion of physical memory pools with certain properties. The
properties can be things like: size, evictability, migratability,
possibly persistence across reboots, maybe "should not be in the
direct map", like memfd_secret. hugetlbfs then could be expressed as a
filesystem on top of a pool of, for example, 1G non-evictable pages.
The pools themselves could have a memfd-like interface (or use memfd
itself), and could also be used to hook in to things like KVM
guestmemfd.

So yes, that would be a hugetlb v2, but mainly as a backward
compatible layer on top of something more generic.

- Frank

On Thu, Feb 22, 2024 at 12:50 AM Peter Xu <peterx@redhat.com> wrote:
>
> I want to propose a session to discuss how we should unify hugetlb into
> core mm.
>
> Due to legacy reasons, hugetlb has plenty of its own code paths that are
> plugged into core mm, causing itself even more special than shmem.  While
> it is a pretty decent and useful file system, efficient on supporting large
> & statically allocated chunks of memory, it also added maintenance burden
> due to having its own specific code paths spread all over the place.
>
> It went into a bit of a mess, and it is messed up enough to become a reason
> to not accept new major features like what used to be proposed last year to
> map hugetlb pages in smaller sizes [1].
>
> We all seem to agree something needs to be done to hugetlb, but it seems
> still not as clear on what exactly, then people forgot about it and move
> on, until hit it again.  The problem didn't yet go away itself even if
> nobody asks.
>
> Is it worthwhile to spend time do such work?  Do we really need a fresh new
> hugetlb-v2 just to accept new features?  What exactly need to be
> generalized for hugetlb?  Is huge_pte_offset() the culprit, or what else?
> To what extent hugetlb is free to accept new features?
>
> The goal of such a session is trying to make it clearer on answering above
> questions.
>
> [1] https://lore.kernel.org/r/20230306191944.GA15773@monkey
>
> Thanks,
>
> --
> Peter Xu
>
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Hugetlb Unifications
  2024-02-22  8:50 [LSF/MM/BPF TOPIC] Hugetlb Unifications Peter Xu
  2024-02-22 20:36 ` Frank van der Linden
@ 2024-02-22 22:16 ` Pasha Tatashin
  2024-02-22 22:31   ` Matthew Wilcox
  2024-03-01  1:37 ` James Houghton
  2 siblings, 1 reply; 13+ messages in thread
From: Pasha Tatashin @ 2024-02-22 22:16 UTC (permalink / raw)
  To: Peter Xu; +Cc: lsf-pc, linux-mm, James Houghton, Muchun Song

On Thu, Feb 22, 2024 at 3:50 AM Peter Xu <peterx@redhat.com> wrote:
>
> I want to propose a session to discuss how we should unify hugetlb into
> core mm.

Or moving the interface part into a driver?

>
> Due to legacy reasons, hugetlb has plenty of its own code paths that are
> plugged into core mm, causing itself even more special than shmem.  While
> it is a pretty decent and useful file system, efficient on supporting large
> & statically allocated chunks of memory, it also added maintenance burden
> due to having its own specific code paths spread all over the place.
>
> It went into a bit of a mess, and it is messed up enough to become a reason
> to not accept new major features like what used to be proposed last year to
> map hugetlb pages in smaller sizes [1].
>
> We all seem to agree something needs to be done to hugetlb, but it seems
> still not as clear on what exactly, then people forgot about it and move
> on, until hit it again.  The problem didn't yet go away itself even if
> nobody asks.
>
> Is it worthwhile to spend time do such work?  Do we really need a fresh new
> hugetlb-v2 just to accept new features?  What exactly need to be
> generalized for hugetlb?  Is huge_pte_offset() the culprit, or what else?
> To what extent hugetlb is free to accept new features?
>
> The goal of such a session is trying to make it clearer on answering above
> questions.
>
> [1] https://lore.kernel.org/r/20230306191944.GA15773@monkey

+1 I am interested in this discussion. Unification is indeed useful,
and the core MM should ideally support only one kind of huge page.
However, we must also ensure compatibility with the interfaces and
features unique to hugetlb, such as boot-time reservation and vmemap
optimizations. Generalizing these features could potentially lead to
memory savings in THP as well.

Pasha


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Hugetlb Unifications
  2024-02-22 20:36 ` Frank van der Linden
@ 2024-02-22 22:21   ` Matthew Wilcox
  0 siblings, 0 replies; 13+ messages in thread
From: Matthew Wilcox @ 2024-02-22 22:21 UTC (permalink / raw)
  To: Frank van der Linden
  Cc: Peter Xu, lsf-pc, linux-mm, James Houghton, Muchun Song

On Thu, Feb 22, 2024 at 12:36:40PM -0800, Frank van der Linden wrote:
> To me, it makes sense to have hugetlb pages themselves to just be
> large folios as much as possible.

Well, they are.  That isn't the concern:

$ git grep -c 'struct page' mm/hugetlb.c fs/hugetlbfs
fs/hugetlbfs/inode.c:5
mm/hugetlb.c:30
$ git grep -c 'struct folio' mm/hugetlb.c fs/hugetlbfs
fs/hugetlbfs/inode.c:10
mm/hugetlb.c:97

(further patches to convert pages to folios are welcome)

> On top of that, there could be a
> notion of physical memory pools with certain properties. The
> properties can be things like: size, evictability, migratability,
> possibly persistence across reboots, maybe "should not be in the
> direct map", like memfd_secret. hugetlbfs then could be expressed as a
> filesystem on top of a pool of, for example, 1G non-evictable pages.
> The pools themselves could have a memfd-like interface (or use memfd
> itself), and could also be used to hook in to things like KVM
> guestmemfd.
> 
> So yes, that would be a hugetlb v2, but mainly as a backward
> compatible layer on top of something more generic.

"Those who do not understand what hugetlbfs provides are condemned to
reinvent it badly".  We're really going to miss Mike this year.

The most important thing (to my mind) that it provides is shared page
tables.  It does it badly, hence the mshare proposal.  I don't think
we can make progress on a hugetlbfs2 until we have some mechanism in
the MM to share page tables (as using hugetlbfs2 will regress Certain
Important Workloads that pay my salary).

Another piece of the puzzle is reserved pages.  I have not investigated
this area at all, and so I don't know if the current mechanism in
hugetlbfs is a good one, how it could be hoisted to the MM layer, or
reimplemented in a hugetlbfs2.  Maybe this is where your mshare idea
comes in.

Peter's patches have been focused on removing some of the special casing
of hugetlbfs in the generic MM.  I think this is a great idea!  While I
haven't been actively reviewing those patches as they often touch areas
I'm not an expert in, I'm in favour of them going in.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Hugetlb Unifications
  2024-02-22 22:16 ` Pasha Tatashin
@ 2024-02-22 22:31   ` Matthew Wilcox
  2024-02-22 22:58     ` Pasha Tatashin
  0 siblings, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2024-02-22 22:31 UTC (permalink / raw)
  To: Pasha Tatashin; +Cc: Peter Xu, lsf-pc, linux-mm, James Houghton, Muchun Song

On Thu, Feb 22, 2024 at 05:16:44PM -0500, Pasha Tatashin wrote:
> However, we must also ensure compatibility with the interfaces and
> features unique to hugetlb, such as boot-time reservation and vmemap
> optimizations. Generalizing these features could potentially lead to
> memory savings in THP as well.

In a memdesc world, how much value is there in vmemmap?  At 8 bytes per
page, memmap occupies 4kB for 2MB and 2MB for 1GB.  So there's no way
to save memory for a 2MB allocation, and saving 2MB per 1GB page is
... not a huge win any more.  Let's say you have a 64GB machine with
50GB tied up in 1GB pages, we'll end up saving 100MB on a 64GB machine
which doesn't seem all that compelling?

I do have a proposal for further compressing memmap, but it requires
doing memdesc first, so I'm reluctant to discuss it before we've done
memdescs.  I have to have something to talk about at LSFMM'26 after all.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Hugetlb Unifications
  2024-02-22 22:31   ` Matthew Wilcox
@ 2024-02-22 22:58     ` Pasha Tatashin
  0 siblings, 0 replies; 13+ messages in thread
From: Pasha Tatashin @ 2024-02-22 22:58 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Peter Xu, lsf-pc, linux-mm, James Houghton, Muchun Song

On Thu, Feb 22, 2024 at 5:32 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Thu, Feb 22, 2024 at 05:16:44PM -0500, Pasha Tatashin wrote:
> > However, we must also ensure compatibility with the interfaces and
> > features unique to hugetlb, such as boot-time reservation and vmemap
> > optimizations. Generalizing these features could potentially lead to
> > memory savings in THP as well.
>
> In a memdesc world, how much value is there in vmemmap?  At 8 bytes per
> page, memmap occupies 4kB for 2MB and 2MB for 1GB.  So there's no way
> to save memory for a 2MB allocation, and saving 2MB per 1GB page is
> ... not a huge win any more.  Let's say you have a 64GB machine with
> 50GB tied up in 1GB pages, we'll end up saving 100MB on a 64GB machine
> which doesn't seem all that compelling?

I agree that memdesc may eliminate the need for vmemmap optimizations.
However, do we want to introduce regressions before transitioning to a
memdesc world? Companies are currently saving petabytes of memory with
vmemmap optimizations.

> I do have a proposal for further compressing memmap, but it requires
> doing memdesc first, so I'm reluctant to discuss it before we've done
> memdescs.  I have to have something to talk about at LSFMM'26 after all.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Hugetlb Unifications
  2024-02-22  8:50 [LSF/MM/BPF TOPIC] Hugetlb Unifications Peter Xu
  2024-02-22 20:36 ` Frank van der Linden
  2024-02-22 22:16 ` Pasha Tatashin
@ 2024-03-01  1:37 ` James Houghton
  2024-03-01  3:11   ` Peter Xu
  2024-03-01  4:29   ` Matthew Wilcox
  2 siblings, 2 replies; 13+ messages in thread
From: James Houghton @ 2024-03-01  1:37 UTC (permalink / raw)
  To: Peter Xu; +Cc: lsf-pc, linux-mm, Muchun Song

On Thu, Feb 22, 2024 at 12:50 AM Peter Xu <peterx@redhat.com> wrote:
>
> I want to propose a session to discuss how we should unify hugetlb into
> core mm.
>
> Due to legacy reasons, hugetlb has plenty of its own code paths that are
> plugged into core mm, causing itself even more special than shmem.  While
> it is a pretty decent and useful file system, efficient on supporting large
> & statically allocated chunks of memory, it also added maintenance burden
> due to having its own specific code paths spread all over the place.

Thank you for proposing this topic. HugeTLB is very useful (1G
mappings, guaranteed hugepages, saving struct page overhead, shared
page tables), but it is special in ways that make it a headache to
modify (and making it harder to work on other mm features).

I haven't been able to spend much time with HugeTLB since the LSFMM
talk last year, so I'm not much of an expert anymore. But I'll give my
two cents anyway.

> It went into a bit of a mess, and it is messed up enough to become a reason
> to not accept new major features like what used to be proposed last year to
> map hugetlb pages in smaller sizes [1].
>
> We all seem to agree something needs to be done to hugetlb, but it seems
> still not as clear on what exactly, then people forgot about it and move
> on, until hit it again.  The problem didn't yet go away itself even if
> nobody asks.
>
> Is it worthwhile to spend time do such work?  Do we really need a fresh new
> hugetlb-v2 just to accept new features?  What exactly need to be
> generalized for hugetlb?  Is huge_pte_offset() the culprit, or what else?
> To what extent hugetlb is free to accept new features?

I think the smaller unification that has been done so far is great
(thank you!!), but at some point additional unification will require a
pretty heavy lift. Trying to enumerate some possible challenges:

What does HugeTLB do differently than main mm?
- Page table walking, huge_pte_offset/etc., of course.
- "huge_pte" as a concept (type-erased p?d_t), though it shares its
type with pte_t.
- Completely different page fault path (hugetlbfs doesn't implement
vm_ops->{huge_,}fault).
- mapcount
- Reservation/MAP_NORESERVE
- HWPoison handling
- Synchronization (hugetlb_fault_mutex_table, VMA lock for PMD sharing)
- more...

What does HugeTLB do that main mm doesn't do?
- It keeps pools of hugepages that cannot be used for anything else.
- It has PMD sharing (which can hopefully be replaced with mshare())
- It has HVO (which can hopefully be dropped in a memdesc world)
- more...?

Page table sharing and HVO are both important, but they're not
fundamental to HugeTLB, so it's not impossible to make progress on
drastic cleanup without them.

No matter what, we'll need to add (more) PUD support into the main mm,
so we could start with that, though it won't be easy. Then we would
need at least...

  (1) ...a filesystem that implements huge_fault for PUDs

It's not inconceivable to add support for this in shmem (where 1G
pages are allocated -- perhaps ahead of time -- with CMA, maybe?).
This could be done in hugetlbfs, but then you'd have to make sure that
the huge_fault implementation stays compatible with everything else in
hugetlb/hugetlbfs, perhaps making incremental progress difficult. Or
you could create hugetlbfs-v2. I'm honestly not sure which of these is
the least difficult -- probably the shmem route?

  (2) ...a mapcount (+refcount) system that works for PUD mappings.

This discussion has progressed a lot since I last thought about it;
I'll let the experts figure this one out[1].

Anyway, I'm oversimplifying things, and it's been a while since I've
thought hard about this, so please take this all with a grain of salt.
The main motivating use-case for HGM (to allow for post-copy live
migration of HugeTLB-1G-backed VMs with userfaultfd) can be solved in
other ways[2].

> The goal of such a session is trying to make it clearer on answering above
> questions.

I hope we can land on a clear answer this year. :)

- James

[1]: https://lore.kernel.org/linux-mm/049e4674-44b6-4675-b53b-62e11481a7ce@redhat.com/
[2]: https://lore.kernel.org/kvm/CALzav=d23P5uE=oYqMpjFohvn0CASMJxXB_XEOEi-jtqWcFTDA@mail.gmail.com/

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Hugetlb Unifications
  2024-03-01  1:37 ` James Houghton
@ 2024-03-01  3:11   ` Peter Xu
  2024-03-06 23:24     ` James Houghton
  2024-03-01  4:29   ` Matthew Wilcox
  1 sibling, 1 reply; 13+ messages in thread
From: Peter Xu @ 2024-03-01  3:11 UTC (permalink / raw)
  To: James Houghton; +Cc: lsf-pc, linux-mm, Muchun Song

Hey, James,

On Thu, Feb 29, 2024 at 05:37:23PM -0800, James Houghton wrote:
> On Thu, Feb 22, 2024 at 12:50 AM Peter Xu <peterx@redhat.com> wrote:
> >
> > I want to propose a session to discuss how we should unify hugetlb into
> > core mm.
> >
> > Due to legacy reasons, hugetlb has plenty of its own code paths that are
> > plugged into core mm, causing itself even more special than shmem.  While
> > it is a pretty decent and useful file system, efficient on supporting large
> > & statically allocated chunks of memory, it also added maintenance burden
> > due to having its own specific code paths spread all over the place.
> 
> Thank you for proposing this topic. HugeTLB is very useful (1G
> mappings, guaranteed hugepages, saving struct page overhead, shared
> page tables), but it is special in ways that make it a headache to
> modify (and making it harder to work on other mm features).
> 
> I haven't been able to spend much time with HugeTLB since the LSFMM
> talk last year, so I'm not much of an expert anymore. But I'll give my
> two cents anyway.
> 
> > It went into a bit of a mess, and it is messed up enough to become a reason
> > to not accept new major features like what used to be proposed last year to
> > map hugetlb pages in smaller sizes [1].
> >
> > We all seem to agree something needs to be done to hugetlb, but it seems
> > still not as clear on what exactly, then people forgot about it and move
> > on, until hit it again.  The problem didn't yet go away itself even if
> > nobody asks.
> >
> > Is it worthwhile to spend time do such work?  Do we really need a fresh new
> > hugetlb-v2 just to accept new features?  What exactly need to be
> > generalized for hugetlb?  Is huge_pte_offset() the culprit, or what else?
> > To what extent hugetlb is free to accept new features?
> 
> I think the smaller unification that has been done so far is great
> (thank you!!), but at some point additional unification will require a
> pretty heavy lift. Trying to enumerate some possible challenges:
> 
> What does HugeTLB do differently than main mm?
> - Page table walking, huge_pte_offset/etc., of course.
> - "huge_pte" as a concept (type-erased p?d_t), though it shares its
> type with pte_t.
> - Completely different page fault path (hugetlbfs doesn't implement
> vm_ops->{huge_,}fault).
> - mapcount
> - Reservation/MAP_NORESERVE
> - HWPoison handling
> - Synchronization (hugetlb_fault_mutex_table, VMA lock for PMD sharing)
> - more...
> 
> What does HugeTLB do that main mm doesn't do?
> - It keeps pools of hugepages that cannot be used for anything else.
> - It has PMD sharing (which can hopefully be replaced with mshare())
> - It has HVO (which can hopefully be dropped in a memdesc world)
> - more...?
> 
> Page table sharing and HVO are both important, but they're not
> fundamental to HugeTLB, so it's not impossible to make progress on
> drastic cleanup without them.
> 
> No matter what, we'll need to add (more) PUD support into the main mm,
> so we could start with that, though it won't be easy. Then we would
> need at least...
> 
>   (1) ...a filesystem that implements huge_fault for PUDs
> 
> It's not inconceivable to add support for this in shmem (where 1G
> pages are allocated -- perhaps ahead of time -- with CMA, maybe?).
> This could be done in hugetlbfs, but then you'd have to make sure that
> the huge_fault implementation stays compatible with everything else in
> hugetlb/hugetlbfs, perhaps making incremental progress difficult. Or
> you could create hugetlbfs-v2. I'm honestly not sure which of these is
> the least difficult -- probably the shmem route?

IMHO hugetlb fault path can be the last to tackle; there seem to have other
lower hanging fruits that are good candidates for such unifications works.

For example, what if we can reduce customized hugetlb paths from 20 -> 2,
where the customized fault() will be 1 out of the 2?  To further reduce
that 2 paths we may need a new file system, but if it's good enough maybe
we don't need v2, at least not for someone looking for a cleanup: that is
more suitable who can properly define the new interface first, and it can
be much more work than an unification effort, also orthogonal in some way.

> 
>   (2) ...a mapcount (+refcount) system that works for PUD mappings.
> 
> This discussion has progressed a lot since I last thought about it;
> I'll let the experts figure this one out[1].

I hope there will be an solid answer there.

Otherwise IIRC the last plan was to use 1 mapcount for anything mapped
underneath.  I still think it's a good plan, which may not apply to mTHP
but could be perfectly efficient & simple to hugetlb.  The complexity lies
in elsewhere other than the counting itself but I had a feeling it's still
a workable solution.

> 
> Anyway, I'm oversimplifying things, and it's been a while since I've
> thought hard about this, so please take this all with a grain of salt.
> The main motivating use-case for HGM (to allow for post-copy live
> migration of HugeTLB-1G-backed VMs with userfaultfd) can be solved in
> other ways[2].

Do you know how far David went in that direction?  When there will be a
prototype?  Would it easily work with MISSING faults (not MINOR)?

I will be more than happy to see whatever solution come up from kernel that
will resolve that pain for VMs first.  It's unfortunate KVM will has its
own solution for hugetlb small mappings, but I also understand there's more
than one demand to that besides hugetlb on 1G (even though I'm not 100%
sure of that demand when I think it again today: is it a worry that the
pgtable pages will take a lot of space when trapping minor-faults?  I
haven't yet got time to revisit David's proposal there in the past two
months; nor do I think I fully digested the details back then).

The answer to above could also help me to prioritize my work, e.g., hugetlb
unification is probably something we should do regardless, at least for the
sake of a healthy mm code base.  I have plan to move HGM or whatever it
will be called to upstream if necessary, but it can also depends on how
fast the other project goes, as personally I don't yet worry on hugetlb
hwpoison yet (at least QEMU's hwpoison handling is still pretty much
broken.. which is pretty unfortunate), but maybe any serious cloud provide
still should care.

> 
> > The goal of such a session is trying to make it clearer on answering above
> > questions.
> 
> I hope we can land on a clear answer this year. :)

Yes. :)  Thanks for the write-up and summary.

> 
> - James
> 
> [1]: https://lore.kernel.org/linux-mm/049e4674-44b6-4675-b53b-62e11481a7ce@redhat.com/
> [2]: https://lore.kernel.org/kvm/CALzav=d23P5uE=oYqMpjFohvn0CASMJxXB_XEOEi-jtqWcFTDA@mail.gmail.com/
> 

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Hugetlb Unifications
  2024-03-01  1:37 ` James Houghton
  2024-03-01  3:11   ` Peter Xu
@ 2024-03-01  4:29   ` Matthew Wilcox
  2024-03-01  6:51     ` Muchun Song
  1 sibling, 1 reply; 13+ messages in thread
From: Matthew Wilcox @ 2024-03-01  4:29 UTC (permalink / raw)
  To: James Houghton; +Cc: Peter Xu, lsf-pc, linux-mm, Muchun Song

On Thu, Feb 29, 2024 at 05:37:23PM -0800, James Houghton wrote:
> - It has HVO (which can hopefully be dropped in a memdesc world)

I've spent a bit of time thinking about this.  I'll keep this x86-64
specific just to have concrete numbers.

Currently a 2MB htlb page without HVO occupies 64 * 512 = 32kB.  With HVO,
it's reduced to 8kB.  A 1GB htlb page occupies 64 * 256k = 8MB, with HVO,
it's still 8kB (right?)

In a memdesc world, a 2MB page without HVO consumes 8 * 512 = 4kB.
There's no room for savings here.  But a 1GB page takes 8 * 256k = 2MB.
There's still almost 2MB of savings to be had here, so I suspect some
people will still want it.

Hopefully Yu Zhao's zone proposal lets us enable HVO for THP.  At least
1GB ones.

I do have a proposal to turn mmap into a much more dynamic data structure
where we'd go from a fixed 8 bytes per page to around 16 bytes per
allocation.  But it depends on memdescs working first, and we haven't
demonstrated that yet, so it's not worth talking about.  It's much more
complicated than 8 bytes per page, so it may not be worth doing.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Hugetlb Unifications
  2024-03-01  4:29   ` Matthew Wilcox
@ 2024-03-01  6:51     ` Muchun Song
  2024-03-01 16:44       ` David Hildenbrand
  0 siblings, 1 reply; 13+ messages in thread
From: Muchun Song @ 2024-03-01  6:51 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: James Houghton, Peter Xu, lsf-pc, linux-mm



> On Mar 1, 2024, at 12:29, Matthew Wilcox <willy@infradead.org> wrote:
> 
> On Thu, Feb 29, 2024 at 05:37:23PM -0800, James Houghton wrote:
>> - It has HVO (which can hopefully be dropped in a memdesc world)
> 
> I've spent a bit of time thinking about this.  I'll keep this x86-64
> specific just to have concrete numbers.
> 
> Currently a 2MB htlb page without HVO occupies 64 * 512 = 32kB.  With HVO,
> it's reduced to 8kB.  A 1GB htlb page occupies 64 * 256k = 8MB, with HVO,
> it's still 8kB (right?)

Correct in the past. In the first version, HVO needs 2 pages (8k) for
vmemmap, however, it only needs only one page (4k) for it whatever the
huge page sizes (2MB or 1GB) now.

> 
> In a memdesc world, a 2MB page without HVO consumes 8 * 512 = 4kB.
> There's no room for savings here.  But a 1GB page takes 8 * 256k = 2MB.
> There's still almost 2MB of savings to be had here, so I suspect some
> people will still want it.

Agree. With 2MB pages, there is no savings with HVO, but it saves a lot
for 1GB huge pages.

> 
> Hopefully Yu Zhao's zone proposal lets us enable HVO for THP.  At least
> 1GB ones.

Hopefully see it.

Thanks.

> 
> I do have a proposal to turn mmap into a much more dynamic data structure
> where we'd go from a fixed 8 bytes per page to around 16 bytes per
> allocation.  But it depends on memdescs working first, and we haven't
> demonstrated that yet, so it's not worth talking about.  It's much more
> complicated than 8 bytes per page, so it may not be worth doing.



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Hugetlb Unifications
  2024-03-01  6:51     ` Muchun Song
@ 2024-03-01 16:44       ` David Hildenbrand
  0 siblings, 0 replies; 13+ messages in thread
From: David Hildenbrand @ 2024-03-01 16:44 UTC (permalink / raw)
  To: Muchun Song, Matthew Wilcox; +Cc: James Houghton, Peter Xu, lsf-pc, linux-mm

On 01.03.24 07:51, Muchun Song wrote:
> 
> 
>> On Mar 1, 2024, at 12:29, Matthew Wilcox <willy@infradead.org> wrote:
>>
>> On Thu, Feb 29, 2024 at 05:37:23PM -0800, James Houghton wrote:
>>> - It has HVO (which can hopefully be dropped in a memdesc world)
>>
>> I've spent a bit of time thinking about this.  I'll keep this x86-64
>> specific just to have concrete numbers.
>>
>> Currently a 2MB htlb page without HVO occupies 64 * 512 = 32kB.  With HVO,
>> it's reduced to 8kB.  A 1GB htlb page occupies 64 * 256k = 8MB, with HVO,
>> it's still 8kB (right?)
> 
> Correct in the past. In the first version, HVO needs 2 pages (8k) for
> vmemmap, however, it only needs only one page (4k) for it whatever the
> huge page sizes (2MB or 1GB) now.
> 
>>
>> In a memdesc world, a 2MB page without HVO consumes 8 * 512 = 4kB.
>> There's no room for savings here.  But a 1GB page takes 8 * 256k = 2MB.
>> There's still almost 2MB of savings to be had here, so I suspect some
>> people will still want it.
> 
> Agree. With 2MB pages, there is no savings with HVO, but it saves a lot
> for 1GB huge pages.
> 
>>
>> Hopefully Yu Zhao's zone proposal lets us enable HVO for THP.  At least
>> 1GB ones.
> 
> Hopefully see it.

What's the biggest blocker regarding HVO+THP?

I can imagine the following two:

1) PMD->PTE remapping currently always has to work. Once we have PTE 
mappings we would try writing per-page subpage + PAE, which we can't.

2) THP split + freeing would require allocating memory to remap the 
vmemmap. Split can fail for other reasons already, but the freeing side 
is nasty. But, if everything fails, we could have memory from the THP 
itself when hadning it back to the buddy (suboptimal, but removes that 
corner-case concern).

Likely there are other page flags (MCE) that also need care, but at 
least for hugetlb we seem to have figured that out.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Hugetlb Unifications
  2024-03-01  3:11   ` Peter Xu
@ 2024-03-06 23:24     ` James Houghton
  2024-03-07 10:06       ` Peter Xu
  0 siblings, 1 reply; 13+ messages in thread
From: James Houghton @ 2024-03-06 23:24 UTC (permalink / raw)
  To: Peter Xu; +Cc: lsf-pc, linux-mm, Muchun Song

On Thu, Feb 29, 2024 at 7:11 PM Peter Xu <peterx@redhat.com> wrote:
>
> Hey, James,
>
> On Thu, Feb 29, 2024 at 05:37:23PM -0800, James Houghton wrote:
> > No matter what, we'll need to add (more) PUD support into the main mm,
> > so we could start with that, though it won't be easy. Then we would
> > need at least...
> >
> >   (1) ...a filesystem that implements huge_fault for PUDs
> >
> > It's not inconceivable to add support for this in shmem (where 1G
> > pages are allocated -- perhaps ahead of time -- with CMA, maybe?).
> > This could be done in hugetlbfs, but then you'd have to make sure that
> > the huge_fault implementation stays compatible with everything else in
> > hugetlb/hugetlbfs, perhaps making incremental progress difficult. Or
> > you could create hugetlbfs-v2. I'm honestly not sure which of these is
> > the least difficult -- probably the shmem route?
>
> IMHO hugetlb fault path can be the last to tackle; there seem to have other
> lower hanging fruits that are good candidates for such unifications works.
>
> For example, what if we can reduce customized hugetlb paths from 20 -> 2,
> where the customized fault() will be 1 out of the 2?  To further reduce
> that 2 paths we may need a new file system, but if it's good enough maybe
> we don't need v2, at least not for someone looking for a cleanup: that is
> more suitable who can properly define the new interface first, and it can
> be much more work than an unification effort, also orthogonal in some way.

This is a fine approach to take. At the same time, I think the
separate fault path is the most important difference between hugetlb
and main mm, so if we're doing a bunch of work to unify hugetlb with
mm (like, 20 -> 2 special paths), it'd be kind of a shame not to go
all the way. But I'm not exactly doing the work here. :)

(The other huge piece that I'd want unified is the huge_pte
architecture-specific functions, that's probably #2 on my list.)

> >
> >   (2) ...a mapcount (+refcount) system that works for PUD mappings.
> >
> > This discussion has progressed a lot since I last thought about it;
> > I'll let the experts figure this one out[1].
>
> I hope there will be an solid answer there.
>
> Otherwise IIRC the last plan was to use 1 mapcount for anything mapped
> underneath.  I still think it's a good plan, which may not apply to mTHP
> but could be perfectly efficient & simple to hugetlb.  The complexity lies
> in elsewhere other than the counting itself but I had a feeling it's still
> a workable solution.
>
> >
> > Anyway, I'm oversimplifying things, and it's been a while since I've
> > thought hard about this, so please take this all with a grain of salt.
> > The main motivating use-case for HGM (to allow for post-copy live
> > migration of HugeTLB-1G-backed VMs with userfaultfd) can be solved in
> > other ways[2].
>
> Do you know how far David went in that direction?  When there will be a
> prototype?  Would it easily work with MISSING faults (not MINOR)?

A prototype will come eventually. :)

It's valid for a user to use KVM-based demand paging with userfaultfd,
MISSING or MINOR. For MISSING, you could do:
- Upon getting a KVM fault, we will exit to KVM_RUN will exit to userspace.
- Fetch the page, install it with UFFDIO_COPY, then mark the page as
present with KVM.

KVM-based demand paging is redundant with userfaultfd in this case though.

With minor faults, the equivalent approach would be:
- Map memory twice. Register one with userfaultfd. The other ("alias
mapping") will be used to install memory.
- Use the userfaultfd-registered mapping to build the KVM memslots.
- Upon getting a KVM fault, KVM_RUN will exit.
- Fetch the page, install it by copying it into the alias mapping,
then UFFDIO_CONTINUE the KVM mapping, then mark the page as present
with KVM.

We can be a little more efficient with MINOR faults, provided we're
confident that KVM-based demand paging works properly:
- Map memory twice. Register one with userfaultfd.
- Give KVM the alias mapping, so we won't get userfaults on it. All
other components get the userfaultfd-registered mapping.
- KVM_RUN exits to userspace.
- Fetch the page, install it in the pagecache. Mark it as present with KVM.
- If other components get userfaults, fetch the page (if it needs to
be), then UFFDIO_CONTINUE to unblock it.

Now userfaultfd and KVM-based demand paging are no longer redundant.
Furthermore, if a user can guarantee that all other components are
able to properly participate in migration without userfaultfd (i.e.,
they are explicitly aware of demand paging), then the need for
userfaultfd is removed.

This is just like KVM's own dirty logging vs. userfaultfd-wp.

>
> I will be more than happy to see whatever solution come up from kernel that
> will resolve that pain for VMs first.  It's unfortunate KVM will has its
> own solution for hugetlb small mappings, but I also understand there's more
> than one demand to that besides hugetlb on 1G (even though I'm not 100%
> sure of that demand when I think it again today: is it a worry that the
> pgtable pages will take a lot of space when trapping minor-faults?  I
> haven't yet got time to revisit David's proposal there in the past two
> months; nor do I think I fully digested the details back then).

In my view, the main motivating factor is that userfaultfd is
inherently incompatible with guest_memfd. We talked a bit about the
potential to do a file-based userfaultfd, but it's very unclear how
that would work.

But a KVM-based demand paging system would be able to help with:
- post-copy for HugeTLB pages
- reduce unnecessary work/overhead in mm (both minor faults and missing faults).

The "unnecessary" work/overhead:
- shattered mm page tables as well as shattered EPT, whereas with a
KVM-based solution, only the EPT is shattered.
- must collapse both mm page tables and EPT at the end of post-copy,
instead of only the EPT
- mm page tables are mapped during post-copy, when they could be
completely present to begin with

You could make collapsing as efficient as possible (like, if possible,
have an mmu_notifier_collapse() instead of using invalidate_start/end,
so that KVM can do the fastest possible invalidations), but we're
fundamentally doing more work with userfaultfd.

> The answer to above could also help me to prioritize my work, e.g., hugetlb
> unification is probably something we should do regardless, at least for the
> sake of a healthy mm code base.  I have plan to move HGM or whatever it
> will be called to upstream if necessary, but it can also depends on how
> fast the other project goes, as personally I don't yet worry on hugetlb
> hwpoison yet (at least QEMU's hwpoison handling is still pretty much
> broken.. which is pretty unfortunate), but maybe any serious cloud provide
> still should care.

My hope with the unification is that HGM almost becomes a byproduct of
that effort. :)

The hwpoison case (in my case) is also solved with a KVM-based demand
paging system: we can use it to prevent access to the page, but
instead of demand-fetching, we inject poison. (We need HugeTLB to keep
mapping the page though.)


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Hugetlb Unifications
  2024-03-06 23:24     ` James Houghton
@ 2024-03-07 10:06       ` Peter Xu
  0 siblings, 0 replies; 13+ messages in thread
From: Peter Xu @ 2024-03-07 10:06 UTC (permalink / raw)
  To: James Houghton; +Cc: lsf-pc, linux-mm, Muchun Song

On Wed, Mar 06, 2024 at 03:24:04PM -0800, James Houghton wrote:
> On Thu, Feb 29, 2024 at 7:11 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > Hey, James,
> >
> > On Thu, Feb 29, 2024 at 05:37:23PM -0800, James Houghton wrote:
> > > No matter what, we'll need to add (more) PUD support into the main mm,
> > > so we could start with that, though it won't be easy. Then we would
> > > need at least...
> > >
> > >   (1) ...a filesystem that implements huge_fault for PUDs
> > >
> > > It's not inconceivable to add support for this in shmem (where 1G
> > > pages are allocated -- perhaps ahead of time -- with CMA, maybe?).
> > > This could be done in hugetlbfs, but then you'd have to make sure that
> > > the huge_fault implementation stays compatible with everything else in
> > > hugetlb/hugetlbfs, perhaps making incremental progress difficult. Or
> > > you could create hugetlbfs-v2. I'm honestly not sure which of these is
> > > the least difficult -- probably the shmem route?
> >
> > IMHO hugetlb fault path can be the last to tackle; there seem to have other
> > lower hanging fruits that are good candidates for such unifications works.
> >
> > For example, what if we can reduce customized hugetlb paths from 20 -> 2,
> > where the customized fault() will be 1 out of the 2?  To further reduce
> > that 2 paths we may need a new file system, but if it's good enough maybe
> > we don't need v2, at least not for someone looking for a cleanup: that is
> > more suitable who can properly define the new interface first, and it can
> > be much more work than an unification effort, also orthogonal in some way.
> 
> This is a fine approach to take. At the same time, I think the
> separate fault path is the most important difference between hugetlb
> and main mm, so if we're doing a bunch of work to unify hugetlb with
> mm (like, 20 -> 2 special paths), it'd be kind of a shame not to go
> all the way. But I'm not exactly doing the work here. :)

My goal was never merge everything, but make hugetlb more maintainable so
that people don't worry on its evolving at some point.  I don't think I
thought everything through, but I hope more things will be clear in the
next few months.

> 
> (The other huge piece that I'd want unified is the huge_pte
> architecture-specific functions, that's probably #2 on my list.)
> 
> > >
> > >   (2) ...a mapcount (+refcount) system that works for PUD mappings.
> > >
> > > This discussion has progressed a lot since I last thought about it;
> > > I'll let the experts figure this one out[1].
> >
> > I hope there will be an solid answer there.
> >
> > Otherwise IIRC the last plan was to use 1 mapcount for anything mapped
> > underneath.  I still think it's a good plan, which may not apply to mTHP
> > but could be perfectly efficient & simple to hugetlb.  The complexity lies
> > in elsewhere other than the counting itself but I had a feeling it's still
> > a workable solution.
> >
> > >
> > > Anyway, I'm oversimplifying things, and it's been a while since I've
> > > thought hard about this, so please take this all with a grain of salt.
> > > The main motivating use-case for HGM (to allow for post-copy live
> > > migration of HugeTLB-1G-backed VMs with userfaultfd) can be solved in
> > > other ways[2].
> >
> > Do you know how far David went in that direction?  When there will be a
> > prototype?  Would it easily work with MISSING faults (not MINOR)?
> 
> A prototype will come eventually. :)
> 
> It's valid for a user to use KVM-based demand paging with userfaultfd,
> MISSING or MINOR. For MISSING, you could do:
> - Upon getting a KVM fault, we will exit to KVM_RUN will exit to userspace.
> - Fetch the page, install it with UFFDIO_COPY, then mark the page as
> present with KVM.
> 
> KVM-based demand paging is redundant with userfaultfd in this case though.
> 
> With minor faults, the equivalent approach would be:
> - Map memory twice. Register one with userfaultfd. The other ("alias
> mapping") will be used to install memory.
> - Use the userfaultfd-registered mapping to build the KVM memslots.
> - Upon getting a KVM fault, KVM_RUN will exit.
> - Fetch the page, install it by copying it into the alias mapping,
> then UFFDIO_CONTINUE the KVM mapping, then mark the page as present
> with KVM.
> 
> We can be a little more efficient with MINOR faults, provided we're
> confident that KVM-based demand paging works properly:
> - Map memory twice. Register one with userfaultfd.
> - Give KVM the alias mapping, so we won't get userfaults on it. All
> other components get the userfaultfd-registered mapping.
> - KVM_RUN exits to userspace.
> - Fetch the page, install it in the pagecache. Mark it as present with KVM.
> - If other components get userfaults, fetch the page (if it needs to
> be), then UFFDIO_CONTINUE to unblock it.
> 
> Now userfaultfd and KVM-based demand paging are no longer redundant.
> Furthermore, if a user can guarantee that all other components are
> able to properly participate in migration without userfaultfd (i.e.,
> they are explicitly aware of demand paging), then the need for
> userfaultfd is removed.
> 
> This is just like KVM's own dirty logging vs. userfaultfd-wp.

The "register one for each" idea is interesting, but it's pretty sad to
know that it still requires userspace's awareness to support demand paging.

Supporting dirty logging is IMHO a pain already. It required so many
customized interfaces including KMV's, most of which may not be necessary
at all if mm can provide a generic async tracking API like soft-dirty or
uffd-wp at that time.  I guess soft-dirty didn't exist when GET_DIRTY_LOG
was proposed?

I think it means the proposal decided to ignore my previous questions on
things like "how do we support vhost with the new demand paging" in the
initial thread.

> 
> >
> > I will be more than happy to see whatever solution come up from kernel that
> > will resolve that pain for VMs first.  It's unfortunate KVM will has its
> > own solution for hugetlb small mappings, but I also understand there's more
> > than one demand to that besides hugetlb on 1G (even though I'm not 100%
> > sure of that demand when I think it again today: is it a worry that the
> > pgtable pages will take a lot of space when trapping minor-faults?  I
> > haven't yet got time to revisit David's proposal there in the past two
> > months; nor do I think I fully digested the details back then).
> 
> In my view, the main motivating factor is that userfaultfd is
> inherently incompatible with guest_memfd. We talked a bit about the
> potential to do a file-based userfaultfd, but it's very unclear how
> that would work.
> 
> But a KVM-based demand paging system would be able to help with:
> - post-copy for HugeTLB pages
> - reduce unnecessary work/overhead in mm (both minor faults and missing faults).
> 
> The "unnecessary" work/overhead:
> - shattered mm page tables as well as shattered EPT, whereas with a
> KVM-based solution, only the EPT is shattered.
> - must collapse both mm page tables and EPT at the end of post-copy,
> instead of only the EPT
> - mm page tables are mapped during post-copy, when they could be
> completely present to begin with

IMHO this is the trade-off of providing a generic solution.  Now afaict
we're pushing the complexity to outside KVM.  IIRC I left similar comments
before.

Obviously that's also one reason why I started working on something about
this (even if I don't know how far I'll go yet), as I don't want to see kvm
specific solution proposed only because mm rejected some generic solution
so there is no other option.  I want to provide that option and make a fair
comparison between the two.

So far I still think guest_memfd should implement its own demand paging
(e.g. there's no worry on "how to support vhost" in that case because vhost
doesn't even have a mapping if memory encrypted), leaving generic guest
memory types to mm like before.  But I'll stop here and leave my other
comments to when the proposal is sent.

> 
> You could make collapsing as efficient as possible (like, if possible,
> have an mmu_notifier_collapse() instead of using invalidate_start/end,
> so that KVM can do the fastest possible invalidations), but we're
> fundamentally doing more work with userfaultfd.
> 
> > The answer to above could also help me to prioritize my work, e.g., hugetlb
> > unification is probably something we should do regardless, at least for the
> > sake of a healthy mm code base.  I have plan to move HGM or whatever it
> > will be called to upstream if necessary, but it can also depends on how
> > fast the other project goes, as personally I don't yet worry on hugetlb
> > hwpoison yet (at least QEMU's hwpoison handling is still pretty much
> > broken.. which is pretty unfortunate), but maybe any serious cloud provide
> > still should care.
> 
> My hope with the unification is that HGM almost becomes a byproduct of
> that effort. :)

I think it'll be a separate project.  Unification effort seems to be always
wanted, while for the next step I'll need to evaluate how hard to support
the new interface in QEMU (no matter my own preference on the approach..).

I had a feeling that the new kvm demand paging proposal can have so many
limitations so that I have no choice to keep persuing HGM (just consider
when I need to implement a demand paging scheme for all virtio devices like
vhost; that can be N times work to me).

> 
> The hwpoison case (in my case) is also solved with a KVM-based demand
> paging system: we can use it to prevent access to the page, but
> instead of demand-fetching, we inject poison. (We need HugeTLB to keep
> mapping the page though.)

Hmm.. I'm curious how does it keep mapping the huge page if part of it is
poisoned with current mm code?

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-03-07 10:07 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-02-22  8:50 [LSF/MM/BPF TOPIC] Hugetlb Unifications Peter Xu
2024-02-22 20:36 ` Frank van der Linden
2024-02-22 22:21   ` Matthew Wilcox
2024-02-22 22:16 ` Pasha Tatashin
2024-02-22 22:31   ` Matthew Wilcox
2024-02-22 22:58     ` Pasha Tatashin
2024-03-01  1:37 ` James Houghton
2024-03-01  3:11   ` Peter Xu
2024-03-06 23:24     ` James Houghton
2024-03-07 10:06       ` Peter Xu
2024-03-01  4:29   ` Matthew Wilcox
2024-03-01  6:51     ` Muchun Song
2024-03-01 16:44       ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox