* Re: [Invitation] Linux MM Alignment Session on HugeTLB Core MM Convergence on Wednesday
[not found] ` <20230614230458.GB3559@monkey>
@ 2023-06-15 1:12 ` David Rientjes
2023-06-15 8:04 ` Michal Hocko
1 sibling, 0 replies; 10+ messages in thread
From: David Rientjes @ 2023-06-15 1:12 UTC (permalink / raw)
To: Mike Kravetz
Cc: linux-mm, David Hildenbrand, James Houghton, John Hubbard,
Matthew Wilcox, Michal Hocko, Peter Xu, Vlastimil Babka, Zi Yan
On Wed, 14 Jun 2023, Mike Kravetz wrote:
> On 06/12/23 18:59, David Rientjes wrote:
> > This week's topic will be a technical brainstorming session on HugeTLB
> > convergence with the core MM. This has been discussed most recently in
> > this thread:
> > https://lore.kernel.org/linux-mm/ZIOEDTUBrBg6tepk@casper.infradead.org/T/
>
> Thank you David for putting this session together! And, thanks to everyone
> who participated.
>
> Following up on linux-mm with most active participants on Cc (sorry if I
> missed someone). If it makes more sense to continue the above thread,
> please move there.
>
Thank *you* for keeping the conversation going. And, yes, thanks to
everybody that participated today and especially Matthew for preparing
content on his thoughts and ideas on convergence opportunities.
> Even though everyone knows that hugetlb is special cased throughout the
> core mm, it came to a head with the proposed introduction of HGM. TBH,
> few people in the core mm community paid much attention to HGM when first
> introduced. A LSF/MM session was then dedicated to the discussion of
> HGM with the outcome being the suggestion to create a new filesystem/driver
> (hugetlb2 if you will) that would satisfy the use cases requiring HGM.
> One thing that was not emphasized at LSF/MM is that there are existing
> hugetlb users experiencing major issues that could be addressed with HGM:
> specifically the issues of memory errors and live migration. That was
> the starting point for recent discussion in the above thread.
>
> I may be wrong, but it appeared the direction of that thread was to
> first try and unify some of the hugetlb and core mm code. Eliminate
> some of the special casing. If hugetlb was less of a special case, then
> perhaps HGM would be more acceptable. That is the impression I (perhaps
> incorrectly) had going into today's session.
>
This matches my understanding as well. The above thread surfaced some
great areas of improvement for hugetlb and the big idea was to surface
those, do some technical brainstorming, and think about both short-term
and long-term goals. There *are* complexities for some of the convergence
opportunities that were discussed, but all of them would improve (imo)
both maintainability and reliability.
> During today's session, we often discussed what would/could be introduced
> in a hugetlb v2. The idea is that this would be the ideal place for HGM.
> However, people also made the comparisons to cgroup v1 - v2. Such a
> redesign provides the needed 'clean slate' to do things right, but it
> does little for existing users who would be unwilling to quickly move off
> existing hugetlb.
>
> We did spend a good chunk of time on hugetlb/core mm unification and
> removing special casing. In some (most) of these cases, the benefit of
> removing special cases from core mm would result in adding more code to
> hugetlb. For example: proper type'ing so that hugetlb does not treat
> all page table entries as PTEs. Again, I may be wrong but I think
> people were OK with adding more code (and even complexity) to hugetlb
> if it eliminated special casing in the core mm. But, there did not
> seem to be a clear concensus especially with the thought that we may
> need to double hugetlb code to get types right.
>
> Unless I missed something, there was no clear direction at the end of this
> session. I was hoping that we could come up with a plan to address the
> issues facing today's hugetlb users. IMO, there seems to be two options:
> 1) Start work on hugetlb v2 with the intention that customers will need
> to move to this to address their issues.
> 2) Incorporate functionality like HGM into existing hugetlb.
>
To address existing customer pain for 1GB memory poisoning and post-copy
live migration, yeah, I think these are the only two possible paths
forward.
> My opinion is that adding HGM to existing hugetlb is the only way we
> will be able to address issues for current users in a timely manner.
> The session today (and email thread) point out the ugliness and
> difficulty with hugetlb special casing in the core mm. Therefore,
> adding HGM (or any new code) to hugetlb should not introduce new special
> cases to core mm. I know the latest version of HGM does introduce new
> special cases. I am not sure if those can be reduced or eliminated.
> My suggestion for a direction forward would be to add HGM to existing
> hugetlb with no or minimal new special casing. In parallel work could
> begin on hugetlb v2.
I'd agree, and I think the complexities of HGM are largely constrained to
hugetlb. Reducing the special casing could have a concrete path forward
that can be iterated on. That very specific and concrete feedback would
be extremely valuable.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Invitation] Linux MM Alignment Session on HugeTLB Core MM Convergence on Wednesday
[not found] ` <20230614230458.GB3559@monkey>
2023-06-15 1:12 ` [Invitation] Linux MM Alignment Session on HugeTLB Core MM Convergence on Wednesday David Rientjes
@ 2023-06-15 8:04 ` Michal Hocko
2023-06-15 8:29 ` David Hildenbrand
2023-06-15 17:00 ` James Houghton
1 sibling, 2 replies; 10+ messages in thread
From: Michal Hocko @ 2023-06-15 8:04 UTC (permalink / raw)
To: Mike Kravetz
Cc: David Rientjes, linux-mm, David Hildenbrand, James Houghton,
John Hubbard, Matthew Wilcox, Peter Xu, Vlastimil Babka, Zi Yan
On Wed 14-06-23 16:04:58, Mike Kravetz wrote:
> On 06/12/23 18:59, David Rientjes wrote:
> > This week's topic will be a technical brainstorming session on HugeTLB
> > convergence with the core MM. This has been discussed most recently in
> > this thread:
> > https://lore.kernel.org/linux-mm/ZIOEDTUBrBg6tepk@casper.infradead.org/T/
>
> Thank you David for putting this session together! And, thanks to everyone
> who participated.
>
> Following up on linux-mm with most active participants on Cc (sorry if I
> missed someone). If it makes more sense to continue the above thread,
> please move there.
>
> Even though everyone knows that hugetlb is special cased throughout the
> core mm, it came to a head with the proposed introduction of HGM. TBH,
> few people in the core mm community paid much attention to HGM when first
> introduced. A LSF/MM session was then dedicated to the discussion of
> HGM with the outcome being the suggestion to create a new filesystem/driver
> (hugetlb2 if you will) that would satisfy the use cases requiring HGM.
> One thing that was not emphasized at LSF/MM is that there are existing
> hugetlb users experiencing major issues that could be addressed with HGM:
> specifically the issues of memory errors and live migration. That was
> the starting point for recent discussion in the above thread.
>
> I may be wrong, but it appeared the direction of that thread was to
> first try and unify some of the hugetlb and core mm code. Eliminate
> some of the special casing. If hugetlb was less of a special case, then
> perhaps HGM would be more acceptable. That is the impression I (perhaps
> incorrectly) had going into today's session.
My impression from the discussion yesterday was that the level of
unification would need to be really large and time consuming in order to
be useful for the HGM patchset to be in a more maintainable form. The
final outcome is quite hard to predict at this stage.
> During today's session, we often discussed what would/could be introduced
> in a hugetlb v2. The idea is that this would be the ideal place for HGM.
> However, people also made the comparisons to cgroup v1 - v2. Such a
> redesign provides the needed 'clean slate' to do things right, but it
> does little for existing users who would be unwilling to quickly move off
> existing hugetlb.
>
> We did spend a good chunk of time on hugetlb/core mm unification and
> removing special casing. In some (most) of these cases, the benefit of
> removing special cases from core mm would result in adding more code to
> hugetlb. For example: proper type'ing so that hugetlb does not treat
> all page table entries as PTEs. Again, I may be wrong but I think
> people were OK with adding more code (and even complexity) to hugetlb
> if it eliminated special casing in the core mm. But, there did not
> seem to be a clear concensus especially with the thought that we may
> need to double hugetlb code to get types right.
This is primarily your call as a maintainer. If you ask me, hugetlb is
over complicated in its current form already. Regression are not really
seldom when code is added which is a signal we are hitting maintenance
cost walls. This doesn't mean further development is impossible of
course but it is increasingly more costly AFAICS.
> Unless I missed something, there was no clear direction at the end of this
> session. I was hoping that we could come up with a plan to address the
> issues facing today's hugetlb users. IMO, there seems to be two options:
> 1) Start work on hugetlb v2 with the intention that customers will need
> to move to this to address their issues.
> 2) Incorporate functionality like HGM into existing hugetlb.
From the memcg experience I can tell that cleaning up interfaces and
supported scenarios helped a lot. We still have to maintain v1 and will
have to for a foreseeable future and beyond but it is much more easier
to build a new functionality on top v2 without struggling how to hammer
that into v1 which is much more easier to generate corner cases.
I fully understand that this is not really a great approach for users
from the short term POV because they have to adapt but I am pretty sure
that they would also appreciate long term stable and regression free
code.
It is my understanding that most HGM users do not really benefit from
most of hugetlb features (shared page tables, reservation code etc) and
from that POV it makes some sense to start from a simpler code base.
I do recognize there are other users who would like their existing
hugetlb setups keep working and benefit from a better support - page
poisoning comes to mind but is this really that easy without
considerable changes on the user space side?
Memory error recovery problems are a really tough to deal with
AFAICS. Do we have any references to an existing userspace that would
able to deal with holes in their hugetlb pages? Or is this a chicken&egg
problem? Have we exhausted potential (even if coarse) solutions that
wouldn't require breaking hugetlb page tables?
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Invitation] Linux MM Alignment Session on HugeTLB Core MM Convergence on Wednesday
2023-06-15 8:04 ` Michal Hocko
@ 2023-06-15 8:29 ` David Hildenbrand
2023-06-15 17:24 ` James Houghton
2023-06-15 18:31 ` Mike Kravetz
2023-06-15 17:00 ` James Houghton
1 sibling, 2 replies; 10+ messages in thread
From: David Hildenbrand @ 2023-06-15 8:29 UTC (permalink / raw)
To: Michal Hocko, Mike Kravetz
Cc: David Rientjes, linux-mm, James Houghton, John Hubbard,
Matthew Wilcox, Peter Xu, Vlastimil Babka, Zi Yan
On 15.06.23 10:04, Michal Hocko wrote:
> On Wed 14-06-23 16:04:58, Mike Kravetz wrote:
>> On 06/12/23 18:59, David Rientjes wrote:
>>> This week's topic will be a technical brainstorming session on HugeTLB
>>> convergence with the core MM. This has been discussed most recently in
>>> this thread:
>>> https://lore.kernel.org/linux-mm/ZIOEDTUBrBg6tepk@casper.infradead.org/T/
>>
>> Thank you David for putting this session together! And, thanks to everyone
>> who participated.
>>
>> Following up on linux-mm with most active participants on Cc (sorry if I
>> missed someone). If it makes more sense to continue the above thread,
>> please move there.
>>
>> Even though everyone knows that hugetlb is special cased throughout the
>> core mm, it came to a head with the proposed introduction of HGM. TBH,
>> few people in the core mm community paid much attention to HGM when first
>> introduced. A LSF/MM session was then dedicated to the discussion of
>> HGM with the outcome being the suggestion to create a new filesystem/driver
>> (hugetlb2 if you will) that would satisfy the use cases requiring HGM.
>> One thing that was not emphasized at LSF/MM is that there are existing
>> hugetlb users experiencing major issues that could be addressed with HGM:
>> specifically the issues of memory errors and live migration. That was
>> the starting point for recent discussion in the above thread.
>>
>> I may be wrong, but it appeared the direction of that thread was to
>> first try and unify some of the hugetlb and core mm code. Eliminate
>> some of the special casing. If hugetlb was less of a special case, then
>> perhaps HGM would be more acceptable. That is the impression I (perhaps
>> incorrectly) had going into today's session.
>
> My impression from the discussion yesterday was that the level of
> unification would need to be really large and time consuming in order to
> be useful for the HGM patchset to be in a more maintainable form. The
> final outcome is quite hard to predict at this stage.
>
>> During today's session, we often discussed what would/could be introduced
>> in a hugetlb v2. The idea is that this would be the ideal place for HGM.
>> However, people also made the comparisons to cgroup v1 - v2. Such a
>> redesign provides the needed 'clean slate' to do things right, but it
>> does little for existing users who would be unwilling to quickly move off
>> existing hugetlb.
>>
>> We did spend a good chunk of time on hugetlb/core mm unification and
>> removing special casing. In some (most) of these cases, the benefit of
>> removing special cases from core mm would result in adding more code to
>> hugetlb. For example: proper type'ing so that hugetlb does not treat
>> all page table entries as PTEs. Again, I may be wrong but I think
>> people were OK with adding more code (and even complexity) to hugetlb
>> if it eliminated special casing in the core mm. But, there did not
>> seem to be a clear concensus especially with the thought that we may
>> need to double hugetlb code to get types right.
>
> This is primarily your call as a maintainer. If you ask me, hugetlb is
> over complicated in its current form already. Regression are not really
> seldom when code is added which is a signal we are hitting maintenance
> cost walls. This doesn't mean further development is impossible of
> course but it is increasingly more costly AFAICS.
>
>> Unless I missed something, there was no clear direction at the end of this
>> session. I was hoping that we could come up with a plan to address the
>> issues facing today's hugetlb users. IMO, there seems to be two options:
>> 1) Start work on hugetlb v2 with the intention that customers will need
>> to move to this to address their issues.
>> 2) Incorporate functionality like HGM into existing hugetlb.
>
I fully agree with all that Michal said.
I'm just going to add that I don't see why anyone would look into a
hugetlbv2 if we're going to use the motivation of "help existing users"
to make hugetlb ever-more complicated and special. "existing users" her
even meaning "people use hugetlb for backing VMs. Now they want to get
postcopy working with less latency." -- which I consider partially a new
use case.
So working on adding HGM and concurrently starting a hugetlbv2? I don't
think that will happen if we decide on adding HGM and proceeding with
that reasoning about existing users.
As expressed yesterday, I don't see a fast an clean way to make hugetlb
significantly less special (thanks Willy for the list of odd cases).
Sure, we can talk about adding pte_t safety, but I don't really see a
way forward to unify page table walking code that way -- there are still
the (PT) locking, PMD sharing, PTE-cont special cases ... but sure, if
anybody wants to work on that, why not.
Having that said, like Michal, I acknowledge that it is Mikes call
regarding the hugetlb code. I, for my part, will push back on any added
core-mm complexity that adds more special casing for hugetlb. Maybe
there are easy ways to integrate it nicely and that is not really a concern.
Note that while we've been discussing how HGM would already interfere
with core-mm, we've not even started discussing how actual
MADV_SPLIT/MADV_COLLAPSE/page poisioning ... would affect core-mm and
require special-casing for hugetlb.
I, for my part, will explore a bit the mapcount topic (as time permits)
and see if we can come up at least with a unified mapcount approach
(e.g., sub-page mapcount?). But I suspect even figuring that out will
take quite a while already ...
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Invitation] Linux MM Alignment Session on HugeTLB Core MM Convergence on Wednesday
2023-06-15 8:04 ` Michal Hocko
2023-06-15 8:29 ` David Hildenbrand
@ 2023-06-15 17:00 ` James Houghton
2023-06-15 17:18 ` Matthew Wilcox
1 sibling, 1 reply; 10+ messages in thread
From: James Houghton @ 2023-06-15 17:00 UTC (permalink / raw)
To: Michal Hocko
Cc: Mike Kravetz, David Rientjes, linux-mm, David Hildenbrand,
John Hubbard, Matthew Wilcox, Peter Xu, Vlastimil Babka, Zi Yan
On Thu, Jun 15, 2023 at 1:04 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 14-06-23 16:04:58, Mike Kravetz wrote:
> > On 06/12/23 18:59, David Rientjes wrote:
> > > This week's topic will be a technical brainstorming session on HugeTLB
> > > convergence with the core MM. This has been discussed most recently in
> > > this thread:
> > > https://lore.kernel.org/linux-mm/ZIOEDTUBrBg6tepk@casper.infradead.org/T/
> >
> > Thank you David for putting this session together! And, thanks to everyone
> > who participated.
> >
> > Following up on linux-mm with most active participants on Cc (sorry if I
> > missed someone). If it makes more sense to continue the above thread,
> > please move there.
> >
> > Even though everyone knows that hugetlb is special cased throughout the
> > core mm, it came to a head with the proposed introduction of HGM. TBH,
> > few people in the core mm community paid much attention to HGM when first
> > introduced. A LSF/MM session was then dedicated to the discussion of
> > HGM with the outcome being the suggestion to create a new filesystem/driver
> > (hugetlb2 if you will) that would satisfy the use cases requiring HGM.
> > One thing that was not emphasized at LSF/MM is that there are existing
> > hugetlb users experiencing major issues that could be addressed with HGM:
> > specifically the issues of memory errors and live migration. That was
> > the starting point for recent discussion in the above thread.
> >
> > I may be wrong, but it appeared the direction of that thread was to
> > first try and unify some of the hugetlb and core mm code. Eliminate
> > some of the special casing. If hugetlb was less of a special case, then
> > perhaps HGM would be more acceptable. That is the impression I (perhaps
> > incorrectly) had going into today's session.
>
> My impression from the discussion yesterday was that the level of
> unification would need to be really large and time consuming in order to
> be useful for the HGM patchset to be in a more maintainable form. The
> final outcome is quite hard to predict at this stage.
I also had this impression, but some of the unification efforts are
pretty independent of HGM (like the PTE/PMD/PUD typing idea). It
doesn't really change HGM all that much. My understanding is like: do
some general unification first, then we could take HGM. HGM is still
going to mostly look the same.
>
> > During today's session, we often discussed what would/could be introduced
> > in a hugetlb v2. The idea is that this would be the ideal place for HGM.
> > However, people also made the comparisons to cgroup v1 - v2. Such a
> > redesign provides the needed 'clean slate' to do things right, but it
> > does little for existing users who would be unwilling to quickly move off
> > existing hugetlb.
> >
> > We did spend a good chunk of time on hugetlb/core mm unification and
> > removing special casing. In some (most) of these cases, the benefit of
> > removing special cases from core mm would result in adding more code to
> > hugetlb. For example: proper type'ing so that hugetlb does not treat
> > all page table entries as PTEs. Again, I may be wrong but I think
> > people were OK with adding more code (and even complexity) to hugetlb
> > if it eliminated special casing in the core mm. But, there did not
> > seem to be a clear concensus especially with the thought that we may
> > need to double hugetlb code to get types right.
>
> This is primarily your call as a maintainer. If you ask me, hugetlb is
> over complicated in its current form already. Regression are not really
> seldom when code is added which is a signal we are hitting maintenance
> cost walls. This doesn't mean further development is impossible of
> course but it is increasingly more costly AFAICS.
>
> > Unless I missed something, there was no clear direction at the end of this
> > session. I was hoping that we could come up with a plan to address the
> > issues facing today's hugetlb users. IMO, there seems to be two options:
> > 1) Start work on hugetlb v2 with the intention that customers will need
> > to move to this to address their issues.
> > 2) Incorporate functionality like HGM into existing hugetlb.
>
> From the memcg experience I can tell that cleaning up interfaces and
> supported scenarios helped a lot. We still have to maintain v1 and will
> have to for a foreseeable future and beyond but it is much more easier
> to build a new functionality on top v2 without struggling how to hammer
> that into v1 which is much more easier to generate corner cases.
>
> I fully understand that this is not really a great approach for users
> from the short term POV because they have to adapt but I am pretty sure
> that they would also appreciate long term stable and regression free
> code.
>
> It is my understanding that most HGM users do not really benefit from
> most of hugetlb features (shared page tables, reservation code etc) and
> from that POV it makes some sense to start from a simpler code base.
>
> I do recognize there are other users who would like their existing
> hugetlb setups keep working and benefit from a better support - page
> poisoning comes to mind but is this really that easy without
> considerable changes on the user space side?
>
> Memory error recovery problems are a really tough to deal with
> AFAICS. Do we have any references to an existing userspace that would
> able to deal with holes in their hugetlb pages? Or is this a chicken&egg
> problem? Have we exhausted potential (even if coarse) solutions that
> wouldn't require breaking hugetlb page tables?
For VMs, having a 4K hole in the hugetlb page is how Google emulates
memory poison. (We sorta get what we need with a terrible hack to KVM.
HGM is better in every way.)
There aren't a lot of userspace changes to make this work. It comes down to:
1. Enlighten the vCPU threads (the ones that run KVM_RUN) to handle
the MCEERR SIGBUSes, and inject MCEs into the guest when this happens.
2. Optionally enlighten any other routines that have a substantial
likelihood to read poisoned memory (for example, live migration
routines), and skip the poisoned pieces (by adjusting the program
counter in the SIGBUS handler).
This isn't theoretical and is implemented today, but I can't easily
show you how Google does it. QEMU does #1 like this[1]. Unmapping in
this case is important; we can't allow the guest to be able to trigger
MCEs at will, as they could easily force the host to crash.
khugepaged was recently updated[2] to better recover from poison, so
VMs that consumed an error because khugepaged (in the guest) was
collapsing memory will be able to legitimately recover.
I can't speak in detail about how databases recover from memory
poison. Mike, maybe you can share more details?
[1]: https://github.com/qemu/qemu/blob/master/target/i386/kvm/kvm.c#L649
[2]: https://lore.kernel.org/linux-mm/20230329151121.949896-1-jiaqiyan@google.com/
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Invitation] Linux MM Alignment Session on HugeTLB Core MM Convergence on Wednesday
2023-06-15 17:00 ` James Houghton
@ 2023-06-15 17:18 ` Matthew Wilcox
2023-06-15 17:59 ` Mike Kravetz
0 siblings, 1 reply; 10+ messages in thread
From: Matthew Wilcox @ 2023-06-15 17:18 UTC (permalink / raw)
To: James Houghton
Cc: Michal Hocko, Mike Kravetz, David Rientjes, linux-mm,
David Hildenbrand, John Hubbard, Peter Xu, Vlastimil Babka,
Zi Yan
On Thu, Jun 15, 2023 at 10:00:41AM -0700, James Houghton wrote:
> I can't speak in detail about how databases recover from memory
> poison. Mike, maybe you can share more details?
Speaking generally (this would describe MySQL as well as any number of
proprietary databases), hugetlbfs is used by databases as a supplier
of memory to their userspace buffer cache. As database blocks are
needed, they're read from storage using O_DIRECT. Depending on the
database, they may or may not be updated in place.
Just as in our page cache, if the hwpoison hits in a clean block, it
can discard the block and re-read it. If it hits in a dirty block, it's
Game Over. Most blocks are clean. A 1GB page contains so many blocks
that taking out the entire 1GB is guaranteed to take out a dirty block.
Taking out a single 4kB page is likely to take out only clean blocks.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Invitation] Linux MM Alignment Session on HugeTLB Core MM Convergence on Wednesday
2023-06-15 8:29 ` David Hildenbrand
@ 2023-06-15 17:24 ` James Houghton
2023-06-15 18:58 ` Peter Xu
2023-06-15 18:31 ` Mike Kravetz
1 sibling, 1 reply; 10+ messages in thread
From: James Houghton @ 2023-06-15 17:24 UTC (permalink / raw)
To: David Hildenbrand
Cc: Michal Hocko, Mike Kravetz, David Rientjes, linux-mm,
John Hubbard, Matthew Wilcox, Peter Xu, Vlastimil Babka, Zi Yan
On Thu, Jun 15, 2023 at 1:30 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 15.06.23 10:04, Michal Hocko wrote:
> > On Wed 14-06-23 16:04:58, Mike Kravetz wrote:
> >> On 06/12/23 18:59, David Rientjes wrote:
> >>> This week's topic will be a technical brainstorming session on HugeTLB
> >>> convergence with the core MM. This has been discussed most recently in
> >>> this thread:
> >>> https://lore.kernel.org/linux-mm/ZIOEDTUBrBg6tepk@casper.infradead.org/T/
> >>
> >> Thank you David for putting this session together! And, thanks to everyone
> >> who participated.
> >>
> >> Following up on linux-mm with most active participants on Cc (sorry if I
> >> missed someone). If it makes more sense to continue the above thread,
> >> please move there.
> >>
> >> Even though everyone knows that hugetlb is special cased throughout the
> >> core mm, it came to a head with the proposed introduction of HGM. TBH,
> >> few people in the core mm community paid much attention to HGM when first
> >> introduced. A LSF/MM session was then dedicated to the discussion of
> >> HGM with the outcome being the suggestion to create a new filesystem/driver
> >> (hugetlb2 if you will) that would satisfy the use cases requiring HGM.
> >> One thing that was not emphasized at LSF/MM is that there are existing
> >> hugetlb users experiencing major issues that could be addressed with HGM:
> >> specifically the issues of memory errors and live migration. That was
> >> the starting point for recent discussion in the above thread.
> >>
> >> I may be wrong, but it appeared the direction of that thread was to
> >> first try and unify some of the hugetlb and core mm code. Eliminate
> >> some of the special casing. If hugetlb was less of a special case, then
> >> perhaps HGM would be more acceptable. That is the impression I (perhaps
> >> incorrectly) had going into today's session.
> >
> > My impression from the discussion yesterday was that the level of
> > unification would need to be really large and time consuming in order to
> > be useful for the HGM patchset to be in a more maintainable form. The
> > final outcome is quite hard to predict at this stage.
> >
> >> During today's session, we often discussed what would/could be introduced
> >> in a hugetlb v2. The idea is that this would be the ideal place for HGM.
> >> However, people also made the comparisons to cgroup v1 - v2. Such a
> >> redesign provides the needed 'clean slate' to do things right, but it
> >> does little for existing users who would be unwilling to quickly move off
> >> existing hugetlb.
> >>
> >> We did spend a good chunk of time on hugetlb/core mm unification and
> >> removing special casing. In some (most) of these cases, the benefit of
> >> removing special cases from core mm would result in adding more code to
> >> hugetlb. For example: proper type'ing so that hugetlb does not treat
> >> all page table entries as PTEs. Again, I may be wrong but I think
> >> people were OK with adding more code (and even complexity) to hugetlb
> >> if it eliminated special casing in the core mm. But, there did not
> >> seem to be a clear concensus especially with the thought that we may
> >> need to double hugetlb code to get types right.
> >
> > This is primarily your call as a maintainer. If you ask me, hugetlb is
> > over complicated in its current form already. Regression are not really
> > seldom when code is added which is a signal we are hitting maintenance
> > cost walls. This doesn't mean further development is impossible of
> > course but it is increasingly more costly AFAICS.
> >
> >> Unless I missed something, there was no clear direction at the end of this
> >> session. I was hoping that we could come up with a plan to address the
> >> issues facing today's hugetlb users. IMO, there seems to be two options:
> >> 1) Start work on hugetlb v2 with the intention that customers will need
> >> to move to this to address their issues.
> >> 2) Incorporate functionality like HGM into existing hugetlb.
> >
>
> I fully agree with all that Michal said.
>
> I'm just going to add that I don't see why anyone would look into a
> hugetlbv2 if we're going to use the motivation of "help existing users"
> to make hugetlb ever-more complicated and special. "existing users" her
> even meaning "people use hugetlb for backing VMs. Now they want to get
> postcopy working with less latency." -- which I consider partially a new
> use case.
>
> So working on adding HGM and concurrently starting a hugetlbv2? I don't
> think that will happen if we decide on adding HGM and proceeding with
> that reasoning about existing users.
>
> As expressed yesterday, I don't see a fast an clean way to make hugetlb
> significantly less special (thanks Willy for the list of odd cases).
>
> Sure, we can talk about adding pte_t safety, but I don't really see a
> way forward to unify page table walking code that way -- there are still
> the (PT) locking, PMD sharing, PTE-cont special cases ... but sure, if
> anybody wants to work on that, why not.
>
> Having that said, like Michal, I acknowledge that it is Mikes call
> regarding the hugetlb code. I, for my part, will push back on any added
> core-mm complexity that adds more special casing for hugetlb. Maybe
> there are easy ways to integrate it nicely and that is not really a concern.
HGM is mostly contained in the already-existing HugeTLB special cases.
HGM doesn't really *add* special cases, it just makes the HugeTLB
special cases more complicated.
There are a few small ways that HGM touches non-hugetlb code:
1. Mapcount (to make hugetlb use the THP scheme) [1], newer version here[2]
2. madvise (to add MADV_SPLIT and update MADV_COLLAPSE) [3] and [4]
3. A small non-hugetlb changes to page_vma_mapped_walk (provide pte_order)[5]
4. A small special case in try_to_unmap_one and try_to_migrate_one (to
check the head page for page flags)[6]
5. smaps stats[7]
[1]: https://lore.kernel.org/linux-mm/20230218002819.1486479-6-jthoughton@google.com/
[2]: https://lore.kernel.org/linux-mm/20230306230004.1387007-1-jthoughton@google.com/
[3]: https://lore.kernel.org/linux-mm/20230218002819.1486479-10-jthoughton@google.com/
[4]: https://lore.kernel.org/linux-mm/20230218002819.1486479-35-jthoughton@google.com/
[5]: https://lore.kernel.org/linux-mm/20230218002819.1486479-27-jthoughton@google.com/
[6]: https://lore.kernel.org/linux-mm/20230218002819.1486479-29-jthoughton@google.com/
[7]: https://lore.kernel.org/linux-mm/20230218002819.1486479-39-jthoughton@google.com/
>
> Note that while we've been discussing how HGM would already interfere
> with core-mm, we've not even started discussing how actual
> MADV_SPLIT/MADV_COLLAPSE/page poisioning ... would affect core-mm and
> require special-casing for hugetlb.
>
> I, for my part, will explore a bit the mapcount topic (as time permits)
> and see if we can come up at least with a unified mapcount approach
> (e.g., sub-page mapcount?). But I suspect even figuring that out will
> take quite a while already ...
Thanks! Simply using the current THP mapcount scheme with HGM isn't
great (but IIUC this isn't blocking HGM). By using this scheme,
HugeTLB loses the vmemmap optimization / page struct freeing when HGM
is in use, and, of course, this scheme gets slow with very large
folios.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Invitation] Linux MM Alignment Session on HugeTLB Core MM Convergence on Wednesday
2023-06-15 17:18 ` Matthew Wilcox
@ 2023-06-15 17:59 ` Mike Kravetz
0 siblings, 0 replies; 10+ messages in thread
From: Mike Kravetz @ 2023-06-15 17:59 UTC (permalink / raw)
To: Matthew Wilcox
Cc: James Houghton, Michal Hocko, David Rientjes, linux-mm,
David Hildenbrand, John Hubbard, Peter Xu, Vlastimil Babka,
Zi Yan
On 06/15/23 18:18, Matthew Wilcox wrote:
> On Thu, Jun 15, 2023 at 10:00:41AM -0700, James Houghton wrote:
> > I can't speak in detail about how databases recover from memory
> > poison. Mike, maybe you can share more details?
>
> Speaking generally (this would describe MySQL as well as any number of
> proprietary databases), hugetlbfs is used by databases as a supplier
> of memory to their userspace buffer cache. As database blocks are
> needed, they're read from storage using O_DIRECT. Depending on the
> database, they may or may not be updated in place.
>
> Just as in our page cache, if the hwpoison hits in a clean block, it
> can discard the block and re-read it. If it hits in a dirty block, it's
> Game Over. Most blocks are clean. A 1GB page contains so many blocks
> that taking out the entire 1GB is guaranteed to take out a dirty block.
> Taking out a single 4kB page is likely to take out only clean blocks.
Thanks Matthew.
I do not have details of exactly how this is done. However, this does fit
in with discussions I have had with our DB team. They can possibly recover
using another (older) copy of the data. The smaller the size of data, the
greater the possibility the older data can be used.
--
Mike Kravetz
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Invitation] Linux MM Alignment Session on HugeTLB Core MM Convergence on Wednesday
2023-06-15 8:29 ` David Hildenbrand
2023-06-15 17:24 ` James Houghton
@ 2023-06-15 18:31 ` Mike Kravetz
1 sibling, 0 replies; 10+ messages in thread
From: Mike Kravetz @ 2023-06-15 18:31 UTC (permalink / raw)
To: David Hildenbrand
Cc: Michal Hocko, David Rientjes, linux-mm, James Houghton,
John Hubbard, Matthew Wilcox, Peter Xu, Vlastimil Babka, Zi Yan
On 06/15/23 10:29, David Hildenbrand wrote:
> On 15.06.23 10:04, Michal Hocko wrote:
> > On Wed 14-06-23 16:04:58, Mike Kravetz wrote:
> > > On 06/12/23 18:59, David Rientjes wrote:
> > >
> > > We did spend a good chunk of time on hugetlb/core mm unification and
> > > removing special casing. In some (most) of these cases, the benefit of
> > > removing special cases from core mm would result in adding more code to
> > > hugetlb. For example: proper type'ing so that hugetlb does not treat
> > > all page table entries as PTEs. Again, I may be wrong but I think
> > > people were OK with adding more code (and even complexity) to hugetlb
> > > if it eliminated special casing in the core mm. But, there did not
> > > seem to be a clear concensus especially with the thought that we may
> > > need to double hugetlb code to get types right.
> >
> > This is primarily your call as a maintainer. If you ask me, hugetlb is
> > over complicated in its current form already. Regression are not really
> > seldom when code is added which is a signal we are hitting maintenance
> > cost walls. This doesn't mean further development is impossible of
> > course but it is increasingly more costly AFAICS.
> >
> > > Unless I missed something, there was no clear direction at the end of this
> > > session. I was hoping that we could come up with a plan to address the
> > > issues facing today's hugetlb users. IMO, there seems to be two options:
> > > 1) Start work on hugetlb v2 with the intention that customers will need
> > > to move to this to address their issues.
> > > 2) Incorporate functionality like HGM into existing hugetlb.
> >
>
> I fully agree with all that Michal said.
>
> I'm just going to add that I don't see why anyone would look into a
> hugetlbv2 if we're going to use the motivation of "help existing users" to
> make hugetlb ever-more complicated and special. "existing users" her even
> meaning "people use hugetlb for backing VMs. Now they want to get postcopy
> working with less latency." -- which I consider partially a new use case.
>
> So working on adding HGM and concurrently starting a hugetlbv2? I don't
> think that will happen if we decide on adding HGM and proceeding with that
> reasoning about existing users.
I agree that doing both in parallel is not going to happen.
> As expressed yesterday, I don't see a fast an clean way to make hugetlb
> significantly less special (thanks Willy for the list of odd cases).
>
> Sure, we can talk about adding pte_t safety, but I don't really see a way
> forward to unify page table walking code that way -- there are still the
> (PT) locking, PMD sharing, PTE-cont special cases ... but sure, if anybody
> wants to work on that, why not.
>
> Having that said, like Michal, I acknowledge that it is Mikes call regarding
> the hugetlb code. I, for my part, will push back on any added core-mm
> complexity that adds more special casing for hugetlb. Maybe there are easy
> ways to integrate it nicely and that is not really a concern.
And if the call on how to move forward was easy, I would have already made
a decision. :) I really do appreciate all the input.
It is pretty clear that adding more complex special cases to core mm for
hugetlb is going to be a non-starter. James has talked about any such special
cases for HGM in another reply.
I previously said that I am leaning toward trying to add HGM to existing
hugetlb. This is on the condition that any addition of special cases to
the core mm would be minimal and trivial. In addition, the added complexity
to hugetlb has to be manageable.
--
Mike Kravetz
> Note that while we've been discussing how HGM would already interfere with
> core-mm, we've not even started discussing how actual
> MADV_SPLIT/MADV_COLLAPSE/page poisioning ... would affect core-mm and
> require special-casing for hugetlb.
>
> I, for my part, will explore a bit the mapcount topic (as time permits) and
> see if we can come up at least with a unified mapcount approach (e.g.,
> sub-page mapcount?). But I suspect even figuring that out will take quite a
> while already ...
>
> --
> Cheers,
>
> David / dhildenb
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Invitation] Linux MM Alignment Session on HugeTLB Core MM Convergence on Wednesday
2023-06-15 17:24 ` James Houghton
@ 2023-06-15 18:58 ` Peter Xu
0 siblings, 0 replies; 10+ messages in thread
From: Peter Xu @ 2023-06-15 18:58 UTC (permalink / raw)
To: James Houghton
Cc: David Hildenbrand, Michal Hocko, Mike Kravetz, David Rientjes,
linux-mm, John Hubbard, Matthew Wilcox, Vlastimil Babka, Zi Yan
On Thu, Jun 15, 2023 at 10:24:24AM -0700, James Houghton wrote:
> On Thu, Jun 15, 2023 at 1:30 AM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 15.06.23 10:04, Michal Hocko wrote:
> > > On Wed 14-06-23 16:04:58, Mike Kravetz wrote:
> > >> On 06/12/23 18:59, David Rientjes wrote:
> > >>> This week's topic will be a technical brainstorming session on HugeTLB
> > >>> convergence with the core MM. This has been discussed most recently in
> > >>> this thread:
> > >>> https://lore.kernel.org/linux-mm/ZIOEDTUBrBg6tepk@casper.infradead.org/T/
> > >>
> > >> Thank you David for putting this session together! And, thanks to everyone
> > >> who participated.
> > >>
> > >> Following up on linux-mm with most active participants on Cc (sorry if I
> > >> missed someone). If it makes more sense to continue the above thread,
> > >> please move there.
> > >>
> > >> Even though everyone knows that hugetlb is special cased throughout the
> > >> core mm, it came to a head with the proposed introduction of HGM. TBH,
> > >> few people in the core mm community paid much attention to HGM when first
> > >> introduced. A LSF/MM session was then dedicated to the discussion of
> > >> HGM with the outcome being the suggestion to create a new filesystem/driver
> > >> (hugetlb2 if you will) that would satisfy the use cases requiring HGM.
> > >> One thing that was not emphasized at LSF/MM is that there are existing
> > >> hugetlb users experiencing major issues that could be addressed with HGM:
> > >> specifically the issues of memory errors and live migration. That was
> > >> the starting point for recent discussion in the above thread.
> > >>
> > >> I may be wrong, but it appeared the direction of that thread was to
> > >> first try and unify some of the hugetlb and core mm code. Eliminate
> > >> some of the special casing. If hugetlb was less of a special case, then
> > >> perhaps HGM would be more acceptable. That is the impression I (perhaps
> > >> incorrectly) had going into today's session.
> > >
> > > My impression from the discussion yesterday was that the level of
> > > unification would need to be really large and time consuming in order to
> > > be useful for the HGM patchset to be in a more maintainable form. The
> > > final outcome is quite hard to predict at this stage.
> > >
> > >> During today's session, we often discussed what would/could be introduced
> > >> in a hugetlb v2. The idea is that this would be the ideal place for HGM.
> > >> However, people also made the comparisons to cgroup v1 - v2. Such a
> > >> redesign provides the needed 'clean slate' to do things right, but it
> > >> does little for existing users who would be unwilling to quickly move off
> > >> existing hugetlb.
> > >>
> > >> We did spend a good chunk of time on hugetlb/core mm unification and
> > >> removing special casing. In some (most) of these cases, the benefit of
> > >> removing special cases from core mm would result in adding more code to
> > >> hugetlb. For example: proper type'ing so that hugetlb does not treat
> > >> all page table entries as PTEs. Again, I may be wrong but I think
> > >> people were OK with adding more code (and even complexity) to hugetlb
> > >> if it eliminated special casing in the core mm. But, there did not
> > >> seem to be a clear concensus especially with the thought that we may
> > >> need to double hugetlb code to get types right.
> > >
> > > This is primarily your call as a maintainer. If you ask me, hugetlb is
> > > over complicated in its current form already. Regression are not really
> > > seldom when code is added which is a signal we are hitting maintenance
> > > cost walls. This doesn't mean further development is impossible of
> > > course but it is increasingly more costly AFAICS.
> > >
> > >> Unless I missed something, there was no clear direction at the end of this
> > >> session. I was hoping that we could come up with a plan to address the
> > >> issues facing today's hugetlb users. IMO, there seems to be two options:
> > >> 1) Start work on hugetlb v2 with the intention that customers will need
> > >> to move to this to address their issues.
> > >> 2) Incorporate functionality like HGM into existing hugetlb.
> > >
> >
> > I fully agree with all that Michal said.
> >
> > I'm just going to add that I don't see why anyone would look into a
> > hugetlbv2 if we're going to use the motivation of "help existing users"
> > to make hugetlb ever-more complicated and special. "existing users" her
> > even meaning "people use hugetlb for backing VMs. Now they want to get
> > postcopy working with less latency." -- which I consider partially a new
> > use case.
> >
> > So working on adding HGM and concurrently starting a hugetlbv2? I don't
> > think that will happen if we decide on adding HGM and proceeding with
> > that reasoning about existing users.
> >
> > As expressed yesterday, I don't see a fast an clean way to make hugetlb
> > significantly less special (thanks Willy for the list of odd cases).
> >
> > Sure, we can talk about adding pte_t safety, but I don't really see a
> > way forward to unify page table walking code that way -- there are still
> > the (PT) locking, PMD sharing, PTE-cont special cases ... but sure, if
> > anybody wants to work on that, why not.
> >
> > Having that said, like Michal, I acknowledge that it is Mikes call
> > regarding the hugetlb code. I, for my part, will push back on any added
> > core-mm complexity that adds more special casing for hugetlb. Maybe
> > there are easy ways to integrate it nicely and that is not really a concern.
>
> HGM is mostly contained in the already-existing HugeTLB special cases.
> HGM doesn't really *add* special cases, it just makes the HugeTLB
> special cases more complicated.
Maybe we shouldn't account all the changes in HGM series to "add complexity
to hugetlb" indeed. Some of the changes may be still needed even if / when
there is the hugetlbv2.
IMHO the goal should be trying our best to reduce the ones that still in
account due to hugetlb's specialties within v1. Personally if anyone can
prove that he/she tried the best on approaching that and showed progress on
getting us more closer to a "converged" state, I'll be fine to merge HGM
within v1 when the special code can be reduced also to minimum.
>
> There are a few small ways that HGM touches non-hugetlb code:
> 1. Mapcount (to make hugetlb use the THP scheme) [1], newer version here[2]
This seems totally benign to me, as this switches hugetlb to use thp
mapcounts. I'd say this does not make hugetlb special but instead making
it slightly forward to convergence..
If we want v2, we can design whatever better way to do mapcount, maybe not
only for hugetlb but also for thp. But that seems to always be able to be
done on top of merging the two major large folios first.
> 2. madvise (to add MADV_SPLIT and update MADV_COLLAPSE) [3] and [4]
This is needed no matter what; I'd say not accounted for
"over-complicating" hugetlb.
> 3. A small non-hugetlb changes to page_vma_mapped_walk (provide pte_order)[5]
A few lines of complexity, maybe not a big issue.
> 4. A small special case in try_to_unmap_one and try_to_migrate_one (to
> check the head page for page flags)[6]
Seems to be an extended specialty due to different handling over hwpoisoned
large pages. Not sure whether it can be worked out from memory failure
side to merge the behavior against thp.
> 5. smaps stats[7]
Seems also benign. Even if hugetlb merges with generic mm, someone could
propose some statistics over less-than-huge sized hugetlb mapping
statistics, then that'll be it. Not directly relevant to core mm, IMHO.
>
> [1]: https://lore.kernel.org/linux-mm/20230218002819.1486479-6-jthoughton@google.com/
> [2]: https://lore.kernel.org/linux-mm/20230306230004.1387007-1-jthoughton@google.com/
> [3]: https://lore.kernel.org/linux-mm/20230218002819.1486479-10-jthoughton@google.com/
> [4]: https://lore.kernel.org/linux-mm/20230218002819.1486479-35-jthoughton@google.com/
> [5]: https://lore.kernel.org/linux-mm/20230218002819.1486479-27-jthoughton@google.com/
> [6]: https://lore.kernel.org/linux-mm/20230218002819.1486479-29-jthoughton@google.com/
> [7]: https://lore.kernel.org/linux-mm/20230218002819.1486479-39-jthoughton@google.com/
I didn't read HGM for a long time, but afair a few hundreds of LOCs lie in
the pgtable walking changes which should probably be accounted into "adding
complexity" if we say hugetlb will converge one day with core mm. That's
one (out of many issues) that Matthew listed in his slides yesterday that
hugetlb may need an eye looking at for convergence.
Does it mean that this might be a good spot to pay some more attention? I
know this goes back to the very early stage where we were discussing what
would be the best way to walk hugetlb pgtable knowing that we can map 4K
over a 2M, but I think it may be slightly different: at least we're clearer
that now we want to merge that with core mm.
I think it means the possibility to mostly deprecate huge_pte_offset().
James/Mike/anyone, have any of you looked into that area? Would above make
any sense at all?
Thanks,
--
Peter Xu
^ permalink raw reply [flat|nested] 10+ messages in thread
* [Invitation] Linux MM Alignment Session on HugeTLB Core MM Convergence on Wednesday
@ 2023-06-13 2:01 David Rientjes
0 siblings, 0 replies; 10+ messages in thread
From: David Rientjes @ 2023-06-13 2:01 UTC (permalink / raw)
To: Andrew Morton, Dave Hansen, David Hildenbrand, Hugh Dickins,
James Houghton, Johannes Weiner, John Hubbard, linux-mm,
Matthew Wilcox, Mel Gorman, Michal Hocko, Mike Kravetz,
Mike Rapoport, Pasha Tatashin, Peter Xu, Rao, Bharata Bhasker,
Rik van Riel, Roman Gushchin, Shakeel Butt, Shutemov, Kirill,
Tejun Heo, Vlastimil Babka, Yang Shi, Zi Yan
[-- Attachment #1: Type: text/plain, Size: 1683 bytes --]
Hi everybody,
We host a biweekly series, the Linux MM Alignment Session, on Wednesdays.
We'd like to invite MM developers to attend and will announce the topic
for the next instance on the Monday prior to the next meeting.
Our next Linux MM Alignment Session is scheduled for Wednesday. The
details:
Wednesday, June 14 · 9:00 – 10:00am PDT (GMT-7)
https://meet.google.com/csb-wcds-xya
backup: (US) +1 347-682-5874 PIN: 356 962 072#
international: https://tel.meet/csb-wcds-xya?pin=1301132214803
This week's topic will be a technical brainstorming session on HugeTLB
convergence with the core MM. This has been discussed most recently in
this thread:
https://lore.kernel.org/linux-mm/ZIOEDTUBrBg6tepk@casper.infradead.org/T/
As discussed there, it would be useful to discuss areas where the HugeTLB
subsystem could be unified with the core MM. This includes both
short-term and long-term objectives to improve reliability and
maintainability:
- Concrete areas where hugetlb and core MM can converge (pagewalks, for
example)
- Incremental progress on the above
- Long-term plans for HugeTLB reservations
- Anything other ideas?
This is intended to be for technical brainstorming, i.e. surface ideas
where the HugeTLB subsystem and core MM could share a common framework and
reduce HugeTLB's "special casing."
It would be great to align on common goals and discuss division of work as
well as timelines.
Also: if anybody has ideas for future topics, please let me know and I'll
try to organize them. We'd love to have volunteers to lead future topics
as well as requests for MM topics to be presented.
Looking forward to seeing all of you on Wednesday!
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2023-06-15 18:58 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <c5afdf35-a5fa-03e2-348d-cf1d990fc389@google.com>
[not found] ` <20230614230458.GB3559@monkey>
2023-06-15 1:12 ` [Invitation] Linux MM Alignment Session on HugeTLB Core MM Convergence on Wednesday David Rientjes
2023-06-15 8:04 ` Michal Hocko
2023-06-15 8:29 ` David Hildenbrand
2023-06-15 17:24 ` James Houghton
2023-06-15 18:58 ` Peter Xu
2023-06-15 18:31 ` Mike Kravetz
2023-06-15 17:00 ` James Houghton
2023-06-15 17:18 ` Matthew Wilcox
2023-06-15 17:59 ` Mike Kravetz
2023-06-13 2:01 David Rientjes
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox