* Re: [Invitation] Linux MM Alignment Session on HugeTLB Core MM Convergence on Wednesday [not found] ` <20230614230458.GB3559@monkey> @ 2023-06-15 1:12 ` David Rientjes 2023-06-15 8:04 ` Michal Hocko 1 sibling, 0 replies; 10+ messages in thread From: David Rientjes @ 2023-06-15 1:12 UTC (permalink / raw) To: Mike Kravetz Cc: linux-mm, David Hildenbrand, James Houghton, John Hubbard, Matthew Wilcox, Michal Hocko, Peter Xu, Vlastimil Babka, Zi Yan On Wed, 14 Jun 2023, Mike Kravetz wrote: > On 06/12/23 18:59, David Rientjes wrote: > > This week's topic will be a technical brainstorming session on HugeTLB > > convergence with the core MM. This has been discussed most recently in > > this thread: > > https://lore.kernel.org/linux-mm/ZIOEDTUBrBg6tepk@casper.infradead.org/T/ > > Thank you David for putting this session together! And, thanks to everyone > who participated. > > Following up on linux-mm with most active participants on Cc (sorry if I > missed someone). If it makes more sense to continue the above thread, > please move there. > Thank *you* for keeping the conversation going. And, yes, thanks to everybody that participated today and especially Matthew for preparing content on his thoughts and ideas on convergence opportunities. > Even though everyone knows that hugetlb is special cased throughout the > core mm, it came to a head with the proposed introduction of HGM. TBH, > few people in the core mm community paid much attention to HGM when first > introduced. A LSF/MM session was then dedicated to the discussion of > HGM with the outcome being the suggestion to create a new filesystem/driver > (hugetlb2 if you will) that would satisfy the use cases requiring HGM. > One thing that was not emphasized at LSF/MM is that there are existing > hugetlb users experiencing major issues that could be addressed with HGM: > specifically the issues of memory errors and live migration. That was > the starting point for recent discussion in the above thread. > > I may be wrong, but it appeared the direction of that thread was to > first try and unify some of the hugetlb and core mm code. Eliminate > some of the special casing. If hugetlb was less of a special case, then > perhaps HGM would be more acceptable. That is the impression I (perhaps > incorrectly) had going into today's session. > This matches my understanding as well. The above thread surfaced some great areas of improvement for hugetlb and the big idea was to surface those, do some technical brainstorming, and think about both short-term and long-term goals. There *are* complexities for some of the convergence opportunities that were discussed, but all of them would improve (imo) both maintainability and reliability. > During today's session, we often discussed what would/could be introduced > in a hugetlb v2. The idea is that this would be the ideal place for HGM. > However, people also made the comparisons to cgroup v1 - v2. Such a > redesign provides the needed 'clean slate' to do things right, but it > does little for existing users who would be unwilling to quickly move off > existing hugetlb. > > We did spend a good chunk of time on hugetlb/core mm unification and > removing special casing. In some (most) of these cases, the benefit of > removing special cases from core mm would result in adding more code to > hugetlb. For example: proper type'ing so that hugetlb does not treat > all page table entries as PTEs. Again, I may be wrong but I think > people were OK with adding more code (and even complexity) to hugetlb > if it eliminated special casing in the core mm. But, there did not > seem to be a clear concensus especially with the thought that we may > need to double hugetlb code to get types right. > > Unless I missed something, there was no clear direction at the end of this > session. I was hoping that we could come up with a plan to address the > issues facing today's hugetlb users. IMO, there seems to be two options: > 1) Start work on hugetlb v2 with the intention that customers will need > to move to this to address their issues. > 2) Incorporate functionality like HGM into existing hugetlb. > To address existing customer pain for 1GB memory poisoning and post-copy live migration, yeah, I think these are the only two possible paths forward. > My opinion is that adding HGM to existing hugetlb is the only way we > will be able to address issues for current users in a timely manner. > The session today (and email thread) point out the ugliness and > difficulty with hugetlb special casing in the core mm. Therefore, > adding HGM (or any new code) to hugetlb should not introduce new special > cases to core mm. I know the latest version of HGM does introduce new > special cases. I am not sure if those can be reduced or eliminated. > My suggestion for a direction forward would be to add HGM to existing > hugetlb with no or minimal new special casing. In parallel work could > begin on hugetlb v2. I'd agree, and I think the complexities of HGM are largely constrained to hugetlb. Reducing the special casing could have a concrete path forward that can be iterated on. That very specific and concrete feedback would be extremely valuable. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Invitation] Linux MM Alignment Session on HugeTLB Core MM Convergence on Wednesday [not found] ` <20230614230458.GB3559@monkey> 2023-06-15 1:12 ` [Invitation] Linux MM Alignment Session on HugeTLB Core MM Convergence on Wednesday David Rientjes @ 2023-06-15 8:04 ` Michal Hocko 2023-06-15 8:29 ` David Hildenbrand 2023-06-15 17:00 ` James Houghton 1 sibling, 2 replies; 10+ messages in thread From: Michal Hocko @ 2023-06-15 8:04 UTC (permalink / raw) To: Mike Kravetz Cc: David Rientjes, linux-mm, David Hildenbrand, James Houghton, John Hubbard, Matthew Wilcox, Peter Xu, Vlastimil Babka, Zi Yan On Wed 14-06-23 16:04:58, Mike Kravetz wrote: > On 06/12/23 18:59, David Rientjes wrote: > > This week's topic will be a technical brainstorming session on HugeTLB > > convergence with the core MM. This has been discussed most recently in > > this thread: > > https://lore.kernel.org/linux-mm/ZIOEDTUBrBg6tepk@casper.infradead.org/T/ > > Thank you David for putting this session together! And, thanks to everyone > who participated. > > Following up on linux-mm with most active participants on Cc (sorry if I > missed someone). If it makes more sense to continue the above thread, > please move there. > > Even though everyone knows that hugetlb is special cased throughout the > core mm, it came to a head with the proposed introduction of HGM. TBH, > few people in the core mm community paid much attention to HGM when first > introduced. A LSF/MM session was then dedicated to the discussion of > HGM with the outcome being the suggestion to create a new filesystem/driver > (hugetlb2 if you will) that would satisfy the use cases requiring HGM. > One thing that was not emphasized at LSF/MM is that there are existing > hugetlb users experiencing major issues that could be addressed with HGM: > specifically the issues of memory errors and live migration. That was > the starting point for recent discussion in the above thread. > > I may be wrong, but it appeared the direction of that thread was to > first try and unify some of the hugetlb and core mm code. Eliminate > some of the special casing. If hugetlb was less of a special case, then > perhaps HGM would be more acceptable. That is the impression I (perhaps > incorrectly) had going into today's session. My impression from the discussion yesterday was that the level of unification would need to be really large and time consuming in order to be useful for the HGM patchset to be in a more maintainable form. The final outcome is quite hard to predict at this stage. > During today's session, we often discussed what would/could be introduced > in a hugetlb v2. The idea is that this would be the ideal place for HGM. > However, people also made the comparisons to cgroup v1 - v2. Such a > redesign provides the needed 'clean slate' to do things right, but it > does little for existing users who would be unwilling to quickly move off > existing hugetlb. > > We did spend a good chunk of time on hugetlb/core mm unification and > removing special casing. In some (most) of these cases, the benefit of > removing special cases from core mm would result in adding more code to > hugetlb. For example: proper type'ing so that hugetlb does not treat > all page table entries as PTEs. Again, I may be wrong but I think > people were OK with adding more code (and even complexity) to hugetlb > if it eliminated special casing in the core mm. But, there did not > seem to be a clear concensus especially with the thought that we may > need to double hugetlb code to get types right. This is primarily your call as a maintainer. If you ask me, hugetlb is over complicated in its current form already. Regression are not really seldom when code is added which is a signal we are hitting maintenance cost walls. This doesn't mean further development is impossible of course but it is increasingly more costly AFAICS. > Unless I missed something, there was no clear direction at the end of this > session. I was hoping that we could come up with a plan to address the > issues facing today's hugetlb users. IMO, there seems to be two options: > 1) Start work on hugetlb v2 with the intention that customers will need > to move to this to address their issues. > 2) Incorporate functionality like HGM into existing hugetlb. From the memcg experience I can tell that cleaning up interfaces and supported scenarios helped a lot. We still have to maintain v1 and will have to for a foreseeable future and beyond but it is much more easier to build a new functionality on top v2 without struggling how to hammer that into v1 which is much more easier to generate corner cases. I fully understand that this is not really a great approach for users from the short term POV because they have to adapt but I am pretty sure that they would also appreciate long term stable and regression free code. It is my understanding that most HGM users do not really benefit from most of hugetlb features (shared page tables, reservation code etc) and from that POV it makes some sense to start from a simpler code base. I do recognize there are other users who would like their existing hugetlb setups keep working and benefit from a better support - page poisoning comes to mind but is this really that easy without considerable changes on the user space side? Memory error recovery problems are a really tough to deal with AFAICS. Do we have any references to an existing userspace that would able to deal with holes in their hugetlb pages? Or is this a chicken&egg problem? Have we exhausted potential (even if coarse) solutions that wouldn't require breaking hugetlb page tables? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Invitation] Linux MM Alignment Session on HugeTLB Core MM Convergence on Wednesday 2023-06-15 8:04 ` Michal Hocko @ 2023-06-15 8:29 ` David Hildenbrand 2023-06-15 17:24 ` James Houghton 2023-06-15 18:31 ` Mike Kravetz 2023-06-15 17:00 ` James Houghton 1 sibling, 2 replies; 10+ messages in thread From: David Hildenbrand @ 2023-06-15 8:29 UTC (permalink / raw) To: Michal Hocko, Mike Kravetz Cc: David Rientjes, linux-mm, James Houghton, John Hubbard, Matthew Wilcox, Peter Xu, Vlastimil Babka, Zi Yan On 15.06.23 10:04, Michal Hocko wrote: > On Wed 14-06-23 16:04:58, Mike Kravetz wrote: >> On 06/12/23 18:59, David Rientjes wrote: >>> This week's topic will be a technical brainstorming session on HugeTLB >>> convergence with the core MM. This has been discussed most recently in >>> this thread: >>> https://lore.kernel.org/linux-mm/ZIOEDTUBrBg6tepk@casper.infradead.org/T/ >> >> Thank you David for putting this session together! And, thanks to everyone >> who participated. >> >> Following up on linux-mm with most active participants on Cc (sorry if I >> missed someone). If it makes more sense to continue the above thread, >> please move there. >> >> Even though everyone knows that hugetlb is special cased throughout the >> core mm, it came to a head with the proposed introduction of HGM. TBH, >> few people in the core mm community paid much attention to HGM when first >> introduced. A LSF/MM session was then dedicated to the discussion of >> HGM with the outcome being the suggestion to create a new filesystem/driver >> (hugetlb2 if you will) that would satisfy the use cases requiring HGM. >> One thing that was not emphasized at LSF/MM is that there are existing >> hugetlb users experiencing major issues that could be addressed with HGM: >> specifically the issues of memory errors and live migration. That was >> the starting point for recent discussion in the above thread. >> >> I may be wrong, but it appeared the direction of that thread was to >> first try and unify some of the hugetlb and core mm code. Eliminate >> some of the special casing. If hugetlb was less of a special case, then >> perhaps HGM would be more acceptable. That is the impression I (perhaps >> incorrectly) had going into today's session. > > My impression from the discussion yesterday was that the level of > unification would need to be really large and time consuming in order to > be useful for the HGM patchset to be in a more maintainable form. The > final outcome is quite hard to predict at this stage. > >> During today's session, we often discussed what would/could be introduced >> in a hugetlb v2. The idea is that this would be the ideal place for HGM. >> However, people also made the comparisons to cgroup v1 - v2. Such a >> redesign provides the needed 'clean slate' to do things right, but it >> does little for existing users who would be unwilling to quickly move off >> existing hugetlb. >> >> We did spend a good chunk of time on hugetlb/core mm unification and >> removing special casing. In some (most) of these cases, the benefit of >> removing special cases from core mm would result in adding more code to >> hugetlb. For example: proper type'ing so that hugetlb does not treat >> all page table entries as PTEs. Again, I may be wrong but I think >> people were OK with adding more code (and even complexity) to hugetlb >> if it eliminated special casing in the core mm. But, there did not >> seem to be a clear concensus especially with the thought that we may >> need to double hugetlb code to get types right. > > This is primarily your call as a maintainer. If you ask me, hugetlb is > over complicated in its current form already. Regression are not really > seldom when code is added which is a signal we are hitting maintenance > cost walls. This doesn't mean further development is impossible of > course but it is increasingly more costly AFAICS. > >> Unless I missed something, there was no clear direction at the end of this >> session. I was hoping that we could come up with a plan to address the >> issues facing today's hugetlb users. IMO, there seems to be two options: >> 1) Start work on hugetlb v2 with the intention that customers will need >> to move to this to address their issues. >> 2) Incorporate functionality like HGM into existing hugetlb. > I fully agree with all that Michal said. I'm just going to add that I don't see why anyone would look into a hugetlbv2 if we're going to use the motivation of "help existing users" to make hugetlb ever-more complicated and special. "existing users" her even meaning "people use hugetlb for backing VMs. Now they want to get postcopy working with less latency." -- which I consider partially a new use case. So working on adding HGM and concurrently starting a hugetlbv2? I don't think that will happen if we decide on adding HGM and proceeding with that reasoning about existing users. As expressed yesterday, I don't see a fast an clean way to make hugetlb significantly less special (thanks Willy for the list of odd cases). Sure, we can talk about adding pte_t safety, but I don't really see a way forward to unify page table walking code that way -- there are still the (PT) locking, PMD sharing, PTE-cont special cases ... but sure, if anybody wants to work on that, why not. Having that said, like Michal, I acknowledge that it is Mikes call regarding the hugetlb code. I, for my part, will push back on any added core-mm complexity that adds more special casing for hugetlb. Maybe there are easy ways to integrate it nicely and that is not really a concern. Note that while we've been discussing how HGM would already interfere with core-mm, we've not even started discussing how actual MADV_SPLIT/MADV_COLLAPSE/page poisioning ... would affect core-mm and require special-casing for hugetlb. I, for my part, will explore a bit the mapcount topic (as time permits) and see if we can come up at least with a unified mapcount approach (e.g., sub-page mapcount?). But I suspect even figuring that out will take quite a while already ... -- Cheers, David / dhildenb ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Invitation] Linux MM Alignment Session on HugeTLB Core MM Convergence on Wednesday 2023-06-15 8:29 ` David Hildenbrand @ 2023-06-15 17:24 ` James Houghton 2023-06-15 18:58 ` Peter Xu 2023-06-15 18:31 ` Mike Kravetz 1 sibling, 1 reply; 10+ messages in thread From: James Houghton @ 2023-06-15 17:24 UTC (permalink / raw) To: David Hildenbrand Cc: Michal Hocko, Mike Kravetz, David Rientjes, linux-mm, John Hubbard, Matthew Wilcox, Peter Xu, Vlastimil Babka, Zi Yan On Thu, Jun 15, 2023 at 1:30 AM David Hildenbrand <david@redhat.com> wrote: > > On 15.06.23 10:04, Michal Hocko wrote: > > On Wed 14-06-23 16:04:58, Mike Kravetz wrote: > >> On 06/12/23 18:59, David Rientjes wrote: > >>> This week's topic will be a technical brainstorming session on HugeTLB > >>> convergence with the core MM. This has been discussed most recently in > >>> this thread: > >>> https://lore.kernel.org/linux-mm/ZIOEDTUBrBg6tepk@casper.infradead.org/T/ > >> > >> Thank you David for putting this session together! And, thanks to everyone > >> who participated. > >> > >> Following up on linux-mm with most active participants on Cc (sorry if I > >> missed someone). If it makes more sense to continue the above thread, > >> please move there. > >> > >> Even though everyone knows that hugetlb is special cased throughout the > >> core mm, it came to a head with the proposed introduction of HGM. TBH, > >> few people in the core mm community paid much attention to HGM when first > >> introduced. A LSF/MM session was then dedicated to the discussion of > >> HGM with the outcome being the suggestion to create a new filesystem/driver > >> (hugetlb2 if you will) that would satisfy the use cases requiring HGM. > >> One thing that was not emphasized at LSF/MM is that there are existing > >> hugetlb users experiencing major issues that could be addressed with HGM: > >> specifically the issues of memory errors and live migration. That was > >> the starting point for recent discussion in the above thread. > >> > >> I may be wrong, but it appeared the direction of that thread was to > >> first try and unify some of the hugetlb and core mm code. Eliminate > >> some of the special casing. If hugetlb was less of a special case, then > >> perhaps HGM would be more acceptable. That is the impression I (perhaps > >> incorrectly) had going into today's session. > > > > My impression from the discussion yesterday was that the level of > > unification would need to be really large and time consuming in order to > > be useful for the HGM patchset to be in a more maintainable form. The > > final outcome is quite hard to predict at this stage. > > > >> During today's session, we often discussed what would/could be introduced > >> in a hugetlb v2. The idea is that this would be the ideal place for HGM. > >> However, people also made the comparisons to cgroup v1 - v2. Such a > >> redesign provides the needed 'clean slate' to do things right, but it > >> does little for existing users who would be unwilling to quickly move off > >> existing hugetlb. > >> > >> We did spend a good chunk of time on hugetlb/core mm unification and > >> removing special casing. In some (most) of these cases, the benefit of > >> removing special cases from core mm would result in adding more code to > >> hugetlb. For example: proper type'ing so that hugetlb does not treat > >> all page table entries as PTEs. Again, I may be wrong but I think > >> people were OK with adding more code (and even complexity) to hugetlb > >> if it eliminated special casing in the core mm. But, there did not > >> seem to be a clear concensus especially with the thought that we may > >> need to double hugetlb code to get types right. > > > > This is primarily your call as a maintainer. If you ask me, hugetlb is > > over complicated in its current form already. Regression are not really > > seldom when code is added which is a signal we are hitting maintenance > > cost walls. This doesn't mean further development is impossible of > > course but it is increasingly more costly AFAICS. > > > >> Unless I missed something, there was no clear direction at the end of this > >> session. I was hoping that we could come up with a plan to address the > >> issues facing today's hugetlb users. IMO, there seems to be two options: > >> 1) Start work on hugetlb v2 with the intention that customers will need > >> to move to this to address their issues. > >> 2) Incorporate functionality like HGM into existing hugetlb. > > > > I fully agree with all that Michal said. > > I'm just going to add that I don't see why anyone would look into a > hugetlbv2 if we're going to use the motivation of "help existing users" > to make hugetlb ever-more complicated and special. "existing users" her > even meaning "people use hugetlb for backing VMs. Now they want to get > postcopy working with less latency." -- which I consider partially a new > use case. > > So working on adding HGM and concurrently starting a hugetlbv2? I don't > think that will happen if we decide on adding HGM and proceeding with > that reasoning about existing users. > > As expressed yesterday, I don't see a fast an clean way to make hugetlb > significantly less special (thanks Willy for the list of odd cases). > > Sure, we can talk about adding pte_t safety, but I don't really see a > way forward to unify page table walking code that way -- there are still > the (PT) locking, PMD sharing, PTE-cont special cases ... but sure, if > anybody wants to work on that, why not. > > Having that said, like Michal, I acknowledge that it is Mikes call > regarding the hugetlb code. I, for my part, will push back on any added > core-mm complexity that adds more special casing for hugetlb. Maybe > there are easy ways to integrate it nicely and that is not really a concern. HGM is mostly contained in the already-existing HugeTLB special cases. HGM doesn't really *add* special cases, it just makes the HugeTLB special cases more complicated. There are a few small ways that HGM touches non-hugetlb code: 1. Mapcount (to make hugetlb use the THP scheme) [1], newer version here[2] 2. madvise (to add MADV_SPLIT and update MADV_COLLAPSE) [3] and [4] 3. A small non-hugetlb changes to page_vma_mapped_walk (provide pte_order)[5] 4. A small special case in try_to_unmap_one and try_to_migrate_one (to check the head page for page flags)[6] 5. smaps stats[7] [1]: https://lore.kernel.org/linux-mm/20230218002819.1486479-6-jthoughton@google.com/ [2]: https://lore.kernel.org/linux-mm/20230306230004.1387007-1-jthoughton@google.com/ [3]: https://lore.kernel.org/linux-mm/20230218002819.1486479-10-jthoughton@google.com/ [4]: https://lore.kernel.org/linux-mm/20230218002819.1486479-35-jthoughton@google.com/ [5]: https://lore.kernel.org/linux-mm/20230218002819.1486479-27-jthoughton@google.com/ [6]: https://lore.kernel.org/linux-mm/20230218002819.1486479-29-jthoughton@google.com/ [7]: https://lore.kernel.org/linux-mm/20230218002819.1486479-39-jthoughton@google.com/ > > Note that while we've been discussing how HGM would already interfere > with core-mm, we've not even started discussing how actual > MADV_SPLIT/MADV_COLLAPSE/page poisioning ... would affect core-mm and > require special-casing for hugetlb. > > I, for my part, will explore a bit the mapcount topic (as time permits) > and see if we can come up at least with a unified mapcount approach > (e.g., sub-page mapcount?). But I suspect even figuring that out will > take quite a while already ... Thanks! Simply using the current THP mapcount scheme with HGM isn't great (but IIUC this isn't blocking HGM). By using this scheme, HugeTLB loses the vmemmap optimization / page struct freeing when HGM is in use, and, of course, this scheme gets slow with very large folios. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Invitation] Linux MM Alignment Session on HugeTLB Core MM Convergence on Wednesday 2023-06-15 17:24 ` James Houghton @ 2023-06-15 18:58 ` Peter Xu 0 siblings, 0 replies; 10+ messages in thread From: Peter Xu @ 2023-06-15 18:58 UTC (permalink / raw) To: James Houghton Cc: David Hildenbrand, Michal Hocko, Mike Kravetz, David Rientjes, linux-mm, John Hubbard, Matthew Wilcox, Vlastimil Babka, Zi Yan On Thu, Jun 15, 2023 at 10:24:24AM -0700, James Houghton wrote: > On Thu, Jun 15, 2023 at 1:30 AM David Hildenbrand <david@redhat.com> wrote: > > > > On 15.06.23 10:04, Michal Hocko wrote: > > > On Wed 14-06-23 16:04:58, Mike Kravetz wrote: > > >> On 06/12/23 18:59, David Rientjes wrote: > > >>> This week's topic will be a technical brainstorming session on HugeTLB > > >>> convergence with the core MM. This has been discussed most recently in > > >>> this thread: > > >>> https://lore.kernel.org/linux-mm/ZIOEDTUBrBg6tepk@casper.infradead.org/T/ > > >> > > >> Thank you David for putting this session together! And, thanks to everyone > > >> who participated. > > >> > > >> Following up on linux-mm with most active participants on Cc (sorry if I > > >> missed someone). If it makes more sense to continue the above thread, > > >> please move there. > > >> > > >> Even though everyone knows that hugetlb is special cased throughout the > > >> core mm, it came to a head with the proposed introduction of HGM. TBH, > > >> few people in the core mm community paid much attention to HGM when first > > >> introduced. A LSF/MM session was then dedicated to the discussion of > > >> HGM with the outcome being the suggestion to create a new filesystem/driver > > >> (hugetlb2 if you will) that would satisfy the use cases requiring HGM. > > >> One thing that was not emphasized at LSF/MM is that there are existing > > >> hugetlb users experiencing major issues that could be addressed with HGM: > > >> specifically the issues of memory errors and live migration. That was > > >> the starting point for recent discussion in the above thread. > > >> > > >> I may be wrong, but it appeared the direction of that thread was to > > >> first try and unify some of the hugetlb and core mm code. Eliminate > > >> some of the special casing. If hugetlb was less of a special case, then > > >> perhaps HGM would be more acceptable. That is the impression I (perhaps > > >> incorrectly) had going into today's session. > > > > > > My impression from the discussion yesterday was that the level of > > > unification would need to be really large and time consuming in order to > > > be useful for the HGM patchset to be in a more maintainable form. The > > > final outcome is quite hard to predict at this stage. > > > > > >> During today's session, we often discussed what would/could be introduced > > >> in a hugetlb v2. The idea is that this would be the ideal place for HGM. > > >> However, people also made the comparisons to cgroup v1 - v2. Such a > > >> redesign provides the needed 'clean slate' to do things right, but it > > >> does little for existing users who would be unwilling to quickly move off > > >> existing hugetlb. > > >> > > >> We did spend a good chunk of time on hugetlb/core mm unification and > > >> removing special casing. In some (most) of these cases, the benefit of > > >> removing special cases from core mm would result in adding more code to > > >> hugetlb. For example: proper type'ing so that hugetlb does not treat > > >> all page table entries as PTEs. Again, I may be wrong but I think > > >> people were OK with adding more code (and even complexity) to hugetlb > > >> if it eliminated special casing in the core mm. But, there did not > > >> seem to be a clear concensus especially with the thought that we may > > >> need to double hugetlb code to get types right. > > > > > > This is primarily your call as a maintainer. If you ask me, hugetlb is > > > over complicated in its current form already. Regression are not really > > > seldom when code is added which is a signal we are hitting maintenance > > > cost walls. This doesn't mean further development is impossible of > > > course but it is increasingly more costly AFAICS. > > > > > >> Unless I missed something, there was no clear direction at the end of this > > >> session. I was hoping that we could come up with a plan to address the > > >> issues facing today's hugetlb users. IMO, there seems to be two options: > > >> 1) Start work on hugetlb v2 with the intention that customers will need > > >> to move to this to address their issues. > > >> 2) Incorporate functionality like HGM into existing hugetlb. > > > > > > > I fully agree with all that Michal said. > > > > I'm just going to add that I don't see why anyone would look into a > > hugetlbv2 if we're going to use the motivation of "help existing users" > > to make hugetlb ever-more complicated and special. "existing users" her > > even meaning "people use hugetlb for backing VMs. Now they want to get > > postcopy working with less latency." -- which I consider partially a new > > use case. > > > > So working on adding HGM and concurrently starting a hugetlbv2? I don't > > think that will happen if we decide on adding HGM and proceeding with > > that reasoning about existing users. > > > > As expressed yesterday, I don't see a fast an clean way to make hugetlb > > significantly less special (thanks Willy for the list of odd cases). > > > > Sure, we can talk about adding pte_t safety, but I don't really see a > > way forward to unify page table walking code that way -- there are still > > the (PT) locking, PMD sharing, PTE-cont special cases ... but sure, if > > anybody wants to work on that, why not. > > > > Having that said, like Michal, I acknowledge that it is Mikes call > > regarding the hugetlb code. I, for my part, will push back on any added > > core-mm complexity that adds more special casing for hugetlb. Maybe > > there are easy ways to integrate it nicely and that is not really a concern. > > HGM is mostly contained in the already-existing HugeTLB special cases. > HGM doesn't really *add* special cases, it just makes the HugeTLB > special cases more complicated. Maybe we shouldn't account all the changes in HGM series to "add complexity to hugetlb" indeed. Some of the changes may be still needed even if / when there is the hugetlbv2. IMHO the goal should be trying our best to reduce the ones that still in account due to hugetlb's specialties within v1. Personally if anyone can prove that he/she tried the best on approaching that and showed progress on getting us more closer to a "converged" state, I'll be fine to merge HGM within v1 when the special code can be reduced also to minimum. > > There are a few small ways that HGM touches non-hugetlb code: > 1. Mapcount (to make hugetlb use the THP scheme) [1], newer version here[2] This seems totally benign to me, as this switches hugetlb to use thp mapcounts. I'd say this does not make hugetlb special but instead making it slightly forward to convergence.. If we want v2, we can design whatever better way to do mapcount, maybe not only for hugetlb but also for thp. But that seems to always be able to be done on top of merging the two major large folios first. > 2. madvise (to add MADV_SPLIT and update MADV_COLLAPSE) [3] and [4] This is needed no matter what; I'd say not accounted for "over-complicating" hugetlb. > 3. A small non-hugetlb changes to page_vma_mapped_walk (provide pte_order)[5] A few lines of complexity, maybe not a big issue. > 4. A small special case in try_to_unmap_one and try_to_migrate_one (to > check the head page for page flags)[6] Seems to be an extended specialty due to different handling over hwpoisoned large pages. Not sure whether it can be worked out from memory failure side to merge the behavior against thp. > 5. smaps stats[7] Seems also benign. Even if hugetlb merges with generic mm, someone could propose some statistics over less-than-huge sized hugetlb mapping statistics, then that'll be it. Not directly relevant to core mm, IMHO. > > [1]: https://lore.kernel.org/linux-mm/20230218002819.1486479-6-jthoughton@google.com/ > [2]: https://lore.kernel.org/linux-mm/20230306230004.1387007-1-jthoughton@google.com/ > [3]: https://lore.kernel.org/linux-mm/20230218002819.1486479-10-jthoughton@google.com/ > [4]: https://lore.kernel.org/linux-mm/20230218002819.1486479-35-jthoughton@google.com/ > [5]: https://lore.kernel.org/linux-mm/20230218002819.1486479-27-jthoughton@google.com/ > [6]: https://lore.kernel.org/linux-mm/20230218002819.1486479-29-jthoughton@google.com/ > [7]: https://lore.kernel.org/linux-mm/20230218002819.1486479-39-jthoughton@google.com/ I didn't read HGM for a long time, but afair a few hundreds of LOCs lie in the pgtable walking changes which should probably be accounted into "adding complexity" if we say hugetlb will converge one day with core mm. That's one (out of many issues) that Matthew listed in his slides yesterday that hugetlb may need an eye looking at for convergence. Does it mean that this might be a good spot to pay some more attention? I know this goes back to the very early stage where we were discussing what would be the best way to walk hugetlb pgtable knowing that we can map 4K over a 2M, but I think it may be slightly different: at least we're clearer that now we want to merge that with core mm. I think it means the possibility to mostly deprecate huge_pte_offset(). James/Mike/anyone, have any of you looked into that area? Would above make any sense at all? Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Invitation] Linux MM Alignment Session on HugeTLB Core MM Convergence on Wednesday 2023-06-15 8:29 ` David Hildenbrand 2023-06-15 17:24 ` James Houghton @ 2023-06-15 18:31 ` Mike Kravetz 1 sibling, 0 replies; 10+ messages in thread From: Mike Kravetz @ 2023-06-15 18:31 UTC (permalink / raw) To: David Hildenbrand Cc: Michal Hocko, David Rientjes, linux-mm, James Houghton, John Hubbard, Matthew Wilcox, Peter Xu, Vlastimil Babka, Zi Yan On 06/15/23 10:29, David Hildenbrand wrote: > On 15.06.23 10:04, Michal Hocko wrote: > > On Wed 14-06-23 16:04:58, Mike Kravetz wrote: > > > On 06/12/23 18:59, David Rientjes wrote: > > > > > > We did spend a good chunk of time on hugetlb/core mm unification and > > > removing special casing. In some (most) of these cases, the benefit of > > > removing special cases from core mm would result in adding more code to > > > hugetlb. For example: proper type'ing so that hugetlb does not treat > > > all page table entries as PTEs. Again, I may be wrong but I think > > > people were OK with adding more code (and even complexity) to hugetlb > > > if it eliminated special casing in the core mm. But, there did not > > > seem to be a clear concensus especially with the thought that we may > > > need to double hugetlb code to get types right. > > > > This is primarily your call as a maintainer. If you ask me, hugetlb is > > over complicated in its current form already. Regression are not really > > seldom when code is added which is a signal we are hitting maintenance > > cost walls. This doesn't mean further development is impossible of > > course but it is increasingly more costly AFAICS. > > > > > Unless I missed something, there was no clear direction at the end of this > > > session. I was hoping that we could come up with a plan to address the > > > issues facing today's hugetlb users. IMO, there seems to be two options: > > > 1) Start work on hugetlb v2 with the intention that customers will need > > > to move to this to address their issues. > > > 2) Incorporate functionality like HGM into existing hugetlb. > > > > I fully agree with all that Michal said. > > I'm just going to add that I don't see why anyone would look into a > hugetlbv2 if we're going to use the motivation of "help existing users" to > make hugetlb ever-more complicated and special. "existing users" her even > meaning "people use hugetlb for backing VMs. Now they want to get postcopy > working with less latency." -- which I consider partially a new use case. > > So working on adding HGM and concurrently starting a hugetlbv2? I don't > think that will happen if we decide on adding HGM and proceeding with that > reasoning about existing users. I agree that doing both in parallel is not going to happen. > As expressed yesterday, I don't see a fast an clean way to make hugetlb > significantly less special (thanks Willy for the list of odd cases). > > Sure, we can talk about adding pte_t safety, but I don't really see a way > forward to unify page table walking code that way -- there are still the > (PT) locking, PMD sharing, PTE-cont special cases ... but sure, if anybody > wants to work on that, why not. > > Having that said, like Michal, I acknowledge that it is Mikes call regarding > the hugetlb code. I, for my part, will push back on any added core-mm > complexity that adds more special casing for hugetlb. Maybe there are easy > ways to integrate it nicely and that is not really a concern. And if the call on how to move forward was easy, I would have already made a decision. :) I really do appreciate all the input. It is pretty clear that adding more complex special cases to core mm for hugetlb is going to be a non-starter. James has talked about any such special cases for HGM in another reply. I previously said that I am leaning toward trying to add HGM to existing hugetlb. This is on the condition that any addition of special cases to the core mm would be minimal and trivial. In addition, the added complexity to hugetlb has to be manageable. -- Mike Kravetz > Note that while we've been discussing how HGM would already interfere with > core-mm, we've not even started discussing how actual > MADV_SPLIT/MADV_COLLAPSE/page poisioning ... would affect core-mm and > require special-casing for hugetlb. > > I, for my part, will explore a bit the mapcount topic (as time permits) and > see if we can come up at least with a unified mapcount approach (e.g., > sub-page mapcount?). But I suspect even figuring that out will take quite a > while already ... > > -- > Cheers, > > David / dhildenb ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Invitation] Linux MM Alignment Session on HugeTLB Core MM Convergence on Wednesday 2023-06-15 8:04 ` Michal Hocko 2023-06-15 8:29 ` David Hildenbrand @ 2023-06-15 17:00 ` James Houghton 2023-06-15 17:18 ` Matthew Wilcox 1 sibling, 1 reply; 10+ messages in thread From: James Houghton @ 2023-06-15 17:00 UTC (permalink / raw) To: Michal Hocko Cc: Mike Kravetz, David Rientjes, linux-mm, David Hildenbrand, John Hubbard, Matthew Wilcox, Peter Xu, Vlastimil Babka, Zi Yan On Thu, Jun 15, 2023 at 1:04 AM Michal Hocko <mhocko@suse.com> wrote: > > On Wed 14-06-23 16:04:58, Mike Kravetz wrote: > > On 06/12/23 18:59, David Rientjes wrote: > > > This week's topic will be a technical brainstorming session on HugeTLB > > > convergence with the core MM. This has been discussed most recently in > > > this thread: > > > https://lore.kernel.org/linux-mm/ZIOEDTUBrBg6tepk@casper.infradead.org/T/ > > > > Thank you David for putting this session together! And, thanks to everyone > > who participated. > > > > Following up on linux-mm with most active participants on Cc (sorry if I > > missed someone). If it makes more sense to continue the above thread, > > please move there. > > > > Even though everyone knows that hugetlb is special cased throughout the > > core mm, it came to a head with the proposed introduction of HGM. TBH, > > few people in the core mm community paid much attention to HGM when first > > introduced. A LSF/MM session was then dedicated to the discussion of > > HGM with the outcome being the suggestion to create a new filesystem/driver > > (hugetlb2 if you will) that would satisfy the use cases requiring HGM. > > One thing that was not emphasized at LSF/MM is that there are existing > > hugetlb users experiencing major issues that could be addressed with HGM: > > specifically the issues of memory errors and live migration. That was > > the starting point for recent discussion in the above thread. > > > > I may be wrong, but it appeared the direction of that thread was to > > first try and unify some of the hugetlb and core mm code. Eliminate > > some of the special casing. If hugetlb was less of a special case, then > > perhaps HGM would be more acceptable. That is the impression I (perhaps > > incorrectly) had going into today's session. > > My impression from the discussion yesterday was that the level of > unification would need to be really large and time consuming in order to > be useful for the HGM patchset to be in a more maintainable form. The > final outcome is quite hard to predict at this stage. I also had this impression, but some of the unification efforts are pretty independent of HGM (like the PTE/PMD/PUD typing idea). It doesn't really change HGM all that much. My understanding is like: do some general unification first, then we could take HGM. HGM is still going to mostly look the same. > > > During today's session, we often discussed what would/could be introduced > > in a hugetlb v2. The idea is that this would be the ideal place for HGM. > > However, people also made the comparisons to cgroup v1 - v2. Such a > > redesign provides the needed 'clean slate' to do things right, but it > > does little for existing users who would be unwilling to quickly move off > > existing hugetlb. > > > > We did spend a good chunk of time on hugetlb/core mm unification and > > removing special casing. In some (most) of these cases, the benefit of > > removing special cases from core mm would result in adding more code to > > hugetlb. For example: proper type'ing so that hugetlb does not treat > > all page table entries as PTEs. Again, I may be wrong but I think > > people were OK with adding more code (and even complexity) to hugetlb > > if it eliminated special casing in the core mm. But, there did not > > seem to be a clear concensus especially with the thought that we may > > need to double hugetlb code to get types right. > > This is primarily your call as a maintainer. If you ask me, hugetlb is > over complicated in its current form already. Regression are not really > seldom when code is added which is a signal we are hitting maintenance > cost walls. This doesn't mean further development is impossible of > course but it is increasingly more costly AFAICS. > > > Unless I missed something, there was no clear direction at the end of this > > session. I was hoping that we could come up with a plan to address the > > issues facing today's hugetlb users. IMO, there seems to be two options: > > 1) Start work on hugetlb v2 with the intention that customers will need > > to move to this to address their issues. > > 2) Incorporate functionality like HGM into existing hugetlb. > > From the memcg experience I can tell that cleaning up interfaces and > supported scenarios helped a lot. We still have to maintain v1 and will > have to for a foreseeable future and beyond but it is much more easier > to build a new functionality on top v2 without struggling how to hammer > that into v1 which is much more easier to generate corner cases. > > I fully understand that this is not really a great approach for users > from the short term POV because they have to adapt but I am pretty sure > that they would also appreciate long term stable and regression free > code. > > It is my understanding that most HGM users do not really benefit from > most of hugetlb features (shared page tables, reservation code etc) and > from that POV it makes some sense to start from a simpler code base. > > I do recognize there are other users who would like their existing > hugetlb setups keep working and benefit from a better support - page > poisoning comes to mind but is this really that easy without > considerable changes on the user space side? > > Memory error recovery problems are a really tough to deal with > AFAICS. Do we have any references to an existing userspace that would > able to deal with holes in their hugetlb pages? Or is this a chicken&egg > problem? Have we exhausted potential (even if coarse) solutions that > wouldn't require breaking hugetlb page tables? For VMs, having a 4K hole in the hugetlb page is how Google emulates memory poison. (We sorta get what we need with a terrible hack to KVM. HGM is better in every way.) There aren't a lot of userspace changes to make this work. It comes down to: 1. Enlighten the vCPU threads (the ones that run KVM_RUN) to handle the MCEERR SIGBUSes, and inject MCEs into the guest when this happens. 2. Optionally enlighten any other routines that have a substantial likelihood to read poisoned memory (for example, live migration routines), and skip the poisoned pieces (by adjusting the program counter in the SIGBUS handler). This isn't theoretical and is implemented today, but I can't easily show you how Google does it. QEMU does #1 like this[1]. Unmapping in this case is important; we can't allow the guest to be able to trigger MCEs at will, as they could easily force the host to crash. khugepaged was recently updated[2] to better recover from poison, so VMs that consumed an error because khugepaged (in the guest) was collapsing memory will be able to legitimately recover. I can't speak in detail about how databases recover from memory poison. Mike, maybe you can share more details? [1]: https://github.com/qemu/qemu/blob/master/target/i386/kvm/kvm.c#L649 [2]: https://lore.kernel.org/linux-mm/20230329151121.949896-1-jiaqiyan@google.com/ ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Invitation] Linux MM Alignment Session on HugeTLB Core MM Convergence on Wednesday 2023-06-15 17:00 ` James Houghton @ 2023-06-15 17:18 ` Matthew Wilcox 2023-06-15 17:59 ` Mike Kravetz 0 siblings, 1 reply; 10+ messages in thread From: Matthew Wilcox @ 2023-06-15 17:18 UTC (permalink / raw) To: James Houghton Cc: Michal Hocko, Mike Kravetz, David Rientjes, linux-mm, David Hildenbrand, John Hubbard, Peter Xu, Vlastimil Babka, Zi Yan On Thu, Jun 15, 2023 at 10:00:41AM -0700, James Houghton wrote: > I can't speak in detail about how databases recover from memory > poison. Mike, maybe you can share more details? Speaking generally (this would describe MySQL as well as any number of proprietary databases), hugetlbfs is used by databases as a supplier of memory to their userspace buffer cache. As database blocks are needed, they're read from storage using O_DIRECT. Depending on the database, they may or may not be updated in place. Just as in our page cache, if the hwpoison hits in a clean block, it can discard the block and re-read it. If it hits in a dirty block, it's Game Over. Most blocks are clean. A 1GB page contains so many blocks that taking out the entire 1GB is guaranteed to take out a dirty block. Taking out a single 4kB page is likely to take out only clean blocks. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [Invitation] Linux MM Alignment Session on HugeTLB Core MM Convergence on Wednesday 2023-06-15 17:18 ` Matthew Wilcox @ 2023-06-15 17:59 ` Mike Kravetz 0 siblings, 0 replies; 10+ messages in thread From: Mike Kravetz @ 2023-06-15 17:59 UTC (permalink / raw) To: Matthew Wilcox Cc: James Houghton, Michal Hocko, David Rientjes, linux-mm, David Hildenbrand, John Hubbard, Peter Xu, Vlastimil Babka, Zi Yan On 06/15/23 18:18, Matthew Wilcox wrote: > On Thu, Jun 15, 2023 at 10:00:41AM -0700, James Houghton wrote: > > I can't speak in detail about how databases recover from memory > > poison. Mike, maybe you can share more details? > > Speaking generally (this would describe MySQL as well as any number of > proprietary databases), hugetlbfs is used by databases as a supplier > of memory to their userspace buffer cache. As database blocks are > needed, they're read from storage using O_DIRECT. Depending on the > database, they may or may not be updated in place. > > Just as in our page cache, if the hwpoison hits in a clean block, it > can discard the block and re-read it. If it hits in a dirty block, it's > Game Over. Most blocks are clean. A 1GB page contains so many blocks > that taking out the entire 1GB is guaranteed to take out a dirty block. > Taking out a single 4kB page is likely to take out only clean blocks. Thanks Matthew. I do not have details of exactly how this is done. However, this does fit in with discussions I have had with our DB team. They can possibly recover using another (older) copy of the data. The smaller the size of data, the greater the possibility the older data can be used. -- Mike Kravetz ^ permalink raw reply [flat|nested] 10+ messages in thread
* [Invitation] Linux MM Alignment Session on HugeTLB Core MM Convergence on Wednesday @ 2023-06-13 2:01 David Rientjes 0 siblings, 0 replies; 10+ messages in thread From: David Rientjes @ 2023-06-13 2:01 UTC (permalink / raw) To: Andrew Morton, Dave Hansen, David Hildenbrand, Hugh Dickins, James Houghton, Johannes Weiner, John Hubbard, linux-mm, Matthew Wilcox, Mel Gorman, Michal Hocko, Mike Kravetz, Mike Rapoport, Pasha Tatashin, Peter Xu, Rao, Bharata Bhasker, Rik van Riel, Roman Gushchin, Shakeel Butt, Shutemov, Kirill, Tejun Heo, Vlastimil Babka, Yang Shi, Zi Yan [-- Attachment #1: Type: text/plain, Size: 1683 bytes --] Hi everybody, We host a biweekly series, the Linux MM Alignment Session, on Wednesdays. We'd like to invite MM developers to attend and will announce the topic for the next instance on the Monday prior to the next meeting. Our next Linux MM Alignment Session is scheduled for Wednesday. The details: Wednesday, June 14 · 9:00 – 10:00am PDT (GMT-7) https://meet.google.com/csb-wcds-xya backup: (US) +1 347-682-5874 PIN: 356 962 072# international: https://tel.meet/csb-wcds-xya?pin=1301132214803 This week's topic will be a technical brainstorming session on HugeTLB convergence with the core MM. This has been discussed most recently in this thread: https://lore.kernel.org/linux-mm/ZIOEDTUBrBg6tepk@casper.infradead.org/T/ As discussed there, it would be useful to discuss areas where the HugeTLB subsystem could be unified with the core MM. This includes both short-term and long-term objectives to improve reliability and maintainability: - Concrete areas where hugetlb and core MM can converge (pagewalks, for example) - Incremental progress on the above - Long-term plans for HugeTLB reservations - Anything other ideas? This is intended to be for technical brainstorming, i.e. surface ideas where the HugeTLB subsystem and core MM could share a common framework and reduce HugeTLB's "special casing." It would be great to align on common goals and discuss division of work as well as timelines. Also: if anybody has ideas for future topics, please let me know and I'll try to organize them. We'd love to have volunteers to lead future topics as well as requests for MM topics to be presented. Looking forward to seeing all of you on Wednesday! ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2023-06-15 18:58 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <c5afdf35-a5fa-03e2-348d-cf1d990fc389@google.com>
[not found] ` <20230614230458.GB3559@monkey>
2023-06-15 1:12 ` [Invitation] Linux MM Alignment Session on HugeTLB Core MM Convergence on Wednesday David Rientjes
2023-06-15 8:04 ` Michal Hocko
2023-06-15 8:29 ` David Hildenbrand
2023-06-15 17:24 ` James Houghton
2023-06-15 18:58 ` Peter Xu
2023-06-15 18:31 ` Mike Kravetz
2023-06-15 17:00 ` James Houghton
2023-06-15 17:18 ` Matthew Wilcox
2023-06-15 17:59 ` Mike Kravetz
2023-06-13 2:01 David Rientjes
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox