Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Mike Kravetz <mike.kravetz@oracle.com>
To: David Hildenbrand <david@redhat.com>
Cc: Yosry Ahmed <yosryahmed@google.com>,
	David Rientjes <rientjes@google.com>,
	James Houghton <jthoughton@google.com>,
	Naoya Horiguchi <naoya.horiguchi@nec.com>,
	Miaohe Lin <linmiaohe@huawei.com>,
	lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org,
	Peter Xu <peterx@redhat.com>, Michal Hocko <mhocko@suse.com>,
	Matthew Wilcox <willy@infradead.org>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Jiaqi Yan <jiaqiyan@google.com>
Subject: Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs
Date: Wed, 7 Jun 2023 15:06:51 -0700	[thread overview]
Message-ID: <20230607220651.GC4122@monkey> (raw)
In-Reply-To: <75d5662a-a901-1e02-4706-66545ad53c5c@redhat.com>

On 06/07/23 10:13, David Hildenbrand wrote:
> On 07.06.23 09:51, Yosry Ahmed wrote:
> > On Wed, Jun 7, 2023 at 12:38 AM David Hildenbrand <david@redhat.com> wrote:
> > > 
> > > On 07.06.23 00:40, David Rientjes wrote:
> > > > On Fri, 2 Jun 2023, Mike Kravetz wrote:
> > > > 
> > > > > The benefit of HGM in the case of memory errors is fairly obvious.  As
> > > > > mentioned above, when a memory error is encountered on a hugetlb page,
> > > > > that entire hugetlb page becomes inaccessible to the application.  Losing,
> > > > > 1G or even 2M of data is often catastrophic for an application.  There
> > > > > is often no way to recover.  It just makes sense that recovering from
> > > > > the loss of 4K of data would generally be easier and more likely to be
> > > > > possible.  Today, when Oracle DB encounters a hard memory error on a
> > > > > hugetlb page it will shutdown.  Plans are currently in place repair and
> > > > > recover from such errors if possible.  Isolating the area of data loss
> > > > > to a single 4K page significantly increases the likelihood of repair and
> > > > > recovery.
> > > > > 
> > > > > Today, when a memory error is encountered on a hugetlb page an
> > > > > application is 'notified' of the error by a SIGBUS, as well as the
> > > > > virtual address of the hugetlb page and it's size.  This makes sense as
> > > > > hugetlb pages are accessed by a single page table entry, so you get all
> > > > > or nothing.  As mentioned by James above, this is catastrophic for VMs
> > > > > as the hypervisor has just been told that 2M or 1G is now inaccessible.
> > > > > With HGM, we can isolate such errors to 4K.
> > > > > 
> > > > > Backing VMs with hugetlb pages is a real use case today.  We are seeing
> > > > > memory errors on such hugetlb pages with the result being VM failures.
> > > > > One of the advantages of backing VMs with THPs is that they are split in
> > > > > the case of memory errors.  HGM would allow similar functionality.
> > > > 
> > > > Thanks for this context, Mike, it's very useful.
> > > > 
> > > > I think everybody is aligned on the desire to map memory at smaller
> > > > granularities for multiple use cases and it's fairly clear that these use
> > > > cases are critically important to multiple stakeholders.
> > > > 
> > > > I think the open question is whether this functionality is supported in
> > > > hugetlbfs (like with HGM) or that there is a hard requirement that we must
> > > > use THP for this support.
> > > > 
> > > > I don't think that hugetlbfs is feature frozen, but if there's a strong
> > > > bias toward not merging additional complexity into the subsystem that
> > > > would useful to know.  I personally think the critical use cases described
> > > 
> > > At least I, attending that session, thought that it was clear that the
> > > majority of the people speaking up clearly expressed "no more added
> > > complexity". So I think there is a clear strong bias, at least from the
> > > people attending that session.
> > > 
> > > 
> > > > above justify the added complexity of HGM to hugetlb and we wouldn't be
> > > > blocked by the long standing (15+ years) desire to mesh hugetlb into the
> > > > core MM subsystem before we can stop the pain associated with memory
> > > > poisoning and live migration.
> > > > 
> > > > Are there strong objections to extending hugetlb for this support?
> > > 
> > > I don't want to get too involved in this discussion (busy), but I
> > > absolutely agree on the points that were raised at LSF/MM that
> > > 
> > > (A) hugetlb is complicated and very special (many things not integrated
> > > with core-mm, so we need special-casing all over the place). [example:
> > > what is a pte?]
> > > 
> > > (B) We added a bunch of complexity in the past that some people
> > > considered very important (and it was not feature frozen, right? ;) ).
> > > Looking back, we might just not have done some of that, or done it
> > > differently/cleaner -- better integrated in the core. (PMD sharing,
> > > MAP_PRIVATE, a reservation mechanism that still requires preallocation
> > > because it fails with NUMA/fork, ...)
> > > 
> > > (C) Unifying hugetlb and the core looks like it's getting more and more
> > > out of reach, maybe even impossible with all the complexity we added
> > > over the years (well, and keep adding).
> > > 
> > > Sure, HGM for the purpose of better hwpoison handling makes sense. But
> > > hugetlb is probably 20 years old and hwpoison handling probably 13 years
> > > old. So we managed to get quite far without that optimization.
> > > 
> > > Absolutely, HGM for better postcopy live migration also makes sense, I
> > > guess nobody disagrees on that.
> > > 
> > > 
> > > But as discussed in that session, maybe we should just start anew and
> > > implement something that integrates nicely with the core , instead of
> > > making hugetlb more complicated and even more special.
> > > 
> > > 
> > > Now, we all know, nobody wants to do the heavy lifting for that, that's
> > > why we're discussing how to get in yet another complicated feature.
> > 
> > If nobody wants to do the heavy lifting and unifying hugetlb with core
> > MM is becoming impossible as you state, then does adding another
> > feature to hugetlb (that we are all agreeing is useful for multiple
> > use cases) really making things worse? In other words, if someone
> 
> Well, if we (as a community) reject more complexity and outline an
> alternative of what would be acceptable (rewrite), people that really want
> these new features will *have to* do the heavy lifting.
> 
> [and I see many people from employers that might have the capacity to do the
> heavy lifting if really required being involved in the discussion around HGM
> :P ]
> 
> > decides tomorrow to do the heavy lifting, how much harder does this
> > become because of HGM, if any?
> > 
> > I am the farthest away from being an expert here, I am just an
> > observer here, but if the answer to the above question is "HGM doesn't
> > actually make it worse" or "HGM only slightly makes things harder",
> > then I naively think that it's something that we should do, from a
> > pure cost-benefit analysis.
> 
> Well, there is always the "maintainability" aspect, because upstream has to
> maintain whatever complexity gets merged. No matter what, we'll have to keep
> maintaining the current set of hugetlb features until we can eventually
> deprecate it/some in the far, far future.
> 
> I, for my part, am happy as long as I can stay away as far as possible from
> hugetlb code. Again, Mike is the maintainer.

Thanks for the reminder :)

Maintainability is my primary concern with HGM.  That is one of the reasons
I proposed James pitch the topic at LSFMM.  Even though I am the 'maintainer'
changes introduced by HGM will impact others working in mm.

> What I saw so far regarding HGM does not count as "slightly makes things
> harder".
> 
> > Again, I don't have a lot of context here, and I understand everyone's
> > frustration with the current state of hugetlb. Just my 2 cents.
> 
> The thing is, we all agree that something that hugetlb provides is valuable
> (i.e., pool of huge/large pages that we can map large), just that after 20
> years there might be better ways of doing it and integrating it better with
> core-mm.

I am struggling with how to support existing hugetlb users that are running
into issues like memory errors on hugetlb pages today.  And, yes that is a
source of real customer issues.  They are not really happy with the current
design that a single error will take out a 1G page, and their VM or
application.  Moving to THP is not likely as they really want a pre-allocated
pool of 1G pages.  I just don't have a good answer for them.
-- 
Mike Kravetz

next prev parent reply	other threads:[~2023-06-07 22:07 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-06 19:19 Mike Kravetz
2023-03-14 15:37 ` James Houghton
2023-04-12  1:44   ` David Rientjes
2023-05-24 20:26 ` James Houghton
2023-05-26  3:00   ` David Rientjes
     [not found]     ` <20230602172723.GA3941@monkey>
2023-06-06 22:40       ` David Rientjes
2023-06-07  7:38         ` David Hildenbrand
2023-06-07  7:51           ` Yosry Ahmed
2023-06-07  8:13             ` David Hildenbrand
2023-06-07 22:06               ` Mike Kravetz [this message]
2023-06-08  0:02                 ` David Rientjes
2023-06-08  6:34                   ` David Hildenbrand
2023-06-08 18:50                     ` Yang Shi
2023-06-08 21:23                       ` Mike Kravetz
2023-06-09  1:57                         ` Zi Yan
2023-06-09 15:17                           ` Pasha Tatashin
2023-06-09 19:04                             ` Ankur Arora
2023-06-09 19:57                           ` Matthew Wilcox
2023-06-08 20:10                     ` Matthew Wilcox
2023-06-09  2:59                       ` David Rientjes
2023-06-13 14:59                       ` Jason Gunthorpe
2023-06-13 15:15                         ` David Hildenbrand
2023-06-13 15:45                           ` Peter Xu
2023-06-08 21:54                 ` [Lsf-pc] " Dan Williams
2023-06-08 22:35                   ` Mike Kravetz
2023-06-09  3:36                     ` Dan Williams
2023-06-09 20:20                       ` James Houghton
2023-06-13 15:17                         ` Jason Gunthorpe
2023-06-07 14:40           ` Matthew Wilcox

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230607220651.GC4122@monkey \
    --to=mike.kravetz@oracle.com \
    --cc=axelrasmussen@google.com \
    --cc=david@redhat.com \
    --cc=jiaqiyan@google.com \
    --cc=jthoughton@google.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mhocko@suse.com \
    --cc=naoya.horiguchi@nec.com \
    --cc=peterx@redhat.com \
    --cc=rientjes@google.com \
    --cc=willy@infradead.org \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox