From: Mike Kravetz <mike.kravetz@oracle.com>
To: David Hildenbrand <david@redhat.com>
Cc: Yosry Ahmed <yosryahmed@google.com>,
David Rientjes <rientjes@google.com>,
James Houghton <jthoughton@google.com>,
Naoya Horiguchi <naoya.horiguchi@nec.com>,
Miaohe Lin <linmiaohe@huawei.com>,
lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org,
Peter Xu <peterx@redhat.com>, Michal Hocko <mhocko@suse.com>,
Matthew Wilcox <willy@infradead.org>,
Axel Rasmussen <axelrasmussen@google.com>,
Jiaqi Yan <jiaqiyan@google.com>
Subject: Re: [LSF/MM/BPF TOPIC] HGM for hugetlbfs
Date: Wed, 7 Jun 2023 15:06:51 -0700 [thread overview]
Message-ID: <20230607220651.GC4122@monkey> (raw)
In-Reply-To: <75d5662a-a901-1e02-4706-66545ad53c5c@redhat.com>
On 06/07/23 10:13, David Hildenbrand wrote:
> On 07.06.23 09:51, Yosry Ahmed wrote:
> > On Wed, Jun 7, 2023 at 12:38 AM David Hildenbrand <david@redhat.com> wrote:
> > >
> > > On 07.06.23 00:40, David Rientjes wrote:
> > > > On Fri, 2 Jun 2023, Mike Kravetz wrote:
> > > >
> > > > > The benefit of HGM in the case of memory errors is fairly obvious. As
> > > > > mentioned above, when a memory error is encountered on a hugetlb page,
> > > > > that entire hugetlb page becomes inaccessible to the application. Losing,
> > > > > 1G or even 2M of data is often catastrophic for an application. There
> > > > > is often no way to recover. It just makes sense that recovering from
> > > > > the loss of 4K of data would generally be easier and more likely to be
> > > > > possible. Today, when Oracle DB encounters a hard memory error on a
> > > > > hugetlb page it will shutdown. Plans are currently in place repair and
> > > > > recover from such errors if possible. Isolating the area of data loss
> > > > > to a single 4K page significantly increases the likelihood of repair and
> > > > > recovery.
> > > > >
> > > > > Today, when a memory error is encountered on a hugetlb page an
> > > > > application is 'notified' of the error by a SIGBUS, as well as the
> > > > > virtual address of the hugetlb page and it's size. This makes sense as
> > > > > hugetlb pages are accessed by a single page table entry, so you get all
> > > > > or nothing. As mentioned by James above, this is catastrophic for VMs
> > > > > as the hypervisor has just been told that 2M or 1G is now inaccessible.
> > > > > With HGM, we can isolate such errors to 4K.
> > > > >
> > > > > Backing VMs with hugetlb pages is a real use case today. We are seeing
> > > > > memory errors on such hugetlb pages with the result being VM failures.
> > > > > One of the advantages of backing VMs with THPs is that they are split in
> > > > > the case of memory errors. HGM would allow similar functionality.
> > > >
> > > > Thanks for this context, Mike, it's very useful.
> > > >
> > > > I think everybody is aligned on the desire to map memory at smaller
> > > > granularities for multiple use cases and it's fairly clear that these use
> > > > cases are critically important to multiple stakeholders.
> > > >
> > > > I think the open question is whether this functionality is supported in
> > > > hugetlbfs (like with HGM) or that there is a hard requirement that we must
> > > > use THP for this support.
> > > >
> > > > I don't think that hugetlbfs is feature frozen, but if there's a strong
> > > > bias toward not merging additional complexity into the subsystem that
> > > > would useful to know. I personally think the critical use cases described
> > >
> > > At least I, attending that session, thought that it was clear that the
> > > majority of the people speaking up clearly expressed "no more added
> > > complexity". So I think there is a clear strong bias, at least from the
> > > people attending that session.
> > >
> > >
> > > > above justify the added complexity of HGM to hugetlb and we wouldn't be
> > > > blocked by the long standing (15+ years) desire to mesh hugetlb into the
> > > > core MM subsystem before we can stop the pain associated with memory
> > > > poisoning and live migration.
> > > >
> > > > Are there strong objections to extending hugetlb for this support?
> > >
> > > I don't want to get too involved in this discussion (busy), but I
> > > absolutely agree on the points that were raised at LSF/MM that
> > >
> > > (A) hugetlb is complicated and very special (many things not integrated
> > > with core-mm, so we need special-casing all over the place). [example:
> > > what is a pte?]
> > >
> > > (B) We added a bunch of complexity in the past that some people
> > > considered very important (and it was not feature frozen, right? ;) ).
> > > Looking back, we might just not have done some of that, or done it
> > > differently/cleaner -- better integrated in the core. (PMD sharing,
> > > MAP_PRIVATE, a reservation mechanism that still requires preallocation
> > > because it fails with NUMA/fork, ...)
> > >
> > > (C) Unifying hugetlb and the core looks like it's getting more and more
> > > out of reach, maybe even impossible with all the complexity we added
> > > over the years (well, and keep adding).
> > >
> > > Sure, HGM for the purpose of better hwpoison handling makes sense. But
> > > hugetlb is probably 20 years old and hwpoison handling probably 13 years
> > > old. So we managed to get quite far without that optimization.
> > >
> > > Absolutely, HGM for better postcopy live migration also makes sense, I
> > > guess nobody disagrees on that.
> > >
> > >
> > > But as discussed in that session, maybe we should just start anew and
> > > implement something that integrates nicely with the core , instead of
> > > making hugetlb more complicated and even more special.
> > >
> > >
> > > Now, we all know, nobody wants to do the heavy lifting for that, that's
> > > why we're discussing how to get in yet another complicated feature.
> >
> > If nobody wants to do the heavy lifting and unifying hugetlb with core
> > MM is becoming impossible as you state, then does adding another
> > feature to hugetlb (that we are all agreeing is useful for multiple
> > use cases) really making things worse? In other words, if someone
>
> Well, if we (as a community) reject more complexity and outline an
> alternative of what would be acceptable (rewrite), people that really want
> these new features will *have to* do the heavy lifting.
>
> [and I see many people from employers that might have the capacity to do the
> heavy lifting if really required being involved in the discussion around HGM
> :P ]
>
> > decides tomorrow to do the heavy lifting, how much harder does this
> > become because of HGM, if any?
> >
> > I am the farthest away from being an expert here, I am just an
> > observer here, but if the answer to the above question is "HGM doesn't
> > actually make it worse" or "HGM only slightly makes things harder",
> > then I naively think that it's something that we should do, from a
> > pure cost-benefit analysis.
>
> Well, there is always the "maintainability" aspect, because upstream has to
> maintain whatever complexity gets merged. No matter what, we'll have to keep
> maintaining the current set of hugetlb features until we can eventually
> deprecate it/some in the far, far future.
>
> I, for my part, am happy as long as I can stay away as far as possible from
> hugetlb code. Again, Mike is the maintainer.
Thanks for the reminder :)
Maintainability is my primary concern with HGM. That is one of the reasons
I proposed James pitch the topic at LSFMM. Even though I am the 'maintainer'
changes introduced by HGM will impact others working in mm.
> What I saw so far regarding HGM does not count as "slightly makes things
> harder".
>
> > Again, I don't have a lot of context here, and I understand everyone's
> > frustration with the current state of hugetlb. Just my 2 cents.
>
> The thing is, we all agree that something that hugetlb provides is valuable
> (i.e., pool of huge/large pages that we can map large), just that after 20
> years there might be better ways of doing it and integrating it better with
> core-mm.
I am struggling with how to support existing hugetlb users that are running
into issues like memory errors on hugetlb pages today. And, yes that is a
source of real customer issues. They are not really happy with the current
design that a single error will take out a 1G page, and their VM or
application. Moving to THP is not likely as they really want a pre-allocated
pool of 1G pages. I just don't have a good answer for them.
--
Mike Kravetz
next prev parent reply other threads:[~2023-06-07 22:07 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-06 19:19 Mike Kravetz
2023-03-14 15:37 ` James Houghton
2023-04-12 1:44 ` David Rientjes
2023-05-24 20:26 ` James Houghton
2023-05-26 3:00 ` David Rientjes
[not found] ` <20230602172723.GA3941@monkey>
2023-06-06 22:40 ` David Rientjes
2023-06-07 7:38 ` David Hildenbrand
2023-06-07 7:51 ` Yosry Ahmed
2023-06-07 8:13 ` David Hildenbrand
2023-06-07 22:06 ` Mike Kravetz [this message]
2023-06-08 0:02 ` David Rientjes
2023-06-08 6:34 ` David Hildenbrand
2023-06-08 18:50 ` Yang Shi
2023-06-08 21:23 ` Mike Kravetz
2023-06-09 1:57 ` Zi Yan
2023-06-09 15:17 ` Pasha Tatashin
2023-06-09 19:04 ` Ankur Arora
2023-06-09 19:57 ` Matthew Wilcox
2023-06-08 20:10 ` Matthew Wilcox
2023-06-09 2:59 ` David Rientjes
2023-06-13 14:59 ` Jason Gunthorpe
2023-06-13 15:15 ` David Hildenbrand
2023-06-13 15:45 ` Peter Xu
2023-06-08 21:54 ` [Lsf-pc] " Dan Williams
2023-06-08 22:35 ` Mike Kravetz
2023-06-09 3:36 ` Dan Williams
2023-06-09 20:20 ` James Houghton
2023-06-13 15:17 ` Jason Gunthorpe
2023-06-07 14:40 ` Matthew Wilcox
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230607220651.GC4122@monkey \
--to=mike.kravetz@oracle.com \
--cc=axelrasmussen@google.com \
--cc=david@redhat.com \
--cc=jiaqiyan@google.com \
--cc=jthoughton@google.com \
--cc=linmiaohe@huawei.com \
--cc=linux-mm@kvack.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=mhocko@suse.com \
--cc=naoya.horiguchi@nec.com \
--cc=peterx@redhat.com \
--cc=rientjes@google.com \
--cc=willy@infradead.org \
--cc=yosryahmed@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox