linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgg@ziepe.ca>
To: James Houghton <jthoughton@google.com>
Cc: Dan Williams <dan.j.williams@intel.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	David Hildenbrand <david@redhat.com>,
	Miaohe Lin <linmiaohe@huawei.com>,
	Naoya Horiguchi <naoya.horiguchi@nec.com>,
	Peter Xu <peterx@redhat.com>, Yosry Ahmed <yosryahmed@google.com>,
	linux-mm@kvack.org, Michal Hocko <mhocko@suse.com>,
	Matthew Wilcox <willy@infradead.org>,
	David Rientjes <rientjes@google.com>,
	Axel Rasmussen <axelrasmussen@google.com>,
	lsf-pc@lists.linux-foundation.org,
	Jiaqi Yan <jiaqiyan@google.com>,
	jane.chu@oracle.com
Subject: Re: [Lsf-pc] [LSF/MM/BPF TOPIC] HGM for hugetlbfs
Date: Tue, 13 Jun 2023 12:17:23 -0300	[thread overview]
Message-ID: <ZIiIg7i+7r47h17S@ziepe.ca> (raw)
In-Reply-To: <CADrL8HXy3OnAqd4Y6FHZLMkxXsCE54UH=PQVCTUvNhX9yWCacw@mail.gmail.com>

On Fri, Jun 09, 2023 at 01:20:19PM -0700, James Houghton wrote:

> So, we could:
> 1. Do what HGM does and have the kernel unmap the 4K page in the
> userspace page tables.
> 2. On-the-fly change the VMA for our hugepage to not be HugeTLB
> anymore, and re-map all the good 4K pages.
> 3. Tell userspace that it must change its mapping from HugeTLB to
> something else, and move the good 4K pages into the new mapping.
 
> (2) feels like more complexity than (1). If a user created a
> MAP_HUGETLB mapping and now it isn't HugeTLB, that feels wrong.
> 
> (3) today isn't possible, but with Jiaqi's improvement to hugetlbfs
> read() it becomes possible. We'll need to have an extra 1G of memory
> while we are doing this copying/recovery, and it isn't transparent at
> all.

It is transparent to the VM, it just has a longer EPT fault response
time if the VM touches that range.

> (3) is additionally painful when considering live migration. We have
> to keep the 4K page unmapped after the migration (to keep it poisoned
> from the guest's perspective), but the page is no longer *actually*
> poisoned on the host. To get the memory we need to back our
> fake-poisoned pages with tmpfs, we would need to free our 1G page.
> Getting that page back later isn't trivial.

Why does this change with #1?

As David says you can't transparently "fix" the page, so when you
migrate a VM with unavailable pages it must migrate those unavailable
pages too, regardless if the kernel made them unavailable or
userspace did.

So, regardless, you end up with a VM that has holes in its address
map.

I guess if the hole is created from a PTE map of a 1G hugetlbfs it is
easier to "heal" back to a full 1G map, but this healing could also be
done by copying.

It seems to me the main value of the kernel-side approach is that it
eliminates the copies and makes the time the 1G page would be
unavailable to the guest shorter.

> So (1) still seems like the most natural solution, so the question
> becomes: how exactly do we implement 4K unmapping? And that brings us
> back to the main question about how HGM should be implemented in
> general.

IMHO if you can do it in userspace with a copy you can solve your
urgent customer need and then have more time to do the big kernel
rework required to optimize it with kernel support.

Jason


  reply	other threads:[~2023-06-13 15:17 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-06 19:19 Mike Kravetz
2023-03-14 15:37 ` James Houghton
2023-04-12  1:44   ` David Rientjes
2023-05-24 20:26 ` James Houghton
2023-05-26  3:00   ` David Rientjes
     [not found]     ` <20230602172723.GA3941@monkey>
2023-06-06 22:40       ` David Rientjes
2023-06-07  7:38         ` David Hildenbrand
2023-06-07  7:51           ` Yosry Ahmed
2023-06-07  8:13             ` David Hildenbrand
2023-06-07 22:06               ` Mike Kravetz
2023-06-08  0:02                 ` David Rientjes
2023-06-08  6:34                   ` David Hildenbrand
2023-06-08 18:50                     ` Yang Shi
2023-06-08 21:23                       ` Mike Kravetz
2023-06-09  1:57                         ` Zi Yan
2023-06-09 15:17                           ` Pasha Tatashin
2023-06-09 19:04                             ` Ankur Arora
2023-06-09 19:57                           ` Matthew Wilcox
2023-06-08 20:10                     ` Matthew Wilcox
2023-06-09  2:59                       ` David Rientjes
2023-06-13 14:59                       ` Jason Gunthorpe
2023-06-13 15:15                         ` David Hildenbrand
2023-06-13 15:45                           ` Peter Xu
2023-06-08 21:54                 ` [Lsf-pc] " Dan Williams
2023-06-08 22:35                   ` Mike Kravetz
2023-06-09  3:36                     ` Dan Williams
2023-06-09 20:20                       ` James Houghton
2023-06-13 15:17                         ` Jason Gunthorpe [this message]
2023-06-07 14:40           ` Matthew Wilcox

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZIiIg7i+7r47h17S@ziepe.ca \
    --to=jgg@ziepe.ca \
    --cc=axelrasmussen@google.com \
    --cc=dan.j.williams@intel.com \
    --cc=david@redhat.com \
    --cc=jane.chu@oracle.com \
    --cc=jiaqiyan@google.com \
    --cc=jthoughton@google.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mhocko@suse.com \
    --cc=mike.kravetz@oracle.com \
    --cc=naoya.horiguchi@nec.com \
    --cc=peterx@redhat.com \
    --cc=rientjes@google.com \
    --cc=willy@infradead.org \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox