Re: [PATCH v1] mm/hugetlb_vmemmap: remap head page to newly allocated page

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Muchun Song <songmuchun@bytedance.com>
To: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Joao Martins <joao.m.martins@oracle.com>,
	linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH v1] mm/hugetlb_vmemmap: remap head page to newly allocated page
Date: Thu, 4 Aug 2022 16:05:00 +0800	[thread overview]
Message-ID: <Yut9rBWto31i3z6W@FVFYT0MHHV2J.usts.net> (raw)
In-Reply-To: <Yur5x6lORlPOZ1KY@monkey>

On Wed, Aug 03, 2022 at 03:42:15PM -0700, Mike Kravetz wrote:
> On 08/03/22 18:44, Muchun Song wrote:
> > On Wed, Aug 03, 2022 at 10:52:13AM +0100, Joao Martins wrote:
> > > On 8/3/22 05:11, Muchun Song wrote:
> > > > On Tue, Aug 02, 2022 at 07:03:09PM +0100, Joao Martins wrote:
> > > >> Today with `hugetlb_free_vmemmap=on` the struct page memory that is
> > > >> freed back to page allocator is as following: for a 2M hugetlb page it
> > > >> will reuse the first 4K vmemmap page to remap the remaining 7 vmemmap
> > > >> pages, and for a 1G hugetlb it will remap the remaining 4095 vmemmap
> > > >> pages. Essentially, that means that it breaks the first 4K of a
> > > >> potentially contiguous chunk of memory of 32K (for 2M hugetlb pages) or
> > > >> 16M (for 1G hugetlb pages). For this reason the memory that it's free
> > > >> back to page allocator cannot be used for hugetlb to allocate huge pages
> > > >> of the same size, but rather only of a smaller huge page size:
> > > >>
> > > > 
> > > > Hi Joao,
> > > > 
> > > > Thanks for your work on this. I admit you are right. The current mechanism
> > > > prevented the freed vmemmap pages from being mergerd into a potential
> > > > contiguous page. Allocating a new head page is straightforward approach,
> > > > however, it is very dangerous at runtime after system booting up. Why
> > > > dangerous? Because you should first 1) copy the content from the head vmemmap
> > > > page to the targeted (newly allocated) page, and then 2) change the PTE
> > > > entry to the new page. However, the content (especially the refcount) of the
> > > > old head vmemmmap page could be changed elsewhere (e.g. other modules)
> > > > between the step 1) and 2). Eventually, the new allocated vmemmap page is
> > > > corrupted. Unluckily, we don't have an easy approach to prevent it.
> > > > 
> > > OK, I see what I missed. You mean the refcount (or any other data) on the
> > > preceeding struct pages to this head struct page unrelated to the hugetlb
> > > page being remapped. Meaning when the first struct page in the old vmemmap
> > > page is *not* the head page we are trying to remap right?
> > >
> > > See further below in your patch but I wonder if we could actually check this
> > > from the hugetlb head va being aligned to PAGE_SIZE. Meaning that we would be checking
> > > that this head page is the first of struct page in the vmemmap page that
> > > we are still safe to remap? If so, the patch could be simpler more
> > > like mine, without the special freeing path you added below.
> > > 
> > > If I'm right, see at the end.
> > 
> > 
> > I am not sure we are on the same page (it seems that we are not after I saw your
> > below patch). So let me make it become more clarified.
> 
> Thanks Muchun!
> I told Joao that you would be the expert in this area and was correct.
>

Thanks for your trust.
 
> > 
> > CPU0:                                            CPU1:
> > 
> > vmemmap_remap_free(start, end, reuse)
> >   // alloc a new page used to be the head vmemmap page
> >   page = alloc_pages_node();
> > 
> >   memcpy(page_address(page), reuse, PAGE_SIZE);
> >                                                  // Now the @reuse address is mapped to the original
> >                                                  // page frame. So the change will be reflected on the
> >                                                  // original page frame.
> >                                                  get_page(reuse);
> 
> Here is a thought.
> 
> This code gets called early after allocating a new hugetlb page.  This new
> compound page has a ref count of 1 on the head page and 0 on all the tail
> pages.  If the ref count was 0 on the head page, get_page() would not succeed.
> 
> I can not think of a reason why we NEED to have a ref count of 1 on the
> head page.  It does make it more convenient to simply call put_page() on
> the newly initialized hugetlb page and have the page be added to the huegtlb
> free lists, but this could easily be modified.
> 
> Matthew Willcox recently pointed me at this:
> https://lore.kernel.org/linux-mm/20220531150611.1303156-1-willy@infradead.org/T/#m98fb9f9bd476155b4951339da51a0887b2377476
> 
> That would only work for hugetlb pages allocated from buddy.  For gigantic
> pages, we manually 'freeze' (zero ref count) of tail pages and check for
> an unexpected increased ref count.  We could do the same with the head page.
> 
> Would having zero ref count on the head page eliminate this race?

I think most races which try to grab an extra refcount could be avoided
in this case.

However, I can figure out a race related to memory failure, that is to poison
a page in memory_failure(), which set PageHWPoison without checking if the
refcount is zero. If we can fix this easily, I think this patch is a good
direction.

Thanks Mike.

> -- 
> Mike Kravetz
> 
> >   vmemmap_remap_pte();
> >    // remap to the above new allocated page 
> >    set_pte_at();
> > 
> >   flush_tlb_kernel_range();
> >                                                  // Now the @reuse address is mapped to the new allocated
> >                                                  // page frame. So the change will be reflected on the
> >                                                  // new page frame and it is corrupted.
> >                                                  put_page(reuse);
> > 
> > So we should make 1) memcpy, 2) remap and 3) TLB flush atomic on CPU0, however
> > it is not easy.
>

next prev parent reply	other threads:[~2022-08-04  8:05 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-02 18:03 Joao Martins
2022-08-03  4:11 ` Muchun Song
2022-08-03  9:52   ` Joao Martins
2022-08-03 10:44     ` Muchun Song
2022-08-03 12:22       ` Joao Martins
2022-08-04  7:17         ` Muchun Song
2022-08-04  9:23           ` Joao Martins
2022-08-04 11:53             ` Muchun Song
2022-08-03 22:42       ` Mike Kravetz
2022-08-04  8:05         ` Muchun Song [this message]
2022-08-04 16:56           ` Mike Kravetz
2022-08-05  2:54             ` Muchun Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Yut9rBWto31i3z6W@FVFYT0MHHV2J.usts.net \
    --to=songmuchun@bytedance.com \
    --cc=akpm@linux-foundation.org \
    --cc=joao.m.martins@oracle.com \
    --cc=linux-mm@kvack.org \
    --cc=mike.kravetz@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox