From: Muchun Song <songmuchun@bytedance.com>
To: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Joao Martins <joao.m.martins@oracle.com>,
linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH v1] mm/hugetlb_vmemmap: remap head page to newly allocated page
Date: Thu, 4 Aug 2022 16:05:00 +0800 [thread overview]
Message-ID: <Yut9rBWto31i3z6W@FVFYT0MHHV2J.usts.net> (raw)
In-Reply-To: <Yur5x6lORlPOZ1KY@monkey>
On Wed, Aug 03, 2022 at 03:42:15PM -0700, Mike Kravetz wrote:
> On 08/03/22 18:44, Muchun Song wrote:
> > On Wed, Aug 03, 2022 at 10:52:13AM +0100, Joao Martins wrote:
> > > On 8/3/22 05:11, Muchun Song wrote:
> > > > On Tue, Aug 02, 2022 at 07:03:09PM +0100, Joao Martins wrote:
> > > >> Today with `hugetlb_free_vmemmap=on` the struct page memory that is
> > > >> freed back to page allocator is as following: for a 2M hugetlb page it
> > > >> will reuse the first 4K vmemmap page to remap the remaining 7 vmemmap
> > > >> pages, and for a 1G hugetlb it will remap the remaining 4095 vmemmap
> > > >> pages. Essentially, that means that it breaks the first 4K of a
> > > >> potentially contiguous chunk of memory of 32K (for 2M hugetlb pages) or
> > > >> 16M (for 1G hugetlb pages). For this reason the memory that it's free
> > > >> back to page allocator cannot be used for hugetlb to allocate huge pages
> > > >> of the same size, but rather only of a smaller huge page size:
> > > >>
> > > >
> > > > Hi Joao,
> > > >
> > > > Thanks for your work on this. I admit you are right. The current mechanism
> > > > prevented the freed vmemmap pages from being mergerd into a potential
> > > > contiguous page. Allocating a new head page is straightforward approach,
> > > > however, it is very dangerous at runtime after system booting up. Why
> > > > dangerous? Because you should first 1) copy the content from the head vmemmap
> > > > page to the targeted (newly allocated) page, and then 2) change the PTE
> > > > entry to the new page. However, the content (especially the refcount) of the
> > > > old head vmemmmap page could be changed elsewhere (e.g. other modules)
> > > > between the step 1) and 2). Eventually, the new allocated vmemmap page is
> > > > corrupted. Unluckily, we don't have an easy approach to prevent it.
> > > >
> > > OK, I see what I missed. You mean the refcount (or any other data) on the
> > > preceeding struct pages to this head struct page unrelated to the hugetlb
> > > page being remapped. Meaning when the first struct page in the old vmemmap
> > > page is *not* the head page we are trying to remap right?
> > >
> > > See further below in your patch but I wonder if we could actually check this
> > > from the hugetlb head va being aligned to PAGE_SIZE. Meaning that we would be checking
> > > that this head page is the first of struct page in the vmemmap page that
> > > we are still safe to remap? If so, the patch could be simpler more
> > > like mine, without the special freeing path you added below.
> > >
> > > If I'm right, see at the end.
> >
> >
> > I am not sure we are on the same page (it seems that we are not after I saw your
> > below patch). So let me make it become more clarified.
>
> Thanks Muchun!
> I told Joao that you would be the expert in this area and was correct.
>
Thanks for your trust.
> >
> > CPU0: CPU1:
> >
> > vmemmap_remap_free(start, end, reuse)
> > // alloc a new page used to be the head vmemmap page
> > page = alloc_pages_node();
> >
> > memcpy(page_address(page), reuse, PAGE_SIZE);
> > // Now the @reuse address is mapped to the original
> > // page frame. So the change will be reflected on the
> > // original page frame.
> > get_page(reuse);
>
> Here is a thought.
>
> This code gets called early after allocating a new hugetlb page. This new
> compound page has a ref count of 1 on the head page and 0 on all the tail
> pages. If the ref count was 0 on the head page, get_page() would not succeed.
>
> I can not think of a reason why we NEED to have a ref count of 1 on the
> head page. It does make it more convenient to simply call put_page() on
> the newly initialized hugetlb page and have the page be added to the huegtlb
> free lists, but this could easily be modified.
>
> Matthew Willcox recently pointed me at this:
> https://lore.kernel.org/linux-mm/20220531150611.1303156-1-willy@infradead.org/T/#m98fb9f9bd476155b4951339da51a0887b2377476
>
> That would only work for hugetlb pages allocated from buddy. For gigantic
> pages, we manually 'freeze' (zero ref count) of tail pages and check for
> an unexpected increased ref count. We could do the same with the head page.
>
> Would having zero ref count on the head page eliminate this race?
I think most races which try to grab an extra refcount could be avoided
in this case.
However, I can figure out a race related to memory failure, that is to poison
a page in memory_failure(), which set PageHWPoison without checking if the
refcount is zero. If we can fix this easily, I think this patch is a good
direction.
Thanks Mike.
> --
> Mike Kravetz
>
> > vmemmap_remap_pte();
> > // remap to the above new allocated page
> > set_pte_at();
> >
> > flush_tlb_kernel_range();
> > // Now the @reuse address is mapped to the new allocated
> > // page frame. So the change will be reflected on the
> > // new page frame and it is corrupted.
> > put_page(reuse);
> >
> > So we should make 1) memcpy, 2) remap and 3) TLB flush atomic on CPU0, however
> > it is not easy.
>
next prev parent reply other threads:[~2022-08-04 8:05 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-08-02 18:03 Joao Martins
2022-08-03 4:11 ` Muchun Song
2022-08-03 9:52 ` Joao Martins
2022-08-03 10:44 ` Muchun Song
2022-08-03 12:22 ` Joao Martins
2022-08-04 7:17 ` Muchun Song
2022-08-04 9:23 ` Joao Martins
2022-08-04 11:53 ` Muchun Song
2022-08-03 22:42 ` Mike Kravetz
2022-08-04 8:05 ` Muchun Song [this message]
2022-08-04 16:56 ` Mike Kravetz
2022-08-05 2:54 ` Muchun Song
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Yut9rBWto31i3z6W@FVFYT0MHHV2J.usts.net \
--to=songmuchun@bytedance.com \
--cc=akpm@linux-foundation.org \
--cc=joao.m.martins@oracle.com \
--cc=linux-mm@kvack.org \
--cc=mike.kravetz@oracle.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox