linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Muchun Song <songmuchun@bytedance.com>
To: Joao Martins <joao.m.martins@oracle.com>
Cc: linux-mm@kvack.org, Mike Kravetz <mike.kravetz@oracle.com>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH v1] mm/hugetlb_vmemmap: remap head page to newly allocated page
Date: Thu, 4 Aug 2022 19:53:31 +0800	[thread overview]
Message-ID: <YuuzOyvJnzbgUhzN@FVFYT0MHHV2J.usts.net> (raw)
In-Reply-To: <1579c011-eb6f-09ad-bf9b-5b3df3e1b182@oracle.com>

On Thu, Aug 04, 2022 at 10:23:48AM +0100, Joao Martins wrote:
> On 8/4/22 08:17, Muchun Song wrote:
> > On Wed, Aug 03, 2022 at 01:22:21PM +0100, Joao Martins wrote:
> >>
> >>
> >> On 8/3/22 11:44, Muchun Song wrote:
> >>> On Wed, Aug 03, 2022 at 10:52:13AM +0100, Joao Martins wrote:
> >>>> On 8/3/22 05:11, Muchun Song wrote:
> >>>>> On Tue, Aug 02, 2022 at 07:03:09PM +0100, Joao Martins wrote:
> >>>>>> Today with `hugetlb_free_vmemmap=on` the struct page memory that is
> >>>>>> freed back to page allocator is as following: for a 2M hugetlb page it
> >>>>>> will reuse the first 4K vmemmap page to remap the remaining 7 vmemmap
> >>>>>> pages, and for a 1G hugetlb it will remap the remaining 4095 vmemmap
> >>>>>> pages. Essentially, that means that it breaks the first 4K of a
> >>>>>> potentially contiguous chunk of memory of 32K (for 2M hugetlb pages) or
> >>>>>> 16M (for 1G hugetlb pages). For this reason the memory that it's free
> >>>>>> back to page allocator cannot be used for hugetlb to allocate huge pages
> >>>>>> of the same size, but rather only of a smaller huge page size:
> >>>>>>
> >>>>>
> >>>>> Hi Joao,
> >>>>>
> >>>>> Thanks for your work on this. I admit you are right. The current mechanism
> >>>>> prevented the freed vmemmap pages from being mergerd into a potential
> >>>>> contiguous page. Allocating a new head page is straightforward approach,
> >>>>> however, it is very dangerous at runtime after system booting up. Why
> >>>>> dangerous? Because you should first 1) copy the content from the head vmemmap
> >>>>> page to the targeted (newly allocated) page, and then 2) change the PTE
> >>>>> entry to the new page. However, the content (especially the refcount) of the
> >>>>> old head vmemmmap page could be changed elsewhere (e.g. other modules)
> >>>>> between the step 1) and 2). Eventually, the new allocated vmemmap page is
> >>>>> corrupted. Unluckily, we don't have an easy approach to prevent it.
> >>>>>
> >>>> OK, I see what I missed. You mean the refcount (or any other data) on the
> >>>> preceeding struct pages to this head struct page unrelated to the hugetlb
> >>>> page being remapped. Meaning when the first struct page in the old vmemmap
> >>>> page is *not* the head page we are trying to remap right?
> >>>>
> >>>> See further below in your patch but I wonder if we could actually check this
> >>>> from the hugetlb head va being aligned to PAGE_SIZE. Meaning that we would be checking
> >>>> that this head page is the first of struct page in the vmemmap page that
> >>>> we are still safe to remap? If so, the patch could be simpler more
> >>>> like mine, without the special freeing path you added below.
> >>>>
> >>>> If I'm right, see at the end.
> >>>
> >>> I am not sure we are on the same page (it seems that we are not after I saw your
> >>> below patch). 
> >>
> >> Even though I misunderstood you it might still look like a possible scenario.
> >>
> >>> So let me make it become more clarified.
> >>>
> >> Thanks
> >>
> >>> CPU0:                                            CPU1:
> >>>
> >>> vmemmap_remap_free(start, end, reuse)
> >>>   // alloc a new page used to be the head vmemmap page
> >>>   page = alloc_pages_node();
> >>>
> >>>   memcpy(page_address(page), reuse, PAGE_SIZE);
> >>>                                                  // Now the @reuse address is mapped to the original
> >>>                                                  // page frame. So the change will be reflected on the
> >>>                                                  // original page frame.
> >>>                                                  get_page(reuse);
> >>>   vmemmap_remap_pte();
> >>>    // remap to the above new allocated page 
> >>>    set_pte_at();
> >>>
> >>>   flush_tlb_kernel_range();
> >>
> >> note-to-self: totally missed to change the flush_tlb_kernel_range() to include the full range.
> >>
> > 
> > Right. I have noticed that as well.
> > 
> >>>                                                  // Now the @reuse address is mapped to the new allocated
> >>>                                                  // page frame. So the change will be reflected on the
> >>>                                                  // new page frame and it is corrupted.
> >>>                                                  put_page(reuse);
> >>>
> >>> So we should make 1) memcpy, 2) remap and 3) TLB flush atomic on CPU0, however
> >>> it is not easy.
> >>>
> >> OK, I understand what you mean now. However, I am trying to follow if this race is
> >> possible? Note that given your previous answer above, I am assuming in your race scenario
> >> that the vmemmap page only ever stores metadata (struct page) related to the hugetlb-page
> >> currently being remapped. If this assumption is wrong, then your race would be possible
> >> (but it wouldn't be from a get_page in the reuse_addr)
> >>
> >> So, how would we get into doing a get_page() on the head-page that we are remapping (and
> >> its put_page() for that matter) from somewhere else ... considering we are at
> >> prep_new_huge_page() when we call vmemmap_remap_free() and hence ... we already got it
> >> from page allocator ... but hugetlb hasn't yet finished initializing the head page
> >> metadata. Hence it isn't yet accounted for someone to grab either e.g. in
> >> demote/page-fault/migration/etc?
> >>
> > 
> > As I know, at least there are two places which could get the refcount.
> > 1) GUP and 2) Memoey failure.
> > 
> So I am aware the means to grab the page refcount. It's how we get into such situation
> that I wasn't sure how we get to in the first place:
> 
> > For 1), you can refer to the commit 7118fc2906e2925d7edb5ed9c8a57f2a5f23b849.
> 
> Good pointer. This one sheds some light too:
> 
> https://lore.kernel.org/linux-mm/CAG48ez23q0Jy9cuVnwAe7t_fdhMk2S7N5Hdi-GLcCeq5bsfLxw@mail.gmail.com/
> 
> I wonder we get into a situation of doing a GUP on a user VA referring a
> just-allocated *hugetlb* page from buddy (but not yet in hugetlb free lists) even if
> temporarily. Unless the page was still siting on a page tables that were about to tear
> down. Maybe it is specific to gigantic pages. Hmm
>

It is not specific to gigantic pages. I think your above link has a good
description to show how this could happen to a non-gigantic page.

Thanks.
 
> > For 2), I think you can refer to the function of __get_unpoison_page().
> > 
> Ah, yes. Good point.
> 
> > Both places could grab an extra refcount to the processing HugeTLB's struct page.
> > 
> Thanks for the pointers ;)
> 


  reply	other threads:[~2022-08-04 11:53 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-02 18:03 Joao Martins
2022-08-03  4:11 ` Muchun Song
2022-08-03  9:52   ` Joao Martins
2022-08-03 10:44     ` Muchun Song
2022-08-03 12:22       ` Joao Martins
2022-08-04  7:17         ` Muchun Song
2022-08-04  9:23           ` Joao Martins
2022-08-04 11:53             ` Muchun Song [this message]
2022-08-03 22:42       ` Mike Kravetz
2022-08-04  8:05         ` Muchun Song
2022-08-04 16:56           ` Mike Kravetz
2022-08-05  2:54             ` Muchun Song

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YuuzOyvJnzbgUhzN@FVFYT0MHHV2J.usts.net \
    --to=songmuchun@bytedance.com \
    --cc=akpm@linux-foundation.org \
    --cc=joao.m.martins@oracle.com \
    --cc=linux-mm@kvack.org \
    --cc=mike.kravetz@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox