From: Muchun Song <songmuchun@bytedance.com>
To: Joao Martins <joao.m.martins@oracle.com>
Cc: linux-mm@kvack.org, Mike Kravetz <mike.kravetz@oracle.com>,
Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH v1] mm/hugetlb_vmemmap: remap head page to newly allocated page
Date: Wed, 3 Aug 2022 18:44:29 +0800 [thread overview]
Message-ID: <YupRjWRiz4lPo+y7@FVFYT0MHHV2J> (raw)
In-Reply-To: <0b085bb1-b5f7-dfc2-588a-880de0d77ea2@oracle.com>
On Wed, Aug 03, 2022 at 10:52:13AM +0100, Joao Martins wrote:
> On 8/3/22 05:11, Muchun Song wrote:
> > On Tue, Aug 02, 2022 at 07:03:09PM +0100, Joao Martins wrote:
> >> Today with `hugetlb_free_vmemmap=on` the struct page memory that is
> >> freed back to page allocator is as following: for a 2M hugetlb page it
> >> will reuse the first 4K vmemmap page to remap the remaining 7 vmemmap
> >> pages, and for a 1G hugetlb it will remap the remaining 4095 vmemmap
> >> pages. Essentially, that means that it breaks the first 4K of a
> >> potentially contiguous chunk of memory of 32K (for 2M hugetlb pages) or
> >> 16M (for 1G hugetlb pages). For this reason the memory that it's free
> >> back to page allocator cannot be used for hugetlb to allocate huge pages
> >> of the same size, but rather only of a smaller huge page size:
> >>
> >
> > Hi Joao,
> >
> > Thanks for your work on this. I admit you are right. The current mechanism
> > prevented the freed vmemmap pages from being mergerd into a potential
> > contiguous page. Allocating a new head page is straightforward approach,
> > however, it is very dangerous at runtime after system booting up. Why
> > dangerous? Because you should first 1) copy the content from the head vmemmap
> > page to the targeted (newly allocated) page, and then 2) change the PTE
> > entry to the new page. However, the content (especially the refcount) of the
> > old head vmemmmap page could be changed elsewhere (e.g. other modules)
> > between the step 1) and 2). Eventually, the new allocated vmemmap page is
> > corrupted. Unluckily, we don't have an easy approach to prevent it.
> >
> OK, I see what I missed. You mean the refcount (or any other data) on the
> preceeding struct pages to this head struct page unrelated to the hugetlb
> page being remapped. Meaning when the first struct page in the old vmemmap
> page is *not* the head page we are trying to remap right?
>
> See further below in your patch but I wonder if we could actually check this
> from the hugetlb head va being aligned to PAGE_SIZE. Meaning that we would be checking
> that this head page is the first of struct page in the vmemmap page that
> we are still safe to remap? If so, the patch could be simpler more
> like mine, without the special freeing path you added below.
>
> If I'm right, see at the end.
I am not sure we are on the same page (it seems that we are not after I saw your
below patch). So let me make it become more clarified.
CPU0: CPU1:
vmemmap_remap_free(start, end, reuse)
// alloc a new page used to be the head vmemmap page
page = alloc_pages_node();
memcpy(page_address(page), reuse, PAGE_SIZE);
// Now the @reuse address is mapped to the original
// page frame. So the change will be reflected on the
// original page frame.
get_page(reuse);
vmemmap_remap_pte();
// remap to the above new allocated page
set_pte_at();
flush_tlb_kernel_range();
// Now the @reuse address is mapped to the new allocated
// page frame. So the change will be reflected on the
// new page frame and it is corrupted.
put_page(reuse);
So we should make 1) memcpy, 2) remap and 3) TLB flush atomic on CPU0, however
it is not easy.
Thanks.
>
> > I also thought of solving this issue, but I didn't find any easy way to
> > solve it after system booting up. However, it is possible at system boot
> > time. Because if the system is in a very early initialization stage,
> > anyone should not access struct page. I have implemented it a mounth ago.
> > I didn't send it out since some additional preparation work is required.
> > The main preparation is to move the allocation of HugeTLB to a very
> > early initialization stage (I want to put it into pure_initcall). Because
> > the allocation of HugeTLB is parsed from cmdline whose logic is very
> > complex, I have sent a patch to cleanup it [1]. After this cleanup, it
> > will be easy to move the allocation to an early stage. Sorry for that
> > I am busy lately, I didn't have time to send a new version. But I will
> > send it ASAP.
> >
> > [1] https://lore.kernel.org/all/20220616071827.3480-1-songmuchun@bytedance.com/
> >
> > The following diff is the main work to remap the head vmemmap page to
> > a new allocated page.
> >
> I like how you use walk::reuse_addr to tell it's an head page.
> I didn't find my version as clean with passing a boolean to the remap_pte().
>
> > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> > index 20f414c0379f..71f2d7335e6f 100644
> > --- a/mm/hugetlb_vmemmap.c
> > +++ b/mm/hugetlb_vmemmap.c
> > @@ -15,6 +15,7 @@
> > #include <asm/pgalloc.h>
> > #include <asm/tlbflush.h>
> > #include "hugetlb_vmemmap.h"
> > +#include "internal.h"
> >
> > /**
> > * struct vmemmap_remap_walk - walk vmemmap page table
> > @@ -227,14 +228,37 @@ static inline void free_vmemmap_page(struct page *page)
> > }
> >
> > /* Free a list of the vmemmap pages */
> > -static void free_vmemmap_page_list(struct list_head *list)
> > +static void free_vmemmap_pages(struct list_head *list)
> > {
> > struct page *page, *next;
> >
> > + list_for_each_entry_safe(page, next, list, lru)
> > + free_vmemmap_page(page);
> > +}
> > +
> > +/*
> > + * Free a list of vmemmap pages but skip per-cpu free list of buddy
> > + * allocator.
> > + */
> > +static void free_vmemmap_pages_nonpcp(struct list_head *list)
> > +{
> > + struct zone *zone;
> > + struct page *page, *next;
> > +
> > + if (list_empty(list))
> > + return;
> > +
> > + zone = page_zone(list_first_entry(list, struct page, lru));
> > + zone_pcp_disable(zone);
> > list_for_each_entry_safe(page, next, list, lru) {
> > - list_del(&page->lru);
> > + if (zone != page_zone(page)) {
> > + zone_pcp_enable(zone);
> > + zone = page_zone(page);
> > + zone_pcp_disable(zone);
> > + }
> > free_vmemmap_page(page);
> > }
> > + zone_pcp_enable(zone);
> > }
> >
> > static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
> > @@ -244,12 +268,28 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr,
> > * Remap the tail pages as read-only to catch illegal write operation
> > * to the tail pages.
> > */
> > - pgprot_t pgprot = PAGE_KERNEL_RO;
> > + pgprot_t pgprot = addr == walk->reuse_addr ? PAGE_KERNEL : PAGE_KERNEL_RO;
> > pte_t entry = mk_pte(walk->reuse_page, pgprot);
> > struct page *page = pte_page(*pte);
> >
> > list_add_tail(&page->lru, walk->vmemmap_pages);
> > set_pte_at(&init_mm, addr, pte, entry);
> > +
> > + if (unlikely(addr == walk->reuse_addr)) {
> > + void *old = page_to_virt(page);
> > +
> > + /* Remove it from the vmemmap_pages list to avoid being freed. */
> > + list_del(&walk->reuse_page->lru);
> > + flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> > + /*
> > + * If we reach here meaning the system is in a very early
> > + * initialization stage, anyone should not access struct page.
> > + * However, if there is something unexpected, the head struct
> > + * page most likely to be written (usually ->_refcount). Using
> > + * BUG_ON() to catch this unexpected case.
> > + */
> > + BUG_ON(memcmp(old, (void *)addr, sizeof(struct page)));
> > + }
> > }
> >
> > /*
> > @@ -298,7 +338,10 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr,
> > * to remap.
> > * @end: end address of the vmemmap virtual address range that we want to
> > * remap.
> > - * @reuse: reuse address.
> > + * @reuse: reuse address. If @reuse is equal to @start, it means the page
> > + * frame which @reuse address is mapped to will be replaced with a
> > + * new page frame and the previous page frame will be freed, this
> > + * is to reduce memory fragment.
> > *
> > * Return: %0 on success, negative error code otherwise.
> > */
> > @@ -319,14 +362,26 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
> > * (see more details from the vmemmap_pte_range()):
> > *
> > * - The range [@start, @end) and the range [@reuse, @reuse + PAGE_SIZE)
> > - * should be continuous.
> > + * should be continuous or @start is equal to @reuse.
> > * - The @reuse address is part of the range [@reuse, @end) that we are
> > * walking which is passed to vmemmap_remap_range().
> > * - The @reuse address is the first in the complete range.
> > *
> > * So we need to make sure that @start and @reuse meet the above rules.
> > */
> > - BUG_ON(start - reuse != PAGE_SIZE);
> > + BUG_ON(start - reuse != PAGE_SIZE && start != reuse);
> > +
> > + if (unlikely(reuse == start)) {
> > + int nid = page_to_nid((struct page *)start);
> > + gfp_t gfp_mask = GFP_KERNEL | __GFP_THISNODE | __GFP_NORETRY |
> > + __GFP_NOWARN;
> > +
> > + walk.reuse_page = alloc_pages_node(nid, gfp_mask, 0);
> > + if (walk.reuse_page) {
> > + copy_page(page_to_virt(walk.reuse_page), (void *)reuse);
> > + list_add(&walk.reuse_page->lru, &vmemmap_pages);
> > + }
> > + }
> >
> > mmap_read_lock(&init_mm);
> > ret = vmemmap_remap_range(reuse, end, &walk);
> > @@ -348,7 +403,10 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
> > }
> > mmap_read_unlock(&init_mm);
> >
> > - free_vmemmap_page_list(&vmemmap_pages);
> > + if (unlikely(reuse == start))
> > + free_vmemmap_pages_nonpcp(&vmemmap_pages);
> > + else
> > + free_vmemmap_pages(&vmemmap_pages);
> >
> > return ret;
> > }
> > @@ -512,6 +570,22 @@ static bool vmemmap_should_optimize(const struct hstate *h, const struct page *h
> > return true;
> > }
> >
> > +/*
> > + * Control if the page frame which the address of the head vmemmap associated
> > + * with a HugeTLB page is mapped to should be replaced with a new page. The
> > + * vmemmap pages are usually mapped with huge PMD mapping, the head vmemmap
> > + * page frames is best freed to the buddy allocator once at an initial stage
> > + * of system booting to reduce memory fragment.
> > + */
> > +static bool vmemmap_remap_head __ro_after_init = true;
> > +
> > +static int __init vmemmap_remap_head_init(void)
> > +{
> > + vmemmap_remap_head = false;
> > + return 0;
> > +}
> > +core_initcall(vmemmap_remap_head_init);
> > +
> > /**
> > * hugetlb_vmemmap_optimize - optimize @head page's vmemmap pages.
> > * @h: struct hstate.
> > @@ -537,6 +611,19 @@ void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head)
> > vmemmap_start += HUGETLB_VMEMMAP_RESERVE_SIZE;
> >
> > /*
> > + * The vmemmap pages are usually mapped with huge PMD mapping. If the
> > + * head vmemmap page is not freed to the buddy allocator, then those
> > + * freed tail vmemmap pages cannot be merged into a big order chunk.
> > + * The head vmemmap page frame can be replaced with a new allocated
> > + * page and be freed to the buddy allocator, then those freed vmemmmap
> > + * pages have the opportunity to be merged into larger contiguous pages
> > + * to reduce memory fragment. vmemmap_remap_free() will do this if
> > + * @vmemmap_remap_free is equal to @vmemmap_reuse.
> > + */
> > + if (unlikely(vmemmap_remap_head))
> > + vmemmap_start = vmemmap_reuse;
> > +
>
> Maybe it doesn't need to strictly early init vs late init.
>
> I wonder if we can still make this trick late-stage but when the head struct page is
> aligned to PAGE_SIZE / sizeof(struct page), and when it is it means that we are safe
> to replace the head vmemmap page provided we would only be covering this hugetlb page
> data? Unless I am missing something and the check wouldn't be enough?
>
> A quick check with this snip added:
>
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index 2b97df8115fe..06e028734b1e 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -331,7 +331,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
> };
> gfp_t gfp_mask = GFP_KERNEL|__GFP_RETRY_MAYFAIL|__GFP_NOWARN;
> int nid = page_to_nid((struct page *)start);
> - struct page *page;
> + struct page *page = NULL;
>
> /*
> * Allocate a new head vmemmap page to avoid breaking a contiguous
> @@ -341,7 +341,8 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end,
> * more allocations of hugepages. Fallback to the currently
> * mapped head page in case should it fail to allocate.
> */
> - page = alloc_pages_node(nid, gfp_mask, 0);
> + if (IS_ALIGNED(start, PAGE_SIZE))
> + page = alloc_pages_node(nid, gfp_mask, 0);
> walk.head_page = page;
>
> /*
>
>
> ... and it looks that it can still be effective just not as much, more or less as expected?
>
> # with 2M hugepages
>
> Before:
>
> Node 0, zone Normal, type Movable 76 28 10 4 1 0
> 0 0 1 1 15568
>
> After (allocated 32400):
>
> Node 0, zone Normal, type Movable 135 328 219 198 155 106
> 72 41 23 0 0
>
> Compared to my original patch where there weren't any pages left in the order-0..order-2:
>
> Node 0, zone Normal, type Movable 0 1 0 70
> 106 91 78 48 17 0 0
>
> But still much better that without any of this:
>
> Node 0, zone Normal, type Movable 32174 31999 31642 104
> 58 24 16 4 2 0 0
>
next prev parent reply other threads:[~2022-08-03 10:44 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-08-02 18:03 Joao Martins
2022-08-03 4:11 ` Muchun Song
2022-08-03 9:52 ` Joao Martins
2022-08-03 10:44 ` Muchun Song [this message]
2022-08-03 12:22 ` Joao Martins
2022-08-04 7:17 ` Muchun Song
2022-08-04 9:23 ` Joao Martins
2022-08-04 11:53 ` Muchun Song
2022-08-03 22:42 ` Mike Kravetz
2022-08-04 8:05 ` Muchun Song
2022-08-04 16:56 ` Mike Kravetz
2022-08-05 2:54 ` Muchun Song
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YupRjWRiz4lPo+y7@FVFYT0MHHV2J \
--to=songmuchun@bytedance.com \
--cc=akpm@linux-foundation.org \
--cc=joao.m.martins@oracle.com \
--cc=linux-mm@kvack.org \
--cc=mike.kravetz@oracle.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox