From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6F883C19F28 for ; Wed, 3 Aug 2022 10:44:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8E61B6B0071; Wed, 3 Aug 2022 06:44:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 895B76B0072; Wed, 3 Aug 2022 06:44:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 738198E0001; Wed, 3 Aug 2022 06:44:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 62E5C6B0071 for ; Wed, 3 Aug 2022 06:44:41 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 38F221614C5 for ; Wed, 3 Aug 2022 10:44:41 +0000 (UTC) X-FDA: 79757947962.21.CC6B482 Received: from mail-pf1-f171.google.com (mail-pf1-f171.google.com [209.85.210.171]) by imf05.hostedemail.com (Postfix) with ESMTP id 82418100115 for ; Wed, 3 Aug 2022 10:44:39 +0000 (UTC) Received: by mail-pf1-f171.google.com with SMTP id f28so7854842pfk.1 for ; Wed, 03 Aug 2022 03:44:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc; bh=+0wJyLwnn1XEF0HXDq4tASOMn57Aqr5siSsgqVTT8Xk=; b=C5ltAZ8g6xxvnRwzMUgNgdeHDdzXtLZ0I/YkWrx6Ee+2Lj49iHzM2IkBcbB4z4fDTF Y+W/NYSt8r+vb5azEoVQwgbKqF85jICZJJeBR+Q+IAIa3shIkOSj4af2JdM60g9b8CGw zfccuEvs0yNO6ro5a+GCjrJKK7ioSeF4J/u/T/suIAkDPqGZFnwgNCqfuR2jqjbLhq5F RUPGCV9gfTuvyA5QEN3O54wgCMZAY/F2LY+y/vSWrFIyvDUbiU3I6W0ehqzU9SUk9raL Qc3OjHYQWWxHiEZPupDpn9OfD0RKOtsDPB7p5ygBRB22Q8k/oRYR++hYXsWgx8DoHe+n g3cQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc; bh=+0wJyLwnn1XEF0HXDq4tASOMn57Aqr5siSsgqVTT8Xk=; b=5e+JCejry2AOvTh1J/9KNeWHKcjWtQLA0FHyzKIOqloqtJ9OF+NDBX8NmnZVZxB7tg eI3Sa8sqFpjUZ3m/WcLljiuM62mfDXhvj1KKqgMjNFY0WyGBTl69HnneUl631KmCK2QA C9P0vTlcX+RN0aTyTdFUjVN2nEia9CBlDp4o+8uT1xfypTWB/SM4aw4oucF1WwkA+BkP A9kUcPjXy2Wb3JnSy4NeypSEuTAW+K0hS826X+HEfDhnSBHi6Ngm9Amg+iwgEu1yHQRu 4AnWqREs7x1SViUak5gsMRmA2WnTIJakOscyq/C5GoLqzXbuoBjXN6ZySglhwXc/PSYb n3hg== X-Gm-Message-State: AJIora/D3FK7lXsEkckJJQ1fhSPLKDmtV6Y0jjm8XJYvBC4YM1wzsa79 Ezpco6A/UN4lTjA0dh0PYgNggciQgBhSKtLB X-Google-Smtp-Source: AGRyM1uW36wLOfy6STNll/1DDv67oInWwNWFsMW85UzGcKo/s1JdWCsyNHDixhGvKwunwuRXGqm3qg== X-Received: by 2002:a05:6a00:2446:b0:528:5f22:5b6f with SMTP id d6-20020a056a00244600b005285f225b6fmr25696553pfj.73.1659523476630; Wed, 03 Aug 2022 03:44:36 -0700 (PDT) Received: from localhost ([139.177.225.249]) by smtp.gmail.com with ESMTPSA id p14-20020a170902780e00b0016bfafffa0esm1559991pll.227.2022.08.03.03.44.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 03 Aug 2022 03:44:36 -0700 (PDT) Date: Wed, 3 Aug 2022 18:44:29 +0800 From: Muchun Song To: Joao Martins Cc: linux-mm@kvack.org, Mike Kravetz , Andrew Morton Subject: Re: [PATCH v1] mm/hugetlb_vmemmap: remap head page to newly allocated page Message-ID: References: <20220802180309.19340-1-joao.m.martins@oracle.com> <0b085bb1-b5f7-dfc2-588a-880de0d77ea2@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <0b085bb1-b5f7-dfc2-588a-880de0d77ea2@oracle.com> ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1659523480; a=rsa-sha256; cv=none; b=NGlvmjVia2EIdKaxZ4ZEipg+dmJRMFa7QWsaATXF7ED2mDqbgFlGZhcHS4vUxIRahJ8Ji5 Rkxre6V/R/GuZNqEcLnJyxQei9uVAut7VrW5DWVlmxay9xp6GEFGC/r/sVyNtEdueus73A Zn9aph0bmpqCdjc6s+La5ADmUn1Vil4= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=C5ltAZ8g; spf=pass (imf05.hostedemail.com: domain of songmuchun@bytedance.com designates 209.85.210.171 as permitted sender) smtp.mailfrom=songmuchun@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1659523480; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=+0wJyLwnn1XEF0HXDq4tASOMn57Aqr5siSsgqVTT8Xk=; b=vQ2l3t0WfaeJn4qUO8C01TQIVBW5/6tPrgRVPxN6Fipx0QOqIffbNHDS+jhbYghAcOnFHd 2rp7VI7RD+egP7z8CTju2CWTfWfgW95coLgCc8ILmcevRzxiEIuS77PuUUEtI6/BYTh2Mt rmG4upUx0GMOfY4VV1I8qMlAzaLMl34= Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=C5ltAZ8g; spf=pass (imf05.hostedemail.com: domain of songmuchun@bytedance.com designates 209.85.210.171 as permitted sender) smtp.mailfrom=songmuchun@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com X-Stat-Signature: p9gmzuna4wugcctnjqzb7ihhhgc7m3u5 X-Rspamd-Queue-Id: 82418100115 X-Rspamd-Server: rspam09 X-Rspam-User: X-HE-Tag: 1659523479-61695 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Aug 03, 2022 at 10:52:13AM +0100, Joao Martins wrote: > On 8/3/22 05:11, Muchun Song wrote: > > On Tue, Aug 02, 2022 at 07:03:09PM +0100, Joao Martins wrote: > >> Today with `hugetlb_free_vmemmap=on` the struct page memory that is > >> freed back to page allocator is as following: for a 2M hugetlb page it > >> will reuse the first 4K vmemmap page to remap the remaining 7 vmemmap > >> pages, and for a 1G hugetlb it will remap the remaining 4095 vmemmap > >> pages. Essentially, that means that it breaks the first 4K of a > >> potentially contiguous chunk of memory of 32K (for 2M hugetlb pages) or > >> 16M (for 1G hugetlb pages). For this reason the memory that it's free > >> back to page allocator cannot be used for hugetlb to allocate huge pages > >> of the same size, but rather only of a smaller huge page size: > >> > > > > Hi Joao, > > > > Thanks for your work on this. I admit you are right. The current mechanism > > prevented the freed vmemmap pages from being mergerd into a potential > > contiguous page. Allocating a new head page is straightforward approach, > > however, it is very dangerous at runtime after system booting up. Why > > dangerous? Because you should first 1) copy the content from the head vmemmap > > page to the targeted (newly allocated) page, and then 2) change the PTE > > entry to the new page. However, the content (especially the refcount) of the > > old head vmemmmap page could be changed elsewhere (e.g. other modules) > > between the step 1) and 2). Eventually, the new allocated vmemmap page is > > corrupted. Unluckily, we don't have an easy approach to prevent it. > > > OK, I see what I missed. You mean the refcount (or any other data) on the > preceeding struct pages to this head struct page unrelated to the hugetlb > page being remapped. Meaning when the first struct page in the old vmemmap > page is *not* the head page we are trying to remap right? > > See further below in your patch but I wonder if we could actually check this > from the hugetlb head va being aligned to PAGE_SIZE. Meaning that we would be checking > that this head page is the first of struct page in the vmemmap page that > we are still safe to remap? If so, the patch could be simpler more > like mine, without the special freeing path you added below. > > If I'm right, see at the end. I am not sure we are on the same page (it seems that we are not after I saw your below patch). So let me make it become more clarified. CPU0: CPU1: vmemmap_remap_free(start, end, reuse) // alloc a new page used to be the head vmemmap page page = alloc_pages_node(); memcpy(page_address(page), reuse, PAGE_SIZE); // Now the @reuse address is mapped to the original // page frame. So the change will be reflected on the // original page frame. get_page(reuse); vmemmap_remap_pte(); // remap to the above new allocated page set_pte_at(); flush_tlb_kernel_range(); // Now the @reuse address is mapped to the new allocated // page frame. So the change will be reflected on the // new page frame and it is corrupted. put_page(reuse); So we should make 1) memcpy, 2) remap and 3) TLB flush atomic on CPU0, however it is not easy. Thanks. > > > I also thought of solving this issue, but I didn't find any easy way to > > solve it after system booting up. However, it is possible at system boot > > time. Because if the system is in a very early initialization stage, > > anyone should not access struct page. I have implemented it a mounth ago. > > I didn't send it out since some additional preparation work is required. > > The main preparation is to move the allocation of HugeTLB to a very > > early initialization stage (I want to put it into pure_initcall). Because > > the allocation of HugeTLB is parsed from cmdline whose logic is very > > complex, I have sent a patch to cleanup it [1]. After this cleanup, it > > will be easy to move the allocation to an early stage. Sorry for that > > I am busy lately, I didn't have time to send a new version. But I will > > send it ASAP. > > > > [1] https://lore.kernel.org/all/20220616071827.3480-1-songmuchun@bytedance.com/ > > > > The following diff is the main work to remap the head vmemmap page to > > a new allocated page. > > > I like how you use walk::reuse_addr to tell it's an head page. > I didn't find my version as clean with passing a boolean to the remap_pte(). > > > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c > > index 20f414c0379f..71f2d7335e6f 100644 > > --- a/mm/hugetlb_vmemmap.c > > +++ b/mm/hugetlb_vmemmap.c > > @@ -15,6 +15,7 @@ > > #include > > #include > > #include "hugetlb_vmemmap.h" > > +#include "internal.h" > > > > /** > > * struct vmemmap_remap_walk - walk vmemmap page table > > @@ -227,14 +228,37 @@ static inline void free_vmemmap_page(struct page *page) > > } > > > > /* Free a list of the vmemmap pages */ > > -static void free_vmemmap_page_list(struct list_head *list) > > +static void free_vmemmap_pages(struct list_head *list) > > { > > struct page *page, *next; > > > > + list_for_each_entry_safe(page, next, list, lru) > > + free_vmemmap_page(page); > > +} > > + > > +/* > > + * Free a list of vmemmap pages but skip per-cpu free list of buddy > > + * allocator. > > + */ > > +static void free_vmemmap_pages_nonpcp(struct list_head *list) > > +{ > > + struct zone *zone; > > + struct page *page, *next; > > + > > + if (list_empty(list)) > > + return; > > + > > + zone = page_zone(list_first_entry(list, struct page, lru)); > > + zone_pcp_disable(zone); > > list_for_each_entry_safe(page, next, list, lru) { > > - list_del(&page->lru); > > + if (zone != page_zone(page)) { > > + zone_pcp_enable(zone); > > + zone = page_zone(page); > > + zone_pcp_disable(zone); > > + } > > free_vmemmap_page(page); > > } > > + zone_pcp_enable(zone); > > } > > > > static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, > > @@ -244,12 +268,28 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, > > * Remap the tail pages as read-only to catch illegal write operation > > * to the tail pages. > > */ > > - pgprot_t pgprot = PAGE_KERNEL_RO; > > + pgprot_t pgprot = addr == walk->reuse_addr ? PAGE_KERNEL : PAGE_KERNEL_RO; > > pte_t entry = mk_pte(walk->reuse_page, pgprot); > > struct page *page = pte_page(*pte); > > > > list_add_tail(&page->lru, walk->vmemmap_pages); > > set_pte_at(&init_mm, addr, pte, entry); > > + > > + if (unlikely(addr == walk->reuse_addr)) { > > + void *old = page_to_virt(page); > > + > > + /* Remove it from the vmemmap_pages list to avoid being freed. */ > > + list_del(&walk->reuse_page->lru); > > + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); > > + /* > > + * If we reach here meaning the system is in a very early > > + * initialization stage, anyone should not access struct page. > > + * However, if there is something unexpected, the head struct > > + * page most likely to be written (usually ->_refcount). Using > > + * BUG_ON() to catch this unexpected case. > > + */ > > + BUG_ON(memcmp(old, (void *)addr, sizeof(struct page))); > > + } > > } > > > > /* > > @@ -298,7 +338,10 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr, > > * to remap. > > * @end: end address of the vmemmap virtual address range that we want to > > * remap. > > - * @reuse: reuse address. > > + * @reuse: reuse address. If @reuse is equal to @start, it means the page > > + * frame which @reuse address is mapped to will be replaced with a > > + * new page frame and the previous page frame will be freed, this > > + * is to reduce memory fragment. > > * > > * Return: %0 on success, negative error code otherwise. > > */ > > @@ -319,14 +362,26 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, > > * (see more details from the vmemmap_pte_range()): > > * > > * - The range [@start, @end) and the range [@reuse, @reuse + PAGE_SIZE) > > - * should be continuous. > > + * should be continuous or @start is equal to @reuse. > > * - The @reuse address is part of the range [@reuse, @end) that we are > > * walking which is passed to vmemmap_remap_range(). > > * - The @reuse address is the first in the complete range. > > * > > * So we need to make sure that @start and @reuse meet the above rules. > > */ > > - BUG_ON(start - reuse != PAGE_SIZE); > > + BUG_ON(start - reuse != PAGE_SIZE && start != reuse); > > + > > + if (unlikely(reuse == start)) { > > + int nid = page_to_nid((struct page *)start); > > + gfp_t gfp_mask = GFP_KERNEL | __GFP_THISNODE | __GFP_NORETRY | > > + __GFP_NOWARN; > > + > > + walk.reuse_page = alloc_pages_node(nid, gfp_mask, 0); > > + if (walk.reuse_page) { > > + copy_page(page_to_virt(walk.reuse_page), (void *)reuse); > > + list_add(&walk.reuse_page->lru, &vmemmap_pages); > > + } > > + } > > > > mmap_read_lock(&init_mm); > > ret = vmemmap_remap_range(reuse, end, &walk); > > @@ -348,7 +403,10 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, > > } > > mmap_read_unlock(&init_mm); > > > > - free_vmemmap_page_list(&vmemmap_pages); > > + if (unlikely(reuse == start)) > > + free_vmemmap_pages_nonpcp(&vmemmap_pages); > > + else > > + free_vmemmap_pages(&vmemmap_pages); > > > > return ret; > > } > > @@ -512,6 +570,22 @@ static bool vmemmap_should_optimize(const struct hstate *h, const struct page *h > > return true; > > } > > > > +/* > > + * Control if the page frame which the address of the head vmemmap associated > > + * with a HugeTLB page is mapped to should be replaced with a new page. The > > + * vmemmap pages are usually mapped with huge PMD mapping, the head vmemmap > > + * page frames is best freed to the buddy allocator once at an initial stage > > + * of system booting to reduce memory fragment. > > + */ > > +static bool vmemmap_remap_head __ro_after_init = true; > > + > > +static int __init vmemmap_remap_head_init(void) > > +{ > > + vmemmap_remap_head = false; > > + return 0; > > +} > > +core_initcall(vmemmap_remap_head_init); > > + > > /** > > * hugetlb_vmemmap_optimize - optimize @head page's vmemmap pages. > > * @h: struct hstate. > > @@ -537,6 +611,19 @@ void hugetlb_vmemmap_optimize(const struct hstate *h, struct page *head) > > vmemmap_start += HUGETLB_VMEMMAP_RESERVE_SIZE; > > > > /* > > + * The vmemmap pages are usually mapped with huge PMD mapping. If the > > + * head vmemmap page is not freed to the buddy allocator, then those > > + * freed tail vmemmap pages cannot be merged into a big order chunk. > > + * The head vmemmap page frame can be replaced with a new allocated > > + * page and be freed to the buddy allocator, then those freed vmemmmap > > + * pages have the opportunity to be merged into larger contiguous pages > > + * to reduce memory fragment. vmemmap_remap_free() will do this if > > + * @vmemmap_remap_free is equal to @vmemmap_reuse. > > + */ > > + if (unlikely(vmemmap_remap_head)) > > + vmemmap_start = vmemmap_reuse; > > + > > Maybe it doesn't need to strictly early init vs late init. > > I wonder if we can still make this trick late-stage but when the head struct page is > aligned to PAGE_SIZE / sizeof(struct page), and when it is it means that we are safe > to replace the head vmemmap page provided we would only be covering this hugetlb page > data? Unless I am missing something and the check wouldn't be enough? > > A quick check with this snip added: > > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c > index 2b97df8115fe..06e028734b1e 100644 > --- a/mm/hugetlb_vmemmap.c > +++ b/mm/hugetlb_vmemmap.c > @@ -331,7 +331,7 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, > }; > gfp_t gfp_mask = GFP_KERNEL|__GFP_RETRY_MAYFAIL|__GFP_NOWARN; > int nid = page_to_nid((struct page *)start); > - struct page *page; > + struct page *page = NULL; > > /* > * Allocate a new head vmemmap page to avoid breaking a contiguous > @@ -341,7 +341,8 @@ static int vmemmap_remap_free(unsigned long start, unsigned long end, > * more allocations of hugepages. Fallback to the currently > * mapped head page in case should it fail to allocate. > */ > - page = alloc_pages_node(nid, gfp_mask, 0); > + if (IS_ALIGNED(start, PAGE_SIZE)) > + page = alloc_pages_node(nid, gfp_mask, 0); > walk.head_page = page; > > /* > > > ... and it looks that it can still be effective just not as much, more or less as expected? > > # with 2M hugepages > > Before: > > Node 0, zone Normal, type Movable 76 28 10 4 1 0 > 0 0 1 1 15568 > > After (allocated 32400): > > Node 0, zone Normal, type Movable 135 328 219 198 155 106 > 72 41 23 0 0 > > Compared to my original patch where there weren't any pages left in the order-0..order-2: > > Node 0, zone Normal, type Movable 0 1 0 70 > 106 91 78 48 17 0 0 > > But still much better that without any of this: > > Node 0, zone Normal, type Movable 32174 31999 31642 104 > 58 24 16 4 2 0 0 >