From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C97C8C433FE for ; Tue, 8 Nov 2022 09:14:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2854F6B0073; Tue, 8 Nov 2022 04:14:07 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 235DC6B0074; Tue, 8 Nov 2022 04:14:07 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0FE8D8E0002; Tue, 8 Nov 2022 04:14:07 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 004446B0073 for ; Tue, 8 Nov 2022 04:14:06 -0500 (EST) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 6C7441A0FFF for ; Tue, 8 Nov 2022 09:14:06 +0000 (UTC) X-FDA: 80109713292.24.4CE23FC Received: from out2.migadu.com (out2.migadu.com [188.165.223.204]) by imf13.hostedemail.com (Postfix) with ESMTP id B095520003 for ; Tue, 8 Nov 2022 09:14:05 +0000 (UTC) Content-Type: text/plain; charset=us-ascii DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1667898843; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=mpBcvN+NZdVsAVaB9RQgaYSz/em5eab78aboKwkwxh8=; b=DpeAvBE5xBzcnQokdAVNk/uqwjRVFQdQZiJSYFCYpxJqzqmnywfeLZrtKhSy7WQYayHabl 6v58PBZnxW20iL8oiZfy11aYk5u6m+a/XOA/TWqALAmkSMdZzymDmDh851e7+gr9OEEYVe 9oLgmlXo2/9ngnc6t7e1Lq3hChfPqyE= Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3731.200.110.1.12\)) Subject: Re: [PATCH v2] mm/hugetlb_vmemmap: remap head page to newly allocated page X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Muchun Song In-Reply-To: <20221107153922.77094-1-joao.m.martins@oracle.com> Date: Tue, 8 Nov 2022 17:13:47 +0800 Cc: Linux Memory Management List , Muchun Song , Mike Kravetz , Andrew Morton Content-Transfer-Encoding: quoted-printable Message-Id: <28F33F85-1DCC-49B5-95A5-B005A7B11E57@linux.dev> References: <20221107153922.77094-1-joao.m.martins@oracle.com> To: Joao Martins X-Migadu-Flow: FLOW_OUT ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1667898846; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=mpBcvN+NZdVsAVaB9RQgaYSz/em5eab78aboKwkwxh8=; b=MxlIQJYmfF58lRpNYjrBhCiPsvfJ04T4djFQl8TXmswkCZcds2NN1XuB1IVgWWeQQjKE4V SvkaXdBaxceWaUSjziFn1OsWq0n1GaWignGXJ2jzDVSV9MS2XjAlji1mjta9LaWaDJtolc gyWjXWPYn4Yrr2wCEYPT3MvbI4/3KXs= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=DpeAvBE5; spf=pass (imf13.hostedemail.com: domain of muchun.song@linux.dev designates 188.165.223.204 as permitted sender) smtp.mailfrom=muchun.song@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1667898846; a=rsa-sha256; cv=none; b=SYPSBpixeGP5luuLwhLFZb4Gai7OwpPRHNSMdPXvjBhbUs+61m2PH9bA6KBEn1HcPbrGTx bvoUptmu6Y+RUAznjqe1EhrkHJL3wLb5eSC8+fOuKiaFteXxeLtdBuCjaxa/SXm8dVSgvW 5PVNstf2S0vIjeK46+K/G0kMTJrERqc= X-Rspam-User: X-Stat-Signature: k15rybfs6u47jh8mih6hpfu4gbdmx4h9 X-Rspamd-Queue-Id: B095520003 Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=DpeAvBE5; spf=pass (imf13.hostedemail.com: domain of muchun.song@linux.dev designates 188.165.223.204 as permitted sender) smtp.mailfrom=muchun.song@linux.dev; dmarc=pass (policy=none) header.from=linux.dev X-Rspamd-Server: rspam07 X-HE-Tag: 1667898845-572766 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: > On Nov 7, 2022, at 23:39, Joao Martins = wrote: >=20 > Today with `hugetlb_free_vmemmap=3Don` the struct page memory that is = freed > back to page allocator is as following: for a 2M hugetlb page it will = reuse > the first 4K vmemmap page to remap the remaining 7 vmemmap pages, and = for a > 1G hugetlb it will remap the remaining 4095 vmemmap pages. = Essentially, > that means that it breaks the first 4K of a potentially contiguous = chunk of > memory of 32K (for 2M hugetlb pages) or 16M (for 1G hugetlb pages). = For > this reason the memory that it's free back to page allocator cannot be = used > for hugetlb to allocate huge pages of the same size, but rather only = of a > smaller huge page size: >=20 > Trying to assign a 64G node to hugetlb (on a 128G 2node guest, each = node > having 64G): >=20 > * Before allocation: > Free pages count per migrate type at order 0 1 2 = 3 > 4 5 6 7 8 9 10 > ... > Node 0, zone Normal, type Movable 340 100 32 = 15 > 1 2 0 0 0 1 15558 >=20 > $ echo 32768 > = /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages > $ cat = /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages > 31987 >=20 > * After: >=20 > Node 0, zone Normal, type Movable 30893 32006 31515 = 7 > 0 0 0 0 0 0 0 >=20 > Notice how the memory freed back are put back into 4K / 8K / 16K page > pools. And it allocates a total of 31974 pages (63948M). >=20 > To fix this behaviour rather than remapping one page (thus breaking = the > contiguous block of memory backing the struct pages) repopulate with a = new > page for the head vmemmap page. It will copying the data from the = currently > mapped vmemmap page, and then remap it to this new page. Additionally, > change the remap_pte callback to look at the newly added = walk::head_page > which needs to be mapped as r/w compared to the tail page vmemmap = reuse > that uses r/o. >=20 > The new head page is allocated by the caller of vmemmap_remap_free() = given > that on restore it should still be using the same code path as before. = Note > that, because right now one hugepage is remapped at a time, thus only = one > free 4K page at a time is needed to remap the head page. Should it = fail to > allocate said new page, it reuses the one that's already mapped just = like > before. As a result, for every 64G of contiguous hugepages it can give = back > 1G more of contiguous memory per 64G, while needing in total 128M new = 4K > pages (for 2M hugetlb) or 256k (for 1G hugetlb). >=20 > After the changes, try to assign a 64G node to hugetlb (on a 128G = 2node > guest, each node with 64G): >=20 > * Before allocation > Free pages count per migrate type at order 0 1 2 = 3 > 4 5 6 7 8 9 10 > ... > Node 0, zone Normal, type Movable 1 1 1 = 0 > 0 1 0 0 1 1 15564 >=20 > $ echo 32768 > = /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages > $ cat = /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages > 32394 >=20 > * After: >=20 > Node 0, zone Normal, type Movable 0 50 97 = 108 > 96 81 70 46 18 0 0 Thanks for your work. >=20 >=20 > In the example above, 407 more hugeltb 2M pages are allocated i.e. = 814M out > of the 32394 (64796M) allocated. So the memory freed back is indeed = being > used back in hugetlb and there's no massive order-0..order-2 pages > accumulated unused. >=20 > Signed-off-by: Joao Martins > --- > Changes since v1[0]: > * Drop rw argument and check walk::head_page directly when there's > no reuse_page set (similar suggestion by Muchun Song to adjust > inside the remap_pte callback) > * Adjust TLB flush to cover the head page vaddr too (Muchun Song) > * Simplify the remap of head page in vmemmap_pte_range() > * Check start is aligned to PAGE_SIZE in vmemmap_remap_free() >=20 > I've kept the same structure as in v1 compared to a chunk Muchun > pasted in the v1 thread[1] and thus I am not altering the calling > convention of vmemmap_remap_free()/vmemmap_restore_pte(). > The remapping of head page is not exactly a page that is reused, > compared to the r/o tail vmemmap pages remapping. So tiny semantic = change, > albeit same outcome in pratice of changing the PTE and freeing the = page, > with different permissions. It also made it simpler to gracefully fail > in case of page allocation failure, and logic simpler to follow IMHO. >=20 > Let me know otherwise if I followed the wrong thinking. >=20 > [0] = https://lore.kernel.org/linux-mm/20220802180309.19340-1-joao.m.martins@ora= cle.com/ > [1] https://lore.kernel.org/linux-mm/Yun1bJsnK%2F6MFc0b@FVFYT0MHHV2J/ >=20 > --- > mm/hugetlb_vmemmap.c | 59 ++++++++++++++++++++++++++++++++++++++------ > 1 file changed, 52 insertions(+), 7 deletions(-) >=20 > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c > index 7898c2c75e35..4298c44578e3 100644 > --- a/mm/hugetlb_vmemmap.c > +++ b/mm/hugetlb_vmemmap.c > @@ -22,6 +22,7 @@ > * > * @remap_pte: called for each lowest-level entry (PTE). > * @nr_walked: the number of walked pte. > + * @head_page: the page which replaces the head vmemmap page. > * @reuse_page: the page which is reused for the tail vmemmap pages. > * @reuse_addr: the virtual address of the @reuse_page page. > * @vmemmap_pages: the list head of the vmemmap pages that can be = freed > @@ -31,6 +32,7 @@ struct vmemmap_remap_walk { > void (*remap_pte)(pte_t *pte, unsigned long addr, > struct vmemmap_remap_walk *walk); > unsigned long nr_walked; > + struct page *head_page; This field is unnecessary. We can reuse ->reuse_page to implement the = same functionality. I'll explain the reason later. > struct page *reuse_page; > unsigned long reuse_addr; > struct list_head *vmemmap_pages; > @@ -105,10 +107,26 @@ static void vmemmap_pte_range(pmd_t *pmd, = unsigned long addr, > * remapping (which is calling @walk->remap_pte). > */ > if (!walk->reuse_page) { > - walk->reuse_page =3D pte_page(*pte); > + struct page *page =3D pte_page(*pte); > + > + /* > + * Copy the data from the original head, and remap to > + * the newly allocated page. > + */ > + if (walk->head_page) { > + memcpy(page_address(walk->head_page), > + page_address(page), PAGE_SIZE); > + walk->remap_pte(pte, addr, walk); > + page =3D walk->head_page; > + } > + > + walk->reuse_page =3D page; > + > /* > - * Because the reuse address is part of the range that we are > - * walking, skip the reuse address range. > + * Because the reuse address is part of the range that > + * we are walking or the head page was remapped to a > + * new page, skip the reuse address range. > + * . > */ > addr +=3D PAGE_SIZE; > pte++; > @@ -204,11 +222,11 @@ static int vmemmap_remap_range(unsigned long = start, unsigned long end, > } while (pgd++, addr =3D next, addr !=3D end); >=20 > /* > - * We only change the mapping of the vmemmap virtual address range > - * [@start + PAGE_SIZE, end), so we only need to flush the TLB which > + * We change the mapping of the vmemmap virtual address range > + * [@start, end], so we only need to flush the TLB which > * belongs to the range. > */ This comment could go away, the reason I added it here is because it is = a bit special here. I want to tell others why we don't flush the full range from = @start to @end. Now, I think it can go away. > - flush_tlb_kernel_range(start + PAGE_SIZE, end); > + flush_tlb_kernel_range(start, end); >=20 > return 0; > } > @@ -244,9 +262,21 @@ static void vmemmap_remap_pte(pte_t *pte, = unsigned long addr, > * to the tail pages. > */ > pgprot_t pgprot =3D PAGE_KERNEL_RO; > - pte_t entry =3D mk_pte(walk->reuse_page, pgprot); > + struct page *reuse =3D walk->reuse_page; > struct page *page =3D pte_page(*pte); > + pte_t entry; >=20 > + /* > + * When there's no walk::reuse_page, it means we allocated a new head > + * page (stored in walk::head_page) and copied from the old head = page. > + * In that case use the walk::head_page as the page to remap. > + */ > + if (!reuse) { > + pgprot =3D PAGE_KERNEL; > + reuse =3D walk->head_page; > + } > + > + entry =3D mk_pte(reuse, pgprot); > list_add_tail(&page->lru, walk->vmemmap_pages); > set_pte_at(&init_mm, addr, pte, entry); > } > @@ -315,6 +345,21 @@ static int vmemmap_remap_free(unsigned long = start, unsigned long end, > .reuse_addr =3D reuse, > .vmemmap_pages =3D &vmemmap_pages, > }; > + gfp_t gfp_mask =3D GFP_KERNEL|__GFP_RETRY_MAYFAIL|__GFP_NOWARN; It is better to add __GFP_THISNODE here for performance. And replace __GFP_RETRY_MAYFAIL to __GFP_NORETRY to keep consistent with hugetlb_vmemmap_restore(). > + int nid =3D page_to_nid((struct page *)start); > + struct page *page =3D NULL; > + > + /* > + * Allocate a new head vmemmap page to avoid breaking a contiguous > + * block of struct page memory when freeing it back to page allocator > + * in free_vmemmap_page_list(). This will allow the likely contiguous > + * struct page backing memory to be kept contiguous and allowing for > + * more allocations of hugepages. Fallback to the currently > + * mapped head page in case should it fail to allocate. > + */ > + if (IS_ALIGNED((unsigned long)start, PAGE_SIZE)) I'm curious why we need this check. IIUC, this is unnecessary. > + page =3D alloc_pages_node(nid, gfp_mask, 0); > + walk.head_page =3D page; >=20 > /* > * In order to make remapping routine most efficient for the huge = pages, > --=20 > 2.17.2 >=20 I have implemented a version based on yours, which does not introduce ->head_page field (Not test if it works). Seems to be simple. Thanks. diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c index c98805d5b815..8ee94f6a6697 100644 --- a/mm/hugetlb_vmemmap.c +++ b/mm/hugetlb_vmemmap.c @@ -202,12 +202,7 @@ static int vmemmap_remap_range(unsigned long start, = unsigned long end, return ret; } while (pgd++, addr =3D next, addr !=3D end); - /* - * We only change the mapping of the vmemmap virtual address = range - * [@start + PAGE_SIZE, end), so we only need to flush the TLB = which - * belongs to the range. - */ - flush_tlb_kernel_range(start + PAGE_SIZE, end); + flush_tlb_kernel_range(start, end); return 0; } @@ -246,6 +241,12 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned = long addr, pte_t entry =3D mk_pte(walk->reuse_page, pgprot); struct page *page =3D pte_page(*pte); + /* The case of remapping the head vmemmap page. */ + if (unlikely(addr =3D=3D walk->reuse_addr)) { + list_del(&walk->reuse_page->lru); + entry =3D mk_pte(walk->reuse_page, PAGE_KERNEL); + } + list_add_tail(&page->lru, walk->vmemmap_pages); set_pte_at(&init_mm, addr, pte, entry); } @@ -310,6 +311,8 @@ static int vmemmap_remap_free(unsigned long start, = unsigned long end, .reuse_addr =3D reuse, .vmemmap_pages =3D &vmemmap_pages, }; + int nid =3D page_to_nid((struct page *)start); + gfp_t gfp_mask =3D GFP_KERNEL | __GFP_THISNODE | __GFP_NORETRY | = __GFP_NOWARN; /* * In order to make remapping routine most efficient for the = huge pages, @@ -326,6 +329,20 @@ static int vmemmap_remap_free(unsigned long start, = unsigned long end, */ BUG_ON(start - reuse !=3D PAGE_SIZE); + /* + * Allocate a new head vmemmap page to avoid breaking a = contiguous + * block of struct page memory when freeing it back to page = allocator + * in free_vmemmap_page_list(). This will allow the likely = contiguous + * struct page backing memory to be kept contiguous and allowing = for + * more allocations of hugepages. Fallback to the currently + * mapped head page in case should it fail to allocate. + */ + walk.reuse_page =3D alloc_pages_node(nid, gfp_mask, 0); + if (walk.reuse_page) { + copy_page(page_to_virt(walk.reuse_page), = walk.reuse_addr); + list_add(&walk.reuse_page->lru, &vmemmap_pages); + } + mmap_read_lock(&init_mm); ret =3D vmemmap_remap_range(reuse, end, &walk); if (ret && walk.nr_walked) {