From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D3E1EC4332F for ; Wed, 9 Nov 2022 02:43:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 13B628E0001; Tue, 8 Nov 2022 21:43:39 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 0C3EE6B0072; Tue, 8 Nov 2022 21:43:39 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E7F608E0001; Tue, 8 Nov 2022 21:43:38 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id D12096B0071 for ; Tue, 8 Nov 2022 21:43:38 -0500 (EST) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id A16F41203E4 for ; Wed, 9 Nov 2022 02:43:38 +0000 (UTC) X-FDA: 80112358116.04.6F5CBC7 Received: from out0.migadu.com (out0.migadu.com [94.23.1.103]) by imf14.hostedemail.com (Postfix) with ESMTP id 79F1D10000D for ; Wed, 9 Nov 2022 02:43:32 +0000 (UTC) Content-Type: text/plain; charset=us-ascii DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1667961752; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=alAUhzNcuM2h9pGMBPSC+yf+mOHn2VSC2/1QsD5Uk88=; b=NZ8ifGn5gHxiq1X9cNzRnm0O9PBoOmlDdDQdlQhX+tNqsxTWARqis82FaGAlznK1B5utlm Qy0b/toplMNPEPoMdm7HTuOBFW7ylDXpfwn5iLqUBMXda7nRpiHXTPveeqwY6ojld4t+PM nFXUMrSOB94gEymtFbUOmGqlAwz/aas= MIME-Version: 1.0 Subject: Re: [PATCH v2] mm/hugetlb_vmemmap: remap head page to newly allocated page X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Muchun Song In-Reply-To: <8482a2e8-89fb-bb20-6b83-81c548bff639@oracle.com> Date: Wed, 9 Nov 2022 10:42:16 +0800 Cc: Linux Memory Management List , Muchun Song , Mike Kravetz , Andrew Morton Content-Transfer-Encoding: quoted-printable Message-Id: References: <20221107153922.77094-1-joao.m.martins@oracle.com> <28F33F85-1DCC-49B5-95A5-B005A7B11E57@linux.dev> <8482a2e8-89fb-bb20-6b83-81c548bff639@oracle.com> To: Joao Martins X-Migadu-Flow: FLOW_OUT ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1667961813; a=rsa-sha256; cv=none; b=i1GhnF5FLlrZ1NF+qQnonL/c+iJESPccj98fnZku2RWkPEVycaAoxRzwASpKmvW8x5Iuka 50Tagi7p5svqXl+c+lfPzNzU7l5o0P1stlenzKvgFvEcUGk0Imt25B4BC4+Ve2cFwVvte2 Chw9chj+ufRC/Zb3DCWullDPUR3s/jg= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=NZ8ifGn5; spf=pass (imf14.hostedemail.com: domain of muchun.song@linux.dev designates 94.23.1.103 as permitted sender) smtp.mailfrom=muchun.song@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1667961813; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=alAUhzNcuM2h9pGMBPSC+yf+mOHn2VSC2/1QsD5Uk88=; b=Ie6w0Er3vXEUKU4/s5SBLqPmTDZFx+gbpkQIvCrJ7BrnYWkIbigE9/cloToSZpRT5a/X91 kYzezBPFd31qp3eL2WdNCj28k1zlJ1FwXJd3f96QqIaV7kwaXx5Si0RAo84dsBLjshWNLd 0rPGISEoDicAYKcMGMOyAONeDr7YlfY= Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=NZ8ifGn5; spf=pass (imf14.hostedemail.com: domain of muchun.song@linux.dev designates 94.23.1.103 as permitted sender) smtp.mailfrom=muchun.song@linux.dev; dmarc=pass (policy=none) header.from=linux.dev X-Rspamd-Server: rspam12 X-Rspam-User: X-Stat-Signature: zbbt8o8cztfo9mfw3oj738az1qtw1srj X-Rspamd-Queue-Id: 79F1D10000D X-HE-Tag: 1667961812-808989 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: > On Nov 8, 2022, at 18:38, Joao Martins = wrote: >=20 > On 08/11/2022 09:13, Muchun Song wrote: >>> On Nov 7, 2022, at 23:39, Joao Martins = wrote: >>>=20 >>> Today with `hugetlb_free_vmemmap=3Don` the struct page memory that = is freed >>> back to page allocator is as following: for a 2M hugetlb page it = will reuse >>> the first 4K vmemmap page to remap the remaining 7 vmemmap pages, = and for a >>> 1G hugetlb it will remap the remaining 4095 vmemmap pages. = Essentially, >>> that means that it breaks the first 4K of a potentially contiguous = chunk of >>> memory of 32K (for 2M hugetlb pages) or 16M (for 1G hugetlb pages). = For >>> this reason the memory that it's free back to page allocator cannot = be used >>> for hugetlb to allocate huge pages of the same size, but rather only = of a >>> smaller huge page size: >>>=20 >>> Trying to assign a 64G node to hugetlb (on a 128G 2node guest, each = node >>> having 64G): >>>=20 >>> * Before allocation: >>> Free pages count per migrate type at order 0 1 2 = 3 >>> 4 5 6 7 8 9 10 >>> ... >>> Node 0, zone Normal, type Movable 340 100 32 = 15 >>> 1 2 0 0 0 1 15558 >>>=20 >>> $ echo 32768 > = /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages >>> $ cat = /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages >>> 31987 >>>=20 >>> * After: >>>=20 >>> Node 0, zone Normal, type Movable 30893 32006 31515 = 7 >>> 0 0 0 0 0 0 0 >>>=20 >>> Notice how the memory freed back are put back into 4K / 8K / 16K = page >>> pools. And it allocates a total of 31974 pages (63948M). >>>=20 >>> To fix this behaviour rather than remapping one page (thus breaking = the >>> contiguous block of memory backing the struct pages) repopulate with = a new >>> page for the head vmemmap page. It will copying the data from the = currently >>> mapped vmemmap page, and then remap it to this new page. = Additionally, >>> change the remap_pte callback to look at the newly added = walk::head_page >>> which needs to be mapped as r/w compared to the tail page vmemmap = reuse >>> that uses r/o. >>>=20 >>> The new head page is allocated by the caller of vmemmap_remap_free() = given >>> that on restore it should still be using the same code path as = before. Note >>> that, because right now one hugepage is remapped at a time, thus = only one >>> free 4K page at a time is needed to remap the head page. Should it = fail to >>> allocate said new page, it reuses the one that's already mapped just = like >>> before. As a result, for every 64G of contiguous hugepages it can = give back >>> 1G more of contiguous memory per 64G, while needing in total 128M = new 4K >>> pages (for 2M hugetlb) or 256k (for 1G hugetlb). >>>=20 >>> After the changes, try to assign a 64G node to hugetlb (on a 128G = 2node >>> guest, each node with 64G): >>>=20 >>> * Before allocation >>> Free pages count per migrate type at order 0 1 2 = 3 >>> 4 5 6 7 8 9 10 >>> ... >>> Node 0, zone Normal, type Movable 1 1 1 = 0 >>> 0 1 0 0 1 1 15564 >>>=20 >>> $ echo 32768 > = /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages >>> $ cat = /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages >>> 32394 >>>=20 >>> * After: >>>=20 >>> Node 0, zone Normal, type Movable 0 50 97 = 108 >>> 96 81 70 46 18 0 0 >>=20 >> Thanks for your work. >>=20 > Thanks for the comments! >=20 >>>=20 >>>=20 >>> In the example above, 407 more hugeltb 2M pages are allocated i.e. = 814M out >>> of the 32394 (64796M) allocated. So the memory freed back is indeed = being >>> used back in hugetlb and there's no massive order-0..order-2 pages >>> accumulated unused. >>>=20 >>> Signed-off-by: Joao Martins >>> --- >>> Changes since v1[0]: >>> * Drop rw argument and check walk::head_page directly when there's >>> no reuse_page set (similar suggestion by Muchun Song to adjust >>> inside the remap_pte callback) >>> * Adjust TLB flush to cover the head page vaddr too (Muchun Song) >>> * Simplify the remap of head page in vmemmap_pte_range() >>> * Check start is aligned to PAGE_SIZE in vmemmap_remap_free() >>>=20 >>> I've kept the same structure as in v1 compared to a chunk Muchun >>> pasted in the v1 thread[1] and thus I am not altering the calling >>> convention of vmemmap_remap_free()/vmemmap_restore_pte(). >>> The remapping of head page is not exactly a page that is reused, >>> compared to the r/o tail vmemmap pages remapping. So tiny semantic = change, >>> albeit same outcome in pratice of changing the PTE and freeing the = page, >>> with different permissions. It also made it simpler to gracefully = fail >>> in case of page allocation failure, and logic simpler to follow = IMHO. >>>=20 >>> Let me know otherwise if I followed the wrong thinking. >>>=20 >>> [0] = https://lore.kernel.org/linux-mm/20220802180309.19340-1-joao.m.martins@ora= cle.com/ >>> [1] = https://lore.kernel.org/linux-mm/Yun1bJsnK%2F6MFc0b@FVFYT0MHHV2J/ >>>=20 >>> --- >>> mm/hugetlb_vmemmap.c | 59 = ++++++++++++++++++++++++++++++++++++++------ >>> 1 file changed, 52 insertions(+), 7 deletions(-) >>>=20 >>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c >>> index 7898c2c75e35..4298c44578e3 100644 >>> --- a/mm/hugetlb_vmemmap.c >>> +++ b/mm/hugetlb_vmemmap.c >>> @@ -22,6 +22,7 @@ >>> * >>> * @remap_pte: called for each lowest-level entry (PTE). >>> * @nr_walked: the number of walked pte. >>> + * @head_page: the page which replaces the head vmemmap page. >>> * @reuse_page: the page which is reused for the tail vmemmap pages. >>> * @reuse_addr: the virtual address of the @reuse_page page. >>> * @vmemmap_pages: the list head of the vmemmap pages that can be = freed >>> @@ -31,6 +32,7 @@ struct vmemmap_remap_walk { >>> void (*remap_pte)(pte_t *pte, unsigned long addr, >>> struct vmemmap_remap_walk *walk); >>> unsigned long nr_walked; >>> + struct page *head_page; >>=20 >> This field is unnecessary. We can reuse ->reuse_page to implement the = same >> functionality. I'll explain the reason later. >>=20 >=20 > OK, I'll comment below >=20 >>> struct page *reuse_page; >>> unsigned long reuse_addr; >>> struct list_head *vmemmap_pages; >>> @@ -105,10 +107,26 @@ static void vmemmap_pte_range(pmd_t *pmd, = unsigned long addr, >>> * remapping (which is calling @walk->remap_pte). >>> */ >>> if (!walk->reuse_page) { >>> - walk->reuse_page =3D pte_page(*pte); >>> + struct page *page =3D pte_page(*pte); >>> + >>> + /* >>> + * Copy the data from the original head, and remap to >>> + * the newly allocated page. >>> + */ >>> + if (walk->head_page) { >>> + memcpy(page_address(walk->head_page), >>> + page_address(page), PAGE_SIZE); >>> + walk->remap_pte(pte, addr, walk); >>> + page =3D walk->head_page; >>> + } >>> + >>> + walk->reuse_page =3D page; >>> + >>> /* >>> - * Because the reuse address is part of the range that we are >>> - * walking, skip the reuse address range. >>> + * Because the reuse address is part of the range that >>> + * we are walking or the head page was remapped to a >>> + * new page, skip the reuse address range. >>> + * . >>> */ >>> addr +=3D PAGE_SIZE; >>> pte++; >>> @@ -204,11 +222,11 @@ static int vmemmap_remap_range(unsigned long = start, unsigned long end, >>> } while (pgd++, addr =3D next, addr !=3D end); >>>=20 >>> /* >>> - * We only change the mapping of the vmemmap virtual address range >>> - * [@start + PAGE_SIZE, end), so we only need to flush the TLB = which >>> + * We change the mapping of the vmemmap virtual address range >>> + * [@start, end], so we only need to flush the TLB which >>> * belongs to the range. >>> */ >>=20 >> This comment could go away, the reason I added it here is because it = is a bit special >> here. I want to tell others why we don't flush the full range from = @start to @end. Now, I >> think it can go away. >>=20 > OK >=20 >>> - flush_tlb_kernel_range(start + PAGE_SIZE, end); >>> + flush_tlb_kernel_range(start, end); >>>=20 >>> return 0; >>> } >>> @@ -244,9 +262,21 @@ static void vmemmap_remap_pte(pte_t *pte, = unsigned long addr, >>> * to the tail pages. >>> */ >>> pgprot_t pgprot =3D PAGE_KERNEL_RO; >>> - pte_t entry =3D mk_pte(walk->reuse_page, pgprot); >>> + struct page *reuse =3D walk->reuse_page; >>> struct page *page =3D pte_page(*pte); >>> + pte_t entry; >>>=20 >>> + /* >>> + * When there's no walk::reuse_page, it means we allocated a new = head >>> + * page (stored in walk::head_page) and copied from the old head = page. >>> + * In that case use the walk::head_page as the page to remap. >>> + */ >>> + if (!reuse) { >>> + pgprot =3D PAGE_KERNEL; >>> + reuse =3D walk->head_page; >>> + } >>> + >>> + entry =3D mk_pte(reuse, pgprot); >>> list_add_tail(&page->lru, walk->vmemmap_pages); >>> set_pte_at(&init_mm, addr, pte, entry); >>> } >>> @@ -315,6 +345,21 @@ static int vmemmap_remap_free(unsigned long = start, unsigned long end, >>> .reuse_addr =3D reuse, >>> .vmemmap_pages =3D &vmemmap_pages, >>> }; >>> + gfp_t gfp_mask =3D GFP_KERNEL|__GFP_RETRY_MAYFAIL|__GFP_NOWARN; >>=20 >> It is better to add __GFP_THISNODE here for performance. And replace >> __GFP_RETRY_MAYFAIL to __GFP_NORETRY to keep consistent with >> hugetlb_vmemmap_restore(). >>=20 > OK >=20 >>> + int nid =3D page_to_nid((struct page *)start); >>> + struct page *page =3D NULL; >>> + >>> + /* >>> + * Allocate a new head vmemmap page to avoid breaking a contiguous >>> + * block of struct page memory when freeing it back to page = allocator >>> + * in free_vmemmap_page_list(). This will allow the likely = contiguous >>> + * struct page backing memory to be kept contiguous and allowing = for >>> + * more allocations of hugepages. Fallback to the currently >>> + * mapped head page in case should it fail to allocate. >>> + */ >>> + if (IS_ALIGNED((unsigned long)start, PAGE_SIZE)) >>=20 >> I'm curious why we need this check. IIUC, this is unnecessary. >>=20 >=20 > So if the start of the vmemmap range (the head page) we will remap = isn't the > first struct page, then we would corrupt the other struct pages in > that vmemmap page unrelated to hugetlb? That was my thinking Actually, @start address should be always aligned with PAGE_SIZE. If = not, vmemmap_remap_range() will complain. So the check can be removed. >=20 >>> + page =3D alloc_pages_node(nid, gfp_mask, 0); >>> + walk.head_page =3D page; >>>=20 >>> /* >>> * In order to make remapping routine most efficient for the huge = pages, >>> --=20 >>> 2.17.2 >>>=20 >>=20 >> I have implemented a version based on yours, which does not introduce >> ->head_page field (Not test if it works). Seems to be simple. >>=20 >=20 > Let me try out with the adjustment below >=20 >> Thanks. >>=20 >> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c >> index c98805d5b815..8ee94f6a6697 100644 >> --- a/mm/hugetlb_vmemmap.c >> +++ b/mm/hugetlb_vmemmap.c >> @@ -202,12 +202,7 @@ static int vmemmap_remap_range(unsigned long = start, unsigned long end, >> return ret; >> } while (pgd++, addr =3D next, addr !=3D end); >>=20 >> - /* >> - * We only change the mapping of the vmemmap virtual address = range >> - * [@start + PAGE_SIZE, end), so we only need to flush the = TLB which >> - * belongs to the range. >> - */ >> - flush_tlb_kernel_range(start + PAGE_SIZE, end); >> + flush_tlb_kernel_range(start, end); >>=20 >> return 0; >> } >> @@ -246,6 +241,12 @@ static void vmemmap_remap_pte(pte_t *pte, = unsigned long addr, >> pte_t entry =3D mk_pte(walk->reuse_page, pgprot); >> struct page *page =3D pte_page(*pte); >>=20 >> + /* The case of remapping the head vmemmap page. */ >> + if (unlikely(addr =3D=3D walk->reuse_addr)) { >=20 > You replace the head_page with checking the reuse_addr , but that is > set on the tail page. So if we want to rely on reuse_addr perhaps > best if do: >=20 > if (unlikely(addr =3D=3D (walk->reuse_addr - PAGE_SIZE))) { > ... > } I don't think so. The @addr here should be equal to @walk->reuse_addr when vmemmap_remap_pte() is fist called since @addr does not be updated from vmemmap_pte_range(). Right? Thanks. >=20 >> + list_del(&walk->reuse_page->lru); >> + entry =3D mk_pte(walk->reuse_page, PAGE_KERNEL); >> + } >> + >> list_add_tail(&page->lru, walk->vmemmap_pages); >> set_pte_at(&init_mm, addr, pte, entry); >> } >> @@ -310,6 +311,8 @@ static int vmemmap_remap_free(unsigned long = start, unsigned long end, >> .reuse_addr =3D reuse, >> .vmemmap_pages =3D &vmemmap_pages, >> }; >> + int nid =3D page_to_nid((struct page *)start); >> + gfp_t gfp_mask =3D GFP_KERNEL | __GFP_THISNODE | = __GFP_NORETRY | __GFP_NOWARN; >>=20 >> /* >> * In order to make remapping routine most efficient for the = huge pages, >> @@ -326,6 +329,20 @@ static int vmemmap_remap_free(unsigned long = start, unsigned long end, >> */ >> BUG_ON(start - reuse !=3D PAGE_SIZE); >>=20 >> + /* >> + * Allocate a new head vmemmap page to avoid breaking a = contiguous >> + * block of struct page memory when freeing it back to page = allocator >> + * in free_vmemmap_page_list(). This will allow the likely = contiguous >> + * struct page backing memory to be kept contiguous and = allowing for >> + * more allocations of hugepages. Fallback to the currently >> + * mapped head page in case should it fail to allocate. >> + */ >> + walk.reuse_page =3D alloc_pages_node(nid, gfp_mask, 0); >> + if (walk.reuse_page) { >> + copy_page(page_to_virt(walk.reuse_page), = walk.reuse_addr); >> + list_add(&walk.reuse_page->lru, &vmemmap_pages); >> + } >> + >> mmap_read_lock(&init_mm); >> ret =3D vmemmap_remap_range(reuse, end, &walk); >> if (ret && walk.nr_walked) {