From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9570AC4332F for ; Thu, 10 Nov 2022 03:28:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2E2806B0071; Wed, 9 Nov 2022 22:28:45 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 292F66B0072; Wed, 9 Nov 2022 22:28:45 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 15A216B0074; Wed, 9 Nov 2022 22:28:45 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 045536B0071 for ; Wed, 9 Nov 2022 22:28:45 -0500 (EST) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id D2C76A0B09 for ; Thu, 10 Nov 2022 03:28:44 +0000 (UTC) X-FDA: 80116100568.09.7D958B4 Received: from out2.migadu.com (out2.migadu.com [188.165.223.204]) by imf29.hostedemail.com (Postfix) with ESMTP id 0B1A0120002 for ; Thu, 10 Nov 2022 03:28:43 +0000 (UTC) Content-Type: text/plain; charset=us-ascii DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1668050918; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=U5+28OvjiALk9augXfusYGUUPALdEJVFOfAFSnQgyBs=; b=AUElEP6UHKyIERFYVkqdVeT6MTSTPMotj1lIqvBSS9emh4Tzp2AIyTKdR1vpb2kxkskjk/ tw81p4F+IzdQSw+WE4gES18S/TN6IFUXpMZ8BN6pXDYt4h8/2YNnGt9aCicMtkbfRKux49 UTgEjgzEuPEMc8VgrxPMphUu1xUe9vU= Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3731.200.110.1.12\)) Subject: Re: [PATCH v3] mm/hugetlb_vmemmap: remap head page to newly allocated page X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Muchun Song In-Reply-To: <20221109200623.96867-1-joao.m.martins@oracle.com> Date: Thu, 10 Nov 2022 11:28:23 +0800 Cc: Linux Memory Management List , Muchun Song , Mike Kravetz , Andrew Morton Content-Transfer-Encoding: quoted-printable Message-Id: <903F9F8D-98A0-4114-8BC2-9738B98C8F23@linux.dev> References: <20221109200623.96867-1-joao.m.martins@oracle.com> To: Joao Martins X-Migadu-Flow: FLOW_OUT ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1668050924; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=U5+28OvjiALk9augXfusYGUUPALdEJVFOfAFSnQgyBs=; b=y3/VDCnfj998vqf6Q/wUwMG0vAWVtG4NKpkdyDuyFNV0N3ib2SGS/giuyNoL5/iHb5qSLs 3zbYM87nzK2ZlNbfAKeM+DclMbNZewzzPl4K8dyldhdeAu0WqTd34ebVgDnfq1GCRD77qo ZKUFojE5W7Esz2twoU5WTnrWBWd6EMo= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=AUElEP6U; spf=pass (imf29.hostedemail.com: domain of muchun.song@linux.dev designates 188.165.223.204 as permitted sender) smtp.mailfrom=muchun.song@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1668050924; a=rsa-sha256; cv=none; b=qoeZYMx3lHol9VZH1GPJqRlqxegTDhgv/oDgxfQD6+wP55Fm71ME2fj5pAMTeYd9gzGZpA yyi6u2HUSHb5zgmZ0xR9032Dz70odxDMxrlqbWruuAHToKTTy3rkhwMpTSOt7aY04rSmZM okBiOxadm6RRo1qm4NmWndAAqG+QrD0= X-Rspamd-Queue-Id: 0B1A0120002 X-Rspam-User: X-Rspamd-Server: rspam08 Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=AUElEP6U; spf=pass (imf29.hostedemail.com: domain of muchun.song@linux.dev designates 188.165.223.204 as permitted sender) smtp.mailfrom=muchun.song@linux.dev; dmarc=pass (policy=none) header.from=linux.dev X-Stat-Signature: be87aqmtxb3c8edjmfo5hi4yyb9tgf1z X-HE-Tag: 1668050923-107673 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: > On Nov 10, 2022, at 04:06, Joao Martins = wrote: >=20 > Today with `hugetlb_free_vmemmap=3Don` the struct page memory that is = freed > back to page allocator is as following: for a 2M hugetlb page it will = reuse > the first 4K vmemmap page to remap the remaining 7 vmemmap pages, and = for a > 1G hugetlb it will remap the remaining 4095 vmemmap pages. = Essentially, > that means that it breaks the first 4K of a potentially contiguous = chunk of > memory of 32K (for 2M hugetlb pages) or 16M (for 1G hugetlb pages). = For > this reason the memory that it's free back to page allocator cannot be = used > for hugetlb to allocate huge pages of the same size, but rather only = of a > smaller huge page size: >=20 > Trying to assign a 64G node to hugetlb (on a 128G 2node guest, each = node > having 64G): >=20 > * Before allocation: > Free pages count per migrate type at order 0 1 2 = 3 > 4 5 6 7 8 9 10 > ... > Node 0, zone Normal, type Movable 340 100 32 = 15 > 1 2 0 0 0 1 15558 >=20 > $ echo 32768 > = /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages > $ cat = /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages > 31987 >=20 > * After: >=20 > Node 0, zone Normal, type Movable 30893 32006 31515 = 7 > 0 0 0 0 0 0 0 >=20 > Notice how the memory freed back are put back into 4K / 8K / 16K page > pools. And it allocates a total of 31987 pages (63974M). >=20 > To fix this behaviour rather than remapping second vmemmap page (thus > breaking the contiguous block of memory backing the struct pages) > repopulate the first vmemmap page with a new one. We allocate and copy > from the currently mapped vmemmap page, and then remap it later on. > The same algorithm works if there's a pre initialized walk::reuse_page > and the head page doesn't need to be skipped and instead we remap it > when the @addr being changed is the @reuse_addr. >=20 > The new head page is allocated in vmemmap_remap_free() given that on > restore there's no need for functional change. Note that, because = right > now one hugepage is remapped at a time, thus only one free 4K page at = a > time is needed to remap the head page. Should it fail to allocate said > new page, it reuses the one that's already mapped just like before. As = a > result, for every 64G of contiguous hugepages it can give back 1G more > of contiguous memory per 64G, while needing in total 128M new 4K pages > (for 2M hugetlb) or 256k (for 1G hugetlb). >=20 > After the changes, try to assign a 64G node to hugetlb (on a 128G = 2node > guest, each node with 64G): >=20 > * Before allocation > Free pages count per migrate type at order 0 1 2 = 3 > 4 5 6 7 8 9 10 > ... > Node 0, zone Normal, type Movable 1 1 1 = 0 > 0 1 0 0 1 1 15564 >=20 > $ echo 32768 > = /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages > $ cat = /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages > 32394 >=20 > * After: >=20 > Node 0, zone Normal, type Movable 0 50 97 = 108 > 96 81 70 46 18 0 0 >=20 > In the example above, 407 more hugeltb 2M pages are allocated i.e. = 814M out > of the 32394 (64788M) allocated. So the memory freed back is indeed = being > used back in hugetlb and there's no massive order-0..order-2 pages > accumulated unused. >=20 > Signed-off-by: Joao Martins Thanks. Reviewed-by: Muchun Song A nit below. > --- > Changes since v2: > Comments from Muchun: > * Delete the comment above the tlb flush > * Move the head vmemmap page copy into vmemmap_remap_free() > * Add and del the new head page to the vmemmap_pages (to be freed > in case of error) > * Move the remap of the head like the tail pages in = vmemmap_remap_pte() > but special casing only when addr =3D=3D reuse_Addr > * Removes the PAGE_SIZE alignment check as the code has the assumption > that start/end are page-aligned (and VM_BUG_ON otherwise). > * Adjusted commit message taking the above changes into account. > --- > mm/hugetlb_vmemmap.c | 34 +++++++++++++++++++++++++++------- > 1 file changed, 27 insertions(+), 7 deletions(-) >=20 > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c > index 7898c2c75e35..f562b3f46410 100644 > --- a/mm/hugetlb_vmemmap.c > +++ b/mm/hugetlb_vmemmap.c > @@ -203,12 +203,7 @@ static int vmemmap_remap_range(unsigned long = start, unsigned long end, > return ret; > } while (pgd++, addr =3D next, addr !=3D end); >=20 > - /* > - * We only change the mapping of the vmemmap virtual address range > - * [@start + PAGE_SIZE, end), so we only need to flush the TLB which > - * belongs to the range. > - */ > - flush_tlb_kernel_range(start + PAGE_SIZE, end); > + flush_tlb_kernel_range(start, end); >=20 > return 0; > } > @@ -244,9 +239,16 @@ static void vmemmap_remap_pte(pte_t *pte, = unsigned long addr, > * to the tail pages. > */ > pgprot_t pgprot =3D PAGE_KERNEL_RO; > - pte_t entry =3D mk_pte(walk->reuse_page, pgprot); > struct page *page =3D pte_page(*pte); > + pte_t entry; >=20 > + /* Remapping the head page requires r/w */ > + if (unlikely(addr =3D=3D walk->reuse_addr)) { > + pgprot =3D PAGE_KERNEL; > + list_del(&walk->reuse_page->lru); Maybe smp_wmb() should be inserted here to make sure the copied data is = visible before set_pte_at() like the commit 939de63d35dde45 does.