From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5B934C433FE for ; Thu, 10 Nov 2022 11:25:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 770656B0071; Thu, 10 Nov 2022 06:25:36 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 7203D6B0072; Thu, 10 Nov 2022 06:25:36 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 60EE66B0074; Thu, 10 Nov 2022 06:25:36 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 524046B0071 for ; Thu, 10 Nov 2022 06:25:36 -0500 (EST) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 2382B1606A7 for ; Thu, 10 Nov 2022 11:25:36 +0000 (UTC) X-FDA: 80117302272.14.A716DAD Received: from out0.migadu.com (out0.migadu.com [94.23.1.103]) by imf24.hostedemail.com (Postfix) with ESMTP id 59289180007 for ; Thu, 10 Nov 2022 11:25:35 +0000 (UTC) Content-Type: text/plain; charset=us-ascii DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1668079529; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Olnt1CF+HzwSiOuMFWFKzKH1m8IfYiD9tGbWku3KuzA=; b=pj8ZyKOgDCfv/I20NDUJYbbsOU1yn5Xxl90Wpw/JE6m3xl7w6F7fa54nPXR1+X6GKjPdv4 +Pudf9udNA1xy44GlsyLzY6YaQ1wEg9MjOP3+wYzWKcQ1mQhOA5G3UWf25QuInl+d+6cKg b+5MYZZ9Uzp+u+cGTlZ4nXQAIzEZq84= MIME-Version: 1.0 Subject: Re: [PATCH v3] mm/hugetlb_vmemmap: remap head page to newly allocated page X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Muchun Song In-Reply-To: <8ab4a36f-b1c2-549a-0a11-693b3e66c5a9@oracle.com> Date: Thu, 10 Nov 2022 19:25:15 +0800 Cc: Linux Memory Management List , Muchun Song , Mike Kravetz , Andrew Morton Content-Transfer-Encoding: quoted-printable Message-Id: <4D316F96-2865-45E1-8413-ECE5EF14488C@linux.dev> References: <20221109200623.96867-1-joao.m.martins@oracle.com> <903F9F8D-98A0-4114-8BC2-9738B98C8F23@linux.dev> <8ab4a36f-b1c2-549a-0a11-693b3e66c5a9@oracle.com> To: Joao Martins X-Migadu-Flow: FLOW_OUT ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1668079535; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Olnt1CF+HzwSiOuMFWFKzKH1m8IfYiD9tGbWku3KuzA=; b=1Dfvi3RsCGWUkV9zjpJwYaz4H0fHnQyXxC43cfjHtHhjnBvpjm1q4vfiCWSWdtIaoJ7yd+ IjUQY3qY177//kOIuR8vKfQW4Rslrc+qjKcjvluNvcuHoPPHVYpuq/DklvvgTu2wduJqiR oxM1kEaMcWNMrPsqz4rzHuZsWe6ivZU= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=pj8ZyKOg; spf=pass (imf24.hostedemail.com: domain of muchun.song@linux.dev designates 94.23.1.103 as permitted sender) smtp.mailfrom=muchun.song@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1668079535; a=rsa-sha256; cv=none; b=gY3ZSljlrSfylI3fpayMHgO1kc/xY2ZUpKXH8qTrLHQbaVP/kUDJ6Jq/M7WvE9BAKPEgly ShFucQDexkJXP5B1A4rcwGBJBek0opH9nRiySLeHw2ypKMQ3iAwE6lWXQoOKoJsJM86QRu 0TONwkrEKcj3vHTs66/xQUEaysOwQHM= X-Rspam-User: X-Stat-Signature: nt4mpwxpp9hpocuzyjm6zpqnf8u5pnn4 X-Rspamd-Queue-Id: 59289180007 Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b=pj8ZyKOg; spf=pass (imf24.hostedemail.com: domain of muchun.song@linux.dev designates 94.23.1.103 as permitted sender) smtp.mailfrom=muchun.song@linux.dev; dmarc=pass (policy=none) header.from=linux.dev X-Rspamd-Server: rspam07 X-HE-Tag: 1668079535-970702 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: > On Nov 10, 2022, at 18:10, Joao Martins = wrote: >=20 > On 10/11/2022 03:28, Muchun Song wrote: >>> On Nov 10, 2022, at 04:06, Joao Martins = wrote: >>>=20 >>> Today with `hugetlb_free_vmemmap=3Don` the struct page memory that = is freed >>> back to page allocator is as following: for a 2M hugetlb page it = will reuse >>> the first 4K vmemmap page to remap the remaining 7 vmemmap pages, = and for a >>> 1G hugetlb it will remap the remaining 4095 vmemmap pages. = Essentially, >>> that means that it breaks the first 4K of a potentially contiguous = chunk of >>> memory of 32K (for 2M hugetlb pages) or 16M (for 1G hugetlb pages). = For >>> this reason the memory that it's free back to page allocator cannot = be used >>> for hugetlb to allocate huge pages of the same size, but rather only = of a >>> smaller huge page size: >>>=20 >>> Trying to assign a 64G node to hugetlb (on a 128G 2node guest, each = node >>> having 64G): >>>=20 >>> * Before allocation: >>> Free pages count per migrate type at order 0 1 2 = 3 >>> 4 5 6 7 8 9 10 >>> ... >>> Node 0, zone Normal, type Movable 340 100 32 = 15 >>> 1 2 0 0 0 1 15558 >>>=20 >>> $ echo 32768 > = /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages >>> $ cat = /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages >>> 31987 >>>=20 >>> * After: >>>=20 >>> Node 0, zone Normal, type Movable 30893 32006 31515 = 7 >>> 0 0 0 0 0 0 0 >>>=20 >>> Notice how the memory freed back are put back into 4K / 8K / 16K = page >>> pools. And it allocates a total of 31987 pages (63974M). >>>=20 >>> To fix this behaviour rather than remapping second vmemmap page = (thus >>> breaking the contiguous block of memory backing the struct pages) >>> repopulate the first vmemmap page with a new one. We allocate and = copy >>> from the currently mapped vmemmap page, and then remap it later on. >>> The same algorithm works if there's a pre initialized = walk::reuse_page >>> and the head page doesn't need to be skipped and instead we remap it >>> when the @addr being changed is the @reuse_addr. >>>=20 >>> The new head page is allocated in vmemmap_remap_free() given that on >>> restore there's no need for functional change. Note that, because = right >>> now one hugepage is remapped at a time, thus only one free 4K page = at a >>> time is needed to remap the head page. Should it fail to allocate = said >>> new page, it reuses the one that's already mapped just like before. = As a >>> result, for every 64G of contiguous hugepages it can give back 1G = more >>> of contiguous memory per 64G, while needing in total 128M new 4K = pages >>> (for 2M hugetlb) or 256k (for 1G hugetlb). >>>=20 >>> After the changes, try to assign a 64G node to hugetlb (on a 128G = 2node >>> guest, each node with 64G): >>>=20 >>> * Before allocation >>> Free pages count per migrate type at order 0 1 2 = 3 >>> 4 5 6 7 8 9 10 >>> ... >>> Node 0, zone Normal, type Movable 1 1 1 = 0 >>> 0 1 0 0 1 1 15564 >>>=20 >>> $ echo 32768 > = /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages >>> $ cat = /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages >>> 32394 >>>=20 >>> * After: >>>=20 >>> Node 0, zone Normal, type Movable 0 50 97 = 108 >>> 96 81 70 46 18 0 0 >>>=20 >>> In the example above, 407 more hugeltb 2M pages are allocated i.e. = 814M out >>> of the 32394 (64788M) allocated. So the memory freed back is indeed = being >>> used back in hugetlb and there's no massive order-0..order-2 pages >>> accumulated unused. >>>=20 >>> Signed-off-by: Joao Martins >>=20 >> Thanks. >>=20 >> Reviewed-by: Muchun Song >>=20 >> A nit below. >>=20 > Thanks >=20 >>> --- >>> Changes since v2: >>> Comments from Muchun: >>> * Delete the comment above the tlb flush >>> * Move the head vmemmap page copy into vmemmap_remap_free() >>> * Add and del the new head page to the vmemmap_pages (to be freed >>> in case of error) >>> * Move the remap of the head like the tail pages in = vmemmap_remap_pte() >>> but special casing only when addr =3D=3D reuse_Addr >>> * Removes the PAGE_SIZE alignment check as the code has the = assumption >>> that start/end are page-aligned (and VM_BUG_ON otherwise). >>> * Adjusted commit message taking the above changes into account. >>> --- >>> mm/hugetlb_vmemmap.c | 34 +++++++++++++++++++++++++++------- >>> 1 file changed, 27 insertions(+), 7 deletions(-) >>>=20 >>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c >>> index 7898c2c75e35..f562b3f46410 100644 >>> --- a/mm/hugetlb_vmemmap.c >>> +++ b/mm/hugetlb_vmemmap.c >>> @@ -203,12 +203,7 @@ static int vmemmap_remap_range(unsigned long = start, unsigned long end, >>> return ret; >>> } while (pgd++, addr =3D next, addr !=3D end); >>>=20 >>> - /* >>> - * We only change the mapping of the vmemmap virtual address range >>> - * [@start + PAGE_SIZE, end), so we only need to flush the TLB = which >>> - * belongs to the range. >>> - */ >>> - flush_tlb_kernel_range(start + PAGE_SIZE, end); >>> + flush_tlb_kernel_range(start, end); >>>=20 >>> return 0; >>> } >>> @@ -244,9 +239,16 @@ static void vmemmap_remap_pte(pte_t *pte, = unsigned long addr, >>> * to the tail pages. >>> */ >>> pgprot_t pgprot =3D PAGE_KERNEL_RO; >>> - pte_t entry =3D mk_pte(walk->reuse_page, pgprot); >>> struct page *page =3D pte_page(*pte); >>> + pte_t entry; >>>=20 >>> + /* Remapping the head page requires r/w */ >>> + if (unlikely(addr =3D=3D walk->reuse_addr)) { >>> + pgprot =3D PAGE_KERNEL; >>> + list_del(&walk->reuse_page->lru); >>=20 >> Maybe smp_wmb() should be inserted here to make sure the copied data = is visible >> before set_pte_at() like the commit 939de63d35dde45 does. >>=20 >=20 > I've added the barrier and comment above the barrier as the copy is = not > immediately obvious where it takes place. See below snip as to what I = added > in v4: >=20 > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c > index f562b3f46410..45e93a545dd7 100644 > --- a/mm/hugetlb_vmemmap.c > +++ b/mm/hugetlb_vmemmap.c > @@ -246,6 +246,13 @@ static void vmemmap_remap_pte(pte_t *pte, = unsigned long addr, > if (unlikely(addr =3D=3D walk->reuse_addr)) { > pgprot =3D PAGE_KERNEL; > list_del(&walk->reuse_page->lru); > + > + /* > + * Makes sure that preceding stores to the page = contents from > + * vmemmap_remap_free() become visible before the = set_pte_at() > + * write. > + */ > + smp_wmb(); > } >=20 > entry =3D mk_pte(walk->reuse_page, pgprot); make sense to me.