From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 60765C433ED for ; Thu, 6 May 2021 01:18:20 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id C39A76121F for ; Thu, 6 May 2021 01:18:19 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C39A76121F Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 1BDBE6B006C; Wed, 5 May 2021 21:18:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 195516B006E; Wed, 5 May 2021 21:18:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 048AF6B0070; Wed, 5 May 2021 21:18:18 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0248.hostedemail.com [216.40.44.248]) by kanga.kvack.org (Postfix) with ESMTP id DF0B56B006C for ; Wed, 5 May 2021 21:18:18 -0400 (EDT) Received: from smtpin06.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 9D335180AD830 for ; Thu, 6 May 2021 01:18:18 +0000 (UTC) X-FDA: 78109045476.06.A5F9335 Received: from mail-ej1-f54.google.com (mail-ej1-f54.google.com [209.85.218.54]) by imf18.hostedemail.com (Postfix) with ESMTP id 9CECC2000241 for ; Thu, 6 May 2021 01:18:19 +0000 (UTC) Received: by mail-ej1-f54.google.com with SMTP id u21so5711263ejo.13 for ; Wed, 05 May 2021 18:18:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=intel-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=QW+c4BRkdQzNKENivs/KCXluQgAqYAg5SWCpSOBOmhI=; b=EvAyRASA3n1wkJe91Y9dz0Kd0zDPS6SoMq7Nuce/JEj+tlh4NzsHzlCVAyCGe+acq1 3KPurdyhKLPw9q7MPfN/+a2j2tzdtGq6w5KX7PEUmt/XwzQbgfrxb4YMe8k4mJQquBcR q7ZtltBcW0BwRB+yOroQPA4XGOyffbwz5LhJgwKPOetltITP9llLYNfhqAtDmhyGdiA0 kJWS7qFBCXuS88i0qQJaf6+o8ETI54DYRqHxJPjAISAAVViIat7y4BxGo7k6pTVa9Pyg 8fSv4aNfShzK303vq3L1H9BpFQ5BD6XzDRSmb3KxsZrIl+VLqz4UUXrDZl22xdwYirqE x6Gw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=QW+c4BRkdQzNKENivs/KCXluQgAqYAg5SWCpSOBOmhI=; b=CJrkL6nWFqJktYSPVSBefd/hHVU1H0Gu4E/jtw5NEoTPYlI+kWExRr/CB5GJq9ZfJr Dfx3qnskqjYScY3oN+Jqu+YFsfbcImb+mdeaEB4e5tRY7Gd0e2D9a0DPA/VIZV7eN48q CV/GObpGskFJ1tzVb/7Pzd9JFEMP8156T00quUgFEUwXhoflI+24N576YFoabNOglwqY VWkDIUVCQh5wW1qro1nh4UDJG1L2FT4O+KtHQofpnU242oUCpVxjZud2VWGfUIIrsO9o i7MMjd11LCNi/HAWxkJkMGZHVVP8aTVWlIStiPzJs0orbU+xZNYrJhTzlax9Z5M6Ervt zSDQ== X-Gm-Message-State: AOAM531z7W4WRAR2BdoTvsH7S4j2GK4Y7G38Pps4zjqmstzJ7R/Vch+3 KfXsC3aqHj9/z+2/pBQKFh1U3MI6k7LwFwD0sO7Thw== X-Google-Smtp-Source: ABdhPJznH6f/EGPS7Zw3Xb0RyZ/qfSzi9FZ5+yW4s368AM60XGl0OR4QgU6nUrZaWo4rJLqklwv+LzhAwdzaXnqRorU= X-Received: by 2002:a17:907:1183:: with SMTP id uz3mr1599006ejb.264.1620263894935; Wed, 05 May 2021 18:18:14 -0700 (PDT) MIME-Version: 1.0 References: <20210325230938.30752-1-joao.m.martins@oracle.com> <20210325230938.30752-8-joao.m.martins@oracle.com> In-Reply-To: <20210325230938.30752-8-joao.m.martins@oracle.com> From: Dan Williams Date: Wed, 5 May 2021 18:18:18 -0700 Message-ID: Subject: Re: [PATCH v1 07/11] mm/sparse-vmemmap: populate compound pagemaps To: Joao Martins Cc: Linux MM , Ira Weiny , linux-nvdimm , Matthew Wilcox , Jason Gunthorpe , Jane Chu , Muchun Song , Mike Kravetz , Andrew Morton Content-Type: text/plain; charset="UTF-8" Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=intel-com.20150623.gappssmtp.com header.s=20150623 header.b=EvAyRASA; dmarc=fail reason="No valid SPF, DKIM not aligned (relaxed)" header.from=intel.com (policy=none); spf=none (imf18.hostedemail.com: domain of dan.j.williams@intel.com has no SPF policy when checking 209.85.218.54) smtp.mailfrom=dan.j.williams@intel.com X-Rspamd-Server: rspam03 X-Stat-Signature: skmdqyru1pfgjup9n4wkypkjdfpdtx6x X-Rspamd-Queue-Id: 9CECC2000241 Received-SPF: none (intel.com>: No applicable sender policy available) receiver=imf18; identity=mailfrom; envelope-from=""; helo=mail-ej1-f54.google.com; client-ip=209.85.218.54 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1620263899-235630 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Mar 25, 2021 at 4:10 PM Joao Martins wrote: > > A compound pagemap is a dev_pagemap with @align > PAGE_SIZE and it > means that pages are mapped at a given huge page alignment and utilize > uses compound pages as opposed to order-0 pages. > > To minimize struct page overhead we take advantage of the fact that I'm not sure if Andrew is a member of the "we" police, but I am on team "imperative tense". "Take advantage of the fact that most tail pages look the same (except the first two) to minimize struct page overhead." ...I think all the other "we"s below can be dropped without losing any meaning. > most tail pages look the same (except the first two). We allocate a > separate page for the vmemmap area which contains the head page and > separate for the next 64 pages. The rest of the subsections then reuse > this tail vmemmap page to initialize the rest of the tail pages. > > Sections are arch-dependent (e.g. on x86 it's 64M, 128M or 512M) and > when initializing compound pagemap with big enough @align (e.g. 1G > PUD) it will cross various sections. To be able to reuse tail pages > across sections belonging to the same gigantic page we fetch the > @range being mapped (nr_ranges + 1). If the section being mapped is > not offset 0 of the @align, then lookup the PFN of the struct page > address that preceeds it and use that to populate the entire s/preceeds/precedes/ > section. > > On compound pagemaps with 2M align, this lets mechanism saves 6 pages s/lets mechanism saves 6 pages/this mechanism lets 6 pages be saved/ > out of the 8 necessary PFNs necessary to set the subsection's 512 > struct pages being mapped. On a 1G compound pagemap it saves > 4094 pages. > > Altmap isn't supported yet, given various restrictions in altmap pfn > allocator, thus fallback to the already in use vmemmap_populate(). Perhaps also note that altmap for devmap mappings was there to relieve the pressure of inordinate amounts of memmap space to map terabytes of PMEM. With compound pages the motivation for altmaps for PMEM is reduced. > > Signed-off-by: Joao Martins > --- > include/linux/mm.h | 2 +- > mm/memremap.c | 1 + > mm/sparse-vmemmap.c | 139 ++++++++++++++++++++++++++++++++++++++++---- > 3 files changed, 131 insertions(+), 11 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 61474602c2b1..49d717ae40ae 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -3040,7 +3040,7 @@ p4d_t *vmemmap_p4d_populate(pgd_t *pgd, unsigned long addr, int node); > pud_t *vmemmap_pud_populate(p4d_t *p4d, unsigned long addr, int node); > pmd_t *vmemmap_pmd_populate(pud_t *pud, unsigned long addr, int node); > pte_t *vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node, > - struct vmem_altmap *altmap); > + struct vmem_altmap *altmap, void *block); > void *vmemmap_alloc_block(unsigned long size, int node); > struct vmem_altmap; > void *vmemmap_alloc_block_buf(unsigned long size, int node, > diff --git a/mm/memremap.c b/mm/memremap.c > index d160853670c4..2e6bc0b1ff00 100644 > --- a/mm/memremap.c > +++ b/mm/memremap.c > @@ -345,6 +345,7 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) > { > struct mhp_params params = { > .altmap = pgmap_altmap(pgmap), > + .pgmap = pgmap, > .pgprot = PAGE_KERNEL, > }; > const int nr_range = pgmap->nr_range; > diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c > index 8814876edcfa..f57c5eada099 100644 > --- a/mm/sparse-vmemmap.c > +++ b/mm/sparse-vmemmap.c > @@ -141,16 +141,20 @@ void __meminit vmemmap_verify(pte_t *pte, int node, > } > > pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node, > - struct vmem_altmap *altmap) > + struct vmem_altmap *altmap, void *block) For type-safety I would make this argument a 'struct page *' and drop the virt_to_page(). > { > pte_t *pte = pte_offset_kernel(pmd, addr); > if (pte_none(*pte)) { > pte_t entry; > - void *p; > - > - p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap); > - if (!p) > - return NULL; > + void *p = block; > + > + if (!block) { > + p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap); > + if (!p) > + return NULL; > + } else if (!altmap) { > + get_page(virt_to_page(block)); This is either super subtle, or there is a missing put_page(). I'm assuming that in the shutdown path the same page will be freed multiple times because the same page is mapped multiple times. Have you validated that the accounting is correct and all allocated pages get freed? I feel this needs more than a comment, it needs a test to validate that the accounting continues to happen correctly as future vmemmap changes land. > + } > entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL); > set_pte_at(&init_mm, addr, pte, entry); > } > @@ -217,7 +221,8 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node) > } > > static int __meminit vmemmap_populate_address(unsigned long addr, int node, > - struct vmem_altmap *altmap) > + struct vmem_altmap *altmap, > + void *page, void **ptr) > { > pgd_t *pgd; > p4d_t *p4d; > @@ -237,10 +242,14 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node, > pmd = vmemmap_pmd_populate(pud, addr, node); > if (!pmd) > return -ENOMEM; > - pte = vmemmap_pte_populate(pmd, addr, node, altmap); > + pte = vmemmap_pte_populate(pmd, addr, node, altmap, page); > if (!pte) > return -ENOMEM; > vmemmap_verify(pte, node, addr, addr + PAGE_SIZE); > + > + if (ptr) > + *ptr = __va(__pfn_to_phys(pte_pfn(*pte))); > + return 0; > } > > int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end, > @@ -249,7 +258,110 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end, > unsigned long addr = start; > > for (; addr < end; addr += PAGE_SIZE) { > - if (vmemmap_populate_address(addr, node, altmap)) > + if (vmemmap_populate_address(addr, node, altmap, NULL, NULL)) > + return -ENOMEM; > + } > + > + return 0; > +} > + > +static int __meminit vmemmap_populate_range(unsigned long start, > + unsigned long end, > + int node, void *page) > +{ > + unsigned long addr = start; > + > + for (; addr < end; addr += PAGE_SIZE) { > + if (vmemmap_populate_address(addr, node, NULL, page, NULL)) > + return -ENOMEM; > + } > + > + return 0; > +} > + > +static inline int __meminit vmemmap_populate_page(unsigned long addr, int node, > + void **ptr) > +{ > + return vmemmap_populate_address(addr, node, NULL, NULL, ptr); > +} > + > +static pte_t * __meminit vmemmap_lookup_address(unsigned long addr) I think this can be replaced with a call to follow_pte(&init_mm...). > +{ > + pgd_t *pgd; > + p4d_t *p4d; > + pud_t *pud; > + pmd_t *pmd; > + pte_t *pte; > + > + pgd = pgd_offset_k(addr); > + if (pgd_none(*pgd)) > + return NULL; > + > + p4d = p4d_offset(pgd, addr); > + if (p4d_none(*p4d)) > + return NULL; > + > + pud = pud_offset(p4d, addr); > + if (pud_none(*pud)) > + return NULL; > + > + pmd = pmd_offset(pud, addr); > + if (pmd_none(*pmd)) > + return NULL; > + > + pte = pte_offset_kernel(pmd, addr); > + if (pte_none(*pte)) > + return NULL; > + > + return pte; > +} > + > +static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn, > + unsigned long start, > + unsigned long end, int node, > + struct dev_pagemap *pgmap) > +{ > + unsigned long offset, size, addr; > + > + /* > + * For compound pages bigger than section size (e.g. 1G) fill the rest Oh, I had to read this a couple times because I thought the "e.g." was referring to section size. How about: s/(e.g. 1G)/(e.g. x86 1G compound pages with 2M subsection size)/ > + * of sections as tail pages. > + * > + * Note that memremap_pages() resets @nr_range value and will increment > + * it after each range successful onlining. Thus the value or @nr_range > + * at section memmap populate corresponds to the in-progress range > + * being onlined that we care about. > + */ > + offset = PFN_PHYS(start_pfn) - pgmap->ranges[pgmap->nr_range].start; > + if (!IS_ALIGNED(offset, pgmap_align(pgmap)) && > + pgmap_align(pgmap) > SUBSECTION_SIZE) { > + pte_t *ptep = vmemmap_lookup_address(start - PAGE_SIZE); > + > + if (!ptep) > + return -ENOMEM; Side note: I had been resistant to adding 'struct page' for altmap pages, but that would be a requirement to enable compound pages + altmap. > + Perhaps a comment here to the effect "recall the page that was allocated in a prior iteration and fill it in with tail pages". > + return vmemmap_populate_range(start, end, node, > + page_to_virt(pte_page(*ptep))); If the @block parameter changes to a 'struct page *' it also saves some typing here. > + } > + > + size = min(end - start, pgmap_pfn_align(pgmap) * sizeof(struct page)); > + for (addr = start; addr < end; addr += size) { > + unsigned long next = addr, last = addr + size; > + void *block; > + > + /* Populate the head page vmemmap page */ > + if (vmemmap_populate_page(addr, node, NULL)) > + return -ENOMEM; > + > + /* Populate the tail pages vmemmap page */ > + block = NULL; > + next = addr + PAGE_SIZE; > + if (vmemmap_populate_page(next, node, &block)) > + return -ENOMEM; > + > + /* Reuse the previous page for the rest of tail pages */ > + next += PAGE_SIZE; > + if (vmemmap_populate_range(next, last, node, block)) > return -ENOMEM; > } > > @@ -262,12 +374,19 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn, > { > unsigned long start = (unsigned long) pfn_to_page(pfn); > unsigned long end = start + nr_pages * sizeof(struct page); > + unsigned int align = pgmap_align(pgmap); > + int r; > > if (WARN_ON_ONCE(!IS_ALIGNED(pfn, PAGES_PER_SUBSECTION) || > !IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION))) > return NULL; > > - if (vmemmap_populate(start, end, nid, altmap)) > + if (align > PAGE_SIZE && !altmap) I also think this code will read better the devmap_geometry enum. > + r = vmemmap_populate_compound_pages(pfn, start, end, nid, pgmap); > + else > + r = vmemmap_populate(start, end, nid, altmap); > + > + if (r < 0) > return NULL; > > return pfn_to_page(pfn); > -- > 2.17.1 >