From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 21430C433EF for ; Fri, 11 Feb 2022 07:55:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 926B66B0074; Fri, 11 Feb 2022 02:55:15 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 8D6486B0075; Fri, 11 Feb 2022 02:55:15 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 778496B0078; Fri, 11 Feb 2022 02:55:15 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0219.hostedemail.com [216.40.44.219]) by kanga.kvack.org (Postfix) with ESMTP id 699016B0074 for ; Fri, 11 Feb 2022 02:55:15 -0500 (EST) Received: from smtpin11.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 2D35D8249980 for ; Fri, 11 Feb 2022 07:55:15 +0000 (UTC) X-FDA: 79129738590.11.A0817DC Received: from mail-yb1-f172.google.com (mail-yb1-f172.google.com [209.85.219.172]) by imf15.hostedemail.com (Postfix) with ESMTP id BD7F6A0002 for ; Fri, 11 Feb 2022 07:55:14 +0000 (UTC) Received: by mail-yb1-f172.google.com with SMTP id j2so22658523ybu.0 for ; Thu, 10 Feb 2022 23:55:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=SxWwp4vhzlHM/r5dlY3eoYylnbCH6SSOv7vRZ0v/oPk=; b=5wlqz7WwEsPV70o9vwoNCHYQssh7D6yEtfOT1EvihAqWheirjNiIAsBWO5gB/2npQS diPV/2tkqg2Par9OS0Cqwu+jYH8+w2UWiFeyuhSCwcuKo0sjevRLR3I8JLO8+BgtgHgX 3//h71enruNwBOV8f4EpTmP5K4SJLk/Q4UlEjuanjANvCR0r2Z+9s9KRFdNGdHltrx43 OOe15MhJXQ3m8Mt/Q6E9XHBDleRHo95IL5tZzXtKkvJtoqXvgCYJtEiUlbkqFzlKbKd1 IBWT/KkcfklF/zYxnWFb9icjpNKmP0HJELKSYWPYR+UPuibyG54TsXWEfx6BPZHqQ5oF yDXA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=SxWwp4vhzlHM/r5dlY3eoYylnbCH6SSOv7vRZ0v/oPk=; b=MTwbzjLh8xhmZDmixijvpyhlV6IuoXXDdLdsZE1kBvpSa3nyqsGVq5zj9H0u6cbYvD rL7AmKn5uL0nigezy/IiVkbW0uuWkUrq1SWQASV9kP9GqL9o29hYuE804XJft+le3He7 o0cle2HOsRrKcj2Hy/Ey5g5ZBLJJ3frunUYurcmL8Nr1BSF01Z3HOpc3hB9HSWNdQ0AS Rwe4uZ5rd5weRX/YG3lcu9TnIEB1WoknbGPwsJqELL0yczldpA51njW656dtksKau4Vs 9SyMYBvnzDue4BAATyxrBHleSBDLPBl1v/2wjqUeCQGS86suFvav/KqEl15JzOlB9pgz JeaA== X-Gm-Message-State: AOAM5325w/eKOjq7tcyCtkJhkZUkqRnBXGhqbdQ2i2BBLv+i/93b6vMC n/y/uaWddeFmKPjKXdOOi0BuyqzK2muYe6os/GW3oQ== X-Google-Smtp-Source: ABdhPJx9DX4VXkSD8lglmhSxc6yMrxUjrjTM2Qa3nh1Yx8UmysmxxA9eJL1Xr53Gmvx0cjL3/gA6ws7KnQbnxq+vAjM= X-Received: by 2002:a25:e406:: with SMTP id b6mr261879ybh.703.1644566114077; Thu, 10 Feb 2022 23:55:14 -0800 (PST) MIME-Version: 1.0 References: <20220210193345.23628-1-joao.m.martins@oracle.com> <20220210193345.23628-5-joao.m.martins@oracle.com> In-Reply-To: <20220210193345.23628-5-joao.m.martins@oracle.com> From: Muchun Song Date: Fri, 11 Feb 2022 15:54:36 +0800 Message-ID: Subject: Re: [PATCH v5 4/5] mm/sparse-vmemmap: improve memory savings for compound devmaps To: Joao Martins Cc: Linux Memory Management List , Dan Williams , Vishal Verma , Matthew Wilcox , Jason Gunthorpe , Jane Chu , Mike Kravetz , Andrew Morton , Jonathan Corbet , Christoph Hellwig , nvdimm@lists.linux.dev, Linux Doc Mailing List Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: BD7F6A0002 X-Stat-Signature: 5do76mtr3pm56qqopd5ayrodb3ppye5i Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=5wlqz7Ww; spf=pass (imf15.hostedemail.com: domain of songmuchun@bytedance.com designates 209.85.219.172 as permitted sender) smtp.mailfrom=songmuchun@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com X-Rspam-User: X-HE-Tag: 1644566114-761477 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Feb 11, 2022 at 3:34 AM Joao Martins wrote: [...] > pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int node, > - struct vmem_altmap *altmap) > + struct vmem_altmap *altmap, > + struct page *block) Why not use the name of "reuse" instead of "block"? Seems like "reuse" is more clear. > { > pte_t *pte = pte_offset_kernel(pmd, addr); > if (pte_none(*pte)) { > pte_t entry; > void *p; > > - p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap); > - if (!p) > - return NULL; > + if (!block) { > + p = vmemmap_alloc_block_buf(PAGE_SIZE, node, altmap); > + if (!p) > + return NULL; > + } else { > + /* > + * When a PTE/PMD entry is freed from the init_mm > + * there's a a free_pages() call to this page allocated > + * above. Thus this get_page() is paired with the > + * put_page_testzero() on the freeing path. > + * This can only called by certain ZONE_DEVICE path, > + * and through vmemmap_populate_compound_pages() when > + * slab is available. > + */ > + get_page(block); > + p = page_to_virt(block); > + } > entry = pfn_pte(__pa(p) >> PAGE_SHIFT, PAGE_KERNEL); > set_pte_at(&init_mm, addr, pte, entry); > } > @@ -609,7 +624,8 @@ pgd_t * __meminit vmemmap_pgd_populate(unsigned long addr, int node) > } > > static int __meminit vmemmap_populate_address(unsigned long addr, int node, > - struct vmem_altmap *altmap) > + struct vmem_altmap *altmap, > + struct page *reuse, struct page **page) We can remove the last argument (struct page **page) if we change the return type to "pte_t *". More simple, don't you think? > { > pgd_t *pgd; > p4d_t *p4d; > @@ -629,11 +645,13 @@ static int __meminit vmemmap_populate_address(unsigned long addr, int node, > pmd = vmemmap_pmd_populate(pud, addr, node); > if (!pmd) > return -ENOMEM; > - pte = vmemmap_pte_populate(pmd, addr, node, altmap); > + pte = vmemmap_pte_populate(pmd, addr, node, altmap, reuse); > if (!pte) > return -ENOMEM; > vmemmap_verify(pte, node, addr, addr + PAGE_SIZE); > > + if (page) > + *page = pte_page(*pte); > return 0; > } > > @@ -644,10 +662,120 @@ int __meminit vmemmap_populate_basepages(unsigned long start, unsigned long end, > int rc; > > for (; addr < end; addr += PAGE_SIZE) { > - rc = vmemmap_populate_address(addr, node, altmap); > + rc = vmemmap_populate_address(addr, node, altmap, NULL, NULL); > if (rc) > return rc; > + } > + > + return 0; > +} > + > +static int __meminit vmemmap_populate_range(unsigned long start, > + unsigned long end, > + int node, struct page *page) > +{ > + unsigned long addr = start; > + int rc; > > + for (; addr < end; addr += PAGE_SIZE) { > + rc = vmemmap_populate_address(addr, node, NULL, page, NULL); > + if (rc) > + return rc; > + } > + > + return 0; > +} > + > +static inline int __meminit vmemmap_populate_page(unsigned long addr, int node, > + struct page **page) > +{ > + return vmemmap_populate_address(addr, node, NULL, NULL, page); > +} > + > +/* > + * For compound pages bigger than section size (e.g. x86 1G compound > + * pages with 2M subsection size) fill the rest of sections as tail > + * pages. > + * > + * Note that memremap_pages() resets @nr_range value and will increment > + * it after each range successful onlining. Thus the value or @nr_range > + * at section memmap populate corresponds to the in-progress range > + * being onlined here. > + */ > +static bool __meminit reuse_compound_section(unsigned long start_pfn, > + struct dev_pagemap *pgmap) > +{ > + unsigned long nr_pages = pgmap_vmemmap_nr(pgmap); > + unsigned long offset = start_pfn - > + PHYS_PFN(pgmap->ranges[pgmap->nr_range].start); > + > + return !IS_ALIGNED(offset, nr_pages) && nr_pages > PAGES_PER_SUBSECTION; > +} > + > +static struct page * __meminit compound_section_tail_page(unsigned long addr) > +{ > + pte_t *ptep; > + > + addr -= PAGE_SIZE; > + > + /* > + * Assuming sections are populated sequentially, the previous section's > + * page data can be reused. > + */ > + ptep = pte_offset_kernel(pmd_off_k(addr), addr); > + if (!ptep) > + return NULL; > + > + return pte_page(*ptep); > +} > + > +static int __meminit vmemmap_populate_compound_pages(unsigned long start_pfn, > + unsigned long start, > + unsigned long end, int node, > + struct dev_pagemap *pgmap) > +{ > + unsigned long size, addr; > + > + if (reuse_compound_section(start_pfn, pgmap)) { > + struct page *page; > + > + page = compound_section_tail_page(start); > + if (!page) > + return -ENOMEM; > + > + /* > + * Reuse the page that was populated in the prior iteration > + * with just tail struct pages. > + */ > + return vmemmap_populate_range(start, end, node, page); > + } > + > + size = min(end - start, pgmap_vmemmap_nr(pgmap) * sizeof(struct page)); > + for (addr = start; addr < end; addr += size) { > + unsigned long next = addr, last = addr + size; > + struct page *block; > + int rc; > + > + /* Populate the head page vmemmap page */ > + rc = vmemmap_populate_page(addr, node, NULL); > + if (rc) > + return rc; > + > + /* Populate the tail pages vmemmap page */ > + block = NULL; > + next = addr + PAGE_SIZE; > + rc = vmemmap_populate_page(next, node, &block); > + if (rc) > + return rc; > + > + /* > + * Reuse the previous page for the rest of tail pages > + * See layout diagram in Documentation/vm/vmemmap_dedup.rst > + */ > + next += PAGE_SIZE; > + rc = vmemmap_populate_range(next, last, node, block); > + if (rc) > + return rc; > } > > return 0; > @@ -659,12 +787,18 @@ struct page * __meminit __populate_section_memmap(unsigned long pfn, > { > unsigned long start = (unsigned long) pfn_to_page(pfn); > unsigned long end = start + nr_pages * sizeof(struct page); > + int r; > > if (WARN_ON_ONCE(!IS_ALIGNED(pfn, PAGES_PER_SUBSECTION) || > !IS_ALIGNED(nr_pages, PAGES_PER_SUBSECTION))) > return NULL; > > - if (vmemmap_populate(start, end, nid, altmap)) > + if (pgmap && pgmap_vmemmap_nr(pgmap) > 1 && !altmap) Should we add a judgment like "is_power_of_2(sizeof(struct page))" since this optimization is only applied when the size of the struct page does not cross page boundaries? Thanks.