From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 446DBC433E0 for ; Mon, 25 Jan 2021 06:41:08 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id C16BE224DE for ; Mon, 25 Jan 2021 06:41:07 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C16BE224DE Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id F2A618D0003; Mon, 25 Jan 2021 01:41:06 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id EB38C8D0001; Mon, 25 Jan 2021 01:41:06 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D7A338D0003; Mon, 25 Jan 2021 01:41:06 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0178.hostedemail.com [216.40.44.178]) by kanga.kvack.org (Postfix) with ESMTP id BB6CF8D0001 for ; Mon, 25 Jan 2021 01:41:06 -0500 (EST) Received: from smtpin01.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 80283180AD838 for ; Mon, 25 Jan 2021 06:41:06 +0000 (UTC) X-FDA: 77743350132.01.crowd27_510a3c627583 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin01.hostedemail.com (Postfix) with ESMTP id 5C7871004C6F3 for ; Mon, 25 Jan 2021 06:41:06 +0000 (UTC) X-HE-Tag: crowd27_510a3c627583 X-Filterd-Recvd-Size: 9172 Received: from mail-pl1-f180.google.com (mail-pl1-f180.google.com [209.85.214.180]) by imf46.hostedemail.com (Postfix) with ESMTP for ; Mon, 25 Jan 2021 06:41:05 +0000 (UTC) Received: by mail-pl1-f180.google.com with SMTP id t6so6997855plq.1 for ; Sun, 24 Jan 2021 22:41:05 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=NewSY01qkIGM9FMVRaVXfK28tOMEomXG8TL6xj/6ExU=; b=dqz4SrJI2EpyL5zBD3Ud75KgTfBUcXl9CpiGOJPIxKjyVX/iYdSQdfe1ePcFBhZw7b HYIhTcNlR8rvxCOXr+82MSI/2qu96+I5tHYDRCPV9hOM6sLvKudctE9rkAnTTCA1RQmE jCPswFUCuBmin5UAqlev59PlZTvBSdMSyQXAVcig6fVCUeLUCjkYB4iyIUDOJeeZCkHB KDQroHze9xd2oxp1s/erz77PxkrYSSwhCs9QzMqfsf74CGGMre5Jrx8LVzcKXTEpzkoY KKlCLBUoGVVc3PSi3/vFNNnWsqF2XM6XiscKNyZcCoG6+cLAFdRxTdW7MkUMLy/8uHcy VljQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=NewSY01qkIGM9FMVRaVXfK28tOMEomXG8TL6xj/6ExU=; b=pcoyvdcIdGhwZuFmpLpSLDkO58XDQZ8muNZVnB+4Ehf1T+fDNLS6UHEa7NHiXu24rG pYa422Q5JYA56ysVYQEqqKg6ckTpHCuxxitCe0v+TFMuXZA8Axvc8qs6G0V7C5Df4XRj HSy9c5XO2MjCz0uLdunYSsBAae+djulVcNkgjSkTs6EDMJUxM1i7G6hvmVcmDMhZ34kG CPTDChduMWH0OWeUhSmFMAtlXYQkug2E1okP/qhRF4jZnYvKIIfS7JN3XpEpcKtKlUgw bBPHH4EziVDDocGcCZE+62bokE1dF9Axm9Jtqx2//EgwCNI8wKmBsOyICHx4nfMQPe7S yOLw== X-Gm-Message-State: AOAM5314T9FL12DMAvVQY9msbyRRNJoC4Mb6/kiX0A3W37WPv2BMARQJ YBsMyATx8ug0ANjQI2qe4YnPb2oMXmoGoH+/wnmiSg== X-Google-Smtp-Source: ABdhPJyXPfNI8fF7mOPctMLEwfthvf6F4onGDO1Vun00hENf1CWfX3J5+p8iE/SBk36EheIPeyMlR1HL/lUpJRFfe30= X-Received: by 2002:a17:90a:808a:: with SMTP id c10mr6198232pjn.229.1611556864233; Sun, 24 Jan 2021 22:41:04 -0800 (PST) MIME-Version: 1.0 References: <20210117151053.24600-1-songmuchun@bytedance.com> <20210117151053.24600-6-songmuchun@bytedance.com> <6a68fde-583d-b8bb-a2c8-fbe32e03b@google.com> In-Reply-To: <6a68fde-583d-b8bb-a2c8-fbe32e03b@google.com> From: Muchun Song Date: Mon, 25 Jan 2021 14:40:27 +0800 Message-ID: Subject: Re: [External] Re: [PATCH v13 05/12] mm: hugetlb: allocate the vmemmap pages associated with each HugeTLB page To: David Rientjes Cc: Jonathan Corbet , Mike Kravetz , Thomas Gleixner , mingo@redhat.com, bp@alien8.de, x86@kernel.org, hpa@zytor.com, dave.hansen@linux.intel.com, luto@kernel.org, Peter Zijlstra , viro@zeniv.linux.org.uk, Andrew Morton , paulmck@kernel.org, mchehab+huawei@kernel.org, pawan.kumar.gupta@linux.intel.com, Randy Dunlap , oneukum@suse.com, anshuman.khandual@arm.com, jroedel@suse.de, Mina Almasry , Matthew Wilcox , Oscar Salvador , Michal Hocko , "Song Bao Hua (Barry Song)" , David Hildenbrand , =?UTF-8?B?SE9SSUdVQ0hJIE5BT1lBKOWggOWPoyDnm7TkuZ8p?= , Xiongchun duan , linux-doc@vger.kernel.org, LKML , Linux Memory Management List , linux-fsdevel Content-Type: text/plain; charset="UTF-8" X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Jan 25, 2021 at 8:05 AM David Rientjes wrote: > > > On Sun, 17 Jan 2021, Muchun Song wrote: > > > diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c > > index ce4be1fa93c2..3b146d5949f3 100644 > > --- a/mm/sparse-vmemmap.c > > +++ b/mm/sparse-vmemmap.c > > @@ -29,6 +29,7 @@ > > #include > > #include > > #include > > +#include > > > > #include > > #include > > @@ -40,7 +41,8 @@ > > * @remap_pte: called for each non-empty PTE (lowest-level) entry. > > * @reuse_page: the page which is reused for the tail vmemmap pages. > > * @reuse_addr: the virtual address of the @reuse_page page. > > - * @vmemmap_pages: the list head of the vmemmap pages that can be freed. > > + * @vmemmap_pages: the list head of the vmemmap pages that can be freed > > + * or is mapped from. > > */ > > struct vmemmap_remap_walk { > > void (*remap_pte)(pte_t *pte, unsigned long addr, > > @@ -50,6 +52,10 @@ struct vmemmap_remap_walk { > > struct list_head *vmemmap_pages; > > }; > > > > +/* The gfp mask of allocating vmemmap page */ > > +#define GFP_VMEMMAP_PAGE \ > > + (GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN | __GFP_THISNODE) > > + > > This is unnecessary, just use the gfp mask directly in allocator. Will do. Thanks. > > > static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr, > > unsigned long end, > > struct vmemmap_remap_walk *walk) > > @@ -228,6 +234,75 @@ void vmemmap_remap_free(unsigned long start, unsigned long end, > > free_vmemmap_page_list(&vmemmap_pages); > > } > > > > +static void vmemmap_restore_pte(pte_t *pte, unsigned long addr, > > + struct vmemmap_remap_walk *walk) > > +{ > > + pgprot_t pgprot = PAGE_KERNEL; > > + struct page *page; > > + void *to; > > + > > + BUG_ON(pte_page(*pte) != walk->reuse_page); > > + > > + page = list_first_entry(walk->vmemmap_pages, struct page, lru); > > + list_del(&page->lru); > > + to = page_to_virt(page); > > + copy_page(to, (void *)walk->reuse_addr); > > + > > + set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot)); > > +} > > + > > +static void alloc_vmemmap_page_list(struct list_head *list, > > + unsigned long start, unsigned long end) > > +{ > > + unsigned long addr; > > + > > + for (addr = start; addr < end; addr += PAGE_SIZE) { > > + struct page *page; > > + int nid = page_to_nid((const void *)addr); > > + > > +retry: > > + page = alloc_pages_node(nid, GFP_VMEMMAP_PAGE, 0); > > + if (unlikely(!page)) { > > + msleep(100); > > + /* > > + * We should retry infinitely, because we cannot > > + * handle allocation failures. Once we allocate > > + * vmemmap pages successfully, then we can free > > + * a HugeTLB page. > > + */ > > + goto retry; > > Ugh, I don't think this will work, there's no guarantee that we'll ever > succeed and now we can't free a 2MB hugepage because we cannot allocate a > 4KB page. We absolutely have to ensure we make forward progress here. This can trigger a OOM when there is no memory and kill someone to release some memory. Right? > > We're going to be freeing the hugetlb page after this succeeeds, can we > not use part of the hugetlb page that we're freeing for this memory > instead? It seems a good idea. We can try to allocate memory firstly, if successful, just use the new page to remap (it can reduce memory fragmentation). If not, we can use part of the hugetlb page to remap. What's your opinion about this? > > > + } > > + list_add_tail(&page->lru, list); > > + } > > +} > > + > > +/** > > + * vmemmap_remap_alloc - remap the vmemmap virtual address range [@start, end) > > + * to the page which is from the @vmemmap_pages > > + * respectively. > > + * @start: start address of the vmemmap virtual address range. > > + * @end: end address of the vmemmap virtual address range. > > + * @reuse: reuse address. > > + */ > > +void vmemmap_remap_alloc(unsigned long start, unsigned long end, > > + unsigned long reuse) > > +{ > > + LIST_HEAD(vmemmap_pages); > > + struct vmemmap_remap_walk walk = { > > + .remap_pte = vmemmap_restore_pte, > > + .reuse_addr = reuse, > > + .vmemmap_pages = &vmemmap_pages, > > + }; > > + > > + might_sleep(); > > + > > + /* See the comment in the vmemmap_remap_free(). */ > > + BUG_ON(start - reuse != PAGE_SIZE); > > + > > + alloc_vmemmap_page_list(&vmemmap_pages, start, end); > > + vmemmap_remap_range(reuse, end, &walk); > > +} > > + > > /* > > * Allocate a block of memory to be used to back the virtual memory map > > * or to back the page tables that are used to create the mapping. > > -- > > 2.11.0 > > > >