From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id B175BC433E0 for ; Mon, 25 Jan 2021 09:34:49 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 2DE7922CA2 for ; Mon, 25 Jan 2021 09:34:48 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2DE7922CA2 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 6388F8D0006; Mon, 25 Jan 2021 04:34:48 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 5E92D8D0001; Mon, 25 Jan 2021 04:34:48 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 489868D0006; Mon, 25 Jan 2021 04:34:48 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0195.hostedemail.com [216.40.44.195]) by kanga.kvack.org (Postfix) with ESMTP id 2CB888D0001 for ; Mon, 25 Jan 2021 04:34:48 -0500 (EST) Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id DF5BC180AD807 for ; Mon, 25 Jan 2021 09:34:47 +0000 (UTC) X-FDA: 77743787814.13.birth65_300c3d527584 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin13.hostedemail.com (Postfix) with ESMTP id BB8C218140B70 for ; Mon, 25 Jan 2021 09:34:47 +0000 (UTC) X-HE-Tag: birth65_300c3d527584 X-Filterd-Recvd-Size: 9313 Received: from mail-pf1-f171.google.com (mail-pf1-f171.google.com [209.85.210.171]) by imf28.hostedemail.com (Postfix) with ESMTP for ; Mon, 25 Jan 2021 09:34:46 +0000 (UTC) Received: by mail-pf1-f171.google.com with SMTP id t29so8047410pfg.11 for ; Mon, 25 Jan 2021 01:34:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20150623.gappssmtp.com; s=20150623; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Z8nrxtJb3XhTbQvd5j67XtkO/vay+xDoaZCK/pVC+LY=; b=OffTwV4C2tkYhMHP6e+JkFb3UOud8nOV2gy5gMoKIW5bnZNdA+5n248fLiUoZ3y0N7 LvcraooP074rVny2vjWLcEzJUsyOgrGRAxb3225v3IAGn/yUOM08NmGUYwN3mE3kaoB0 u4YaR8QdfYX9ZiiNyuEUTNrdV/JhUilTlAVCMs99kfebStJUoBYBv5gqCKS8HRhVLxRx 2jVdPLkW8gspVaGYIaZqtha4Tvrx9gdQRI/9QIZvFxN5+R/GRrw2Jb17Qli5+89HuJaR oXVh/O53vepMV9itsBDojL2HUASfwpKoMbAtL3qeww3Zfoooe4VH4a8gbNMEaAQXTK/o rPxQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Z8nrxtJb3XhTbQvd5j67XtkO/vay+xDoaZCK/pVC+LY=; b=KL13L0Cp0jqbIb5Pf2GlWA1/9sB15enZlQ2qo5aRWAOSFw/XfBRDcvHcq3iP1Iz0fG G0xufUs93yddNaVusP1yPrs1Vhjmn3rThhpdkPXzR3otAutJxcBBWSynfFfROM7FpB2w tz5LNhARKvVmM00MJRrCkVZIcdVolcdEIKXCtS6aa2QmkX87KMuHgsCrdaQD/1yBR3jk Uq3h0UHAjFSBmXsWHW1guFPiKutkvWSHtWXUScxIIZfYTwqFMLe6ZbBQ4NrC5pGCWcmP x8eYXkmsEz/TNzTxmq0ISxZLhmTXJtc2Yk7PMYwaxnzPlC0mQCkVbFBN5nqIxJQykHEc 55FQ== X-Gm-Message-State: AOAM531frEInb8Z7jfZiNUKumeuvm5PL/H9f9AF55yJlr9dbkU7UQEmw MCMlPlRXs0mcK0a3KFpNY0RyOGKr9WKvx+Jfzrvw2Q== X-Google-Smtp-Source: ABdhPJxVx55V16cI4xy7oBEQg8xxEV5bUx60rUKse8vz37JdPHqIzHsWLuoVk+Wl8tff+ZV+FKqr9jBd+q5Qmp1fQDY= X-Received: by 2002:a62:7694:0:b029:1b9:8d43:95af with SMTP id r142-20020a6276940000b02901b98d4395afmr17392890pfc.2.1611567285933; Mon, 25 Jan 2021 01:34:45 -0800 (PST) MIME-Version: 1.0 References: <20210117151053.24600-1-songmuchun@bytedance.com> <20210117151053.24600-6-songmuchun@bytedance.com> <6a68fde-583d-b8bb-a2c8-fbe32e03b@google.com> <552e8214-bc6f-8d90-0ed8-b3aff75d0e47@redhat.com> In-Reply-To: <552e8214-bc6f-8d90-0ed8-b3aff75d0e47@redhat.com> From: Muchun Song Date: Mon, 25 Jan 2021 17:34:09 +0800 Message-ID: Subject: Re: [External] Re: [PATCH v13 05/12] mm: hugetlb: allocate the vmemmap pages associated with each HugeTLB page To: David Hildenbrand Cc: David Rientjes , Jonathan Corbet , Mike Kravetz , Thomas Gleixner , mingo@redhat.com, bp@alien8.de, x86@kernel.org, hpa@zytor.com, dave.hansen@linux.intel.com, luto@kernel.org, Peter Zijlstra , viro@zeniv.linux.org.uk, Andrew Morton , paulmck@kernel.org, mchehab+huawei@kernel.org, pawan.kumar.gupta@linux.intel.com, Randy Dunlap , oneukum@suse.com, anshuman.khandual@arm.com, jroedel@suse.de, Mina Almasry , Matthew Wilcox , Oscar Salvador , Michal Hocko , "Song Bao Hua (Barry Song)" , =?UTF-8?B?SE9SSUdVQ0hJIE5BT1lBKOWggOWPoyDnm7TkuZ8p?= , Xiongchun duan , linux-doc@vger.kernel.org, LKML , Linux Memory Management List , linux-fsdevel Content-Type: text/plain; charset="UTF-8" X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Jan 25, 2021 at 5:15 PM David Hildenbrand wrote: > > On 25.01.21 08:41, Muchun Song wrote: > > On Mon, Jan 25, 2021 at 2:40 PM Muchun Song wrote: > >> > >> On Mon, Jan 25, 2021 at 8:05 AM David Rientjes wrote: > >>> > >>> > >>> On Sun, 17 Jan 2021, Muchun Song wrote: > >>> > >>>> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c > >>>> index ce4be1fa93c2..3b146d5949f3 100644 > >>>> --- a/mm/sparse-vmemmap.c > >>>> +++ b/mm/sparse-vmemmap.c > >>>> @@ -29,6 +29,7 @@ > >>>> #include > >>>> #include > >>>> #include > >>>> +#include > >>>> > >>>> #include > >>>> #include > >>>> @@ -40,7 +41,8 @@ > >>>> * @remap_pte: called for each non-empty PTE (lowest-level) entry. > >>>> * @reuse_page: the page which is reused for the tail vmemmap pages. > >>>> * @reuse_addr: the virtual address of the @reuse_page page. > >>>> - * @vmemmap_pages: the list head of the vmemmap pages that can be freed. > >>>> + * @vmemmap_pages: the list head of the vmemmap pages that can be freed > >>>> + * or is mapped from. > >>>> */ > >>>> struct vmemmap_remap_walk { > >>>> void (*remap_pte)(pte_t *pte, unsigned long addr, > >>>> @@ -50,6 +52,10 @@ struct vmemmap_remap_walk { > >>>> struct list_head *vmemmap_pages; > >>>> }; > >>>> > >>>> +/* The gfp mask of allocating vmemmap page */ > >>>> +#define GFP_VMEMMAP_PAGE \ > >>>> + (GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN | __GFP_THISNODE) > >>>> + > >>> > >>> This is unnecessary, just use the gfp mask directly in allocator. > >> > >> Will do. Thanks. > >> > >>> > >>>> static void vmemmap_pte_range(pmd_t *pmd, unsigned long addr, > >>>> unsigned long end, > >>>> struct vmemmap_remap_walk *walk) > >>>> @@ -228,6 +234,75 @@ void vmemmap_remap_free(unsigned long start, unsigned long end, > >>>> free_vmemmap_page_list(&vmemmap_pages); > >>>> } > >>>> > >>>> +static void vmemmap_restore_pte(pte_t *pte, unsigned long addr, > >>>> + struct vmemmap_remap_walk *walk) > >>>> +{ > >>>> + pgprot_t pgprot = PAGE_KERNEL; > >>>> + struct page *page; > >>>> + void *to; > >>>> + > >>>> + BUG_ON(pte_page(*pte) != walk->reuse_page); > >>>> + > >>>> + page = list_first_entry(walk->vmemmap_pages, struct page, lru); > >>>> + list_del(&page->lru); > >>>> + to = page_to_virt(page); > >>>> + copy_page(to, (void *)walk->reuse_addr); > >>>> + > >>>> + set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot)); > >>>> +} > >>>> + > >>>> +static void alloc_vmemmap_page_list(struct list_head *list, > >>>> + unsigned long start, unsigned long end) > >>>> +{ > >>>> + unsigned long addr; > >>>> + > >>>> + for (addr = start; addr < end; addr += PAGE_SIZE) { > >>>> + struct page *page; > >>>> + int nid = page_to_nid((const void *)addr); > >>>> + > >>>> +retry: > >>>> + page = alloc_pages_node(nid, GFP_VMEMMAP_PAGE, 0); > >>>> + if (unlikely(!page)) { > >>>> + msleep(100); > >>>> + /* > >>>> + * We should retry infinitely, because we cannot > >>>> + * handle allocation failures. Once we allocate > >>>> + * vmemmap pages successfully, then we can free > >>>> + * a HugeTLB page. > >>>> + */ > >>>> + goto retry; > >>> > >>> Ugh, I don't think this will work, there's no guarantee that we'll ever > >>> succeed and now we can't free a 2MB hugepage because we cannot allocate a > >>> 4KB page. We absolutely have to ensure we make forward progress here. > >> > >> This can trigger a OOM when there is no memory and kill someone to release > >> some memory. Right? > >> > >>> > >>> We're going to be freeing the hugetlb page after this succeeeds, can we > >>> not use part of the hugetlb page that we're freeing for this memory > >>> instead? > >> > >> It seems a good idea. We can try to allocate memory firstly, if successful, > >> just use the new page to remap (it can reduce memory fragmentation). > >> If not, we can use part of the hugetlb page to remap. What's your opinion > >> about this? > > > > If the HugeTLB page is a gigantic page which is allocated from > > CMA. In this case, we cannot use part of the hugetlb page to remap. > > Right? > > Right; and I don't think the "reuse part of a huge page as vmemmap while > freeing, while that part itself might not have a proper vmemmap yet (or > might cover itself now)" is particularly straight forward. Maybe I'm > wrong :) > > Also, watch out for huge pages on ZONE_MOVABLE, in that case you also > shouldn't allocate the vmemmap from there ... Yeah, you are right. So I tend to trigger OOM to kill other processes to reclaim some memory when we allocate memory fails. > > -- > Thanks, > > David / dhildenb >