From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7C426C433FE for ; Thu, 23 Sep 2021 22:21:09 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id C168D61241 for ; Thu, 23 Sep 2021 22:21:08 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org C168D61241 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 028A96B006C; Thu, 23 Sep 2021 18:21:08 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id F1A586B0071; Thu, 23 Sep 2021 18:21:07 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DE29F900002; Thu, 23 Sep 2021 18:21:07 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0135.hostedemail.com [216.40.44.135]) by kanga.kvack.org (Postfix) with ESMTP id CC65C6B006C for ; Thu, 23 Sep 2021 18:21:07 -0400 (EDT) Received: from smtpin02.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 94E05181CBC26 for ; Thu, 23 Sep 2021 22:21:07 +0000 (UTC) X-FDA: 78620259774.02.AFA34BA Received: from mail-ed1-f54.google.com (mail-ed1-f54.google.com [209.85.208.54]) by imf18.hostedemail.com (Postfix) with ESMTP id 572A2400208D for ; Thu, 23 Sep 2021 22:21:07 +0000 (UTC) Received: by mail-ed1-f54.google.com with SMTP id bx4so28672895edb.4 for ; Thu, 23 Sep 2021 15:21:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=x5e7ohsim31B7H4n8TfUfWsuWHCegNWfiOgTMiVFSyQ=; b=gTIe805SijMU2atBm05vJi1BdhXmcz6AT/H3fl6GEfmlmXsizCNEyDfLgEOOLnI3FY qo+YngKtKlwwS3lyL1d2zjOnOHz55MFHPAGSaV7D5Bt6bZV7211tF1BAr+IJxwxKZEkV htQyjk/4XZDd0oYd1zCWE4Kym+wFg8XosjuWPsskJ3sVEyAIWG7Fdqn3tXZrhv49g4V1 KY5Lo2sxrX8hxC2vqsx185OiyigP6R//B3puvVctqC8JXtc7FJJMy9TId2bpK5MiBjO1 Ajtt7hzUzlEgmJhag0T0W1b2X7j/opqqgGeJ7vuvbvWIsXYUaKjC1ksvSGZdrVCK3wre C45w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=x5e7ohsim31B7H4n8TfUfWsuWHCegNWfiOgTMiVFSyQ=; b=rtLW84C7cmFF+voTN20NDOeaiKDpjFxzB2QGLv6g3XjU1P4MQnU9+PaRVr9vugYdy9 LuSpYMqf5Yuy9URhID7PudEHu8VbqVf4wEPEGik2AGbe4KMGqY95FymjrSExGDZO3Le1 S2Mr+vtWebYiXfyDMV9j8w1f54msLm1pVhJ+gVVWpzITm06oYPe7gO5TV9vxXplhFYNR 8qfVVhb2sohomgQTBWF2/qHkbeUuOmt922CcwvpVr2UPW3todweaYZm1bOcpOXQ+Ew6X lUgh/IaLJ4ZlUT+4jhgsHs+2dqu0URXeMXlFr2aePHJBKLpYzd96SEqrafOM9W0muqXt /rXw== X-Gm-Message-State: AOAM530EUH8hl+6RHCuRhZw0eLP7baAyuO5FOq+FBwF1RmRfLihXaa0w FdmwjMJrA/Tqx6wAUcJokSXL1F6IE66zMN0IIyk= X-Google-Smtp-Source: ABdhPJyRJ3Y64L7U6JVz/bTnrsPAFdfUqc+r/DmjGB2ffajJIcl6Gg7dhOYQNSjovehSqfXjLua1XDI/Q6ibL9lrQEQ= X-Received: by 2002:a17:906:3383:: with SMTP id v3mr8053860eja.213.1632435665976; Thu, 23 Sep 2021 15:21:05 -0700 (PDT) MIME-Version: 1.0 References: <20210922102411.34494-1-songmuchun@bytedance.com> <20210922102411.34494-2-songmuchun@bytedance.com> In-Reply-To: <20210922102411.34494-2-songmuchun@bytedance.com> From: Barry Song <21cnbao@gmail.com> Date: Fri, 24 Sep 2021 10:20:54 +1200 Message-ID: Subject: Re: [PATCH v3 1/4] mm: hugetlb: free the 2nd vmemmap page associated with each HugeTLB page To: Muchun Song Cc: Mike Kravetz , Andrew Morton , Oscar Salvador , Michal Hocko , Barry Song , David Hildenbrand , Chen Huang , "Bodeddula, Balasubramaniam" , Jonathan Corbet , Matthew Wilcox , Xiongchun duan , fam.zheng@bytedance.com, Muchun Song , Qi Zheng , linux-doc@vger.kernel.org, LKML , Linux-MM Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 572A2400208D X-Stat-Signature: p49noy3uqdnp6qxompn5dzjqhkuay98i Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=gTIe805S; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf18.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.208.54 as permitted sender) smtp.mailfrom=21cnbao@gmail.com X-HE-Tag: 1632435667-91961 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Sep 22, 2021 at 10:27 PM Muchun Song wrote: > > Currently, we only free 6 vmemmap pages associated with a 2MB HugeTLB > page. However, we can remap all tail vmemmap pages to the page frame > mapped to with the head vmemmap page. Finally, we can free 7 vmemmap > pages for a 2MB HugeTLB page. It is a fine gain (e.g. we can save > extra 2GB memory when there is 1TB HugeTLB pages in the system > compared with the current implementation). > > But the head vmemmap page is not freed to the buddy allocator and all > tail vmemmap pages are mapped to the head vmemmap page frame. So we > can see more than one struct page struct with PG_head (e.g. 8 per 2 MB > HugeTLB page) associated with each HugeTLB page. We should adjust > compound_head() to make it returns the real head struct page when the > parameter is the tail struct page but with PG_head flag. > > Signed-off-by: Muchun Song > --- > Documentation/admin-guide/kernel-parameters.txt | 2 +- > include/linux/page-flags.h | 77 +++++++++++++++++++++++-- > mm/hugetlb_vmemmap.c | 60 ++++++++++--------- > mm/sparse-vmemmap.c | 21 +++++++ > 4 files changed, 128 insertions(+), 32 deletions(-) > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > index 91ba391f9b32..5aaf2f271980 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -1617,7 +1617,7 @@ > [KNL] Reguires CONFIG_HUGETLB_PAGE_FREE_VMEMMAP > enabled. > Allows heavy hugetlb users to free up some more > - memory (6 * PAGE_SIZE for each 2MB hugetlb page). > + memory (7 * PAGE_SIZE for each 2MB hugetlb page). > Format: { on | off (default) } > > on: enable the feature > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > index a68af80649a4..b47a7f51d2c3 100644 > --- a/include/linux/page-flags.h > +++ b/include/linux/page-flags.h > @@ -184,13 +184,68 @@ enum pageflags { > > #ifndef __GENERATING_BOUNDS_H > > +#ifdef CONFIG_HUGETLB_PAGE_FREE_VMEMMAP > +extern bool hugetlb_free_vmemmap_enabled; > + > +/* > + * If the feature of freeing some vmemmap pages associated with each HugeTLB > + * page is enabled, the head vmemmap page frame is reused and all of the tail > + * vmemmap addresses map to the head vmemmap page frame (furture details can > + * refer to the figure at the head of the mm/hugetlb_vmemmap.c). In other > + * words, there are more than one page struct with PG_head associated with each > + * HugeTLB page. We __know__ that there is only one head page struct, the tail > + * page structs with PG_head are fake head page structs. We need an approach > + * to distinguish between those two different types of page structs so that > + * compound_head() can return the real head page struct when the parameter is > + * the tail page struct but with PG_head. > + * > + * The page_fixed_fake_head() returns the real head page struct if the @page is > + * fake page head, otherwise, returns @page which can either be a true page > + * head or tail. > + */ > +static __always_inline const struct page *page_fixed_fake_head(const struct page *page) > +{ > + if (!hugetlb_free_vmemmap_enabled) > + return page; > + /* > + * Only addresses aligned with PAGE_SIZE of struct page may be fake head > + * struct page. The alignment check aims to avoid access the fields ( > + * e.g. compound_head) of the @page[1]. It can avoid touch a (possibly) > + * cold cacheline in some cases. > + */ > + if (IS_ALIGNED((unsigned long)page, PAGE_SIZE) && > + test_bit(PG_head, &page->flags)) { > + /* > + * We can safely access the field of the @page[1] with PG_head > + * because the @page is a compound page composed with at least > + * two contiguous pages. > + */ > + unsigned long head = READ_ONCE(page[1].compound_head); > + > + if (likely(head & 1)) > + return (const struct page *)(head - 1); > + } > + return page; > +} > +#else > +static __always_inline const struct page *page_fixed_fake_head(const struct page *page) > +{ > + return page; > +} > +#endif > + > +static __always_inline int page_is_fake_head(struct page *page) > +{ > + return page_fixed_fake_head(page) != page; > +} > + > static inline unsigned long _compound_head(const struct page *page) > { > unsigned long head = READ_ONCE(page->compound_head); > > if (unlikely(head & 1)) > return head - 1; > - return (unsigned long)page; > + return (unsigned long)page_fixed_fake_head(page); > } > > #define compound_head(page) ((typeof(page))_compound_head(page)) > @@ -225,12 +280,13 @@ static inline unsigned long _compound_head(const struct page *page) > > static __always_inline int PageTail(struct page *page) > { > - return READ_ONCE(page->compound_head) & 1; > + return READ_ONCE(page->compound_head) & 1 || page_is_fake_head(page); > } > > static __always_inline int PageCompound(struct page *page) > { > - return test_bit(PG_head, &page->flags) || PageTail(page); > + return test_bit(PG_head, &page->flags) || > + READ_ONCE(page->compound_head) & 1; > } > > #define PAGE_POISON_PATTERN -1l > @@ -675,7 +731,20 @@ static inline bool test_set_page_writeback(struct page *page) > return set_page_writeback(page); > } > > -__PAGEFLAG(Head, head, PF_ANY) CLEARPAGEFLAG(Head, head, PF_ANY) > +static __always_inline bool folio_test_head(struct folio *folio) > +{ > + return test_bit(PG_head, folio_flags(folio, FOLIO_PF_ANY)); > +} > + > +static __always_inline int PageHead(struct page *page) > +{ > + PF_POISONED_CHECK(page); > + return test_bit(PG_head, &page->flags) && !page_is_fake_head(page); > +} > + > +__SETPAGEFLAG(Head, head, PF_ANY) > +__CLEARPAGEFLAG(Head, head, PF_ANY) > +CLEARPAGEFLAG(Head, head, PF_ANY) > > /* Whether there are one or multiple pages in a folio */ > static inline bool folio_test_single(struct folio *folio) > diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c > index c540c21e26f5..527bcaa44a48 100644 > --- a/mm/hugetlb_vmemmap.c > +++ b/mm/hugetlb_vmemmap.c > @@ -124,9 +124,9 @@ > * page of page structs (page 0) associated with the HugeTLB page contains the 4 > * page structs necessary to describe the HugeTLB. The only use of the remaining > * pages of page structs (page 1 to page 7) is to point to page->compound_head. > - * Therefore, we can remap pages 2 to 7 to page 1. Only 2 pages of page structs > + * Therefore, we can remap pages 1 to 7 to page 0. Only 1 pages of page structs 1 page. > * will be used for each HugeTLB page. This will allow us to free the remaining > - * 6 pages to the buddy allocator. > + * 7 pages to the buddy allocator. > * > * Here is how things look after remapping. > * > @@ -134,30 +134,30 @@ > * +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ > * | | | 0 | -------------> | 0 | > * | | +-----------+ +-----------+ > - * | | | 1 | -------------> | 1 | > - * | | +-----------+ +-----------+ > - * | | | 2 | ----------------^ ^ ^ ^ ^ ^ > - * | | +-----------+ | | | | | > - * | | | 3 | ------------------+ | | | | > - * | | +-----------+ | | | | > - * | | | 4 | --------------------+ | | | > - * | PMD | +-----------+ | | | > - * | level | | 5 | ----------------------+ | | > - * | mapping | +-----------+ | | > - * | | | 6 | ------------------------+ | > - * | | +-----------+ | > - * | | | 7 | --------------------------+ > + * | | | 1 | ---------------^ ^ ^ ^ ^ ^ ^ > + * | | +-----------+ | | | | | | > + * | | | 2 | -----------------+ | | | | | > + * | | +-----------+ | | | | | > + * | | | 3 | -------------------+ | | | | > + * | | +-----------+ | | | | > + * | | | 4 | ---------------------+ | | | > + * | PMD | +-----------+ | | | > + * | level | | 5 | -----------------------+ | | > + * | mapping | +-----------+ | | > + * | | | 6 | -------------------------+ | > + * | | +-----------+ | > + * | | | 7 | ---------------------------+ > * | | +-----------+ > * | | > * | | > * | | > * +-----------+ > * > - * When a HugeTLB is freed to the buddy system, we should allocate 6 pages for > + * When a HugeTLB is freed to the buddy system, we should allocate 7 pages for > * vmemmap pages and restore the previous mapping relationship. > * > * For the HugeTLB page of the pud level mapping. It is similar to the former. > - * We also can use this approach to free (PAGE_SIZE - 2) vmemmap pages. > + * We also can use this approach to free (PAGE_SIZE - 1) vmemmap pages. > * > * Apart from the HugeTLB page of the pmd/pud level mapping, some architectures > * (e.g. aarch64) provides a contiguous bit in the translation table entries > @@ -166,7 +166,13 @@ > * > * The contiguous bit is used to increase the mapping size at the pmd and pte > * (last) level. So this type of HugeTLB page can be optimized only when its > - * size of the struct page structs is greater than 2 pages. > + * size of the struct page structs is greater than 1 pages. 1 page. > + * > + * Notice: The head vmemmap page is not freed to the buddy allocator and all > + * tail vmemmap pages are mapped to the head vmemmap page frame. So we can see > + * more than one struct page struct with PG_head (e.g. 8 per 2 MB HugeTLB page) > + * associated with each HugeTLB page. The compound_head() can handle this > + * correctly (more details refer to the comment above compound_head()). > */ > #define pr_fmt(fmt) "HugeTLB: " fmt > > @@ -175,14 +181,16 @@ > /* > * There are a lot of struct page structures associated with each HugeTLB page. > * For tail pages, the value of compound_head is the same. So we can reuse first > - * page of tail page structures. We map the virtual addresses of the remaining > - * pages of tail page structures to the first tail page struct, and then free > - * these page frames. Therefore, we need to reserve two pages as vmemmap areas. > + * page of head page structures. We map the virtual addresses of all the pages > + * of tail page structures to the head page struct, and then free these page > + * frames. Therefore, we need to reserve one pages as vmemmap areas. > */ > -#define RESERVE_VMEMMAP_NR 2U > +#define RESERVE_VMEMMAP_NR 1U > #define RESERVE_VMEMMAP_SIZE (RESERVE_VMEMMAP_NR << PAGE_SHIFT) > > -bool hugetlb_free_vmemmap_enabled = IS_ENABLED(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON); > +bool hugetlb_free_vmemmap_enabled __read_mostly = > + IS_ENABLED(CONFIG_HUGETLB_PAGE_FREE_VMEMMAP_DEFAULT_ON); > +EXPORT_SYMBOL(hugetlb_free_vmemmap_enabled); > > static int __init early_hugetlb_free_vmemmap_param(char *buf) > { > @@ -236,7 +244,6 @@ int alloc_huge_page_vmemmap(struct hstate *h, struct page *head) > */ > ret = vmemmap_remap_alloc(vmemmap_addr, vmemmap_end, vmemmap_reuse, > GFP_KERNEL | __GFP_NORETRY | __GFP_THISNODE); > - > if (!ret) > ClearHPageVmemmapOptimized(head); > > @@ -282,9 +289,8 @@ void __init hugetlb_vmemmap_init(struct hstate *h) > > vmemmap_pages = (nr_pages * sizeof(struct page)) >> PAGE_SHIFT; > /* > - * The head page and the first tail page are not to be freed to buddy > - * allocator, the other pages will map to the first tail page, so they > - * can be freed. > + * The head page is not to be freed to buddy allocator, the other tail > + * pages will map to the head page, so they can be freed. > * > * Could RESERVE_VMEMMAP_NR be greater than @vmemmap_pages? It is true > * on some architectures (e.g. aarch64). See Documentation/arm64/ > diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c > index bdce883f9286..62e3d20648ce 100644 > --- a/mm/sparse-vmemmap.c > +++ b/mm/sparse-vmemmap.c > @@ -53,6 +53,17 @@ struct vmemmap_remap_walk { > struct list_head *vmemmap_pages; > }; > > +/* > + * How many struct page structs need to be reset. When we reuse the head > + * struct page, the special metadata (e.g. page->flags or page->mapping) > + * cannot copy to the tail struct page structs. The invalid value will be > + * checked in the free_tail_pages_check(). In order to avoid the message > + * of "corrupted mapping in tail page". We need to reset at least 3 (one > + * head struct page struct and two tail struct page structs) struct page > + * structs. > + */ > +#define NR_RESET_STRUCT_PAGE 3 > + > static int split_vmemmap_huge_pmd(pmd_t *pmd, unsigned long start, > struct vmemmap_remap_walk *walk) > { > @@ -245,6 +256,15 @@ static void vmemmap_remap_pte(pte_t *pte, unsigned long addr, > set_pte_at(&init_mm, addr, pte, entry); > } > > +static inline void reset_struct_pages(struct page *start) > +{ > + int i; > + struct page *from = start + NR_RESET_STRUCT_PAGE; > + > + for (i = 0; i < NR_RESET_STRUCT_PAGE; i++) > + memcpy(start + i, from, sizeof(*from)); > +} > + > static void vmemmap_restore_pte(pte_t *pte, unsigned long addr, > struct vmemmap_remap_walk *walk) > { > @@ -258,6 +278,7 @@ static void vmemmap_restore_pte(pte_t *pte, unsigned long addr, > list_del(&page->lru); > to = page_to_virt(page); > copy_page(to, (void *)walk->reuse_addr); > + reset_struct_pages(to); Is this done unconditionally? Are we going here while hugetlb_free_vmemmap_enabled=false? > > set_pte_at(&init_mm, addr, pte, mk_pte(page, pgprot)); > } > -- > 2.11.0 > Thanks barry