From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 60B4DC0218D for ; Tue, 28 Jan 2025 08:55:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A2F52280212; Tue, 28 Jan 2025 03:55:43 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 9B7C8280209; Tue, 28 Jan 2025 03:55:43 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8590D280212; Tue, 28 Jan 2025 03:55:43 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 63E35280209 for ; Tue, 28 Jan 2025 03:55:43 -0500 (EST) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 2541E47DFC for ; Tue, 28 Jan 2025 08:55:43 +0000 (UTC) X-FDA: 83056252566.28.F72822E Received: from pegase2.c-s.fr (pegase2.c-s.fr [93.17.235.10]) by imf13.hostedemail.com (Postfix) with ESMTP id C574A20007 for ; Tue, 28 Jan 2025 08:55:40 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=csgroup.eu; spf=pass (imf13.hostedemail.com: domain of christophe.leroy@csgroup.eu designates 93.17.235.10 as permitted sender) smtp.mailfrom=christophe.leroy@csgroup.eu ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1738054541; a=rsa-sha256; cv=none; b=Fk8oORrgqWomTstvUD9Hb7hhlwLgvHh4tNy3dVQ3ZTruOyFUnftI2hWyZu2PyHQWon744e xZKRcuBWDJdbXx61Dm5SAnX1o9rts5b7hkEDGfed29yqJRyuPinU1Y9U79HjutHnJFPcIt +ygLQA0cJvq95A3FIi+W6pOZCjFElvg= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=csgroup.eu; spf=pass (imf13.hostedemail.com: domain of christophe.leroy@csgroup.eu designates 93.17.235.10 as permitted sender) smtp.mailfrom=christophe.leroy@csgroup.eu ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1738054541; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=BVyhC7QSA0G8BV/1ANXQcYrxJUmRXT/TD12BTyXAE2o=; b=DW1SHubBzKhswHHLBaZGqgYDr+TmmBMnDpVP+EWB5Zre5fNQtoSdEXtL7KiprGbAz0j+2L HOS6L9j9iy9qY9Lx8dxZCrhGtT+/of8L3owPYq80+t5AQJGhtzgWzPRS/DvNcsacxyGz42 F0PELUDOnPRU7aFojw15/Zza8tHZssw= Received: from localhost (mailhub3.si.c-s.fr [172.26.127.67]) by localhost (Postfix) with ESMTP id 4Yhzdk4zHCz9sSZ; Tue, 28 Jan 2025 09:55:38 +0100 (CET) X-Virus-Scanned: amavisd-new at c-s.fr Received: from pegase2.c-s.fr ([172.26.127.65]) by localhost (pegase2.c-s.fr [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id bUOlSg6jBUTm; Tue, 28 Jan 2025 09:55:38 +0100 (CET) Received: from messagerie.si.c-s.fr (messagerie.si.c-s.fr [192.168.25.192]) by pegase2.c-s.fr (Postfix) with ESMTP id 4Yhzdk3m2lz9sSX; Tue, 28 Jan 2025 09:55:38 +0100 (CET) Received: from localhost (localhost [127.0.0.1]) by messagerie.si.c-s.fr (Postfix) with ESMTP id 7126F8B76C; Tue, 28 Jan 2025 09:55:38 +0100 (CET) X-Virus-Scanned: amavisd-new at c-s.fr Received: from messagerie.si.c-s.fr ([127.0.0.1]) by localhost (messagerie.si.c-s.fr [127.0.0.1]) (amavisd-new, port 10023) with ESMTP id 7ESIVIMdKw33; Tue, 28 Jan 2025 09:55:38 +0100 (CET) Received: from [192.168.235.99] (unknown [192.168.235.99]) by messagerie.si.c-s.fr (Postfix) with ESMTP id BA8DE8B763; Tue, 28 Jan 2025 09:55:37 +0100 (CET) Message-ID: Date: Tue, 28 Jan 2025 09:55:37 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 27/27] mm/hugetlb: enable bootmem allocation from CMA areas To: Frank van der Linden , akpm@linux-foundation.org, muchun.song@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org Cc: yuzhao@google.com, usama.arif@bytedance.com, joao.m.martins@oracle.com, roman.gushchin@linux.dev, Madhavan Srinivasan , Michael Ellerman , linuxppc-dev@lists.ozlabs.org References: <20250127232207.3888640-1-fvdl@google.com> <20250127232207.3888640-28-fvdl@google.com> Content-Language: fr-FR From: Christophe Leroy In-Reply-To: <20250127232207.3888640-28-fvdl@google.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: C574A20007 X-Stat-Signature: hhsai51xykd8fgsz4xidk6mbawu31szj X-Rspam-User: X-HE-Tag: 1738054540-136779 X-HE-Meta: U2FsdGVkX1/kg0SxRxtOnOpbDGDhY429mLqHPVoYoVL2rbZNr80LZK2bj/ErossodrBWuwE3Y65AZ+WnLyzEn06oZRh82uWgBGfgrc7GxhpskCRWILVb5+k8jH4QMhhQWNXv62ocTLntWm21smUPHwFhy5rmulTUjKnl2edP+Z7mcUh69QPvWwA4C6mBTErEZDX+Ffkp5LNtwR0fHkqgbIeHu2KLeYoTdvtAYcPSRQMqIOGDcg7oUXL00Yl0QUo3orxMADoEp5GnBY+BfL6N8Q4hSrWAQqnS+jCNsVisLqvV8P9KXFL8sP9rgIkI1ztX99i/aES1mbRKnxKgQFYNcKOIOlAn1yywMPvTfu9WN36G9oNHwnNxz94+nbfQndkjQqvHkrcvqCDGgTqQ+3tMKnCTIPK8BhhkgXQDtl68sk2UjCThI7nuGgrPTqEAoVcETYCiECZHKYMF9WI6s8ow2iB3VX6g01EiEEVHkTwu/xe4BuyqZd0YGq+2pUg8Iq4ad9LGRipEJXxc9D6e/LuNuUA72qMu0YFfOMH5vXE+vHd1x9ldcO7Ppcf5jvCxP3AcVGS14LMiqjVzDdTrAFRWoJfokh9csBOZKEJDbE5WyC0FXRyU80vzWK4hx+F1hG1sdSpluTCp7OsawOXJ81M6aaJXxTut0BwpP4aursWAJWR/IitGJUZvRDqdhXTbzF671dXQbozDDdtP0YGF/W+3uBsYxdjnZmgan48buzGPeRB12TLNNpgNnk87xwmCOdd3rJvGJO2v+nxS2BuaLIwXK0qKhZja1W86wdOr0JvmsGJAakonokPgtZU4egoKSOQBjcALUs/W3IfymEzJrKTGFvDB99svwrMsoWU05X9jgWXG5iEeG32GvSgDs43NIZFrNQV/RP1LMryddG/eankjy3XcfslfTMJwHbV45sulsIh1nY5we7KSZS1rW3JzpSkBaJtc/GjVpiKWcPLUiq0 oGbQ6NS6 rOLpF4gxI8z7qDZxHHTmirUBJO3f1wb3GQcCGA8UKSm9gwubAWVoQZ4NCqHfA6BV54TvZrAePgMrcGYyJGeTUhnmU53iGsE7PC8+zyeerHg3f3zd2Pc4c9oUE5mJDADCgF8Kk/OYoEGu7+o3wjOE68UeXk5twl2jOTw0SW/wKa8Lo1uzXVw8HXOWOoOFUFyH4BtrX/miJ3Po3tDmgduO61zCZw+PBbfqSIxWllxB28vSovJbWs7F6i5r31EudzJGj7vJS4FBUmhHA4KZky7CtT8leICKJ1Mx0enDmiP2DxdaCm09xfhsMI3Jk1ZwXn6o2nKDCXnTPE2Ga5e8ANcVil2BbVQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Le 28/01/2025 à 00:22, Frank van der Linden a écrit : > If hugetlb_cma_only is enabled, we know that hugetlb pages > can only be allocated from CMA. Now that there is an interface > to do early reservations from a CMA area (returning memblock > memory), it can be used to allocate hugetlb pages from CMA. > > This also allows for doing pre-HVO on these pages (if enabled). > > Make sure to initialize the page structures and associated data > correctly. Create a flag to signal that a hugetlb page has been > allocated from CMA to make things a little easier. > > Some configurations of powerpc have a special hugetlb bootmem > allocator, so introduce a boolean arch_specific_huge_bootmem_alloc > that returns true if such an allocator is present. In that case, > CMA bootmem allocations can't be used, so check that function > before trying. > > Cc: Madhavan Srinivasan > Cc: Michael Ellerman > Cc: linuxppc-dev@lists.ozlabs.org > Signed-off-by: Frank van der Linden > --- > arch/powerpc/mm/hugetlbpage.c | 5 ++ > include/linux/hugetlb.h | 7 ++ > mm/hugetlb.c | 135 +++++++++++++++++++++++++--------- > 3 files changed, 114 insertions(+), 33 deletions(-) > > diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c > index d3c1b749dcfc..e53e4b4c8ef6 100644 > --- a/arch/powerpc/mm/hugetlbpage.c > +++ b/arch/powerpc/mm/hugetlbpage.c > @@ -121,6 +121,11 @@ bool __init hugetlb_node_alloc_supported(void) > { > return false; > } > + > +bool __init arch_specific_huge_bootmem_alloc(struct hstate *h) > +{ > + return (firmware_has_feature(FW_FEATURE_LPAR) && !radix_enabled()); > +} > #endif > > > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h > index 2512463bca49..bca3052fb175 100644 > --- a/include/linux/hugetlb.h > +++ b/include/linux/hugetlb.h > @@ -591,6 +591,7 @@ enum hugetlb_page_flags { > HPG_freed, > HPG_vmemmap_optimized, > HPG_raw_hwp_unreliable, > + HPG_cma, > __NR_HPAGEFLAGS, > }; > > @@ -650,6 +651,7 @@ HPAGEFLAG(Temporary, temporary) > HPAGEFLAG(Freed, freed) > HPAGEFLAG(VmemmapOptimized, vmemmap_optimized) > HPAGEFLAG(RawHwpUnreliable, raw_hwp_unreliable) > +HPAGEFLAG(Cma, cma) > > #ifdef CONFIG_HUGETLB_PAGE > > @@ -678,14 +680,18 @@ struct hstate { > char name[HSTATE_NAME_LEN]; > }; > > +struct cma; > + > struct huge_bootmem_page { > struct list_head list; > struct hstate *hstate; > unsigned long flags; > + struct cma *cma; > }; > > #define HUGE_BOOTMEM_HVO 0x0001 > #define HUGE_BOOTMEM_ZONES_VALID 0x0002 > +#define HUGE_BOOTMEM_CMA 0x0004 > > bool hugetlb_bootmem_page_zones_valid(int nid, struct huge_bootmem_page *m); > > @@ -711,6 +717,7 @@ bool __init hugetlb_node_alloc_supported(void); > > void __init hugetlb_add_hstate(unsigned order); > bool __init arch_hugetlb_valid_size(unsigned long size); > +bool __init arch_specific_huge_bootmem_alloc(struct hstate *h); Why adding 'specific' in the name ? Prefixing a function name with arch_ is usually enough to denote an architecture specific function. > struct hstate *size_to_hstate(unsigned long size); > > #ifndef HUGE_MAX_HSTATE > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 32ebde9039e2..183e8d0c2fb4 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -61,7 +61,7 @@ static struct cma *hugetlb_cma[MAX_NUMNODES]; > static unsigned long hugetlb_cma_size_in_node[MAX_NUMNODES] __initdata; > #endif > static bool hugetlb_cma_only; > -static unsigned long hugetlb_cma_size __initdata; > +static unsigned long hugetlb_cma_size; Why remove __initdata ? As far as I can see it is used only in hugetlb_early_cma() and hugetlb_hstate_alloc_pages() which are __init functions. > > __initdata struct list_head huge_boot_pages[MAX_NUMNODES]; > __initdata unsigned long hstate_boot_nrinvalid[HUGE_MAX_HSTATE]; > @@ -132,8 +132,10 @@ static void hugetlb_free_folio(struct folio *folio) > #ifdef CONFIG_CMA > int nid = folio_nid(folio); > > - if (cma_free_folio(hugetlb_cma[nid], folio)) > + if (folio_test_hugetlb_cma(folio)) { > + WARN_ON(!cma_free_folio(hugetlb_cma[nid], folio)); Is that WARN_ON() needed ? See https://docs.kernel.org/process/coding-style.html#do-not-crash-the-kernel > return; > + } > #endif > folio_put(folio); > } > @@ -1509,6 +1511,9 @@ static struct folio *alloc_gigantic_folio(struct hstate *h, gfp_t gfp_mask, > break; > } > } > + > + if (folio) > + folio_set_hugetlb_cma(folio); > } > #endif > if (!folio) { > @@ -3175,6 +3180,63 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma, > return ERR_PTR(-ENOSPC); > } > > +/* > + * Some architectures do their own bootmem allocation, so they can't use > + * early CMA allocation. So, allow for this function to be redefined. > + */ > +bool __init __attribute((weak)) Can't you use __weak ? By the way, do we really need a weak function here ? Can't it be a static inline helper that gets waived by a macro defined by the arch, something like: #ifndef arch_huge_bootmem_alloc static inline arch_huge_bootmem_alloc(struct hstate *h) { return false; } #endif Then powerpc does: #define arch_huge_bootmem_alloc arch_huge_bootmem_alloc static inline arch_huge_bootmem_alloc(struct hstate *h) { return (firmware_has_feature(FW_FEATURE_LPAR) && !radix_enabled()); } But why is struct hstate *h parameter needed ? Seems like noone uses it. > +arch_specific_huge_bootmem_alloc(struct hstate *h) > +{ > + return false; > +} > + > +static bool __init hugetlb_early_cma(struct hstate *h) > +{ > + if (arch_specific_huge_bootmem_alloc(h)) > + return false; > + > + return (hstate_is_gigantic(h) && hugetlb_cma_size && hugetlb_cma_only); > +} > + > +static __init void *alloc_bootmem(struct hstate *h, int nid) > +{ > + struct huge_bootmem_page *m; > + unsigned long flags; > + struct cma *cma; > + > +#ifdef CONFIG_CMA #ifdefs in C files should be avoided, see https://docs.kernel.org/process/coding-style.html#conditional-compilation > + if (hugetlb_early_cma(h)) { > + flags = HUGE_BOOTMEM_CMA; > + cma = hugetlb_cma[nid]; > + m = cma_reserve_early(cma, huge_page_size(h)); > + } else > +#endif This kind of if/else construct in uggly, should be avoided. > + { > + flags = 0; > + cma = NULL; > + m = memblock_alloc_try_nid_raw(huge_page_size(h), > + huge_page_size(h), 0, MEMBLOCK_ALLOC_ACCESSIBLE, nid); > + } > + > + if (m) { > + /* > + * Use the beginning of the huge page to store the > + * huge_bootmem_page struct (until gather_bootmem > + * puts them into the mem_map). > + * > + * Put them into a private list first because mem_map > + * is not up yet. > + */ > + INIT_LIST_HEAD(&m->list); > + list_add(&m->list, &huge_boot_pages[nid]); > + m->hstate = h; > + m->flags = flags; > + m->cma = cma; > + } > + > + return m; > +} > + > int alloc_bootmem_huge_page(struct hstate *h, int nid) > __attribute__ ((weak, alias("__alloc_bootmem_huge_page"))); > int __alloc_bootmem_huge_page(struct hstate *h, int nid) > @@ -3184,17 +3246,14 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid) > > /* do node specific alloc */ > if (nid != NUMA_NO_NODE) { > - m = memblock_alloc_try_nid_raw(huge_page_size(h), huge_page_size(h), > - 0, MEMBLOCK_ALLOC_ACCESSIBLE, nid); > + m = alloc_bootmem(h, node); > if (!m) > return 0; > goto found; > } > /* allocate from next node when distributing huge pages */ > for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node, &node_states[N_ONLINE]) { > - m = memblock_alloc_try_nid_raw( > - huge_page_size(h), huge_page_size(h), > - 0, MEMBLOCK_ALLOC_ACCESSIBLE, node); > + m = alloc_bootmem(h, node); > if (m) > break; > } > @@ -3203,7 +3262,6 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid) > return 0; > > found: > - > /* > * Only initialize the head struct page in memmap_init_reserved_pages, > * rest of the struct pages will be initialized by the HugeTLB > @@ -3213,18 +3271,6 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid) > */ > memblock_reserved_mark_noinit(virt_to_phys((void *)m + PAGE_SIZE), > huge_page_size(h) - PAGE_SIZE); > - /* > - * Use the beginning of the huge page to store the > - * huge_bootmem_page struct (until gather_bootmem > - * puts them into the mem_map). > - * > - * Put them into a private list first because mem_map > - * is not up yet. > - */ > - INIT_LIST_HEAD(&m->list); > - list_add(&m->list, &huge_boot_pages[node]); > - m->hstate = h; > - m->flags = 0; > return 1; > } > > @@ -3265,13 +3311,25 @@ static void __init hugetlb_folio_init_vmemmap(struct folio *folio, > prep_compound_head((struct page *)folio, huge_page_order(h)); > } > > +static bool __init hugetlb_bootmem_page_prehvo(struct huge_bootmem_page *m) > +{ > + return (m->flags & HUGE_BOOTMEM_HVO); Parenthesis are superflous > +} > + > +static bool __init hugetlb_bootmem_page_earlycma(struct huge_bootmem_page *m) > +{ > + return (m->flags & HUGE_BOOTMEM_CMA); Parenthesis are superflous > +} > + > /* > * memblock-allocated pageblocks might not have the migrate type set > * if marked with the 'noinit' flag. Set it to the default (MIGRATE_MOVABLE) > - * here. > + * here, or MIGRATE_CMA if this was a page allocated through an early CMA > + * reservation. > * > - * Note that this will not write the page struct, it is ok (and necessary) > - * to do this on vmemmap optimized folios. > + * In case of vmemmap optimized folios, the tail vmemmap pages are mapped > + * read-only, but that's ok - for sparse vmemmap this does not write to > + * the page structure. > */ > static void __init hugetlb_bootmem_init_migratetype(struct folio *folio, > struct hstate *h) > @@ -3280,9 +3338,13 @@ static void __init hugetlb_bootmem_init_migratetype(struct folio *folio, > > WARN_ON_ONCE(!pageblock_aligned(folio_pfn(folio))); > > - for (i = 0; i < nr_pages; i += pageblock_nr_pages) > - set_pageblock_migratetype(folio_page(folio, i), > + for (i = 0; i < nr_pages; i += pageblock_nr_pages) { > + if (folio_test_hugetlb_cma(folio)) > + init_cma_pageblock(folio_page(folio, i)); > + else > + set_pageblock_migratetype(folio_page(folio, i), > MIGRATE_MOVABLE); > + } > } > > static void __init prep_and_add_bootmem_folios(struct hstate *h, > @@ -3319,7 +3381,7 @@ bool __init hugetlb_bootmem_page_zones_valid(int nid, > struct huge_bootmem_page *m) > { > unsigned long start_pfn; > - bool valid; > + bool valid = false; Why do you need that, I can't see any path to reach out: without setting 'valid'. > > if (m->flags & HUGE_BOOTMEM_ZONES_VALID) { > /* > @@ -3328,10 +3390,16 @@ bool __init hugetlb_bootmem_page_zones_valid(int nid, > return true; > } > > + if (hugetlb_bootmem_page_earlycma(m)) { > + valid = cma_validate_zones(m->cma); > + goto out; > + } > + > start_pfn = virt_to_phys(m) >> PAGE_SHIFT; > > valid = !pfn_range_intersects_zones(nid, start_pfn, > pages_per_huge_page(m->hstate)); > +out: > if (!valid) > hstate_boot_nrinvalid[hstate_index(m->hstate)]++; > > @@ -3360,11 +3428,6 @@ static void __init hugetlb_bootmem_free_invalid_page(int nid, struct page *page, > } > } > > -static bool __init hugetlb_bootmem_page_prehvo(struct huge_bootmem_page *m) > -{ > - return (m->flags & HUGE_BOOTMEM_HVO); > -} > - > /* > * Put bootmem huge pages into the standard lists after mem_map is up. > * Note: This only applies to gigantic (order > MAX_PAGE_ORDER) pages. > @@ -3414,6 +3477,9 @@ static void __init gather_bootmem_prealloc_node(unsigned long nid) > */ > folio_set_hugetlb_vmemmap_optimized(folio); > > + if (hugetlb_bootmem_page_earlycma(m)) > + folio_set_hugetlb_cma(folio); > + > list_add(&folio->lru, &folio_list); > > /* > @@ -3606,8 +3672,11 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h) > { > unsigned long allocated; > > - /* skip gigantic hugepages allocation if hugetlb_cma enabled */ > - if (hstate_is_gigantic(h) && hugetlb_cma_size) { > + /* > + * Skip gigantic hugepages allocation if early CMA > + * reservations are not available. > + */ > + if (hstate_is_gigantic(h) && hugetlb_cma_size && !hugetlb_early_cma(h)) { > pr_warn_once("HugeTLB: hugetlb_cma is enabled, skip boot time allocation\n"); > return; > }