From: Frank van der Linden <fvdl@google.com>
To: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: akpm@linux-foundation.org, muchun.song@linux.dev,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
yuzhao@google.com, joao.m.martins@oracle.com,
roman.gushchin@linux.dev,
Madhavan Srinivasan <maddy@linux.ibm.com>,
Michael Ellerman <mpe@ellerman.id.au>,
linuxppc-dev@lists.ozlabs.org
Subject: Re: [PATCH 27/27] mm/hugetlb: enable bootmem allocation from CMA areas
Date: Tue, 28 Jan 2025 09:52:30 -0800 [thread overview]
Message-ID: <CAPTztWa8bL06fVDP-N3s1yqMLxFZfvDvVRpn5B5YZBJ3idru9Q@mail.gmail.com> (raw)
In-Reply-To: <c961cc1e-897d-4b86-b123-b12a0c27f91a@csgroup.eu>
Hi Christophe, thanks for your comments. Replies inline below.
On Tue, Jan 28, 2025 at 12:55 AM Christophe Leroy
<christophe.leroy@csgroup.eu> wrote:
>
>
>
> Le 28/01/2025 à 00:22, Frank van der Linden a écrit :
> > If hugetlb_cma_only is enabled, we know that hugetlb pages
> > can only be allocated from CMA. Now that there is an interface
> > to do early reservations from a CMA area (returning memblock
> > memory), it can be used to allocate hugetlb pages from CMA.
> >
> > This also allows for doing pre-HVO on these pages (if enabled).
> >
> > Make sure to initialize the page structures and associated data
> > correctly. Create a flag to signal that a hugetlb page has been
> > allocated from CMA to make things a little easier.
> >
> > Some configurations of powerpc have a special hugetlb bootmem
> > allocator, so introduce a boolean arch_specific_huge_bootmem_alloc
> > that returns true if such an allocator is present. In that case,
> > CMA bootmem allocations can't be used, so check that function
> > before trying.
> >
> > Cc: Madhavan Srinivasan <maddy@linux.ibm.com>
> > Cc: Michael Ellerman <mpe@ellerman.id.au>
> > Cc: linuxppc-dev@lists.ozlabs.org
> > Signed-off-by: Frank van der Linden <fvdl@google.com>
> > ---
> > arch/powerpc/mm/hugetlbpage.c | 5 ++
> > include/linux/hugetlb.h | 7 ++
> > mm/hugetlb.c | 135 +++++++++++++++++++++++++---------
> > 3 files changed, 114 insertions(+), 33 deletions(-)
> >
> > diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c
> > index d3c1b749dcfc..e53e4b4c8ef6 100644
> > --- a/arch/powerpc/mm/hugetlbpage.c
> > +++ b/arch/powerpc/mm/hugetlbpage.c
> > @@ -121,6 +121,11 @@ bool __init hugetlb_node_alloc_supported(void)
> > {
> > return false;
> > }
> > +
> > +bool __init arch_specific_huge_bootmem_alloc(struct hstate *h)
> > +{
> > + return (firmware_has_feature(FW_FEATURE_LPAR) && !radix_enabled());
> > +}
> > #endif
> >
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 2512463bca49..bca3052fb175 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -591,6 +591,7 @@ enum hugetlb_page_flags {
> > HPG_freed,
> > HPG_vmemmap_optimized,
> > HPG_raw_hwp_unreliable,
> > + HPG_cma,
> > __NR_HPAGEFLAGS,
> > };
> >
> > @@ -650,6 +651,7 @@ HPAGEFLAG(Temporary, temporary)
> > HPAGEFLAG(Freed, freed)
> > HPAGEFLAG(VmemmapOptimized, vmemmap_optimized)
> > HPAGEFLAG(RawHwpUnreliable, raw_hwp_unreliable)
> > +HPAGEFLAG(Cma, cma)
> >
> > #ifdef CONFIG_HUGETLB_PAGE
> >
> > @@ -678,14 +680,18 @@ struct hstate {
> > char name[HSTATE_NAME_LEN];
> > };
> >
> > +struct cma;
> > +
> > struct huge_bootmem_page {
> > struct list_head list;
> > struct hstate *hstate;
> > unsigned long flags;
> > + struct cma *cma;
> > };
> >
> > #define HUGE_BOOTMEM_HVO 0x0001
> > #define HUGE_BOOTMEM_ZONES_VALID 0x0002
> > +#define HUGE_BOOTMEM_CMA 0x0004
> >
> > bool hugetlb_bootmem_page_zones_valid(int nid, struct huge_bootmem_page *m);
> >
> > @@ -711,6 +717,7 @@ bool __init hugetlb_node_alloc_supported(void);
> >
> > void __init hugetlb_add_hstate(unsigned order);
> > bool __init arch_hugetlb_valid_size(unsigned long size);
> > +bool __init arch_specific_huge_bootmem_alloc(struct hstate *h);
>
> Why adding 'specific' in the name ? Prefixing a function name with arch_
> is usually enough to denote an architecture specific function.
True, yes. That should probably be arch_has_huge_bootmem_alloc, I will
change that.
>
> > struct hstate *size_to_hstate(unsigned long size);
> >
> > #ifndef HUGE_MAX_HSTATE
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 32ebde9039e2..183e8d0c2fb4 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -61,7 +61,7 @@ static struct cma *hugetlb_cma[MAX_NUMNODES];
> > static unsigned long hugetlb_cma_size_in_node[MAX_NUMNODES] __initdata;
> > #endif
> > static bool hugetlb_cma_only;
> > -static unsigned long hugetlb_cma_size __initdata;
> > +static unsigned long hugetlb_cma_size;
>
> Why remove __initdata ? As far as I can see it is used only in
> hugetlb_early_cma() and hugetlb_hstate_alloc_pages() which are __init
> functions.
hugetlb_cma_size is now used in alloc_gigantic_folio(), which is not
an __init function. However, you got me thinking: since
hugetlb_cma_only is only effective when hugetlb_cma_size != 0, I can
just reset hugetlb_cma_only to false if hugetlb_cma_size == 0 after
parsing the commandline arguments. This will revert hugetlb_cma_size
to __initdata, and simplifies things a bit. I'll make that change in
v2.
>
> >
> > __initdata struct list_head huge_boot_pages[MAX_NUMNODES];
> > __initdata unsigned long hstate_boot_nrinvalid[HUGE_MAX_HSTATE];
> > @@ -132,8 +132,10 @@ static void hugetlb_free_folio(struct folio *folio)
> > #ifdef CONFIG_CMA
> > int nid = folio_nid(folio);
> >
> > - if (cma_free_folio(hugetlb_cma[nid], folio))
> > + if (folio_test_hugetlb_cma(folio)) {
> > + WARN_ON(!cma_free_folio(hugetlb_cma[nid], folio));
>
> Is that WARN_ON() needed ? See
> https://docs.kernel.org/process/coding-style.html#do-not-crash-the-kernel
Not strictly, I suppose, but if there is a CMA-allocated hugetlb
folio, and cma_free fails, that would be a condition worthy of a
warning, as the flag somehow got corrupted or there is an internal CMA
error. How about WARN_ON_ONCE?
>
>
> > return;
> > + }
> > #endif
> > folio_put(folio);
> > }
> > @@ -1509,6 +1511,9 @@ static struct folio *alloc_gigantic_folio(struct hstate *h, gfp_t gfp_mask,
> > break;
> > }
> > }
> > +
> > + if (folio)
> > + folio_set_hugetlb_cma(folio);
> > }
> > #endif
> > if (!folio) {
> > @@ -3175,6 +3180,63 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> > return ERR_PTR(-ENOSPC);
> > }
> >
> > +/*
> > + * Some architectures do their own bootmem allocation, so they can't use
> > + * early CMA allocation. So, allow for this function to be redefined.
> > + */
> > +bool __init __attribute((weak))
>
> Can't you use __weak ?
>
> By the way, do we really need a weak function here ? Can't it be a
> static inline helper that gets waived by a macro defined by the arch,
> something like:
>
> #ifndef arch_huge_bootmem_alloc
> static inline arch_huge_bootmem_alloc(struct hstate *h)
> {
> return false;
> }
> #endif
>
> Then powerpc does:
>
> #define arch_huge_bootmem_alloc arch_huge_bootmem_alloc
> static inline arch_huge_bootmem_alloc(struct hstate *h)
> {
> return (firmware_has_feature(FW_FEATURE_LPAR) && !radix_enabled());
> }
>
Fair enough, yeah. I used a weak symbol because that was already used
for powerpc with alloc_bootmem_huge_page(), but I can change this one.
>
> But why is struct hstate *h parameter needed ? Seems like noone uses it.
Correct - I merely extrapolated a bit and thought "well, architectures
might have bootmem hugetlb allocators that only deal with certain
sizes". But then again, like you say, there is currently no need for
it. I'll remove the argument.
>
> > +arch_specific_huge_bootmem_alloc(struct hstate *h)
> > +{
> > + return false;
> > +}
> > +
> > +static bool __init hugetlb_early_cma(struct hstate *h)
> > +{
> > + if (arch_specific_huge_bootmem_alloc(h))
> > + return false;
> > +
> > + return (hstate_is_gigantic(h) && hugetlb_cma_size && hugetlb_cma_only);
> > +}
> > +
> > +static __init void *alloc_bootmem(struct hstate *h, int nid)
> > +{
> > + struct huge_bootmem_page *m;
> > + unsigned long flags;
> > + struct cma *cma;
> > +
> > +#ifdef CONFIG_CMA
>
> #ifdefs in C files should be avoided, see
> https://docs.kernel.org/process/coding-style.html#conditional-compilation
>
> > + if (hugetlb_early_cma(h)) {
> > + flags = HUGE_BOOTMEM_CMA;
> > + cma = hugetlb_cma[nid];
> > + m = cma_reserve_early(cma, huge_page_size(h));
> > + } else
> > +#endif
>
> This kind of if/else construct in uggly, should be avoided.
>
I found this ifdef hard to avoid, sadly, I tried various ways to avoid
it (If (IS_ENABLED(CONFIG_CMA), etc), but came up short. I'll have
another look for v2, but short of trying to split off all CMA-related
code in to a different file, which would definitely be out of scope
here, it might not end up being better.
> > + {
> > + flags = 0;
> > + cma = NULL;
> > + m = memblock_alloc_try_nid_raw(huge_page_size(h),
> > + huge_page_size(h), 0, MEMBLOCK_ALLOC_ACCESSIBLE, nid);
> > + }
> > +
> > + if (m) {
> > + /*
> > + * Use the beginning of the huge page to store the
> > + * huge_bootmem_page struct (until gather_bootmem
> > + * puts them into the mem_map).
> > + *
> > + * Put them into a private list first because mem_map
> > + * is not up yet.
> > + */
> > + INIT_LIST_HEAD(&m->list);
> > + list_add(&m->list, &huge_boot_pages[nid]);
> > + m->hstate = h;
> > + m->flags = flags;
> > + m->cma = cma;
> > + }
> > +
> > + return m;
> > +}
> > +
> > int alloc_bootmem_huge_page(struct hstate *h, int nid)
> > __attribute__ ((weak, alias("__alloc_bootmem_huge_page")));
> > int __alloc_bootmem_huge_page(struct hstate *h, int nid)
> > @@ -3184,17 +3246,14 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
> >
> > /* do node specific alloc */
> > if (nid != NUMA_NO_NODE) {
> > - m = memblock_alloc_try_nid_raw(huge_page_size(h), huge_page_size(h),
> > - 0, MEMBLOCK_ALLOC_ACCESSIBLE, nid);
> > + m = alloc_bootmem(h, node);
> > if (!m)
> > return 0;
> > goto found;
> > }
> > /* allocate from next node when distributing huge pages */
> > for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node, &node_states[N_ONLINE]) {
> > - m = memblock_alloc_try_nid_raw(
> > - huge_page_size(h), huge_page_size(h),
> > - 0, MEMBLOCK_ALLOC_ACCESSIBLE, node);
> > + m = alloc_bootmem(h, node);
> > if (m)
> > break;
> > }
> > @@ -3203,7 +3262,6 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
> > return 0;
> >
> > found:
> > -
> > /*
> > * Only initialize the head struct page in memmap_init_reserved_pages,
> > * rest of the struct pages will be initialized by the HugeTLB
> > @@ -3213,18 +3271,6 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
> > */
> > memblock_reserved_mark_noinit(virt_to_phys((void *)m + PAGE_SIZE),
> > huge_page_size(h) - PAGE_SIZE);
> > - /*
> > - * Use the beginning of the huge page to store the
> > - * huge_bootmem_page struct (until gather_bootmem
> > - * puts them into the mem_map).
> > - *
> > - * Put them into a private list first because mem_map
> > - * is not up yet.
> > - */
> > - INIT_LIST_HEAD(&m->list);
> > - list_add(&m->list, &huge_boot_pages[node]);
> > - m->hstate = h;
> > - m->flags = 0;
> > return 1;
> > }
> >
> > @@ -3265,13 +3311,25 @@ static void __init hugetlb_folio_init_vmemmap(struct folio *folio,
> > prep_compound_head((struct page *)folio, huge_page_order(h));
> > }
> >
> > +static bool __init hugetlb_bootmem_page_prehvo(struct huge_bootmem_page *m)
> > +{
> > + return (m->flags & HUGE_BOOTMEM_HVO);
>
> Parenthesis are superflous
>
> > +}
> > +
> > +static bool __init hugetlb_bootmem_page_earlycma(struct huge_bootmem_page *m)
> > +{
> > + return (m->flags & HUGE_BOOTMEM_CMA);
>
> Parenthesis are superflous
>
Sure, will remove them.
> > +}
> > +
> > /*
> > * memblock-allocated pageblocks might not have the migrate type set
> > * if marked with the 'noinit' flag. Set it to the default (MIGRATE_MOVABLE)
> > - * here.
> > + * here, or MIGRATE_CMA if this was a page allocated through an early CMA
> > + * reservation.
> > *
> > - * Note that this will not write the page struct, it is ok (and necessary)
> > - * to do this on vmemmap optimized folios.
> > + * In case of vmemmap optimized folios, the tail vmemmap pages are mapped
> > + * read-only, but that's ok - for sparse vmemmap this does not write to
> > + * the page structure.
> > */
> > static void __init hugetlb_bootmem_init_migratetype(struct folio *folio,
> > struct hstate *h)
> > @@ -3280,9 +3338,13 @@ static void __init hugetlb_bootmem_init_migratetype(struct folio *folio,
> >
> > WARN_ON_ONCE(!pageblock_aligned(folio_pfn(folio)));
> >
> > - for (i = 0; i < nr_pages; i += pageblock_nr_pages)
> > - set_pageblock_migratetype(folio_page(folio, i),
> > + for (i = 0; i < nr_pages; i += pageblock_nr_pages) {
> > + if (folio_test_hugetlb_cma(folio))
> > + init_cma_pageblock(folio_page(folio, i));
> > + else
> > + set_pageblock_migratetype(folio_page(folio, i),
> > MIGRATE_MOVABLE);
> > + }
> > }
> >
> > static void __init prep_and_add_bootmem_folios(struct hstate *h,
> > @@ -3319,7 +3381,7 @@ bool __init hugetlb_bootmem_page_zones_valid(int nid,
> > struct huge_bootmem_page *m)
> > {
> > unsigned long start_pfn;
> > - bool valid;
> > + bool valid = false;
>
> Why do you need that, I can't see any path to reach out: without setting
> 'valid'.
True - probably a leftover from an earlier iteration, I can remove that.
>
> >
> > if (m->flags & HUGE_BOOTMEM_ZONES_VALID) {
> > /*
> > @@ -3328,10 +3390,16 @@ bool __init hugetlb_bootmem_page_zones_valid(int nid,
> > return true;
> > }
> >
> > + if (hugetlb_bootmem_page_earlycma(m)) {
> > + valid = cma_validate_zones(m->cma);
> > + goto out;
> > + }
> > +
> > start_pfn = virt_to_phys(m) >> PAGE_SHIFT;
> >
> > valid = !pfn_range_intersects_zones(nid, start_pfn,
> > pages_per_huge_page(m->hstate));
> > +out:
> > if (!valid)
> > hstate_boot_nrinvalid[hstate_index(m->hstate)]++;
> >
> > @@ -3360,11 +3428,6 @@ static void __init hugetlb_bootmem_free_invalid_page(int nid, struct page *page,
> > }
> > }
> >
> > -static bool __init hugetlb_bootmem_page_prehvo(struct huge_bootmem_page *m)
> > -{
> > - return (m->flags & HUGE_BOOTMEM_HVO);
> > -}
> > -
> > /*
> > * Put bootmem huge pages into the standard lists after mem_map is up.
> > * Note: This only applies to gigantic (order > MAX_PAGE_ORDER) pages.
> > @@ -3414,6 +3477,9 @@ static void __init gather_bootmem_prealloc_node(unsigned long nid)
> > */
> > folio_set_hugetlb_vmemmap_optimized(folio);
> >
> > + if (hugetlb_bootmem_page_earlycma(m))
> > + folio_set_hugetlb_cma(folio);
> > +
> > list_add(&folio->lru, &folio_list);
> >
> > /*
> > @@ -3606,8 +3672,11 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
> > {
> > unsigned long allocated;
> >
> > - /* skip gigantic hugepages allocation if hugetlb_cma enabled */
> > - if (hstate_is_gigantic(h) && hugetlb_cma_size) {
> > + /*
> > + * Skip gigantic hugepages allocation if early CMA
> > + * reservations are not available.
> > + */
> > + if (hstate_is_gigantic(h) && hugetlb_cma_size && !hugetlb_early_cma(h)) {
> > pr_warn_once("HugeTLB: hugetlb_cma is enabled, skip boot time allocation\n");
> > return;
> > }
>
next prev parent reply other threads:[~2025-01-28 17:52 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-27 23:21 [PATCH 00/27] hugetlb/CMA improvements for large systems Frank van der Linden
2025-01-27 23:21 ` [PATCH 01/27] mm/cma: export total and free number of pages for CMA areas Frank van der Linden
2025-01-27 23:21 ` [PATCH 02/27] mm, cma: support multiple contiguous ranges, if requested Frank van der Linden
2025-01-28 4:19 ` Andrew Morton
2025-01-28 5:15 ` Frank van der Linden
2025-01-27 23:21 ` [PATCH 03/27] mm/cma: introduce cma_intersects function Frank van der Linden
2025-01-27 23:21 ` [PATCH 04/27] mm, hugetlb: use cma_declare_contiguous_multi Frank van der Linden
2025-01-27 23:21 ` [PATCH 05/27] mm/hugetlb: fix round-robin bootmem allocation Frank van der Linden
2025-01-27 23:21 ` [PATCH 06/27] mm/hugetlb: remove redundant __ClearPageReserved Frank van der Linden
2025-01-27 23:21 ` [PATCH 07/27] mm/hugetlb: use online nodes for bootmem allocation Frank van der Linden
2025-01-27 23:21 ` [PATCH 08/27] mm/hugetlb: convert cmdline parameters from setup to early Frank van der Linden
2025-01-27 23:21 ` [PATCH 09/27] x86/mm: make register_page_bootmem_memmap handle PTE mappings Frank van der Linden
2025-01-27 23:21 ` [PATCH 10/27] mm/bootmem_info: export register_page_bootmem_memmap Frank van der Linden
2025-01-27 23:21 ` [PATCH 11/27] mm/sparse: allow for alternate vmemmap section init at boot Frank van der Linden
2025-01-27 23:21 ` [PATCH 12/27] mm/hugetlb: set migratetype for bootmem folios Frank van der Linden
2025-01-27 23:21 ` [PATCH 13/27] mm: define __init_reserved_page_zone function Frank van der Linden
2025-01-27 23:21 ` [PATCH 14/27] mm/hugetlb: check bootmem pages for zone intersections Frank van der Linden
2025-01-27 23:21 ` [PATCH 15/27] mm/sparse: add vmemmap_*_hvo functions Frank van der Linden
2025-01-27 23:21 ` [PATCH 16/27] mm/hugetlb: deal with multiple calls to hugetlb_bootmem_alloc Frank van der Linden
2025-01-27 23:21 ` [PATCH 17/27] mm/hugetlb: move huge_boot_pages list init " Frank van der Linden
2025-01-27 23:21 ` [PATCH 18/27] mm/hugetlb: add pre-HVO framework Frank van der Linden
2025-01-27 23:21 ` [PATCH 19/27] mm/hugetlb_vmemmap: fix hugetlb_vmemmap_restore_folios definition Frank van der Linden
2025-01-27 23:22 ` [PATCH 20/27] mm/hugetlb: do pre-HVO for bootmem allocated pages Frank van der Linden
2025-01-27 23:22 ` [PATCH 21/27] x86/setup: call hugetlb_bootmem_alloc early Frank van der Linden
2025-01-27 23:22 ` [PATCH 22/27] x86/mm: set ARCH_WANT_SPARSEMEM_VMEMMAP_PREINIT Frank van der Linden
2025-01-27 23:22 ` [PATCH 23/27] mm/cma: simplify zone intersection check Frank van der Linden
2025-01-27 23:22 ` [PATCH 24/27] mm/cma: introduce a cma validate function Frank van der Linden
2025-01-27 23:22 ` [PATCH 25/27] mm/cma: introduce interface for early reservations Frank van der Linden
2025-01-27 23:22 ` [PATCH 26/27] mm/hugetlb: add hugetlb_cma_only cmdline option Frank van der Linden
2025-01-27 23:22 ` [PATCH 27/27] mm/hugetlb: enable bootmem allocation from CMA areas Frank van der Linden
2025-01-28 8:55 ` Christophe Leroy
2025-01-28 17:52 ` Frank van der Linden [this message]
2025-01-29 22:43 ` Frank van der Linden
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAPTztWa8bL06fVDP-N3s1yqMLxFZfvDvVRpn5B5YZBJ3idru9Q@mail.gmail.com \
--to=fvdl@google.com \
--cc=akpm@linux-foundation.org \
--cc=christophe.leroy@csgroup.eu \
--cc=joao.m.martins@oracle.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=maddy@linux.ibm.com \
--cc=mpe@ellerman.id.au \
--cc=muchun.song@linux.dev \
--cc=roman.gushchin@linux.dev \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox