Re: [PATCH v2] mm/alloc_tag: replace fixed-size early PFN array with dynamic linked list

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Suren Baghdasaryan <surenb@google.com>
To: Hao Ge <hao.ge@linux.dev>
Cc: Kent Overstreet <kent.overstreet@linux.dev>,
	Andrew Morton <akpm@linux-foundation.org>,
	 linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2] mm/alloc_tag: replace fixed-size early PFN array with dynamic linked list
Date: Wed, 22 Apr 2026 21:10:20 -0700	[thread overview]
Message-ID: <CAJuCfpGOd=p4UOkn1UF=KmAFL6Xxm4FGomZt83eRj_cjeUxAjw@mail.gmail.com> (raw)
In-Reply-To: <a4d4ac18-fbae-4ace-8f1c-651c8fefcf38@linux.dev>

On Wed, Apr 22, 2026 at 7:10 PM Hao Ge <hao.ge@linux.dev> wrote:
>
>
> On 2026/4/23 01:32, Suren Baghdasaryan wrote:
> > On Mon, Apr 20, 2026 at 8:15 PM Hao Ge <hao.ge@linux.dev> wrote:
> >> Pages allocated before page_ext is available have their codetag left
> >> uninitialized. Track these early PFNs and clear their codetag in
> >> clear_early_alloc_pfn_tag_refs() to avoid "alloc_tag was not set"
> >> warnings when they are freed later.
> >>
> >> Currently a fixed-size array of 8192 entries is used, with a warning if
> >> the limit is exceeded. However, the number of early allocations depends
> >> on the number of CPUs and can be larger than 8192.
> >>
> >> Replace the fixed-size array with a dynamically allocated linked list.
> >> Each page is carved into early_pfn_node entries and the remainder is
> >> kept as a freelist for subsequent allocations.
> >>
> >> The list nodes themselves are allocated via alloc_page(), which would
> >> trigger __pgalloc_tag_add() -> alloc_tag_add_early_pfn() ->
> >> alloc_early_pfn_node() and recurse indefinitely.  Introduce
> >> __GFP_NO_CODETAG (reuses the %__GFP_NO_OBJ_EXT bit) and pass
> >> gfp_flags through pgalloc_tag_add() so that the early path can skip
> >> recording allocations that carry this flag.
> Hi Suren
> > Hi Hao,
> > Thanks for following up on this!
>
> Happy to help further develop this feature.
>
> Feel free to reach out if there's anything else I can do.
>
>
> >
> >> Signed-off-by: Hao Ge <hao.ge@linux.dev>
> >> ---
> >> v2:
> >> - Use cmpxchg to atomically update early_pfn_pages, preventing page leak under concurrent allocation
> >> - Pass gfp_flags through the full call chain and use gfpflags_allow_blocking()
> >>    to select GFP_KERNEL vs GFP_ATOMIC, avoiding unnecessary GFP_ATOMIC in process context
> >> ---
> >>   include/linux/alloc_tag.h |  22 +++++++-
> >>   lib/alloc_tag.c           | 102 ++++++++++++++++++++++++++------------
> >>   mm/page_alloc.c           |  29 +++++++----
> >>   3 files changed, 108 insertions(+), 45 deletions(-)
> >>
> >> diff --git a/include/linux/alloc_tag.h b/include/linux/alloc_tag.h
> >> index 02de2ede560f..2fa695bd3c53 100644
> >> --- a/include/linux/alloc_tag.h
> >> +++ b/include/linux/alloc_tag.h
> >> @@ -150,6 +150,23 @@ static inline struct alloc_tag_counters alloc_tag_read(struct alloc_tag *tag)
> >>   }
> >>
> >>   #ifdef CONFIG_MEM_ALLOC_PROFILING_DEBUG
> >> +/*
> >> + * Skip early PFN recording for a page allocation.  Reuses the
> >> + * %__GFP_NO_OBJ_EXT bit.  Used by alloc_early_pfn_node() to avoid
> >> + * recursion when allocating pages for the early PFN tracking list
> >> + * itself.
> >> + *
> >> + * Callers must set the codetag to CODETAG_EMPTY (via
> >> + * clear_page_tag_ref()) before freeing pages allocated with this
> >> + * flag once page_ext becomes available, otherwise
> >> + * alloc_tag_sub_check() will trigger a warning.
> >> + */
> >> +#define __GFP_NO_CODETAG       __GFP_NO_OBJ_EXT
> >> +
> >> +static inline bool should_record_early_pfn(gfp_t gfp_flags)
> >> +{
> >> +       return !(gfp_flags & __GFP_NO_CODETAG);
> >> +}
> >>   static inline void alloc_tag_add_check(union codetag_ref *ref, struct alloc_tag *tag)
> >>   {
> >>          WARN_ONCE(ref && ref->ct && !is_codetag_empty(ref),
> >> @@ -163,11 +180,12 @@ static inline void alloc_tag_sub_check(union codetag_ref *ref)
> >>   {
> >>          WARN_ONCE(ref && !ref->ct, "alloc_tag was not set\n");
> >>   }
> >> -void alloc_tag_add_early_pfn(unsigned long pfn);
> >> +void alloc_tag_add_early_pfn(unsigned long pfn, gfp_t gfp_flags);
> >>   #else
> >>   static inline void alloc_tag_add_check(union codetag_ref *ref, struct alloc_tag *tag) {}
> >>   static inline void alloc_tag_sub_check(union codetag_ref *ref) {}
> >> -static inline void alloc_tag_add_early_pfn(unsigned long pfn) {}
> >> +static inline void alloc_tag_add_early_pfn(unsigned long pfn, gfp_t gfp_flags) {}
> >> +static inline bool should_record_early_pfn(gfp_t gfp_flags) { return true; }
> >
> > If CONFIG_MEM_ALLOC_PROFILING_DEBUG=n why should we record early pfns?
> Good point! I'll address this in the next patch.
> >>   #endif
> >>
> >>   /* Caller should verify both ref and tag to be valid */
> >> diff --git a/lib/alloc_tag.c b/lib/alloc_tag.c
> >> index ed1bdcf1f8ab..cfc68e397eba 100644
> >> --- a/lib/alloc_tag.c
> >> +++ b/lib/alloc_tag.c
> >> @@ -766,45 +766,75 @@ static __init bool need_page_alloc_tagging(void)
> >>    * Some pages are allocated before page_ext becomes available, leaving
> >>    * their codetag uninitialized. Track these early PFNs so we can clear
> >>    * their codetag refs later to avoid warnings when they are freed.
> >> - *
> >> - * Early allocations include:
> >> - *   - Base allocations independent of CPU count
> >> - *   - Per-CPU allocations (e.g., CPU hotplug callbacks during smp_init,
> >> - *     such as trace ring buffers, scheduler per-cpu data)
> >> - *
> >> - * For simplicity, we fix the size to 8192.
> >> - * If insufficient, a warning will be triggered to alert the user.
> >> - *
> >> - * TODO: Replace fixed-size array with dynamic allocation using
> >> - * a GFP flag similar to ___GFP_NO_OBJ_EXT to avoid recursion.
> >>    */
> >> -#define EARLY_ALLOC_PFN_MAX            8192
> >> +struct early_pfn_node {
> >> +       struct early_pfn_node   *next;
> >> +       unsigned long           pfn;
> >> +};
> >> +
> >> +#define NODES_PER_PAGE         (PAGE_SIZE / sizeof(struct early_pfn_node))
> >>
> >> -static unsigned long early_pfns[EARLY_ALLOC_PFN_MAX] __initdata;
> >> -static atomic_t early_pfn_count __initdata = ATOMIC_INIT(0);
> >> +static struct early_pfn_node *early_pfn_list __initdata;
> >> +static struct early_pfn_node *early_pfn_freelist __initdata;
> >> +static struct page *early_pfn_pages __initdata;
> > This early_pfn_node linked list seems overly complex. Why not just
> > allocate a page and use page->lru to place it into a linked list? I
> > think the code will end up much simpler.
>
> Ah, yes! You mentioned this before, but I misunderstood your point.
>
> I apologize for the confusion. I'll optimize this in the next revision.
>
> >
> >> -static void __init __alloc_tag_add_early_pfn(unsigned long pfn)
> >> +static struct early_pfn_node *__init alloc_early_pfn_node(gfp_t gfp_flags)
> >>   {
> >> -       int old_idx, new_idx;
> >> +       struct early_pfn_node *ep, *new;
> >> +       struct page *page, *old_page;
> >> +       gfp_t gfp = gfpflags_allow_blocking(gfp_flags) ? GFP_KERNEL : GFP_ATOMIC;
> >> +       int i;
> >> +
> >> +retry:
> >> +       ep = READ_ONCE(early_pfn_freelist);
> >> +       if (ep) {
> >> +               struct early_pfn_node *next = READ_ONCE(ep->next);
> >> +
> >> +               if (try_cmpxchg(&early_pfn_freelist, &ep, next))
> >> +                       return ep;
> >> +               goto retry;
> >> +       }
> >> +
> >> +       page = alloc_page(gfp | __GFP_NO_CODETAG | __GFP_ZERO);
>
> One more question: since this is called in an RCU context, should we use
> GFP_ATOMIC by default?

You mean it might be called in RCU context, right? If so then I think
we should not use GFP_ATOMIC when GFP_KERNEL can be used.

>
> Also, should we remove GFP_KSWAPD_RECLAIM here? I'm not entirely sure,
> but I recall Sashiko mentioned a warning about this before:
>
> ---
>
> page = alloc_page(GFP_ATOMIC | __GFP_NO_CODETAG | __GFP_ZERO);
>
> Sashiko's concerns:
>
> Can this lead to a deadlock by introducing lock recursion?
> alloc_early_pfn_node() is invoked as a post-allocation hook for early boot
> pages via pgalloc_tag_add(). GFP_ATOMIC includes __GFP_KSWAPD_RECLAIM,
> which triggers wakeup_kswapd() and acquires scheduler locks.
> If the original allocation was made under scheduler locks and intentionally
> stripped __GFP_KSWAPD_RECLAIM to prevent recursion, does this hardcoded
> GFP_ATOMIC force it back on? Should the hook inherit or constrain its flags
> based on the caller's gfp_flags instead?
>
> ---
>
> Even though this only happens during early boot, should we handle it
> more safely?

That seems reasonable. Any reason you are not doing simply: page =
alloc_page(gfp_flags | __GFP_NO_CODETAG | __GFP_ZERO); ? IOW why
aren't you simply inheriting the flags?

>
> >> +       if (!page)
> >> +               return NULL;
> >> +
> >> +       new = page_address(page);
> >> +       for (i = 0; i < NODES_PER_PAGE - 1; i++)
> >> +               new[i].next = &new[i + 1];
> >> +       new[NODES_PER_PAGE - 1].next = NULL;
> >> +
> >> +       if (cmpxchg(&early_pfn_freelist, NULL, new + 1)) {
> >> +               __free_page(page);
> >> +               goto retry;
> >> +       }
> >>
> >>          do {
> >> -               old_idx = atomic_read(&early_pfn_count);
> >> -               if (old_idx >= EARLY_ALLOC_PFN_MAX) {
> >> -                       pr_warn_once("Early page allocations before page_ext init exceeded EARLY_ALLOC_PFN_MAX (%d)\n",
> >> -                                     EARLY_ALLOC_PFN_MAX);
> >> -                       return;
> >> -               }
> >> -               new_idx = old_idx + 1;
> >> -       } while (!atomic_try_cmpxchg(&early_pfn_count, &old_idx, new_idx));
> >> +               old_page = READ_ONCE(early_pfn_pages);
> >> +               page->private = (unsigned long)old_page;
> >> +       } while (cmpxchg(&early_pfn_pages, old_page, page) != old_page);
> > I don't think this whole lockless schema is worth the complexity.
> > alloc_early_pfn_node() is called only during early init and is called
> > perhaps a few hundred times in total. Why not use a simple spinlock to
> > synchronize this operation and be done with it?
>
> I initially used a simple spinlock, but Sashiko raised a good point in
> his review:
>
> https://sashiko.dev/#/patchset/20260319083153.2488005-1-hao.ge%40linux.dev
>
> Since alloc_early_pfn_node() is called during early init from an unknown
> context, in the early page allocation path,
>
> a lockless approach is safer. However, you're right that if we use
> page->lru as a linked list (which you suggested earlier),
>
> the code becomes much simpler. I plan to simplify the code and continue
> using the lockless approach in the next version.

Ok, let's see the result and decide then. Allocating a page from NMI
context sounds extreme to me. I would expect NMI context to use only
preallocated memory.

>
>
> >> +
> >> +       return new;
> >> +}
> >> +
> >> +static void __init __alloc_tag_add_early_pfn(unsigned long pfn, gfp_t gfp_flags)
> >> +{
> >> +       struct early_pfn_node *ep = alloc_early_pfn_node(gfp_flags);
> >>
> >> -       early_pfns[old_idx] = pfn;
> >> +       if (!ep)
> >> +               return;
> >> +
> >> +       ep->pfn = pfn;
> >> +       do {
> >> +               ep->next = READ_ONCE(early_pfn_list);
> >> +       } while (!try_cmpxchg(&early_pfn_list, &ep->next, ep));
> >>   }
> >>
> >> -typedef void alloc_tag_add_func(unsigned long pfn);
> >> +typedef void alloc_tag_add_func(unsigned long pfn, gfp_t gfp_flags);
> >>   static alloc_tag_add_func __rcu *alloc_tag_add_early_pfn_ptr __refdata =
> >>          RCU_INITIALIZER(__alloc_tag_add_early_pfn);
> >>
> >> -void alloc_tag_add_early_pfn(unsigned long pfn)
> >> +void alloc_tag_add_early_pfn(unsigned long pfn, gfp_t gfp_flags)
> >>   {
> >>          alloc_tag_add_func *alloc_tag_add;
> >>
> >> @@ -814,13 +844,14 @@ void alloc_tag_add_early_pfn(unsigned long pfn)
> >>          rcu_read_lock();
> >>          alloc_tag_add = rcu_dereference(alloc_tag_add_early_pfn_ptr);
> >>          if (alloc_tag_add)
> >> -               alloc_tag_add(pfn);
> >> +               alloc_tag_add(pfn, gfp_flags);
> >>          rcu_read_unlock();
> >>   }
> >>
> >>   static void __init clear_early_alloc_pfn_tag_refs(void)
> >>   {
> >> -       unsigned int i;
> >> +       struct early_pfn_node *ep;
> >> +       struct page *page, *next;
> >>
> >>          if (static_key_enabled(&mem_profiling_compressed))
> >>                  return;
> >> @@ -829,14 +860,13 @@ static void __init clear_early_alloc_pfn_tag_refs(void)
> >>          /* Make sure we are not racing with __alloc_tag_add_early_pfn() */
> >>          synchronize_rcu();
> >>
> >> -       for (i = 0; i < atomic_read(&early_pfn_count); i++) {
> >> -               unsigned long pfn = early_pfns[i];
> >> +       for (ep = early_pfn_list; ep; ep = ep->next) {
> >>
> >> -               if (pfn_valid(pfn)) {
> >> -                       struct page *page = pfn_to_page(pfn);
> >> +               if (pfn_valid(ep->pfn)) {
> >>                          union pgtag_ref_handle handle;
> >>                          union codetag_ref ref;
> >>
> >> +                       page = pfn_to_page(ep->pfn);
> >>                          if (get_page_tag_ref(page, &ref, &handle)) {
> >>                                  /*
> >>                                   * An early-allocated page could be freed and reallocated
> >> @@ -861,6 +891,12 @@ static void __init clear_early_alloc_pfn_tag_refs(void)
> >>                  }
> >>
> >>          }
> >> +
> >> +       for (page = early_pfn_pages; page; page = next) {
> >> +               next = (struct page *)page->private;
> >> +               clear_page_tag_ref(page);
> >> +               __free_page(page);
> >> +       }
> >>   }
> >>   #else /* !CONFIG_MEM_ALLOC_PROFILING_DEBUG */
> >>   static inline void __init clear_early_alloc_pfn_tag_refs(void) {}
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index 04494bc2e46f..4e2bfb3714e1 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -1284,7 +1284,7 @@ void __clear_page_tag_ref(struct page *page)
> >>   /* Should be called only if mem_alloc_profiling_enabled() */
> >>   static noinline
> >>   void __pgalloc_tag_add(struct page *page, struct task_struct *task,
> >> -                      unsigned int nr)
> >> +                      unsigned int nr, gfp_t gfp_flags)
> >>   {
> >>          union pgtag_ref_handle handle;
> >>          union codetag_ref ref;
> >> @@ -1294,21 +1294,30 @@ void __pgalloc_tag_add(struct page *page, struct task_struct *task,
> >>                  update_page_tag_ref(handle, &ref);
> >>                  put_page_tag_ref(handle);
> >>          } else {
> >> -               /*
> >> -                * page_ext is not available yet, record the pfn so we can
> >> -                * clear the tag ref later when page_ext is initialized.
> >> -                */
> >> -               alloc_tag_add_early_pfn(page_to_pfn(page));
> >> +
> >>                  if (task->alloc_tag)
> >>                          alloc_tag_set_inaccurate(task->alloc_tag);
> >> +
> >> +               /*
> >> +                * page_ext is not available yet, skip if this allocation
> >> +                * doesn't need early PFN recording.
> >> +                */
> >> +               if (unlikely(!should_record_early_pfn(gfp_flags)))
> >> +                       return;
> >> +
> >> +               /*
> >> +                * Record the pfn so the tag ref can be cleared later
> >> +                * when page_ext is initialized.
> >> +                */
> >> +               alloc_tag_add_early_pfn(page_to_pfn(page), gfp_flags);
> > nit: This seems shorter and more readable:
> >
> >                 if (unlikely(should_record_early_pfn(gfp_flags)))
> >                              alloc_tag_add_early_pfn(page_to_pfn(page),
> > gfp_flags);
>
> OK, will improve it in the next version.

Thanks,
Suren.

>
>
> Thanks for taking the time to review and provide these suggestions!
>
> Thanks
>
> Best Regards
>
> Hao
>
> >>          }
> >>   }
> >>
> >>   static inline void pgalloc_tag_add(struct page *page, struct task_struct *task,
> >> -                                  unsigned int nr)
> >> +                                  unsigned int nr, gfp_t gfp_flags)
> >>   {
> >>          if (mem_alloc_profiling_enabled())
> >> -               __pgalloc_tag_add(page, task, nr);
> >> +               __pgalloc_tag_add(page, task, nr, gfp_flags);
> >>   }
> >>
> >>   /* Should be called only if mem_alloc_profiling_enabled() */
> >> @@ -1341,7 +1350,7 @@ static inline void pgalloc_tag_sub_pages(struct alloc_tag *tag, unsigned int nr)
> >>   #else /* CONFIG_MEM_ALLOC_PROFILING */
> >>
> >>   static inline void pgalloc_tag_add(struct page *page, struct task_struct *task,
> >> -                                  unsigned int nr) {}
> >> +                                  unsigned int nr, gfp_t gfp_flags) {}
> >>   static inline void pgalloc_tag_sub(struct page *page, unsigned int nr) {}
> >>   static inline void pgalloc_tag_sub_pages(struct alloc_tag *tag, unsigned int nr) {}
> >>
> >> @@ -1896,7 +1905,7 @@ inline void post_alloc_hook(struct page *page, unsigned int order,
> >>
> >>          set_page_owner(page, order, gfp_flags);
> >>          page_table_check_alloc(page, order);
> >> -       pgalloc_tag_add(page, current, 1 << order);
> >> +       pgalloc_tag_add(page, current, 1 << order, gfp_flags);
> >>   }
> >>
> >>   static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags,
> >> --
> >> 2.25.1
> >>

     prev parent reply	other threads:[~2026-04-23  4:10 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-21  3:14 Hao Ge
2026-04-22 17:32 ` Suren Baghdasaryan
2026-04-23  2:09   ` Hao Ge
2026-04-23  4:10     ` Suren Baghdasaryan [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJuCfpGOd=p4UOkn1UF=KmAFL6Xxm4FGomZt83eRj_cjeUxAjw@mail.gmail.com' \
    --to=surenb@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=hao.ge@linux.dev \
    --cc=kent.overstreet@linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox