Re: [PATCH v2 2/3] mm/page_alloc: only free healthy pages in high-order HWPoison folio

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Jiaqi Yan <jiaqiyan@google.com>
To: Harry Yoo <harry.yoo@oracle.com>
Cc: jackmanb@google.com, hannes@cmpxchg.org, linmiaohe@huawei.com,
	 ziy@nvidia.com, willy@infradead.org, nao.horiguchi@gmail.com,
	 david@redhat.com, lorenzo.stoakes@oracle.com,
	william.roche@oracle.com,  tony.luck@intel.com,
	wangkefeng.wang@huawei.com, jane.chu@oracle.com,
	 akpm@linux-foundation.org, osalvador@suse.de,
	muchun.song@linux.dev,  rientjes@google.com, duenwen@google.com,
	jthoughton@google.com,  linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, Liam.Howlett@oracle.com,
	 vbabka@suse.cz, rppt@kernel.org, surenb@google.com,
	mhocko@suse.com
Subject: Re: [PATCH v2 2/3] mm/page_alloc: only free healthy pages in high-order HWPoison folio
Date: Mon, 5 Jan 2026 19:40:00 -0800	[thread overview]
Message-ID: <CACw3F51ZnjaHtsorT0ao8S5K8Tvzgqb6fT9hr-WBavkqivz+BQ@mail.gmail.com> (raw)
In-Reply-To: <aVSojEWdFPsb1fg7@hyeyoo>

On Tue, Dec 30, 2025 at 8:37 PM Harry Yoo <harry.yoo@oracle.com> wrote:
>
> On Tue, Dec 30, 2025 at 04:19:50PM -0800, Jiaqi Yan wrote:
> > On Sun, Dec 28, 2025 at 5:15 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> > > On Fri, Dec 26, 2025 at 05:50:59PM -0800, Jiaqi Yan wrote:
> > > > On Mon, Dec 22, 2025 at 9:14 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> > > > > On Fri, Dec 19, 2025 at 06:33:45PM +0000, Jiaqi Yan wrote:
> > > > > > ---
> > > > > >  mm/page_alloc.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++
> > > > > >  1 file changed, 101 insertions(+)
> > > > > >
> > > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > > index 822e05f1a9646..20c8862ce594e 100644
> > > > > > --- a/mm/page_alloc.c
> > > > > > +++ b/mm/page_alloc.c
> > > > > > +/*
> > > > > > + * Given a high-order compound page containing certain number of HWPoison
> > > > > > + * pages, free only the healthy ones to buddy allocator.
> > > > > > + *
> > > > > > + * It calls __free_frozen_pages O(2^order) times and cause nontrivial
> > > > > > + * overhead. So only use this when compound page really contains HWPoison.
> > > > > > + *
> > > > > > + * This implementation doesn't work in memdesc world.
> > > > > > + */
> > > > > > +static void free_has_hwpoison_pages(struct page *page, unsigned int order)
> > > > > > +{
> > > > > > +     struct page *curr = page;
> > > > > > +     struct page *end = page + (1 << order);
> > > > > > +     struct page *next;
> > > > > > +     unsigned long flags = page->flags.f;
> > > > > > +     unsigned long nr_pages;
> > > > > > +     unsigned long total_freed = 0;
> > > > > > +     unsigned long total_hwp = 0;
> > > > > > +
> > > > > > +     VM_WARN_ON(flags & PAGE_FLAGS_CHECK_AT_FREE);
> > > > > > +
> > > > > > +     while (curr < end) {
> > > > > > +             next = curr;
> > > > > > +             nr_pages = 0;
> > > > > > +
> > > > > > +             while (next < end && !PageHWPoison(next)) {
> > > > > > +                     ++next;
> > > > > > +                     ++nr_pages;
> > > > > > +             }
> > > > > > +
> > > > > > +             if (PageHWPoison(next))
> > > > > > +                     ++total_hwp;
> > > > > > +
> > > > > > +             free_contiguous_pages(curr, next, flags);
> > > > >
> > > > > page_owner, memory profiling (anything else?) will be confused
> > > > > because it was allocated as a larger size, but we're freeing only
> > > > > some portion of it.
> > > >
> > > > I am not sure, but looking at __split_unmapped_folio, it calls
> > > > pgalloc_tag_split(folio, old_order, split_order) when splitting an
> > > > old_order-order folio into a new split_order.
> > > >
> > > > Maybe prepare_compound_page_to_free really should
> > > > update_page_tag_ref(), I need to take a closer look at this with
> > > > CONFIG_MEM_ALLOC_PROFILING (not something I usually enable).
> > > >
> > > > > Perhaps we need to run some portion of this code snippet
> > > > > (from free_pages_prepare()), before freeing portions of it:
> > > > >
> > > > >         page_cpupid_reset_last(page);
> > > > >         page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> > > > >         reset_page_owner(page, order);
> > > > >         page_table_check_free(page, order);
> > > > >         pgalloc_tag_sub(page, 1 << order);
> > > >
> > > > Since they come from free_pages_prepare, I believe these lines are
> > > > already executed via free_contiguous_pages()=> __free_frozen_pages()=>
> > > > free_pages_prepare(), right? Or am I missing something?
> > >
> > > But they're called with order that is smaller than the original order.
> > > That's could be problematic; for example, memory profiling stores metadata
> > > only on the first page. If you pass anything other than the first page
> > > to free_pages_prepare(), it will not recognize that metadata was stored
> > > during allocation.
> > >
> >
> > Right, with MEM_ALLOC_PROFILING enabled, I ran into the following
> > WARNING when freeing all blocks except the 1st one (which contains the
> > original head page):
> >
> > [ 2101.713669] ------------[ cut here ]------------
> > [ 2101.713670] alloc_tag was not set
> > [ 2101.713671] WARNING: ./include/linux/alloc_tag.h:164 at
> > __pgalloc_tag_sub+0xdf/0x160, CPU#18: hugetlb-mfr-3pa/33675
> > [ 2101.713693] CPU: 18 UID: 0 PID: 33675 Comm: hugetlb-mfr-3pa
> > Tainted: G S W O 6.19.0-smp-DEV #2 NONE
> > [ 2101.713698] Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN, [O]=OOT_MODULE
> > [ 2101.713702] RIP: 0010:__pgalloc_tag_sub+0xdf/0x160
> > ...
> > [ 2101.713723] Call Trace:
> > [ 2101.713725] <TASK>
> > [ 2101.713727] free_has_hwpoison_pages+0xbc/0x370
> > [ 2101.713731] free_frozen_pages+0xb3/0x100
> > [ 2101.713733] __folio_put+0xd5/0x100
> > [ 2101.713739] dissolve_free_hugetlb_folio+0x17f/0x1a0
> > [ 2101.713743] filemap_offline_hwpoison_folio+0x193/0x4c0
> > [ 2101.713747] ? __pfx_workingset_update_node+0x10/0x10
> > [ 2101.713751] remove_inode_hugepages+0x209/0x690
> > [ 2101.713757] ? on_each_cpu_cond_mask+0x1a/0x20
> > [ 2101.713760] ? __cond_resched+0x23/0x60
> > [ 2101.713768] ? n_tty_write+0x4c7/0x500
> > [ 2101.713773] hugetlbfs_setattr+0x127/0x170
> > [ 2101.713776] notify_change+0x32e/0x390
> > [ 2101.713781] do_ftruncate+0x12c/0x1a0
> > [ 2101.713786] __x64_sys_ftruncate+0x3e/0x70
> > [ 2101.713789] do_syscall_64+0x6f/0x890
> > [ 2101.713792] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [ 2101.713811] </TASK>
> > [ 2101.713812] ---[ end trace 0000000000000000 ]---
> >
> > This is because in free_pages_prepare(), pgalloc_tag_sub() found there
> > is no alloc tag on the compound page being freeing.
> >
> > > In general, I think they're not designed to handle cases where
> > > the allocation order and the free order differ (unless we split
> > > metadata like __split_unmapped_folio() does).
> >
> > I believe the proper way to fix this is to do something similar to
> > pgalloc_tag_split(), used by __split_unmapped_folio().
>
> Yeah, that will work.
>
> The difference is that pgalloc_tag_split(), split_page_owner(), and
> split_page_memcg() only support splitting the whole page into many (> 1)
> smaller pieces, but we're trying to create only one smaller page from the
> original page.
>
> > When we split a
> > new block from the original folio, we create a compound page from the
> > block (just the way prep_compound_page_to_free does) and link the
> > alloc tag of the original head page to the head of the new compound
> > page.
> >
> > Something like copy_alloc_tag (to be added in v3) below to demo
> > my idea, assuming tag=pgalloc_tag_get(original head page):

Actually I had another idea for page metadata like page alloc tag and
page owner - how about we do the special handling AFTER
free_pages_prepare()?

Looking at free_pages_prepare, it actually breaks down compound page
(if it is compound, using free_tail_page_prepare), adjust page flags,
and deals page metadata:

for (i = 1; i < (1 << order); i++) {
    if (compound)
        bad += free_tail_page_prepare(page, page + i);
...
page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
reset_page_owner(page, order);
page_table_check_free(page, order);
pgalloc_tag_sub(page, 1 << order);

Even the original compound page has some HWPoison pages, these
operations should be done in the same way when no HWPoison page exist.
The compound structure will be gone after free, the alloc tag will be
reduced after free, the page owner will be reset after free. So
instead of splitting these stuff into temporary compound pages, why
not getting rid of them in the first place with free_pages_prepare()?
This is unique to the freeing path, so we don't need to mimic
everything __split_unmapped_folio() does.

At high level, applying the idea to __free_pages_ok (or
__free_frozen_pages) looks like this:

+static bool compound_has_hwpoisoned(struct page *page, unsigned int order)
 {
+       if (order == 0 || !PageCompound(page))
+               return false;
+
+       return folio_test_has_hwpoisoned(page_folio(page));
+}

static void __free_pages_ok(struct page *page, unsigned int order,
 {
        unsigned long pfn = page_to_pfn(page);
        struct zone *zone = page_zone(page);
+       bool should_fhh = compound_has_hwpoisoned(page, order);

-       if (free_pages_prepare(page, order))
-               free_one_page(zone, page, pfn, order, fpi_flags);
+       if (!free_pages_prepare(page, order))
+               return;
+
+       if (should_fhh) {
+               free_has_hwpoisoned(page, order, fpi_flags);
+               return;
+       }
+
+       free_one_page(zone, page, pfn, order, fpi_flags);
 }

Under this idea, free_has_hwpoisoned() doesn't need
prepare_compound_page_to_free() anymore. It can just directly call
free_one_page() when it selects the right order.

The only imperfect part is compound_has_hwpoisoned(), that we need to
make a call to whether free_has_hwpoisoned() or not ealier than
free_pages_prepare(). free_pages_prepare() clears PAGE_FLAGS_SECOND
flags on the first tail page of a compound, which clears
PG_has_hwpoisoned. On the other hand, we can't easily exclude
PG_has_hwpoisoned from PAGE_FLAGS_SECOND. Because PG_has_hwpoisoned ==
PG_active, free_page_is_bad() will confuse and complaint that the
first tail page is still active.

I have implemented a working prototype and tested just fine. I can
send it out as v3 after clean it up.

> >
> > +/*
> > + * Point page's alloc tag to an existing one.
> > + */
> > +static void copy_alloc_tag(struct page *page, struct alloc_tag *tag)
>
> This should be defined in lib/alloc_tag.c
>
> > +{
> > +       union pgtag_ref_handle handle;
> > +       union codetag_ref ref;
> > +       unsigned long pfn = page_to_pfn(page);
> > +
> > +       if (!mem_alloc_profiling_enabled())
> > +               return;
> > +
> > +       /* tag is NULL if HugeTLB page is allocated in boot process. */
> > +       if (!tag)
> > +               return;
> > +
> > +       if (!get_page_tag_ref(page, &ref, &handle))
> > +               return;
> > +
> > +       /* Avoid overriding existing alloc tag from page. */
> > +       if (!ref.ct || is_codetag_empty(&ref)) {
>
> Is it an error if the page already has a valid tag?

Probably not important anymore, but I don't think it is an error. The
split block having the original folio's head page will have a valid
tag. We don't need to set + update for this block; otherwise
alloc_tag_add_check will WARN().

>
> > +               alloc_tag_ref_set(&ref, tag);
> > +               update_page_tag_ref(handle, &ref);
> > +       }
> > +       put_page_tag_ref(handle);
> > +}
> > +
> > +static void prep_compound_page_to_free(struct page *head, unsigned int order,
> > +                                      unsigned long flags, struct
> > alloc_tag *tag)
> > +{
> > +       head->flags.f = flags & (~PAGE_FLAGS_CHECK_AT_FREE);
> > +       head->mapping = NULL;
> > +       head->private = 0;
> > +
> > +       clear_compound_head(head);
> > +       if (order)
> > +               prep_compound_page(head, order);
> > +
> > +       copy_alloc_tag(head, tag);
>
> Do we need a similar thing for memcg? Probably not, since it should have

Yes, I think we don't need to do anything for memgc, it is already
uncharged in __folio_put().

> been uncharged before freeing (as long as we're not using it for kmem pages)

You mean the folio was charged to kernel memory, right? From my
reading of uncharge_folio(), it covers the kmem case.

> Perhaps a comment on why we don't need to split memcg will be enough.
>
> Do we need a similar thing for page_owner? I think yes, although it won't
> crash or cause a warning, it will be inconsistent if we split page_owner in
> split_page()/__split_unmapped_folio()/etc but not in
> prep_compound_page_to_free().
>
> > +}
> >
> > I tested now the WARNING from include/linux/alloc_tag.h:164 is gone
> > for both 2M and 1G pages. BTW we also need to copy_alloc_tag() for
> > HWPoison pages before pgalloc_tag_sub().
> >
> > > > > > +             total_freed += nr_pages;
> > > > > > +             curr = PageHWPoison(next) ? next + 1 : next;
> > > > > > +     }
> > > > > > +
> > > > > > +     pr_info("Excluded %lu hwpoison pages from folio\n", total_hwp);
> > > > > > +     pr_info("Freed %#lx pages from folio\n", total_freed);
> > > > > > +}
> > > > > > +
> > > > > >  void free_frozen_pages(struct page *page, unsigned int order)
> > > > > >  {
> > > > > > +     struct folio *folio = page_folio(page);
> > > > > > +
> > > > > > +     if (order > 0 && unlikely(folio_test_has_hwpoisoned(folio))) {
> > > > > > +             folio_clear_has_hwpoisoned(folio);
> > > > > > +             free_has_hwpoison_pages(page, order);
> > > > > > +             return;
> > > > > > +     }
> > > > > > +
> > > > >
> > > > > It feels like it's a bit random place to do has_hwpoisoned check.
> > > > > Can we move this to free_pages_prepare() where we have some
> > > > > sanity checks (and also order-0 hwpoison page handling)?
> > > >
> > > > While free_pages_prepare() seems to be a better place to do the
> > > > has_hwpoisoned check, it is not a good place to do
> > > > free_has_hwpoison_pages().
> > >
> > > Why is it not a good place for free_has_hwpoison_pages()?
> > >
> > > Callers of free_pages_prepare() are supposed to avoid freeing it back to
> > > the buddy or using the page when it returns false.
> >
> > What I mean is, callers of free_pages_prepare() wouldn't know from the
> > single false return value whether 1) they should completely bail out
> > or 2) they should retry with free_has_hwpoison_pages.
>
> I was thinking that once free_pages_prepare() returns false, it already
> has done all the work to isolate HWPoison pages and freed healthy portions,
> so the caller doesn't have to do anything and can bail out completely.

Does this change the meaning of free_pages_prepare()'s return value?
My first take is it returns if the preparation+check works are good. I
don't know if doing the freeing work in free_pages_prepare() is a good
idea. But if it is fine, we can actually hide the ugly
compound_has_hwpoisoned() entirely inside free_pages_prepare():
- true means preparation + check work are all good but caller needs to
free prepared+checked pages itself
- false means one of the two:
  - some check failed and it is impossible to safely free pages for callers
  - this is a compound page with some HWPoison pages, and healthy
pages are already freed safely

Doesn't seem very clean with a boolean return type...

>
> > So for now I
> > think it'd better that free_frozen_pages does the check and act upon
> > the check result. Not sure if there is a better place, or worthy to
> > change free_pages_prepare()'s return type?
> >
> > > ...except compaction_free(), which I don't have much idea what it's
> > > doing.
> > >
> > > > > >       __free_frozen_pages(page, order, FPI_NONE);
> > > > > >  }
>
> --
> Cheers,
> Harry / Hyeonggon

next prev parent reply	other threads:[~2026-01-06  3:40 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-19 18:33 [PATCH v2 0/3] Only " Jiaqi Yan
2025-12-19 18:33 ` [PATCH v2 1/3] mm/memory-failure: set has_hwpoisoned flags on HugeTLB folio Jiaqi Yan
2025-12-19 18:33 ` [PATCH v2 2/3] mm/page_alloc: only free healthy pages in high-order HWPoison folio Jiaqi Yan
2025-12-23  5:14   ` Harry Yoo
2025-12-27  1:50     ` Jiaqi Yan
2025-12-29  1:15       ` Harry Yoo
2025-12-31  0:19         ` Jiaqi Yan
2025-12-31  4:37           ` Harry Yoo
2026-01-06  3:40             ` Jiaqi Yan [this message]
2026-01-06  7:35               ` Harry Yoo
2025-12-23  7:45   ` Miaohe Lin
2025-12-27  1:50     ` Jiaqi Yan
2025-12-19 18:33 ` [PATCH v2 3/3] mm/memory-failure: simplify __page_handle_poison Jiaqi Yan
2025-12-22 22:12   ` Matthew Wilcox
2025-12-22 23:13     ` Jiaqi Yan
2025-12-22 22:06 ` [PATCH v2 0/3] Only free healthy pages in high-order HWPoison folio Jiaqi Yan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CACw3F51ZnjaHtsorT0ao8S5K8Tvzgqb6fT9hr-WBavkqivz+BQ@mail.gmail.com \
    --to=jiaqiyan@google.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=duenwen@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=harry.yoo@oracle.com \
    --cc=jackmanb@google.com \
    --cc=jane.chu@oracle.com \
    --cc=jthoughton@google.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=nao.horiguchi@gmail.com \
    --cc=osalvador@suse.de \
    --cc=rientjes@google.com \
    --cc=rppt@kernel.org \
    --cc=surenb@google.com \
    --cc=tony.luck@intel.com \
    --cc=vbabka@suse.cz \
    --cc=wangkefeng.wang@huawei.com \
    --cc=william.roche@oracle.com \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox