linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Harry Yoo <harry.yoo@oracle.com>
To: Jiaqi Yan <jiaqiyan@google.com>
Cc: jackmanb@google.com, hannes@cmpxchg.org, linmiaohe@huawei.com,
	ziy@nvidia.com, willy@infradead.org, nao.horiguchi@gmail.com,
	david@redhat.com, lorenzo.stoakes@oracle.com,
	william.roche@oracle.com, tony.luck@intel.com,
	wangkefeng.wang@huawei.com, jane.chu@oracle.com,
	akpm@linux-foundation.org, osalvador@suse.de,
	muchun.song@linux.dev, rientjes@google.com, duenwen@google.com,
	jthoughton@google.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, Liam.Howlett@oracle.com,
	vbabka@suse.cz, rppt@kernel.org, surenb@google.com,
	mhocko@suse.com
Subject: Re: [PATCH v2 2/3] mm/page_alloc: only free healthy pages in high-order HWPoison folio
Date: Tue, 6 Jan 2026 16:35:11 +0900	[thread overview]
Message-ID: <aVy7L-3pc4JUYBEn@hyeyoo> (raw)
In-Reply-To: <CACw3F51ZnjaHtsorT0ao8S5K8Tvzgqb6fT9hr-WBavkqivz+BQ@mail.gmail.com>

On Mon, Jan 05, 2026 at 07:40:00PM -0800, Jiaqi Yan wrote:
> On Tue, Dec 30, 2025 at 8:37 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> > On Tue, Dec 30, 2025 at 04:19:50PM -0800, Jiaqi Yan wrote:
> > > On Sun, Dec 28, 2025 at 5:15 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> > > > On Fri, Dec 26, 2025 at 05:50:59PM -0800, Jiaqi Yan wrote:
> > > > > On Mon, Dec 22, 2025 at 9:14 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> > > > > > On Fri, Dec 19, 2025 at 06:33:45PM +0000, Jiaqi Yan wrote:
> > > > In general, I think they're not designed to handle cases where
> > > > the allocation order and the free order differ (unless we split
> > > > metadata like __split_unmapped_folio() does).
> > >
> > > I believe the proper way to fix this is to do something similar to
> > > pgalloc_tag_split(), used by __split_unmapped_folio().
> >
> > Yeah, that will work.
> >
> > The difference is that pgalloc_tag_split(), split_page_owner(), and
> > split_page_memcg() only support splitting the whole page into many (> 1)
> > smaller pieces, but we're trying to create only one smaller page from the
> > original page.
> >
> > > When we split a
> > > new block from the original folio, we create a compound page from the
> > > block (just the way prep_compound_page_to_free does) and link the
> > > alloc tag of the original head page to the head of the new compound
> > > page.
> > >
> > > Something like copy_alloc_tag (to be added in v3) below to demo
> > > my idea, assuming tag=pgalloc_tag_get(original head page):
> 
> Actually I had another idea for page metadata like page alloc tag and
> page owner - how about we do the special handling AFTER
> free_pages_prepare()?

Sounds good to me.

> Looking at free_pages_prepare, it actually breaks down compound page
> (if it is compound, using free_tail_page_prepare), adjust page flags,
> and deals page metadata:
> 
> for (i = 1; i < (1 << order); i++) {
>     if (compound)
>         bad += free_tail_page_prepare(page, page + i);
> ...
> page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> reset_page_owner(page, order);
> page_table_check_free(page, order);
> pgalloc_tag_sub(page, 1 << order);
> 
> Even the original compound page has some HWPoison pages, these
> operations should be done in the same way when no HWPoison page exist.

and we're not losing hwpoison flag after free_pages_prepare()...
that actually sounds fine.

> The compound structure will be gone after free, the alloc tag will be
> reduced after free, the page owner will be reset after free.

Right.

> So instead of splitting these stuff into temporary compound pages, why
> not getting rid of them in the first place with free_pages_prepare()?
>
> This is unique to the freeing path, so we don't need to mimic
> everything __split_unmapped_folio() does.

Not 100% sure how it'd work in the memdesc world,
but for today, right.

> At high level, applying the idea to __free_pages_ok (or
> __free_frozen_pages) looks like this:
> 
> +static bool compound_has_hwpoisoned(struct page *page, unsigned int order)
>  {
> +       if (order == 0 || !PageCompound(page))
> +               return false;
> +
> +       return folio_test_has_hwpoisoned(page_folio(page));
> +}
> 
> static void __free_pages_ok(struct page *page, unsigned int order,
>  {
>         unsigned long pfn = page_to_pfn(page);
>         struct zone *zone = page_zone(page);
> +       bool should_fhh = compound_has_hwpoisoned(page, order);
> 
> -       if (free_pages_prepare(page, order))
> -               free_one_page(zone, page, pfn, order, fpi_flags);
> +       if (!free_pages_prepare(page, order))
> +               return;
> +
> +       if (should_fhh) {
> +               free_has_hwpoisoned(page, order, fpi_flags);
> +               return;
> +       }
> +
> +       free_one_page(zone, page, pfn, order, fpi_flags);
>  }
> 
> Under this idea, free_has_hwpoisoned() doesn't need
> prepare_compound_page_to_free() anymore. It can just directly call
> free_one_page() when it selects the right order.

Right.

> The only imperfect part is compound_has_hwpoisoned(), that we need to
> make a call to whether free_has_hwpoisoned() or not ealier than
> free_pages_prepare().

Right.

> free_pages_prepare() clears PAGE_FLAGS_SECOND
> flags on the first tail page of a compound, which clears
> PG_has_hwpoisoned. On the other hand, we can't easily exclude
> PG_has_hwpoisoned from PAGE_FLAGS_SECOND. Because PG_has_hwpoisoned ==
> PG_active, free_page_is_bad() will confuse and complaint that the
> first tail page is still active.

Adding a comment like this wouldn't hurt:

/*
 * Should be called before decomposing compound page in
 * free_pages_prepare().
 */
bool should_fhh = compound_has_hwpoisoned(page, order);

> I have implemented a working prototype and tested just fine. I can
> send it out as v3 after clean it up.
> 
> > > +static void prep_compound_page_to_free(struct page *head, unsigned int order,
> > > +                                      unsigned long flags, struct
> > > alloc_tag *tag)
> > > +{
> > > +       head->flags.f = flags & (~PAGE_FLAGS_CHECK_AT_FREE);
> > > +       head->mapping = NULL;
> > > +       head->private = 0;
> > > +
> > > +       clear_compound_head(head);
> > > +       if (order)
> > > +               prep_compound_page(head, order);
> > > +
> > > +       copy_alloc_tag(head, tag);
> >
> > Do we need a similar thing for memcg? Probably not, since it should have
> 
> Yes, I think we don't need to do anything for memgc, it is already
> uncharged in __folio_put().
> 
> > been uncharged before freeing (as long as we're not using it for kmem pages)
> 
> You mean the folio was charged to kernel memory, right? From my
> reading of uncharge_folio(), it covers the kmem case.

Yes but I think not all kmem pages go through uncharge_folio() given
free_pages_prepare() handles uncharging kmem pages.

> > Perhaps a comment on why we don't need to split memcg will be enough.
> >
> > Do we need a similar thing for page_owner? I think yes, although it won't
> > crash or cause a warning, it will be inconsistent if we split page_owner in
> > split_page()/__split_unmapped_folio()/etc but not in
> > prep_compound_page_to_free().
> >
> > > +}
> > >
> > > I tested now the WARNING from include/linux/alloc_tag.h:164 is gone
> > > for both 2M and 1G pages. BTW we also need to copy_alloc_tag() for
> > > HWPoison pages before pgalloc_tag_sub().
> > >
> > > > > > > +             total_freed += nr_pages;
> > > > > > > +             curr = PageHWPoison(next) ? next + 1 : next;
> > > > > > > +     }
> > > > > > > +
> > > > > > > +     pr_info("Excluded %lu hwpoison pages from folio\n", total_hwp);
> > > > > > > +     pr_info("Freed %#lx pages from folio\n", total_freed);
> > > > > > > +}
> > > > > > > +
> > > > > > >  void free_frozen_pages(struct page *page, unsigned int order)
> > > > > > >  {
> > > > > > > +     struct folio *folio = page_folio(page);
> > > > > > > +
> > > > > > > +     if (order > 0 && unlikely(folio_test_has_hwpoisoned(folio))) {
> > > > > > > +             folio_clear_has_hwpoisoned(folio);
> > > > > > > +             free_has_hwpoison_pages(page, order);
> > > > > > > +             return;
> > > > > > > +     }
> > > > > > > +
> > > > > >
> > > > > > It feels like it's a bit random place to do has_hwpoisoned check.
> > > > > > Can we move this to free_pages_prepare() where we have some
> > > > > > sanity checks (and also order-0 hwpoison page handling)?
> > > > >
> > > > > While free_pages_prepare() seems to be a better place to do the
> > > > > has_hwpoisoned check, it is not a good place to do
> > > > > free_has_hwpoison_pages().
> > > >
> > > > Why is it not a good place for free_has_hwpoison_pages()?
> > > >
> > > > Callers of free_pages_prepare() are supposed to avoid freeing it back to
> > > > the buddy or using the page when it returns false.
> > >
> > > What I mean is, callers of free_pages_prepare() wouldn't know from the
> > > single false return value whether 1) they should completely bail out
> > > or 2) they should retry with free_has_hwpoison_pages.
> >
> > I was thinking that once free_pages_prepare() returns false, it already
> > has done all the work to isolate HWPoison pages and freed healthy portions,
> > so the caller doesn't have to do anything and can bail out completely.
> 
> Does this change the meaning of free_pages_prepare()'s return value?
> My first take is it returns if the preparation+check works are good.
>
> I don't know if doing the freeing work in free_pages_prepare() is a good
> idea.

As long as it's documented in a comment, why not.

It doesn't make callers' code less readable. The semantic is the same;
callers should free the page only if free_pages_prepare() returns true.

They don't have to care why they should not free it.
(either it should not be freed back to buddy because it's bad page,
 or it had hwpoison page(s) and healthy portions are already freed)

> But if it is fine, we can actually hide the ugly
> compound_has_hwpoisoned() entirely inside free_pages_prepare():
> - true means preparation + check work are all good but caller needs to
> free prepared+checked pages itself
> - false means one of the two:
>   - some check failed and it is impossible to safely free pages for callers
>   - this is a compound page with some HWPoison pages, and healthy
>     pages are already freed safely

Yes, something like this could be documented in free_pages_prepare().

> Doesn't seem very clean with a boolean return type...

The return type could be an enum type, but that wouldn't affect
the caller's behavior (bailing out) anyway?

No strong opinion from me on the type.

-- 
Cheers,
Harry / Hyeonggon


  reply	other threads:[~2026-01-06  7:36 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-12-19 18:33 [PATCH v2 0/3] Only " Jiaqi Yan
2025-12-19 18:33 ` [PATCH v2 1/3] mm/memory-failure: set has_hwpoisoned flags on HugeTLB folio Jiaqi Yan
2025-12-19 18:33 ` [PATCH v2 2/3] mm/page_alloc: only free healthy pages in high-order HWPoison folio Jiaqi Yan
2025-12-23  5:14   ` Harry Yoo
2025-12-27  1:50     ` Jiaqi Yan
2025-12-29  1:15       ` Harry Yoo
2025-12-31  0:19         ` Jiaqi Yan
2025-12-31  4:37           ` Harry Yoo
2026-01-06  3:40             ` Jiaqi Yan
2026-01-06  7:35               ` Harry Yoo [this message]
2025-12-23  7:45   ` Miaohe Lin
2025-12-27  1:50     ` Jiaqi Yan
2025-12-19 18:33 ` [PATCH v2 3/3] mm/memory-failure: simplify __page_handle_poison Jiaqi Yan
2025-12-22 22:12   ` Matthew Wilcox
2025-12-22 23:13     ` Jiaqi Yan
2025-12-22 22:06 ` [PATCH v2 0/3] Only free healthy pages in high-order HWPoison folio Jiaqi Yan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aVy7L-3pc4JUYBEn@hyeyoo \
    --to=harry.yoo@oracle.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=duenwen@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=jackmanb@google.com \
    --cc=jane.chu@oracle.com \
    --cc=jiaqiyan@google.com \
    --cc=jthoughton@google.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=nao.horiguchi@gmail.com \
    --cc=osalvador@suse.de \
    --cc=rientjes@google.com \
    --cc=rppt@kernel.org \
    --cc=surenb@google.com \
    --cc=tony.luck@intel.com \
    --cc=vbabka@suse.cz \
    --cc=wangkefeng.wang@huawei.com \
    --cc=william.roche@oracle.com \
    --cc=willy@infradead.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox