[PATCH v2 0/3] Only free healthy pages in high-order HWPoison folio

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/3] Only free healthy pages in high-order HWPoison folio
@ 2025-12-19 18:33 Jiaqi Yan
  2025-12-19 18:33 ` [PATCH v2 1/3] mm/memory-failure: set has_hwpoisoned flags on HugeTLB folio Jiaqi Yan
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Jiaqi Yan @ 2025-12-19 18:33 UTC (permalink / raw)
  To: jackmanb, hannes, linmiaohe, ziy, harry.yoo, willy
  Cc: nao.horiguchi, david, lorenzo.stoakes, william.roche, tony.luck,
	wangkefeng.wang, jane.chu, akpm, osalvador, muchun.song,
	rientjes, duenwen, jthoughton, linux-mm, linux-kernel,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, Jiaqi Yan

At the end of dissolve_free_hugetlb_folio that a free HugeTLB
folio becomes non-HugeTLB, it is released to buddy allocator
as a high-order folio, e.g. a folio that contains 262144 pages
if the folio was a 1G HugeTLB hugepage.

This is problematic if the HugeTLB hugepage contained HWPoison
subpages. In that case, since buddy allocator does not check
HWPoison for non-zero-order folio, the raw HWPoison page can
be given out with its buddy page and be re-used by either
kernel or userspace.

Memory failure recovery (MFR) in kernel does attempt to take
raw HWPoison page off buddy allocator after
dissolve_free_hugetlb_folio. However, there is always a time
window between dissolve_free_hugetlb_folio frees a HWPoison
high-order folio to buddy allocator and MFR takes HWPoison
raw page off buddy allocator.

One obvious way to avoid this problem is to add page sanity
checks in page allocate or free path. However, it is against
the past efforts to reduce sanity check overhead [1,2,3].

Introduce free_has_hwpoison_pages to only free the healthy
pages and excludes the HWPoison ones in the high-order folio.
The idea is to iterate through the sub-pages of the folio to
identify contiguous ranges of healthy pages. Instead of freeing
pages one by one, decompose healthy ranges into the largest
possible blocks. Each block meets the requirements to be freed
to buddy allocator by calling __free_frozen_pages directly.

free_has_hwpoison_pages has linear time complexity O(N) wrt the
number of pages in the folio. While the power-of-two decomposition
ensures that the number of calls to the buddy allocator is
logarithmic for each contiguous healthy range, the mandatory
linear scan of pages to identify PageHWPoison defines the
overall time complexity.

I tested with some test-only code [4] and hugetlb-mfr [5], by
checking the status of pcplist and freelist immediately after
dissolve_free_hugetlb_folio a free hugetlb page that contains
3 HWPoison raw pages:

* HWPoison pages are excluded by free_has_hwpoison_pages.

* Some healthy pages can be in zone->per_cpu_pageset (pcplist)
  because pcp_count is not high enough.

* Many healthy pages are already in some order's
  zone->free_area[order].free_list (freelist).

* In rare cases, some healthy pages are in neither pcplist
  nor freelist. My best guest is they are allocated before
  the test checks.

[1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net/
[2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net/
[3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
[4] https://drive.google.com/file/d/1CzJn1Cc4wCCm183Y77h244fyZIkTLzCt/view?usp=sharing
[5] https://lore.kernel.org/linux-mm/20251116013223.1557158-3-jiaqiyan@google.com

Jiaqi Yan (3):
  mm/memory-failure: set has_hwpoisoned flags on HugeTLB folio
  mm/page_alloc: only free healthy pages in high-order HWPoison folio
  mm/memory-failure: simplify __page_handle_poison

 include/linux/page-flags.h |   2 +-
 mm/memory-failure.c        |  32 +++---------
 mm/page_alloc.c            | 101 +++++++++++++++++++++++++++++++++++++
 3 files changed, 108 insertions(+), 27 deletions(-)

-- 
2.52.0.322.g1dd061c0dc-goog

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 1/3] mm/memory-failure: set has_hwpoisoned flags on HugeTLB folio
  2025-12-19 18:33 [PATCH v2 0/3] Only free healthy pages in high-order HWPoison folio Jiaqi Yan
@ 2025-12-19 18:33 ` Jiaqi Yan
  2025-12-19 18:33 ` [PATCH v2 2/3] mm/page_alloc: only free healthy pages in high-order HWPoison folio Jiaqi Yan
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 14+ messages in thread
From: Jiaqi Yan @ 2025-12-19 18:33 UTC (permalink / raw)
  To: jackmanb, hannes, linmiaohe, ziy, harry.yoo, willy
  Cc: nao.horiguchi, david, lorenzo.stoakes, william.roche, tony.luck,
	wangkefeng.wang, jane.chu, akpm, osalvador, muchun.song,
	rientjes, duenwen, jthoughton, linux-mm, linux-kernel,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, Jiaqi Yan

When a free HWPoison HugeTLB folio is dissolved, it becomes
non-HugeTLB and is released to buddy allocator as a high-order
folio.

Set has_hwpoisoned flags on the high-order folio so that buddy
allocator can tell that it contains certain HWPoison page(s).
This is a prepare change for buddy allocator to handle only the
high-order HWPoison folio differently.

This cannot be done with hwpoison flag because users cannot tell
from the case that the page with hwpoison is hardware corrupted.

Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
 include/linux/page-flags.h | 2 +-
 mm/memory-failure.c        | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f7a0e4af0c734..d13835e265952 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -904,7 +904,7 @@ static inline int PageTransCompound(const struct page *page)
 TESTPAGEFLAG_FALSE(TransCompound, transcompound)
 #endif
 
-#if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
+#if defined(CONFIG_MEMORY_FAILURE) && (defined(CONFIG_TRANSPARENT_HUGEPAGE) || defined(CONFIG_HUGETLB_PAGE))
 /*
  * PageHasHWPoisoned indicates that at least one subpage is hwpoisoned in the
  * compound page.
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index fbc5a01260c89..d204de6c9792a 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1952,6 +1952,7 @@ void folio_clear_hugetlb_hwpoison(struct folio *folio)
 	if (folio_test_hugetlb_vmemmap_optimized(folio))
 		return;
 	folio_clear_hwpoison(folio);
+	folio_set_has_hwpoisoned(folio);
 	folio_free_raw_hwp(folio, true);
 }
 
-- 
2.52.0.322.g1dd061c0dc-goog



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 2/3] mm/page_alloc: only free healthy pages in high-order HWPoison folio
  2025-12-19 18:33 [PATCH v2 0/3] Only free healthy pages in high-order HWPoison folio Jiaqi Yan
  2025-12-19 18:33 ` [PATCH v2 1/3] mm/memory-failure: set has_hwpoisoned flags on HugeTLB folio Jiaqi Yan
@ 2025-12-19 18:33 ` Jiaqi Yan
  2025-12-23  5:14   ` Harry Yoo
  2025-12-23  7:45   ` Miaohe Lin
  2025-12-19 18:33 ` [PATCH v2 3/3] mm/memory-failure: simplify __page_handle_poison Jiaqi Yan
  2025-12-22 22:06 ` [PATCH v2 0/3] Only free healthy pages in high-order HWPoison folio Jiaqi Yan
  3 siblings, 2 replies; 14+ messages in thread
From: Jiaqi Yan @ 2025-12-19 18:33 UTC (permalink / raw)
  To: jackmanb, hannes, linmiaohe, ziy, harry.yoo, willy
  Cc: nao.horiguchi, david, lorenzo.stoakes, william.roche, tony.luck,
	wangkefeng.wang, jane.chu, akpm, osalvador, muchun.song,
	rientjes, duenwen, jthoughton, linux-mm, linux-kernel,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, Jiaqi Yan

At the end of dissolve_free_hugetlb_folio that a free HugeTLB
folio becomes non-HugeTLB, it is released to buddy allocator
as a high-order folio, e.g. a folio that contains 262144 pages
if the folio was a 1G HugeTLB hugepage.

This is problematic if the HugeTLB hugepage contained HWPoison
subpages. In that case, since buddy allocator does not check
HWPoison for non-zero-order folio, the raw HWPoison page can
be given out with its buddy page and be re-used by either
kernel or userspace.

Memory failure recovery (MFR) in kernel does attempt to take
raw HWPoison page off buddy allocator after
dissolve_free_hugetlb_folio. However, there is always a time
window between dissolve_free_hugetlb_folio frees a HWPoison
high-order folio to buddy allocator and MFR takes HWPoison
raw page off buddy allocator.

One obvious way to avoid this problem is to add page sanity
checks in page allocate or free path. However, it is against
the past efforts to reduce sanity check overhead [1,2,3].

Introduce free_has_hwpoison_pages to only free the healthy
pages and excludes the HWPoison ones in the high-order folio.
The idea is to iterate through the sub-pages of the folio to
identify contiguous ranges of healthy pages. Instead of freeing
pages one by one, decompose healthy ranges into the largest
possible blocks. Each block meets the requirements to be freed
to buddy allocator (__free_frozen_pages).

free_has_hwpoison_pages has linear time complexity O(N) wrt the
number of pages in the folio. While the power-of-two decomposition
ensures that the number of calls to the buddy allocator is
logarithmic for each contiguous healthy range, the mandatory
linear scan of pages to identify PageHWPoison defines the
overall time complexity.

[1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net/
[2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net/
[3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz

Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
 mm/page_alloc.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 101 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 822e05f1a9646..20c8862ce594e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2976,8 +2976,109 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
 	}
 }
 
+static void prepare_compound_page_to_free(struct page *new_head,
+					  unsigned int order,
+					  unsigned long flags)
+{
+	new_head->flags.f = flags & (~PAGE_FLAGS_CHECK_AT_FREE);
+	new_head->mapping = NULL;
+	new_head->private = 0;
+
+	clear_compound_head(new_head);
+	if (order)
+		prep_compound_page(new_head, order);
+}
+
+/*
+ * Given a range of pages physically contiguous physical, efficiently
+ * free them in blocks that meet __free_frozen_pages's requirements.
+ */
+static void free_contiguous_pages(struct page *curr, struct page *next,
+				  unsigned long flags)
+{
+	unsigned int order;
+	unsigned int align_order;
+	unsigned int size_order;
+	unsigned long pfn;
+	unsigned long end_pfn = page_to_pfn(next);
+	unsigned long remaining;
+
+	/*
+	 * This decomposition algorithm at every iteration chooses the
+	 * order to be the minimum of two constraints:
+	 * - Alignment: the largest power-of-two that divides current pfn.
+	 * - Size: the largest power-of-two that fits in the
+	 *   current remaining number of pages.
+	 */
+	while (curr < next) {
+		pfn = page_to_pfn(curr);
+		remaining = end_pfn - pfn;
+
+		align_order = ffs(pfn) - 1;
+		size_order = fls_long(remaining) - 1;
+		order = min(align_order, size_order);
+
+		prepare_compound_page_to_free(curr, order, flags);
+		__free_frozen_pages(curr, order, FPI_NONE);
+		curr += (1UL << order);
+	}
+
+	VM_WARN_ON(curr != next);
+}
+
+/*
+ * Given a high-order compound page containing certain number of HWPoison
+ * pages, free only the healthy ones to buddy allocator.
+ *
+ * It calls __free_frozen_pages O(2^order) times and cause nontrivial
+ * overhead. So only use this when compound page really contains HWPoison.
+ *
+ * This implementation doesn't work in memdesc world.
+ */
+static void free_has_hwpoison_pages(struct page *page, unsigned int order)
+{
+	struct page *curr = page;
+	struct page *end = page + (1 << order);
+	struct page *next;
+	unsigned long flags = page->flags.f;
+	unsigned long nr_pages;
+	unsigned long total_freed = 0;
+	unsigned long total_hwp = 0;
+
+	VM_WARN_ON(flags & PAGE_FLAGS_CHECK_AT_FREE);
+
+	while (curr < end) {
+		next = curr;
+		nr_pages = 0;
+
+		while (next < end && !PageHWPoison(next)) {
+			++next;
+			++nr_pages;
+		}
+
+		if (PageHWPoison(next))
+			++total_hwp;
+
+		free_contiguous_pages(curr, next, flags);
+
+		total_freed += nr_pages;
+		curr = PageHWPoison(next) ? next + 1 : next;
+	}
+
+	pr_info("Excluded %lu hwpoison pages from folio\n", total_hwp);
+	pr_info("Freed %#lx pages from folio\n", total_freed);
+}
+
 void free_frozen_pages(struct page *page, unsigned int order)
 {
+	struct folio *folio = page_folio(page);
+
+	if (order > 0 && unlikely(folio_test_has_hwpoisoned(folio))) {
+		folio_clear_has_hwpoisoned(folio);
+		free_has_hwpoison_pages(page, order);
+		return;
+	}
+
 	__free_frozen_pages(page, order, FPI_NONE);
 }
 
-- 
2.52.0.322.g1dd061c0dc-goog



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/3] mm/page_alloc: only free healthy pages in high-order HWPoison folio
  2025-12-19 18:33 ` [PATCH v2 2/3] mm/page_alloc: only free healthy pages in high-order HWPoison folio Jiaqi Yan
@ 2025-12-23  5:14   ` Harry Yoo
  2025-12-27  1:50     ` Jiaqi Yan
  2025-12-23  7:45   ` Miaohe Lin
  1 sibling, 1 reply; 14+ messages in thread
From: Harry Yoo @ 2025-12-23  5:14 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: jackmanb, hannes, linmiaohe, ziy, willy, nao.horiguchi, david,
	lorenzo.stoakes, william.roche, tony.luck, wangkefeng.wang,
	jane.chu, akpm, osalvador, muchun.song, rientjes, duenwen,
	jthoughton, linux-mm, linux-kernel, Liam.Howlett, vbabka, rppt,
	surenb, mhocko

On Fri, Dec 19, 2025 at 06:33:45PM +0000, Jiaqi Yan wrote:
> At the end of dissolve_free_hugetlb_folio that a free HugeTLB
> folio becomes non-HugeTLB, it is released to buddy allocator
> as a high-order folio, e.g. a folio that contains 262144 pages
> if the folio was a 1G HugeTLB hugepage.
> 
> This is problematic if the HugeTLB hugepage contained HWPoison
> subpages. In that case, since buddy allocator does not check
> HWPoison for non-zero-order folio, the raw HWPoison page can
> be given out with its buddy page and be re-used by either
> kernel or userspace.
> 
> Memory failure recovery (MFR) in kernel does attempt to take
> raw HWPoison page off buddy allocator after
> dissolve_free_hugetlb_folio. However, there is always a time
> window between dissolve_free_hugetlb_folio frees a HWPoison
> high-order folio to buddy allocator and MFR takes HWPoison
> raw page off buddy allocator.
> 
> One obvious way to avoid this problem is to add page sanity
> checks in page allocate or free path. However, it is against
> the past efforts to reduce sanity check overhead [1,2,3].
> 
> Introduce free_has_hwpoison_pages to only free the healthy
> pages and excludes the HWPoison ones in the high-order folio.
> The idea is to iterate through the sub-pages of the folio to
> identify contiguous ranges of healthy pages. Instead of freeing
> pages one by one, decompose healthy ranges into the largest
> possible blocks. Each block meets the requirements to be freed
> to buddy allocator (__free_frozen_pages).
> 
> free_has_hwpoison_pages has linear time complexity O(N) wrt the
> number of pages in the folio. While the power-of-two decomposition
> ensures that the number of calls to the buddy allocator is
> logarithmic for each contiguous healthy range, the mandatory
> linear scan of pages to identify PageHWPoison defines the
> overall time complexity.

Hi Jiaqi, thanks for the patch!

Have you tried measuring the latency of free_has_hwpoison_pages() when
a few pages in a 1GB folio are hwpoisoned?

Just wanted to make sure we don't introduce a possible soft lockup...
Or am I worrying too much?

> [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net/
> [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net/
> [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
> 
> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> ---
>  mm/page_alloc.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 101 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 822e05f1a9646..20c8862ce594e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2976,8 +2976,109 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
>  	}
>  }
>  
> +static void prepare_compound_page_to_free(struct page *new_head,
> +					  unsigned int order,
> +					  unsigned long flags)
> +{
> +	new_head->flags.f = flags & (~PAGE_FLAGS_CHECK_AT_FREE);
> +	new_head->mapping = NULL;
> +	new_head->private = 0;
> +
> +	clear_compound_head(new_head);
> +	if (order)
> +		prep_compound_page(new_head, order);
> +}

Not sure why it's building compound pages, just to decompose them
when freeing via __free_frozen_pages()?

If you intended to reset compound head & tails, I think it's more
readable to decompose the whole compound page at once and not build
compound pages when freeing it?

> +/*
> + * Given a range of pages physically contiguous physical, efficiently
> + * free them in blocks that meet __free_frozen_pages's requirements.
> + */
> +static void free_contiguous_pages(struct page *curr, struct page *next,
> +				  unsigned long flags)
> +{
> +	unsigned int order;
> +	unsigned int align_order;
> +	unsigned int size_order;
> +	unsigned long pfn;
> +	unsigned long end_pfn = page_to_pfn(next);
> +	unsigned long remaining;
> +
> +	/*
> +	 * This decomposition algorithm at every iteration chooses the
> +	 * order to be the minimum of two constraints:
> +	 * - Alignment: the largest power-of-two that divides current pfn.
> +	 * - Size: the largest power-of-two that fits in the
> +	 *   current remaining number of pages.
> +	 */
> +	while (curr < next) {
> +		pfn = page_to_pfn(curr);
> +		remaining = end_pfn - pfn;
> +
> +		align_order = ffs(pfn) - 1;
> +		size_order = fls_long(remaining) - 1;
> +		order = min(align_order, size_order);
> +
> +		prepare_compound_page_to_free(curr, order, flags);
> +		__free_frozen_pages(curr, order, FPI_NONE);
> +		curr += (1UL << order);
> +	}
> +
> +	VM_WARN_ON(curr != next);
> +}
> +
> +/*
> + * Given a high-order compound page containing certain number of HWPoison
> + * pages, free only the healthy ones to buddy allocator.
> + *
> + * It calls __free_frozen_pages O(2^order) times and cause nontrivial
> + * overhead. So only use this when compound page really contains HWPoison.
> + *
> + * This implementation doesn't work in memdesc world.
> + */
> +static void free_has_hwpoison_pages(struct page *page, unsigned int order)
> +{
> +	struct page *curr = page;
> +	struct page *end = page + (1 << order);
> +	struct page *next;
> +	unsigned long flags = page->flags.f;
> +	unsigned long nr_pages;
> +	unsigned long total_freed = 0;
> +	unsigned long total_hwp = 0;
> +
> +	VM_WARN_ON(flags & PAGE_FLAGS_CHECK_AT_FREE);
> +
> +	while (curr < end) {
> +		next = curr;
> +		nr_pages = 0;
> +
> +		while (next < end && !PageHWPoison(next)) {
> +			++next;
> +			++nr_pages;
> +		}
> +
> +		if (PageHWPoison(next))
> +			++total_hwp;
> +
> +		free_contiguous_pages(curr, next, flags);

page_owner, memory profiling (anything else?) will be confused
because it was allocated as a larger size, but we're freeing only
some portion of it.

Perhaps we need to run some portion of this code snippet
(from free_pages_prepare()), before freeing portions of it:

        page_cpupid_reset_last(page);
        page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
        reset_page_owner(page, order);
        page_table_check_free(page, order);
        pgalloc_tag_sub(page, 1 << order);

> +		total_freed += nr_pages;
> +		curr = PageHWPoison(next) ? next + 1 : next;
> +	}
> +
> +	pr_info("Excluded %lu hwpoison pages from folio\n", total_hwp);
> +	pr_info("Freed %#lx pages from folio\n", total_freed);
> +}
> +
>  void free_frozen_pages(struct page *page, unsigned int order)
>  {
> +	struct folio *folio = page_folio(page);
> +
> +	if (order > 0 && unlikely(folio_test_has_hwpoisoned(folio))) {
> +		folio_clear_has_hwpoisoned(folio);
> +		free_has_hwpoison_pages(page, order);
> +		return;
> +	}
> +

It feels like it's a bit random place to do has_hwpoisoned check.
Can we move this to free_pages_prepare() where we have some
sanity checks (and also order-0 hwpoison page handling)?

>  	__free_frozen_pages(page, order, FPI_NONE);
>  }
>  
> -- 
> 2.52.0.322.g1dd061c0dc-goog

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/3] mm/page_alloc: only free healthy pages in high-order HWPoison folio
  2025-12-23  5:14   ` Harry Yoo
@ 2025-12-27  1:50     ` Jiaqi Yan
  2025-12-29  1:15       ` Harry Yoo
  0 siblings, 1 reply; 14+ messages in thread
From: Jiaqi Yan @ 2025-12-27  1:50 UTC (permalink / raw)
  To: Harry Yoo
  Cc: jackmanb, hannes, linmiaohe, ziy, willy, nao.horiguchi, david,
	lorenzo.stoakes, william.roche, tony.luck, wangkefeng.wang,
	jane.chu, akpm, osalvador, muchun.song, rientjes, duenwen,
	jthoughton, linux-mm, linux-kernel, Liam.Howlett, vbabka, rppt,
	surenb, mhocko

On Mon, Dec 22, 2025 at 9:14 PM Harry Yoo <harry.yoo@oracle.com> wrote:
>
> On Fri, Dec 19, 2025 at 06:33:45PM +0000, Jiaqi Yan wrote:
> > At the end of dissolve_free_hugetlb_folio that a free HugeTLB
> > folio becomes non-HugeTLB, it is released to buddy allocator
> > as a high-order folio, e.g. a folio that contains 262144 pages
> > if the folio was a 1G HugeTLB hugepage.
> >
> > This is problematic if the HugeTLB hugepage contained HWPoison
> > subpages. In that case, since buddy allocator does not check
> > HWPoison for non-zero-order folio, the raw HWPoison page can
> > be given out with its buddy page and be re-used by either
> > kernel or userspace.
> >
> > Memory failure recovery (MFR) in kernel does attempt to take
> > raw HWPoison page off buddy allocator after
> > dissolve_free_hugetlb_folio. However, there is always a time
> > window between dissolve_free_hugetlb_folio frees a HWPoison
> > high-order folio to buddy allocator and MFR takes HWPoison
> > raw page off buddy allocator.
> >
> > One obvious way to avoid this problem is to add page sanity
> > checks in page allocate or free path. However, it is against
> > the past efforts to reduce sanity check overhead [1,2,3].
> >
> > Introduce free_has_hwpoison_pages to only free the healthy
> > pages and excludes the HWPoison ones in the high-order folio.
> > The idea is to iterate through the sub-pages of the folio to
> > identify contiguous ranges of healthy pages. Instead of freeing
> > pages one by one, decompose healthy ranges into the largest
> > possible blocks. Each block meets the requirements to be freed
> > to buddy allocator (__free_frozen_pages).
> >
> > free_has_hwpoison_pages has linear time complexity O(N) wrt the
> > number of pages in the folio. While the power-of-two decomposition
> > ensures that the number of calls to the buddy allocator is
> > logarithmic for each contiguous healthy range, the mandatory
> > linear scan of pages to identify PageHWPoison defines the
> > overall time complexity.
>
> Hi Jiaqi, thanks for the patch!

Thanks for your review/comments!

>
> Have you tried measuring the latency of free_has_hwpoison_pages() when
> a few pages in a 1GB folio are hwpoisoned?
>
> Just wanted to make sure we don't introduce a possible soft lockup...
> Or am I worrying too much?

In my local tests, freeing a 1GB folio with 1 / 3 / 8 HWPoison pages,
I never run into a soft lockup. The 8-HWPoison-page case takes more
time than other cases, meaning that handling the additional HWPoison
page adds to the time cost.

After adding some instrument code, 10 sample runs of
free_has_hwpoison_pages with 8 HWPoison pages:
- observed mean is 7.03 ms (5.97 ms when 3 HWPoison pages)
- observed standard deviation is 0.76 ms (0.18 ms when 3 HWPoison pages)

In comparison, freeing a 1G folio without any HWPoison pages 10 times
(with same kernel config):
- observed mean is 3.39 ms
- observed standard deviation is 0.16ms

So it's around twice the baseline. It should be far from triggering a
soft lockup, and the cost seems fair for handling exceptional hardware
memory errors.

I can add these measurements in future revisions.

>
> > [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net/
> > [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net/
> > [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
> >
> > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> > ---
> >  mm/page_alloc.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 101 insertions(+)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 822e05f1a9646..20c8862ce594e 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2976,8 +2976,109 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
> >       }
> >  }
> >
> > +static void prepare_compound_page_to_free(struct page *new_head,
> > +                                       unsigned int order,
> > +                                       unsigned long flags)
> > +{
> > +     new_head->flags.f = flags & (~PAGE_FLAGS_CHECK_AT_FREE);
> > +     new_head->mapping = NULL;
> > +     new_head->private = 0;
> > +
> > +     clear_compound_head(new_head);
> > +     if (order)
> > +             prep_compound_page(new_head, order);
> > +}
>
> Not sure why it's building compound pages, just to decompose them
> when freeing via __free_frozen_pages()?

prepare_compound_page_to_free() borrowed the idea from
__split_folio_to_order(). Conceptually the original folio is split
into new compound pages with different orders; here this is done on
the fly in free_contiguous_pages() when order is decided.
__free_frozen_pages() is also happy with a compound page when order >
0, as I tested with free_pages_prepare before calling
__free_frozen_pages().

>
> If you intended to reset compound head & tails, I think it's more
> readable to decompose the whole compound page at once and not build
> compound pages when freeing it?

I don't think prepare_compound_page_to_free() is that hard to read,
but I'm open to more opinions.

>
> > +/*
> > + * Given a range of pages physically contiguous physical, efficiently
> > + * free them in blocks that meet __free_frozen_pages's requirements.
> > + */
> > +static void free_contiguous_pages(struct page *curr, struct page *next,
> > +                               unsigned long flags)
> > +{
> > +     unsigned int order;
> > +     unsigned int align_order;
> > +     unsigned int size_order;
> > +     unsigned long pfn;
> > +     unsigned long end_pfn = page_to_pfn(next);
> > +     unsigned long remaining;
> > +
> > +     /*
> > +      * This decomposition algorithm at every iteration chooses the
> > +      * order to be the minimum of two constraints:
> > +      * - Alignment: the largest power-of-two that divides current pfn.
> > +      * - Size: the largest power-of-two that fits in the
> > +      *   current remaining number of pages.
> > +      */
> > +     while (curr < next) {
> > +             pfn = page_to_pfn(curr);
> > +             remaining = end_pfn - pfn;
> > +
> > +             align_order = ffs(pfn) - 1;
> > +             size_order = fls_long(remaining) - 1;
> > +             order = min(align_order, size_order);
> > +
> > +             prepare_compound_page_to_free(curr, order, flags);
> > +             __free_frozen_pages(curr, order, FPI_NONE);
> > +             curr += (1UL << order);
> > +     }
> > +
> > +     VM_WARN_ON(curr != next);
> > +}
> > +
> > +/*
> > + * Given a high-order compound page containing certain number of HWPoison
> > + * pages, free only the healthy ones to buddy allocator.
> > + *
> > + * It calls __free_frozen_pages O(2^order) times and cause nontrivial
> > + * overhead. So only use this when compound page really contains HWPoison.
> > + *
> > + * This implementation doesn't work in memdesc world.
> > + */
> > +static void free_has_hwpoison_pages(struct page *page, unsigned int order)
> > +{
> > +     struct page *curr = page;
> > +     struct page *end = page + (1 << order);
> > +     struct page *next;
> > +     unsigned long flags = page->flags.f;
> > +     unsigned long nr_pages;
> > +     unsigned long total_freed = 0;
> > +     unsigned long total_hwp = 0;
> > +
> > +     VM_WARN_ON(flags & PAGE_FLAGS_CHECK_AT_FREE);
> > +
> > +     while (curr < end) {
> > +             next = curr;
> > +             nr_pages = 0;
> > +
> > +             while (next < end && !PageHWPoison(next)) {
> > +                     ++next;
> > +                     ++nr_pages;
> > +             }
> > +
> > +             if (PageHWPoison(next))
> > +                     ++total_hwp;
> > +
> > +             free_contiguous_pages(curr, next, flags);
>
> page_owner, memory profiling (anything else?) will be confused
> because it was allocated as a larger size, but we're freeing only
> some portion of it.

I am not sure, but looking at __split_unmapped_folio, it calls
pgalloc_tag_split(folio, old_order, split_order) when splitting an
old_order-order folio into a new split_order.

Maybe prepare_compound_page_to_free really should
update_page_tag_ref(), I need to take a closer look at this with
CONFIG_MEM_ALLOC_PROFILING (not something I usually enable).

>
> Perhaps we need to run some portion of this code snippet
> (from free_pages_prepare()), before freeing portions of it:
>
>         page_cpupid_reset_last(page);
>         page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
>         reset_page_owner(page, order);
>         page_table_check_free(page, order);
>         pgalloc_tag_sub(page, 1 << order);

Since they come from free_pages_prepare, I believe these lines are
already executed via free_contiguous_pages()=> __free_frozen_pages()=>
free_pages_prepare(), right? Or am I missing something?

>
> > +             total_freed += nr_pages;
> > +             curr = PageHWPoison(next) ? next + 1 : next;
> > +     }
> > +
> > +     pr_info("Excluded %lu hwpoison pages from folio\n", total_hwp);
> > +     pr_info("Freed %#lx pages from folio\n", total_freed);
> > +}
> > +
> >  void free_frozen_pages(struct page *page, unsigned int order)
> >  {
> > +     struct folio *folio = page_folio(page);
> > +
> > +     if (order > 0 && unlikely(folio_test_has_hwpoisoned(folio))) {
> > +             folio_clear_has_hwpoisoned(folio);
> > +             free_has_hwpoison_pages(page, order);
> > +             return;
> > +     }
> > +
>
> It feels like it's a bit random place to do has_hwpoisoned check.
> Can we move this to free_pages_prepare() where we have some
> sanity checks (and also order-0 hwpoison page handling)?

While free_pages_prepare() seems to be a better place to do the
has_hwpoisoned check, it is not a good place to do
free_has_hwpoison_pages().


>
> >       __free_frozen_pages(page, order, FPI_NONE);
> >  }
> >
> > --
> > 2.52.0.322.g1dd061c0dc-goog
>
> --
> Cheers,
> Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/3] mm/page_alloc: only free healthy pages in high-order HWPoison folio
  2025-12-27  1:50     ` Jiaqi Yan
@ 2025-12-29  1:15       ` Harry Yoo
  2025-12-31  0:19         ` Jiaqi Yan
  0 siblings, 1 reply; 14+ messages in thread
From: Harry Yoo @ 2025-12-29  1:15 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: jackmanb, hannes, linmiaohe, ziy, willy, nao.horiguchi, david,
	lorenzo.stoakes, william.roche, tony.luck, wangkefeng.wang,
	jane.chu, akpm, osalvador, muchun.song, rientjes, duenwen,
	jthoughton, linux-mm, linux-kernel, Liam.Howlett, vbabka, rppt,
	surenb, mhocko

On Fri, Dec 26, 2025 at 05:50:59PM -0800, Jiaqi Yan wrote:
> On Mon, Dec 22, 2025 at 9:14 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> >
> > On Fri, Dec 19, 2025 at 06:33:45PM +0000, Jiaqi Yan wrote:
> > > At the end of dissolve_free_hugetlb_folio that a free HugeTLB
> > > folio becomes non-HugeTLB, it is released to buddy allocator
> > > as a high-order folio, e.g. a folio that contains 262144 pages
> > > if the folio was a 1G HugeTLB hugepage.
> > >
> > > This is problematic if the HugeTLB hugepage contained HWPoison
> > > subpages. In that case, since buddy allocator does not check
> > > HWPoison for non-zero-order folio, the raw HWPoison page can
> > > be given out with its buddy page and be re-used by either
> > > kernel or userspace.
> > >
> > > Memory failure recovery (MFR) in kernel does attempt to take
> > > raw HWPoison page off buddy allocator after
> > > dissolve_free_hugetlb_folio. However, there is always a time
> > > window between dissolve_free_hugetlb_folio frees a HWPoison
> > > high-order folio to buddy allocator and MFR takes HWPoison
> > > raw page off buddy allocator.
> > >
> > > One obvious way to avoid this problem is to add page sanity
> > > checks in page allocate or free path. However, it is against
> > > the past efforts to reduce sanity check overhead [1,2,3].
> > >
> > > Introduce free_has_hwpoison_pages to only free the healthy
> > > pages and excludes the HWPoison ones in the high-order folio.
> > > The idea is to iterate through the sub-pages of the folio to
> > > identify contiguous ranges of healthy pages. Instead of freeing
> > > pages one by one, decompose healthy ranges into the largest
> > > possible blocks. Each block meets the requirements to be freed
> > > to buddy allocator (__free_frozen_pages).
> > >
> > > free_has_hwpoison_pages has linear time complexity O(N) wrt the
> > > number of pages in the folio. While the power-of-two decomposition
> > > ensures that the number of calls to the buddy allocator is
> > > logarithmic for each contiguous healthy range, the mandatory
> > > linear scan of pages to identify PageHWPoison defines the
> > > overall time complexity.
> >
> > Hi Jiaqi, thanks for the patch!
> 
> Thanks for your review/comments!
> 
> >
> > Have you tried measuring the latency of free_has_hwpoison_pages() when
> > a few pages in a 1GB folio are hwpoisoned?
> >
> > Just wanted to make sure we don't introduce a possible soft lockup...
> > Or am I worrying too much?
> 
> In my local tests, freeing a 1GB folio with 1 / 3 / 8 HWPoison pages,
> I never run into a soft lockup. The 8-HWPoison-page case takes more
> time than other cases, meaning that handling the additional HWPoison
> page adds to the time cost.
> 
> After adding some instrument code, 10 sample runs of
> free_has_hwpoison_pages with 8 HWPoison pages:
> - observed mean is 7.03 ms (5.97 ms when 3 HWPoison pages)
> - observed standard deviation is 0.76 ms (0.18 ms when 3 HWPoison pages)
> 
> In comparison, freeing a 1G folio without any HWPoison pages 10 times
> (with same kernel config):
> - observed mean is 3.39 ms
> - observed standard deviation is 0.16ms

Thanks for the measurement!
 
> So it's around twice the baseline. It should be far from triggering a
> soft lockup, and the cost seems fair for handling exceptional hardware
> memory errors.

Yeah it looks fine to me.

> I can add these measurements in future revisions.

That would be nice, thanks.

> > > [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net/
> > > [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net/
> > > [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
> > >
> > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> > > ---
> > >  mm/page_alloc.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 101 insertions(+)
> > >
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index 822e05f1a9646..20c8862ce594e 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -2976,8 +2976,109 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
> > >       }
> > >  }
> > >
> > > +static void prepare_compound_page_to_free(struct page *new_head,
> > > +                                       unsigned int order,
> > > +                                       unsigned long flags)
> > > +{
> > > +     new_head->flags.f = flags & (~PAGE_FLAGS_CHECK_AT_FREE);
> > > +     new_head->mapping = NULL;
> > > +     new_head->private = 0;
> > > +
> > > +     clear_compound_head(new_head);
> > > +     if (order)
> > > +             prep_compound_page(new_head, order);
> > > +}
> >
> > Not sure why it's building compound pages, just to decompose them
> > when freeing via __free_frozen_pages()?
> 
> prepare_compound_page_to_free() borrowed the idea from
> __split_folio_to_order(). Conceptually the original folio is split
> into new compound pages with different orders;

I see, and per the previous discussion we don't want to split it
to 262,144 4K pages in the future, anyway...

> here this is done on
> the fly in free_contiguous_pages() when order is decided.
>
> > > +/*
> > > + * Given a range of pages physically contiguous physical, efficiently
> > > + * free them in blocks that meet __free_frozen_pages's requirements.
> > > + */
> > > +static void free_contiguous_pages(struct page *curr, struct page *next,
> > > +                               unsigned long flags)
> > > +{
> > > +     unsigned int order;
> > > +     unsigned int align_order;
> > > +     unsigned int size_order;
> > > +     unsigned long pfn;
> > > +     unsigned long end_pfn = page_to_pfn(next);
> > > +     unsigned long remaining;
> > > +
> > > +     /*
> > > +      * This decomposition algorithm at every iteration chooses the
> > > +      * order to be the minimum of two constraints:
> > > +      * - Alignment: the largest power-of-two that divides current pfn.
> > > +      * - Size: the largest power-of-two that fits in the
> > > +      *   current remaining number of pages.
> > > +      */
> > > +     while (curr < next) {
> > > +             pfn = page_to_pfn(curr);
> > > +             remaining = end_pfn - pfn;
> > > +
> > > +             align_order = ffs(pfn) - 1;
> > > +             size_order = fls_long(remaining) - 1;
> > > +             order = min(align_order, size_order);
> > > +
> > > +             prepare_compound_page_to_free(curr, order, flags);
> > > +             __free_frozen_pages(curr, order, FPI_NONE);
> > > +             curr += (1UL << order);
> > > +     }
> > > +
> > > +     VM_WARN_ON(curr != next);
> > > +}
> > > +
> > > +/*
> > > + * Given a high-order compound page containing certain number of HWPoison
> > > + * pages, free only the healthy ones to buddy allocator.
> > > + *
> > > + * It calls __free_frozen_pages O(2^order) times and cause nontrivial
> > > + * overhead. So only use this when compound page really contains HWPoison.
> > > + *
> > > + * This implementation doesn't work in memdesc world.
> > > + */
> > > +static void free_has_hwpoison_pages(struct page *page, unsigned int order)
> > > +{
> > > +     struct page *curr = page;
> > > +     struct page *end = page + (1 << order);
> > > +     struct page *next;
> > > +     unsigned long flags = page->flags.f;
> > > +     unsigned long nr_pages;
> > > +     unsigned long total_freed = 0;
> > > +     unsigned long total_hwp = 0;
> > > +
> > > +     VM_WARN_ON(flags & PAGE_FLAGS_CHECK_AT_FREE);
> > > +
> > > +     while (curr < end) {
> > > +             next = curr;
> > > +             nr_pages = 0;
> > > +
> > > +             while (next < end && !PageHWPoison(next)) {
> > > +                     ++next;
> > > +                     ++nr_pages;
> > > +             }
> > > +
> > > +             if (PageHWPoison(next))
> > > +                     ++total_hwp;
> > > +
> > > +             free_contiguous_pages(curr, next, flags);
> >
> > page_owner, memory profiling (anything else?) will be confused
> > because it was allocated as a larger size, but we're freeing only
> > some portion of it.
> 
> I am not sure, but looking at __split_unmapped_folio, it calls
> pgalloc_tag_split(folio, old_order, split_order) when splitting an
> old_order-order folio into a new split_order.
> 
> Maybe prepare_compound_page_to_free really should
> update_page_tag_ref(), I need to take a closer look at this with
> CONFIG_MEM_ALLOC_PROFILING (not something I usually enable).
> 
> > Perhaps we need to run some portion of this code snippet
> > (from free_pages_prepare()), before freeing portions of it:
> >
> >         page_cpupid_reset_last(page);
> >         page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> >         reset_page_owner(page, order);
> >         page_table_check_free(page, order);
> >         pgalloc_tag_sub(page, 1 << order);
> 
> Since they come from free_pages_prepare, I believe these lines are
> already executed via free_contiguous_pages()=> __free_frozen_pages()=>
> free_pages_prepare(), right? Or am I missing something?

But they're called with order that is smaller than the original order.
That's could be problematic; for example, memory profiling stores metadata
only on the first page. If you pass anything other than the first page
to free_pages_prepare(), it will not recognize that metadata was stored
during allocation.

In general, I think they're not designed to handle cases where
the allocation order and the free order differ (unless we split
metadata like __split_unmapped_folio() does).

> > > +             total_freed += nr_pages;
> > > +             curr = PageHWPoison(next) ? next + 1 : next;
> > > +     }
> > > +
> > > +     pr_info("Excluded %lu hwpoison pages from folio\n", total_hwp);
> > > +     pr_info("Freed %#lx pages from folio\n", total_freed);
> > > +}
> > > +
> > >  void free_frozen_pages(struct page *page, unsigned int order)
> > >  {
> > > +     struct folio *folio = page_folio(page);
> > > +
> > > +     if (order > 0 && unlikely(folio_test_has_hwpoisoned(folio))) {
> > > +             folio_clear_has_hwpoisoned(folio);
> > > +             free_has_hwpoison_pages(page, order);
> > > +             return;
> > > +     }
> > > +
> >
> > It feels like it's a bit random place to do has_hwpoisoned check.
> > Can we move this to free_pages_prepare() where we have some
> > sanity checks (and also order-0 hwpoison page handling)?
> 
> While free_pages_prepare() seems to be a better place to do the
> has_hwpoisoned check, it is not a good place to do
> free_has_hwpoison_pages().

Why is it not a good place for free_has_hwpoison_pages()?

Callers of free_pages_prepare() are supposed to avoid freeing it back to
the buddy or using the page when it returns false.

...except compaction_free(), which I don't have much idea what it's
doing.

> > >       __free_frozen_pages(page, order, FPI_NONE);
> > >  }

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/3] mm/page_alloc: only free healthy pages in high-order HWPoison folio
  2025-12-29  1:15       ` Harry Yoo
@ 2025-12-31  0:19         ` Jiaqi Yan
  2025-12-31  4:37           ` Harry Yoo
  0 siblings, 1 reply; 14+ messages in thread
From: Jiaqi Yan @ 2025-12-31  0:19 UTC (permalink / raw)
  To: Harry Yoo
  Cc: jackmanb, hannes, linmiaohe, ziy, willy, nao.horiguchi, david,
	lorenzo.stoakes, william.roche, tony.luck, wangkefeng.wang,
	jane.chu, akpm, osalvador, muchun.song, rientjes, duenwen,
	jthoughton, linux-mm, linux-kernel, Liam.Howlett, vbabka, rppt,
	surenb, mhocko

On Sun, Dec 28, 2025 at 5:15 PM Harry Yoo <harry.yoo@oracle.com> wrote:
>
> On Fri, Dec 26, 2025 at 05:50:59PM -0800, Jiaqi Yan wrote:
> > On Mon, Dec 22, 2025 at 9:14 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> > >
> > > On Fri, Dec 19, 2025 at 06:33:45PM +0000, Jiaqi Yan wrote:
> > > > At the end of dissolve_free_hugetlb_folio that a free HugeTLB
> > > > folio becomes non-HugeTLB, it is released to buddy allocator
> > > > as a high-order folio, e.g. a folio that contains 262144 pages
> > > > if the folio was a 1G HugeTLB hugepage.
> > > >
> > > > This is problematic if the HugeTLB hugepage contained HWPoison
> > > > subpages. In that case, since buddy allocator does not check
> > > > HWPoison for non-zero-order folio, the raw HWPoison page can
> > > > be given out with its buddy page and be re-used by either
> > > > kernel or userspace.
> > > >
> > > > Memory failure recovery (MFR) in kernel does attempt to take
> > > > raw HWPoison page off buddy allocator after
> > > > dissolve_free_hugetlb_folio. However, there is always a time
> > > > window between dissolve_free_hugetlb_folio frees a HWPoison
> > > > high-order folio to buddy allocator and MFR takes HWPoison
> > > > raw page off buddy allocator.
> > > >
> > > > One obvious way to avoid this problem is to add page sanity
> > > > checks in page allocate or free path. However, it is against
> > > > the past efforts to reduce sanity check overhead [1,2,3].
> > > >
> > > > Introduce free_has_hwpoison_pages to only free the healthy
> > > > pages and excludes the HWPoison ones in the high-order folio.
> > > > The idea is to iterate through the sub-pages of the folio to
> > > > identify contiguous ranges of healthy pages. Instead of freeing
> > > > pages one by one, decompose healthy ranges into the largest
> > > > possible blocks. Each block meets the requirements to be freed
> > > > to buddy allocator (__free_frozen_pages).
> > > >
> > > > free_has_hwpoison_pages has linear time complexity O(N) wrt the
> > > > number of pages in the folio. While the power-of-two decomposition
> > > > ensures that the number of calls to the buddy allocator is
> > > > logarithmic for each contiguous healthy range, the mandatory
> > > > linear scan of pages to identify PageHWPoison defines the
> > > > overall time complexity.
> > >
> > > Hi Jiaqi, thanks for the patch!
> >
> > Thanks for your review/comments!
> >
> > >
> > > Have you tried measuring the latency of free_has_hwpoison_pages() when
> > > a few pages in a 1GB folio are hwpoisoned?
> > >
> > > Just wanted to make sure we don't introduce a possible soft lockup...
> > > Or am I worrying too much?
> >
> > In my local tests, freeing a 1GB folio with 1 / 3 / 8 HWPoison pages,
> > I never run into a soft lockup. The 8-HWPoison-page case takes more
> > time than other cases, meaning that handling the additional HWPoison
> > page adds to the time cost.
> >
> > After adding some instrument code, 10 sample runs of
> > free_has_hwpoison_pages with 8 HWPoison pages:
> > - observed mean is 7.03 ms (5.97 ms when 3 HWPoison pages)
> > - observed standard deviation is 0.76 ms (0.18 ms when 3 HWPoison pages)
> >
> > In comparison, freeing a 1G folio without any HWPoison pages 10 times
> > (with same kernel config):
> > - observed mean is 3.39 ms
> > - observed standard deviation is 0.16ms
>
> Thanks for the measurement!
>
> > So it's around twice the baseline. It should be far from triggering a
> > soft lockup, and the cost seems fair for handling exceptional hardware
> > memory errors.
>
> Yeah it looks fine to me.
>
> > I can add these measurements in future revisions.
>
> That would be nice, thanks.
>
> > > > [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net/
> > > > [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net/
> > > > [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
> > > >
> > > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> > > > ---
> > > >  mm/page_alloc.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++
> > > >  1 file changed, 101 insertions(+)
> > > >
> > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > index 822e05f1a9646..20c8862ce594e 100644
> > > > --- a/mm/page_alloc.c
> > > > +++ b/mm/page_alloc.c
> > > > @@ -2976,8 +2976,109 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
> > > >       }
> > > >  }
> > > >
> > > > +static void prepare_compound_page_to_free(struct page *new_head,
> > > > +                                       unsigned int order,
> > > > +                                       unsigned long flags)
> > > > +{
> > > > +     new_head->flags.f = flags & (~PAGE_FLAGS_CHECK_AT_FREE);
> > > > +     new_head->mapping = NULL;
> > > > +     new_head->private = 0;
> > > > +
> > > > +     clear_compound_head(new_head);
> > > > +     if (order)
> > > > +             prep_compound_page(new_head, order);
> > > > +}
> > >
> > > Not sure why it's building compound pages, just to decompose them
> > > when freeing via __free_frozen_pages()?
> >
> > prepare_compound_page_to_free() borrowed the idea from
> > __split_folio_to_order(). Conceptually the original folio is split
> > into new compound pages with different orders;
>
> I see, and per the previous discussion we don't want to split it
> to 262,144 4K pages in the future, anyway...
>
> > here this is done on
> > the fly in free_contiguous_pages() when order is decided.
> >
> > > > +/*
> > > > + * Given a range of pages physically contiguous physical, efficiently
> > > > + * free them in blocks that meet __free_frozen_pages's requirements.
> > > > + */
> > > > +static void free_contiguous_pages(struct page *curr, struct page *next,
> > > > +                               unsigned long flags)
> > > > +{
> > > > +     unsigned int order;
> > > > +     unsigned int align_order;
> > > > +     unsigned int size_order;
> > > > +     unsigned long pfn;
> > > > +     unsigned long end_pfn = page_to_pfn(next);
> > > > +     unsigned long remaining;
> > > > +
> > > > +     /*
> > > > +      * This decomposition algorithm at every iteration chooses the
> > > > +      * order to be the minimum of two constraints:
> > > > +      * - Alignment: the largest power-of-two that divides current pfn.
> > > > +      * - Size: the largest power-of-two that fits in the
> > > > +      *   current remaining number of pages.
> > > > +      */
> > > > +     while (curr < next) {
> > > > +             pfn = page_to_pfn(curr);
> > > > +             remaining = end_pfn - pfn;
> > > > +
> > > > +             align_order = ffs(pfn) - 1;
> > > > +             size_order = fls_long(remaining) - 1;
> > > > +             order = min(align_order, size_order);
> > > > +
> > > > +             prepare_compound_page_to_free(curr, order, flags);
> > > > +             __free_frozen_pages(curr, order, FPI_NONE);
> > > > +             curr += (1UL << order);
> > > > +     }
> > > > +
> > > > +     VM_WARN_ON(curr != next);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Given a high-order compound page containing certain number of HWPoison
> > > > + * pages, free only the healthy ones to buddy allocator.
> > > > + *
> > > > + * It calls __free_frozen_pages O(2^order) times and cause nontrivial
> > > > + * overhead. So only use this when compound page really contains HWPoison.
> > > > + *
> > > > + * This implementation doesn't work in memdesc world.
> > > > + */
> > > > +static void free_has_hwpoison_pages(struct page *page, unsigned int order)
> > > > +{
> > > > +     struct page *curr = page;
> > > > +     struct page *end = page + (1 << order);
> > > > +     struct page *next;
> > > > +     unsigned long flags = page->flags.f;
> > > > +     unsigned long nr_pages;
> > > > +     unsigned long total_freed = 0;
> > > > +     unsigned long total_hwp = 0;
> > > > +
> > > > +     VM_WARN_ON(flags & PAGE_FLAGS_CHECK_AT_FREE);
> > > > +
> > > > +     while (curr < end) {
> > > > +             next = curr;
> > > > +             nr_pages = 0;
> > > > +
> > > > +             while (next < end && !PageHWPoison(next)) {
> > > > +                     ++next;
> > > > +                     ++nr_pages;
> > > > +             }
> > > > +
> > > > +             if (PageHWPoison(next))
> > > > +                     ++total_hwp;
> > > > +
> > > > +             free_contiguous_pages(curr, next, flags);
> > >
> > > page_owner, memory profiling (anything else?) will be confused
> > > because it was allocated as a larger size, but we're freeing only
> > > some portion of it.
> >
> > I am not sure, but looking at __split_unmapped_folio, it calls
> > pgalloc_tag_split(folio, old_order, split_order) when splitting an
> > old_order-order folio into a new split_order.
> >
> > Maybe prepare_compound_page_to_free really should
> > update_page_tag_ref(), I need to take a closer look at this with
> > CONFIG_MEM_ALLOC_PROFILING (not something I usually enable).
> >
> > > Perhaps we need to run some portion of this code snippet
> > > (from free_pages_prepare()), before freeing portions of it:
> > >
> > >         page_cpupid_reset_last(page);
> > >         page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> > >         reset_page_owner(page, order);
> > >         page_table_check_free(page, order);
> > >         pgalloc_tag_sub(page, 1 << order);
> >
> > Since they come from free_pages_prepare, I believe these lines are
> > already executed via free_contiguous_pages()=> __free_frozen_pages()=>
> > free_pages_prepare(), right? Or am I missing something?
>
> But they're called with order that is smaller than the original order.
> That's could be problematic; for example, memory profiling stores metadata
> only on the first page. If you pass anything other than the first page
> to free_pages_prepare(), it will not recognize that metadata was stored
> during allocation.
>

Right, with MEM_ALLOC_PROFILING enabled, I ran into the following
WARNING when freeing all blocks except the 1st one (which contains the
original head page):

[ 2101.713669] ------------[ cut here ]------------
[ 2101.713670] alloc_tag was not set
[ 2101.713671] WARNING: ./include/linux/alloc_tag.h:164 at
__pgalloc_tag_sub+0xdf/0x160, CPU#18: hugetlb-mfr-3pa/33675
[ 2101.713693] CPU: 18 UID: 0 PID: 33675 Comm: hugetlb-mfr-3pa
Tainted: G S W O 6.19.0-smp-DEV #2 NONE
[ 2101.713698] Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN, [O]=OOT_MODULE
[ 2101.713702] RIP: 0010:__pgalloc_tag_sub+0xdf/0x160
...
[ 2101.713723] Call Trace:
[ 2101.713725] <TASK>
[ 2101.713727] free_has_hwpoison_pages+0xbc/0x370
[ 2101.713731] free_frozen_pages+0xb3/0x100
[ 2101.713733] __folio_put+0xd5/0x100
[ 2101.713739] dissolve_free_hugetlb_folio+0x17f/0x1a0
[ 2101.713743] filemap_offline_hwpoison_folio+0x193/0x4c0
[ 2101.713747] ? __pfx_workingset_update_node+0x10/0x10
[ 2101.713751] remove_inode_hugepages+0x209/0x690
[ 2101.713757] ? on_each_cpu_cond_mask+0x1a/0x20
[ 2101.713760] ? __cond_resched+0x23/0x60
[ 2101.713768] ? n_tty_write+0x4c7/0x500
[ 2101.713773] hugetlbfs_setattr+0x127/0x170
[ 2101.713776] notify_change+0x32e/0x390
[ 2101.713781] do_ftruncate+0x12c/0x1a0
[ 2101.713786] __x64_sys_ftruncate+0x3e/0x70
[ 2101.713789] do_syscall_64+0x6f/0x890
[ 2101.713792] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 2101.713811] </TASK>
[ 2101.713812] ---[ end trace 0000000000000000 ]---

This is because in free_pages_prepare(), pgalloc_tag_sub() found there
is no alloc tag on the compound page being freeing.

> In general, I think they're not designed to handle cases where
> the allocation order and the free order differ (unless we split
> metadata like __split_unmapped_folio() does).

I believe the proper way to fix this is to do something similar to
pgalloc_tag_split(), used by __split_unmapped_folio(). When we split a
new block from the original folio, we create a compound page from the
block (just the way prep_compound_page_to_free does) and link the
alloc tag of the original head page to the head of the new compound
page. Something like copy_alloc_tag (to be added in v3) below to demo
my idea, assuming tag=pgalloc_tag_get(original head page):

+/*
+ * Point page's alloc tag to an existing one.
+ */
+static void copy_alloc_tag(struct page *page, struct alloc_tag *tag)
+{
+       union pgtag_ref_handle handle;
+       union codetag_ref ref;
+       unsigned long pfn = page_to_pfn(page);
+
+       if (!mem_alloc_profiling_enabled())
+               return;
+
+       /* tag is NULL if HugeTLB page is allocated in boot process. */
+       if (!tag)
+               return;
+
+       if (!get_page_tag_ref(page, &ref, &handle))
+               return;
+
+       /* Avoid overriding existing alloc tag from page. */
+       if (!ref.ct || is_codetag_empty(&ref)) {
+               alloc_tag_ref_set(&ref, tag);
+               update_page_tag_ref(handle, &ref);
+       }
+       put_page_tag_ref(handle);
+}
+
+static void prep_compound_page_to_free(struct page *head, unsigned int order,
+                                      unsigned long flags, struct
alloc_tag *tag)
+{
+       head->flags.f = flags & (~PAGE_FLAGS_CHECK_AT_FREE);
+       head->mapping = NULL;
+       head->private = 0;
+
+       clear_compound_head(head);
+       if (order)
+               prep_compound_page(head, order);
+
+       copy_alloc_tag(head, tag);
+}

I tested now the WARNING from include/linux/alloc_tag.h:164 is gone
for both 2M and 1G pages. BTW we also need to copy_alloc_tag() for
HWPoison pages before pgalloc_tag_sub().

>
> > > > +             total_freed += nr_pages;
> > > > +             curr = PageHWPoison(next) ? next + 1 : next;
> > > > +     }
> > > > +
> > > > +     pr_info("Excluded %lu hwpoison pages from folio\n", total_hwp);
> > > > +     pr_info("Freed %#lx pages from folio\n", total_freed);
> > > > +}
> > > > +
> > > >  void free_frozen_pages(struct page *page, unsigned int order)
> > > >  {
> > > > +     struct folio *folio = page_folio(page);
> > > > +
> > > > +     if (order > 0 && unlikely(folio_test_has_hwpoisoned(folio))) {
> > > > +             folio_clear_has_hwpoisoned(folio);
> > > > +             free_has_hwpoison_pages(page, order);
> > > > +             return;
> > > > +     }
> > > > +
> > >
> > > It feels like it's a bit random place to do has_hwpoisoned check.
> > > Can we move this to free_pages_prepare() where we have some
> > > sanity checks (and also order-0 hwpoison page handling)?
> >
> > While free_pages_prepare() seems to be a better place to do the
> > has_hwpoisoned check, it is not a good place to do
> > free_has_hwpoison_pages().
>
> Why is it not a good place for free_has_hwpoison_pages()?
>
> Callers of free_pages_prepare() are supposed to avoid freeing it back to
> the buddy or using the page when it returns false.

What I mean is, callers of free_pages_prepare() wouldn't know from the
single false return value whether 1) they should completely bail out
or 2) they should retry with free_has_hwpoison_pages. So for now I
think it'd better that free_frozen_pages does the check and act upon
the check result. Not sure if there is a better place, or worthy to
change free_pages_prepare()'s return type?

>
> ...except compaction_free(), which I don't have much idea what it's
> doing.
>
> > > >       __free_frozen_pages(page, order, FPI_NONE);
> > > >  }
>
> --
> Cheers,
> Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/3] mm/page_alloc: only free healthy pages in high-order HWPoison folio
  2025-12-31  0:19         ` Jiaqi Yan
@ 2025-12-31  4:37           ` Harry Yoo
  0 siblings, 0 replies; 14+ messages in thread
From: Harry Yoo @ 2025-12-31  4:37 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: jackmanb, hannes, linmiaohe, ziy, willy, nao.horiguchi, david,
	lorenzo.stoakes, william.roche, tony.luck, wangkefeng.wang,
	jane.chu, akpm, osalvador, muchun.song, rientjes, duenwen,
	jthoughton, linux-mm, linux-kernel, Liam.Howlett, vbabka, rppt,
	surenb, mhocko

On Tue, Dec 30, 2025 at 04:19:50PM -0800, Jiaqi Yan wrote:
> On Sun, Dec 28, 2025 at 5:15 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> > On Fri, Dec 26, 2025 at 05:50:59PM -0800, Jiaqi Yan wrote:
> > > On Mon, Dec 22, 2025 at 9:14 PM Harry Yoo <harry.yoo@oracle.com> wrote:
> > > > On Fri, Dec 19, 2025 at 06:33:45PM +0000, Jiaqi Yan wrote:
> > > > > ---
> > > > >  mm/page_alloc.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++
> > > > >  1 file changed, 101 insertions(+)
> > > > >
> > > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > > > index 822e05f1a9646..20c8862ce594e 100644
> > > > > --- a/mm/page_alloc.c
> > > > > +++ b/mm/page_alloc.c
> > > > > +/*
> > > > > + * Given a high-order compound page containing certain number of HWPoison
> > > > > + * pages, free only the healthy ones to buddy allocator.
> > > > > + *
> > > > > + * It calls __free_frozen_pages O(2^order) times and cause nontrivial
> > > > > + * overhead. So only use this when compound page really contains HWPoison.
> > > > > + *
> > > > > + * This implementation doesn't work in memdesc world.
> > > > > + */
> > > > > +static void free_has_hwpoison_pages(struct page *page, unsigned int order)
> > > > > +{
> > > > > +     struct page *curr = page;
> > > > > +     struct page *end = page + (1 << order);
> > > > > +     struct page *next;
> > > > > +     unsigned long flags = page->flags.f;
> > > > > +     unsigned long nr_pages;
> > > > > +     unsigned long total_freed = 0;
> > > > > +     unsigned long total_hwp = 0;
> > > > > +
> > > > > +     VM_WARN_ON(flags & PAGE_FLAGS_CHECK_AT_FREE);
> > > > > +
> > > > > +     while (curr < end) {
> > > > > +             next = curr;
> > > > > +             nr_pages = 0;
> > > > > +
> > > > > +             while (next < end && !PageHWPoison(next)) {
> > > > > +                     ++next;
> > > > > +                     ++nr_pages;
> > > > > +             }
> > > > > +
> > > > > +             if (PageHWPoison(next))
> > > > > +                     ++total_hwp;
> > > > > +
> > > > > +             free_contiguous_pages(curr, next, flags);
> > > >
> > > > page_owner, memory profiling (anything else?) will be confused
> > > > because it was allocated as a larger size, but we're freeing only
> > > > some portion of it.
> > >
> > > I am not sure, but looking at __split_unmapped_folio, it calls
> > > pgalloc_tag_split(folio, old_order, split_order) when splitting an
> > > old_order-order folio into a new split_order.
> > >
> > > Maybe prepare_compound_page_to_free really should
> > > update_page_tag_ref(), I need to take a closer look at this with
> > > CONFIG_MEM_ALLOC_PROFILING (not something I usually enable).
> > >
> > > > Perhaps we need to run some portion of this code snippet
> > > > (from free_pages_prepare()), before freeing portions of it:
> > > >
> > > >         page_cpupid_reset_last(page);
> > > >         page->flags.f &= ~PAGE_FLAGS_CHECK_AT_PREP;
> > > >         reset_page_owner(page, order);
> > > >         page_table_check_free(page, order);
> > > >         pgalloc_tag_sub(page, 1 << order);
> > >
> > > Since they come from free_pages_prepare, I believe these lines are
> > > already executed via free_contiguous_pages()=> __free_frozen_pages()=>
> > > free_pages_prepare(), right? Or am I missing something?
> >
> > But they're called with order that is smaller than the original order.
> > That's could be problematic; for example, memory profiling stores metadata
> > only on the first page. If you pass anything other than the first page
> > to free_pages_prepare(), it will not recognize that metadata was stored
> > during allocation.
> >
> 
> Right, with MEM_ALLOC_PROFILING enabled, I ran into the following
> WARNING when freeing all blocks except the 1st one (which contains the
> original head page):
> 
> [ 2101.713669] ------------[ cut here ]------------
> [ 2101.713670] alloc_tag was not set
> [ 2101.713671] WARNING: ./include/linux/alloc_tag.h:164 at
> __pgalloc_tag_sub+0xdf/0x160, CPU#18: hugetlb-mfr-3pa/33675
> [ 2101.713693] CPU: 18 UID: 0 PID: 33675 Comm: hugetlb-mfr-3pa
> Tainted: G S W O 6.19.0-smp-DEV #2 NONE
> [ 2101.713698] Tainted: [S]=CPU_OUT_OF_SPEC, [W]=WARN, [O]=OOT_MODULE
> [ 2101.713702] RIP: 0010:__pgalloc_tag_sub+0xdf/0x160
> ...
> [ 2101.713723] Call Trace:
> [ 2101.713725] <TASK>
> [ 2101.713727] free_has_hwpoison_pages+0xbc/0x370
> [ 2101.713731] free_frozen_pages+0xb3/0x100
> [ 2101.713733] __folio_put+0xd5/0x100
> [ 2101.713739] dissolve_free_hugetlb_folio+0x17f/0x1a0
> [ 2101.713743] filemap_offline_hwpoison_folio+0x193/0x4c0
> [ 2101.713747] ? __pfx_workingset_update_node+0x10/0x10
> [ 2101.713751] remove_inode_hugepages+0x209/0x690
> [ 2101.713757] ? on_each_cpu_cond_mask+0x1a/0x20
> [ 2101.713760] ? __cond_resched+0x23/0x60
> [ 2101.713768] ? n_tty_write+0x4c7/0x500
> [ 2101.713773] hugetlbfs_setattr+0x127/0x170
> [ 2101.713776] notify_change+0x32e/0x390
> [ 2101.713781] do_ftruncate+0x12c/0x1a0
> [ 2101.713786] __x64_sys_ftruncate+0x3e/0x70
> [ 2101.713789] do_syscall_64+0x6f/0x890
> [ 2101.713792] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [ 2101.713811] </TASK>
> [ 2101.713812] ---[ end trace 0000000000000000 ]---
> 
> This is because in free_pages_prepare(), pgalloc_tag_sub() found there
> is no alloc tag on the compound page being freeing.
> 
> > In general, I think they're not designed to handle cases where
> > the allocation order and the free order differ (unless we split
> > metadata like __split_unmapped_folio() does).
> 
> I believe the proper way to fix this is to do something similar to
> pgalloc_tag_split(), used by __split_unmapped_folio().

Yeah, that will work.

The difference is that pgalloc_tag_split(), split_page_owner(), and
split_page_memcg() only support splitting the whole page into many (> 1)
smaller pieces, but we're trying to create only one smaller page from the
original page.

> When we split a
> new block from the original folio, we create a compound page from the
> block (just the way prep_compound_page_to_free does) and link the
> alloc tag of the original head page to the head of the new compound
> page.
>
> Something like copy_alloc_tag (to be added in v3) below to demo
> my idea, assuming tag=pgalloc_tag_get(original head page):
> 
> +/*
> + * Point page's alloc tag to an existing one.
> + */
> +static void copy_alloc_tag(struct page *page, struct alloc_tag *tag)

This should be defined in lib/alloc_tag.c

> +{
> +       union pgtag_ref_handle handle;
> +       union codetag_ref ref;
> +       unsigned long pfn = page_to_pfn(page);
> +
> +       if (!mem_alloc_profiling_enabled())
> +               return;
> +
> +       /* tag is NULL if HugeTLB page is allocated in boot process. */
> +       if (!tag)
> +               return;
> +
> +       if (!get_page_tag_ref(page, &ref, &handle))
> +               return;
> +
> +       /* Avoid overriding existing alloc tag from page. */
> +       if (!ref.ct || is_codetag_empty(&ref)) {

Is it an error if the page already has a valid tag?

> +               alloc_tag_ref_set(&ref, tag);
> +               update_page_tag_ref(handle, &ref);
> +       }
> +       put_page_tag_ref(handle);
> +}
> +
> +static void prep_compound_page_to_free(struct page *head, unsigned int order,
> +                                      unsigned long flags, struct
> alloc_tag *tag)
> +{
> +       head->flags.f = flags & (~PAGE_FLAGS_CHECK_AT_FREE);
> +       head->mapping = NULL;
> +       head->private = 0;
> +
> +       clear_compound_head(head);
> +       if (order)
> +               prep_compound_page(head, order);
> +
> +       copy_alloc_tag(head, tag);

Do we need a similar thing for memcg? Probably not, since it should have
been uncharged before freeing (as long as we're not using it for kmem pages)
Perhaps a comment on why we don't need to split memcg will be enough.

Do we need a similar thing for page_owner? I think yes, although it won't
crash or cause a warning, it will be inconsistent if we split page_owner in
split_page()/__split_unmapped_folio()/etc but not in
prep_compound_page_to_free(). 

> +}
> 
> I tested now the WARNING from include/linux/alloc_tag.h:164 is gone
> for both 2M and 1G pages. BTW we also need to copy_alloc_tag() for
> HWPoison pages before pgalloc_tag_sub().
> 
> > > > > +             total_freed += nr_pages;
> > > > > +             curr = PageHWPoison(next) ? next + 1 : next;
> > > > > +     }
> > > > > +
> > > > > +     pr_info("Excluded %lu hwpoison pages from folio\n", total_hwp);
> > > > > +     pr_info("Freed %#lx pages from folio\n", total_freed);
> > > > > +}
> > > > > +
> > > > >  void free_frozen_pages(struct page *page, unsigned int order)
> > > > >  {
> > > > > +     struct folio *folio = page_folio(page);
> > > > > +
> > > > > +     if (order > 0 && unlikely(folio_test_has_hwpoisoned(folio))) {
> > > > > +             folio_clear_has_hwpoisoned(folio);
> > > > > +             free_has_hwpoison_pages(page, order);
> > > > > +             return;
> > > > > +     }
> > > > > +
> > > >
> > > > It feels like it's a bit random place to do has_hwpoisoned check.
> > > > Can we move this to free_pages_prepare() where we have some
> > > > sanity checks (and also order-0 hwpoison page handling)?
> > >
> > > While free_pages_prepare() seems to be a better place to do the
> > > has_hwpoisoned check, it is not a good place to do
> > > free_has_hwpoison_pages().
> >
> > Why is it not a good place for free_has_hwpoison_pages()?
> >
> > Callers of free_pages_prepare() are supposed to avoid freeing it back to
> > the buddy or using the page when it returns false.
> 
> What I mean is, callers of free_pages_prepare() wouldn't know from the
> single false return value whether 1) they should completely bail out
> or 2) they should retry with free_has_hwpoison_pages.

I was thinking that once free_pages_prepare() returns false, it already
has done all the work to isolate HWPoison pages and freed healthy portions,
so the caller doesn't have to do anything and can bail out completely.

> So for now I
> think it'd better that free_frozen_pages does the check and act upon
> the check result. Not sure if there is a better place, or worthy to
> change free_pages_prepare()'s return type?
> 
> > ...except compaction_free(), which I don't have much idea what it's
> > doing.
> >
> > > > >       __free_frozen_pages(page, order, FPI_NONE);
> > > > >  }

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/3] mm/page_alloc: only free healthy pages in high-order HWPoison folio
  2025-12-19 18:33 ` [PATCH v2 2/3] mm/page_alloc: only free healthy pages in high-order HWPoison folio Jiaqi Yan
  2025-12-23  5:14   ` Harry Yoo
@ 2025-12-23  7:45   ` Miaohe Lin
  2025-12-27  1:50     ` Jiaqi Yan
  1 sibling, 1 reply; 14+ messages in thread
From: Miaohe Lin @ 2025-12-23  7:45 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: nao.horiguchi, david, lorenzo.stoakes, william.roche, tony.luck,
	wangkefeng.wang, jane.chu, akpm, osalvador, muchun.song,
	rientjes, duenwen, jthoughton, linux-mm, linux-kernel,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, jackmanb, hannes,
	ziy, harry.yoo, willy

On 2025/12/20 2:33, Jiaqi Yan wrote:
> At the end of dissolve_free_hugetlb_folio that a free HugeTLB
> folio becomes non-HugeTLB, it is released to buddy allocator
> as a high-order folio, e.g. a folio that contains 262144 pages
> if the folio was a 1G HugeTLB hugepage.
> 
> This is problematic if the HugeTLB hugepage contained HWPoison
> subpages. In that case, since buddy allocator does not check
> HWPoison for non-zero-order folio, the raw HWPoison page can
> be given out with its buddy page and be re-used by either
> kernel or userspace.
> 
> Memory failure recovery (MFR) in kernel does attempt to take
> raw HWPoison page off buddy allocator after
> dissolve_free_hugetlb_folio. However, there is always a time
> window between dissolve_free_hugetlb_folio frees a HWPoison
> high-order folio to buddy allocator and MFR takes HWPoison
> raw page off buddy allocator.
> 
> One obvious way to avoid this problem is to add page sanity
> checks in page allocate or free path. However, it is against
> the past efforts to reduce sanity check overhead [1,2,3].
> 
> Introduce free_has_hwpoison_pages to only free the healthy
> pages and excludes the HWPoison ones in the high-order folio.
> The idea is to iterate through the sub-pages of the folio to
> identify contiguous ranges of healthy pages. Instead of freeing
> pages one by one, decompose healthy ranges into the largest
> possible blocks. Each block meets the requirements to be freed
> to buddy allocator (__free_frozen_pages).
> 
> free_has_hwpoison_pages has linear time complexity O(N) wrt the
> number of pages in the folio. While the power-of-two decomposition
> ensures that the number of calls to the buddy allocator is
> logarithmic for each contiguous healthy range, the mandatory
> linear scan of pages to identify PageHWPoison defines the
> overall time complexity.
> 

Thanks for your patch.

> [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net/
> [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net/
> [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
> 
> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> ---
>  mm/page_alloc.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 101 insertions(+)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 822e05f1a9646..20c8862ce594e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2976,8 +2976,109 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
>  	}
>  }
>  
> +static void prepare_compound_page_to_free(struct page *new_head,
> +					  unsigned int order,
> +					  unsigned long flags)
> +{
> +	new_head->flags.f = flags & (~PAGE_FLAGS_CHECK_AT_FREE);
> +	new_head->mapping = NULL;
> +	new_head->private = 0;
> +
> +	clear_compound_head(new_head);
> +	if (order)
> +		prep_compound_page(new_head, order);
> +}
> +
> +/*
> + * Given a range of pages physically contiguous physical, efficiently
> + * free them in blocks that meet __free_frozen_pages's requirements.
> + */
> +static void free_contiguous_pages(struct page *curr, struct page *next,
> +				  unsigned long flags)
> +{
> +	unsigned int order;
> +	unsigned int align_order;
> +	unsigned int size_order;
> +	unsigned long pfn;
> +	unsigned long end_pfn = page_to_pfn(next);
> +	unsigned long remaining;
> +
> +	/*
> +	 * This decomposition algorithm at every iteration chooses the
> +	 * order to be the minimum of two constraints:
> +	 * - Alignment: the largest power-of-two that divides current pfn.
> +	 * - Size: the largest power-of-two that fits in the
> +	 *   current remaining number of pages.
> +	 */
> +	while (curr < next) {
> +		pfn = page_to_pfn(curr);
> +		remaining = end_pfn - pfn;
> +
> +		align_order = ffs(pfn) - 1;
> +		size_order = fls_long(remaining) - 1;
> +		order = min(align_order, size_order);
> +
> +		prepare_compound_page_to_free(curr, order, flags);> +		__free_frozen_pages(curr, order, FPI_NONE);
> +		curr += (1UL << order);

For hwpoisoned pages, nothing is done for them. I think we should run at least
some portion of code snippet from free_pages_prepare():

if (unlikely(PageHWPoison(page)) && !order) {
		/* Do not let hwpoison pages hit pcplists/buddy */
		reset_page_owner(page, order);
		page_table_check_free(page, order);
		pgalloc_tag_sub(page, 1 << order);

		/*
		 * The page is isolated and accounted for.
		 * Mark the codetag as empty to avoid accounting error
		 * when the page is freed by unpoison_memory().
		 */
		clear_page_tag_ref(page);
		return false;
	}

> +	}
> +
> +	VM_WARN_ON(curr != next);
> +}
> +
> +/*
> + * Given a high-order compound page containing certain number of HWPoison
> + * pages, free only the healthy ones to buddy allocator.
> + *
> + * It calls __free_frozen_pages O(2^order) times and cause nontrivial
> + * overhead. So only use this when compound page really contains HWPoison.
> + *
> + * This implementation doesn't work in memdesc world.
> + */
> +static void free_has_hwpoison_pages(struct page *page, unsigned int order)
> +{
> +	struct page *curr = page;
> +	struct page *end = page + (1 << order);
> +	struct page *next;
> +	unsigned long flags = page->flags.f;
> +	unsigned long nr_pages;
> +	unsigned long total_freed = 0;
> +	unsigned long total_hwp = 0;
> +
> +	VM_WARN_ON(flags & PAGE_FLAGS_CHECK_AT_FREE);
> +
> +	while (curr < end) {
> +		next = curr;
> +		nr_pages = 0;
> +
> +		while (next < end && !PageHWPoison(next)) {
> +			++next;
> +			++nr_pages;
> +		}
> +
> +		if (PageHWPoison(next))
Would it be possible next points to end? In that case, irrelevant even nonexistent page
will be accessed ?

Thanks.
.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 2/3] mm/page_alloc: only free healthy pages in high-order HWPoison folio
  2025-12-23  7:45   ` Miaohe Lin
@ 2025-12-27  1:50     ` Jiaqi Yan
  0 siblings, 0 replies; 14+ messages in thread
From: Jiaqi Yan @ 2025-12-27  1:50 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: nao.horiguchi, david, lorenzo.stoakes, william.roche, tony.luck,
	wangkefeng.wang, jane.chu, akpm, osalvador, muchun.song,
	rientjes, duenwen, jthoughton, linux-mm, linux-kernel,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, jackmanb, hannes,
	ziy, harry.yoo, willy

On Mon, Dec 22, 2025 at 11:45 PM Miaohe Lin <linmiaohe@huawei.com> wrote:
>
> On 2025/12/20 2:33, Jiaqi Yan wrote:
> > At the end of dissolve_free_hugetlb_folio that a free HugeTLB
> > folio becomes non-HugeTLB, it is released to buddy allocator
> > as a high-order folio, e.g. a folio that contains 262144 pages
> > if the folio was a 1G HugeTLB hugepage.
> >
> > This is problematic if the HugeTLB hugepage contained HWPoison
> > subpages. In that case, since buddy allocator does not check
> > HWPoison for non-zero-order folio, the raw HWPoison page can
> > be given out with its buddy page and be re-used by either
> > kernel or userspace.
> >
> > Memory failure recovery (MFR) in kernel does attempt to take
> > raw HWPoison page off buddy allocator after
> > dissolve_free_hugetlb_folio. However, there is always a time
> > window between dissolve_free_hugetlb_folio frees a HWPoison
> > high-order folio to buddy allocator and MFR takes HWPoison
> > raw page off buddy allocator.
> >
> > One obvious way to avoid this problem is to add page sanity
> > checks in page allocate or free path. However, it is against
> > the past efforts to reduce sanity check overhead [1,2,3].
> >
> > Introduce free_has_hwpoison_pages to only free the healthy
> > pages and excludes the HWPoison ones in the high-order folio.
> > The idea is to iterate through the sub-pages of the folio to
> > identify contiguous ranges of healthy pages. Instead of freeing
> > pages one by one, decompose healthy ranges into the largest
> > possible blocks. Each block meets the requirements to be freed
> > to buddy allocator (__free_frozen_pages).
> >
> > free_has_hwpoison_pages has linear time complexity O(N) wrt the
> > number of pages in the folio. While the power-of-two decomposition
> > ensures that the number of calls to the buddy allocator is
> > logarithmic for each contiguous healthy range, the mandatory
> > linear scan of pages to identify PageHWPoison defines the
> > overall time complexity.
> >
>
> Thanks for your patch.

Thanks for your review/comments!

>
> > [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net/
> > [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net/
> > [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
> >
> > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> > ---
> >  mm/page_alloc.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 101 insertions(+)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 822e05f1a9646..20c8862ce594e 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2976,8 +2976,109 @@ static void __free_frozen_pages(struct page *page, unsigned int order,
> >       }
> >  }
> >
> > +static void prepare_compound_page_to_free(struct page *new_head,
> > +                                       unsigned int order,
> > +                                       unsigned long flags)
> > +{
> > +     new_head->flags.f = flags & (~PAGE_FLAGS_CHECK_AT_FREE);
> > +     new_head->mapping = NULL;
> > +     new_head->private = 0;
> > +
> > +     clear_compound_head(new_head);
> > +     if (order)
> > +             prep_compound_page(new_head, order);
> > +}
> > +
> > +/*
> > + * Given a range of pages physically contiguous physical, efficiently
> > + * free them in blocks that meet __free_frozen_pages's requirements.
> > + */
> > +static void free_contiguous_pages(struct page *curr, struct page *next,
> > +                               unsigned long flags)
> > +{
> > +     unsigned int order;
> > +     unsigned int align_order;
> > +     unsigned int size_order;
> > +     unsigned long pfn;
> > +     unsigned long end_pfn = page_to_pfn(next);
> > +     unsigned long remaining;
> > +
> > +     /*
> > +      * This decomposition algorithm at every iteration chooses the
> > +      * order to be the minimum of two constraints:
> > +      * - Alignment: the largest power-of-two that divides current pfn.
> > +      * - Size: the largest power-of-two that fits in the
> > +      *   current remaining number of pages.
> > +      */
> > +     while (curr < next) {
> > +             pfn = page_to_pfn(curr);
> > +             remaining = end_pfn - pfn;
> > +
> > +             align_order = ffs(pfn) - 1;
> > +             size_order = fls_long(remaining) - 1;
> > +             order = min(align_order, size_order);
> > +
> > +             prepare_compound_page_to_free(curr, order, flags);
> > +             __free_frozen_pages(curr, order, FPI_NONE);
> > +             curr += (1UL << order);
>
> For hwpoisoned pages, nothing is done for them. I think we should run at least
> some portion of code snippet from free_pages_prepare():

Agreed, will add in v3.

>
> if (unlikely(PageHWPoison(page)) && !order) {
>                 /* Do not let hwpoison pages hit pcplists/buddy */
>                 reset_page_owner(page, order);
>                 page_table_check_free(page, order);
>                 pgalloc_tag_sub(page, 1 << order);
>
>                 /*
>                  * The page is isolated and accounted for.
>                  * Mark the codetag as empty to avoid accounting error
>                  * when the page is freed by unpoison_memory().
>                  */
>                 clear_page_tag_ref(page);
>                 return false;
>         }
>
> > +     }
> > +
> > +     VM_WARN_ON(curr != next);
> > +}
> > +
> > +/*
> > + * Given a high-order compound page containing certain number of HWPoison
> > + * pages, free only the healthy ones to buddy allocator.
> > + *
> > + * It calls __free_frozen_pages O(2^order) times and cause nontrivial
> > + * overhead. So only use this when compound page really contains HWPoison.
> > + *
> > + * This implementation doesn't work in memdesc world.
> > + */
> > +static void free_has_hwpoison_pages(struct page *page, unsigned int order)
> > +{
> > +     struct page *curr = page;
> > +     struct page *end = page + (1 << order);
> > +     struct page *next;
> > +     unsigned long flags = page->flags.f;
> > +     unsigned long nr_pages;
> > +     unsigned long total_freed = 0;
> > +     unsigned long total_hwp = 0;
> > +
> > +     VM_WARN_ON(flags & PAGE_FLAGS_CHECK_AT_FREE);
> > +
> > +     while (curr < end) {
> > +             next = curr;
> > +             nr_pages = 0;
> > +
> > +             while (next < end && !PageHWPoison(next)) {
> > +                     ++next;
> > +                     ++nr_pages;
> > +             }
> > +
> > +             if (PageHWPoison(next))
> Would it be possible next points to end? In that case, irrelevant even nonexistent page
> will be accessed ?

Thanks for catching that. Let me avoid access end as a page at all,
both here and in free_contiguous_pages.

>
> Thanks.
> .


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 3/3] mm/memory-failure: simplify __page_handle_poison
  2025-12-19 18:33 [PATCH v2 0/3] Only free healthy pages in high-order HWPoison folio Jiaqi Yan
  2025-12-19 18:33 ` [PATCH v2 1/3] mm/memory-failure: set has_hwpoisoned flags on HugeTLB folio Jiaqi Yan
  2025-12-19 18:33 ` [PATCH v2 2/3] mm/page_alloc: only free healthy pages in high-order HWPoison folio Jiaqi Yan
@ 2025-12-19 18:33 ` Jiaqi Yan
  2025-12-22 22:12   ` Matthew Wilcox
  2025-12-22 22:06 ` [PATCH v2 0/3] Only free healthy pages in high-order HWPoison folio Jiaqi Yan
  3 siblings, 1 reply; 14+ messages in thread
From: Jiaqi Yan @ 2025-12-19 18:33 UTC (permalink / raw)
  To: jackmanb, hannes, linmiaohe, ziy, harry.yoo, willy
  Cc: nao.horiguchi, david, lorenzo.stoakes, william.roche, tony.luck,
	wangkefeng.wang, jane.chu, akpm, osalvador, muchun.song,
	rientjes, duenwen, jthoughton, linux-mm, linux-kernel,
	Liam.Howlett, vbabka, rppt, surenb, mhocko, Jiaqi Yan

Now that no HWPoison page will be given away to buddy allocator
at the end of dissolve_free_hugetlb_folio, there is no need to
drain_all_pages and take_page_off_buddy anymore, so remove them.

Also make __page_handle_poison return either 0 for success or
negative for failure, following the convention for functions
that perform an action.

Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
 mm/memory-failure.c | 31 +++++--------------------------
 1 file changed, 5 insertions(+), 26 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index d204de6c9792a..54ea840ded162 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -164,33 +164,12 @@ static DEFINE_MUTEX(pfn_space_lock);
 
 /*
  * Return values:
- *   1:   the page is dissolved (if needed) and taken off from buddy,
- *   0:   the page is dissolved (if needed) and not taken off from buddy,
+ *   = 0: the page is dissolved (if needed)
  *   < 0: failed to dissolve.
  */
 static int __page_handle_poison(struct page *page)
 {
-	int ret;
-
-	/*
-	 * zone_pcp_disable() can't be used here. It will
-	 * hold pcp_batch_high_lock and dissolve_free_hugetlb_folio() might hold
-	 * cpu_hotplug_lock via static_key_slow_dec() when hugetlb vmemmap
-	 * optimization is enabled. This will break current lock dependency
-	 * chain and leads to deadlock.
-	 * Disabling pcp before dissolving the page was a deterministic
-	 * approach because we made sure that those pages cannot end up in any
-	 * PCP list. Draining PCP lists expels those pages to the buddy system,
-	 * but nothing guarantees that those pages do not get back to a PCP
-	 * queue if we need to refill those.
-	 */
-	ret = dissolve_free_hugetlb_folio(page_folio(page));
-	if (!ret) {
-		drain_all_pages(page_zone(page));
-		ret = take_page_off_buddy(page);
-	}
-
-	return ret;
+	return dissolve_free_hugetlb_folio(page_folio(page));
 }
 
 static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, bool release)
@@ -200,7 +179,7 @@ static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, boo
 		 * Doing this check for free pages is also fine since
 		 * dissolve_free_hugetlb_folio() returns 0 for non-hugetlb folios as well.
 		 */
-		if (__page_handle_poison(page) <= 0)
+		if (__page_handle_poison(page) < 0)
 			/*
 			 * We could fail to take off the target page from buddy
 			 * for example due to racy page allocation, but that's
@@ -1174,7 +1153,7 @@ static int me_huge_page(struct page_state *ps, struct page *p)
 		 * subpages.
 		 */
 		folio_put(folio);
-		if (__page_handle_poison(p) > 0) {
+		if (!__page_handle_poison(p)) {
 			page_ref_inc(p);
 			res = MF_RECOVERED;
 		} else {
@@ -2067,7 +2046,7 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
 	 */
 	if (res == 0) {
 		folio_unlock(folio);
-		if (__page_handle_poison(p) > 0) {
+		if (!__page_handle_poison(p)) {
 			page_ref_inc(p);
 			res = MF_RECOVERED;
 		} else {
-- 
2.52.0.322.g1dd061c0dc-goog



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 3/3] mm/memory-failure: simplify __page_handle_poison
  2025-12-19 18:33 ` [PATCH v2 3/3] mm/memory-failure: simplify __page_handle_poison Jiaqi Yan
@ 2025-12-22 22:12   ` Matthew Wilcox
  2025-12-22 23:13     ` Jiaqi Yan
  0 siblings, 1 reply; 14+ messages in thread
From: Matthew Wilcox @ 2025-12-22 22:12 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: jackmanb, hannes, linmiaohe, ziy, harry.yoo, nao.horiguchi,
	david, lorenzo.stoakes, william.roche, tony.luck,
	wangkefeng.wang, jane.chu, akpm, osalvador, muchun.song,
	rientjes, duenwen, jthoughton, linux-mm, linux-kernel,
	Liam.Howlett, vbabka, rppt, surenb, mhocko

On Fri, Dec 19, 2025 at 06:33:46PM +0000, Jiaqi Yan wrote:
> Now that no HWPoison page will be given away to buddy allocator
> at the end of dissolve_free_hugetlb_folio, there is no need to
> drain_all_pages and take_page_off_buddy anymore, so remove them.

What if we discover a hardware error in a page which is currently
in the buddy system?  Why don't we need to remove that page from the
buddy system?


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 3/3] mm/memory-failure: simplify __page_handle_poison
  2025-12-22 22:12   ` Matthew Wilcox
@ 2025-12-22 23:13     ` Jiaqi Yan
  0 siblings, 0 replies; 14+ messages in thread
From: Jiaqi Yan @ 2025-12-22 23:13 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: jackmanb, hannes, linmiaohe, ziy, harry.yoo, nao.horiguchi,
	david, lorenzo.stoakes, william.roche, tony.luck,
	wangkefeng.wang, jane.chu, akpm, osalvador, muchun.song,
	rientjes, duenwen, jthoughton, linux-mm, linux-kernel,
	Liam.Howlett, vbabka, rppt, surenb, mhocko

On Mon, Dec 22, 2025 at 2:12 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Fri, Dec 19, 2025 at 06:33:46PM +0000, Jiaqi Yan wrote:
> > Now that no HWPoison page will be given away to buddy allocator
> > at the end of dissolve_free_hugetlb_folio, there is no need to
> > drain_all_pages and take_page_off_buddy anymore, so remove them.
>
> What if we discover a hardware error in a page which is currently
> in the buddy system?  Why don't we need to remove that page from the
> buddy system?

Thanks for your comment.

memory_failure() explicitly handles is_free_buddy_page() case and is
removing from buddy allocator with take_page_off_buddy() directly.

However, when soft_offline_page() handles free page with
page_handle_poison(), removing from buddy allocator is missing with
this patch.

I will fix this in v3. One way is to split hugepage_or_freepage and
only take_page_off_buddy() for non-huge free page.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/3] Only free healthy pages in high-order HWPoison folio
  2025-12-19 18:33 [PATCH v2 0/3] Only free healthy pages in high-order HWPoison folio Jiaqi Yan
                   ` (2 preceding siblings ...)
  2025-12-19 18:33 ` [PATCH v2 3/3] mm/memory-failure: simplify __page_handle_poison Jiaqi Yan
@ 2025-12-22 22:06 ` Jiaqi Yan
  3 siblings, 0 replies; 14+ messages in thread
From: Jiaqi Yan @ 2025-12-22 22:06 UTC (permalink / raw)
  To: jackmanb, hannes, linmiaohe, ziy, harry.yoo, willy
  Cc: nao.horiguchi, david, lorenzo.stoakes, william.roche, tony.luck,
	wangkefeng.wang, jane.chu, akpm, osalvador, muchun.song,
	rientjes, duenwen, jthoughton, linux-mm, linux-kernel,
	Liam.Howlett, vbabka, rppt, surenb, mhocko

On Fri, Dec 19, 2025 at 10:33 AM Jiaqi Yan <jiaqiyan@google.com> wrote:
>
> At the end of dissolve_free_hugetlb_folio that a free HugeTLB
> folio becomes non-HugeTLB, it is released to buddy allocator
> as a high-order folio, e.g. a folio that contains 262144 pages
> if the folio was a 1G HugeTLB hugepage.
>
> This is problematic if the HugeTLB hugepage contained HWPoison
> subpages. In that case, since buddy allocator does not check
> HWPoison for non-zero-order folio, the raw HWPoison page can
> be given out with its buddy page and be re-used by either
> kernel or userspace.
>
> Memory failure recovery (MFR) in kernel does attempt to take
> raw HWPoison page off buddy allocator after
> dissolve_free_hugetlb_folio. However, there is always a time
> window between dissolve_free_hugetlb_folio frees a HWPoison
> high-order folio to buddy allocator and MFR takes HWPoison
> raw page off buddy allocator.
>
> One obvious way to avoid this problem is to add page sanity
> checks in page allocate or free path. However, it is against
> the past efforts to reduce sanity check overhead [1,2,3].
>
> Introduce free_has_hwpoison_pages to only free the healthy
> pages and excludes the HWPoison ones in the high-order folio.
> The idea is to iterate through the sub-pages of the folio to
> identify contiguous ranges of healthy pages. Instead of freeing
> pages one by one, decompose healthy ranges into the largest
> possible blocks. Each block meets the requirements to be freed
> to buddy allocator by calling __free_frozen_pages directly.
>
> free_has_hwpoison_pages has linear time complexity O(N) wrt the
> number of pages in the folio. While the power-of-two decomposition
> ensures that the number of calls to the buddy allocator is
> logarithmic for each contiguous healthy range, the mandatory
> linear scan of pages to identify PageHWPoison defines the
> overall time complexity.
>
> I tested with some test-only code [4] and hugetlb-mfr [5], by
> checking the status of pcplist and freelist immediately after
> dissolve_free_hugetlb_folio a free hugetlb page that contains
> 3 HWPoison raw pages:
>
> * HWPoison pages are excluded by free_has_hwpoison_pages.
>
> * Some healthy pages can be in zone->per_cpu_pageset (pcplist)
>   because pcp_count is not high enough.
>
> * Many healthy pages are already in some order's
>   zone->free_area[order].free_list (freelist).
>
> * In rare cases, some healthy pages are in neither pcplist
>   nor freelist. My best guest is they are allocated before
>   the test checks.

Sorry, just realized changelog is missing. Appending it here:

Changelog

v1 [6] => v2:
- Total reimplementation based on discussions with Mathew Wilcox,
  Harry Hoo, Zi Yan etc.
- hugetlb_free_hwpoison_folio => free_has_hwpoison_pages.
- Utilize has_hwpoisoned flag to tell buddy allocator a high-order
  folio contains HWPoison.
- Simplify __page_handle_poison given that HWPoison page
  won't be freed within the high-order folio.

>
> [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net/
> [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net/
> [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
> [4] https://drive.google.com/file/d/1CzJn1Cc4wCCm183Y77h244fyZIkTLzCt/view?usp=sharing
> [5] https://lore.kernel.org/linux-mm/20251116013223.1557158-3-jiaqiyan@google.com

[6] https://lore.kernel.org/linux-mm/20251116014721.1561456-1-jiaqiyan@google.com

>
> Jiaqi Yan (3):
>   mm/memory-failure: set has_hwpoisoned flags on HugeTLB folio
>   mm/page_alloc: only free healthy pages in high-order HWPoison folio
>   mm/memory-failure: simplify __page_handle_poison
>
>  include/linux/page-flags.h |   2 +-
>  mm/memory-failure.c        |  32 +++---------
>  mm/page_alloc.c            | 101 +++++++++++++++++++++++++++++++++++++
>  3 files changed, 108 insertions(+), 27 deletions(-)
>
> --
> 2.52.0.322.g1dd061c0dc-goog
>


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-12-31  4:38 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-12-19 18:33 [PATCH v2 0/3] Only free healthy pages in high-order HWPoison folio Jiaqi Yan
2025-12-19 18:33 ` [PATCH v2 1/3] mm/memory-failure: set has_hwpoisoned flags on HugeTLB folio Jiaqi Yan
2025-12-19 18:33 ` [PATCH v2 2/3] mm/page_alloc: only free healthy pages in high-order HWPoison folio Jiaqi Yan
2025-12-23  5:14   ` Harry Yoo
2025-12-27  1:50     ` Jiaqi Yan
2025-12-29  1:15       ` Harry Yoo
2025-12-31  0:19         ` Jiaqi Yan
2025-12-31  4:37           ` Harry Yoo
2025-12-23  7:45   ` Miaohe Lin
2025-12-27  1:50     ` Jiaqi Yan
2025-12-19 18:33 ` [PATCH v2 3/3] mm/memory-failure: simplify __page_handle_poison Jiaqi Yan
2025-12-22 22:12   ` Matthew Wilcox
2025-12-22 23:13     ` Jiaqi Yan
2025-12-22 22:06 ` [PATCH v2 0/3] Only free healthy pages in high-order HWPoison folio Jiaqi Yan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox