[PATCH v1 0/2] Only free healthy pages in high-order HWPoison folio

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v1 0/2] Only free healthy pages in high-order HWPoison folio
@ 2025-11-16  1:47 Jiaqi Yan
  2025-11-16  1:47 ` [PATCH v1 1/2] mm/huge_memory: introduce uniform_split_unmapped_folio_to_zero_order Jiaqi Yan
  2025-11-16  1:47 ` [PATCH v1 2/2] mm/memory-failure: avoid free HWPoison high-order folio Jiaqi Yan
  0 siblings, 2 replies; 21+ messages in thread
From: Jiaqi Yan @ 2025-11-16  1:47 UTC (permalink / raw)
  To: nao.horiguchi, linmiaohe, ziy
  Cc: david, lorenzo.stoakes, william.roche, harry.yoo, tony.luck,
	wangkefeng.wang, willy, jane.chu, akpm, osalvador, muchun.song,
	linux-mm, linux-kernel, linux-fsdevel, Jiaqi Yan

At the end of dissolve_free_hugetlb_folio that a free HugeTLB
folio becomes non-HugeTLB, it is released to buddy allocator
as a high-order folio, e.g. a folio that contains 262144 pages
if the folio was a 1G HugeTLB hugepage.

This is problematic if the HugeTLB hugepage contained HWPoison
subpages. In that case, since buddy allocator does not check
HWPoison for non-zero-order folio, the raw HWPoison page can
be given out with its buddy page and be re-used by either
kernel or userspace.

Memory failure recovery (MFR) in kernel does attempt to take
raw HWPoison page off buddy allocator after
dissolve_free_hugetlb_folio. However, there is always a time
window between dissolve_free_hugetlb_folio frees a HWPoison
high-order folio to buddy allocator and MFR takes HWPoison
raw page off buddy allocator.

One obvious way to avoid this problem is to add page sanity
checks in page allocate or free path. However, it is against
the past efforts to reduce sanity check overhead [1,2,3].

Introduce hugetlb_free_hwpoison_folio to solve this problem.
The idea is, in case a HugeTLB folio for sure contains HWPoison
page, first split the non-HugeTLB high-order folio uniformly
into 0-order folios, then let healthy pages join the buddy
allocator while reject the HWPoison ones.

I tested with some test-only code [4] and hugetlb-mfr [5], by
checking the stats of pcplist and freelist immediately after
hugetlb_free_hwpoison_folio. After dealing with HugeTLB folio
that contains 3 HWPoison raw pages, the pages used to be in
folio becomes one of the four states:

* Some pages can still be in zone->per_cpu_pageset (pcplist)
  because pcp-count is not high enough.

* Many others are, after merging, in some order's
  zone->free_area[order].free_list (freelist).

* There may be some pages in neither pcplist nor freelist.
  My best guest is they are allocated already.

* 3 HWPoison pages are checked in neither pcplist nor freelist.

For example:

* When hugepagesize=2M, 509 0-order pages are all placed in
pcplist, and no page from the hugepage is in freelist.

* When hugepagesize=1G, in one of the tests, I observed that
  262069 pages are merged to buddy blocks of order 0 to 10,
  72 are in pcplist, and 3 HWPoison ones are isolated.

[1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net/
[2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net/
[3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
[4] https://drive.google.com/file/d/1CzJn1Cc4wCCm183Y77h244fyZIkTLzCt/view?usp=sharing
[5] https://lore.kernel.org/linux-mm/20251116013223.1557158-3-jiaqiyan@google.com

Jiaqi Yan (2):
  mm/huge_memory: introduce uniform_split_unmapped_folio_to_zero_order
  mm/memory-failure: avoid free HWPoison high-order folio

 include/linux/huge_mm.h |  6 ++++++
 include/linux/hugetlb.h |  4 ++++
 mm/huge_memory.c        |  8 ++++++++
 mm/hugetlb.c            |  8 ++++++--
 mm/memory-failure.c     | 43 +++++++++++++++++++++++++++++++++++++++++
 5 files changed, 67 insertions(+), 2 deletions(-)

-- 
2.52.0.rc1.455.g30608eb744-goog

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v1 1/2] mm/huge_memory: introduce uniform_split_unmapped_folio_to_zero_order
  2025-11-16  1:47 [PATCH v1 0/2] Only free healthy pages in high-order HWPoison folio Jiaqi Yan
@ 2025-11-16  1:47 ` Jiaqi Yan
  2025-11-16 11:51   ` Matthew Wilcox
                     ` (2 more replies)
  2025-11-16  1:47 ` [PATCH v1 2/2] mm/memory-failure: avoid free HWPoison high-order folio Jiaqi Yan
  1 sibling, 3 replies; 21+ messages in thread
From: Jiaqi Yan @ 2025-11-16  1:47 UTC (permalink / raw)
  To: nao.horiguchi, linmiaohe, ziy
  Cc: david, lorenzo.stoakes, william.roche, harry.yoo, tony.luck,
	wangkefeng.wang, willy, jane.chu, akpm, osalvador, muchun.song,
	linux-mm, linux-kernel, linux-fsdevel, Jiaqi Yan

When freeing a high-order folio that contains HWPoison pages,
to ensure these HWPoison pages are not added to buddy allocator,
we can first uniformly split a free and unmapped high-order folio
to 0-order folios first, then only add non-HWPoison folios to
buddy allocator and exclude HWPoison ones.

Introduce uniform_split_unmapped_folio_to_zero_order, a wrapper
to the existing __split_unmapped_folio. Caller can use it to
uniformly split an unmapped high-order folio into 0-order folios.

No functional change. It will be used in a subsequent commit.

Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
 include/linux/huge_mm.h | 6 ++++++
 mm/huge_memory.c        | 8 ++++++++
 2 files changed, 14 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 71ac78b9f834f..ef6a84973e157 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -365,6 +365,7 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
 		vm_flags_t vm_flags);
 
 bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
+int uniform_split_unmapped_folio_to_zero_order(struct folio *folio);
 int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		unsigned int new_order);
 int min_order_for_split(struct folio *folio);
@@ -569,6 +570,11 @@ can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
 {
 	return false;
 }
+static inline int uniform_split_unmapped_folio_to_zero_order(struct folio *folio)
+{
+	VM_WARN_ON_ONCE_PAGE(1, page);
+	return -EINVAL;
+}
 static inline int
 split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		unsigned int new_order)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 323654fb4f8cf..c7b6c1c75a18e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3515,6 +3515,14 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 	return ret;
 }
 
+int uniform_split_unmapped_folio_to_zero_order(struct folio *folio)
+{
+	return __split_unmapped_folio(folio, /*new_order=*/0,
+				      /*split_at=*/&folio->page,
+				      /*xas=*/NULL, /*mapping=*/NULL,
+				      /*uniform_split=*/true);
+}
+
 bool non_uniform_split_supported(struct folio *folio, unsigned int new_order,
 		bool warns)
 {
-- 
2.52.0.rc1.455.g30608eb744-goog



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH v1 2/2] mm/memory-failure: avoid free HWPoison high-order folio
  2025-11-16  1:47 [PATCH v1 0/2] Only free healthy pages in high-order HWPoison folio Jiaqi Yan
  2025-11-16  1:47 ` [PATCH v1 1/2] mm/huge_memory: introduce uniform_split_unmapped_folio_to_zero_order Jiaqi Yan
@ 2025-11-16  1:47 ` Jiaqi Yan
  2025-11-16  2:10   ` Zi Yan
  2025-11-17 17:15   ` David Hildenbrand (Red Hat)
  1 sibling, 2 replies; 21+ messages in thread
From: Jiaqi Yan @ 2025-11-16  1:47 UTC (permalink / raw)
  To: nao.horiguchi, linmiaohe, ziy
  Cc: david, lorenzo.stoakes, william.roche, harry.yoo, tony.luck,
	wangkefeng.wang, willy, jane.chu, akpm, osalvador, muchun.song,
	linux-mm, linux-kernel, linux-fsdevel, Jiaqi Yan

At the end of dissolve_free_hugetlb_folio, when a free HugeTLB
folio becomes non-HugeTLB, it is released to buddy allocator
as a high-order folio, e.g. a folio that contains 262144 pages
if the folio was a 1G HugeTLB hugepage.

This is problematic if the HugeTLB hugepage contained HWPoison
subpages. In that case, since buddy allocator does not check
HWPoison for non-zero-order folio, the raw HWPoison page can
be given out with its buddy page and be re-used by either
kernel or userspace.

Memory failure recovery (MFR) in kernel does attempt to take
raw HWPoison page off buddy allocator after
dissolve_free_hugetlb_folio. However, there is always a time
window between freed to buddy allocator and taken off from
buddy allocator.

One obvious way to avoid this problem is to add page sanity
checks in page allocate or free path. However, it is against
the past efforts to reduce sanity check overhead [1,2,3].

Introduce hugetlb_free_hwpoison_folio to solve this problem.
The idea is, in case a HugeTLB folio for sure contains HWPoison
page(s), first split the non-HugeTLB high-order folio uniformly
into 0-order folios, then let healthy pages join the buddy
allocator while reject the HWPoison ones.

[1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net/
[2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net/
[3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz

Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
 include/linux/hugetlb.h |  4 ++++
 mm/hugetlb.c            |  8 ++++++--
 mm/memory-failure.c     | 43 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 53 insertions(+), 2 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 8e63e46b8e1f0..e1c334a7db2fe 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -870,8 +870,12 @@ int dissolve_free_hugetlb_folios(unsigned long start_pfn,
 				    unsigned long end_pfn);
 
 #ifdef CONFIG_MEMORY_FAILURE
+extern void hugetlb_free_hwpoison_folio(struct folio *folio);
 extern void folio_clear_hugetlb_hwpoison(struct folio *folio);
 #else
+static inline void hugetlb_free_hwpoison_folio(struct folio *folio)
+{
+}
 static inline void folio_clear_hugetlb_hwpoison(struct folio *folio)
 {
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 0455119716ec0..801ca1a14c0f0 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1596,6 +1596,7 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
 						struct folio *folio)
 {
 	bool clear_flag = folio_test_hugetlb_vmemmap_optimized(folio);
+	bool has_hwpoison = folio_test_hwpoison(folio);
 
 	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
 		return;
@@ -1638,12 +1639,15 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
 	 * Move PageHWPoison flag from head page to the raw error pages,
 	 * which makes any healthy subpages reusable.
 	 */
-	if (unlikely(folio_test_hwpoison(folio)))
+	if (unlikely(has_hwpoison))
 		folio_clear_hugetlb_hwpoison(folio);
 
 	folio_ref_unfreeze(folio, 1);
 
-	hugetlb_free_folio(folio);
+	if (unlikely(has_hwpoison))
+		hugetlb_free_hwpoison_folio(folio);
+	else
+		hugetlb_free_folio(folio);
 }
 
 /*
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 3edebb0cda30b..e6a9deba6292a 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2002,6 +2002,49 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags,
 	return ret;
 }
 
+void hugetlb_free_hwpoison_folio(struct folio *folio)
+{
+	struct folio *curr, *next;
+	struct folio *end_folio = folio_next(folio);
+	int ret;
+
+	VM_WARN_ON_FOLIO(folio_ref_count(folio) != 1, folio);
+
+	ret = uniform_split_unmapped_folio_to_zero_order(folio);
+	if (ret) {
+		/*
+		 * In case of split failure, none of the pages in folio
+		 * will be freed to buddy allocator.
+		 */
+		pr_err("%#lx: failed to split free %d-order folio with HWPoison page(s): %d\n",
+		       folio_pfn(folio), folio_order(folio), ret);
+		return;
+	}
+
+	/* Expect 1st folio's refcount==1, and other's refcount==0. */
+	for (curr = folio; curr != end_folio; curr = next) {
+		next = folio_next(curr);
+
+		VM_WARN_ON_FOLIO(folio_order(curr), curr);
+
+		if (PageHWPoison(&curr->page)) {
+			if (curr != folio)
+				folio_ref_inc(curr);
+
+			VM_WARN_ON_FOLIO(folio_ref_count(curr) != 1, curr);
+			pr_warn("%#lx: prevented freeing HWPoison page\n",
+				folio_pfn(curr));
+			continue;
+		}
+
+		if (curr == folio)
+			folio_ref_dec(curr);
+
+		VM_WARN_ON_FOLIO(folio_ref_count(curr), curr);
+		free_frozen_pages(&curr->page, folio_order(curr));
+	}
+}
+
 /*
  * Taking refcount of hugetlb pages needs extra care about race conditions
  * with basic operations like hugepage allocation/free/demotion.
-- 
2.52.0.rc1.455.g30608eb744-goog



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1 2/2] mm/memory-failure: avoid free HWPoison high-order folio
  2025-11-16  1:47 ` [PATCH v1 2/2] mm/memory-failure: avoid free HWPoison high-order folio Jiaqi Yan
@ 2025-11-16  2:10   ` Zi Yan
  2025-11-18  5:12     ` Jiaqi Yan
  2025-11-17 17:15   ` David Hildenbrand (Red Hat)
  1 sibling, 1 reply; 21+ messages in thread
From: Zi Yan @ 2025-11-16  2:10 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: nao.horiguchi, linmiaohe, david, lorenzo.stoakes, william.roche,
	harry.yoo, tony.luck, wangkefeng.wang, willy, jane.chu, akpm,
	osalvador, muchun.song, linux-mm, linux-kernel, linux-fsdevel

On 15 Nov 2025, at 20:47, Jiaqi Yan wrote:

> At the end of dissolve_free_hugetlb_folio, when a free HugeTLB
> folio becomes non-HugeTLB, it is released to buddy allocator
> as a high-order folio, e.g. a folio that contains 262144 pages
> if the folio was a 1G HugeTLB hugepage.
>
> This is problematic if the HugeTLB hugepage contained HWPoison
> subpages. In that case, since buddy allocator does not check
> HWPoison for non-zero-order folio, the raw HWPoison page can
> be given out with its buddy page and be re-used by either
> kernel or userspace.
>
> Memory failure recovery (MFR) in kernel does attempt to take
> raw HWPoison page off buddy allocator after
> dissolve_free_hugetlb_folio. However, there is always a time
> window between freed to buddy allocator and taken off from
> buddy allocator.
>
> One obvious way to avoid this problem is to add page sanity
> checks in page allocate or free path. However, it is against
> the past efforts to reduce sanity check overhead [1,2,3].
>
> Introduce hugetlb_free_hwpoison_folio to solve this problem.
> The idea is, in case a HugeTLB folio for sure contains HWPoison
> page(s), first split the non-HugeTLB high-order folio uniformly
> into 0-order folios, then let healthy pages join the buddy
> allocator while reject the HWPoison ones.
>
> [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net/
> [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net/
> [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
>
> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> ---
>  include/linux/hugetlb.h |  4 ++++
>  mm/hugetlb.c            |  8 ++++++--
>  mm/memory-failure.c     | 43 +++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 53 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 8e63e46b8e1f0..e1c334a7db2fe 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -870,8 +870,12 @@ int dissolve_free_hugetlb_folios(unsigned long start_pfn,
>  				    unsigned long end_pfn);
>
>  #ifdef CONFIG_MEMORY_FAILURE
> +extern void hugetlb_free_hwpoison_folio(struct folio *folio);
>  extern void folio_clear_hugetlb_hwpoison(struct folio *folio);
>  #else
> +static inline void hugetlb_free_hwpoison_folio(struct folio *folio)
> +{
> +}
>  static inline void folio_clear_hugetlb_hwpoison(struct folio *folio)
>  {
>  }
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 0455119716ec0..801ca1a14c0f0 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1596,6 +1596,7 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
>  						struct folio *folio)
>  {
>  	bool clear_flag = folio_test_hugetlb_vmemmap_optimized(folio);
> +	bool has_hwpoison = folio_test_hwpoison(folio);
>
>  	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
>  		return;
> @@ -1638,12 +1639,15 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
>  	 * Move PageHWPoison flag from head page to the raw error pages,
>  	 * which makes any healthy subpages reusable.
>  	 */
> -	if (unlikely(folio_test_hwpoison(folio)))
> +	if (unlikely(has_hwpoison))
>  		folio_clear_hugetlb_hwpoison(folio);
>
>  	folio_ref_unfreeze(folio, 1);
>
> -	hugetlb_free_folio(folio);
> +	if (unlikely(has_hwpoison))
> +		hugetlb_free_hwpoison_folio(folio);
> +	else
> +		hugetlb_free_folio(folio);
>  }
>
>  /*
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 3edebb0cda30b..e6a9deba6292a 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -2002,6 +2002,49 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags,
>  	return ret;
>  }
>
> +void hugetlb_free_hwpoison_folio(struct folio *folio)
> +{
> +	struct folio *curr, *next;
> +	struct folio *end_folio = folio_next(folio);
> +	int ret;
> +
> +	VM_WARN_ON_FOLIO(folio_ref_count(folio) != 1, folio);
> +
> +	ret = uniform_split_unmapped_folio_to_zero_order(folio);

I realize that __split_unmapped_folio() is a wrong name and causes confusion.
It should be __split_frozen_folio(), since when you look at its current
call site, it is called after the folio is frozen. There probably
should be a check in __split_unmapped_folio() to make sure the folio is frozen.

Is it possible to change hugetlb_free_hwpoison_folio() so that it
can be called before folio_ref_unfreeze(folio, 1)? In this way,
__split_unmapped_folio() is called at frozen folios.

You can add a preparation patch to rename __split_unmapped_folio() to
__split_frozen_folio() and add
VM_WARN_ON_ONCE_FOLIO(folio_ref_count(folio) != 0, folio) to the function.

Thanks.

> +	if (ret) {
> +		/*
> +		 * In case of split failure, none of the pages in folio
> +		 * will be freed to buddy allocator.
> +		 */
> +		pr_err("%#lx: failed to split free %d-order folio with HWPoison page(s): %d\n",
> +		       folio_pfn(folio), folio_order(folio), ret);
> +		return;
> +	}
> +
> +	/* Expect 1st folio's refcount==1, and other's refcount==0. */
> +	for (curr = folio; curr != end_folio; curr = next) {
> +		next = folio_next(curr);
> +
> +		VM_WARN_ON_FOLIO(folio_order(curr), curr);
> +
> +		if (PageHWPoison(&curr->page)) {
> +			if (curr != folio)
> +				folio_ref_inc(curr);
> +
> +			VM_WARN_ON_FOLIO(folio_ref_count(curr) != 1, curr);
> +			pr_warn("%#lx: prevented freeing HWPoison page\n",
> +				folio_pfn(curr));
> +			continue;
> +		}
> +
> +		if (curr == folio)
> +			folio_ref_dec(curr);
> +
> +		VM_WARN_ON_FOLIO(folio_ref_count(curr), curr);
> +		free_frozen_pages(&curr->page, folio_order(curr));
> +	}
> +}
> +
>  /*
>   * Taking refcount of hugetlb pages needs extra care about race conditions
>   * with basic operations like hugepage allocation/free/demotion.
> -- 
> 2.52.0.rc1.455.g30608eb744-goog


--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1 1/2] mm/huge_memory: introduce uniform_split_unmapped_folio_to_zero_order
  2025-11-16  1:47 ` [PATCH v1 1/2] mm/huge_memory: introduce uniform_split_unmapped_folio_to_zero_order Jiaqi Yan
@ 2025-11-16 11:51   ` Matthew Wilcox
  2025-11-17  3:15     ` Harry Yoo
  2025-11-16 22:38   ` kernel test robot
  2025-11-17 17:12   ` David Hildenbrand (Red Hat)
  2 siblings, 1 reply; 21+ messages in thread
From: Matthew Wilcox @ 2025-11-16 11:51 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: nao.horiguchi, linmiaohe, ziy, david, lorenzo.stoakes,
	william.roche, harry.yoo, tony.luck, wangkefeng.wang, jane.chu,
	akpm, osalvador, muchun.song, linux-mm, linux-kernel,
	linux-fsdevel

On Sun, Nov 16, 2025 at 01:47:20AM +0000, Jiaqi Yan wrote:
> Introduce uniform_split_unmapped_folio_to_zero_order, a wrapper
> to the existing __split_unmapped_folio. Caller can use it to
> uniformly split an unmapped high-order folio into 0-order folios.

Please don't make this function exist.  I appreciate what you're trying
to do, but let's try to do it differently?

When we have struct folio separately allocated from struct page,
splitting a folio will mean allocating new struct folios for every
new folio created.  I anticipate an order-0 folio will be about 80 or
96 bytes.  So if we create 512 * 512 folios in a single go, that'll be
an allocation of 20MB.

This is why I asked Zi Yan to create the asymmetrical folio split, so we
only end up creating log() of this.  In the case of a single hwpoison page
in an order-18 hugetlb, that'd be 19 allocations totallying 1520 bytes.

But since we're only doing this on free, we won't need to do folio
allocations at all; we'll just be able to release the good pages to the
page allocator and sequester the hwpoison pages.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1 1/2] mm/huge_memory: introduce uniform_split_unmapped_folio_to_zero_order
  2025-11-16  1:47 ` [PATCH v1 1/2] mm/huge_memory: introduce uniform_split_unmapped_folio_to_zero_order Jiaqi Yan
  2025-11-16 11:51   ` Matthew Wilcox
@ 2025-11-16 22:38   ` kernel test robot
  2025-11-17 17:12   ` David Hildenbrand (Red Hat)
  2 siblings, 0 replies; 21+ messages in thread
From: kernel test robot @ 2025-11-16 22:38 UTC (permalink / raw)
  To: Jiaqi Yan, nao.horiguchi, linmiaohe, ziy
  Cc: oe-kbuild-all, david, lorenzo.stoakes, william.roche, harry.yoo,
	tony.luck, wangkefeng.wang, willy, jane.chu, akpm, osalvador,
	muchun.song, linux-mm, linux-kernel, linux-fsdevel, Jiaqi Yan

Hi Jiaqi,

kernel test robot noticed the following build errors:

[auto build test ERROR on linus/master]
[also build test ERROR on v6.18-rc5]
[cannot apply to akpm-mm/mm-everything next-20251114]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Jiaqi-Yan/mm-huge_memory-introduce-uniform_split_unmapped_folio_to_zero_order/20251116-094846
base:   linus/master
patch link:    https://lore.kernel.org/r/20251116014721.1561456-2-jiaqiyan%40google.com
patch subject: [PATCH v1 1/2] mm/huge_memory: introduce uniform_split_unmapped_folio_to_zero_order
config: arc-randconfig-001-20251117 (https://download.01.org/0day-ci/archive/20251117/202511170614.JnCyqo45-lkp@intel.com/config)
compiler: arc-linux-gcc (GCC) 14.3.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251117/202511170614.JnCyqo45-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202511170614.JnCyqo45-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from include/linux/mm.h:6,
                    from arch/arc/kernel/asm-offsets.c:8:
   include/linux/huge_mm.h: In function 'uniform_split_unmapped_folio_to_zero_order':
>> include/linux/huge_mm.h:575:33: error: 'page' undeclared (first use in this function)
     575 |         VM_WARN_ON_ONCE_PAGE(1, page);
         |                                 ^~~~
   include/linux/mmdebug.h:55:27: note: in definition of macro 'VM_WARN_ON_ONCE_PAGE'
      55 |                 dump_page(page, "VM_WARN_ON_ONCE_PAGE(" __stringify(cond)")");\
         |                           ^~~~
   include/linux/huge_mm.h:575:33: note: each undeclared identifier is reported only once for each function it appears in
     575 |         VM_WARN_ON_ONCE_PAGE(1, page);
         |                                 ^~~~
   include/linux/mmdebug.h:55:27: note: in definition of macro 'VM_WARN_ON_ONCE_PAGE'
      55 |                 dump_page(page, "VM_WARN_ON_ONCE_PAGE(" __stringify(cond)")");\
         |                           ^~~~
   make[3]: *** [scripts/Makefile.build:182: arch/arc/kernel/asm-offsets.s] Error 1 shuffle=3849756084
   make[3]: Target 'prepare' not remade because of errors.
   make[2]: *** [Makefile:1280: prepare0] Error 2 shuffle=3849756084
   make[2]: Target 'prepare' not remade because of errors.
   make[1]: *** [Makefile:248: __sub-make] Error 2 shuffle=3849756084
   make[1]: Target 'prepare' not remade because of errors.
   make: *** [Makefile:248: __sub-make] Error 2 shuffle=3849756084
   make: Target 'prepare' not remade because of errors.


vim +/page +575 include/linux/huge_mm.h

   567	
   568	static inline bool
   569	can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
   570	{
   571		return false;
   572	}
   573	static inline int uniform_split_unmapped_folio_to_zero_order(struct folio *folio)
   574	{
 > 575		VM_WARN_ON_ONCE_PAGE(1, page);
   576		return -EINVAL;
   577	}
   578	static inline int
   579	split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
   580			unsigned int new_order)
   581	{
   582		VM_WARN_ON_ONCE_PAGE(1, page);
   583		return -EINVAL;
   584	}
   585	static inline int split_huge_page(struct page *page)
   586	{
   587		VM_WARN_ON_ONCE_PAGE(1, page);
   588		return -EINVAL;
   589	}
   590	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1 1/2] mm/huge_memory: introduce uniform_split_unmapped_folio_to_zero_order
  2025-11-16 11:51   ` Matthew Wilcox
@ 2025-11-17  3:15     ` Harry Yoo
  2025-11-17  3:21       ` Zi Yan
  2025-11-17 13:43       ` Matthew Wilcox
  0 siblings, 2 replies; 21+ messages in thread
From: Harry Yoo @ 2025-11-17  3:15 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jiaqi Yan, nao.horiguchi, linmiaohe, ziy, david, lorenzo.stoakes,
	william.roche, tony.luck, wangkefeng.wang, jane.chu, akpm,
	osalvador, muchun.song, linux-mm, linux-kernel, linux-fsdevel,
	Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner

On Sun, Nov 16, 2025 at 11:51:14AM +0000, Matthew Wilcox wrote:
> On Sun, Nov 16, 2025 at 01:47:20AM +0000, Jiaqi Yan wrote:
> > Introduce uniform_split_unmapped_folio_to_zero_order, a wrapper
> > to the existing __split_unmapped_folio. Caller can use it to
> > uniformly split an unmapped high-order folio into 0-order folios.
> 
> Please don't make this function exist.  I appreciate what you're trying
> to do, but let's try to do it differently?
> 
> When we have struct folio separately allocated from struct page,
> splitting a folio will mean allocating new struct folios for every
> new folio created.  I anticipate an order-0 folio will be about 80 or
> 96 bytes.  So if we create 512 * 512 folios in a single go, that'll be
> an allocation of 20MB.
>
> This is why I asked Zi Yan to create the asymmetrical folio split, so we
> only end up creating log() of this.  In the case of a single hwpoison page
> in an order-18 hugetlb, that'd be 19 allocations totallying 1520 bytes.

Oh god, I completely overlooked this aspect when discussing this with Jiaqi.
Thanks for raising this concern.

> But since we're only doing this on free, we won't need to do folio
> allocations at all; we'll just be able to release the good pages to the
> page allocator and sequester the hwpoison pages.

[+Cc PAGE ALLOCATOR folks]

So we need an interface to free only healthy portion of a hwpoison folio.

I think a proper approach to this should be to "free a hwpoison folio
just like freeing a normal folio via folio_put() or free_frozen_pages(),
then the page allocator will add only healthy pages to the freelist and
isolate the hwpoison pages". Oherwise we'll end up open coding a lot,
which is too fragile.

In fact, that can be done by teaching free_pages_prepare() how to handle
the case where one or more subpages of a folio are hwpoison pages.

How this should be implemented in the page allocator in memdescs world?
Hmm, we'll want to do some kind of non-uniform split, without actually
splitting the folio but allocating struct buddy?

But... for now I think hiding this complexity inside the page allocator
is good enough. For now this would just mean splitting a frozen page
inside the page allocator (probably non-uniform?). We can later re-implement
this to provide better support for memdescs.

-- 
Cheers,
Harry / Hyeonggon

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1 1/2] mm/huge_memory: introduce uniform_split_unmapped_folio_to_zero_order
  2025-11-17  3:15     ` Harry Yoo
@ 2025-11-17  3:21       ` Zi Yan
  2025-11-17  3:39         ` Harry Yoo
  2025-11-17 13:43       ` Matthew Wilcox
  1 sibling, 1 reply; 21+ messages in thread
From: Zi Yan @ 2025-11-17  3:21 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Matthew Wilcox, Jiaqi Yan, nao.horiguchi, linmiaohe, david,
	lorenzo.stoakes, william.roche, tony.luck, wangkefeng.wang,
	jane.chu, akpm, osalvador, muchun.song, linux-mm, linux-kernel,
	linux-fsdevel, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner

On 16 Nov 2025, at 22:15, Harry Yoo wrote:

> On Sun, Nov 16, 2025 at 11:51:14AM +0000, Matthew Wilcox wrote:
>> On Sun, Nov 16, 2025 at 01:47:20AM +0000, Jiaqi Yan wrote:
>>> Introduce uniform_split_unmapped_folio_to_zero_order, a wrapper
>>> to the existing __split_unmapped_folio. Caller can use it to
>>> uniformly split an unmapped high-order folio into 0-order folios.
>>
>> Please don't make this function exist.  I appreciate what you're trying
>> to do, but let's try to do it differently?
>>
>> When we have struct folio separately allocated from struct page,
>> splitting a folio will mean allocating new struct folios for every
>> new folio created.  I anticipate an order-0 folio will be about 80 or
>> 96 bytes.  So if we create 512 * 512 folios in a single go, that'll be
>> an allocation of 20MB.
>>
>> This is why I asked Zi Yan to create the asymmetrical folio split, so we
>> only end up creating log() of this.  In the case of a single hwpoison page
>> in an order-18 hugetlb, that'd be 19 allocations totallying 1520 bytes.
>
> Oh god, I completely overlooked this aspect when discussing this with Jiaqi.
> Thanks for raising this concern.
>
>> But since we're only doing this on free, we won't need to do folio
>> allocations at all; we'll just be able to release the good pages to the
>> page allocator and sequester the hwpoison pages.
>
> [+Cc PAGE ALLOCATOR folks]
>
> So we need an interface to free only healthy portion of a hwpoison folio.
>
> I think a proper approach to this should be to "free a hwpoison folio
> just like freeing a normal folio via folio_put() or free_frozen_pages(),
> then the page allocator will add only healthy pages to the freelist and
> isolate the hwpoison pages". Oherwise we'll end up open coding a lot,
> which is too fragile.

Why not use __split_unmaped_folio(folio, /*new_order=*/0,
								  /split_at=*/HWPoisoned_page,
								  ..., /*uniform_split=*/ false)?

If there are multiple HWPoisoned pages, just repeat.

>
> In fact, that can be done by teaching free_pages_prepare() how to handle
> the case where one or more subpages of a folio are hwpoison pages.
>
> How this should be implemented in the page allocator in memdescs world?
> Hmm, we'll want to do some kind of non-uniform split, without actually
> splitting the folio but allocating struct buddy?
>
> But... for now I think hiding this complexity inside the page allocator
> is good enough. For now this would just mean splitting a frozen page
> inside the page allocator (probably non-uniform?). We can later re-implement
> this to provide better support for memdescs.
>
> -- 
> Cheers,
> Harry / Hyeonggon


--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1 1/2] mm/huge_memory: introduce uniform_split_unmapped_folio_to_zero_order
  2025-11-17  3:21       ` Zi Yan
@ 2025-11-17  3:39         ` Harry Yoo
  0 siblings, 0 replies; 21+ messages in thread
From: Harry Yoo @ 2025-11-17  3:39 UTC (permalink / raw)
  To: Zi Yan
  Cc: Matthew Wilcox, Jiaqi Yan, nao.horiguchi, linmiaohe, david,
	lorenzo.stoakes, william.roche, tony.luck, wangkefeng.wang,
	jane.chu, akpm, osalvador, muchun.song, linux-mm, linux-kernel,
	linux-fsdevel, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner

On Sun, Nov 16, 2025 at 10:21:26PM -0500, Zi Yan wrote:
> On 16 Nov 2025, at 22:15, Harry Yoo wrote:
> 
> > On Sun, Nov 16, 2025 at 11:51:14AM +0000, Matthew Wilcox wrote:
> >> On Sun, Nov 16, 2025 at 01:47:20AM +0000, Jiaqi Yan wrote:
> >>> Introduce uniform_split_unmapped_folio_to_zero_order, a wrapper
> >>> to the existing __split_unmapped_folio. Caller can use it to
> >>> uniformly split an unmapped high-order folio into 0-order folios.
> >>
> >> Please don't make this function exist.  I appreciate what you're trying
> >> to do, but let's try to do it differently?
> >>
> >> When we have struct folio separately allocated from struct page,
> >> splitting a folio will mean allocating new struct folios for every
> >> new folio created.  I anticipate an order-0 folio will be about 80 or
> >> 96 bytes.  So if we create 512 * 512 folios in a single go, that'll be
> >> an allocation of 20MB.
> >>
> >> This is why I asked Zi Yan to create the asymmetrical folio split, so we
> >> only end up creating log() of this.  In the case of a single hwpoison page
> >> in an order-18 hugetlb, that'd be 19 allocations totallying 1520 bytes.
> >
> > Oh god, I completely overlooked this aspect when discussing this with Jiaqi.
> > Thanks for raising this concern.
> >
> >> But since we're only doing this on free, we won't need to do folio
> >> allocations at all; we'll just be able to release the good pages to the
> >> page allocator and sequester the hwpoison pages.
> >
> > [+Cc PAGE ALLOCATOR folks]
> >
> > So we need an interface to free only healthy portion of a hwpoison folio.
> >
> > I think a proper approach to this should be to "free a hwpoison folio
> > just like freeing a normal folio via folio_put() or free_frozen_pages(),
> > then the page allocator will add only healthy pages to the freelist and
> > isolate the hwpoison pages". Oherwise we'll end up open coding a lot,
> > which is too fragile.
> 
> Why not use __split_unmaped_folio(folio, /*new_order=*/0,
> 								  /split_at=*/HWPoisoned_page,
> 								  ..., /*uniform_split=*/ false)?
> 
> If there are multiple HWPoisoned pages, just repeat.

Using __split_unmapped_folio() is totally fine. I was just thinking that
maybe we should hide the complexity inside the page allocator if we want
to avoid allocating struct folio at all when handling this.

> > In fact, that can be done by teaching free_pages_prepare() how to handle
> > the case where one or more subpages of a folio are hwpoison pages.
> >
> > How this should be implemented in the page allocator in memdescs world?
> > Hmm, we'll want to do some kind of non-uniform split, without actually
> > splitting the folio but allocating struct buddy?
> >
> > But... for now I think hiding this complexity inside the page allocator
> > is good enough. For now this would just mean splitting a frozen page
> > inside the page allocator (probably non-uniform?). We can later re-implement
> > this to provide better support for memdescs.

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1 1/2] mm/huge_memory: introduce uniform_split_unmapped_folio_to_zero_order
  2025-11-17  3:15     ` Harry Yoo
  2025-11-17  3:21       ` Zi Yan
@ 2025-11-17 13:43       ` Matthew Wilcox
  2025-11-18  6:24         ` Jiaqi Yan
  1 sibling, 1 reply; 21+ messages in thread
From: Matthew Wilcox @ 2025-11-17 13:43 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Jiaqi Yan, nao.horiguchi, linmiaohe, ziy, david, lorenzo.stoakes,
	william.roche, tony.luck, wangkefeng.wang, jane.chu, akpm,
	osalvador, muchun.song, linux-mm, linux-kernel, linux-fsdevel,
	Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner

On Mon, Nov 17, 2025 at 12:15:23PM +0900, Harry Yoo wrote:
> On Sun, Nov 16, 2025 at 11:51:14AM +0000, Matthew Wilcox wrote:
> > But since we're only doing this on free, we won't need to do folio
> > allocations at all; we'll just be able to release the good pages to the
> > page allocator and sequester the hwpoison pages.
> 
> [+Cc PAGE ALLOCATOR folks]
> 
> So we need an interface to free only healthy portion of a hwpoison folio.
> 
> I think a proper approach to this should be to "free a hwpoison folio
> just like freeing a normal folio via folio_put() or free_frozen_pages(),
> then the page allocator will add only healthy pages to the freelist and
> isolate the hwpoison pages". Oherwise we'll end up open coding a lot,
> which is too fragile.

Yes, I think it should be handled by the page allocator.  There may be
some complexity to this that I've missed, eg if hugetlb wants to retain
the good 2MB chunks of a 1GB allocation.  I'm not sure that's a useful
thing to do or not.

> In fact, that can be done by teaching free_pages_prepare() how to handle
> the case where one or more subpages of a folio are hwpoison pages.
> 
> How this should be implemented in the page allocator in memdescs world?
> Hmm, we'll want to do some kind of non-uniform split, without actually
> splitting the folio but allocating struct buddy?

Let me sketch that out, realising that it's subject to change.

A page in buddy state can't need a memdesc allocated.  Otherwise we're
allocating memory to free memory, and that way lies madness.  We can't
do the hack of "embed struct buddy in the page that we're freeing"
because HIGHMEM.  So we'll never shrink struct page smaller than struct
buddy (which is fine because I've laid out how to get to a 64 bit struct
buddy, and we're probably two years from getting there anyway).

My design for handling hwpoison is that we do allocate a struct hwpoison
for a page.  It looks like this (for now, in my head):

struct hwpoison {
	memdesc_t original;
	... other things ...
};

So we can replace the memdesc in a page with a hwpoison memdesc when we
encounter the error.  We still need a folio flag to indicate that "this
folio contains a page with hwpoison".  I haven't put much thought yet
into interaction with HUGETLB_PAGE_OPTIMIZE_VMEMMAP; maybe "other things"
includes an index of where the actually poisoned page is in the folio,
so it doesn't matter if the pages alias with each other as we can recover
the information when it becomes useful to do so.

> But... for now I think hiding this complexity inside the page allocator
> is good enough. For now this would just mean splitting a frozen page
> inside the page allocator (probably non-uniform?). We can later re-implement
> this to provide better support for memdescs.

Yes, I like this approach.  But then I'm not the page allocator
maintainer ;-)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1 1/2] mm/huge_memory: introduce uniform_split_unmapped_folio_to_zero_order
  2025-11-16  1:47 ` [PATCH v1 1/2] mm/huge_memory: introduce uniform_split_unmapped_folio_to_zero_order Jiaqi Yan
  2025-11-16 11:51   ` Matthew Wilcox
  2025-11-16 22:38   ` kernel test robot
@ 2025-11-17 17:12   ` David Hildenbrand (Red Hat)
  2 siblings, 0 replies; 21+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-17 17:12 UTC (permalink / raw)
  To: Jiaqi Yan, nao.horiguchi, linmiaohe, ziy
  Cc: lorenzo.stoakes, william.roche, harry.yoo, tony.luck,
	wangkefeng.wang, willy, jane.chu, akpm, osalvador, muchun.song,
	linux-mm, linux-kernel, linux-fsdevel

On 16.11.25 02:47, Jiaqi Yan wrote:
> When freeing a high-order folio that contains HWPoison pages,
> to ensure these HWPoison pages are not added to buddy allocator,
> we can first uniformly split a free and unmapped high-order folio
> to 0-order folios first, then only add non-HWPoison folios to
> buddy allocator and exclude HWPoison ones.
> 
> Introduce uniform_split_unmapped_folio_to_zero_order, a wrapper
> to the existing __split_unmapped_folio. Caller can use it to
> uniformly split an unmapped high-order folio into 0-order folios.
> 
> No functional change. It will be used in a subsequent commit.
> 
> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> ---
>   include/linux/huge_mm.h | 6 ++++++
>   mm/huge_memory.c        | 8 ++++++++
>   2 files changed, 14 insertions(+)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 71ac78b9f834f..ef6a84973e157 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -365,6 +365,7 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>   		vm_flags_t vm_flags);
>   
>   bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
> +int uniform_split_unmapped_folio_to_zero_order(struct folio *folio);
>   int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>   		unsigned int new_order);
>   int min_order_for_split(struct folio *folio);
> @@ -569,6 +570,11 @@ can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
>   {
>   	return false;
>   }
> +static inline int uniform_split_unmapped_folio_to_zero_order(struct folio *folio)
> +{
> +	VM_WARN_ON_ONCE_PAGE(1, page);
> +	return -EINVAL;
> +}

IIUC this patch won't be required (I agree that ideally the page 
allocator takes care of this), but for the future, let's consistently 
name these things "folio_split_XXX".

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1 2/2] mm/memory-failure: avoid free HWPoison high-order folio
  2025-11-16  1:47 ` [PATCH v1 2/2] mm/memory-failure: avoid free HWPoison high-order folio Jiaqi Yan
  2025-11-16  2:10   ` Zi Yan
@ 2025-11-17 17:15   ` David Hildenbrand (Red Hat)
  2025-11-18  5:17     ` Jiaqi Yan
  1 sibling, 1 reply; 21+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-17 17:15 UTC (permalink / raw)
  To: Jiaqi Yan, nao.horiguchi, linmiaohe, ziy
  Cc: lorenzo.stoakes, william.roche, harry.yoo, tony.luck,
	wangkefeng.wang, willy, jane.chu, akpm, osalvador, muchun.song,
	linux-mm, linux-kernel, linux-fsdevel

On 16.11.25 02:47, Jiaqi Yan wrote:
> At the end of dissolve_free_hugetlb_folio, when a free HugeTLB
> folio becomes non-HugeTLB, it is released to buddy allocator
> as a high-order folio, e.g. a folio that contains 262144 pages
> if the folio was a 1G HugeTLB hugepage.
> 
> This is problematic if the HugeTLB hugepage contained HWPoison
> subpages. In that case, since buddy allocator does not check
> HWPoison for non-zero-order folio, the raw HWPoison page can
> be given out with its buddy page and be re-used by either
> kernel or userspace.
> 
> Memory failure recovery (MFR) in kernel does attempt to take
> raw HWPoison page off buddy allocator after
> dissolve_free_hugetlb_folio. However, there is always a time
> window between freed to buddy allocator and taken off from
> buddy allocator.
> 
> One obvious way to avoid this problem is to add page sanity
> checks in page allocate or free path. However, it is against
> the past efforts to reduce sanity check overhead [1,2,3].
> 
> Introduce hugetlb_free_hwpoison_folio to solve this problem.
> The idea is, in case a HugeTLB folio for sure contains HWPoison
> page(s), first split the non-HugeTLB high-order folio uniformly
> into 0-order folios, then let healthy pages join the buddy
> allocator while reject the HWPoison ones.
> 
> [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net/
> [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net/
> [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
> 
> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>


[...]

>   /*
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 3edebb0cda30b..e6a9deba6292a 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -2002,6 +2002,49 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags,
>   	return ret;
>   }
>   
> +void hugetlb_free_hwpoison_folio(struct folio *folio)

What is hugetlb specific in here? :)

Hint: if there is nothing, likely it should be generic infrastructure.

But I would prefer if the page allocator could just take care of that 
when freeing a folio.

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1 2/2] mm/memory-failure: avoid free HWPoison high-order folio
  2025-11-16  2:10   ` Zi Yan
@ 2025-11-18  5:12     ` Jiaqi Yan
  0 siblings, 0 replies; 21+ messages in thread
From: Jiaqi Yan @ 2025-11-18  5:12 UTC (permalink / raw)
  To: Zi Yan
  Cc: nao.horiguchi, linmiaohe, david, lorenzo.stoakes, william.roche,
	harry.yoo, tony.luck, wangkefeng.wang, willy, jane.chu, akpm,
	osalvador, muchun.song, linux-mm, linux-kernel, linux-fsdevel

On Sat, Nov 15, 2025 at 6:10 PM Zi Yan <ziy@nvidia.com> wrote:
>
> On 15 Nov 2025, at 20:47, Jiaqi Yan wrote:
>
> > At the end of dissolve_free_hugetlb_folio, when a free HugeTLB
> > folio becomes non-HugeTLB, it is released to buddy allocator
> > as a high-order folio, e.g. a folio that contains 262144 pages
> > if the folio was a 1G HugeTLB hugepage.
> >
> > This is problematic if the HugeTLB hugepage contained HWPoison
> > subpages. In that case, since buddy allocator does not check
> > HWPoison for non-zero-order folio, the raw HWPoison page can
> > be given out with its buddy page and be re-used by either
> > kernel or userspace.
> >
> > Memory failure recovery (MFR) in kernel does attempt to take
> > raw HWPoison page off buddy allocator after
> > dissolve_free_hugetlb_folio. However, there is always a time
> > window between freed to buddy allocator and taken off from
> > buddy allocator.
> >
> > One obvious way to avoid this problem is to add page sanity
> > checks in page allocate or free path. However, it is against
> > the past efforts to reduce sanity check overhead [1,2,3].
> >
> > Introduce hugetlb_free_hwpoison_folio to solve this problem.
> > The idea is, in case a HugeTLB folio for sure contains HWPoison
> > page(s), first split the non-HugeTLB high-order folio uniformly
> > into 0-order folios, then let healthy pages join the buddy
> > allocator while reject the HWPoison ones.
> >
> > [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net/
> > [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net/
> > [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
> >
> > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> > ---
> >  include/linux/hugetlb.h |  4 ++++
> >  mm/hugetlb.c            |  8 ++++++--
> >  mm/memory-failure.c     | 43 +++++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 53 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 8e63e46b8e1f0..e1c334a7db2fe 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -870,8 +870,12 @@ int dissolve_free_hugetlb_folios(unsigned long start_pfn,
> >                                   unsigned long end_pfn);
> >
> >  #ifdef CONFIG_MEMORY_FAILURE
> > +extern void hugetlb_free_hwpoison_folio(struct folio *folio);
> >  extern void folio_clear_hugetlb_hwpoison(struct folio *folio);
> >  #else
> > +static inline void hugetlb_free_hwpoison_folio(struct folio *folio)
> > +{
> > +}
> >  static inline void folio_clear_hugetlb_hwpoison(struct folio *folio)
> >  {
> >  }
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 0455119716ec0..801ca1a14c0f0 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1596,6 +1596,7 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
> >                                               struct folio *folio)
> >  {
> >       bool clear_flag = folio_test_hugetlb_vmemmap_optimized(folio);
> > +     bool has_hwpoison = folio_test_hwpoison(folio);
> >
> >       if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
> >               return;
> > @@ -1638,12 +1639,15 @@ static void __update_and_free_hugetlb_folio(struct hstate *h,
> >        * Move PageHWPoison flag from head page to the raw error pages,
> >        * which makes any healthy subpages reusable.
> >        */
> > -     if (unlikely(folio_test_hwpoison(folio)))
> > +     if (unlikely(has_hwpoison))
> >               folio_clear_hugetlb_hwpoison(folio);
> >
> >       folio_ref_unfreeze(folio, 1);
> >
> > -     hugetlb_free_folio(folio);
> > +     if (unlikely(has_hwpoison))
> > +             hugetlb_free_hwpoison_folio(folio);
> > +     else
> > +             hugetlb_free_folio(folio);
> >  }
> >
> >  /*
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index 3edebb0cda30b..e6a9deba6292a 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -2002,6 +2002,49 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags,
> >       return ret;
> >  }
> >
> > +void hugetlb_free_hwpoison_folio(struct folio *folio)
> > +{
> > +     struct folio *curr, *next;
> > +     struct folio *end_folio = folio_next(folio);
> > +     int ret;
> > +
> > +     VM_WARN_ON_FOLIO(folio_ref_count(folio) != 1, folio);
> > +
> > +     ret = uniform_split_unmapped_folio_to_zero_order(folio);
>
> I realize that __split_unmapped_folio() is a wrong name and causes confusion.
> It should be __split_frozen_folio(), since when you look at its current
> call site, it is called after the folio is frozen. There probably
> should be a check in __split_unmapped_folio() to make sure the folio is frozen.
>
> Is it possible to change hugetlb_free_hwpoison_folio() so that it
> can be called before folio_ref_unfreeze(folio, 1)? In this way,
> __split_unmapped_folio() is called at frozen folios.
>
> You can add a preparation patch to rename __split_unmapped_folio() to
> __split_frozen_folio() and add
> VM_WARN_ON_ONCE_FOLIO(folio_ref_count(folio) != 0, folio) to the function.
>

FWIW, I am going to still follow your suggestion to improve code
healthiness or readability :)

> Thanks.

Thanks, Zi!

>
> > +     if (ret) {
> > +             /*
> > +              * In case of split failure, none of the pages in folio
> > +              * will be freed to buddy allocator.
> > +              */
> > +             pr_err("%#lx: failed to split free %d-order folio with HWPoison page(s): %d\n",
> > +                    folio_pfn(folio), folio_order(folio), ret);
> > +             return;
> > +     }
> > +
> > +     /* Expect 1st folio's refcount==1, and other's refcount==0. */
> > +     for (curr = folio; curr != end_folio; curr = next) {
> > +             next = folio_next(curr);
> > +
> > +             VM_WARN_ON_FOLIO(folio_order(curr), curr);
> > +
> > +             if (PageHWPoison(&curr->page)) {
> > +                     if (curr != folio)
> > +                             folio_ref_inc(curr);
> > +
> > +                     VM_WARN_ON_FOLIO(folio_ref_count(curr) != 1, curr);
> > +                     pr_warn("%#lx: prevented freeing HWPoison page\n",
> > +                             folio_pfn(curr));
> > +                     continue;
> > +             }
> > +
> > +             if (curr == folio)
> > +                     folio_ref_dec(curr);
> > +
> > +             VM_WARN_ON_FOLIO(folio_ref_count(curr), curr);
> > +             free_frozen_pages(&curr->page, folio_order(curr));
> > +     }
> > +}
> > +
> >  /*
> >   * Taking refcount of hugetlb pages needs extra care about race conditions
> >   * with basic operations like hugepage allocation/free/demotion.
> > --
> > 2.52.0.rc1.455.g30608eb744-goog
>
>
> --
> Best Regards,
> Yan, Zi


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1 2/2] mm/memory-failure: avoid free HWPoison high-order folio
  2025-11-17 17:15   ` David Hildenbrand (Red Hat)
@ 2025-11-18  5:17     ` Jiaqi Yan
  0 siblings, 0 replies; 21+ messages in thread
From: Jiaqi Yan @ 2025-11-18  5:17 UTC (permalink / raw)
  To: David Hildenbrand (Red Hat)
  Cc: nao.horiguchi, linmiaohe, ziy, lorenzo.stoakes, william.roche,
	harry.yoo, tony.luck, wangkefeng.wang, willy, jane.chu, akpm,
	osalvador, muchun.song, linux-mm, linux-kernel, linux-fsdevel

On Mon, Nov 17, 2025 at 9:15 AM David Hildenbrand (Red Hat)
<david@kernel.org> wrote:
>
> On 16.11.25 02:47, Jiaqi Yan wrote:
> > At the end of dissolve_free_hugetlb_folio, when a free HugeTLB
> > folio becomes non-HugeTLB, it is released to buddy allocator
> > as a high-order folio, e.g. a folio that contains 262144 pages
> > if the folio was a 1G HugeTLB hugepage.
> >
> > This is problematic if the HugeTLB hugepage contained HWPoison
> > subpages. In that case, since buddy allocator does not check
> > HWPoison for non-zero-order folio, the raw HWPoison page can
> > be given out with its buddy page and be re-used by either
> > kernel or userspace.
> >
> > Memory failure recovery (MFR) in kernel does attempt to take
> > raw HWPoison page off buddy allocator after
> > dissolve_free_hugetlb_folio. However, there is always a time
> > window between freed to buddy allocator and taken off from
> > buddy allocator.
> >
> > One obvious way to avoid this problem is to add page sanity
> > checks in page allocate or free path. However, it is against
> > the past efforts to reduce sanity check overhead [1,2,3].
> >
> > Introduce hugetlb_free_hwpoison_folio to solve this problem.
> > The idea is, in case a HugeTLB folio for sure contains HWPoison
> > page(s), first split the non-HugeTLB high-order folio uniformly
> > into 0-order folios, then let healthy pages join the buddy
> > allocator while reject the HWPoison ones.
> >
> > [1] https://lore.kernel.org/linux-mm/1460711275-1130-15-git-send-email-mgorman@techsingularity.net/
> > [2] https://lore.kernel.org/linux-mm/1460711275-1130-16-git-send-email-mgorman@techsingularity.net/
> > [3] https://lore.kernel.org/all/20230216095131.17336-1-vbabka@suse.cz
> >
> > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
>
>
> [...]
>
> >   /*
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index 3edebb0cda30b..e6a9deba6292a 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -2002,6 +2002,49 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags,
> >       return ret;
> >   }
> >
> > +void hugetlb_free_hwpoison_folio(struct folio *folio)
>
> What is hugetlb specific in here? :)
>
> Hint: if there is nothing, likely it should be generic infrastructure.
>
> But I would prefer if the page allocator could just take care of that
> when freeing a folio.

Ack, and if it could be taken care by page allocator, it would be
generic infrastructure

>
> --
> Cheers
>
> David


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1 1/2] mm/huge_memory: introduce uniform_split_unmapped_folio_to_zero_order
  2025-11-17 13:43       ` Matthew Wilcox
@ 2025-11-18  6:24         ` Jiaqi Yan
  2025-11-18 10:19           ` Harry Yoo
  0 siblings, 1 reply; 21+ messages in thread
From: Jiaqi Yan @ 2025-11-18  6:24 UTC (permalink / raw)
  To: Matthew Wilcox, Harry Yoo, ziy, david, Vlastimil Babka
  Cc: nao.horiguchi, linmiaohe, lorenzo.stoakes, william.roche,
	tony.luck, wangkefeng.wang, jane.chu, akpm, osalvador,
	muchun.song, linux-mm, linux-kernel, linux-fsdevel, Michal Hocko,
	Suren Baghdasaryan, Brendan Jackman, Johannes Weiner

On Mon, Nov 17, 2025 at 5:43 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Nov 17, 2025 at 12:15:23PM +0900, Harry Yoo wrote:
> > On Sun, Nov 16, 2025 at 11:51:14AM +0000, Matthew Wilcox wrote:
> > > But since we're only doing this on free, we won't need to do folio
> > > allocations at all; we'll just be able to release the good pages to the
> > > page allocator and sequester the hwpoison pages.
> >
> > [+Cc PAGE ALLOCATOR folks]
> >
> > So we need an interface to free only healthy portion of a hwpoison folio.

+1, with some of my own thoughts below.

> >
> > I think a proper approach to this should be to "free a hwpoison folio
> > just like freeing a normal folio via folio_put() or free_frozen_pages(),
> > then the page allocator will add only healthy pages to the freelist and
> > isolate the hwpoison pages". Oherwise we'll end up open coding a lot,
> > which is too fragile.
>
> Yes, I think it should be handled by the page allocator.  There may be

I agree with Matthew, Harry, and David. The page allocator seems best
suited to handle HWPoison subpages without any new folio allocations.

> some complexity to this that I've missed, eg if hugetlb wants to retain
> the good 2MB chunks of a 1GB allocation.  I'm not sure that's a useful
> thing to do or not.
>
> > In fact, that can be done by teaching free_pages_prepare() how to handle
> > the case where one or more subpages of a folio are hwpoison pages.
> >
> > How this should be implemented in the page allocator in memdescs world?
> > Hmm, we'll want to do some kind of non-uniform split, without actually
> > splitting the folio but allocating struct buddy?
>
> Let me sketch that out, realising that it's subject to change.
>
> A page in buddy state can't need a memdesc allocated.  Otherwise we're
> allocating memory to free memory, and that way lies madness.  We can't
> do the hack of "embed struct buddy in the page that we're freeing"
> because HIGHMEM.  So we'll never shrink struct page smaller than struct
> buddy (which is fine because I've laid out how to get to a 64 bit struct
> buddy, and we're probably two years from getting there anyway).
>
> My design for handling hwpoison is that we do allocate a struct hwpoison
> for a page.  It looks like this (for now, in my head):
>
> struct hwpoison {
>         memdesc_t original;
>         ... other things ...
> };
>
> So we can replace the memdesc in a page with a hwpoison memdesc when we
> encounter the error.  We still need a folio flag to indicate that "this
> folio contains a page with hwpoison".  I haven't put much thought yet
> into interaction with HUGETLB_PAGE_OPTIMIZE_VMEMMAP; maybe "other things"
> includes an index of where the actually poisoned page is in the folio,
> so it doesn't matter if the pages alias with each other as we can recover
> the information when it becomes useful to do so.
>
> > But... for now I think hiding this complexity inside the page allocator
> > is good enough. For now this would just mean splitting a frozen page

I want to add one more thing. For HugeTLB, kernel clears the HWPoison
flag on the folio and move it to every raw pages in raw_hwp_page list
(see folio_clear_hugetlb_hwpoison). So page allocator has no hint that
some pages passed into free_frozen_pages has HWPoison. It has to
traverse 2^order pages to tell, if I am not mistaken, which goes
against the past effort to reduce sanity checks. I believe this is one
reason I choosed to handle the problem in hugetlb / memory-failure.

For the new interface Harry requested, is it the caller's
responsibility to ensure that the folio contains HWPoison pages (to be
even better, maybe point out the exact ones?), so that page allocator
at least doesn't waste cycles to search non-exist HWPoison in the set
of pages?

Or caller and page allocator need to agree on some contract? Say
caller has to set has_hwpoisoned flag in non-zero order folio to free.
This allows the old interface free_frozen_pages an easy way using the
has_hwpoison flag from the second page. I know has_hwpoison is "#if
defined" on THP and using it for hugetlb probably is not very clean,
but are there other concerns?


> > inside the page allocator (probably non-uniform?). We can later re-implement
> > this to provide better support for memdescs.
>
> Yes, I like this approach.  But then I'm not the page allocator
> maintainer ;-)

If page allocator maintainers can weigh in here, that will be very helpful!


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1 1/2] mm/huge_memory: introduce uniform_split_unmapped_folio_to_zero_order
  2025-11-18  6:24         ` Jiaqi Yan
@ 2025-11-18 10:19           ` Harry Yoo
  2025-11-18 19:26             ` Jiaqi Yan
  0 siblings, 1 reply; 21+ messages in thread
From: Harry Yoo @ 2025-11-18 10:19 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: Matthew Wilcox, ziy, david, Vlastimil Babka, nao.horiguchi,
	linmiaohe, lorenzo.stoakes, william.roche, tony.luck,
	wangkefeng.wang, jane.chu, akpm, osalvador, muchun.song,
	linux-mm, linux-kernel, linux-fsdevel, Michal Hocko,
	Suren Baghdasaryan, Brendan Jackman, Johannes Weiner

On Mon, Nov 17, 2025 at 10:24:27PM -0800, Jiaqi Yan wrote:
> On Mon, Nov 17, 2025 at 5:43 AM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, Nov 17, 2025 at 12:15:23PM +0900, Harry Yoo wrote:
> > > On Sun, Nov 16, 2025 at 11:51:14AM +0000, Matthew Wilcox wrote:
> > > > But since we're only doing this on free, we won't need to do folio
> > > > allocations at all; we'll just be able to release the good pages to the
> > > > page allocator and sequester the hwpoison pages.
> > >
> > > [+Cc PAGE ALLOCATOR folks]
> > >
> > > So we need an interface to free only healthy portion of a hwpoison folio.
> 
> +1, with some of my own thoughts below.
> 
> > >
> > > I think a proper approach to this should be to "free a hwpoison folio
> > > just like freeing a normal folio via folio_put() or free_frozen_pages(),
> > > then the page allocator will add only healthy pages to the freelist and
> > > isolate the hwpoison pages". Oherwise we'll end up open coding a lot,
> > > which is too fragile.
> >
> > Yes, I think it should be handled by the page allocator.  There may be
> 
> I agree with Matthew, Harry, and David. The page allocator seems best
> suited to handle HWPoison subpages without any new folio allocations.

Sorry I should have been clearer. I don't think adding an **explicit**
interface to free an hwpoison folio is worth; instead implicitly
handling during freeing of a folio seems more feasible.

> > some complexity to this that I've missed, eg if hugetlb wants to retain
> > the good 2MB chunks of a 1GB allocation.  I'm not sure that's a useful
> > thing to do or not.
> >
> > > In fact, that can be done by teaching free_pages_prepare() how to handle
> > > the case where one or more subpages of a folio are hwpoison pages.
> > >
> > > How this should be implemented in the page allocator in memdescs world?
> > > Hmm, we'll want to do some kind of non-uniform split, without actually
> > > splitting the folio but allocating struct buddy?
> >
> > Let me sketch that out, realising that it's subject to change.
> >
> > A page in buddy state can't need a memdesc allocated.  Otherwise we're
> > allocating memory to free memory, and that way lies madness.  We can't
> > do the hack of "embed struct buddy in the page that we're freeing"
> > because HIGHMEM.  So we'll never shrink struct page smaller than struct
> > buddy (which is fine because I've laid out how to get to a 64 bit struct
> > buddy, and we're probably two years from getting there anyway).
> >
> > My design for handling hwpoison is that we do allocate a struct hwpoison
> > for a page.  It looks like this (for now, in my head):
> >
> > struct hwpoison {
> >         memdesc_t original;
> >         ... other things ...
> > };
> >
> > So we can replace the memdesc in a page with a hwpoison memdesc when we
> > encounter the error.  We still need a folio flag to indicate that "this
> > folio contains a page with hwpoison".  I haven't put much thought yet
> > into interaction with HUGETLB_PAGE_OPTIMIZE_VMEMMAP; maybe "other things"
> > includes an index of where the actually poisoned page is in the folio,
> > so it doesn't matter if the pages alias with each other as we can recover
> > the information when it becomes useful to do so.
> >
> > > But... for now I think hiding this complexity inside the page allocator
> > > is good enough. For now this would just mean splitting a frozen page
> 
> I want to add one more thing. For HugeTLB, kernel clears the HWPoison
> flag on the folio and move it to every raw pages in raw_hwp_page list
> (see folio_clear_hugetlb_hwpoison). So page allocator has no hint that
> some pages passed into free_frozen_pages has HWPoison. It has to
> traverse 2^order pages to tell, if I am not mistaken, which goes
> against the past effort to reduce sanity checks. I believe this is one
> reason I choosed to handle the problem in hugetlb / memory-failure.

I think we can skip calling folio_clear_hugetlb_hwpoison() and teach the
buddy allocator to handle this. free_pages_prepare() already handles
(PageHWPoison(page) && !order) case, we just need to extend that to
support hugetlb folios as well.

> For the new interface Harry requested, is it the caller's
> responsibility to ensure that the folio contains HWPoison pages (to be
> even better, maybe point out the exact ones?), so that page allocator
> at least doesn't waste cycles to search non-exist HWPoison in the set
> of pages?

With implicit handling it would be the page allocator's responsibility
to check and handle hwpoison hugetlb folios.

> Or caller and page allocator need to agree on some contract? Say
> caller has to set has_hwpoisoned flag in non-zero order folio to free.
> This allows the old interface free_frozen_pages an easy way using the
> has_hwpoison flag from the second page. I know has_hwpoison is "#if
> defined" on THP and using it for hugetlb probably is not very clean,
> but are there other concerns?

As you mentioned has_hwpoisoned is used for THPs and for a hugetlb
folio. But for a hugetlb folio folio_test_hwpoison() returns true
if it has at least one hwpoison pages (assuming that we don't clear it
before freeing).

So in free_pages_prepare():

if (folio_test_hugetlb(folio) && folio_test_hwpoison(folio)) {
  /*
   * Handle hwpoison hugetlb folios; transfer the error information
   * to individual pages, clear hwpoison flag of the folio,
   * perform non-uniform split on the frozen folio.
   */
} else if (PageHWPoison(page) && !order) {
  /* We already handle this in the allocator. */
}

This would be sufficient?

Or do we want to handle THPs as well, in case of split failure in
memory_failure()? if so we need to handle folio_test_has_hwpoisoned()
case as well...

> > > inside the page allocator (probably non-uniform?). We can later re-implement
> > > this to provide better support for memdescs.
> >
> > Yes, I like this approach.  But then I'm not the page allocator
> > maintainer ;-)
> 
> If page allocator maintainers can weigh in here, that will be very helpful!

Yeah, I'm not a maintainer either ;) it'll be great to get opinions
from page allocator folks!

-- 
Cheers,
Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1 1/2] mm/huge_memory: introduce uniform_split_unmapped_folio_to_zero_order
  2025-11-18 10:19           ` Harry Yoo
@ 2025-11-18 19:26             ` Jiaqi Yan
  2025-11-18 21:54               ` Zi Yan
  0 siblings, 1 reply; 21+ messages in thread
From: Jiaqi Yan @ 2025-11-18 19:26 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Matthew Wilcox, ziy, david, Vlastimil Babka, nao.horiguchi,
	linmiaohe, lorenzo.stoakes, william.roche, tony.luck,
	wangkefeng.wang, jane.chu, akpm, osalvador, muchun.song,
	linux-mm, linux-kernel, linux-fsdevel, Michal Hocko,
	Suren Baghdasaryan, Brendan Jackman, Johannes Weiner

On Tue, Nov 18, 2025 at 2:20 AM Harry Yoo <harry.yoo@oracle.com> wrote:
>
> On Mon, Nov 17, 2025 at 10:24:27PM -0800, Jiaqi Yan wrote:
> > On Mon, Nov 17, 2025 at 5:43 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Mon, Nov 17, 2025 at 12:15:23PM +0900, Harry Yoo wrote:
> > > > On Sun, Nov 16, 2025 at 11:51:14AM +0000, Matthew Wilcox wrote:
> > > > > But since we're only doing this on free, we won't need to do folio
> > > > > allocations at all; we'll just be able to release the good pages to the
> > > > > page allocator and sequester the hwpoison pages.
> > > >
> > > > [+Cc PAGE ALLOCATOR folks]
> > > >
> > > > So we need an interface to free only healthy portion of a hwpoison folio.
> >
> > +1, with some of my own thoughts below.
> >
> > > >
> > > > I think a proper approach to this should be to "free a hwpoison folio
> > > > just like freeing a normal folio via folio_put() or free_frozen_pages(),
> > > > then the page allocator will add only healthy pages to the freelist and
> > > > isolate the hwpoison pages". Oherwise we'll end up open coding a lot,
> > > > which is too fragile.
> > >
> > > Yes, I think it should be handled by the page allocator.  There may be
> >
> > I agree with Matthew, Harry, and David. The page allocator seems best
> > suited to handle HWPoison subpages without any new folio allocations.
>
> Sorry I should have been clearer. I don't think adding an **explicit**
> interface to free an hwpoison folio is worth; instead implicitly
> handling during freeing of a folio seems more feasible.

That's fine with me, just more to be taken care of by page allocator.

>
> > > some complexity to this that I've missed, eg if hugetlb wants to retain
> > > the good 2MB chunks of a 1GB allocation.  I'm not sure that's a useful
> > > thing to do or not.
> > >
> > > > In fact, that can be done by teaching free_pages_prepare() how to handle
> > > > the case where one or more subpages of a folio are hwpoison pages.
> > > >
> > > > How this should be implemented in the page allocator in memdescs world?
> > > > Hmm, we'll want to do some kind of non-uniform split, without actually
> > > > splitting the folio but allocating struct buddy?
> > >
> > > Let me sketch that out, realising that it's subject to change.
> > >
> > > A page in buddy state can't need a memdesc allocated.  Otherwise we're
> > > allocating memory to free memory, and that way lies madness.  We can't
> > > do the hack of "embed struct buddy in the page that we're freeing"
> > > because HIGHMEM.  So we'll never shrink struct page smaller than struct
> > > buddy (which is fine because I've laid out how to get to a 64 bit struct
> > > buddy, and we're probably two years from getting there anyway).
> > >
> > > My design for handling hwpoison is that we do allocate a struct hwpoison
> > > for a page.  It looks like this (for now, in my head):
> > >
> > > struct hwpoison {
> > >         memdesc_t original;
> > >         ... other things ...
> > > };
> > >
> > > So we can replace the memdesc in a page with a hwpoison memdesc when we
> > > encounter the error.  We still need a folio flag to indicate that "this
> > > folio contains a page with hwpoison".  I haven't put much thought yet
> > > into interaction with HUGETLB_PAGE_OPTIMIZE_VMEMMAP; maybe "other things"
> > > includes an index of where the actually poisoned page is in the folio,
> > > so it doesn't matter if the pages alias with each other as we can recover
> > > the information when it becomes useful to do so.
> > >
> > > > But... for now I think hiding this complexity inside the page allocator
> > > > is good enough. For now this would just mean splitting a frozen page
> >
> > I want to add one more thing. For HugeTLB, kernel clears the HWPoison
> > flag on the folio and move it to every raw pages in raw_hwp_page list
> > (see folio_clear_hugetlb_hwpoison). So page allocator has no hint that
> > some pages passed into free_frozen_pages has HWPoison. It has to
> > traverse 2^order pages to tell, if I am not mistaken, which goes
> > against the past effort to reduce sanity checks. I believe this is one
> > reason I choosed to handle the problem in hugetlb / memory-failure.
>
> I think we can skip calling folio_clear_hugetlb_hwpoison() and teach the

Nit: also skip folio_free_raw_hwp so the hugetlb-specific llist
containing the raw pages and owned by memory-failure is preserved? And
expect the page allocator to use it for whatever purpose then free the
llist? Doesn't seem to follow the correct ownership rule.

> buddy allocator to handle this. free_pages_prepare() already handles
> (PageHWPoison(page) && !order) case, we just need to extend that to
> support hugetlb folios as well.
>
> > For the new interface Harry requested, is it the caller's
> > responsibility to ensure that the folio contains HWPoison pages (to be
> > even better, maybe point out the exact ones?), so that page allocator
> > at least doesn't waste cycles to search non-exist HWPoison in the set
> > of pages?
>
> With implicit handling it would be the page allocator's responsibility
> to check and handle hwpoison hugetlb folios.

Does this mean we must bake hugetlb-specific logic in the page
allocator's freeing path? AFAICT today the contract in
free_frozen_page doesn't contain much hugetlb info.

I saw there is already some hugetlb-specific logic in page_alloc.c,
but perhaps not valid for adding more.

>
> > Or caller and page allocator need to agree on some contract? Say
> > caller has to set has_hwpoisoned flag in non-zero order folio to free.
> > This allows the old interface free_frozen_pages an easy way using the
> > has_hwpoison flag from the second page. I know has_hwpoison is "#if
> > defined" on THP and using it for hugetlb probably is not very clean,
> > but are there other concerns?
>
> As you mentioned has_hwpoisoned is used for THPs and for a hugetlb
> folio. But for a hugetlb folio folio_test_hwpoison() returns true
> if it has at least one hwpoison pages (assuming that we don't clear it
> before freeing).
>
> So in free_pages_prepare():
>
> if (folio_test_hugetlb(folio) && folio_test_hwpoison(folio)) {
>   /*
>    * Handle hwpoison hugetlb folios; transfer the error information
>    * to individual pages, clear hwpoison flag of the folio,
>    * perform non-uniform split on the frozen folio.
>    */
> } else if (PageHWPoison(page) && !order) {
>   /* We already handle this in the allocator. */
> }
>
> This would be sufficient?

Wouldn't this confuse the page allocator into thinking the healthy
head page is HWPoison (when it actually isn't)? I thought that was one
of the reasons has_hwpoison exists.

>
> Or do we want to handle THPs as well, in case of split failure in
> memory_failure()? if so we need to handle folio_test_has_hwpoisoned()
> case as well...

Yeah, I think this is another good use case for our request to page allocator.

>
> > > > inside the page allocator (probably non-uniform?). We can later re-implement
> > > > this to provide better support for memdescs.
> > >
> > > Yes, I like this approach.  But then I'm not the page allocator
> > > maintainer ;-)
> >
> > If page allocator maintainers can weigh in here, that will be very helpful!
>
> Yeah, I'm not a maintainer either ;) it'll be great to get opinions
> from page allocator folks!
>
> --
> Cheers,
> Harry / Hyeonggon


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1 1/2] mm/huge_memory: introduce uniform_split_unmapped_folio_to_zero_order
  2025-11-18 19:26             ` Jiaqi Yan
@ 2025-11-18 21:54               ` Zi Yan
  2025-11-19 12:37                 ` Harry Yoo
  0 siblings, 1 reply; 21+ messages in thread
From: Zi Yan @ 2025-11-18 21:54 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: Harry Yoo, Matthew Wilcox, david, Vlastimil Babka, nao.horiguchi,
	linmiaohe, lorenzo.stoakes, william.roche, tony.luck,
	wangkefeng.wang, jane.chu, akpm, osalvador, muchun.song,
	linux-mm, linux-kernel, linux-fsdevel, Michal Hocko,
	Suren Baghdasaryan, Brendan Jackman, Johannes Weiner

On 18 Nov 2025, at 14:26, Jiaqi Yan wrote:

> On Tue, Nov 18, 2025 at 2:20 AM Harry Yoo <harry.yoo@oracle.com> wrote:
>>
>> On Mon, Nov 17, 2025 at 10:24:27PM -0800, Jiaqi Yan wrote:
>>> On Mon, Nov 17, 2025 at 5:43 AM Matthew Wilcox <willy@infradead.org> wrote:
>>>>
>>>> On Mon, Nov 17, 2025 at 12:15:23PM +0900, Harry Yoo wrote:
>>>>> On Sun, Nov 16, 2025 at 11:51:14AM +0000, Matthew Wilcox wrote:
>>>>>> But since we're only doing this on free, we won't need to do folio
>>>>>> allocations at all; we'll just be able to release the good pages to the
>>>>>> page allocator and sequester the hwpoison pages.
>>>>>
>>>>> [+Cc PAGE ALLOCATOR folks]
>>>>>
>>>>> So we need an interface to free only healthy portion of a hwpoison folio.
>>>
>>> +1, with some of my own thoughts below.
>>>
>>>>>
>>>>> I think a proper approach to this should be to "free a hwpoison folio
>>>>> just like freeing a normal folio via folio_put() or free_frozen_pages(),
>>>>> then the page allocator will add only healthy pages to the freelist and
>>>>> isolate the hwpoison pages". Oherwise we'll end up open coding a lot,
>>>>> which is too fragile.
>>>>
>>>> Yes, I think it should be handled by the page allocator.  There may be
>>>
>>> I agree with Matthew, Harry, and David. The page allocator seems best
>>> suited to handle HWPoison subpages without any new folio allocations.
>>
>> Sorry I should have been clearer. I don't think adding an **explicit**
>> interface to free an hwpoison folio is worth; instead implicitly
>> handling during freeing of a folio seems more feasible.
>
> That's fine with me, just more to be taken care of by page allocator.
>
>>
>>>> some complexity to this that I've missed, eg if hugetlb wants to retain
>>>> the good 2MB chunks of a 1GB allocation.  I'm not sure that's a useful
>>>> thing to do or not.
>>>>
>>>>> In fact, that can be done by teaching free_pages_prepare() how to handle
>>>>> the case where one or more subpages of a folio are hwpoison pages.
>>>>>
>>>>> How this should be implemented in the page allocator in memdescs world?
>>>>> Hmm, we'll want to do some kind of non-uniform split, without actually
>>>>> splitting the folio but allocating struct buddy?
>>>>
>>>> Let me sketch that out, realising that it's subject to change.
>>>>
>>>> A page in buddy state can't need a memdesc allocated.  Otherwise we're
>>>> allocating memory to free memory, and that way lies madness.  We can't
>>>> do the hack of "embed struct buddy in the page that we're freeing"
>>>> because HIGHMEM.  So we'll never shrink struct page smaller than struct
>>>> buddy (which is fine because I've laid out how to get to a 64 bit struct
>>>> buddy, and we're probably two years from getting there anyway).
>>>>
>>>> My design for handling hwpoison is that we do allocate a struct hwpoison
>>>> for a page.  It looks like this (for now, in my head):
>>>>
>>>> struct hwpoison {
>>>>         memdesc_t original;
>>>>         ... other things ...
>>>> };
>>>>
>>>> So we can replace the memdesc in a page with a hwpoison memdesc when we
>>>> encounter the error.  We still need a folio flag to indicate that "this
>>>> folio contains a page with hwpoison".  I haven't put much thought yet
>>>> into interaction with HUGETLB_PAGE_OPTIMIZE_VMEMMAP; maybe "other things"
>>>> includes an index of where the actually poisoned page is in the folio,
>>>> so it doesn't matter if the pages alias with each other as we can recover
>>>> the information when it becomes useful to do so.
>>>>
>>>>> But... for now I think hiding this complexity inside the page allocator
>>>>> is good enough. For now this would just mean splitting a frozen page
>>>
>>> I want to add one more thing. For HugeTLB, kernel clears the HWPoison
>>> flag on the folio and move it to every raw pages in raw_hwp_page list
>>> (see folio_clear_hugetlb_hwpoison). So page allocator has no hint that
>>> some pages passed into free_frozen_pages has HWPoison. It has to
>>> traverse 2^order pages to tell, if I am not mistaken, which goes
>>> against the past effort to reduce sanity checks. I believe this is one
>>> reason I choosed to handle the problem in hugetlb / memory-failure.
>>
>> I think we can skip calling folio_clear_hugetlb_hwpoison() and teach the
>
> Nit: also skip folio_free_raw_hwp so the hugetlb-specific llist
> containing the raw pages and owned by memory-failure is preserved? And
> expect the page allocator to use it for whatever purpose then free the
> llist? Doesn't seem to follow the correct ownership rule.
>
>> buddy allocator to handle this. free_pages_prepare() already handles
>> (PageHWPoison(page) && !order) case, we just need to extend that to
>> support hugetlb folios as well.
>>
>>> For the new interface Harry requested, is it the caller's
>>> responsibility to ensure that the folio contains HWPoison pages (to be
>>> even better, maybe point out the exact ones?), so that page allocator
>>> at least doesn't waste cycles to search non-exist HWPoison in the set
>>> of pages?
>>
>> With implicit handling it would be the page allocator's responsibility
>> to check and handle hwpoison hugetlb folios.
>
> Does this mean we must bake hugetlb-specific logic in the page
> allocator's freeing path? AFAICT today the contract in
> free_frozen_page doesn't contain much hugetlb info.
>
> I saw there is already some hugetlb-specific logic in page_alloc.c,
> but perhaps not valid for adding more.
>
>>
>>> Or caller and page allocator need to agree on some contract? Say
>>> caller has to set has_hwpoisoned flag in non-zero order folio to free.
>>> This allows the old interface free_frozen_pages an easy way using the
>>> has_hwpoison flag from the second page. I know has_hwpoison is "#if
>>> defined" on THP and using it for hugetlb probably is not very clean,
>>> but are there other concerns?
>>
>> As you mentioned has_hwpoisoned is used for THPs and for a hugetlb
>> folio. But for a hugetlb folio folio_test_hwpoison() returns true
>> if it has at least one hwpoison pages (assuming that we don't clear it
>> before freeing).
>>
>> So in free_pages_prepare():
>>
>> if (folio_test_hugetlb(folio) && folio_test_hwpoison(folio)) {
>>   /*
>>    * Handle hwpoison hugetlb folios; transfer the error information
>>    * to individual pages, clear hwpoison flag of the folio,
>>    * perform non-uniform split on the frozen folio.
>>    */
>> } else if (PageHWPoison(page) && !order) {
>>   /* We already handle this in the allocator. */
>> }
>>
>> This would be sufficient?
>
> Wouldn't this confuse the page allocator into thinking the healthy
> head page is HWPoison (when it actually isn't)? I thought that was one
> of the reasons has_hwpoison exists.

Is there a reason why hugetlb does not use has_hwpoison flag?

BTW, __split_unmapped_folio() currently sets has_hwpoison to the after-split
folios by scanning every single page in the to-be-split folio.
The related code is in __split_folio_to_order(). But the code is not
efficient for non-uniform split, since it calls __split_folio_to_order()
multiple time, meaning when non-uniform split order-N to order-0,
2^(N-1) pages are scanned once, 2^(N-2) pages are scanned twice,
2^(N-3) pages are scanned 3 times, ..., 4 pages are scanned N-4 times.
It can be optimized with some additional code in __split_folio_to_order().

Something like the patch below, it assumes PageHWPoison(split_at) == true:

From 219466f5d5edc4e8bf0e5402c5deffb584c6a2a0 Mon Sep 17 00:00:00 2001
From: Zi Yan <ziy@nvidia.com>
Date: Tue, 18 Nov 2025 14:55:36 -0500
Subject: [PATCH] mm/huge_memory: optimize hwpoison page scan

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/huge_memory.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d716c6965e27..54a933a20f1b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3233,8 +3233,11 @@ bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
 					caller_pins;
 }

-static bool page_range_has_hwpoisoned(struct page *page, long nr_pages)
+static bool page_range_has_hwpoisoned(struct page *page, long nr_pages, struct page *donot_scan)
 {
+	if (donot_scan && donot_scan >= page && donot_scan < page + nr_pages)
+		return false;
+
 	for (; nr_pages; page++, nr_pages--)
 		if (PageHWPoison(page))
 			return true;
@@ -3246,7 +3249,7 @@ static bool page_range_has_hwpoisoned(struct page *page, long nr_pages)
  * all the resulting folios.
  */
 static void __split_folio_to_order(struct folio *folio, int old_order,
-		int new_order)
+		int new_order, struct page *donot_scan)
 {
 	/* Scan poisoned pages when split a poisoned folio to large folios */
 	const bool handle_hwpoison = folio_test_has_hwpoisoned(folio) && new_order;
@@ -3258,7 +3261,7 @@ static void __split_folio_to_order(struct folio *folio, int old_order,

 	/* Check first new_nr_pages since the loop below skips them */
 	if (handle_hwpoison &&
-	    page_range_has_hwpoisoned(folio_page(folio, 0), new_nr_pages))
+	    page_range_has_hwpoisoned(folio_page(folio, 0), new_nr_pages, donot_scan))
 		folio_set_has_hwpoisoned(folio);
 	/*
 	 * Skip the first new_nr_pages, since the new folio from them have all
@@ -3308,7 +3311,7 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
 				 LRU_GEN_MASK | LRU_REFS_MASK));

 		if (handle_hwpoison &&
-		    page_range_has_hwpoisoned(new_head, new_nr_pages))
+		    page_range_has_hwpoisoned(new_head, new_nr_pages, donot_scan))
 			folio_set_has_hwpoisoned(new_folio);

 		new_folio->mapping = folio->mapping;
@@ -3438,7 +3441,7 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 		folio_split_memcg_refs(folio, old_order, split_order);
 		split_page_owner(&folio->page, old_order, split_order);
 		pgalloc_tag_split(folio, old_order, split_order);
-		__split_folio_to_order(folio, old_order, split_order);
+		__split_folio_to_order(folio, old_order, split_order, uniform_split ? NULL : split_at);

 		if (is_anon) {
 			mod_mthp_stat(old_order, MTHP_STAT_NR_ANON, -1);
-- 
2.51.0


>
>>
>> Or do we want to handle THPs as well, in case of split failure in
>> memory_failure()? if so we need to handle folio_test_has_hwpoisoned()
>> case as well...
>
> Yeah, I think this is another good use case for our request to page allocator.
>
>>
>>>>> inside the page allocator (probably non-uniform?). We can later re-implement
>>>>> this to provide better support for memdescs.
>>>>
>>>> Yes, I like this approach.  But then I'm not the page allocator
>>>> maintainer ;-)
>>>
>>> If page allocator maintainers can weigh in here, that will be very helpful!
>>
>> Yeah, I'm not a maintainer either ;) it'll be great to get opinions
>> from page allocator folks!

I think this is a good approach as long as it does not add too much overhead
on the page freeing path. Otherwise dispatch the job to a workqueue?

Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1 1/2] mm/huge_memory: introduce uniform_split_unmapped_folio_to_zero_order
  2025-11-18 21:54               ` Zi Yan
@ 2025-11-19 12:37                 ` Harry Yoo
  2025-11-19 19:21                   ` Jiaqi Yan
  0 siblings, 1 reply; 21+ messages in thread
From: Harry Yoo @ 2025-11-19 12:37 UTC (permalink / raw)
  To: Zi Yan
  Cc: Jiaqi Yan, Matthew Wilcox, david, Vlastimil Babka, nao.horiguchi,
	linmiaohe, lorenzo.stoakes, william.roche, tony.luck,
	wangkefeng.wang, jane.chu, akpm, osalvador, muchun.song,
	linux-mm, linux-kernel, linux-fsdevel, Michal Hocko,
	Suren Baghdasaryan, Brendan Jackman, Johannes Weiner

On Tue, Nov 18, 2025 at 04:54:31PM -0500, Zi Yan wrote:
> On 18 Nov 2025, at 14:26, Jiaqi Yan wrote:
> 
> > On Tue, Nov 18, 2025 at 2:20 AM Harry Yoo <harry.yoo@oracle.com> wrote:
> >>
> >> On Mon, Nov 17, 2025 at 10:24:27PM -0800, Jiaqi Yan wrote:
> >>> On Mon, Nov 17, 2025 at 5:43 AM Matthew Wilcox <willy@infradead.org> wrote:
> >>>>
> >>>> On Mon, Nov 17, 2025 at 12:15:23PM +0900, Harry Yoo wrote:
> >>>>> On Sun, Nov 16, 2025 at 11:51:14AM +0000, Matthew Wilcox wrote:
> >>>>>> But since we're only doing this on free, we won't need to do folio
> >>>>>> allocations at all; we'll just be able to release the good pages to the
> >>>>>> page allocator and sequester the hwpoison pages.
> >>>>>
> >>>>> [+Cc PAGE ALLOCATOR folks]
> >>>>>
> >>>>> So we need an interface to free only healthy portion of a hwpoison folio.
> >>>
> >>> +1, with some of my own thoughts below.
> >>>
> >>>>>
> >>>>> I think a proper approach to this should be to "free a hwpoison folio
> >>>>> just like freeing a normal folio via folio_put() or free_frozen_pages(),
> >>>>> then the page allocator will add only healthy pages to the freelist and
> >>>>> isolate the hwpoison pages". Oherwise we'll end up open coding a lot,
> >>>>> which is too fragile.
> >>>>
> >>>> Yes, I think it should be handled by the page allocator.  There may be
> >>>
> >>> I agree with Matthew, Harry, and David. The page allocator seems best
> >>> suited to handle HWPoison subpages without any new folio allocations.
> >>
> >> Sorry I should have been clearer. I don't think adding an **explicit**
> >> interface to free an hwpoison folio is worth; instead implicitly
> >> handling during freeing of a folio seems more feasible.
> >
> > That's fine with me, just more to be taken care of by page allocator.
> >
> >>
> >>>> some complexity to this that I've missed, eg if hugetlb wants to retain
> >>>> the good 2MB chunks of a 1GB allocation.  I'm not sure that's a useful
> >>>> thing to do or not.
> >>>>
> >>>>> In fact, that can be done by teaching free_pages_prepare() how to handle
> >>>>> the case where one or more subpages of a folio are hwpoison pages.
> >>>>>
> >>>>> How this should be implemented in the page allocator in memdescs world?
> >>>>> Hmm, we'll want to do some kind of non-uniform split, without actually
> >>>>> splitting the folio but allocating struct buddy?
> >>>>
> >>>> Let me sketch that out, realising that it's subject to change.
> >>>>
> >>>> A page in buddy state can't need a memdesc allocated.  Otherwise we're
> >>>> allocating memory to free memory, and that way lies madness.  We can't
> >>>> do the hack of "embed struct buddy in the page that we're freeing"
> >>>> because HIGHMEM.  So we'll never shrink struct page smaller than struct
> >>>> buddy (which is fine because I've laid out how to get to a 64 bit struct
> >>>> buddy, and we're probably two years from getting there anyway).
> >>>>
> >>>> My design for handling hwpoison is that we do allocate a struct hwpoison
> >>>> for a page.  It looks like this (for now, in my head):
> >>>>
> >>>> struct hwpoison {
> >>>>         memdesc_t original;
> >>>>         ... other things ...
> >>>> };
> >>>>
> >>>> So we can replace the memdesc in a page with a hwpoison memdesc when we
> >>>> encounter the error.  We still need a folio flag to indicate that "this
> >>>> folio contains a page with hwpoison".  I haven't put much thought yet
> >>>> into interaction with HUGETLB_PAGE_OPTIMIZE_VMEMMAP; maybe "other things"
> >>>> includes an index of where the actually poisoned page is in the folio,
> >>>> so it doesn't matter if the pages alias with each other as we can recover
> >>>> the information when it becomes useful to do so.
> >>>>
> >>>>> But... for now I think hiding this complexity inside the page allocator
> >>>>> is good enough. For now this would just mean splitting a frozen page
> >>>
> >>> I want to add one more thing. For HugeTLB, kernel clears the HWPoison
> >>> flag on the folio and move it to every raw pages in raw_hwp_page list
> >>> (see folio_clear_hugetlb_hwpoison). So page allocator has no hint that
> >>> some pages passed into free_frozen_pages has HWPoison. It has to
> >>> traverse 2^order pages to tell, if I am not mistaken, which goes
> >>> against the past effort to reduce sanity checks. I believe this is one
> >>> reason I choosed to handle the problem in hugetlb / memory-failure.
> >>
> >> I think we can skip calling folio_clear_hugetlb_hwpoison() and teach the
> >
> > Nit: also skip folio_free_raw_hwp so the hugetlb-specific llist
> > containing the raw pages and owned by memory-failure is preserved? And
> > expect the page allocator to use it for whatever purpose then free the
> > llist? Doesn't seem to follow the correct ownership rule.
> >
> >> buddy allocator to handle this. free_pages_prepare() already handles
> >> (PageHWPoison(page) && !order) case, we just need to extend that to
> >> support hugetlb folios as well.
> >>
> >>> For the new interface Harry requested, is it the caller's
> >>> responsibility to ensure that the folio contains HWPoison pages (to be
> >>> even better, maybe point out the exact ones?), so that page allocator
> >>> at least doesn't waste cycles to search non-exist HWPoison in the set
> >>> of pages?
> >>
> >> With implicit handling it would be the page allocator's responsibility
> >> to check and handle hwpoison hugetlb folios.
> >
> > Does this mean we must bake hugetlb-specific logic in the page
> > allocator's freeing path? AFAICT today the contract in
> > free_frozen_page doesn't contain much hugetlb info.
> >
> > I saw there is already some hugetlb-specific logic in page_alloc.c,
> > but perhaps not valid for adding more.
> >
> >>
> >>> Or caller and page allocator need to agree on some contract? Say
> >>> caller has to set has_hwpoisoned flag in non-zero order folio to free.
> >>> This allows the old interface free_frozen_pages an easy way using the
> >>> has_hwpoison flag from the second page. I know has_hwpoison is "#if
> >>> defined" on THP and using it for hugetlb probably is not very clean,
> >>> but are there other concerns?
> >>
> >> As you mentioned has_hwpoisoned is used for THPs and for a hugetlb
> >> folio. But for a hugetlb folio folio_test_hwpoison() returns true
> >> if it has at least one hwpoison pages (assuming that we don't clear it
> >> before freeing).
> >>
> >> So in free_pages_prepare():
> >>
> >> if (folio_test_hugetlb(folio) && folio_test_hwpoison(folio)) {
> >>   /*
> >>    * Handle hwpoison hugetlb folios; transfer the error information
> >>    * to individual pages, clear hwpoison flag of the folio,
> >>    * perform non-uniform split on the frozen folio.
> >>    */
> >> } else if (PageHWPoison(page) && !order) {
> >>   /* We already handle this in the allocator. */
> >> }
> >>
> >> This would be sufficient?
> >
> > Wouldn't this confuse the page allocator into thinking the healthy
> > head page is HWPoison (when it actually isn't)? I thought that was one
> > of the reasons has_hwpoison exists.

AFAICT in the current code we don't set PG_hwpoison on individual
pages for hugetlb folios, so it won't confuse the page allocator.

> Is there a reason why hugetlb does not use has_hwpoison flag?

But yeah sounds like hugetlb is quite special here :)

I don't see why we should not use has_hwpoisoned and I think it's fine
to set has_hwpoisoned on hwpoison hugetlb folios in
folio_clear_hugetlb_hwpoison() and check the flag in the page allocator!

And since the split code has to scan base pages to check if there
is an hwpoison page in the new folio created by split (as Zi Yan mentioned),
I think it's fine to not skip calling folio_free_raw_hwp() in
folio_clear_hugetlb_hwpoison() and set has_hwpoisoned instead, and then
scan pages in free_pages_prepare() when we know has_hwpoisoned is set.

That should address Jiaqi's concern on adding hugetlb-specific code
in the page allocator.

So summing up:

1. Transfer raw hwp list to individual pages by setting PG_hwpoison
   (that's done in folio_clear_hugetlb_hwpoison()->folio_free_raw_hwp()!)

2. Set has_hwpoisoned in folio_clear_hugetlb_hwpoison()

3. Check has_hwpoisoned in free_pages_prepare() and if it's set,
   iterate over all base pages and do non-uniform split by calling
   __split_unmapped_folio() at each hwpoisoned pages.

   I think it's fine to iterate over base pages and check hwpoison flag
   as long as we do that only when we know there's an hwpoison page.

   But maybe we need to dispatch the job to a workqueue as Zi Yan said,
   because it'll take a while to iterate 512 * 512 pages when we're freeing
   1GB hugetlb folios.

4. Optimize __split_unmapped_folio() as suggested by Zi Yan below.

BTW I think we have to discard folios that has hwpoison pages
when we fail to split some parts? (we don't have to discard all of them,
but we may have managed to split some parts while other parts failed)

-- 
Cheers,
Harry / Hyeonggon

> BTW, __split_unmapped_folio() currently sets has_hwpoison to the after-split
> folios by scanning every single page in the to-be-split folio.
>
> The related code is in __split_folio_to_order(). But the code is not
> efficient for non-uniform split, since it calls __split_folio_to_order()
> multiple time, meaning when non-uniform split order-N to order-0,
> 2^(N-1) pages are scanned once, 2^(N-2) pages are scanned twice,
> 2^(N-3) pages are scanned 3 times, ..., 4 pages are scanned N-4 times.
> It can be optimized with some additional code in __split_folio_to_order().
> 
> Something like the patch below, it assumes PageHWPoison(split_at) == true:
> 
> From 219466f5d5edc4e8bf0e5402c5deffb584c6a2a0 Mon Sep 17 00:00:00 2001
> From: Zi Yan <ziy@nvidia.com>
> Date: Tue, 18 Nov 2025 14:55:36 -0500
> Subject: [PATCH] mm/huge_memory: optimize hwpoison page scan
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>  mm/huge_memory.c | 13 ++++++++-----
>  1 file changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d716c6965e27..54a933a20f1b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3233,8 +3233,11 @@ bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
>  					caller_pins;
>  }
> 
> -static bool page_range_has_hwpoisoned(struct page *page, long nr_pages)
> +static bool page_range_has_hwpoisoned(struct page *page, long nr_pages, struct page *donot_scan)
>  {
> +	if (donot_scan && donot_scan >= page && donot_scan < page + nr_pages)
> +		return false;
> +
>  	for (; nr_pages; page++, nr_pages--)
>  		if (PageHWPoison(page))
>  			return true;
> @@ -3246,7 +3249,7 @@ static bool page_range_has_hwpoisoned(struct page *page, long nr_pages)
>   * all the resulting folios.
>   */
>  static void __split_folio_to_order(struct folio *folio, int old_order,
> -		int new_order)
> +		int new_order, struct page *donot_scan)
>  {
>  	/* Scan poisoned pages when split a poisoned folio to large folios */
>  	const bool handle_hwpoison = folio_test_has_hwpoisoned(folio) && new_order;
> @@ -3258,7 +3261,7 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
> 
>  	/* Check first new_nr_pages since the loop below skips them */
>  	if (handle_hwpoison &&
> -	    page_range_has_hwpoisoned(folio_page(folio, 0), new_nr_pages))
> +	    page_range_has_hwpoisoned(folio_page(folio, 0), new_nr_pages, donot_scan))
>  		folio_set_has_hwpoisoned(folio);
>  	/*
>  	 * Skip the first new_nr_pages, since the new folio from them have all
> @@ -3308,7 +3311,7 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
>  				 LRU_GEN_MASK | LRU_REFS_MASK));
> 
>  		if (handle_hwpoison &&
> -		    page_range_has_hwpoisoned(new_head, new_nr_pages))
> +		    page_range_has_hwpoisoned(new_head, new_nr_pages, donot_scan))
>  			folio_set_has_hwpoisoned(new_folio);
> 
>  		new_folio->mapping = folio->mapping;
> @@ -3438,7 +3441,7 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>  		folio_split_memcg_refs(folio, old_order, split_order);
>  		split_page_owner(&folio->page, old_order, split_order);
>  		pgalloc_tag_split(folio, old_order, split_order);
> -		__split_folio_to_order(folio, old_order, split_order);
> +		__split_folio_to_order(folio, old_order, split_order, uniform_split ? NULL : split_at);
> 
>  		if (is_anon) {
>  			mod_mthp_stat(old_order, MTHP_STAT_NR_ANON, -1);
> -- 
> 2.51.0
> 
> >> Or do we want to handle THPs as well, in case of split failure in
> >> memory_failure()? if so we need to handle folio_test_has_hwpoisoned()
> >> case as well...
> >
> > Yeah, I think this is another good use case for our request to page allocator.
> >
> >>
> >>>>> inside the page allocator (probably non-uniform?). We can later re-implement
> >>>>> this to provide better support for memdescs.
> >>>>
> >>>> Yes, I like this approach.  But then I'm not the page allocator
> >>>> maintainer ;-)
> >>>
> >>> If page allocator maintainers can weigh in here, that will be very helpful!
> >>
> >> Yeah, I'm not a maintainer either ;) it'll be great to get opinions
> >> from page allocator folks!
> 
> I think this is a good approach as long as it does not add too much overhead
> on the page freeing path. Otherwise dispatch the job to a workqueue?
> 
> Best Regards,
> Yan, Zi


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1 1/2] mm/huge_memory: introduce uniform_split_unmapped_folio_to_zero_order
  2025-11-19 12:37                 ` Harry Yoo
@ 2025-11-19 19:21                   ` Jiaqi Yan
  2025-11-19 20:35                     ` Zi Yan
  0 siblings, 1 reply; 21+ messages in thread
From: Jiaqi Yan @ 2025-11-19 19:21 UTC (permalink / raw)
  To: Harry Yoo
  Cc: Zi Yan, Matthew Wilcox, david, Vlastimil Babka, nao.horiguchi,
	linmiaohe, lorenzo.stoakes, william.roche, tony.luck,
	wangkefeng.wang, jane.chu, akpm, osalvador, muchun.song,
	linux-mm, linux-kernel, linux-fsdevel, Michal Hocko,
	Suren Baghdasaryan, Brendan Jackman, Johannes Weiner

On Wed, Nov 19, 2025 at 4:37 AM Harry Yoo <harry.yoo@oracle.com> wrote:
>
> On Tue, Nov 18, 2025 at 04:54:31PM -0500, Zi Yan wrote:
> > On 18 Nov 2025, at 14:26, Jiaqi Yan wrote:
> >
> > > On Tue, Nov 18, 2025 at 2:20 AM Harry Yoo <harry.yoo@oracle.com> wrote:
> > >>
> > >> On Mon, Nov 17, 2025 at 10:24:27PM -0800, Jiaqi Yan wrote:
> > >>> On Mon, Nov 17, 2025 at 5:43 AM Matthew Wilcox <willy@infradead.org> wrote:
> > >>>>
> > >>>> On Mon, Nov 17, 2025 at 12:15:23PM +0900, Harry Yoo wrote:
> > >>>>> On Sun, Nov 16, 2025 at 11:51:14AM +0000, Matthew Wilcox wrote:
> > >>>>>> But since we're only doing this on free, we won't need to do folio
> > >>>>>> allocations at all; we'll just be able to release the good pages to the
> > >>>>>> page allocator and sequester the hwpoison pages.
> > >>>>>
> > >>>>> [+Cc PAGE ALLOCATOR folks]
> > >>>>>
> > >>>>> So we need an interface to free only healthy portion of a hwpoison folio.
> > >>>
> > >>> +1, with some of my own thoughts below.
> > >>>
> > >>>>>
> > >>>>> I think a proper approach to this should be to "free a hwpoison folio
> > >>>>> just like freeing a normal folio via folio_put() or free_frozen_pages(),
> > >>>>> then the page allocator will add only healthy pages to the freelist and
> > >>>>> isolate the hwpoison pages". Oherwise we'll end up open coding a lot,
> > >>>>> which is too fragile.
> > >>>>
> > >>>> Yes, I think it should be handled by the page allocator.  There may be
> > >>>
> > >>> I agree with Matthew, Harry, and David. The page allocator seems best
> > >>> suited to handle HWPoison subpages without any new folio allocations.
> > >>
> > >> Sorry I should have been clearer. I don't think adding an **explicit**
> > >> interface to free an hwpoison folio is worth; instead implicitly
> > >> handling during freeing of a folio seems more feasible.
> > >
> > > That's fine with me, just more to be taken care of by page allocator.
> > >
> > >>
> > >>>> some complexity to this that I've missed, eg if hugetlb wants to retain
> > >>>> the good 2MB chunks of a 1GB allocation.  I'm not sure that's a useful
> > >>>> thing to do or not.
> > >>>>
> > >>>>> In fact, that can be done by teaching free_pages_prepare() how to handle
> > >>>>> the case where one or more subpages of a folio are hwpoison pages.
> > >>>>>
> > >>>>> How this should be implemented in the page allocator in memdescs world?
> > >>>>> Hmm, we'll want to do some kind of non-uniform split, without actually
> > >>>>> splitting the folio but allocating struct buddy?
> > >>>>
> > >>>> Let me sketch that out, realising that it's subject to change.
> > >>>>
> > >>>> A page in buddy state can't need a memdesc allocated.  Otherwise we're
> > >>>> allocating memory to free memory, and that way lies madness.  We can't
> > >>>> do the hack of "embed struct buddy in the page that we're freeing"
> > >>>> because HIGHMEM.  So we'll never shrink struct page smaller than struct
> > >>>> buddy (which is fine because I've laid out how to get to a 64 bit struct
> > >>>> buddy, and we're probably two years from getting there anyway).
> > >>>>
> > >>>> My design for handling hwpoison is that we do allocate a struct hwpoison
> > >>>> for a page.  It looks like this (for now, in my head):
> > >>>>
> > >>>> struct hwpoison {
> > >>>>         memdesc_t original;
> > >>>>         ... other things ...
> > >>>> };
> > >>>>
> > >>>> So we can replace the memdesc in a page with a hwpoison memdesc when we
> > >>>> encounter the error.  We still need a folio flag to indicate that "this
> > >>>> folio contains a page with hwpoison".  I haven't put much thought yet
> > >>>> into interaction with HUGETLB_PAGE_OPTIMIZE_VMEMMAP; maybe "other things"
> > >>>> includes an index of where the actually poisoned page is in the folio,
> > >>>> so it doesn't matter if the pages alias with each other as we can recover
> > >>>> the information when it becomes useful to do so.
> > >>>>
> > >>>>> But... for now I think hiding this complexity inside the page allocator
> > >>>>> is good enough. For now this would just mean splitting a frozen page
> > >>>
> > >>> I want to add one more thing. For HugeTLB, kernel clears the HWPoison
> > >>> flag on the folio and move it to every raw pages in raw_hwp_page list
> > >>> (see folio_clear_hugetlb_hwpoison). So page allocator has no hint that
> > >>> some pages passed into free_frozen_pages has HWPoison. It has to
> > >>> traverse 2^order pages to tell, if I am not mistaken, which goes
> > >>> against the past effort to reduce sanity checks. I believe this is one
> > >>> reason I choosed to handle the problem in hugetlb / memory-failure.
> > >>
> > >> I think we can skip calling folio_clear_hugetlb_hwpoison() and teach the
> > >
> > > Nit: also skip folio_free_raw_hwp so the hugetlb-specific llist
> > > containing the raw pages and owned by memory-failure is preserved? And
> > > expect the page allocator to use it for whatever purpose then free the
> > > llist? Doesn't seem to follow the correct ownership rule.
> > >
> > >> buddy allocator to handle this. free_pages_prepare() already handles
> > >> (PageHWPoison(page) && !order) case, we just need to extend that to
> > >> support hugetlb folios as well.
> > >>
> > >>> For the new interface Harry requested, is it the caller's
> > >>> responsibility to ensure that the folio contains HWPoison pages (to be
> > >>> even better, maybe point out the exact ones?), so that page allocator
> > >>> at least doesn't waste cycles to search non-exist HWPoison in the set
> > >>> of pages?
> > >>
> > >> With implicit handling it would be the page allocator's responsibility
> > >> to check and handle hwpoison hugetlb folios.
> > >
> > > Does this mean we must bake hugetlb-specific logic in the page
> > > allocator's freeing path? AFAICT today the contract in
> > > free_frozen_page doesn't contain much hugetlb info.
> > >
> > > I saw there is already some hugetlb-specific logic in page_alloc.c,
> > > but perhaps not valid for adding more.
> > >
> > >>
> > >>> Or caller and page allocator need to agree on some contract? Say
> > >>> caller has to set has_hwpoisoned flag in non-zero order folio to free.
> > >>> This allows the old interface free_frozen_pages an easy way using the
> > >>> has_hwpoison flag from the second page. I know has_hwpoison is "#if
> > >>> defined" on THP and using it for hugetlb probably is not very clean,
> > >>> but are there other concerns?
> > >>
> > >> As you mentioned has_hwpoisoned is used for THPs and for a hugetlb
> > >> folio. But for a hugetlb folio folio_test_hwpoison() returns true
> > >> if it has at least one hwpoison pages (assuming that we don't clear it
> > >> before freeing).
> > >>
> > >> So in free_pages_prepare():
> > >>
> > >> if (folio_test_hugetlb(folio) && folio_test_hwpoison(folio)) {
> > >>   /*
> > >>    * Handle hwpoison hugetlb folios; transfer the error information
> > >>    * to individual pages, clear hwpoison flag of the folio,
> > >>    * perform non-uniform split on the frozen folio.
> > >>    */
> > >> } else if (PageHWPoison(page) && !order) {
> > >>   /* We already handle this in the allocator. */
> > >> }
> > >>
> > >> This would be sufficient?
> > >
> > > Wouldn't this confuse the page allocator into thinking the healthy
> > > head page is HWPoison (when it actually isn't)? I thought that was one
> > > of the reasons has_hwpoison exists.
>
> AFAICT in the current code we don't set PG_hwpoison on individual
> pages for hugetlb folios, so it won't confuse the page allocator.
>
> > Is there a reason why hugetlb does not use has_hwpoison flag?
>
> But yeah sounds like hugetlb is quite special here :)
>
> I don't see why we should not use has_hwpoisoned and I think it's fine
> to set has_hwpoisoned on hwpoison hugetlb folios in
> folio_clear_hugetlb_hwpoison() and check the flag in the page allocator!
>
> And since the split code has to scan base pages to check if there
> is an hwpoison page in the new folio created by split (as Zi Yan mentioned),
> I think it's fine to not skip calling folio_free_raw_hwp() in
> folio_clear_hugetlb_hwpoison() and set has_hwpoisoned instead, and then
> scan pages in free_pages_prepare() when we know has_hwpoisoned is set.
>
> That should address Jiaqi's concern on adding hugetlb-specific code
> in the page allocator.
>
> So summing up:
>
> 1. Transfer raw hwp list to individual pages by setting PG_hwpoison
>    (that's done in folio_clear_hugetlb_hwpoison()->folio_free_raw_hwp()!)
>
> 2. Set has_hwpoisoned in folio_clear_hugetlb_hwpoison()

IIUC, #1 and #2 are exactly what I considered: no change in
folio_clear_hugetlb_hwpoison, but set the has_hwpoisoned flag (instead
of HWPoison flag) to the folio as the hint to page allocator.

>
> 3. Check has_hwpoisoned in free_pages_prepare() and if it's set,
>    iterate over all base pages and do non-uniform split by calling
>    __split_unmapped_folio() at each hwpoisoned pages.

IIUC, directly re-use __split_unmapped_folio still need some memory
overhead. But I believe that's log(n) and much better than the current
uniform split version. So if that's acceptable, I can give this
solution a try.

>
>    I think it's fine to iterate over base pages and check hwpoison flag
>    as long as we do that only when we know there's an hwpoison page.
>
>    But maybe we need to dispatch the job to a workqueue as Zi Yan said,
>    because it'll take a while to iterate 512 * 512 pages when we're freeing
>    1GB hugetlb folios.
>
> 4. Optimize __split_unmapped_folio() as suggested by Zi Yan below.
>
> BTW I think we have to discard folios that has hwpoison pages
> when we fail to split some parts? (we don't have to discard all of them,
> but we may have managed to split some parts while other parts failed)
>

Maybe we can fail in other places, but at least __split_unmapped_folio
can't fail when mapping is NULL, which is the case for us.

> --
> Cheers,
> Harry / Hyeonggon
>
> > BTW, __split_unmapped_folio() currently sets has_hwpoison to the after-split
> > folios by scanning every single page in the to-be-split folio.
> >
> > The related code is in __split_folio_to_order(). But the code is not
> > efficient for non-uniform split, since it calls __split_folio_to_order()
> > multiple time, meaning when non-uniform split order-N to order-0,
> > 2^(N-1) pages are scanned once, 2^(N-2) pages are scanned twice,
> > 2^(N-3) pages are scanned 3 times, ..., 4 pages are scanned N-4 times.
> > It can be optimized with some additional code in __split_folio_to_order().
> >
> > Something like the patch below, it assumes PageHWPoison(split_at) == true:
> >
> > From 219466f5d5edc4e8bf0e5402c5deffb584c6a2a0 Mon Sep 17 00:00:00 2001
> > From: Zi Yan <ziy@nvidia.com>
> > Date: Tue, 18 Nov 2025 14:55:36 -0500
> > Subject: [PATCH] mm/huge_memory: optimize hwpoison page scan
> >
> > Signed-off-by: Zi Yan <ziy@nvidia.com>
> > ---
> >  mm/huge_memory.c | 13 ++++++++-----
> >  1 file changed, 8 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index d716c6965e27..54a933a20f1b 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -3233,8 +3233,11 @@ bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
> >                                       caller_pins;
> >  }
> >
> > -static bool page_range_has_hwpoisoned(struct page *page, long nr_pages)
> > +static bool page_range_has_hwpoisoned(struct page *page, long nr_pages, struct page *donot_scan)
> >  {
> > +     if (donot_scan && donot_scan >= page && donot_scan < page + nr_pages)
> > +             return false;
> > +
> >       for (; nr_pages; page++, nr_pages--)
> >               if (PageHWPoison(page))
> >                       return true;
> > @@ -3246,7 +3249,7 @@ static bool page_range_has_hwpoisoned(struct page *page, long nr_pages)
> >   * all the resulting folios.
> >   */
> >  static void __split_folio_to_order(struct folio *folio, int old_order,
> > -             int new_order)
> > +             int new_order, struct page *donot_scan)
> >  {
> >       /* Scan poisoned pages when split a poisoned folio to large folios */
> >       const bool handle_hwpoison = folio_test_has_hwpoisoned(folio) && new_order;
> > @@ -3258,7 +3261,7 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
> >
> >       /* Check first new_nr_pages since the loop below skips them */
> >       if (handle_hwpoison &&
> > -         page_range_has_hwpoisoned(folio_page(folio, 0), new_nr_pages))
> > +         page_range_has_hwpoisoned(folio_page(folio, 0), new_nr_pages, donot_scan))
> >               folio_set_has_hwpoisoned(folio);
> >       /*
> >        * Skip the first new_nr_pages, since the new folio from them have all
> > @@ -3308,7 +3311,7 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
> >                                LRU_GEN_MASK | LRU_REFS_MASK));
> >
> >               if (handle_hwpoison &&
> > -                 page_range_has_hwpoisoned(new_head, new_nr_pages))
> > +                 page_range_has_hwpoisoned(new_head, new_nr_pages, donot_scan))
> >                       folio_set_has_hwpoisoned(new_folio);
> >
> >               new_folio->mapping = folio->mapping;
> > @@ -3438,7 +3441,7 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
> >               folio_split_memcg_refs(folio, old_order, split_order);
> >               split_page_owner(&folio->page, old_order, split_order);
> >               pgalloc_tag_split(folio, old_order, split_order);
> > -             __split_folio_to_order(folio, old_order, split_order);
> > +             __split_folio_to_order(folio, old_order, split_order, uniform_split ? NULL : split_at);
> >
> >               if (is_anon) {
> >                       mod_mthp_stat(old_order, MTHP_STAT_NR_ANON, -1);
> > --
> > 2.51.0
> >
> > >> Or do we want to handle THPs as well, in case of split failure in
> > >> memory_failure()? if so we need to handle folio_test_has_hwpoisoned()
> > >> case as well...
> > >
> > > Yeah, I think this is another good use case for our request to page allocator.
> > >
> > >>
> > >>>>> inside the page allocator (probably non-uniform?). We can later re-implement
> > >>>>> this to provide better support for memdescs.
> > >>>>
> > >>>> Yes, I like this approach.  But then I'm not the page allocator
> > >>>> maintainer ;-)
> > >>>
> > >>> If page allocator maintainers can weigh in here, that will be very helpful!
> > >>
> > >> Yeah, I'm not a maintainer either ;) it'll be great to get opinions
> > >> from page allocator folks!
> >
> > I think this is a good approach as long as it does not add too much overhead
> > on the page freeing path. Otherwise dispatch the job to a workqueue?
> >
> > Best Regards,
> > Yan, Zi


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH v1 1/2] mm/huge_memory: introduce uniform_split_unmapped_folio_to_zero_order
  2025-11-19 19:21                   ` Jiaqi Yan
@ 2025-11-19 20:35                     ` Zi Yan
  0 siblings, 0 replies; 21+ messages in thread
From: Zi Yan @ 2025-11-19 20:35 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: Harry Yoo, Matthew Wilcox, david, Vlastimil Babka, nao.horiguchi,
	linmiaohe, lorenzo.stoakes, william.roche, tony.luck,
	wangkefeng.wang, jane.chu, akpm, osalvador, muchun.song,
	linux-mm, linux-kernel, linux-fsdevel, Michal Hocko,
	Suren Baghdasaryan, Brendan Jackman, Johannes Weiner

On 19 Nov 2025, at 14:21, Jiaqi Yan wrote:

> On Wed, Nov 19, 2025 at 4:37 AM Harry Yoo <harry.yoo@oracle.com> wrote:
>>
>> On Tue, Nov 18, 2025 at 04:54:31PM -0500, Zi Yan wrote:
>>> On 18 Nov 2025, at 14:26, Jiaqi Yan wrote:
>>>
>>>> On Tue, Nov 18, 2025 at 2:20 AM Harry Yoo <harry.yoo@oracle.com> wrote:
>>>>>
>>>>> On Mon, Nov 17, 2025 at 10:24:27PM -0800, Jiaqi Yan wrote:
>>>>>> On Mon, Nov 17, 2025 at 5:43 AM Matthew Wilcox <willy@infradead.org> wrote:
>>>>>>>
>>>>>>> On Mon, Nov 17, 2025 at 12:15:23PM +0900, Harry Yoo wrote:
>>>>>>>> On Sun, Nov 16, 2025 at 11:51:14AM +0000, Matthew Wilcox wrote:
>>>>>>>>> But since we're only doing this on free, we won't need to do folio
>>>>>>>>> allocations at all; we'll just be able to release the good pages to the
>>>>>>>>> page allocator and sequester the hwpoison pages.
>>>>>>>>
>>>>>>>> [+Cc PAGE ALLOCATOR folks]
>>>>>>>>
>>>>>>>> So we need an interface to free only healthy portion of a hwpoison folio.
>>>>>>
>>>>>> +1, with some of my own thoughts below.
>>>>>>
>>>>>>>>
>>>>>>>> I think a proper approach to this should be to "free a hwpoison folio
>>>>>>>> just like freeing a normal folio via folio_put() or free_frozen_pages(),
>>>>>>>> then the page allocator will add only healthy pages to the freelist and
>>>>>>>> isolate the hwpoison pages". Oherwise we'll end up open coding a lot,
>>>>>>>> which is too fragile.
>>>>>>>
>>>>>>> Yes, I think it should be handled by the page allocator.  There may be
>>>>>>
>>>>>> I agree with Matthew, Harry, and David. The page allocator seems best
>>>>>> suited to handle HWPoison subpages without any new folio allocations.
>>>>>
>>>>> Sorry I should have been clearer. I don't think adding an **explicit**
>>>>> interface to free an hwpoison folio is worth; instead implicitly
>>>>> handling during freeing of a folio seems more feasible.
>>>>
>>>> That's fine with me, just more to be taken care of by page allocator.
>>>>
>>>>>
>>>>>>> some complexity to this that I've missed, eg if hugetlb wants to retain
>>>>>>> the good 2MB chunks of a 1GB allocation.  I'm not sure that's a useful
>>>>>>> thing to do or not.
>>>>>>>
>>>>>>>> In fact, that can be done by teaching free_pages_prepare() how to handle
>>>>>>>> the case where one or more subpages of a folio are hwpoison pages.
>>>>>>>>
>>>>>>>> How this should be implemented in the page allocator in memdescs world?
>>>>>>>> Hmm, we'll want to do some kind of non-uniform split, without actually
>>>>>>>> splitting the folio but allocating struct buddy?
>>>>>>>
>>>>>>> Let me sketch that out, realising that it's subject to change.
>>>>>>>
>>>>>>> A page in buddy state can't need a memdesc allocated.  Otherwise we're
>>>>>>> allocating memory to free memory, and that way lies madness.  We can't
>>>>>>> do the hack of "embed struct buddy in the page that we're freeing"
>>>>>>> because HIGHMEM.  So we'll never shrink struct page smaller than struct
>>>>>>> buddy (which is fine because I've laid out how to get to a 64 bit struct
>>>>>>> buddy, and we're probably two years from getting there anyway).
>>>>>>>
>>>>>>> My design for handling hwpoison is that we do allocate a struct hwpoison
>>>>>>> for a page.  It looks like this (for now, in my head):
>>>>>>>
>>>>>>> struct hwpoison {
>>>>>>>         memdesc_t original;
>>>>>>>         ... other things ...
>>>>>>> };
>>>>>>>
>>>>>>> So we can replace the memdesc in a page with a hwpoison memdesc when we
>>>>>>> encounter the error.  We still need a folio flag to indicate that "this
>>>>>>> folio contains a page with hwpoison".  I haven't put much thought yet
>>>>>>> into interaction with HUGETLB_PAGE_OPTIMIZE_VMEMMAP; maybe "other things"
>>>>>>> includes an index of where the actually poisoned page is in the folio,
>>>>>>> so it doesn't matter if the pages alias with each other as we can recover
>>>>>>> the information when it becomes useful to do so.
>>>>>>>
>>>>>>>> But... for now I think hiding this complexity inside the page allocator
>>>>>>>> is good enough. For now this would just mean splitting a frozen page
>>>>>>
>>>>>> I want to add one more thing. For HugeTLB, kernel clears the HWPoison
>>>>>> flag on the folio and move it to every raw pages in raw_hwp_page list
>>>>>> (see folio_clear_hugetlb_hwpoison). So page allocator has no hint that
>>>>>> some pages passed into free_frozen_pages has HWPoison. It has to
>>>>>> traverse 2^order pages to tell, if I am not mistaken, which goes
>>>>>> against the past effort to reduce sanity checks. I believe this is one
>>>>>> reason I choosed to handle the problem in hugetlb / memory-failure.
>>>>>
>>>>> I think we can skip calling folio_clear_hugetlb_hwpoison() and teach the
>>>>
>>>> Nit: also skip folio_free_raw_hwp so the hugetlb-specific llist
>>>> containing the raw pages and owned by memory-failure is preserved? And
>>>> expect the page allocator to use it for whatever purpose then free the
>>>> llist? Doesn't seem to follow the correct ownership rule.
>>>>
>>>>> buddy allocator to handle this. free_pages_prepare() already handles
>>>>> (PageHWPoison(page) && !order) case, we just need to extend that to
>>>>> support hugetlb folios as well.
>>>>>
>>>>>> For the new interface Harry requested, is it the caller's
>>>>>> responsibility to ensure that the folio contains HWPoison pages (to be
>>>>>> even better, maybe point out the exact ones?), so that page allocator
>>>>>> at least doesn't waste cycles to search non-exist HWPoison in the set
>>>>>> of pages?
>>>>>
>>>>> With implicit handling it would be the page allocator's responsibility
>>>>> to check and handle hwpoison hugetlb folios.
>>>>
>>>> Does this mean we must bake hugetlb-specific logic in the page
>>>> allocator's freeing path? AFAICT today the contract in
>>>> free_frozen_page doesn't contain much hugetlb info.
>>>>
>>>> I saw there is already some hugetlb-specific logic in page_alloc.c,
>>>> but perhaps not valid for adding more.
>>>>
>>>>>
>>>>>> Or caller and page allocator need to agree on some contract? Say
>>>>>> caller has to set has_hwpoisoned flag in non-zero order folio to free.
>>>>>> This allows the old interface free_frozen_pages an easy way using the
>>>>>> has_hwpoison flag from the second page. I know has_hwpoison is "#if
>>>>>> defined" on THP and using it for hugetlb probably is not very clean,
>>>>>> but are there other concerns?
>>>>>
>>>>> As you mentioned has_hwpoisoned is used for THPs and for a hugetlb
>>>>> folio. But for a hugetlb folio folio_test_hwpoison() returns true
>>>>> if it has at least one hwpoison pages (assuming that we don't clear it
>>>>> before freeing).
>>>>>
>>>>> So in free_pages_prepare():
>>>>>
>>>>> if (folio_test_hugetlb(folio) && folio_test_hwpoison(folio)) {
>>>>>   /*
>>>>>    * Handle hwpoison hugetlb folios; transfer the error information
>>>>>    * to individual pages, clear hwpoison flag of the folio,
>>>>>    * perform non-uniform split on the frozen folio.
>>>>>    */
>>>>> } else if (PageHWPoison(page) && !order) {
>>>>>   /* We already handle this in the allocator. */
>>>>> }
>>>>>
>>>>> This would be sufficient?
>>>>
>>>> Wouldn't this confuse the page allocator into thinking the healthy
>>>> head page is HWPoison (when it actually isn't)? I thought that was one
>>>> of the reasons has_hwpoison exists.
>>
>> AFAICT in the current code we don't set PG_hwpoison on individual
>> pages for hugetlb folios, so it won't confuse the page allocator.
>>
>>> Is there a reason why hugetlb does not use has_hwpoison flag?
>>
>> But yeah sounds like hugetlb is quite special here :)
>>
>> I don't see why we should not use has_hwpoisoned and I think it's fine
>> to set has_hwpoisoned on hwpoison hugetlb folios in
>> folio_clear_hugetlb_hwpoison() and check the flag in the page allocator!
>>
>> And since the split code has to scan base pages to check if there
>> is an hwpoison page in the new folio created by split (as Zi Yan mentioned),
>> I think it's fine to not skip calling folio_free_raw_hwp() in
>> folio_clear_hugetlb_hwpoison() and set has_hwpoisoned instead, and then
>> scan pages in free_pages_prepare() when we know has_hwpoisoned is set.
>>
>> That should address Jiaqi's concern on adding hugetlb-specific code
>> in the page allocator.
>>
>> So summing up:
>>
>> 1. Transfer raw hwp list to individual pages by setting PG_hwpoison
>>    (that's done in folio_clear_hugetlb_hwpoison()->folio_free_raw_hwp()!)
>>
>> 2. Set has_hwpoisoned in folio_clear_hugetlb_hwpoison()
>
> IIUC, #1 and #2 are exactly what I considered: no change in
> folio_clear_hugetlb_hwpoison, but set the has_hwpoisoned flag (instead
> of HWPoison flag) to the folio as the hint to page allocator.
>
>>
>> 3. Check has_hwpoisoned in free_pages_prepare() and if it's set,
>>    iterate over all base pages and do non-uniform split by calling
>>    __split_unmapped_folio() at each hwpoisoned pages.
>
> IIUC, directly re-use __split_unmapped_folio still need some memory
> overhead. But I believe that's log(n) and much better than the current
> uniform split version. So if that's acceptable, I can give this
> solution a try.

If we do not want to allocate additional memory, we will need to know
all the HWPoison pages within the given folio and skip them.

I notice that free_pages_prepare() actually scans all pages of a large
folio. This means we can get all the HWPoison pages within a large folio.

A naive implementation would be that if any page within a large folio
is HWPoison, which can be determined in free_pages_prepare(), we can
free this large folio at order-0 granularity and skip HWPoison pages.

An improvement would be to store these HWPoison page pfn and
use the offset of each pfn to the first page to determine the max order
for freeing non HWPoison pages between HWPoison pages. For example,
for an order-3 folio with HWPoison page0 and page3, we skip page0,
free page1 at order-0, free page2 at order-0, free page4 to page7 at
order 2. But we will need memory to store these pfns.

Or we can move page scanning (which calls free_tail_page_prepare()) in a
later stage of free page path and free pages during scan. Basically,
keep scanning if no HWPoison page is encountered, otherwise,
free non HWPoison pages between HWPoison pages using >0 order if possible.

This change will be non-trivial but should not need additional memory
like __split_unmapped_folio().

>
>>
>>    I think it's fine to iterate over base pages and check hwpoison flag
>>    as long as we do that only when we know there's an hwpoison page.
>>
>>    But maybe we need to dispatch the job to a workqueue as Zi Yan said,
>>    because it'll take a while to iterate 512 * 512 pages when we're freeing
>>    1GB hugetlb folios.
>>
>> 4. Optimize __split_unmapped_folio() as suggested by Zi Yan below.
>>
>> BTW I think we have to discard folios that has hwpoison pages
>> when we fail to split some parts? (we don't have to discard all of them,
>> but we may have managed to split some parts while other parts failed)
>>
>
> Maybe we can fail in other places, but at least __split_unmapped_folio
> can't fail when mapping is NULL, which is the case for us.
>
>> --
>> Cheers,
>> Harry / Hyeonggon
>>
>>> BTW, __split_unmapped_folio() currently sets has_hwpoison to the after-split
>>> folios by scanning every single page in the to-be-split folio.
>>>
>>> The related code is in __split_folio_to_order(). But the code is not
>>> efficient for non-uniform split, since it calls __split_folio_to_order()
>>> multiple time, meaning when non-uniform split order-N to order-0,
>>> 2^(N-1) pages are scanned once, 2^(N-2) pages are scanned twice,
>>> 2^(N-3) pages are scanned 3 times, ..., 4 pages are scanned N-4 times.
>>> It can be optimized with some additional code in __split_folio_to_order().
>>>
>>> Something like the patch below, it assumes PageHWPoison(split_at) == true:
>>>
>>> From 219466f5d5edc4e8bf0e5402c5deffb584c6a2a0 Mon Sep 17 00:00:00 2001
>>> From: Zi Yan <ziy@nvidia.com>
>>> Date: Tue, 18 Nov 2025 14:55:36 -0500
>>> Subject: [PATCH] mm/huge_memory: optimize hwpoison page scan
>>>
>>> Signed-off-by: Zi Yan <ziy@nvidia.com>
>>> ---
>>>  mm/huge_memory.c | 13 ++++++++-----
>>>  1 file changed, 8 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index d716c6965e27..54a933a20f1b 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -3233,8 +3233,11 @@ bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins)
>>>                                       caller_pins;
>>>  }
>>>
>>> -static bool page_range_has_hwpoisoned(struct page *page, long nr_pages)
>>> +static bool page_range_has_hwpoisoned(struct page *page, long nr_pages, struct page *donot_scan)
>>>  {
>>> +     if (donot_scan && donot_scan >= page && donot_scan < page + nr_pages)
>>> +             return false;
>>> +
>>>       for (; nr_pages; page++, nr_pages--)
>>>               if (PageHWPoison(page))
>>>                       return true;
>>> @@ -3246,7 +3249,7 @@ static bool page_range_has_hwpoisoned(struct page *page, long nr_pages)
>>>   * all the resulting folios.
>>>   */
>>>  static void __split_folio_to_order(struct folio *folio, int old_order,
>>> -             int new_order)
>>> +             int new_order, struct page *donot_scan)
>>>  {
>>>       /* Scan poisoned pages when split a poisoned folio to large folios */
>>>       const bool handle_hwpoison = folio_test_has_hwpoisoned(folio) && new_order;
>>> @@ -3258,7 +3261,7 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
>>>
>>>       /* Check first new_nr_pages since the loop below skips them */
>>>       if (handle_hwpoison &&
>>> -         page_range_has_hwpoisoned(folio_page(folio, 0), new_nr_pages))
>>> +         page_range_has_hwpoisoned(folio_page(folio, 0), new_nr_pages, donot_scan))
>>>               folio_set_has_hwpoisoned(folio);
>>>       /*
>>>        * Skip the first new_nr_pages, since the new folio from them have all
>>> @@ -3308,7 +3311,7 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
>>>                                LRU_GEN_MASK | LRU_REFS_MASK));
>>>
>>>               if (handle_hwpoison &&
>>> -                 page_range_has_hwpoisoned(new_head, new_nr_pages))
>>> +                 page_range_has_hwpoisoned(new_head, new_nr_pages, donot_scan))
>>>                       folio_set_has_hwpoisoned(new_folio);
>>>
>>>               new_folio->mapping = folio->mapping;
>>> @@ -3438,7 +3441,7 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>>>               folio_split_memcg_refs(folio, old_order, split_order);
>>>               split_page_owner(&folio->page, old_order, split_order);
>>>               pgalloc_tag_split(folio, old_order, split_order);
>>> -             __split_folio_to_order(folio, old_order, split_order);
>>> +             __split_folio_to_order(folio, old_order, split_order, uniform_split ? NULL : split_at);
>>>
>>>               if (is_anon) {
>>>                       mod_mthp_stat(old_order, MTHP_STAT_NR_ANON, -1);
>>> --
>>> 2.51.0
>>>
>>>>> Or do we want to handle THPs as well, in case of split failure in
>>>>> memory_failure()? if so we need to handle folio_test_has_hwpoisoned()
>>>>> case as well...
>>>>
>>>> Yeah, I think this is another good use case for our request to page allocator.
>>>>
>>>>>
>>>>>>>> inside the page allocator (probably non-uniform?). We can later re-implement
>>>>>>>> this to provide better support for memdescs.
>>>>>>>
>>>>>>> Yes, I like this approach.  But then I'm not the page allocator
>>>>>>> maintainer ;-)
>>>>>>
>>>>>> If page allocator maintainers can weigh in here, that will be very helpful!
>>>>>
>>>>> Yeah, I'm not a maintainer either ;) it'll be great to get opinions
>>>>> from page allocator folks!
>>>
>>> I think this is a good approach as long as it does not add too much overhead
>>> on the page freeing path. Otherwise dispatch the job to a workqueue?
>>>
>>> Best Regards,
>>> Yan, Zi


Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-11-19 20:35 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-11-16  1:47 [PATCH v1 0/2] Only free healthy pages in high-order HWPoison folio Jiaqi Yan
2025-11-16  1:47 ` [PATCH v1 1/2] mm/huge_memory: introduce uniform_split_unmapped_folio_to_zero_order Jiaqi Yan
2025-11-16 11:51   ` Matthew Wilcox
2025-11-17  3:15     ` Harry Yoo
2025-11-17  3:21       ` Zi Yan
2025-11-17  3:39         ` Harry Yoo
2025-11-17 13:43       ` Matthew Wilcox
2025-11-18  6:24         ` Jiaqi Yan
2025-11-18 10:19           ` Harry Yoo
2025-11-18 19:26             ` Jiaqi Yan
2025-11-18 21:54               ` Zi Yan
2025-11-19 12:37                 ` Harry Yoo
2025-11-19 19:21                   ` Jiaqi Yan
2025-11-19 20:35                     ` Zi Yan
2025-11-16 22:38   ` kernel test robot
2025-11-17 17:12   ` David Hildenbrand (Red Hat)
2025-11-16  1:47 ` [PATCH v1 2/2] mm/memory-failure: avoid free HWPoison high-order folio Jiaqi Yan
2025-11-16  2:10   ` Zi Yan
2025-11-18  5:12     ` Jiaqi Yan
2025-11-17 17:15   ` David Hildenbrand (Red Hat)
2025-11-18  5:17     ` Jiaqi Yan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox