* [PATCH 1/1] mm: prevent poison consumption when splitting THP
@ 2025-09-28 3:28 Qiuxu Zhuo
2025-09-28 21:55 ` Jiaqi Yan
` (4 more replies)
0 siblings, 5 replies; 28+ messages in thread
From: Qiuxu Zhuo @ 2025-09-28 3:28 UTC (permalink / raw)
To: akpm, david, lorenzo.stoakes, linmiaohe, tony.luck
Cc: qiuxu.zhuo, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
dev.jain, baohua, nao.horiguchi, farrah.chen, linux-mm,
linux-kernel, Andrew Zaborowski
From: Andrew Zaborowski <andrew.zaborowski@intel.com>
When performing memory error injection on a THP (Transparent Huge Page)
mapped to userspace on an x86 server, the kernel panics with the following
trace. The expected behavior is to terminate the affected process instead
of panicking the kernel, as the x86 Machine Check code can recover from an
in-userspace #MC.
mce: [Hardware Error]: CPU 0: Machine Check Exception: f Bank 3: bd80000000070134
mce: [Hardware Error]: RIP 10:<ffffffff8372f8bc> {memchr_inv+0x4c/0xf0}
mce: [Hardware Error]: TSC afff7bbff88a ADDR 1d301b000 MISC 80 PPIN 1e741e77539027db
mce: [Hardware Error]: PROCESSOR 0:d06d0 TIME 1758093249 SOCKET 0 APIC 0 microcode 80000320
mce: [Hardware Error]: Run the above through 'mcelog --ascii'
mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
Kernel panic - not syncing: Fatal local machine check
The root cause of this panic is that handling a memory failure triggered by
an in-userspace #MC necessitates splitting the THP. The splitting process
employs a mechanism, implemented in try_to_map_unused_to_zeropage(), which
reads the sub-pages of the THP to identify zero-filled pages. However,
reading the sub-pages results in a second in-kernel #MC, occurring before
the initial memory_failure() completes, ultimately leading to a kernel
panic. See the kernel panic call trace on the two #MCs.
First Machine Check occurs // [1]
memory_failure() // [2]
try_to_split_thp_page()
split_huge_page()
split_huge_page_to_list_to_order()
__folio_split() // [3]
remap_page()
remove_migration_ptes()
remove_migration_pte()
try_to_map_unused_to_zeropage()
memchr_inv() // [4]
Second Machine Check occurs // [5]
Kernel panic
[1] Triggered by accessing a hardware-poisoned THP in userspace, which is
typically recoverable by terminating the affected process.
[2] Call folio_set_has_hwpoisoned() before try_to_split_thp_page().
[3] Pass the RMP_USE_SHARED_ZEROPAGE remap flag to remap_page().
[4] Re-access sub-pages of the hw-poisoned THP in the kernel.
[5] Triggered in-kernel, leading to a panic kernel.
In Step[2], memory_failure() sets the has_hwpoisoned flag on the THP,
right before calling try_to_split_thp_page(). Fix this panic by not
passing the RMP_USE_SHARED_ZEROPAGE flag to remap_page() in Step[3]
if the THP has the has_hwpoisoned flag set. This prevents access to
sub-pages of the poisoned THP for zero-page identification, avoiding
a second in-kernel #MC that would cause kernel panic.
[ Qiuxu: Re-worte the commit message. ]
Reported-by: Farrah Chen <farrah.chen@intel.com>
Signed-off-by: Andrew Zaborowski <andrew.zaborowski@intel.com>
Tested-by: Farrah Chen <farrah.chen@intel.com>
Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
---
mm/huge_memory.c | 3 ++-
mm/memory-failure.c | 6 ++++--
2 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9c38a95e9f09..1568f0308b90 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3588,6 +3588,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
struct list_head *list, bool uniform_split)
{
struct deferred_split *ds_queue = get_deferred_split_queue(folio);
+ bool has_hwpoisoned = folio_test_has_hwpoisoned(folio);
XA_STATE(xas, &folio->mapping->i_pages, folio->index);
struct folio *end_folio = folio_next(folio);
bool is_anon = folio_test_anon(folio);
@@ -3858,7 +3859,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
if (nr_shmem_dropped)
shmem_uncharge(mapping->host, nr_shmem_dropped);
- if (!ret && is_anon)
+ if (!ret && is_anon && !has_hwpoisoned)
remap_flags = RMP_USE_SHARED_ZEROPAGE;
remap_page(folio, 1 << order, remap_flags);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index df6ee59527dd..3ba6fd4079ab 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2351,8 +2351,10 @@ int memory_failure(unsigned long pfn, int flags)
* otherwise it may race with THP split.
* And the flag can't be set in get_hwpoison_page() since
* it is called by soft offline too and it is just called
- * for !MF_COUNT_INCREASED. So here seems to be the best
- * place.
+ * for !MF_COUNT_INCREASED.
+ * It also tells split_huge_page() to not bother using
+ * the shared zeropage -- the all-zeros check would
+ * consume the poison. So here seems to be the best place.
*
* Don't need care about the above error handling paths for
* get_hwpoison_page() since they handle either free page
--
2.43.0
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/1] mm: prevent poison consumption when splitting THP
2025-09-28 3:28 [PATCH 1/1] mm: prevent poison consumption when splitting THP Qiuxu Zhuo
@ 2025-09-28 21:55 ` Jiaqi Yan
2025-09-29 12:29 ` Miaohe Lin
2025-09-29 13:27 ` Zhuo, Qiuxu
2025-09-29 7:34 ` David Hildenbrand
` (3 subsequent siblings)
4 siblings, 2 replies; 28+ messages in thread
From: Jiaqi Yan @ 2025-09-28 21:55 UTC (permalink / raw)
To: Qiuxu Zhuo
Cc: akpm, david, lorenzo.stoakes, linmiaohe, tony.luck, ziy,
baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
baohua, nao.horiguchi, farrah.chen, linux-mm, linux-kernel,
Andrew Zaborowski
On Sat, Sep 27, 2025 at 8:30 PM Qiuxu Zhuo <qiuxu.zhuo@intel.com> wrote:
>
> From: Andrew Zaborowski <andrew.zaborowski@intel.com>
>
> When performing memory error injection on a THP (Transparent Huge Page)
> mapped to userspace on an x86 server, the kernel panics with the following
> trace. The expected behavior is to terminate the affected process instead
> of panicking the kernel, as the x86 Machine Check code can recover from an
> in-userspace #MC.
>
> mce: [Hardware Error]: CPU 0: Machine Check Exception: f Bank 3: bd80000000070134
> mce: [Hardware Error]: RIP 10:<ffffffff8372f8bc> {memchr_inv+0x4c/0xf0}
> mce: [Hardware Error]: TSC afff7bbff88a ADDR 1d301b000 MISC 80 PPIN 1e741e77539027db
> mce: [Hardware Error]: PROCESSOR 0:d06d0 TIME 1758093249 SOCKET 0 APIC 0 microcode 80000320
> mce: [Hardware Error]: Run the above through 'mcelog --ascii'
> mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
> Kernel panic - not syncing: Fatal local machine check
>
> The root cause of this panic is that handling a memory failure triggered by
> an in-userspace #MC necessitates splitting the THP. The splitting process
> employs a mechanism, implemented in try_to_map_unused_to_zeropage(), which
> reads the sub-pages of the THP to identify zero-filled pages. However,
> reading the sub-pages results in a second in-kernel #MC, occurring before
> the initial memory_failure() completes, ultimately leading to a kernel
> panic. See the kernel panic call trace on the two #MCs.
>
> First Machine Check occurs // [1]
> memory_failure() // [2]
> try_to_split_thp_page()
> split_huge_page()
> split_huge_page_to_list_to_order()
> __folio_split() // [3]
> remap_page()
> remove_migration_ptes()
> remove_migration_pte()
> try_to_map_unused_to_zeropage()
Just an observation: Unfortunately THP only has PageHasHWPoisoned and
don't know the exact HWPoisoned page. Otherwise, we may still use
zeropage for these not HWPoisoned.
> memchr_inv() // [4]
> Second Machine Check occurs // [5]
> Kernel panic
>
> [1] Triggered by accessing a hardware-poisoned THP in userspace, which is
> typically recoverable by terminating the affected process.
>
> [2] Call folio_set_has_hwpoisoned() before try_to_split_thp_page().
>
> [3] Pass the RMP_USE_SHARED_ZEROPAGE remap flag to remap_page().
>
> [4] Re-access sub-pages of the hw-poisoned THP in the kernel.
>
> [5] Triggered in-kernel, leading to a panic kernel.
>
> In Step[2], memory_failure() sets the has_hwpoisoned flag on the THP,
> right before calling try_to_split_thp_page(). Fix this panic by not
> passing the RMP_USE_SHARED_ZEROPAGE flag to remap_page() in Step[3]
> if the THP has the has_hwpoisoned flag set. This prevents access to
> sub-pages of the poisoned THP for zero-page identification, avoiding
> a second in-kernel #MC that would cause kernel panic.
>
> [ Qiuxu: Re-worte the commit message. ]
>
> Reported-by: Farrah Chen <farrah.chen@intel.com>
> Signed-off-by: Andrew Zaborowski <andrew.zaborowski@intel.com>
> Tested-by: Farrah Chen <farrah.chen@intel.com>
> Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> ---
> mm/huge_memory.c | 3 ++-
> mm/memory-failure.c | 6 ++++--
> 2 files changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9c38a95e9f09..1568f0308b90 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3588,6 +3588,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> struct list_head *list, bool uniform_split)
> {
> struct deferred_split *ds_queue = get_deferred_split_queue(folio);
> + bool has_hwpoisoned = folio_test_has_hwpoisoned(folio);
> XA_STATE(xas, &folio->mapping->i_pages, folio->index);
> struct folio *end_folio = folio_next(folio);
> bool is_anon = folio_test_anon(folio);
> @@ -3858,7 +3859,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> if (nr_shmem_dropped)
> shmem_uncharge(mapping->host, nr_shmem_dropped);
>
> - if (!ret && is_anon)
> + if (!ret && is_anon && !has_hwpoisoned)
> remap_flags = RMP_USE_SHARED_ZEROPAGE;
> remap_page(folio, 1 << order, remap_flags);
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index df6ee59527dd..3ba6fd4079ab 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -2351,8 +2351,10 @@ int memory_failure(unsigned long pfn, int flags)
> * otherwise it may race with THP split.
> * And the flag can't be set in get_hwpoison_page() since
> * it is called by soft offline too and it is just called
> - * for !MF_COUNT_INCREASED. So here seems to be the best
> - * place.
> + * for !MF_COUNT_INCREASED.
> + * It also tells split_huge_page() to not bother using
nit: it may confuse readers of split_huge_page when they didn't see
any check on the hwpoison flag. So from readability PoV, it may be
better to refer to this in a more generic term like the "following THP
splitting process" (I would prefer this), or to point precisely to
__folio_split.
Everything else looks good to me.
Reviewed-by: Jiaqi Yan <jiaqiyan@google.com>
> + * the shared zeropage -- the all-zeros check would
> + * consume the poison. So here seems to be the best place.
> *
> * Don't need care about the above error handling paths for
> * get_hwpoison_page() since they handle either free page
> --
> 2.43.0
>
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/1] mm: prevent poison consumption when splitting THP
2025-09-28 3:28 [PATCH 1/1] mm: prevent poison consumption when splitting THP Qiuxu Zhuo
2025-09-28 21:55 ` Jiaqi Yan
@ 2025-09-29 7:34 ` David Hildenbrand
2025-09-29 13:52 ` Zhuo, Qiuxu
2025-10-12 1:37 ` Wei Yang
2025-10-11 7:55 ` [PATCH v2 " Qiuxu Zhuo
` (2 subsequent siblings)
4 siblings, 2 replies; 28+ messages in thread
From: David Hildenbrand @ 2025-09-29 7:34 UTC (permalink / raw)
To: Qiuxu Zhuo, akpm, lorenzo.stoakes, linmiaohe, tony.luck
Cc: ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
baohua, nao.horiguchi, farrah.chen, linux-mm, linux-kernel,
Andrew Zaborowski
On 28.09.25 05:28, Qiuxu Zhuo wrote:
> From: Andrew Zaborowski <andrew.zaborowski@intel.com>
>
> When performing memory error injection on a THP (Transparent Huge Page)
> mapped to userspace on an x86 server, the kernel panics with the following
> trace. The expected behavior is to terminate the affected process instead
> of panicking the kernel, as the x86 Machine Check code can recover from an
> in-userspace #MC.
>
> mce: [Hardware Error]: CPU 0: Machine Check Exception: f Bank 3: bd80000000070134
> mce: [Hardware Error]: RIP 10:<ffffffff8372f8bc> {memchr_inv+0x4c/0xf0}
> mce: [Hardware Error]: TSC afff7bbff88a ADDR 1d301b000 MISC 80 PPIN 1e741e77539027db
> mce: [Hardware Error]: PROCESSOR 0:d06d0 TIME 1758093249 SOCKET 0 APIC 0 microcode 80000320
> mce: [Hardware Error]: Run the above through 'mcelog --ascii'
> mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
> Kernel panic - not syncing: Fatal local machine check
>
> The root cause of this panic is that handling a memory failure triggered by
> an in-userspace #MC necessitates splitting the THP. The splitting process
> employs a mechanism, implemented in try_to_map_unused_to_zeropage(), which
> reads the sub-pages of the THP to identify zero-filled pages. However,
> reading the sub-pages results in a second in-kernel #MC, occurring before
> the initial memory_failure() completes, ultimately leading to a kernel
> panic. See the kernel panic call trace on the two #MCs.
>
> First Machine Check occurs // [1]
> memory_failure() // [2]
> try_to_split_thp_page()
> split_huge_page()
> split_huge_page_to_list_to_order()
> __folio_split() // [3]
> remap_page()
> remove_migration_ptes()
> remove_migration_pte()
> try_to_map_unused_to_zeropage()
> memchr_inv() // [4]
> Second Machine Check occurs // [5]
> Kernel panic
>
> [1] Triggered by accessing a hardware-poisoned THP in userspace, which is
> typically recoverable by terminating the affected process.
>
> [2] Call folio_set_has_hwpoisoned() before try_to_split_thp_page().
>
> [3] Pass the RMP_USE_SHARED_ZEROPAGE remap flag to remap_page().
>
> [4] Re-access sub-pages of the hw-poisoned THP in the kernel.
>
> [5] Triggered in-kernel, leading to a panic kernel.
>
> In Step[2], memory_failure() sets the has_hwpoisoned flag on the THP,
> right before calling try_to_split_thp_page(). Fix this panic by not
> passing the RMP_USE_SHARED_ZEROPAGE flag to remap_page() in Step[3]
> if the THP has the has_hwpoisoned flag set. This prevents access to
> sub-pages of the poisoned THP for zero-page identification, avoiding
> a second in-kernel #MC that would cause kernel panic.
>
> [ Qiuxu: Re-worte the commit message. ]
>
> Reported-by: Farrah Chen <farrah.chen@intel.com>
> Signed-off-by: Andrew Zaborowski <andrew.zaborowski@intel.com>
> Tested-by: Farrah Chen <farrah.chen@intel.com>
> Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> ---
> mm/huge_memory.c | 3 ++-
> mm/memory-failure.c | 6 ++++--
> 2 files changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9c38a95e9f09..1568f0308b90 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3588,6 +3588,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> struct list_head *list, bool uniform_split)
> {
> struct deferred_split *ds_queue = get_deferred_split_queue(folio);
> + bool has_hwpoisoned = folio_test_has_hwpoisoned(folio);
> XA_STATE(xas, &folio->mapping->i_pages, folio->index);
> struct folio *end_folio = folio_next(folio);
> bool is_anon = folio_test_anon(folio);
> @@ -3858,7 +3859,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> if (nr_shmem_dropped)
> shmem_uncharge(mapping->host, nr_shmem_dropped);
>
> - if (!ret && is_anon)
> + if (!ret && is_anon && !has_hwpoisoned)
> remap_flags = RMP_USE_SHARED_ZEROPAGE;
> remap_page(folio, 1 << order, remap_flags);
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index df6ee59527dd..3ba6fd4079ab 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -2351,8 +2351,10 @@ int memory_failure(unsigned long pfn, int flags)
> * otherwise it may race with THP split.
> * And the flag can't be set in get_hwpoison_page() since
> * it is called by soft offline too and it is just called
> - * for !MF_COUNT_INCREASED. So here seems to be the best
> - * place.
> + * for !MF_COUNT_INCREASED.
> + * It also tells split_huge_page() to not bother using
> + * the shared zeropage -- the all-zeros check would
> + * consume the poison. So here seems to be the best place.
> *
> * Don't need care about the above error handling paths for
> * get_hwpoison_page() since they handle either free page
Hm, I wonder if we should actually check in try_to_map_unused_to_zeropage()
whether the page has the hwpoison flag set. Nothing wrong with scanning
non-affected pages.
In thp_underused() we should just skip the folio entirely I guess, so keep
it simple.
So what about something like this:
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9c38a95e9f091..d4109fd7fa1f2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -4121,6 +4121,9 @@ static bool thp_underused(struct folio *folio)
if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
return false;
+ folio_contain_hwpoisoned_page(folio)
+ return false;
+
for (i = 0; i < folio_nr_pages(folio); i++) {
kaddr = kmap_local_folio(folio, i * PAGE_SIZE);
if (!memchr_inv(kaddr, 0, PAGE_SIZE)) {
diff --git a/mm/migrate.c b/mm/migrate.c
index 9e5ef39ce73af..393fc2ffc96e5 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -305,8 +305,9 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
pte_t newpte;
void *addr;
- if (PageCompound(page))
+ if (PageCompound(page) || PageHWPoison(page))
return false;
+
VM_BUG_ON_PAGE(!PageAnon(page), page);
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(pte_present(ptep_get(pvmw->pte)), page);
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/1] mm: prevent poison consumption when splitting THP
2025-09-28 21:55 ` Jiaqi Yan
@ 2025-09-29 12:29 ` Miaohe Lin
2025-09-29 13:57 ` Zhuo, Qiuxu
2025-09-29 13:27 ` Zhuo, Qiuxu
1 sibling, 1 reply; 28+ messages in thread
From: Miaohe Lin @ 2025-09-29 12:29 UTC (permalink / raw)
To: Jiaqi Yan, Qiuxu Zhuo
Cc: akpm, david, lorenzo.stoakes, tony.luck, ziy, baolin.wang,
Liam.Howlett, npache, ryan.roberts, dev.jain, baohua,
nao.horiguchi, farrah.chen, linux-mm, linux-kernel,
Andrew Zaborowski
On 2025/9/29 5:55, Jiaqi Yan wrote:
> On Sat, Sep 27, 2025 at 8:30 PM Qiuxu Zhuo <qiuxu.zhuo@intel.com> wrote:
>>
>> From: Andrew Zaborowski <andrew.zaborowski@intel.com>
>>
>> When performing memory error injection on a THP (Transparent Huge Page)
>> mapped to userspace on an x86 server, the kernel panics with the following
>> trace. The expected behavior is to terminate the affected process instead
>> of panicking the kernel, as the x86 Machine Check code can recover from an
>> in-userspace #MC.
>>
>> mce: [Hardware Error]: CPU 0: Machine Check Exception: f Bank 3: bd80000000070134
>> mce: [Hardware Error]: RIP 10:<ffffffff8372f8bc> {memchr_inv+0x4c/0xf0}
>> mce: [Hardware Error]: TSC afff7bbff88a ADDR 1d301b000 MISC 80 PPIN 1e741e77539027db
>> mce: [Hardware Error]: PROCESSOR 0:d06d0 TIME 1758093249 SOCKET 0 APIC 0 microcode 80000320
>> mce: [Hardware Error]: Run the above through 'mcelog --ascii'
>> mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
>> Kernel panic - not syncing: Fatal local machine check
>>
>> The root cause of this panic is that handling a memory failure triggered by
>> an in-userspace #MC necessitates splitting the THP. The splitting process
>> employs a mechanism, implemented in try_to_map_unused_to_zeropage(), which
>> reads the sub-pages of the THP to identify zero-filled pages. However,
>> reading the sub-pages results in a second in-kernel #MC, occurring before
>> the initial memory_failure() completes, ultimately leading to a kernel
>> panic. See the kernel panic call trace on the two #MCs.
>>
>> First Machine Check occurs // [1]
>> memory_failure() // [2]
>> try_to_split_thp_page()
>> split_huge_page()
>> split_huge_page_to_list_to_order()
>> __folio_split() // [3]
>> remap_page()
>> remove_migration_ptes()
>> remove_migration_pte()
>> try_to_map_unused_to_zeropage()
>
> Just an observation: Unfortunately THP only has PageHasHWPoisoned and
> don't know the exact HWPoisoned page. Otherwise, we may still use
> zeropage for these not HWPoisoned.
IIUC, the raw error page will have HWPoisoned flag set while the THP has
PageHasHWPoisoned set. So I think we could use zeropage for healthy sub-pages.
Thanks.
.
^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: [PATCH 1/1] mm: prevent poison consumption when splitting THP
2025-09-28 21:55 ` Jiaqi Yan
2025-09-29 12:29 ` Miaohe Lin
@ 2025-09-29 13:27 ` Zhuo, Qiuxu
2025-09-29 15:51 ` Luck, Tony
1 sibling, 1 reply; 28+ messages in thread
From: Zhuo, Qiuxu @ 2025-09-29 13:27 UTC (permalink / raw)
To: Jiaqi Yan
Cc: akpm, david, lorenzo.stoakes, linmiaohe, Luck, Tony, ziy,
baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
baohua, nao.horiguchi, Chen, Farrah, linux-mm, linux-kernel,
Andrew Zaborowski
Hi Jiaqi,
> From: Jiaqi Yan <jiaqiyan@google.com>
> [...]
> > First Machine Check occurs // [1]
> > memory_failure() // [2]
> > try_to_split_thp_page()
> > split_huge_page()
> > split_huge_page_to_list_to_order()
> > __folio_split() // [3]
> > remap_page()
> > remove_migration_ptes()
> > remove_migration_pte()
> > try_to_map_unused_to_zeropage()
>
> Just an observation: Unfortunately THP only has PageHasHWPoisoned and
> don't know the exact HWPoisoned page. Otherwise, we may still use
> zeropage for these not HWPoisoned.
>
Thanks for catching this.
Miaohe mentioned in another e-mail that there was an HWPoisoned flag for the raw error 4K page.
We could use that flag just to skip that raw error page and still use the zeropage for other
healthy sub-pages. I'll try that.
> > memchr_inv() // [4]
> > Second Machine Check occurs // [5]
> > Kernel panic
> >
> [...]
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -2351,8 +2351,10 @@ int memory_failure(unsigned long pfn, int flags)
> > * otherwise it may race with THP split.
> > * And the flag can't be set in get_hwpoison_page() since
> > * it is called by soft offline too and it is just called
> > - * for !MF_COUNT_INCREASED. So here seems to be the best
> > - * place.
> > + * for !MF_COUNT_INCREASED.
> > + * It also tells split_huge_page() to not bother using
>
> nit: it may confuse readers of split_huge_page when they didn't see any check
> on the hwpoison flag. So from readability PoV, it may be better to refer to this
> in a more generic term like the "following THP splitting process" (I would
> prefer this), or to point precisely to __folio_split.
>
OK. I'll update this comment in v2.
> Everything else looks good to me.
>
> Reviewed-by: Jiaqi Yan <jiaqiyan@google.com>
Thanks.
-Qiuxu
^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: [PATCH 1/1] mm: prevent poison consumption when splitting THP
2025-09-29 7:34 ` David Hildenbrand
@ 2025-09-29 13:52 ` Zhuo, Qiuxu
2025-09-29 16:12 ` David Hildenbrand
2025-10-12 1:37 ` Wei Yang
1 sibling, 1 reply; 28+ messages in thread
From: Zhuo, Qiuxu @ 2025-09-29 13:52 UTC (permalink / raw)
To: David Hildenbrand, akpm, lorenzo.stoakes, linmiaohe, Luck, Tony
Cc: ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
baohua, nao.horiguchi, Chen, Farrah, linux-mm, linux-kernel,
Andrew Zaborowski
Hi David,
> From: David Hildenbrand <david@redhat.com>
> [...]
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -2351,8 +2351,10 @@ int memory_failure(unsigned long pfn, int flags)
> > * otherwise it may race with THP split.
> > * And the flag can't be set in get_hwpoison_page() since
> > * it is called by soft offline too and it is just called
> > - * for !MF_COUNT_INCREASED. So here seems to be the best
> > - * place.
> > + * for !MF_COUNT_INCREASED.
> > + * It also tells split_huge_page() to not bother using
> > + * the shared zeropage -- the all-zeros check would
> > + * consume the poison. So here seems to be the best place.
> > *
> > * Don't need care about the above error handling paths for
> > * get_hwpoison_page() since they handle either free page
>
> Hm, I wonder if we should actually check in
> try_to_map_unused_to_zeropage() whether the page has the hwpoison flag
> set. Nothing wrong with scanning non-affected pages.
>
Good point about continuing to scan non-affected pages for possible zeropage mapping.
> In thp_underused() we should just skip the folio entirely I guess, so keep it
> simple.
>
> So what about something like this:
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c index
> 9c38a95e9f091..d4109fd7fa1f2 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -4121,6 +4121,9 @@ static bool thp_underused(struct folio *folio)
> if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
> return false;
>
> + folio_contain_hwpoisoned_page(folio)
Typo here 😊?
if (folio_contain_hwpoisoned_page(folio))
> + return false;
> +
> for (i = 0; i < folio_nr_pages(folio); i++) {
> kaddr = kmap_local_folio(folio, i * PAGE_SIZE);
> if (!memchr_inv(kaddr, 0, PAGE_SIZE)) { diff --git a/mm/migrate.c
> b/mm/migrate.c index 9e5ef39ce73af..393fc2ffc96e5 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -305,8 +305,9 @@ static bool try_to_map_unused_to_zeropage(struct
> page_vma_mapped_walk *pvmw,
> pte_t newpte;
> void *addr;
>
> - if (PageCompound(page))
> + if (PageCompound(page) || PageHWPoison(page))
> return false;
> +
> VM_BUG_ON_PAGE(!PageAnon(page), page);
> VM_BUG_ON_PAGE(!PageLocked(page), page);
> VM_BUG_ON_PAGE(pte_present(ptep_get(pvmw->pte)), page);
>
I tested this diff and it works well.
If there are no objections, I'll use this diff for v2.
Thanks David.
-Qiuxu
^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: [PATCH 1/1] mm: prevent poison consumption when splitting THP
2025-09-29 12:29 ` Miaohe Lin
@ 2025-09-29 13:57 ` Zhuo, Qiuxu
2025-09-29 15:15 ` Jiaqi Yan
0 siblings, 1 reply; 28+ messages in thread
From: Zhuo, Qiuxu @ 2025-09-29 13:57 UTC (permalink / raw)
To: Miaohe Lin, Jiaqi Yan
Cc: akpm, david, lorenzo.stoakes, Luck, Tony, ziy, baolin.wang,
Liam.Howlett, npache, ryan.roberts, dev.jain, baohua,
nao.horiguchi, Chen, Farrah, linux-mm, linux-kernel,
Andrew Zaborowski
Hi Miaohe,
> From: Miaohe Lin <linmiaohe@huawei.com>
> [...]
> >> First Machine Check occurs // [1]
> >> memory_failure() // [2]
> >> try_to_split_thp_page()
> >> split_huge_page()
> >> split_huge_page_to_list_to_order()
> >> __folio_split() // [3]
> >> remap_page()
> >> remove_migration_ptes()
> >> remove_migration_pte()
> >> try_to_map_unused_to_zeropage()
> >
> > Just an observation: Unfortunately THP only has PageHasHWPoisoned and
> > don't know the exact HWPoisoned page. Otherwise, we may still use
> > zeropage for these not HWPoisoned.
>
> IIUC, the raw error page will have HWPoisoned flag set while the THP has
> PageHasHWPoisoned set. So I think we could use zeropage for healthy sub-
> pages.
Good point.
David's suggested diff in another e-mail checked the raw error page instead of
entire folio. And I tested that diff and it worked well.
-Qiuxu
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/1] mm: prevent poison consumption when splitting THP
2025-09-29 13:57 ` Zhuo, Qiuxu
@ 2025-09-29 15:15 ` Jiaqi Yan
0 siblings, 0 replies; 28+ messages in thread
From: Jiaqi Yan @ 2025-09-29 15:15 UTC (permalink / raw)
To: Zhuo, Qiuxu, Miaohe Lin, david
Cc: akpm, lorenzo.stoakes, Luck, Tony, ziy, baolin.wang,
Liam.Howlett, npache, ryan.roberts, dev.jain, baohua,
nao.horiguchi, Chen, Farrah, linux-mm, linux-kernel,
Andrew Zaborowski
On Mon, Sep 29, 2025 at 6:57 AM Zhuo, Qiuxu <qiuxu.zhuo@intel.com> wrote:
>
> Hi Miaohe,
>
> > From: Miaohe Lin <linmiaohe@huawei.com>
> > [...]
> > >> First Machine Check occurs // [1]
> > >> memory_failure() // [2]
> > >> try_to_split_thp_page()
> > >> split_huge_page()
> > >> split_huge_page_to_list_to_order()
> > >> __folio_split() // [3]
> > >> remap_page()
> > >> remove_migration_ptes()
> > >> remove_migration_pte()
> > >> try_to_map_unused_to_zeropage()
> > >
> > > Just an observation: Unfortunately THP only has PageHasHWPoisoned and
> > > don't know the exact HWPoisoned page. Otherwise, we may still use
> > > zeropage for these not HWPoisoned.
> >
> > IIUC, the raw error page will have HWPoisoned flag set while the THP has
> > PageHasHWPoisoned set. So I think we could use zeropage for healthy sub-
> > pages.
Oh, sorry, somehow I forgot this so I thought there is no better place
to do the HWPoison check than in __folio_split. Yeah, since we know
the exact raw error page, checking in try_to_map_unused_to_zeropage
like David suggested is much better!
>
> Good point.
>
> David's suggested diff in another e-mail checked the raw error page instead of
> entire folio. And I tested that diff and it worked well.
>
> -Qiuxu
^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: [PATCH 1/1] mm: prevent poison consumption when splitting THP
2025-09-29 13:27 ` Zhuo, Qiuxu
@ 2025-09-29 15:51 ` Luck, Tony
2025-09-29 16:30 ` Zhuo, Qiuxu
0 siblings, 1 reply; 28+ messages in thread
From: Luck, Tony @ 2025-09-29 15:51 UTC (permalink / raw)
To: Zhuo, Qiuxu, Jiaqi Yan
Cc: akpm, david, lorenzo.stoakes, linmiaohe, ziy, baolin.wang,
Liam.Howlett, npache, ryan.roberts, dev.jain, baohua,
nao.horiguchi, Chen, Farrah, linux-mm, linux-kernel,
Andrew Zaborowski
> Miaohe mentioned in another e-mail that there was an HWPoisoned flag for the raw error 4K page.
> We could use that flag just to skip that raw error page and still use the zeropage for other
> healthy sub-pages. I'll try that.
That HWPoisoned flag is only set for raw pages where an error has been detected. Maybe Linux
could implement an "is_this_page_all_zero_mc_safe()"[1] that would catch undetected poison and
avoid a crash in that case too?
-Tony
[1] terrible name, pick something better.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/1] mm: prevent poison consumption when splitting THP
2025-09-29 13:52 ` Zhuo, Qiuxu
@ 2025-09-29 16:12 ` David Hildenbrand
0 siblings, 0 replies; 28+ messages in thread
From: David Hildenbrand @ 2025-09-29 16:12 UTC (permalink / raw)
To: Zhuo, Qiuxu, akpm, lorenzo.stoakes, linmiaohe, Luck, Tony
Cc: ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
baohua, nao.horiguchi, Chen, Farrah, linux-mm, linux-kernel,
Andrew Zaborowski
On 29.09.25 15:52, Zhuo, Qiuxu wrote:
> Hi David,
>
>> From: David Hildenbrand <david@redhat.com>
>> [...]
>>> --- a/mm/memory-failure.c
>>> +++ b/mm/memory-failure.c
>>> @@ -2351,8 +2351,10 @@ int memory_failure(unsigned long pfn, int flags)
>>> * otherwise it may race with THP split.
>>> * And the flag can't be set in get_hwpoison_page() since
>>> * it is called by soft offline too and it is just called
>>> - * for !MF_COUNT_INCREASED. So here seems to be the best
>>> - * place.
>>> + * for !MF_COUNT_INCREASED.
>>> + * It also tells split_huge_page() to not bother using
>>> + * the shared zeropage -- the all-zeros check would
>>> + * consume the poison. So here seems to be the best place.
>>> *
>>> * Don't need care about the above error handling paths for
>>> * get_hwpoison_page() since they handle either free page
>>
>> Hm, I wonder if we should actually check in
>> try_to_map_unused_to_zeropage() whether the page has the hwpoison flag
>> set. Nothing wrong with scanning non-affected pages.
>>
>
> Good point about continuing to scan non-affected pages for possible zeropage mapping.
>
>> In thp_underused() we should just skip the folio entirely I guess, so keep it
>> simple.
>>
>> So what about something like this:
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c index
>> 9c38a95e9f091..d4109fd7fa1f2 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -4121,6 +4121,9 @@ static bool thp_underused(struct folio *folio)
>> if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
>> return false;
>>
>> + folio_contain_hwpoisoned_page(folio)
>
> Typo here 😊?
Yes! :) As always, completely uncompiled.
>
> if (folio_contain_hwpoisoned_page(folio))
>
>> + return false;
>> +
>> for (i = 0; i < folio_nr_pages(folio); i++) {
>> kaddr = kmap_local_folio(folio, i * PAGE_SIZE);
>> if (!memchr_inv(kaddr, 0, PAGE_SIZE)) { diff --git a/mm/migrate.c
>> b/mm/migrate.c index 9e5ef39ce73af..393fc2ffc96e5 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -305,8 +305,9 @@ static bool try_to_map_unused_to_zeropage(struct
>> page_vma_mapped_walk *pvmw,
>> pte_t newpte;
>> void *addr;
>>
>> - if (PageCompound(page))
>> + if (PageCompound(page) || PageHWPoison(page))
>> return false;
>> +
>> VM_BUG_ON_PAGE(!PageAnon(page), page);
>> VM_BUG_ON_PAGE(!PageLocked(page), page);
>> VM_BUG_ON_PAGE(pte_present(ptep_get(pvmw->pte)), page);
>>
>
> I tested this diff and it works well.
> If there are no objections, I'll use this diff for v2.
Sounds good, thanks!
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: [PATCH 1/1] mm: prevent poison consumption when splitting THP
2025-09-29 15:51 ` Luck, Tony
@ 2025-09-29 16:30 ` Zhuo, Qiuxu
2025-09-29 17:25 ` David Hildenbrand
0 siblings, 1 reply; 28+ messages in thread
From: Zhuo, Qiuxu @ 2025-09-29 16:30 UTC (permalink / raw)
To: Luck, Tony, Jiaqi Yan
Cc: akpm, david, lorenzo.stoakes, linmiaohe, ziy, baolin.wang,
Liam.Howlett, npache, ryan.roberts, dev.jain, baohua,
nao.horiguchi, Chen, Farrah, linux-mm, linux-kernel,
Andrew Zaborowski
Hi Tony,
> From: Luck, Tony <tony.luck@intel.com>
> [...]
> Subject: RE: [PATCH 1/1] mm: prevent poison consumption when splitting THP
>
> > Miaohe mentioned in another e-mail that there was an HWPoisoned flag
> for the raw error 4K page.
> > We could use that flag just to skip that raw error page and still use
> > the zeropage for other healthy sub-pages. I'll try that.
>
> That HWPoisoned flag is only set for raw pages where an error has been
> detected. Maybe Linux could implement an
> "is_this_page_all_zero_mc_safe()"[1] that would catch undetected poison
This sounds like a great suggestion to me.
Let's see what others think about this and the name (though the name already LGTM 😊).
> and avoid a crash in that case too?
>
> -Tony
>
> [1] terrible name, pick something better.
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/1] mm: prevent poison consumption when splitting THP
2025-09-29 16:30 ` Zhuo, Qiuxu
@ 2025-09-29 17:25 ` David Hildenbrand
2025-09-30 1:48 ` Lance Yang
0 siblings, 1 reply; 28+ messages in thread
From: David Hildenbrand @ 2025-09-29 17:25 UTC (permalink / raw)
To: Zhuo, Qiuxu, Luck, Tony, Jiaqi Yan
Cc: akpm, lorenzo.stoakes, linmiaohe, ziy, baolin.wang, Liam.Howlett,
npache, ryan.roberts, dev.jain, baohua, nao.horiguchi, Chen,
Farrah, linux-mm, linux-kernel, Andrew Zaborowski
On 29.09.25 18:30, Zhuo, Qiuxu wrote:
> Hi Tony,
>
>> From: Luck, Tony <tony.luck@intel.com>
>> [...]
>> Subject: RE: [PATCH 1/1] mm: prevent poison consumption when splitting THP
>>
>>> Miaohe mentioned in another e-mail that there was an HWPoisoned flag
>> for the raw error 4K page.
>>> We could use that flag just to skip that raw error page and still use
>>> the zeropage for other healthy sub-pages. I'll try that.
>>
>> That HWPoisoned flag is only set for raw pages where an error has been
>> detected. Maybe Linux could implement an
>> "is_this_page_all_zero_mc_safe()"[1] that would catch undetected poison
>
> This sounds like a great suggestion to me.
> Let's see what others think about this and the name (though the name already LGTM 😊).
The function name is just ... special. Not the good type of special IMHO. :)
Note that we'll be moving to pages_identical() in [1]. Maybe we would
want a pages_identical_mc() or sth. like that as a follow up later.
So in any case, make that a follow-up work on top of a simple fix.
[1] https://lore.kernel.org/all/20250922021458.68123-1-lance.yang@linux.dev/
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/1] mm: prevent poison consumption when splitting THP
2025-09-29 17:25 ` David Hildenbrand
@ 2025-09-30 1:48 ` Lance Yang
2025-09-30 8:53 ` David Hildenbrand
0 siblings, 1 reply; 28+ messages in thread
From: Lance Yang @ 2025-09-30 1:48 UTC (permalink / raw)
To: Zhuo, Qiuxu, David Hildenbrand
Cc: Luck, Tony, Jiaqi Yan, akpm, lorenzo.stoakes, linmiaohe, ziy,
baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
baohua, nao.horiguchi, Chen, Farrah, linux-mm, linux-kernel,
Andrew Zaborowski
On Tue, Sep 30, 2025 at 3:07 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 29.09.25 18:30, Zhuo, Qiuxu wrote:
> > Hi Tony,
> >
> >> From: Luck, Tony <tony.luck@intel.com>
> >> [...]
> >> Subject: RE: [PATCH 1/1] mm: prevent poison consumption when splitting THP
> >>
> >>> Miaohe mentioned in another e-mail that there was an HWPoisoned flag
> >> for the raw error 4K page.
> >>> We could use that flag just to skip that raw error page and still use
> >>> the zeropage for other healthy sub-pages. I'll try that.
> >>
> >> That HWPoisoned flag is only set for raw pages where an error has been
> >> detected. Maybe Linux could implement an
> >> "is_this_page_all_zero_mc_safe()"[1] that would catch undetected poison
> >
> > This sounds like a great suggestion to me.
> > Let's see what others think about this and the name (though the name already LGTM 😊).
>
> The function name is just ... special. Not the good type of special IMHO. :)
>
> Note that we'll be moving to pages_identical() in [1]. Maybe we would
> want a pages_identical_mc() or sth. like that as a follow up later.
>
>
> So in any case, make that a follow-up work on top of a simple fix.
Yeah. IIRC, as David suggested earlier, we can just check if a page is
poisoned using PageHWPoison().
Perhaps we should move this check into pages_identical()? This would make
it a central place to determine if pages are safe to access and merge ;)
BTW, could you please keep me in the loop for the next version?
Thanks,
Lance
>
> [1] https://lore.kernel.org/all/20250922021458.68123-1-lance.yang@linux.dev/
>
> --
> Cheers
>
> David / dhildenb
>
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/1] mm: prevent poison consumption when splitting THP
2025-09-30 1:48 ` Lance Yang
@ 2025-09-30 8:53 ` David Hildenbrand
2025-09-30 10:13 ` Lance Yang
0 siblings, 1 reply; 28+ messages in thread
From: David Hildenbrand @ 2025-09-30 8:53 UTC (permalink / raw)
To: Lance Yang, Zhuo, Qiuxu
Cc: Luck, Tony, Jiaqi Yan, akpm, lorenzo.stoakes, linmiaohe, ziy,
baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
baohua, nao.horiguchi, Chen, Farrah, linux-mm, linux-kernel,
Andrew Zaborowski
On 30.09.25 03:48, Lance Yang wrote:
> On Tue, Sep 30, 2025 at 3:07 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 29.09.25 18:30, Zhuo, Qiuxu wrote:
>>> Hi Tony,
>>>
>>>> From: Luck, Tony <tony.luck@intel.com>
>>>> [...]
>>>> Subject: RE: [PATCH 1/1] mm: prevent poison consumption when splitting THP
>>>>
>>>>> Miaohe mentioned in another e-mail that there was an HWPoisoned flag
>>>> for the raw error 4K page.
>>>>> We could use that flag just to skip that raw error page and still use
>>>>> the zeropage for other healthy sub-pages. I'll try that.
>>>>
>>>> That HWPoisoned flag is only set for raw pages where an error has been
>>>> detected. Maybe Linux could implement an
>>>> "is_this_page_all_zero_mc_safe()"[1] that would catch undetected poison
>>>
>>> This sounds like a great suggestion to me.
>>> Let's see what others think about this and the name (though the name already LGTM 😊).
>>
>> The function name is just ... special. Not the good type of special IMHO. :)
>>
>> Note that we'll be moving to pages_identical() in [1]. Maybe we would
>> want a pages_identical_mc() or sth. like that as a follow up later.
>>
>>
>> So in any case, make that a follow-up work on top of a simple fix.
>
> Yeah. IIRC, as David suggested earlier, we can just check if a page is
> poisoned using PageHWPoison().
>
> Perhaps we should move this check into pages_identical()? This would make
> it a central place to determine if pages are safe to access and merge ;)
I would have to go into memcmp_pages(). Would be an option, but not sure
if we should rather let callers deal with that.
For example, in some cases it might be sufficient to just check if the
large folio has any poisoned page and give up early.
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/1] mm: prevent poison consumption when splitting THP
2025-09-30 8:53 ` David Hildenbrand
@ 2025-09-30 10:13 ` Lance Yang
2025-09-30 10:20 ` Lance Yang
0 siblings, 1 reply; 28+ messages in thread
From: Lance Yang @ 2025-09-30 10:13 UTC (permalink / raw)
To: David Hildenbrand, Zhuo, Qiuxu
Cc: Luck, Tony, Jiaqi Yan, akpm, lorenzo.stoakes, linmiaohe, ziy,
baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
baohua, nao.horiguchi, Chen, Farrah, linux-mm, linux-kernel,
Andrew Zaborowski
On 2025/9/30 16:53, David Hildenbrand wrote:
> On 30.09.25 03:48, Lance Yang wrote:
>> On Tue, Sep 30, 2025 at 3:07 AM David Hildenbrand <david@redhat.com>
>> wrote:
>>>
>>> On 29.09.25 18:30, Zhuo, Qiuxu wrote:
>>>> Hi Tony,
>>>>
>>>>> From: Luck, Tony <tony.luck@intel.com>
>>>>> [...]
>>>>> Subject: RE: [PATCH 1/1] mm: prevent poison consumption when
>>>>> splitting THP
>>>>>
>>>>>> Miaohe mentioned in another e-mail that there was an HWPoisoned flag
>>>>> for the raw error 4K page.
>>>>>> We could use that flag just to skip that raw error page and still use
>>>>>> the zeropage for other healthy sub-pages. I'll try that.
>>>>>
>>>>> That HWPoisoned flag is only set for raw pages where an error has been
>>>>> detected. Maybe Linux could implement an
>>>>> "is_this_page_all_zero_mc_safe()"[1] that would catch undetected
>>>>> poison
>>>>
>>>> This sounds like a great suggestion to me.
>>>> Let's see what others think about this and the name (though the name
>>>> already LGTM 😊).
>>>
>>> The function name is just ... special. Not the good type of special
>>> IMHO. :)
>>>
>>> Note that we'll be moving to pages_identical() in [1]. Maybe we would
>>> want a pages_identical_mc() or sth. like that as a follow up later.
>>>
>>>
>>> So in any case, make that a follow-up work on top of a simple fix.
>>
>> Yeah. IIRC, as David suggested earlier, we can just check if a page is
>> poisoned using PageHWPoison().
>>
>> Perhaps we should move this check into pages_identical()? This would make
>> it a central place to determine if pages are safe to access and merge ;)
>
> I would have to go into memcmp_pages(). Would be an option, but not sure
> if we should rather let callers deal with that.
>
> For example, in some cases it might be sufficient to just check if the
> large folio has any poisoned page and give up early.
FWIW, one idea I had was to create a unified pre-flight checker, like
folio_pages_identical_prepare(struct folio *folio). A caller could use
it before a loop of pages_identical() calls to pre-check a folio :)
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/1] mm: prevent poison consumption when splitting THP
2025-09-30 10:13 ` Lance Yang
@ 2025-09-30 10:20 ` Lance Yang
0 siblings, 0 replies; 28+ messages in thread
From: Lance Yang @ 2025-09-30 10:20 UTC (permalink / raw)
To: David Hildenbrand, Zhuo, Qiuxu
Cc: Luck, Tony, Jiaqi Yan, akpm, lorenzo.stoakes, linmiaohe, ziy,
baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
baohua, nao.horiguchi, Chen, Farrah, linux-mm, linux-kernel,
Andrew Zaborowski
On 2025/9/30 18:13, Lance Yang wrote:
>
>
> On 2025/9/30 16:53, David Hildenbrand wrote:
>> On 30.09.25 03:48, Lance Yang wrote:
>>> On Tue, Sep 30, 2025 at 3:07 AM David Hildenbrand <david@redhat.com>
>>> wrote:
>>>>
>>>> On 29.09.25 18:30, Zhuo, Qiuxu wrote:
>>>>> Hi Tony,
>>>>>
>>>>>> From: Luck, Tony <tony.luck@intel.com>
>>>>>> [...]
>>>>>> Subject: RE: [PATCH 1/1] mm: prevent poison consumption when
>>>>>> splitting THP
>>>>>>
>>>>>>> Miaohe mentioned in another e-mail that there was an HWPoisoned flag
>>>>>> for the raw error 4K page.
>>>>>>> We could use that flag just to skip that raw error page and still
>>>>>>> use
>>>>>>> the zeropage for other healthy sub-pages. I'll try that.
>>>>>>
>>>>>> That HWPoisoned flag is only set for raw pages where an error has
>>>>>> been
>>>>>> detected. Maybe Linux could implement an
>>>>>> "is_this_page_all_zero_mc_safe()"[1] that would catch undetected
>>>>>> poison
>>>>>
>>>>> This sounds like a great suggestion to me.
>>>>> Let's see what others think about this and the name (though the
>>>>> name already LGTM 😊).
>>>>
>>>> The function name is just ... special. Not the good type of special
>>>> IMHO. :)
>>>>
>>>> Note that we'll be moving to pages_identical() in [1]. Maybe we would
>>>> want a pages_identical_mc() or sth. like that as a follow up later.
>>>>
>>>>
>>>> So in any case, make that a follow-up work on top of a simple fix.
>>>
>>> Yeah. IIRC, as David suggested earlier, we can just check if a page is
>>> poisoned using PageHWPoison().
>>>
>>> Perhaps we should move this check into pages_identical()? This would
>>> make
>>> it a central place to determine if pages are safe to access and merge ;)
>>
>> I would have to go into memcmp_pages(). Would be an option, but not
>> sure if we should rather let callers deal with that.
>>
>> For example, in some cases it might be sufficient to just check if the
>> large folio has any poisoned page and give up early.
>
> FWIW, one idea I had was to create a unified pre-flight checker, like
> folio_pages_identical_prepare(struct folio *folio). A caller could use
> it before a loop of pages_identical() calls to pre-check a folio :)
Forgot to add:
It would centralize all folio-level checks.
So if we ever need a new check in the future, we'd only modify the
prepare helper, not all the individual callers.
^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH v2 1/1] mm: prevent poison consumption when splitting THP
2025-09-28 3:28 [PATCH 1/1] mm: prevent poison consumption when splitting THP Qiuxu Zhuo
2025-09-28 21:55 ` Jiaqi Yan
2025-09-29 7:34 ` David Hildenbrand
@ 2025-10-11 7:55 ` Qiuxu Zhuo
2025-10-11 9:09 ` Lance Yang
` (4 more replies)
2025-10-14 14:19 ` [PATCH v3 " Qiuxu Zhuo
2025-10-15 6:49 ` [PATCH v4 " Qiuxu Zhuo
4 siblings, 5 replies; 28+ messages in thread
From: Qiuxu Zhuo @ 2025-10-11 7:55 UTC (permalink / raw)
To: akpm, david, lorenzo.stoakes, linmiaohe, tony.luck
Cc: qiuxu.zhuo, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
dev.jain, baohua, nao.horiguchi, farrah.chen, jiaqiyan,
lance.yang, linux-mm, linux-kernel
When performing memory error injection on a THP (Transparent Huge Page)
mapped to userspace on an x86 server, the kernel panics with the following
trace. The expected behavior is to terminate the affected process instead
of panicking the kernel, as the x86 Machine Check code can recover from an
in-userspace #MC.
mce: [Hardware Error]: CPU 0: Machine Check Exception: f Bank 3: bd80000000070134
mce: [Hardware Error]: RIP 10:<ffffffff8372f8bc> {memchr_inv+0x4c/0xf0}
mce: [Hardware Error]: TSC afff7bbff88a ADDR 1d301b000 MISC 80 PPIN 1e741e77539027db
mce: [Hardware Error]: PROCESSOR 0:d06d0 TIME 1758093249 SOCKET 0 APIC 0 microcode 80000320
mce: [Hardware Error]: Run the above through 'mcelog --ascii'
mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
Kernel panic - not syncing: Fatal local machine check
The root cause of this panic is that handling a memory failure triggered by
an in-userspace #MC necessitates splitting the THP. The splitting process
employs a mechanism, implemented in try_to_map_unused_to_zeropage(), which
reads the sub-pages of the THP to identify zero-filled pages. However,
reading the sub-pages results in a second in-kernel #MC, occurring before
the initial memory_failure() completes, ultimately leading to a kernel
panic. See the kernel panic call trace on the two #MCs.
First Machine Check occurs // [1]
memory_failure() // [2]
try_to_split_thp_page()
split_huge_page()
split_huge_page_to_list_to_order()
__folio_split() // [3]
remap_page()
remove_migration_ptes()
remove_migration_pte()
try_to_map_unused_to_zeropage() // [4]
memchr_inv() // [5]
Second Machine Check occurs // [6]
Kernel panic
[1] Triggered by accessing a hardware-poisoned THP in userspace, which is
typically recoverable by terminating the affected process.
[2] Call folio_set_has_hwpoisoned() before try_to_split_thp_page().
[3] Pass the RMP_USE_SHARED_ZEROPAGE remap flag to remap_page().
[4] Try to map the unused THP to zeropage.
[5] Re-access sub-pages of the hw-poisoned THP in the kernel.
[6] Triggered in-kernel, leading to a panic kernel.
In Step[2], memory_failure() sets the poisoned flag on the sub-page of the
THP by TestSetPageHWPoison() before calling try_to_split_thp_page().
As suggested by David Hildenbrand, fix this panic by not accessing to the
poisoned sub-page of the THP during zeropage identification, while
continuing to scan unaffected sub-pages of the THP for possible zeropage
mapping. This prevents a second in-kernel #MC that would cause kernel
panic in Step[4].
[ Credits to Andrew Zaborowski <andrew.zaborowski@intel.com> for his
original fix that prevents passing the RMP_USE_SHARED_ZEROPAGE flag
to remap_page() in Step[3] if the THP has the has_hwpoisoned flag set,
avoiding access to the entire THP for zero-page identification. ]
Reported-by: Farrah Chen <farrah.chen@intel.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Tested-by: Farrah Chen <farrah.chen@intel.com>
Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
---
v1 -> v2:
- Apply David Hildenbrand's fix suggestion.
- Update the commit message to reflect the new fix.
- Add David Hildenbrand's "Suggested-by:" tag.
- Remove Andrew Zaborowski's SoB but add credits to him in the commit message.
[ I cannot reach him to get his SoB for the completely rewritten commit
message and new fix approach. ]
mm/huge_memory.c | 3 +++
mm/migrate.c | 3 ++-
2 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9c38a95e9f09..2bf5178cca96 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -4121,6 +4121,9 @@ static bool thp_underused(struct folio *folio)
if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
return false;
+ if (folio_contain_hwpoisoned_page(folio))
+ return false;
+
for (i = 0; i < folio_nr_pages(folio); i++) {
kaddr = kmap_local_folio(folio, i * PAGE_SIZE);
if (!memchr_inv(kaddr, 0, PAGE_SIZE)) {
diff --git a/mm/migrate.c b/mm/migrate.c
index 9e5ef39ce73a..393fc2ffc96e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -305,8 +305,9 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
pte_t newpte;
void *addr;
- if (PageCompound(page))
+ if (PageCompound(page) || PageHWPoison(page))
return false;
+
VM_BUG_ON_PAGE(!PageAnon(page), page);
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(pte_present(ptep_get(pvmw->pte)), page);
base-commit: e5f0a698b34ed76002dc5cff3804a61c80233a7a
--
2.43.0
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 1/1] mm: prevent poison consumption when splitting THP
2025-10-11 7:55 ` [PATCH v2 " Qiuxu Zhuo
@ 2025-10-11 9:09 ` Lance Yang
2025-10-11 18:18 ` Andrew Morton
` (3 subsequent siblings)
4 siblings, 0 replies; 28+ messages in thread
From: Lance Yang @ 2025-10-11 9:09 UTC (permalink / raw)
To: Qiuxu Zhuo
Cc: ziy, baolin.wang, akpm, Liam.Howlett, npache, ryan.roberts,
dev.jain, baohua, lorenzo.stoakes, nao.horiguchi, farrah.chen,
jiaqiyan, linux-mm, linux-kernel, tony.luck, linmiaohe, david
On 2025/10/11 15:55, Qiuxu Zhuo wrote:
> When performing memory error injection on a THP (Transparent Huge Page)
> mapped to userspace on an x86 server, the kernel panics with the following
> trace. The expected behavior is to terminate the affected process instead
> of panicking the kernel, as the x86 Machine Check code can recover from an
> in-userspace #MC.
>
> mce: [Hardware Error]: CPU 0: Machine Check Exception: f Bank 3: bd80000000070134
> mce: [Hardware Error]: RIP 10:<ffffffff8372f8bc> {memchr_inv+0x4c/0xf0}
> mce: [Hardware Error]: TSC afff7bbff88a ADDR 1d301b000 MISC 80 PPIN 1e741e77539027db
> mce: [Hardware Error]: PROCESSOR 0:d06d0 TIME 1758093249 SOCKET 0 APIC 0 microcode 80000320
> mce: [Hardware Error]: Run the above through 'mcelog --ascii'
> mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
> Kernel panic - not syncing: Fatal local machine check
>
> The root cause of this panic is that handling a memory failure triggered by
> an in-userspace #MC necessitates splitting the THP. The splitting process
> employs a mechanism, implemented in try_to_map_unused_to_zeropage(), which
> reads the sub-pages of the THP to identify zero-filled pages. However,
> reading the sub-pages results in a second in-kernel #MC, occurring before
> the initial memory_failure() completes, ultimately leading to a kernel
> panic. See the kernel panic call trace on the two #MCs.
>
> First Machine Check occurs // [1]
> memory_failure() // [2]
> try_to_split_thp_page()
> split_huge_page()
> split_huge_page_to_list_to_order()
> __folio_split() // [3]
> remap_page()
> remove_migration_ptes()
> remove_migration_pte()
> try_to_map_unused_to_zeropage() // [4]
> memchr_inv() // [5]
> Second Machine Check occurs // [6]
> Kernel panic
>
> [1] Triggered by accessing a hardware-poisoned THP in userspace, which is
> typically recoverable by terminating the affected process.
>
> [2] Call folio_set_has_hwpoisoned() before try_to_split_thp_page().
>
> [3] Pass the RMP_USE_SHARED_ZEROPAGE remap flag to remap_page().
>
> [4] Try to map the unused THP to zeropage.
>
> [5] Re-access sub-pages of the hw-poisoned THP in the kernel.
>
> [6] Triggered in-kernel, leading to a panic kernel.
>
> In Step[2], memory_failure() sets the poisoned flag on the sub-page of the
> THP by TestSetPageHWPoison() before calling try_to_split_thp_page().
>
> As suggested by David Hildenbrand, fix this panic by not accessing to the
> poisoned sub-page of the THP during zeropage identification, while
> continuing to scan unaffected sub-pages of the THP for possible zeropage
> mapping. This prevents a second in-kernel #MC that would cause kernel
> panic in Step[4].
>
> [ Credits to Andrew Zaborowski <andrew.zaborowski@intel.com> for his
> original fix that prevents passing the RMP_USE_SHARED_ZEROPAGE flag
> to remap_page() in Step[3] if the THP has the has_hwpoisoned flag set,
> avoiding access to the entire THP for zero-page identification. ]
>
Thanks for the fix!
But one thing is missing: a "Fixes:" tag here. And also add:
Cc: <stable@vger.kernel.org>
> Reported-by: Farrah Chen <farrah.chen@intel.com>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Tested-by: Farrah Chen <farrah.chen@intel.com>
> Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> ---
Well, I think this fix should work ;)
Acked-by: Lance Yang <lance.yang@linux.dev>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 1/1] mm: prevent poison consumption when splitting THP
2025-10-11 7:55 ` [PATCH v2 " Qiuxu Zhuo
2025-10-11 9:09 ` Lance Yang
@ 2025-10-11 18:18 ` Andrew Morton
2025-10-12 1:23 ` Wei Yang
` (2 subsequent siblings)
4 siblings, 0 replies; 28+ messages in thread
From: Andrew Morton @ 2025-10-11 18:18 UTC (permalink / raw)
To: Qiuxu Zhuo
Cc: david, lorenzo.stoakes, linmiaohe, tony.luck, ziy, baolin.wang,
Liam.Howlett, npache, ryan.roberts, dev.jain, baohua,
nao.horiguchi, farrah.chen, jiaqiyan, lance.yang, linux-mm,
linux-kernel
On Sat, 11 Oct 2025 15:55:19 +0800 Qiuxu Zhuo <qiuxu.zhuo@intel.com> wrote:
> When performing memory error injection on a THP (Transparent Huge Page)
> mapped to userspace on an x86 server, the kernel panics with the following
> trace. The expected behavior is to terminate the affected process instead
> of panicking the kernel, as the x86 Machine Check code can recover from an
> in-userspace #MC.
>
> mce: [Hardware Error]: CPU 0: Machine Check Exception: f Bank 3: bd80000000070134
> mce: [Hardware Error]: RIP 10:<ffffffff8372f8bc> {memchr_inv+0x4c/0xf0}
> mce: [Hardware Error]: TSC afff7bbff88a ADDR 1d301b000 MISC 80 PPIN 1e741e77539027db
> mce: [Hardware Error]: PROCESSOR 0:d06d0 TIME 1758093249 SOCKET 0 APIC 0 microcode 80000320
> mce: [Hardware Error]: Run the above through 'mcelog --ascii'
> mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
> Kernel panic - not syncing: Fatal local machine check
>
> The root cause of this panic is that handling a memory failure triggered by
> an in-userspace #MC necessitates splitting the THP. The splitting process
> employs a mechanism, implemented in try_to_map_unused_to_zeropage(), which
> reads the sub-pages of the THP to identify zero-filled pages. However,
> reading the sub-pages results in a second in-kernel #MC,
Well that sounds dumb. To me this suggests a lack of selftesting code.
Perhaps someone could prepare a test for this case.
> occurring before
> the initial memory_failure() completes, ultimately leading to a kernel
> panic. See the kernel panic call trace on the two #MCs.
>
> ...
>
> Reported-by: Farrah Chen <farrah.chen@intel.com>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Tested-by: Farrah Chen <farrah.chen@intel.com>
> Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Yes please, a Fixes: would be good.
> + if (folio_contain_hwpoisoned_page(folio))
Offtopic, that should have been "folio_contains_hwpoisoned_page".
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 1/1] mm: prevent poison consumption when splitting THP
2025-10-11 7:55 ` [PATCH v2 " Qiuxu Zhuo
2025-10-11 9:09 ` Lance Yang
2025-10-11 18:18 ` Andrew Morton
@ 2025-10-12 1:23 ` Wei Yang
2025-10-13 17:15 ` Zi Yan
2025-10-14 2:42 ` Miaohe Lin
4 siblings, 0 replies; 28+ messages in thread
From: Wei Yang @ 2025-10-12 1:23 UTC (permalink / raw)
To: Qiuxu Zhuo
Cc: akpm, david, lorenzo.stoakes, linmiaohe, tony.luck, ziy,
baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
baohua, nao.horiguchi, farrah.chen, jiaqiyan, lance.yang,
linux-mm, linux-kernel
On Sat, Oct 11, 2025 at 03:55:19PM +0800, Qiuxu Zhuo wrote:
[...]
>
>diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>index 9c38a95e9f09..2bf5178cca96 100644
>--- a/mm/huge_memory.c
>+++ b/mm/huge_memory.c
>@@ -4121,6 +4121,9 @@ static bool thp_underused(struct folio *folio)
> if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
> return false;
>
>+ if (folio_contain_hwpoisoned_page(folio))
>+ return false;
>+
> for (i = 0; i < folio_nr_pages(folio); i++) {
> kaddr = kmap_local_folio(folio, i * PAGE_SIZE);
> if (!memchr_inv(kaddr, 0, PAGE_SIZE)) {
>diff --git a/mm/migrate.c b/mm/migrate.c
>index 9e5ef39ce73a..393fc2ffc96e 100644
>--- a/mm/migrate.c
>+++ b/mm/migrate.c
>@@ -305,8 +305,9 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
> pte_t newpte;
> void *addr;
>
>- if (PageCompound(page))
>+ if (PageCompound(page) || PageHWPoison(page))
> return false;
>+
> VM_BUG_ON_PAGE(!PageAnon(page), page);
> VM_BUG_ON_PAGE(!PageLocked(page), page);
> VM_BUG_ON_PAGE(pte_present(ptep_get(pvmw->pte)), page);
>
The code change LGTM.
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
>base-commit: e5f0a698b34ed76002dc5cff3804a61c80233a7a
>--
>2.43.0
>
--
Wei Yang
Help you, Help me
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/1] mm: prevent poison consumption when splitting THP
2025-09-29 7:34 ` David Hildenbrand
2025-09-29 13:52 ` Zhuo, Qiuxu
@ 2025-10-12 1:37 ` Wei Yang
2025-10-12 4:23 ` Jiaqi Yan
1 sibling, 1 reply; 28+ messages in thread
From: Wei Yang @ 2025-10-12 1:37 UTC (permalink / raw)
To: David Hildenbrand
Cc: Qiuxu Zhuo, akpm, lorenzo.stoakes, linmiaohe, tony.luck, ziy,
baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
baohua, nao.horiguchi, farrah.chen, linux-mm, linux-kernel,
Andrew Zaborowski
On Mon, Sep 29, 2025 at 09:34:12AM +0200, David Hildenbrand wrote:
>On 28.09.25 05:28, Qiuxu Zhuo wrote:
[...]
>
>Hm, I wonder if we should actually check in try_to_map_unused_to_zeropage()
>whether the page has the hwpoison flag set. Nothing wrong with scanning
>non-affected pages.
>
>In thp_underused() we should just skip the folio entirely I guess, so keep
>it simple.
>
>So what about something like this:
>
>diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>index 9c38a95e9f091..d4109fd7fa1f2 100644
>--- a/mm/huge_memory.c
>+++ b/mm/huge_memory.c
>@@ -4121,6 +4121,9 @@ static bool thp_underused(struct folio *folio)
> if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
> return false;
>+ folio_contain_hwpoisoned_page(folio)
>+ return false;
>+
One question.
When hardware detect error, it would immediately trigger memory_failure()? Or
it will wait until the memory is accessed?
> for (i = 0; i < folio_nr_pages(folio); i++) {
> kaddr = kmap_local_folio(folio, i * PAGE_SIZE);
> if (!memchr_inv(kaddr, 0, PAGE_SIZE)) {
>diff --git a/mm/migrate.c b/mm/migrate.c
>index 9e5ef39ce73af..393fc2ffc96e5 100644
>--- a/mm/migrate.c
>+++ b/mm/migrate.c
>@@ -305,8 +305,9 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
> pte_t newpte;
> void *addr;
>- if (PageCompound(page))
>+ if (PageCompound(page) || PageHWPoison(page))
> return false;
>+
> VM_BUG_ON_PAGE(!PageAnon(page), page);
> VM_BUG_ON_PAGE(!PageLocked(page), page);
> VM_BUG_ON_PAGE(pte_present(ptep_get(pvmw->pte)), page);
>
>
>--
>Cheers
>
>David / dhildenb
>
--
Wei Yang
Help you, Help me
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH 1/1] mm: prevent poison consumption when splitting THP
2025-10-12 1:37 ` Wei Yang
@ 2025-10-12 4:23 ` Jiaqi Yan
0 siblings, 0 replies; 28+ messages in thread
From: Jiaqi Yan @ 2025-10-12 4:23 UTC (permalink / raw)
To: Wei Yang
Cc: David Hildenbrand, Qiuxu Zhuo, akpm, lorenzo.stoakes, linmiaohe,
tony.luck, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
dev.jain, baohua, nao.horiguchi, farrah.chen, linux-mm,
linux-kernel, Andrew Zaborowski
On Sat, Oct 11, 2025 at 6:37 PM Wei Yang <richard.weiyang@gmail.com> wrote:
>
> On Mon, Sep 29, 2025 at 09:34:12AM +0200, David Hildenbrand wrote:
> >On 28.09.25 05:28, Qiuxu Zhuo wrote:
> [...]
> >
> >Hm, I wonder if we should actually check in try_to_map_unused_to_zeropage()
> >whether the page has the hwpoison flag set. Nothing wrong with scanning
> >non-affected pages.
> >
> >In thp_underused() we should just skip the folio entirely I guess, so keep
> >it simple.
> >
> >So what about something like this:
> >
> >diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >index 9c38a95e9f091..d4109fd7fa1f2 100644
> >--- a/mm/huge_memory.c
> >+++ b/mm/huge_memory.c
> >@@ -4121,6 +4121,9 @@ static bool thp_underused(struct folio *folio)
> > if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
> > return false;
> >+ folio_contain_hwpoisoned_page(folio)
> >+ return false;
> >+
>
> One question.
>
> When hardware detect error, it would immediately trigger memory_failure()? Or
> it will wait until the memory is accessed?
Hardware detecting a memory error usually results in a poison
generation event. Kernel expects to receive such poison generation
event from modern platforms, then kicks off memory_failure without any
poison context, e.g. the current process cannot be assumed to be the
"culprit".
>
> > for (i = 0; i < folio_nr_pages(folio); i++) {
> > kaddr = kmap_local_folio(folio, i * PAGE_SIZE);
> > if (!memchr_inv(kaddr, 0, PAGE_SIZE)) {
> >diff --git a/mm/migrate.c b/mm/migrate.c
> >index 9e5ef39ce73af..393fc2ffc96e5 100644
> >--- a/mm/migrate.c
> >+++ b/mm/migrate.c
> >@@ -305,8 +305,9 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
> > pte_t newpte;
> > void *addr;
> >- if (PageCompound(page))
> >+ if (PageCompound(page) || PageHWPoison(page))
> > return false;
> >+
> > VM_BUG_ON_PAGE(!PageAnon(page), page);
> > VM_BUG_ON_PAGE(!PageLocked(page), page);
> > VM_BUG_ON_PAGE(pte_present(ptep_get(pvmw->pte)), page);
> >
> >
> >--
> >Cheers
> >
> >David / dhildenb
> >
>
> --
> Wei Yang
> Help you, Help me
>
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 1/1] mm: prevent poison consumption when splitting THP
2025-10-11 7:55 ` [PATCH v2 " Qiuxu Zhuo
` (2 preceding siblings ...)
2025-10-12 1:23 ` Wei Yang
@ 2025-10-13 17:15 ` Zi Yan
2025-10-14 2:42 ` Miaohe Lin
4 siblings, 0 replies; 28+ messages in thread
From: Zi Yan @ 2025-10-13 17:15 UTC (permalink / raw)
To: Qiuxu Zhuo
Cc: akpm, david, lorenzo.stoakes, linmiaohe, tony.luck, baolin.wang,
Liam.Howlett, npache, ryan.roberts, dev.jain, baohua,
nao.horiguchi, farrah.chen, jiaqiyan, lance.yang, linux-mm,
linux-kernel
On 11 Oct 2025, at 3:55, Qiuxu Zhuo wrote:
> When performing memory error injection on a THP (Transparent Huge Page)
> mapped to userspace on an x86 server, the kernel panics with the following
> trace. The expected behavior is to terminate the affected process instead
> of panicking the kernel, as the x86 Machine Check code can recover from an
> in-userspace #MC.
>
> mce: [Hardware Error]: CPU 0: Machine Check Exception: f Bank 3: bd80000000070134
> mce: [Hardware Error]: RIP 10:<ffffffff8372f8bc> {memchr_inv+0x4c/0xf0}
> mce: [Hardware Error]: TSC afff7bbff88a ADDR 1d301b000 MISC 80 PPIN 1e741e77539027db
> mce: [Hardware Error]: PROCESSOR 0:d06d0 TIME 1758093249 SOCKET 0 APIC 0 microcode 80000320
> mce: [Hardware Error]: Run the above through 'mcelog --ascii'
> mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
> Kernel panic - not syncing: Fatal local machine check
>
> The root cause of this panic is that handling a memory failure triggered by
> an in-userspace #MC necessitates splitting the THP. The splitting process
> employs a mechanism, implemented in try_to_map_unused_to_zeropage(), which
> reads the sub-pages of the THP to identify zero-filled pages. However,
> reading the sub-pages results in a second in-kernel #MC, occurring before
> the initial memory_failure() completes, ultimately leading to a kernel
> panic. See the kernel panic call trace on the two #MCs.
>
> First Machine Check occurs // [1]
> memory_failure() // [2]
> try_to_split_thp_page()
> split_huge_page()
> split_huge_page_to_list_to_order()
> __folio_split() // [3]
> remap_page()
> remove_migration_ptes()
> remove_migration_pte()
> try_to_map_unused_to_zeropage() // [4]
> memchr_inv() // [5]
> Second Machine Check occurs // [6]
> Kernel panic
>
> [1] Triggered by accessing a hardware-poisoned THP in userspace, which is
> typically recoverable by terminating the affected process.
>
> [2] Call folio_set_has_hwpoisoned() before try_to_split_thp_page().
>
> [3] Pass the RMP_USE_SHARED_ZEROPAGE remap flag to remap_page().
>
> [4] Try to map the unused THP to zeropage.
>
> [5] Re-access sub-pages of the hw-poisoned THP in the kernel.
>
> [6] Triggered in-kernel, leading to a panic kernel.
>
> In Step[2], memory_failure() sets the poisoned flag on the sub-page of the
> THP by TestSetPageHWPoison() before calling try_to_split_thp_page().
>
> As suggested by David Hildenbrand, fix this panic by not accessing to the
> poisoned sub-page of the THP during zeropage identification, while
> continuing to scan unaffected sub-pages of the THP for possible zeropage
> mapping. This prevents a second in-kernel #MC that would cause kernel
> panic in Step[4].
>
> [ Credits to Andrew Zaborowski <andrew.zaborowski@intel.com> for his
> original fix that prevents passing the RMP_USE_SHARED_ZEROPAGE flag
> to remap_page() in Step[3] if the THP has the has_hwpoisoned flag set,
> avoiding access to the entire THP for zero-page identification. ]
>
> Reported-by: Farrah Chen <farrah.chen@intel.com>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Tested-by: Farrah Chen <farrah.chen@intel.com>
> Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> ---
> v1 -> v2:
> - Apply David Hildenbrand's fix suggestion.
>
> - Update the commit message to reflect the new fix.
>
> - Add David Hildenbrand's "Suggested-by:" tag.
>
> - Remove Andrew Zaborowski's SoB but add credits to him in the commit message.
> [ I cannot reach him to get his SoB for the completely rewritten commit
> message and new fix approach. ]
>
> mm/huge_memory.c | 3 +++
> mm/migrate.c | 3 ++-
> 2 files changed, 5 insertions(+), 1 deletion(-)
>
LGTM. Acked-by: Zi Yan <ziy@nvidia.com>
--
Best Regards,
Yan, Zi
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v2 1/1] mm: prevent poison consumption when splitting THP
2025-10-11 7:55 ` [PATCH v2 " Qiuxu Zhuo
` (3 preceding siblings ...)
2025-10-13 17:15 ` Zi Yan
@ 2025-10-14 2:42 ` Miaohe Lin
4 siblings, 0 replies; 28+ messages in thread
From: Miaohe Lin @ 2025-10-14 2:42 UTC (permalink / raw)
To: Qiuxu Zhuo
Cc: ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
baohua, nao.horiguchi, farrah.chen, jiaqiyan, lance.yang,
linux-mm, linux-kernel, akpm, david, lorenzo.stoakes, tony.luck
On 2025/10/11 15:55, Qiuxu Zhuo wrote:
> When performing memory error injection on a THP (Transparent Huge Page)
> mapped to userspace on an x86 server, the kernel panics with the following
> trace. The expected behavior is to terminate the affected process instead
> of panicking the kernel, as the x86 Machine Check code can recover from an
> in-userspace #MC.
>
> mce: [Hardware Error]: CPU 0: Machine Check Exception: f Bank 3: bd80000000070134
> mce: [Hardware Error]: RIP 10:<ffffffff8372f8bc> {memchr_inv+0x4c/0xf0}
> mce: [Hardware Error]: TSC afff7bbff88a ADDR 1d301b000 MISC 80 PPIN 1e741e77539027db
> mce: [Hardware Error]: PROCESSOR 0:d06d0 TIME 1758093249 SOCKET 0 APIC 0 microcode 80000320
> mce: [Hardware Error]: Run the above through 'mcelog --ascii'
> mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
> Kernel panic - not syncing: Fatal local machine check
>
> The root cause of this panic is that handling a memory failure triggered by
> an in-userspace #MC necessitates splitting the THP. The splitting process
> employs a mechanism, implemented in try_to_map_unused_to_zeropage(), which
> reads the sub-pages of the THP to identify zero-filled pages. However,
> reading the sub-pages results in a second in-kernel #MC, occurring before
> the initial memory_failure() completes, ultimately leading to a kernel
> panic. See the kernel panic call trace on the two #MCs.
>
> First Machine Check occurs // [1]
> memory_failure() // [2]
> try_to_split_thp_page()
> split_huge_page()
> split_huge_page_to_list_to_order()
> __folio_split() // [3]
> remap_page()
> remove_migration_ptes()
> remove_migration_pte()
> try_to_map_unused_to_zeropage() // [4]
> memchr_inv() // [5]
> Second Machine Check occurs // [6]
> Kernel panic
>
> [1] Triggered by accessing a hardware-poisoned THP in userspace, which is
> typically recoverable by terminating the affected process.
>
> [2] Call folio_set_has_hwpoisoned() before try_to_split_thp_page().
>
> [3] Pass the RMP_USE_SHARED_ZEROPAGE remap flag to remap_page().
>
> [4] Try to map the unused THP to zeropage.
>
> [5] Re-access sub-pages of the hw-poisoned THP in the kernel.
>
> [6] Triggered in-kernel, leading to a panic kernel.
>
> In Step[2], memory_failure() sets the poisoned flag on the sub-page of the
> THP by TestSetPageHWPoison() before calling try_to_split_thp_page().
>
> As suggested by David Hildenbrand, fix this panic by not accessing to the
> poisoned sub-page of the THP during zeropage identification, while
> continuing to scan unaffected sub-pages of the THP for possible zeropage
> mapping. This prevents a second in-kernel #MC that would cause kernel
> panic in Step[4].
>
> [ Credits to Andrew Zaborowski <andrew.zaborowski@intel.com> for his
> original fix that prevents passing the RMP_USE_SHARED_ZEROPAGE flag
> to remap_page() in Step[3] if the THP has the has_hwpoisoned flag set,
> avoiding access to the entire THP for zero-page identification. ]
>
> Reported-by: Farrah Chen <farrah.chen@intel.com>
> Suggested-by: David Hildenbrand <david@redhat.com>
> Tested-by: Farrah Chen <farrah.chen@intel.com>
> Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
> ---
> v1 -> v2:
> - Apply David Hildenbrand's fix suggestion.
>
> - Update the commit message to reflect the new fix.
>
> - Add David Hildenbrand's "Suggested-by:" tag.
>
> - Remove Andrew Zaborowski's SoB but add credits to him in the commit message.
> [ I cannot reach him to get his SoB for the completely rewritten commit
> message and new fix approach. ]
>
> mm/huge_memory.c | 3 +++
> mm/migrate.c | 3 ++-
> 2 files changed, 5 insertions(+), 1 deletion(-)
>
LGTM. Thanks for your fix.
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Thanks.
.
^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH v3 1/1] mm: prevent poison consumption when splitting THP
2025-09-28 3:28 [PATCH 1/1] mm: prevent poison consumption when splitting THP Qiuxu Zhuo
` (2 preceding siblings ...)
2025-10-11 7:55 ` [PATCH v2 " Qiuxu Zhuo
@ 2025-10-14 14:19 ` Qiuxu Zhuo
2025-10-14 14:29 ` David Hildenbrand
2025-10-15 6:49 ` [PATCH v4 " Qiuxu Zhuo
4 siblings, 1 reply; 28+ messages in thread
From: Qiuxu Zhuo @ 2025-10-14 14:19 UTC (permalink / raw)
To: akpm, david, lorenzo.stoakes, linmiaohe, tony.luck
Cc: qiuxu.zhuo, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
dev.jain, baohua, nao.horiguchi, farrah.chen, jiaqiyan,
lance.yang, richard.weiyang, linux-mm, linux-kernel
When performing memory error injection on a THP (Transparent Huge Page)
mapped to userspace on an x86 server, the kernel panics with the following
trace. The expected behavior is to terminate the affected process instead
of panicking the kernel, as the x86 Machine Check code can recover from an
in-userspace #MC.
mce: [Hardware Error]: CPU 0: Machine Check Exception: f Bank 3: bd80000000070134
mce: [Hardware Error]: RIP 10:<ffffffff8372f8bc> {memchr_inv+0x4c/0xf0}
mce: [Hardware Error]: TSC afff7bbff88a ADDR 1d301b000 MISC 80 PPIN 1e741e77539027db
mce: [Hardware Error]: PROCESSOR 0:d06d0 TIME 1758093249 SOCKET 0 APIC 0 microcode 80000320
mce: [Hardware Error]: Run the above through 'mcelog --ascii'
mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
Kernel panic - not syncing: Fatal local machine check
The root cause of this panic is that handling a memory failure triggered by
an in-userspace #MC necessitates splitting the THP. The splitting process
employs a mechanism, implemented in try_to_map_unused_to_zeropage(), which
reads the sub-pages of the THP to identify zero-filled pages. However,
reading the sub-pages results in a second in-kernel #MC, occurring before
the initial memory_failure() completes, ultimately leading to a kernel
panic. See the kernel panic call trace on the two #MCs.
First Machine Check occurs // [1]
memory_failure() // [2]
try_to_split_thp_page()
split_huge_page()
split_huge_page_to_list_to_order()
__folio_split() // [3]
remap_page()
remove_migration_ptes()
remove_migration_pte()
try_to_map_unused_to_zeropage() // [4]
memchr_inv() // [5]
Second Machine Check occurs // [6]
Kernel panic
[1] Triggered by accessing a hardware-poisoned THP in userspace, which is
typically recoverable by terminating the affected process.
[2] Call folio_set_has_hwpoisoned() before try_to_split_thp_page().
[3] Pass the RMP_USE_SHARED_ZEROPAGE remap flag to remap_page().
[4] Try to map the unused THP to zeropage.
[5] Re-access sub-pages of the hw-poisoned THP in the kernel.
[6] Triggered in-kernel, leading to a panic kernel.
In Step[2], memory_failure() sets the poisoned flag on the sub-page of the
THP by TestSetPageHWPoison() before calling try_to_split_thp_page().
As suggested by David Hildenbrand, fix this panic by not accessing to the
poisoned sub-page of the THP during zeropage identification, while
continuing to scan unaffected sub-pages of the THP for possible zeropage
mapping. This prevents a second in-kernel #MC that would cause kernel
panic in Step[4].
[ Credits to Andrew Zaborowski <andrew.zaborowski@intel.com> for his
original fix that prevents passing the RMP_USE_SHARED_ZEROPAGE flag
to remap_page() in Step[3] if the THP has the has_hwpoisoned flag set,
avoiding access to the entire THP for zero-page identification. ]
Fixes: b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp")
Fixes: dafff3f4c850 ("mm: split underused THPs")
Reported-by: Farrah Chen <farrah.chen@intel.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Tested-by: Farrah Chen <farrah.chen@intel.com>
Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Acked-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
---
v2 -> v3:
- No code changes.
- Rebased on top of v6.18-rc1 and retested.
- Add two "Fixes:" tags.
- Collect Lance Yang's "Acked-by:" tag.
- Collect Wei Yang's "Reviewed-by:" tag.
- Collect Zi Yan's "Acked-by:" tag.
- Collect Miaohe's "Reviewed-by:" tag.
v1 -> v2:
- Apply David Hildenbrand's fix suggestion.
- Update the commit message to reflect the new fix.
- Add David Hildenbrand's "Suggested-by:" tag.
- Remove Andrew Zaborowski's SoB but add credits to him in the commit message.
[ I cannot reach him to get his SoB for the completely rewritten commit
message and new fix approach. ]
mm/huge_memory.c | 3 +++
mm/migrate.c | 3 ++-
2 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1b81680b4225..1d1b74950332 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -4109,6 +4109,9 @@ static bool thp_underused(struct folio *folio)
if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
return false;
+ if (folio_contain_hwpoisoned_page(folio))
+ return false;
+
for (i = 0; i < folio_nr_pages(folio); i++) {
if (pages_identical(folio_page(folio, i), ZERO_PAGE(0))) {
if (++num_zero_pages > khugepaged_max_ptes_none)
diff --git a/mm/migrate.c b/mm/migrate.c
index e3065c9edb55..c0e9f15be2a2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -301,8 +301,9 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
struct page *page = folio_page(folio, idx);
pte_t newpte;
- if (PageCompound(page))
+ if (PageCompound(page) || PageHWPoison(page))
return false;
+
VM_BUG_ON_PAGE(!PageAnon(page), page);
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(pte_present(old_pte), page);
base-commit: 3a8660878839faadb4f1a6dd72c3179c1df56787
--
2.43.0
^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH v3 1/1] mm: prevent poison consumption when splitting THP
2025-10-14 14:19 ` [PATCH v3 " Qiuxu Zhuo
@ 2025-10-14 14:29 ` David Hildenbrand
2025-10-14 14:51 ` Zhuo, Qiuxu
0 siblings, 1 reply; 28+ messages in thread
From: David Hildenbrand @ 2025-10-14 14:29 UTC (permalink / raw)
To: Qiuxu Zhuo, akpm, lorenzo.stoakes, linmiaohe, tony.luck
Cc: ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
baohua, nao.horiguchi, farrah.chen, jiaqiyan, lance.yang,
richard.weiyang, linux-mm, linux-kernel
On 14.10.25 16:19, Qiuxu Zhuo wrote:
> When performing memory error injection on a THP (Transparent Huge Page)
> mapped to userspace on an x86 server, the kernel panics with the following
> trace. The expected behavior is to terminate the affected process instead
> of panicking the kernel, as the x86 Machine Check code can recover from an
> in-userspace #MC.
>
> mce: [Hardware Error]: CPU 0: Machine Check Exception: f Bank 3: bd80000000070134
> mce: [Hardware Error]: RIP 10:<ffffffff8372f8bc> {memchr_inv+0x4c/0xf0}
> mce: [Hardware Error]: TSC afff7bbff88a ADDR 1d301b000 MISC 80 PPIN 1e741e77539027db
> mce: [Hardware Error]: PROCESSOR 0:d06d0 TIME 1758093249 SOCKET 0 APIC 0 microcode 80000320
> mce: [Hardware Error]: Run the above through 'mcelog --ascii'
> mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
> Kernel panic - not syncing: Fatal local machine check
>
> The root cause of this panic is that handling a memory failure triggered by
> an in-userspace #MC necessitates splitting the THP. The splitting process
> employs a mechanism, implemented in try_to_map_unused_to_zeropage(), which
> reads the sub-pages of the THP to identify zero-filled pages. However,
> reading the sub-pages results in a second in-kernel #MC, occurring before
> the initial memory_failure() completes, ultimately leading to a kernel
> panic. See the kernel panic call trace on the two #MCs.
>
> First Machine Check occurs // [1]
> memory_failure() // [2]
> try_to_split_thp_page()
> split_huge_page()
> split_huge_page_to_list_to_order()
> __folio_split() // [3]
> remap_page()
> remove_migration_ptes()
> remove_migration_pte()
> try_to_map_unused_to_zeropage() // [4]
> memchr_inv() // [5]
> Second Machine Check occurs // [6]
> Kernel panic
>
> [1] Triggered by accessing a hardware-poisoned THP in userspace, which is
> typically recoverable by terminating the affected process.
>
> [2] Call folio_set_has_hwpoisoned() before try_to_split_thp_page().
>
> [3] Pass the RMP_USE_SHARED_ZEROPAGE remap flag to remap_page().
>
> [4] Try to map the unused THP to zeropage.
>
> [5] Re-access sub-pages of the hw-poisoned THP in the kernel.
>
> [6] Triggered in-kernel, leading to a panic kernel.
>
> In Step[2], memory_failure() sets the poisoned flag on the sub-page of the
> THP by TestSetPageHWPoison() before calling try_to_split_thp_page().
>
> As suggested by David Hildenbrand, fix this panic by not accessing to the
> poisoned sub-page of the THP during zeropage identification, while
> continuing to scan unaffected sub-pages of the THP for possible zeropage
> mapping. This prevents a second in-kernel #MC that would cause kernel
> panic in Step[4].
>
> [ Credits to Andrew Zaborowski <andrew.zaborowski@intel.com> for his
> original fix that prevents passing the RMP_USE_SHARED_ZEROPAGE flag
> to remap_page() in Step[3] if the THP has the has_hwpoisoned flag set,
> avoiding access to the entire THP for zero-page identification. ]
Two smaller things:
(a) Sub-page is the wrong terminology. We simply call it "page in a
THP". So consider changing multiple occurrence above.
(b) You should probably trim the credits to something simple like
"Thanks to Andrew Zaborowski for his initial work on fixing
this issue."
removing the brackets. If you want, you could then link to one of the
submissions from him. The details how he would have fixed it are not
really relevant to be had in this patch.
LGTM, thanks
Acked-by: David Hildenbrand <david@redhat.com>
--
Cheers
David / dhildenb
^ permalink raw reply [flat|nested] 28+ messages in thread
* RE: [PATCH v3 1/1] mm: prevent poison consumption when splitting THP
2025-10-14 14:29 ` David Hildenbrand
@ 2025-10-14 14:51 ` Zhuo, Qiuxu
0 siblings, 0 replies; 28+ messages in thread
From: Zhuo, Qiuxu @ 2025-10-14 14:51 UTC (permalink / raw)
To: David Hildenbrand, akpm, lorenzo.stoakes, linmiaohe, Luck, Tony
Cc: ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts, dev.jain,
baohua, nao.horiguchi, Chen, Farrah, jiaqiyan, lance.yang,
richard.weiyang, linux-mm, linux-kernel
Hi David,
> From: David Hildenbrand <david@redhat.com>
> [...]
> > [ Credits to Andrew Zaborowski <andrew.zaborowski@intel.com> for his
> > original fix that prevents passing the RMP_USE_SHARED_ZEROPAGE flag
> > to remap_page() in Step[3] if the THP has the has_hwpoisoned flag set,
> > avoiding access to the entire THP for zero-page identification. ]
>
> Two smaller things:
>
> (a) Sub-page is the wrong terminology. We simply call it "page in a THP". So
> consider changing multiple occurrence above.
>
> (b) You should probably trim the credits to something simple like
>
> "Thanks to Andrew Zaborowski for his initial work on fixing
> this issue."
>
> removing the brackets. If you want, you could then link to one of the
> submissions from him. The details how he would have fixed it are not really
> relevant to be had in this patch.
>
Thanks for the guidance.
I'll create a v4 incorporating these suggestions.
> LGTM, thanks
>
> Acked-by: David Hildenbrand <david@redhat.com>
Thanks for your review.
- Qiuxu
^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH v4 1/1] mm: prevent poison consumption when splitting THP
2025-09-28 3:28 [PATCH 1/1] mm: prevent poison consumption when splitting THP Qiuxu Zhuo
` (3 preceding siblings ...)
2025-10-14 14:19 ` [PATCH v3 " Qiuxu Zhuo
@ 2025-10-15 6:49 ` Qiuxu Zhuo
4 siblings, 0 replies; 28+ messages in thread
From: Qiuxu Zhuo @ 2025-10-15 6:49 UTC (permalink / raw)
To: akpm, david, lorenzo.stoakes, linmiaohe, tony.luck
Cc: qiuxu.zhuo, ziy, baolin.wang, Liam.Howlett, npache, ryan.roberts,
dev.jain, baohua, nao.horiguchi, farrah.chen, jiaqiyan,
lance.yang, richard.weiyang, linux-mm, linux-kernel
When performing memory error injection on a THP (Transparent Huge Page)
mapped to userspace on an x86 server, the kernel panics with the following
trace. The expected behavior is to terminate the affected process instead
of panicking the kernel, as the x86 Machine Check code can recover from an
in-userspace #MC.
mce: [Hardware Error]: CPU 0: Machine Check Exception: f Bank 3: bd80000000070134
mce: [Hardware Error]: RIP 10:<ffffffff8372f8bc> {memchr_inv+0x4c/0xf0}
mce: [Hardware Error]: TSC afff7bbff88a ADDR 1d301b000 MISC 80 PPIN 1e741e77539027db
mce: [Hardware Error]: PROCESSOR 0:d06d0 TIME 1758093249 SOCKET 0 APIC 0 microcode 80000320
mce: [Hardware Error]: Run the above through 'mcelog --ascii'
mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
Kernel panic - not syncing: Fatal local machine check
The root cause of this panic is that handling a memory failure triggered by
an in-userspace #MC necessitates splitting the THP. The splitting process
employs a mechanism, implemented in try_to_map_unused_to_zeropage(), which
reads the pages in the THP to identify zero-filled pages. However, reading
the pages in the THP results in a second in-kernel #MC, occurring before
the initial memory_failure() completes, ultimately leading to a kernel
panic. See the kernel panic call trace on the two #MCs.
First Machine Check occurs // [1]
memory_failure() // [2]
try_to_split_thp_page()
split_huge_page()
split_huge_page_to_list_to_order()
__folio_split() // [3]
remap_page()
remove_migration_ptes()
remove_migration_pte()
try_to_map_unused_to_zeropage() // [4]
memchr_inv() // [5]
Second Machine Check occurs // [6]
Kernel panic
[1] Triggered by accessing a hardware-poisoned THP in userspace, which is
typically recoverable by terminating the affected process.
[2] Call folio_set_has_hwpoisoned() before try_to_split_thp_page().
[3] Pass the RMP_USE_SHARED_ZEROPAGE remap flag to remap_page().
[4] Try to map the unused THP to zeropage.
[5] Re-access pages in the hw-poisoned THP in the kernel.
[6] Triggered in-kernel, leading to a panic kernel.
In Step[2], memory_failure() sets the poisoned flag on the page in the
THP by TestSetPageHWPoison() before calling try_to_split_thp_page().
As suggested by David Hildenbrand, fix this panic by not accessing to the
poisoned page in the THP during zeropage identification, while continuing
to scan unaffected pages in the THP for possible zeropage mapping. This
prevents a second in-kernel #MC that would cause kernel panic in Step[4].
Thanks to Andrew Zaborowski for his initial work on fixing this issue.
Fixes: b1f202060afe ("mm: remap unused subpages to shared zeropage when splitting isolated thp")
Fixes: dafff3f4c850 ("mm: split underused THPs")
Reported-by: Farrah Chen <farrah.chen@intel.com>
Suggested-by: David Hildenbrand <david@redhat.com>
Tested-by: Farrah Chen <farrah.chen@intel.com>
Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Acked-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Wei Yang <richard.weiyang@gmail.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
---
v3 -> v4:
- No code changes.
- s/sub-page of the THP/page in the THP/ in the commit message.
- s/sub-pages of the THP/pages in the THP/ in the commit message.
- Simplify the credits in the commit message.
- Collect David Hildenbrand's "Acked-by:" tag.
v2 -> v3:
- No code changes.
- Rebased on top of v6.18-rc1 and retested.
- Add two "Fixes:" tags.
- Collect Lance Yang's "Acked-by:" tag.
- Collect Wei Yang's "Reviewed-by:" tag.
- Collect Zi Yan's "Acked-by:" tag.
- Collect Miaohe's "Reviewed-by:" tag.
v1 -> v2:
- Apply David Hildenbrand's fix suggestion.
- Update the commit message to reflect the new fix.
- Add David Hildenbrand's "Suggested-by:" tag.
- Remove Andrew Zaborowski's SoB but add credits to him in the commit message.
[ I cannot reach him to get his SoB for the completely rewritten commit
message and new fix approach. ]
mm/huge_memory.c | 3 +++
mm/migrate.c | 3 ++-
2 files changed, 5 insertions(+), 1 deletion(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1b81680b4225..1d1b74950332 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -4109,6 +4109,9 @@ static bool thp_underused(struct folio *folio)
if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
return false;
+ if (folio_contain_hwpoisoned_page(folio))
+ return false;
+
for (i = 0; i < folio_nr_pages(folio); i++) {
if (pages_identical(folio_page(folio, i), ZERO_PAGE(0))) {
if (++num_zero_pages > khugepaged_max_ptes_none)
diff --git a/mm/migrate.c b/mm/migrate.c
index e3065c9edb55..c0e9f15be2a2 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -301,8 +301,9 @@ static bool try_to_map_unused_to_zeropage(struct page_vma_mapped_walk *pvmw,
struct page *page = folio_page(folio, idx);
pte_t newpte;
- if (PageCompound(page))
+ if (PageCompound(page) || PageHWPoison(page))
return false;
+
VM_BUG_ON_PAGE(!PageAnon(page), page);
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(pte_present(old_pte), page);
base-commit: 3a8660878839faadb4f1a6dd72c3179c1df56787
--
2.43.0
^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2025-10-15 6:51 UTC | newest]
Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-28 3:28 [PATCH 1/1] mm: prevent poison consumption when splitting THP Qiuxu Zhuo
2025-09-28 21:55 ` Jiaqi Yan
2025-09-29 12:29 ` Miaohe Lin
2025-09-29 13:57 ` Zhuo, Qiuxu
2025-09-29 15:15 ` Jiaqi Yan
2025-09-29 13:27 ` Zhuo, Qiuxu
2025-09-29 15:51 ` Luck, Tony
2025-09-29 16:30 ` Zhuo, Qiuxu
2025-09-29 17:25 ` David Hildenbrand
2025-09-30 1:48 ` Lance Yang
2025-09-30 8:53 ` David Hildenbrand
2025-09-30 10:13 ` Lance Yang
2025-09-30 10:20 ` Lance Yang
2025-09-29 7:34 ` David Hildenbrand
2025-09-29 13:52 ` Zhuo, Qiuxu
2025-09-29 16:12 ` David Hildenbrand
2025-10-12 1:37 ` Wei Yang
2025-10-12 4:23 ` Jiaqi Yan
2025-10-11 7:55 ` [PATCH v2 " Qiuxu Zhuo
2025-10-11 9:09 ` Lance Yang
2025-10-11 18:18 ` Andrew Morton
2025-10-12 1:23 ` Wei Yang
2025-10-13 17:15 ` Zi Yan
2025-10-14 2:42 ` Miaohe Lin
2025-10-14 14:19 ` [PATCH v3 " Qiuxu Zhuo
2025-10-14 14:29 ` David Hildenbrand
2025-10-14 14:51 ` Zhuo, Qiuxu
2025-10-15 6:49 ` [PATCH v4 " Qiuxu Zhuo
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox