[PATCH v1 0/4] fmm/hwpoison: Fix regressions in memory failure handling

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v1 0/4] fmm/hwpoison: Fix regressions in memory failure handling
@ 2025-02-11  6:01 Shuai Xue
  2025-02-11  6:01 ` [PATCH v1 1/4] x86/mce: Collect error message for severities below MCE_PANIC_SEVERITY Shuai Xue
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Shuai Xue @ 2025-02-11  6:01 UTC (permalink / raw)
  To: tony.luck, bp, nao.horiguchi
  Cc: tglx, mingo, dave.hansen, x86, hpa, linmiaohe, akpm, linux-edac,
	linux-kernel, linux-mm, baolin.wang, tianruidong

This patch addresses three regressions identified in memory failure
handling, as discovered using ras-tools[1]:

- `./einj_mem_uc copyin -f`
- `./einj_mem_uc futex -f`
- `./einj_mem_uc instr`

The regressions in the copyin and futex cases were caused by the
replacement of `EX_TYPE_UACCESS` with `EX_TYPE_EFAULT_REG` in some
copy-from-user operations, leading to kernel panics. The instr case
regression resulted from the PTE entry not being marked as hwpoison,
causing the system to send unnecessary SIGBUS signals.

These fixes ensure proper handling of memory errors and prevent kernel
panics and unnecessary signal dispatch.

[1]https://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git

Shuai Xue (4):
  x86/mce: Collect error message for severities below MCE_PANIC_SEVERITY
  x86/mce: dump error msg from severities
  x86/mce: add EX_TYPE_EFAULT_REG as in-kernel recovery context to fix
    copy-from-user operations regression
  mm/hwpoison: Fix incorrect "not recovered" report for recovered clean
    pages

 arch/x86/kernel/cpu/mce/core.c     | 19 +++++++++++++------
 arch/x86/kernel/cpu/mce/severity.c | 21 ++++++++++++++++-----
 mm/memory-failure.c                |  5 ++---
 3 files changed, 31 insertions(+), 14 deletions(-)

-- 
2.39.3



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v1 1/4] x86/mce: Collect error message for severities below MCE_PANIC_SEVERITY
  2025-02-11  6:01 [PATCH v1 0/4] fmm/hwpoison: Fix regressions in memory failure handling Shuai Xue
@ 2025-02-11  6:01 ` Shuai Xue
  2025-02-11 16:51   ` Luck, Tony
  2025-02-11  6:01 ` [PATCH v1 2/4] x86/mce: dump error msg from severities Shuai Xue
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 17+ messages in thread
From: Shuai Xue @ 2025-02-11  6:01 UTC (permalink / raw)
  To: tony.luck, bp, nao.horiguchi
  Cc: tglx, mingo, dave.hansen, x86, hpa, linmiaohe, akpm, linux-edac,
	linux-kernel, linux-mm, baolin.wang, tianruidong

Currently, mce_no_way_out() only collects error messages when the error
severity is equal to `MCE_PANIC_SEVERITY`. To improve diagnostics,
modify the behavior to also collect error messages when the severity is
less than `MCE_PANIC_SEVERITY`.

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
---
 arch/x86/kernel/cpu/mce/core.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 0dc00c9894c7..2919a077cd66 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -925,11 +925,12 @@ static __always_inline void quirk_zen_ifu(int bank, struct mce *m, struct pt_reg
  * Do a quick check if any of the events requires a panic.
  * This decides if we keep the events around or clear them.
  */
-static __always_inline int mce_no_way_out(struct mce_hw_err *err, char **msg, unsigned long *validp,
-					  struct pt_regs *regs)
+static __always_inline bool mce_no_way_out(struct mce_hw_err *err, char **msg,
+					   unsigned long *validp,
+					   struct pt_regs *regs)
 {
 	struct mce *m = &err->m;
-	char *tmp = *msg;
+	char *tmp = *msg, cur_sev = MCE_NO_SEVERITY, sev;
 	int i;
 
 	for (i = 0; i < this_cpu_read(mce_num_banks); i++) {
@@ -945,13 +946,17 @@ static __always_inline int mce_no_way_out(struct mce_hw_err *err, char **msg, un
 			quirk_zen_ifu(i, m, regs);
 
 		m->bank = i;
-		if (mce_severity(m, regs, &tmp, true) >= MCE_PANIC_SEVERITY) {
+		sev = mce_severity(m, regs, &tmp, true);
+		if (sev >= cur_sev) {
 			mce_read_aux(err, i);
 			*msg = tmp;
-			return 1;
+			cur_sev = sev;
 		}
+
+		if (cur_sev == MCE_PANIC_SEVERITY)
+			return true;
 	}
-	return 0;
+	return false;
 }
 
 /*
-- 
2.39.3



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v1 2/4] x86/mce: dump error msg from severities
  2025-02-11  6:01 [PATCH v1 0/4] fmm/hwpoison: Fix regressions in memory failure handling Shuai Xue
  2025-02-11  6:01 ` [PATCH v1 1/4] x86/mce: Collect error message for severities below MCE_PANIC_SEVERITY Shuai Xue
@ 2025-02-11  6:01 ` Shuai Xue
  2025-02-11 16:44   ` Luck, Tony
  2025-02-11  6:01 ` [PATCH v1 3/4] x86/mce: add EX_TYPE_EFAULT_REG as in-kernel recovery context to fix copy-from-user operations regression Shuai Xue
  2025-02-11  6:02 ` [PATCH v1 4/4] mm/hwpoison: Fix incorrect "not recovered" report for recovered clean pages Shuai Xue
  3 siblings, 1 reply; 17+ messages in thread
From: Shuai Xue @ 2025-02-11  6:01 UTC (permalink / raw)
  To: tony.luck, bp, nao.horiguchi
  Cc: tglx, mingo, dave.hansen, x86, hpa, linmiaohe, akpm, linux-edac,
	linux-kernel, linux-mm, baolin.wang, tianruidong

The message in severities is useful for identifying the type of MCE that
has occurred; dump it if it is valid.

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
---
 arch/x86/kernel/cpu/mce/core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 2919a077cd66..c1319db45b0a 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1456,6 +1456,8 @@ static void queue_task_work(struct mce_hw_err *err, char *msg, void (*func)(stru
 	if (count > 1)
 		return;
 
+	if (msg)
+		pr_err("%s\n", msg);
 	task_work_add(current, &current->mce_kill_me, TWA_RESUME);
 }
 
-- 
2.39.3



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v1 3/4] x86/mce: add EX_TYPE_EFAULT_REG as in-kernel recovery context to fix copy-from-user operations regression
  2025-02-11  6:01 [PATCH v1 0/4] fmm/hwpoison: Fix regressions in memory failure handling Shuai Xue
  2025-02-11  6:01 ` [PATCH v1 1/4] x86/mce: Collect error message for severities below MCE_PANIC_SEVERITY Shuai Xue
  2025-02-11  6:01 ` [PATCH v1 2/4] x86/mce: dump error msg from severities Shuai Xue
@ 2025-02-11  6:01 ` Shuai Xue
  2025-02-11  6:02 ` [PATCH v1 4/4] mm/hwpoison: Fix incorrect "not recovered" report for recovered clean pages Shuai Xue
  3 siblings, 0 replies; 17+ messages in thread
From: Shuai Xue @ 2025-02-11  6:01 UTC (permalink / raw)
  To: tony.luck, bp, nao.horiguchi
  Cc: tglx, mingo, dave.hansen, x86, hpa, linmiaohe, akpm, linux-edac,
	linux-kernel, linux-mm, baolin.wang, tianruidong

Commit 4c132d1d844a ("x86/futex: Remove .fixup usage") introduced a new
extable fixup type, EX_TYPE_EFAULT_REG, and later patches updated the
extable fixup type for copy-from-user operations, changing it from
EX_TYPE_UACCESS to EX_TYPE_EFAULT_REG.

Specifically, commit 99641e094d6c ("x86/uaccess: Remove .fixup usage")
altered the extable fixup type for the get_user family, while commit
4c132d1d844a ("x86/futex: Remove .fixup usage") addressed the futex
operations. This change inadvertently caused a regression where the error
context for some copy-from-user operations no longer functions as an
in-kernel recovery context, leading to kernel panics with the message:
"Machine check: Data load in unrecoverable area of kernel."

To fix the regression, add EX_TYPE_EFAULT_REG as a in-kernel recovery
context for copy-from-user operations.

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
---
 arch/x86/kernel/cpu/mce/severity.c | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/severity.c b/arch/x86/kernel/cpu/mce/severity.c
index dac4d64dfb2a..14c2d71c3ce1 100644
--- a/arch/x86/kernel/cpu/mce/severity.c
+++ b/arch/x86/kernel/cpu/mce/severity.c
@@ -16,6 +16,7 @@
 #include <asm/traps.h>
 #include <asm/insn.h>
 #include <asm/insn-eval.h>
+#include <linux/extable.h>
 
 #include "internal.h"
 
@@ -285,7 +286,8 @@ static bool is_copy_from_user(struct pt_regs *regs)
  */
 static noinstr int error_context(struct mce *m, struct pt_regs *regs)
 {
-	int fixup_type;
+	const struct exception_table_entry *e;
+	int fixup_type, imm;
 	bool copy_user;
 
 	if ((m->cs & 3) == 3)
@@ -294,9 +296,14 @@ static noinstr int error_context(struct mce *m, struct pt_regs *regs)
 	if (!mc_recoverable(m->mcgstatus))
 		return IN_KERNEL;
 
+	e = search_exception_tables(m->ip);
+	if (!e)
+		return IN_KERNEL;
+
 	/* Allow instrumentation around external facilities usage. */
 	instrumentation_begin();
-	fixup_type = ex_get_fixup_type(m->ip);
+	fixup_type = FIELD_GET(EX_DATA_TYPE_MASK, e->data);
+	imm  = FIELD_GET(EX_DATA_IMM_MASK,  e->data);
 	copy_user  = is_copy_from_user(regs);
 	instrumentation_end();
 
@@ -304,9 +311,13 @@ static noinstr int error_context(struct mce *m, struct pt_regs *regs)
 	case EX_TYPE_UACCESS:
 		if (!copy_user)
 			return IN_KERNEL;
-		m->kflags |= MCE_IN_KERNEL_COPYIN;
-		fallthrough;
-
+		m->kflags |= MCE_IN_KERNEL_COPYIN | MCE_IN_KERNEL_RECOV;
+		return IN_KERNEL_RECOV;
+	case EX_TYPE_IMM_REG:
+		if (!copy_user || imm != -EFAULT)
+			return IN_KERNEL;
+		m->kflags |= MCE_IN_KERNEL_COPYIN | MCE_IN_KERNEL_RECOV;
+		return IN_KERNEL_RECOV;
 	case EX_TYPE_FAULT_MCE_SAFE:
 	case EX_TYPE_DEFAULT_MCE_SAFE:
 		m->kflags |= MCE_IN_KERNEL_RECOV;
-- 
2.39.3



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v1 4/4] mm/hwpoison: Fix incorrect "not recovered" report for recovered clean pages
  2025-02-11  6:01 [PATCH v1 0/4] fmm/hwpoison: Fix regressions in memory failure handling Shuai Xue
                   ` (2 preceding siblings ...)
  2025-02-11  6:01 ` [PATCH v1 3/4] x86/mce: add EX_TYPE_EFAULT_REG as in-kernel recovery context to fix copy-from-user operations regression Shuai Xue
@ 2025-02-11  6:02 ` Shuai Xue
  2025-02-12  8:09   ` Miaohe Lin
  3 siblings, 1 reply; 17+ messages in thread
From: Shuai Xue @ 2025-02-11  6:02 UTC (permalink / raw)
  To: tony.luck, bp, nao.horiguchi
  Cc: tglx, mingo, dave.hansen, x86, hpa, linmiaohe, akpm, linux-edac,
	linux-kernel, linux-mm, baolin.wang, tianruidong

When an uncorrected memory error is consumed there is a race between
the CMCI from the memory controller reporting an uncorrected error
with a UCNA signature, and the core reporting and SRAR signature
machine check when the data is about to be consumed.

If the CMCI wins that race, the page is marked poisoned when
uc_decode_notifier() calls memory_failure(). For dirty pages,
memory_failure() invokes try_to_unmap() with the TTU_HWPOISON flag,
converting the PTE to a hwpoison entry. However, for clean pages, the
TTU_HWPOISON flag is cleared, leaving the PTE unchanged and not converted
to a hwpoison entry. Consequently, for an unmapped dirty page, the PTE is
marked as a hwpoison entry allowing kill_accessing_process() to:

- call walk_page_range() and return 1
- call kill_proc() to make sure a SIGBUS is sent
- return -EHWPOISON to indicate that SIGBUS is already sent to the process
  and kill_me_maybe() doesn't have to send it again.

Conversely, for clean pages where PTE entries are not marked as hwpoison,
kill_accessing_process() returns -EFAULT, causing kill_me_maybe() to send a
SIGBUS.

Console log looks like this:

    Memory failure: 0x827ca68: corrupted page was clean: dropped without side effects
    Memory failure: 0x827ca68: recovery action for clean LRU page: Recovered
    Memory failure: 0x827ca68: already hardware poisoned
    mce: Memory error not recovered

To fix it, return -EHWPOISON if no hwpoison PTE entry is found, preventing
an unnecessary SIGBUS.

Fixes: 046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"")
Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
---
 mm/memory-failure.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 995a15eb67e2..f9a6b136a6f0 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -883,10 +883,9 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
 			      (void *)&priv);
 	if (ret == 1 && priv.tk.addr)
 		kill_proc(&priv.tk, pfn, flags);
-	else
-		ret = 0;
 	mmap_read_unlock(p->mm);
-	return ret > 0 ? -EHWPOISON : -EFAULT;
+
+	return ret >= 0 ? -EHWPOISON : -EFAULT;
 }

 /*
-- 
2.39.3

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH v1 2/4] x86/mce: dump error msg from severities
  2025-02-11  6:01 ` [PATCH v1 2/4] x86/mce: dump error msg from severities Shuai Xue
@ 2025-02-11 16:44   ` Luck, Tony
  2025-02-14  9:29     ` Shuai Xue
  0 siblings, 1 reply; 17+ messages in thread
From: Luck, Tony @ 2025-02-11 16:44 UTC (permalink / raw)
  To: Shuai Xue, bp, nao.horiguchi
  Cc: tglx, mingo, dave.hansen, x86, hpa, linmiaohe, akpm, linux-edac,
	linux-kernel, linux-mm, baolin.wang, tianruidong

> The message in severities is useful for identifying the type of MCE that
> has occurred; dump it if it is valid.
>
> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
> ---
>  arch/x86/kernel/cpu/mce/core.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 2919a077cd66..c1319db45b0a 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -1456,6 +1456,8 @@ static void queue_task_work(struct mce_hw_err *err, char *msg, void (*func)(stru
>       if (count > 1)
>               return;
>
> +     if (msg)
> +             pr_err("%s\n", msg);
>       task_work_add(current, &current->mce_kill_me, TWA_RESUME);
>  }

This is called from the #MC handler. Is that a safe context to print a console
message? It wasn't in the past, but maybe changes to how console messages
are handled have changed this.

-Tony


^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH v1 1/4] x86/mce: Collect error message for severities below MCE_PANIC_SEVERITY
  2025-02-11  6:01 ` [PATCH v1 1/4] x86/mce: Collect error message for severities below MCE_PANIC_SEVERITY Shuai Xue
@ 2025-02-11 16:51   ` Luck, Tony
  2025-02-12  1:51     ` Shuai Xue
  0 siblings, 1 reply; 17+ messages in thread
From: Luck, Tony @ 2025-02-11 16:51 UTC (permalink / raw)
  To: Shuai Xue, bp, nao.horiguchi
  Cc: tglx, mingo, dave.hansen, x86, hpa, linmiaohe, akpm, linux-edac,
	linux-kernel, linux-mm, baolin.wang, tianruidong

> +	char *tmp = *msg, cur_sev = MCE_NO_SEVERITY, sev;

Should cur_sev and sev be of type "int" (since that's the type returned by mce_severity())?

It doesn't matter today since the list of return value does fit into "char", but it is setting up
to fail if that should ever change.

-Tony



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 1/4] x86/mce: Collect error message for severities below MCE_PANIC_SEVERITY
  2025-02-11 16:51   ` Luck, Tony
@ 2025-02-12  1:51     ` Shuai Xue
  0 siblings, 0 replies; 17+ messages in thread
From: Shuai Xue @ 2025-02-12  1:51 UTC (permalink / raw)
  To: Luck, Tony, bp, nao.horiguchi
  Cc: tglx, mingo, dave.hansen, x86, hpa, linmiaohe, akpm, linux-edac,
	linux-kernel, linux-mm, baolin.wang, tianruidong



在 2025/2/12 00:51, Luck, Tony 写道:
>> +	char *tmp = *msg, cur_sev = MCE_NO_SEVERITY, sev;
> 
> Should cur_sev and sev be of type "int" (since that's the type returned by mce_severity())?
> 
> It doesn't matter today since the list of return value does fit into "char", but it is setting up
> to fail if that should ever change.
> 
> -Tony

You are right, I previously only focused on the fact that the field 'sev' of
struct severity is an unsigned char.

Will fix it in next version.

Thanks.
Shuai


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 4/4] mm/hwpoison: Fix incorrect "not recovered" report for recovered clean pages
  2025-02-11  6:02 ` [PATCH v1 4/4] mm/hwpoison: Fix incorrect "not recovered" report for recovered clean pages Shuai Xue
@ 2025-02-12  8:09   ` Miaohe Lin
  2025-02-12 13:55     ` Shuai Xue
  0 siblings, 1 reply; 17+ messages in thread
From: Miaohe Lin @ 2025-02-12  8:09 UTC (permalink / raw)
  To: Shuai Xue
  Cc: tglx, mingo, dave.hansen, x86, hpa, akpm, linux-edac,
	linux-kernel, linux-mm, baolin.wang, tianruidong, tony.luck, bp,
	nao.horiguchi

On 2025/2/11 14:02, Shuai Xue wrote:
> When an uncorrected memory error is consumed there is a race between
> the CMCI from the memory controller reporting an uncorrected error
> with a UCNA signature, and the core reporting and SRAR signature
> machine check when the data is about to be consumed.
> 
> If the CMCI wins that race, the page is marked poisoned when
> uc_decode_notifier() calls memory_failure(). For dirty pages,
> memory_failure() invokes try_to_unmap() with the TTU_HWPOISON flag,
> converting the PTE to a hwpoison entry. However, for clean pages, the
> TTU_HWPOISON flag is cleared, leaving the PTE unchanged and not converted
> to a hwpoison entry. Consequently, for an unmapped dirty page, the PTE is
> marked as a hwpoison entry allowing kill_accessing_process() to:
> 
> - call walk_page_range() and return 1
> - call kill_proc() to make sure a SIGBUS is sent
> - return -EHWPOISON to indicate that SIGBUS is already sent to the process
>   and kill_me_maybe() doesn't have to send it again.
> 
> Conversely, for clean pages where PTE entries are not marked as hwpoison,
> kill_accessing_process() returns -EFAULT, causing kill_me_maybe() to send a
> SIGBUS.
> 
> Console log looks like this:
> 
>     Memory failure: 0x827ca68: corrupted page was clean: dropped without side effects
>     Memory failure: 0x827ca68: recovery action for clean LRU page: Recovered
>     Memory failure: 0x827ca68: already hardware poisoned
>     mce: Memory error not recovered
> 
> To fix it, return -EHWPOISON if no hwpoison PTE entry is found, preventing
> an unnecessary SIGBUS.

Thanks for your patch.

> 
> Fixes: 046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"")
> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
> ---
>  mm/memory-failure.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 995a15eb67e2..f9a6b136a6f0 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -883,10 +883,9 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
>  			      (void *)&priv);
>  	if (ret == 1 && priv.tk.addr)
>  		kill_proc(&priv.tk, pfn, flags);
> -	else
> -		ret = 0;
>  	mmap_read_unlock(p->mm);
> -	return ret > 0 ? -EHWPOISON : -EFAULT;
> +
> +	return ret >= 0 ? -EHWPOISON : -EFAULT;

IIUC, kill_accessing_process() is supposed to return -EHWPOISON to notify that SIGBUS is already
sent to the process and kill_me_maybe() doesn't have to send it again. But with your change,
kill_accessing_process() will return -EHWPOISON even if SIGBUS is not sent. Does this break
the semantics of -EHWPOISON?

BTW I scanned the code of walk_page_range(). It seems with implementation of hwpoison_walk_ops
walk_page_range() will only return 0 or 1, i.e. always >= 0. So kill_accessing_process() will always
return -EHWPOISON if this patch is applied.

Correct me if I miss something.

Thanks.
.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 4/4] mm/hwpoison: Fix incorrect "not recovered" report for recovered clean pages
  2025-02-12  8:09   ` Miaohe Lin
@ 2025-02-12 13:55     ` Shuai Xue
  2025-02-13  3:20       ` Miaohe Lin
  0 siblings, 1 reply; 17+ messages in thread
From: Shuai Xue @ 2025-02-12 13:55 UTC (permalink / raw)
  To: Miaohe Lin, nao.horiguchi
  Cc: tglx, mingo, dave.hansen, x86, hpa, akpm, linux-edac,
	linux-kernel, linux-mm, baolin.wang, tianruidong, tony.luck, bp



在 2025/2/12 16:09, Miaohe Lin 写道:
> On 2025/2/11 14:02, Shuai Xue wrote:
>> When an uncorrected memory error is consumed there is a race between
>> the CMCI from the memory controller reporting an uncorrected error
>> with a UCNA signature, and the core reporting and SRAR signature
>> machine check when the data is about to be consumed.
>>
>> If the CMCI wins that race, the page is marked poisoned when
>> uc_decode_notifier() calls memory_failure(). For dirty pages,
>> memory_failure() invokes try_to_unmap() with the TTU_HWPOISON flag,
>> converting the PTE to a hwpoison entry. However, for clean pages, the
>> TTU_HWPOISON flag is cleared, leaving the PTE unchanged and not converted
>> to a hwpoison entry. Consequently, for an unmapped dirty page, the PTE is
>> marked as a hwpoison entry allowing kill_accessing_process() to:
>>
>> - call walk_page_range() and return 1
>> - call kill_proc() to make sure a SIGBUS is sent
>> - return -EHWPOISON to indicate that SIGBUS is already sent to the process
>>    and kill_me_maybe() doesn't have to send it again.
>>
>> Conversely, for clean pages where PTE entries are not marked as hwpoison,
>> kill_accessing_process() returns -EFAULT, causing kill_me_maybe() to send a
>> SIGBUS.
>>
>> Console log looks like this:
>>
>>      Memory failure: 0x827ca68: corrupted page was clean: dropped without side effects
>>      Memory failure: 0x827ca68: recovery action for clean LRU page: Recovered
>>      Memory failure: 0x827ca68: already hardware poisoned
>>      mce: Memory error not recovered
>>
>> To fix it, return -EHWPOISON if no hwpoison PTE entry is found, preventing
>> an unnecessary SIGBUS.
> 
> Thanks for your patch.
> 
>>
>> Fixes: 046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"")
>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
>> ---
>>   mm/memory-failure.c | 5 ++---
>>   1 file changed, 2 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index 995a15eb67e2..f9a6b136a6f0 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -883,10 +883,9 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
>>   			      (void *)&priv);
>>   	if (ret == 1 && priv.tk.addr)
>>   		kill_proc(&priv.tk, pfn, flags);
>> -	else
>> -		ret = 0;
>>   	mmap_read_unlock(p->mm);
>> -	return ret > 0 ? -EHWPOISON : -EFAULT;
>> +
>> +	return ret >= 0 ? -EHWPOISON : -EFAULT;
> 
> IIUC, kill_accessing_process() is supposed to return -EHWPOISON to notify that SIGBUS is already
> sent to the process and kill_me_maybe() doesn't have to send it again. But with your change,
> kill_accessing_process() will return -EHWPOISON even if SIGBUS is not sent. Does this break
> the semantics of -EHWPOISON?

Yes, from the comment of kill_me_maybe(),

	 * -EHWPOISON from memory_failure() means that it already sent SIGBUS
	 * to the current process with the proper error info,
	 * -EOPNOTSUPP means hwpoison_filter() filtered the error event,

this patch break the comment.

But the defination of EHWPOISON is quite different from the comment.

  #define EHWPOISON	133	/* Memory page has hardware error */

As for this issue, returning 0 or EHWPOISON can both prevent a SIGBUS signal
from being sent in kill_me_maybe().

Which way do you prefer?

> 
> BTW I scanned the code of walk_page_range(). It seems with implementation of hwpoison_walk_ops
> walk_page_range() will only return 0 or 1, i.e. always >= 0. So kill_accessing_process() will always
> return -EHWPOISON if this patch is applied.
> 
> Correct me if I miss something.

Yes, you are right. Let's count the cases one by one:

1. clean page: try_to_remap(!TTU_HWPOISON), walk_page_range() will return 0 and
we should not send sigbus in kill_me_maybe().

2. dirty page:
2.1 MCE wins race
           CMCI:w/o Action Require         MCE: w/ Action Require
                                       TestSetPageHWPoison
       TestSetPageHWPoison
       return -EHWPOISON
                                       try_to_unmap(TTU_HWPOISON)
                                       kill_proc in hwpoison_user_mappings()

If MCE wins the race, because the flag of memory_fialure() called by CMCI is
not set as MF_ACTION_REQUIRED, everything goes well, kill_proc() will send
SIGBUS in hwpoison_user_mappings().

2.2 CMCI win
           CMCI:w/o Action Require         MCE: w/ Action Require
     TestSetPageHWPoison
     try_to_unmap(TTU_HWPOISON)
                                        walk_page_range() return 1 due to hwpoison PTE entry
                                        kill_proc in kill_accessing_process()

If the CMCI wins the race, we need to kill the process in
kill_accessing_process(). And if try_to_remap() success, everything goes well,
kill_proc() will send SIGBUS in kill_accessing_process().

But if try_to_remap() fails, the PTE entry will not be marked as hwpoison, and
walk_page_range() return 0 as case 1 clean page, NO SIGBUS will be sent.

In summary, hwpoison_walk_ops cannot distinguish between try_to_unmap failing
and causing the PTE entry not to be set to hwpoison, and a clean page that
originally does not have the PTE entry set to hwpoison.

+naoya for orginal patch intend.

Thanks.
Best Regard,
Shuai
				


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 4/4] mm/hwpoison: Fix incorrect "not recovered" report for recovered clean pages
  2025-02-12 13:55     ` Shuai Xue
@ 2025-02-13  3:20       ` Miaohe Lin
  2025-02-13  6:59         ` Shuai Xue
  0 siblings, 1 reply; 17+ messages in thread
From: Miaohe Lin @ 2025-02-13  3:20 UTC (permalink / raw)
  To: Shuai Xue
  Cc: tglx, mingo, dave.hansen, x86, hpa, akpm, linux-edac,
	linux-kernel, linux-mm, baolin.wang, tianruidong, tony.luck, bp,
	nao.horiguchi

On 2025/2/12 21:55, Shuai Xue wrote:
> 
> 
> 在 2025/2/12 16:09, Miaohe Lin 写道:
>> On 2025/2/11 14:02, Shuai Xue wrote:
>>> When an uncorrected memory error is consumed there is a race between
>>> the CMCI from the memory controller reporting an uncorrected error
>>> with a UCNA signature, and the core reporting and SRAR signature
>>> machine check when the data is about to be consumed.
>>>
>>> If the CMCI wins that race, the page is marked poisoned when
>>> uc_decode_notifier() calls memory_failure(). For dirty pages,
>>> memory_failure() invokes try_to_unmap() with the TTU_HWPOISON flag,
>>> converting the PTE to a hwpoison entry. However, for clean pages, the
>>> TTU_HWPOISON flag is cleared, leaving the PTE unchanged and not converted
>>> to a hwpoison entry. Consequently, for an unmapped dirty page, the PTE is
>>> marked as a hwpoison entry allowing kill_accessing_process() to:
>>>
>>> - call walk_page_range() and return 1
>>> - call kill_proc() to make sure a SIGBUS is sent
>>> - return -EHWPOISON to indicate that SIGBUS is already sent to the process
>>>    and kill_me_maybe() doesn't have to send it again.
>>>
>>> Conversely, for clean pages where PTE entries are not marked as hwpoison,
>>> kill_accessing_process() returns -EFAULT, causing kill_me_maybe() to send a
>>> SIGBUS.
>>>
>>> Console log looks like this:
>>>
>>>      Memory failure: 0x827ca68: corrupted page was clean: dropped without side effects
>>>      Memory failure: 0x827ca68: recovery action for clean LRU page: Recovered
>>>      Memory failure: 0x827ca68: already hardware poisoned
>>>      mce: Memory error not recovered
>>>
>>> To fix it, return -EHWPOISON if no hwpoison PTE entry is found, preventing
>>> an unnecessary SIGBUS.
>>
>> Thanks for your patch.
>>
>>>
>>> Fixes: 046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"")
>>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
>>> ---
>>>   mm/memory-failure.c | 5 ++---
>>>   1 file changed, 2 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>> index 995a15eb67e2..f9a6b136a6f0 100644
>>> --- a/mm/memory-failure.c
>>> +++ b/mm/memory-failure.c
>>> @@ -883,10 +883,9 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
>>>                     (void *)&priv);
>>>       if (ret == 1 && priv.tk.addr)
>>>           kill_proc(&priv.tk, pfn, flags);
>>> -    else
>>> -        ret = 0;
>>>       mmap_read_unlock(p->mm);
>>> -    return ret > 0 ? -EHWPOISON : -EFAULT;
>>> +
>>> +    return ret >= 0 ? -EHWPOISON : -EFAULT;
>>
>> IIUC, kill_accessing_process() is supposed to return -EHWPOISON to notify that SIGBUS is already
>> sent to the process and kill_me_maybe() doesn't have to send it again. But with your change,
>> kill_accessing_process() will return -EHWPOISON even if SIGBUS is not sent. Does this break
>> the semantics of -EHWPOISON?
> 
> Yes, from the comment of kill_me_maybe(),
> 
>      * -EHWPOISON from memory_failure() means that it already sent SIGBUS
>      * to the current process with the proper error info,
>      * -EOPNOTSUPP means hwpoison_filter() filtered the error event,
> 
> this patch break the comment.
> 
> But the defination of EHWPOISON is quite different from the comment.
> 
>  #define EHWPOISON    133    /* Memory page has hardware error */
> 
> As for this issue, returning 0 or EHWPOISON can both prevent a SIGBUS signal
> from being sent in kill_me_maybe().
> 
> Which way do you prefer?
> 
>>
>> BTW I scanned the code of walk_page_range(). It seems with implementation of hwpoison_walk_ops
>> walk_page_range() will only return 0 or 1, i.e. always >= 0. So kill_accessing_process() will always
>> return -EHWPOISON if this patch is applied.
>>
>> Correct me if I miss something.
> 
> Yes, you are right. Let's count the cases one by one:
> 
> 1. clean page: try_to_remap(!TTU_HWPOISON), walk_page_range() will return 0 and

Do you mean try_to_unmap?

> we should not send sigbus in kill_me_maybe().
> 
> 2. dirty page:
> 2.1 MCE wins race
>           CMCI:w/o Action Require         MCE: w/ Action Require
>                                       TestSetPageHWPoison
>       TestSetPageHWPoison
>       return -EHWPOISON
>                                       try_to_unmap(TTU_HWPOISON)
>                                       kill_proc in hwpoison_user_mappings()
> 
> If MCE wins the race, because the flag of memory_fialure() called by CMCI is
> not set as MF_ACTION_REQUIRED, everything goes well, kill_proc() will send
> SIGBUS in hwpoison_user_mappings().
> 
> 2.2 CMCI win
>           CMCI:w/o Action Require         MCE: w/ Action Require
>     TestSetPageHWPoison
>     try_to_unmap(TTU_HWPOISON)
>                                        walk_page_range() return 1 due to hwpoison PTE entry
>                                        kill_proc in kill_accessing_process()
> 
> If the CMCI wins the race, we need to kill the process in
> kill_accessing_process(). And if try_to_remap() success, everything goes well,
> kill_proc() will send SIGBUS in kill_accessing_process().
> 
> But if try_to_remap() fails, the PTE entry will not be marked as hwpoison, and
> walk_page_range() return 0 as case 1 clean page, NO SIGBUS will be sent.

If try_to_unmap() fails, the PTE entry will still point to the dirty page. Then in
check_hwpoisoned_entry(), we will have pfn == poisoned_pfn. So walk_page_range()
will return 1 in this case. Or am I miss something?

> 
> In summary, hwpoison_walk_ops cannot distinguish between try_to_unmap failing
> and causing the PTE entry not to be set to hwpoison, and a clean page that
> originally does not have the PTE entry set to hwpoison.

Is it possible current process is not the one accessing the hwpoisoned page? E.g. memory_failure
is deferred and called from kworker context or something like that. If it's possible, this is
another scene needs to be considered.

Thanks.
.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 4/4] mm/hwpoison: Fix incorrect "not recovered" report for recovered clean pages
  2025-02-13  3:20       ` Miaohe Lin
@ 2025-02-13  6:59         ` Shuai Xue
  2025-02-14  6:54           ` Miaohe Lin
  0 siblings, 1 reply; 17+ messages in thread
From: Shuai Xue @ 2025-02-13  6:59 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: tglx, mingo, dave.hansen, x86, hpa, akpm, linux-edac,
	linux-kernel, linux-mm, baolin.wang, tianruidong, tony.luck, bp,
	nao.horiguchi



在 2025/2/13 11:20, Miaohe Lin 写道:
> On 2025/2/12 21:55, Shuai Xue wrote:
>>
>>
>> 在 2025/2/12 16:09, Miaohe Lin 写道:
>>> On 2025/2/11 14:02, Shuai Xue wrote:
>>>> When an uncorrected memory error is consumed there is a race between
>>>> the CMCI from the memory controller reporting an uncorrected error
>>>> with a UCNA signature, and the core reporting and SRAR signature
>>>> machine check when the data is about to be consumed.
>>>>
>>>> If the CMCI wins that race, the page is marked poisoned when
>>>> uc_decode_notifier() calls memory_failure(). For dirty pages,
>>>> memory_failure() invokes try_to_unmap() with the TTU_HWPOISON flag,
>>>> converting the PTE to a hwpoison entry. However, for clean pages, the
>>>> TTU_HWPOISON flag is cleared, leaving the PTE unchanged and not converted
>>>> to a hwpoison entry. Consequently, for an unmapped dirty page, the PTE is
>>>> marked as a hwpoison entry allowing kill_accessing_process() to:
>>>>
>>>> - call walk_page_range() and return 1
>>>> - call kill_proc() to make sure a SIGBUS is sent
>>>> - return -EHWPOISON to indicate that SIGBUS is already sent to the process
>>>>     and kill_me_maybe() doesn't have to send it again.
>>>>
>>>> Conversely, for clean pages where PTE entries are not marked as hwpoison,
>>>> kill_accessing_process() returns -EFAULT, causing kill_me_maybe() to send a
>>>> SIGBUS.
>>>>
>>>> Console log looks like this:
>>>>
>>>>       Memory failure: 0x827ca68: corrupted page was clean: dropped without side effects
>>>>       Memory failure: 0x827ca68: recovery action for clean LRU page: Recovered
>>>>       Memory failure: 0x827ca68: already hardware poisoned
>>>>       mce: Memory error not recovered
>>>>
>>>> To fix it, return -EHWPOISON if no hwpoison PTE entry is found, preventing
>>>> an unnecessary SIGBUS.
>>>
>>> Thanks for your patch.
>>>
>>>>
>>>> Fixes: 046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"")
>>>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
>>>> ---
>>>>    mm/memory-failure.c | 5 ++---
>>>>    1 file changed, 2 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>>> index 995a15eb67e2..f9a6b136a6f0 100644
>>>> --- a/mm/memory-failure.c
>>>> +++ b/mm/memory-failure.c
>>>> @@ -883,10 +883,9 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
>>>>                      (void *)&priv);
>>>>        if (ret == 1 && priv.tk.addr)
>>>>            kill_proc(&priv.tk, pfn, flags);
>>>> -    else
>>>> -        ret = 0;
>>>>        mmap_read_unlock(p->mm);
>>>> -    return ret > 0 ? -EHWPOISON : -EFAULT;
>>>> +
>>>> +    return ret >= 0 ? -EHWPOISON : -EFAULT;
>>>
>>> IIUC, kill_accessing_process() is supposed to return -EHWPOISON to notify that SIGBUS is already
>>> sent to the process and kill_me_maybe() doesn't have to send it again. But with your change,
>>> kill_accessing_process() will return -EHWPOISON even if SIGBUS is not sent. Does this break
>>> the semantics of -EHWPOISON?
>>
>> Yes, from the comment of kill_me_maybe(),
>>
>>       * -EHWPOISON from memory_failure() means that it already sent SIGBUS
>>       * to the current process with the proper error info,
>>       * -EOPNOTSUPP means hwpoison_filter() filtered the error event,
>>
>> this patch break the comment.
>>
>> But the defination of EHWPOISON is quite different from the comment.
>>
>>   #define EHWPOISON    133    /* Memory page has hardware error */
>>
>> As for this issue, returning 0 or EHWPOISON can both prevent a SIGBUS signal
>> from being sent in kill_me_maybe().
>>
>> Which way do you prefer?
>>
>>>
>>> BTW I scanned the code of walk_page_range(). It seems with implementation of hwpoison_walk_ops
>>> walk_page_range() will only return 0 or 1, i.e. always >= 0. So kill_accessing_process() will always
>>> return -EHWPOISON if this patch is applied.
>>>
>>> Correct me if I miss something.
>>
>> Yes, you are right. Let's count the cases one by one:
>>
>> 1. clean page: try_to_remap(!TTU_HWPOISON), walk_page_range() will return 0 and
> 
> Do you mean try_to_unmap?

Yes, sorry for the typo.
> 
>> we should not send sigbus in kill_me_maybe().
>>
>> 2. dirty page:
>> 2.1 MCE wins race
>>            CMCI:w/o Action Require         MCE: w/ Action Require
>>                                        TestSetPageHWPoison
>>        TestSetPageHWPoison
>>        return -EHWPOISON
>>                                        try_to_unmap(TTU_HWPOISON)
>>                                        kill_proc in hwpoison_user_mappings()
>>
>> If MCE wins the race, because the flag of memory_fialure() called by CMCI is
>> not set as MF_ACTION_REQUIRED, everything goes well, kill_proc() will send
>> SIGBUS in hwpoison_user_mappings().
>>
>> 2.2 CMCI win
>>            CMCI:w/o Action Require         MCE: w/ Action Require
>>      TestSetPageHWPoison
>>      try_to_unmap(TTU_HWPOISON)
>>                                         walk_page_range() return 1 due to hwpoison PTE entry
>>                                         kill_proc in kill_accessing_process()
>>
>> If the CMCI wins the race, we need to kill the process in
>> kill_accessing_process(). And if try_to_remap() success, everything goes well,
>> kill_proc() will send SIGBUS in kill_accessing_process().
>>
>> But if try_to_remap() fails, the PTE entry will not be marked as hwpoison, and
>> walk_page_range() return 0 as case 1 clean page, NO SIGBUS will be sent.
> 
> If try_to_unmap() fails, the PTE entry will still point to the dirty page. Then in
> check_hwpoisoned_entry(), we will have pfn == poisoned_pfn. So walk_page_range()
> will return 1 in this case. Or am I miss something?
> 

You’re right; I overlooked the pte_present() branch.

Therefore, in the walk_page_range() function:
- It returns 0 when the poison page is a clean page.
- It returns 1 when CMCI wins, regardless of whether try_to_unmap succeeds
   or fails.

Then the patch will be like:

@@ -883,10 +883,9 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
  			      (void *)&priv);
  	if (ret == 1 && priv.tk.addr)
  		kill_proc(&priv.tk, pfn, flags);
-	else
-		ret = 0;
  	mmap_read_unlock(p->mm);
-	return ret > 0 ? -EHWPOISON : -EFAULT;
+
+	return ret > 0 ? -EHWPOISON : 0;

Here, returning 0 indicates that memory_failure() successfully handled the
error by dropping the clean page.


>>
>> In summary, hwpoison_walk_ops cannot distinguish between try_to_unmap failing
>> and causing the PTE entry not to be set to hwpoison, and a clean page that
>> originally does not have the PTE entry set to hwpoison.
> 
> Is it possible current process is not the one accessing the hwpoisoned page? E.g. memory_failure
> is deferred and called from kworker context or something like that. If it's possible, this is
> another scene needs to be considered.

Yes, it possibale.

But kill_accessing_process() will only be called with MF_ACTION_REQUIRED.
MF_ACTION_REQUIRED indates that current process is exactly the one accesing the
poison data.

Fox x86 platform, GHES driver may queue a kwoker to defer memory_failure() with
flag=0.  So kill_accessing_process() will not be called in such case.

Thanks.
Shuai


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 4/4] mm/hwpoison: Fix incorrect "not recovered" report for recovered clean pages
  2025-02-13  6:59         ` Shuai Xue
@ 2025-02-14  6:54           ` Miaohe Lin
  2025-02-14  7:59             ` Shuai Xue
  2025-02-14 16:51             ` Luck, Tony
  0 siblings, 2 replies; 17+ messages in thread
From: Miaohe Lin @ 2025-02-14  6:54 UTC (permalink / raw)
  To: Shuai Xue
  Cc: tglx, mingo, dave.hansen, x86, hpa, akpm, linux-edac,
	linux-kernel, linux-mm, baolin.wang, tianruidong, tony.luck, bp,
	nao.horiguchi

On 2025/2/13 14:59, Shuai Xue wrote:
> 
> 
> 在 2025/2/13 11:20, Miaohe Lin 写道:
>> On 2025/2/12 21:55, Shuai Xue wrote:
>>>
>>>
>>> 在 2025/2/12 16:09, Miaohe Lin 写道:
>>>> On 2025/2/11 14:02, Shuai Xue wrote:
>>>>> When an uncorrected memory error is consumed there is a race between
>>>>> the CMCI from the memory controller reporting an uncorrected error
>>>>> with a UCNA signature, and the core reporting and SRAR signature
>>>>> machine check when the data is about to be consumed.
>>>>>
>>>>> If the CMCI wins that race, the page is marked poisoned when
>>>>> uc_decode_notifier() calls memory_failure(). For dirty pages,
>>>>> memory_failure() invokes try_to_unmap() with the TTU_HWPOISON flag,
>>>>> converting the PTE to a hwpoison entry. However, for clean pages, the
>>>>> TTU_HWPOISON flag is cleared, leaving the PTE unchanged and not converted
>>>>> to a hwpoison entry. Consequently, for an unmapped dirty page, the PTE is
>>>>> marked as a hwpoison entry allowing kill_accessing_process() to:
>>>>>
>>>>> - call walk_page_range() and return 1
>>>>> - call kill_proc() to make sure a SIGBUS is sent
>>>>> - return -EHWPOISON to indicate that SIGBUS is already sent to the process
>>>>>     and kill_me_maybe() doesn't have to send it again.
>>>>>
>>>>> Conversely, for clean pages where PTE entries are not marked as hwpoison,
>>>>> kill_accessing_process() returns -EFAULT, causing kill_me_maybe() to send a
>>>>> SIGBUS.
>>>>>
>>>>> Console log looks like this:
>>>>>
>>>>>       Memory failure: 0x827ca68: corrupted page was clean: dropped without side effects
>>>>>       Memory failure: 0x827ca68: recovery action for clean LRU page: Recovered
>>>>>       Memory failure: 0x827ca68: already hardware poisoned
>>>>>       mce: Memory error not recovered
>>>>>
>>>>> To fix it, return -EHWPOISON if no hwpoison PTE entry is found, preventing
>>>>> an unnecessary SIGBUS.
>>>>
>>>> Thanks for your patch.
>>>>
>>>>>
>>>>> Fixes: 046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"")
>>>>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
>>>>> ---
>>>>>    mm/memory-failure.c | 5 ++---
>>>>>    1 file changed, 2 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>>>> index 995a15eb67e2..f9a6b136a6f0 100644
>>>>> --- a/mm/memory-failure.c
>>>>> +++ b/mm/memory-failure.c
>>>>> @@ -883,10 +883,9 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
>>>>>                      (void *)&priv);
>>>>>        if (ret == 1 && priv.tk.addr)
>>>>>            kill_proc(&priv.tk, pfn, flags);
>>>>> -    else
>>>>> -        ret = 0;
>>>>>        mmap_read_unlock(p->mm);
>>>>> -    return ret > 0 ? -EHWPOISON : -EFAULT;
>>>>> +
>>>>> +    return ret >= 0 ? -EHWPOISON : -EFAULT;
>>>>
>>>> IIUC, kill_accessing_process() is supposed to return -EHWPOISON to notify that SIGBUS is already
>>>> sent to the process and kill_me_maybe() doesn't have to send it again. But with your change,
>>>> kill_accessing_process() will return -EHWPOISON even if SIGBUS is not sent. Does this break
>>>> the semantics of -EHWPOISON?
>>>
>>> Yes, from the comment of kill_me_maybe(),
>>>
>>>       * -EHWPOISON from memory_failure() means that it already sent SIGBUS
>>>       * to the current process with the proper error info,
>>>       * -EOPNOTSUPP means hwpoison_filter() filtered the error event,
>>>
>>> this patch break the comment.
>>>
>>> But the defination of EHWPOISON is quite different from the comment.
>>>
>>>   #define EHWPOISON    133    /* Memory page has hardware error */
>>>
>>> As for this issue, returning 0 or EHWPOISON can both prevent a SIGBUS signal
>>> from being sent in kill_me_maybe().
>>>
>>> Which way do you prefer?
>>>
>>>>
>>>> BTW I scanned the code of walk_page_range(). It seems with implementation of hwpoison_walk_ops
>>>> walk_page_range() will only return 0 or 1, i.e. always >= 0. So kill_accessing_process() will always
>>>> return -EHWPOISON if this patch is applied.
>>>>
>>>> Correct me if I miss something.
>>>
>>> Yes, you are right. Let's count the cases one by one:
>>>
>>> 1. clean page: try_to_remap(!TTU_HWPOISON), walk_page_range() will return 0 and
>>
>> Do you mean try_to_unmap?
> 
> Yes, sorry for the typo.
>>
>>> we should not send sigbus in kill_me_maybe().
>>>
>>> 2. dirty page:
>>> 2.1 MCE wins race
>>>            CMCI:w/o Action Require         MCE: w/ Action Require
>>>                                        TestSetPageHWPoison
>>>        TestSetPageHWPoison
>>>        return -EHWPOISON
>>>                                        try_to_unmap(TTU_HWPOISON)
>>>                                        kill_proc in hwpoison_user_mappings()
>>>
>>> If MCE wins the race, because the flag of memory_fialure() called by CMCI is
>>> not set as MF_ACTION_REQUIRED, everything goes well, kill_proc() will send
>>> SIGBUS in hwpoison_user_mappings().
>>>
>>> 2.2 CMCI win
>>>            CMCI:w/o Action Require         MCE: w/ Action Require
>>>      TestSetPageHWPoison
>>>      try_to_unmap(TTU_HWPOISON)
>>>                                         walk_page_range() return 1 due to hwpoison PTE entry
>>>                                         kill_proc in kill_accessing_process()
>>>
>>> If the CMCI wins the race, we need to kill the process in
>>> kill_accessing_process(). And if try_to_remap() success, everything goes well,
>>> kill_proc() will send SIGBUS in kill_accessing_process().
>>>
>>> But if try_to_remap() fails, the PTE entry will not be marked as hwpoison, and
>>> walk_page_range() return 0 as case 1 clean page, NO SIGBUS will be sent.
>>
>> If try_to_unmap() fails, the PTE entry will still point to the dirty page. Then in
>> check_hwpoisoned_entry(), we will have pfn == poisoned_pfn. So walk_page_range()
>> will return 1 in this case. Or am I miss something?
>>
> 
> You’re right; I overlooked the pte_present() branch.
> 
> Therefore, in the walk_page_range() function:
> - It returns 0 when the poison page is a clean page.
> - It returns 1 when CMCI wins, regardless of whether try_to_unmap succeeds
>   or fails.
> 
> Then the patch will be like:
> 
> @@ -883,10 +883,9 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
>                    (void *)&priv);
>      if (ret == 1 && priv.tk.addr)
>          kill_proc(&priv.tk, pfn, flags);
> -    else
> -        ret = 0;
>      mmap_read_unlock(p->mm);
> -    return ret > 0 ? -EHWPOISON : -EFAULT;
> +
> +    return ret > 0 ? -EHWPOISON : 0;
> 
> Here, returning 0 indicates that memory_failure() successfully handled the
> error by dropping the clean page.

I'm not sure whether there's another scene that can make walk_page_range() returns 0. But if the
only reason for walk_page_range() returning 0 is the poison page is a clean page and it's dropped,
then this modification should be appropriate. With this change, the callers never send SIGBUS now.
They might need to be changed too.

Thanks.
.

> 
> 
>>>
>>> In summary, hwpoison_walk_ops cannot distinguish between try_to_unmap failing
>>> and causing the PTE entry not to be set to hwpoison, and a clean page that
>>> originally does not have the PTE entry set to hwpoison.
>>
>> Is it possible current process is not the one accessing the hwpoisoned page? E.g. memory_failure
>> is deferred and called from kworker context or something like that. If it's possible, this is
>> another scene needs to be considered.
> 
> Yes, it possibale.
> 
> But kill_accessing_process() will only be called with MF_ACTION_REQUIRED.
> MF_ACTION_REQUIRED indates that current process is exactly the one accesing the
> poison data.
> 
> Fox x86 platform, GHES driver may queue a kwoker to defer memory_failure() with
> flag=0.  So kill_accessing_process() will not be called in such case.
> 
> Thanks.
> Shuai
> .



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 4/4] mm/hwpoison: Fix incorrect "not recovered" report for recovered clean pages
  2025-02-14  6:54           ` Miaohe Lin
@ 2025-02-14  7:59             ` Shuai Xue
  2025-02-14 16:51             ` Luck, Tony
  1 sibling, 0 replies; 17+ messages in thread
From: Shuai Xue @ 2025-02-14  7:59 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: tglx, mingo, dave.hansen, x86, hpa, akpm, linux-edac,
	linux-kernel, linux-mm, baolin.wang, tianruidong, tony.luck, bp,
	nao.horiguchi



在 2025/2/14 14:54, Miaohe Lin 写道:
> On 2025/2/13 14:59, Shuai Xue wrote:
>>
>>
>> 在 2025/2/13 11:20, Miaohe Lin 写道:
>>> On 2025/2/12 21:55, Shuai Xue wrote:
>>>>
>>>>
>>>> 在 2025/2/12 16:09, Miaohe Lin 写道:
>>>>> On 2025/2/11 14:02, Shuai Xue wrote:
>>>>>> When an uncorrected memory error is consumed there is a race between
>>>>>> the CMCI from the memory controller reporting an uncorrected error
>>>>>> with a UCNA signature, and the core reporting and SRAR signature
>>>>>> machine check when the data is about to be consumed.
>>>>>>
>>>>>> If the CMCI wins that race, the page is marked poisoned when
>>>>>> uc_decode_notifier() calls memory_failure(). For dirty pages,
>>>>>> memory_failure() invokes try_to_unmap() with the TTU_HWPOISON flag,
>>>>>> converting the PTE to a hwpoison entry. However, for clean pages, the
>>>>>> TTU_HWPOISON flag is cleared, leaving the PTE unchanged and not converted
>>>>>> to a hwpoison entry. Consequently, for an unmapped dirty page, the PTE is
>>>>>> marked as a hwpoison entry allowing kill_accessing_process() to:
>>>>>>
>>>>>> - call walk_page_range() and return 1
>>>>>> - call kill_proc() to make sure a SIGBUS is sent
>>>>>> - return -EHWPOISON to indicate that SIGBUS is already sent to the process
>>>>>>      and kill_me_maybe() doesn't have to send it again.
>>>>>>
>>>>>> Conversely, for clean pages where PTE entries are not marked as hwpoison,
>>>>>> kill_accessing_process() returns -EFAULT, causing kill_me_maybe() to send a
>>>>>> SIGBUS.
>>>>>>
>>>>>> Console log looks like this:
>>>>>>
>>>>>>        Memory failure: 0x827ca68: corrupted page was clean: dropped without side effects
>>>>>>        Memory failure: 0x827ca68: recovery action for clean LRU page: Recovered
>>>>>>        Memory failure: 0x827ca68: already hardware poisoned
>>>>>>        mce: Memory error not recovered
>>>>>>
>>>>>> To fix it, return -EHWPOISON if no hwpoison PTE entry is found, preventing
>>>>>> an unnecessary SIGBUS.
>>>>>
>>>>> Thanks for your patch.
>>>>>
>>>>>>
>>>>>> Fixes: 046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"")
>>>>>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
>>>>>> ---
>>>>>>     mm/memory-failure.c | 5 ++---
>>>>>>     1 file changed, 2 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>>>>> index 995a15eb67e2..f9a6b136a6f0 100644
>>>>>> --- a/mm/memory-failure.c
>>>>>> +++ b/mm/memory-failure.c
>>>>>> @@ -883,10 +883,9 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
>>>>>>                       (void *)&priv);
>>>>>>         if (ret == 1 && priv.tk.addr)
>>>>>>             kill_proc(&priv.tk, pfn, flags);
>>>>>> -    else
>>>>>> -        ret = 0;
>>>>>>         mmap_read_unlock(p->mm);
>>>>>> -    return ret > 0 ? -EHWPOISON : -EFAULT;
>>>>>> +
>>>>>> +    return ret >= 0 ? -EHWPOISON : -EFAULT;
>>>>>
>>>>> IIUC, kill_accessing_process() is supposed to return -EHWPOISON to notify that SIGBUS is already
>>>>> sent to the process and kill_me_maybe() doesn't have to send it again. But with your change,
>>>>> kill_accessing_process() will return -EHWPOISON even if SIGBUS is not sent. Does this break
>>>>> the semantics of -EHWPOISON?
>>>>
>>>> Yes, from the comment of kill_me_maybe(),
>>>>
>>>>        * -EHWPOISON from memory_failure() means that it already sent SIGBUS
>>>>        * to the current process with the proper error info,
>>>>        * -EOPNOTSUPP means hwpoison_filter() filtered the error event,
>>>>
>>>> this patch break the comment.
>>>>
>>>> But the defination of EHWPOISON is quite different from the comment.
>>>>
>>>>    #define EHWPOISON    133    /* Memory page has hardware error */
>>>>
>>>> As for this issue, returning 0 or EHWPOISON can both prevent a SIGBUS signal
>>>> from being sent in kill_me_maybe().
>>>>
>>>> Which way do you prefer?
>>>>
>>>>>
>>>>> BTW I scanned the code of walk_page_range(). It seems with implementation of hwpoison_walk_ops
>>>>> walk_page_range() will only return 0 or 1, i.e. always >= 0. So kill_accessing_process() will always
>>>>> return -EHWPOISON if this patch is applied.
>>>>>
>>>>> Correct me if I miss something.
>>>>
>>>> Yes, you are right. Let's count the cases one by one:
>>>>
>>>> 1. clean page: try_to_remap(!TTU_HWPOISON), walk_page_range() will return 0 and
>>>
>>> Do you mean try_to_unmap?
>>
>> Yes, sorry for the typo.
>>>
>>>> we should not send sigbus in kill_me_maybe().
>>>>
>>>> 2. dirty page:
>>>> 2.1 MCE wins race
>>>>             CMCI:w/o Action Require         MCE: w/ Action Require
>>>>                                         TestSetPageHWPoison
>>>>         TestSetPageHWPoison
>>>>         return -EHWPOISON
>>>>                                         try_to_unmap(TTU_HWPOISON)
>>>>                                         kill_proc in hwpoison_user_mappings()
>>>>
>>>> If MCE wins the race, because the flag of memory_fialure() called by CMCI is
>>>> not set as MF_ACTION_REQUIRED, everything goes well, kill_proc() will send
>>>> SIGBUS in hwpoison_user_mappings().
>>>>
>>>> 2.2 CMCI win
>>>>             CMCI:w/o Action Require         MCE: w/ Action Require
>>>>       TestSetPageHWPoison
>>>>       try_to_unmap(TTU_HWPOISON)
>>>>                                          walk_page_range() return 1 due to hwpoison PTE entry
>>>>                                          kill_proc in kill_accessing_process()
>>>>
>>>> If the CMCI wins the race, we need to kill the process in
>>>> kill_accessing_process(). And if try_to_remap() success, everything goes well,
>>>> kill_proc() will send SIGBUS in kill_accessing_process().
>>>>
>>>> But if try_to_remap() fails, the PTE entry will not be marked as hwpoison, and
>>>> walk_page_range() return 0 as case 1 clean page, NO SIGBUS will be sent.
>>>
>>> If try_to_unmap() fails, the PTE entry will still point to the dirty page. Then in
>>> check_hwpoisoned_entry(), we will have pfn == poisoned_pfn. So walk_page_range()
>>> will return 1 in this case. Or am I miss something?
>>>
>>
>> You’re right; I overlooked the pte_present() branch.
>>
>> Therefore, in the walk_page_range() function:
>> - It returns 0 when the poison page is a clean page.
>> - It returns 1 when CMCI wins, regardless of whether try_to_unmap succeeds
>>    or fails.
>>
>> Then the patch will be like:
>>
>> @@ -883,10 +883,9 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
>>                     (void *)&priv);
>>       if (ret == 1 && priv.tk.addr)
>>           kill_proc(&priv.tk, pfn, flags);
>> -    else
>> -        ret = 0;
>>       mmap_read_unlock(p->mm);
>> -    return ret > 0 ? -EHWPOISON : -EFAULT;
>> +
>> +    return ret > 0 ? -EHWPOISON : 0;
>>
>> Here, returning 0 indicates that memory_failure() successfully handled the
>> error by dropping the clean page.
> 
> I'm not sure whether there's another scene that can make walk_page_range() returns 0. But if the
> only reason for walk_page_range() returning 0 is the poison page is a clean page and it's dropped,
> then this modification should be appropriate. With this change, the callers never send SIGBUS now.
> They might need to be changed too.

Yes, if memory_failure() successfully handled the error, the caller should be nothing.

Thanks.
Shuai



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v1 2/4] x86/mce: dump error msg from severities
  2025-02-11 16:44   ` Luck, Tony
@ 2025-02-14  9:29     ` Shuai Xue
  2025-02-14 16:57       ` Luck, Tony
  0 siblings, 1 reply; 17+ messages in thread
From: Shuai Xue @ 2025-02-14  9:29 UTC (permalink / raw)
  To: Luck, Tony, bp, nao.horiguchi
  Cc: tglx, mingo, dave.hansen, x86, hpa, linmiaohe, akpm, linux-edac,
	linux-kernel, linux-mm, baolin.wang, tianruidong



在 2025/2/12 00:44, Luck, Tony 写道:
>> The message in severities is useful for identifying the type of MCE that
>> has occurred; dump it if it is valid.
>>
>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
>> ---
>>   arch/x86/kernel/cpu/mce/core.c | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
>> index 2919a077cd66..c1319db45b0a 100644
>> --- a/arch/x86/kernel/cpu/mce/core.c
>> +++ b/arch/x86/kernel/cpu/mce/core.c
>> @@ -1456,6 +1456,8 @@ static void queue_task_work(struct mce_hw_err *err, char *msg, void (*func)(stru
>>        if (count > 1)
>>                return;
>>
>> +     if (msg)
>> +             pr_err("%s\n", msg);
>>        task_work_add(current, &current->mce_kill_me, TWA_RESUME);
>>   }
> 
> This is called from the #MC handler. Is that a safe context to print a console
> message? It wasn't in the past, but maybe changes to how console messages
> are handled have changed this.
> 
> -Tony

#MC is a kind of NMI context, as far as I know, since

commit 42a0bb3f71383b457a7db362f1c69e7afb96732b
printk/nmi: generic solution for safe printk in NMI

print a console message is safe.

Please correct me if I missed anything.

Thanks.
Shuai


^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH v1 4/4] mm/hwpoison: Fix incorrect "not recovered" report for recovered clean pages
  2025-02-14  6:54           ` Miaohe Lin
  2025-02-14  7:59             ` Shuai Xue
@ 2025-02-14 16:51             ` Luck, Tony
  1 sibling, 0 replies; 17+ messages in thread
From: Luck, Tony @ 2025-02-14 16:51 UTC (permalink / raw)
  To: Miaohe Lin, Shuai Xue
  Cc: tglx, mingo, dave.hansen, x86, hpa, akpm, linux-edac,
	linux-kernel, linux-mm, baolin.wang, tianruidong, bp,
	nao.horiguchi

> > Then the patch will be like:
> >
> > @@ -883,10 +883,9 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
> >                    (void *)&priv);
> >      if (ret == 1 && priv.tk.addr)
> >          kill_proc(&priv.tk, pfn, flags);
> > -    else
> > -        ret = 0;
> >      mmap_read_unlock(p->mm);
> > -    return ret > 0 ? -EHWPOISON : -EFAULT;
> > +
> > +    return ret > 0 ? -EHWPOISON : 0;
> >
> > Here, returning 0 indicates that memory_failure() successfully handled the
> > error by dropping the clean page.
>
> I'm not sure whether there's another scene that can make walk_page_range() returns 0. But if the
> only reason for walk_page_range() returning 0 is the poison page is a clean page and it's dropped,
> then this modification should be appropriate. With this change, the callers never send SIGBUS now.
> They might need to be changed too.

Note there shouldn't be a SIGBUS when the action was "dropping a clean page". Full recovery
is possible in this case (user process takes #PF, Linux allocates a new page and fills by reading
from storage).

-Tony

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: [PATCH v1 2/4] x86/mce: dump error msg from severities
  2025-02-14  9:29     ` Shuai Xue
@ 2025-02-14 16:57       ` Luck, Tony
  0 siblings, 0 replies; 17+ messages in thread
From: Luck, Tony @ 2025-02-14 16:57 UTC (permalink / raw)
  To: Shuai Xue, bp, nao.horiguchi
  Cc: tglx, mingo, dave.hansen, x86, hpa, linmiaohe, akpm, linux-edac,
	linux-kernel, linux-mm, baolin.wang, tianruidong

> > This is called from the #MC handler. Is that a safe context to print a console
> > message? It wasn't in the past, but maybe changes to how console messages
> > are handled have changed this.
> >
> > -Tony
>
> #MC is a kind of NMI context, as far as I know, since
>
> commit 42a0bb3f71383b457a7db362f1c69e7afb96732b
> printk/nmi: generic solution for safe printk in NMI
>
> print a console message is safe.
>
> Please correct me if I missed anything.

wow, that's v4.7 (ancient history). I thought I'd had issues with debug
messages in the machine check handler more recently than that, but
perhaps I'm misremembering.

-Tony

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-02-14 16:57 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-11  6:01 [PATCH v1 0/4] fmm/hwpoison: Fix regressions in memory failure handling Shuai Xue
2025-02-11  6:01 ` [PATCH v1 1/4] x86/mce: Collect error message for severities below MCE_PANIC_SEVERITY Shuai Xue
2025-02-11 16:51   ` Luck, Tony
2025-02-12  1:51     ` Shuai Xue
2025-02-11  6:01 ` [PATCH v1 2/4] x86/mce: dump error msg from severities Shuai Xue
2025-02-11 16:44   ` Luck, Tony
2025-02-14  9:29     ` Shuai Xue
2025-02-14 16:57       ` Luck, Tony
2025-02-11  6:01 ` [PATCH v1 3/4] x86/mce: add EX_TYPE_EFAULT_REG as in-kernel recovery context to fix copy-from-user operations regression Shuai Xue
2025-02-11  6:02 ` [PATCH v1 4/4] mm/hwpoison: Fix incorrect "not recovered" report for recovered clean pages Shuai Xue
2025-02-12  8:09   ` Miaohe Lin
2025-02-12 13:55     ` Shuai Xue
2025-02-13  3:20       ` Miaohe Lin
2025-02-13  6:59         ` Shuai Xue
2025-02-14  6:54           ` Miaohe Lin
2025-02-14  7:59             ` Shuai Xue
2025-02-14 16:51             ` Luck, Tony

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox