linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling
@ 2025-02-17  6:33 Shuai Xue
  2025-02-17  6:33 ` [PATCH v2 1/5] x86/mce: Collect error message for severities below MCE_PANIC_SEVERITY Shuai Xue
                   ` (6 more replies)
  0 siblings, 7 replies; 59+ messages in thread
From: Shuai Xue @ 2025-02-17  6:33 UTC (permalink / raw)
  To: tony.luck, bp, nao.horiguchi
  Cc: tglx, mingo, dave.hansen, x86, hpa, linmiaohe, akpm, peterz,
	jpoimboe, linux-edac, linux-kernel, linux-mm, baolin.wang,
	tianruidong

changes singce v1:
- Patch 1: Fix cur_sev and sev type to `int` per Tony
- Patch 4: Fix return value to 0 for clean pages per Miaohe
- Patch 5: pick return value comments of memory-failure()

This patch addresses three regressions identified in memory failure
handling, as discovered using ras-tools[1]:

- `./einj_mem_uc copyin -f`
- `./einj_mem_uc futex -f`
- `./einj_mem_uc instr`

The regressions in the copyin and futex cases were caused by the
replacement of `EX_TYPE_UACCESS` with `EX_TYPE_EFAULT_REG` in some
copy-from-user operations, leading to kernel panics. The instr case
regression resulted from the PTE entry not being marked as hwpoison,
causing the system to send unnecessary SIGBUS signals.

These fixes ensure proper handling of memory errors and prevent kernel
panics and unnecessary signal dispatch.

[1]https://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git

Shuai Xue (5):
  x86/mce: Collect error message for severities below MCE_PANIC_SEVERITY
  x86/mce: dump error msg from severities
  x86/mce: add EX_TYPE_EFAULT_REG as in-kernel recovery context to fix
    copy-from-user operations regression
  mm/hwpoison: Fix incorrect "not recovered" report for recovered clean
    pages
  mm: memory-failure: move return value documentation to function
    declaration

 arch/x86/kernel/cpu/mce/core.c     | 26 +++++++++++++-------------
 arch/x86/kernel/cpu/mce/severity.c | 21 ++++++++++++++++-----
 mm/memory-failure.c                | 21 +++++++++++++++------
 3 files changed, 44 insertions(+), 24 deletions(-)

-- 
2.39.3



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v2 1/5] x86/mce: Collect error message for severities below MCE_PANIC_SEVERITY
  2025-02-17  6:33 [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling Shuai Xue
@ 2025-02-17  6:33 ` Shuai Xue
  2025-02-18  7:58   ` Borislav Petkov
  2025-02-17  6:33 ` [PATCH v2 2/5] x86/mce: dump error msg from severities Shuai Xue
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 59+ messages in thread
From: Shuai Xue @ 2025-02-17  6:33 UTC (permalink / raw)
  To: tony.luck, bp, nao.horiguchi
  Cc: tglx, mingo, dave.hansen, x86, hpa, linmiaohe, akpm, peterz,
	jpoimboe, linux-edac, linux-kernel, linux-mm, baolin.wang,
	tianruidong

Currently, mce_no_way_out() only collects error messages when the error
severity is equal to `MCE_PANIC_SEVERITY`. To improve diagnostics,
modify the behavior to also collect error messages when the severity is
less than `MCE_PANIC_SEVERITY`.

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
---
 arch/x86/kernel/cpu/mce/core.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 0dc00c9894c7..f2e730d4acc5 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -925,12 +925,13 @@ static __always_inline void quirk_zen_ifu(int bank, struct mce *m, struct pt_reg
  * Do a quick check if any of the events requires a panic.
  * This decides if we keep the events around or clear them.
  */
-static __always_inline int mce_no_way_out(struct mce_hw_err *err, char **msg, unsigned long *validp,
-					  struct pt_regs *regs)
+static __always_inline bool mce_no_way_out(struct mce_hw_err *err, char **msg,
+					   unsigned long *validp,
+					   struct pt_regs *regs)
 {
 	struct mce *m = &err->m;
 	char *tmp = *msg;
-	int i;
+	int i, cur_sev = MCE_NO_SEVERITY, sev;
 
 	for (i = 0; i < this_cpu_read(mce_num_banks); i++) {
 		m->status = mce_rdmsrl(mca_msr_reg(i, MCA_STATUS));
@@ -945,13 +946,17 @@ static __always_inline int mce_no_way_out(struct mce_hw_err *err, char **msg, un
 			quirk_zen_ifu(i, m, regs);
 
 		m->bank = i;
-		if (mce_severity(m, regs, &tmp, true) >= MCE_PANIC_SEVERITY) {
+		sev = mce_severity(m, regs, &tmp, true);
+		if (sev >= cur_sev) {
 			mce_read_aux(err, i);
 			*msg = tmp;
-			return 1;
+			cur_sev = sev;
 		}
+
+		if (cur_sev == MCE_PANIC_SEVERITY)
+			return true;
 	}
-	return 0;
+	return false;
 }
 
 /*
-- 
2.39.3



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v2 2/5] x86/mce: dump error msg from severities
  2025-02-17  6:33 [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling Shuai Xue
  2025-02-17  6:33 ` [PATCH v2 1/5] x86/mce: Collect error message for severities below MCE_PANIC_SEVERITY Shuai Xue
@ 2025-02-17  6:33 ` Shuai Xue
  2025-02-28 12:37   ` Borislav Petkov
  2025-02-17  6:33 ` [PATCH v2 3/5] x86/mce: add EX_TYPE_EFAULT_REG as in-kernel recovery context to fix copy-from-user operations regression Shuai Xue
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 59+ messages in thread
From: Shuai Xue @ 2025-02-17  6:33 UTC (permalink / raw)
  To: tony.luck, bp, nao.horiguchi
  Cc: tglx, mingo, dave.hansen, x86, hpa, linmiaohe, akpm, peterz,
	jpoimboe, linux-edac, linux-kernel, linux-mm, baolin.wang,
	tianruidong

The message in severities is useful for identifying the type of MCE that
has occurred; dump it if it is valid.

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
---
 arch/x86/kernel/cpu/mce/core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index f2e730d4acc5..c1b945dbccdf 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1456,6 +1456,8 @@ static void queue_task_work(struct mce_hw_err *err, char *msg, void (*func)(stru
 	if (count > 1)
 		return;
 
+	if (msg)
+		pr_err("%s\n", msg);
 	task_work_add(current, &current->mce_kill_me, TWA_RESUME);
 }
 
-- 
2.39.3



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v2 3/5] x86/mce: add EX_TYPE_EFAULT_REG as in-kernel recovery context to fix copy-from-user operations regression
  2025-02-17  6:33 [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling Shuai Xue
  2025-02-17  6:33 ` [PATCH v2 1/5] x86/mce: Collect error message for severities below MCE_PANIC_SEVERITY Shuai Xue
  2025-02-17  6:33 ` [PATCH v2 2/5] x86/mce: dump error msg from severities Shuai Xue
@ 2025-02-17  6:33 ` Shuai Xue
  2025-02-18 12:54   ` Peter Zijlstra
  2025-02-17  6:33 ` [PATCH v2 4/5] mm/hwpoison: Fix incorrect "not recovered" report for recovered clean pages Shuai Xue
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 59+ messages in thread
From: Shuai Xue @ 2025-02-17  6:33 UTC (permalink / raw)
  To: tony.luck, bp, nao.horiguchi
  Cc: tglx, mingo, dave.hansen, x86, hpa, linmiaohe, akpm, peterz,
	jpoimboe, linux-edac, linux-kernel, linux-mm, baolin.wang,
	tianruidong

Commit 4c132d1d844a ("x86/futex: Remove .fixup usage") introduced a new
extable fixup type, EX_TYPE_EFAULT_REG, and later patches updated the
extable fixup type for copy-from-user operations, changing it from
EX_TYPE_UACCESS to EX_TYPE_EFAULT_REG.

Specifically, commit 99641e094d6c ("x86/uaccess: Remove .fixup usage")
altered the extable fixup type for the get_user family, while commit
4c132d1d844a ("x86/futex: Remove .fixup usage") addressed the futex
operations. This change inadvertently caused a regression where the error
context for some copy-from-user operations no longer functions as an
in-kernel recovery context, leading to kernel panics with the message:
"Machine check: Data load in unrecoverable area of kernel."

To fix the regression, add EX_TYPE_EFAULT_REG as a in-kernel recovery
context for copy-from-user operations.

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
Fixes: 4c132d1d844a ("x86/futex: Remove .fixup usage")
Cc: stable@vger.kernel.org
---
 arch/x86/kernel/cpu/mce/severity.c | 21 ++++++++++++++++-----
 1 file changed, 16 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/severity.c b/arch/x86/kernel/cpu/mce/severity.c
index dac4d64dfb2a..14c2d71c3ce1 100644
--- a/arch/x86/kernel/cpu/mce/severity.c
+++ b/arch/x86/kernel/cpu/mce/severity.c
@@ -16,6 +16,7 @@
 #include <asm/traps.h>
 #include <asm/insn.h>
 #include <asm/insn-eval.h>
+#include <linux/extable.h>
 
 #include "internal.h"
 
@@ -285,7 +286,8 @@ static bool is_copy_from_user(struct pt_regs *regs)
  */
 static noinstr int error_context(struct mce *m, struct pt_regs *regs)
 {
-	int fixup_type;
+	const struct exception_table_entry *e;
+	int fixup_type, imm;
 	bool copy_user;
 
 	if ((m->cs & 3) == 3)
@@ -294,9 +296,14 @@ static noinstr int error_context(struct mce *m, struct pt_regs *regs)
 	if (!mc_recoverable(m->mcgstatus))
 		return IN_KERNEL;
 
+	e = search_exception_tables(m->ip);
+	if (!e)
+		return IN_KERNEL;
+
 	/* Allow instrumentation around external facilities usage. */
 	instrumentation_begin();
-	fixup_type = ex_get_fixup_type(m->ip);
+	fixup_type = FIELD_GET(EX_DATA_TYPE_MASK, e->data);
+	imm  = FIELD_GET(EX_DATA_IMM_MASK,  e->data);
 	copy_user  = is_copy_from_user(regs);
 	instrumentation_end();
 
@@ -304,9 +311,13 @@ static noinstr int error_context(struct mce *m, struct pt_regs *regs)
 	case EX_TYPE_UACCESS:
 		if (!copy_user)
 			return IN_KERNEL;
-		m->kflags |= MCE_IN_KERNEL_COPYIN;
-		fallthrough;
-
+		m->kflags |= MCE_IN_KERNEL_COPYIN | MCE_IN_KERNEL_RECOV;
+		return IN_KERNEL_RECOV;
+	case EX_TYPE_IMM_REG:
+		if (!copy_user || imm != -EFAULT)
+			return IN_KERNEL;
+		m->kflags |= MCE_IN_KERNEL_COPYIN | MCE_IN_KERNEL_RECOV;
+		return IN_KERNEL_RECOV;
 	case EX_TYPE_FAULT_MCE_SAFE:
 	case EX_TYPE_DEFAULT_MCE_SAFE:
 		m->kflags |= MCE_IN_KERNEL_RECOV;
-- 
2.39.3



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v2 4/5] mm/hwpoison: Fix incorrect "not recovered" report for recovered clean pages
  2025-02-17  6:33 [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling Shuai Xue
                   ` (2 preceding siblings ...)
  2025-02-17  6:33 ` [PATCH v2 3/5] x86/mce: add EX_TYPE_EFAULT_REG as in-kernel recovery context to fix copy-from-user operations regression Shuai Xue
@ 2025-02-17  6:33 ` Shuai Xue
  2025-02-19  6:34   ` Miaohe Lin
  2025-02-17  6:33 ` [PATCH v2 5/5] mm: memory-failure: move return value documentation to function declaration Shuai Xue
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 59+ messages in thread
From: Shuai Xue @ 2025-02-17  6:33 UTC (permalink / raw)
  To: tony.luck, bp, nao.horiguchi
  Cc: tglx, mingo, dave.hansen, x86, hpa, linmiaohe, akpm, peterz,
	jpoimboe, linux-edac, linux-kernel, linux-mm, baolin.wang,
	tianruidong

When an uncorrected memory error is consumed there is a race between
the CMCI from the memory controller reporting an uncorrected error
with a UCNA signature, and the core reporting and SRAR signature
machine check when the data is about to be consumed.

If the CMCI wins that race, the page is marked poisoned when
uc_decode_notifier() calls memory_failure(). For dirty pages,
memory_failure() invokes try_to_unmap() with the TTU_HWPOISON flag,
converting the PTE to a hwpoison entry. As a result,
kill_accessing_process():

- call walk_page_range() and return 1 regardless of whether
  try_to_unmap() succeeds or fails,
- call kill_proc() to make sure a SIGBUS is sent
- return -EHWPOISON to indicate that SIGBUS is already sent to the
  process and kill_me_maybe() doesn't have to send it again.

However, for clean pages, the TTU_HWPOISON flag is cleared, leaving the
PTE unchanged and not converted to a hwpoison entry. Conversely, for
clean pages where PTE entries are not marked as hwpoison,
kill_accessing_process() returns -EFAULT, causing kill_me_maybe() to
send a SIGBUS.

Console log looks like this:

    Memory failure: 0x827ca68: corrupted page was clean: dropped without side effects
    Memory failure: 0x827ca68: recovery action for clean LRU page: Recovered
    Memory failure: 0x827ca68: already hardware poisoned
    mce: Memory error not recovered

To fix it, return 0 for "corrupted page was clean", preventing an
unnecessary SIGBUS.

Fixes: 046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"")
Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
Cc: stable@vger.kernel.org
---
 mm/memory-failure.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 995a15eb67e2..b037952565be 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -881,12 +881,17 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
 	mmap_read_lock(p->mm);
 	ret = walk_page_range(p->mm, 0, TASK_SIZE, &hwpoison_walk_ops,
 			      (void *)&priv);
+	/*
+	 * ret = 1 when CMCI wins, regardless of whether try_to_unmap()
+	 * succeeds or fails, then kill the process with SIGBUS.
+	 * ret = 0 when poison page is a clean page and it's dropped, no
+	 * SIGBUS is needed.
+	 */
 	if (ret == 1 && priv.tk.addr)
 		kill_proc(&priv.tk, pfn, flags);
-	else
-		ret = 0;
 	mmap_read_unlock(p->mm);
-	return ret > 0 ? -EHWPOISON : -EFAULT;
+
+	return ret > 0 ? -EHWPOISON : 0;
 }
 
 /*
-- 
2.39.3



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [PATCH v2 5/5] mm: memory-failure: move return value documentation to function declaration
  2025-02-17  6:33 [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling Shuai Xue
                   ` (3 preceding siblings ...)
  2025-02-17  6:33 ` [PATCH v2 4/5] mm/hwpoison: Fix incorrect "not recovered" report for recovered clean pages Shuai Xue
@ 2025-02-17  6:33 ` Shuai Xue
  2025-02-19  6:31   ` Miaohe Lin
  2025-02-18  3:29 ` [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling Andrew Morton
  2025-02-18  8:27 ` Borislav Petkov
  6 siblings, 1 reply; 59+ messages in thread
From: Shuai Xue @ 2025-02-17  6:33 UTC (permalink / raw)
  To: tony.luck, bp, nao.horiguchi
  Cc: tglx, mingo, dave.hansen, x86, hpa, linmiaohe, akpm, peterz,
	jpoimboe, linux-edac, linux-kernel, linux-mm, baolin.wang,
	tianruidong

Part of return value comments for memory_failure() were originally
documented at the call site. Move those comments to the function
declaration to improve code readability and to provide developers with
immediate access to function usage and return information.

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Reviewed-by: Jane Chu <jane.chu@oracle.com>
---
 arch/x86/kernel/cpu/mce/core.c |  7 -------
 mm/memory-failure.c            | 10 +++++++---
 2 files changed, 7 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index c1b945dbccdf..c81695f320bc 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1403,13 +1403,6 @@ static void kill_me_maybe(struct callback_head *cb)
 		return;
 	}
 
-	/*
-	 * -EHWPOISON from memory_failure() means that it already sent SIGBUS
-	 * to the current process with the proper error info,
-	 * -EOPNOTSUPP means hwpoison_filter() filtered the error event,
-	 *
-	 * In both cases, no further processing is required.
-	 */
 	if (ret == -EHWPOISON || ret == -EOPNOTSUPP)
 		return;
 
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index b037952565be..8649849bcdb4 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2216,9 +2216,13 @@ static void kill_procs_now(struct page *p, unsigned long pfn, int flags,
  * Must run in process context (e.g. a work queue) with interrupts
  * enabled and no spinlocks held.
  *
- * Return: 0 for successfully handled the memory error,
- *         -EOPNOTSUPP for hwpoison_filter() filtered the error event,
- *         < 0(except -EOPNOTSUPP) on failure.
+ * Return:
+ *   0             - success,
+ *   -ENXIO        - memory not managed by the kernel
+ *   -EOPNOTSUPP   - hwpoison_filter() filtered the error event,
+ *   -EHWPOISON    - the page was already poisoned, potentially
+ *                   kill process,
+ *   other negative values - failure.
  */
 int memory_failure(unsigned long pfn, int flags)
 {
-- 
2.39.3



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling
  2025-02-17  6:33 [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling Shuai Xue
                   ` (4 preceding siblings ...)
  2025-02-17  6:33 ` [PATCH v2 5/5] mm: memory-failure: move return value documentation to function declaration Shuai Xue
@ 2025-02-18  3:29 ` Andrew Morton
  2025-02-18  8:03   ` Borislav Petkov
  2025-02-18  8:27 ` Borislav Petkov
  6 siblings, 1 reply; 59+ messages in thread
From: Andrew Morton @ 2025-02-18  3:29 UTC (permalink / raw)
  To: Shuai Xue
  Cc: tony.luck, bp, nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa,
	linmiaohe, peterz, jpoimboe, linux-edac, linux-kernel, linux-mm,
	baolin.wang, tianruidong

On Mon, 17 Feb 2025 14:33:30 +0800 Shuai Xue <xueshuai@linux.alibaba.com> wrote:

> This patch addresses three regressions identified in memory failure
> handling, as discovered using ras-tools[1]:

Thanks.  I added these to mm.git for further exposure and testing. 
They do appear to be more an x86 thing so I'll do the usual thing: if
they later turn up in the x86 tree I shall drop the mm.git copies.  


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 1/5] x86/mce: Collect error message for severities below MCE_PANIC_SEVERITY
  2025-02-17  6:33 ` [PATCH v2 1/5] x86/mce: Collect error message for severities below MCE_PANIC_SEVERITY Shuai Xue
@ 2025-02-18  7:58   ` Borislav Petkov
  2025-02-18  9:39     ` Shuai Xue
  0 siblings, 1 reply; 59+ messages in thread
From: Borislav Petkov @ 2025-02-18  7:58 UTC (permalink / raw)
  To: Shuai Xue
  Cc: tony.luck, nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa,
	linmiaohe, akpm, peterz, jpoimboe, linux-edac, linux-kernel,
	linux-mm, baolin.wang, tianruidong

On Mon, Feb 17, 2025 at 02:33:31PM +0800, Shuai Xue wrote:
> Currently, mce_no_way_out() only collects error messages when the error
> severity is equal to `MCE_PANIC_SEVERITY`. To improve diagnostics,
> modify the behavior to also collect error messages when the severity is
> less than `MCE_PANIC_SEVERITY`.

That function is literally called "no way out" as in, is the error severe
enough so that there is no way out.

Now you went and stomped all over that to "improve diagnostics". What
diagnostsics? Your commit messages need to explain in detail why exactly
a patch exists.

So nope.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling
  2025-02-18  3:29 ` [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling Andrew Morton
@ 2025-02-18  8:03   ` Borislav Petkov
  0 siblings, 0 replies; 59+ messages in thread
From: Borislav Petkov @ 2025-02-18  8:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shuai Xue, tony.luck, nao.horiguchi, tglx, mingo, dave.hansen,
	x86, hpa, linmiaohe, peterz, jpoimboe, linux-edac, linux-kernel,
	linux-mm, baolin.wang, tianruidong

On Mon, Feb 17, 2025 at 07:29:54PM -0800, Andrew Morton wrote:
> On Mon, 17 Feb 2025 14:33:30 +0800 Shuai Xue <xueshuai@linux.alibaba.com> wrote:
> 
> > This patch addresses three regressions identified in memory failure
> > handling, as discovered using ras-tools[1]:
> 
> Thanks.  I added these to mm.git for further exposure and testing. 
> They do appear to be more an x86 thing so I'll do the usual thing: if
> they later turn up in the x86 tree I shall drop the mm.git copies.  

Please drop 'em. They look confused and need proper review first.

Also, your commits have

"Cc: Acked-by:Thomas Gleixner <tglx@linutronix.de>"

When did tglx ack them? I don't see anything in my mbox.

Andrew, again, please do not take unreviewed x86 patches through your tree
without synchronizing with us. This keeps happening and it is not helping at
all - the opposite is happening.

:-(

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling
  2025-02-17  6:33 [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling Shuai Xue
                   ` (5 preceding siblings ...)
  2025-02-18  3:29 ` [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling Andrew Morton
@ 2025-02-18  8:27 ` Borislav Petkov
  2025-02-18 11:31   ` Shuai Xue
  6 siblings, 1 reply; 59+ messages in thread
From: Borislav Petkov @ 2025-02-18  8:27 UTC (permalink / raw)
  To: Shuai Xue
  Cc: tony.luck, nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa,
	linmiaohe, akpm, peterz, jpoimboe, linux-edac, linux-kernel,
	linux-mm, baolin.wang, tianruidong

On Mon, Feb 17, 2025 at 02:33:30PM +0800, Shuai Xue wrote:
> changes singce v1:
> - Patch 1: Fix cur_sev and sev type to `int` per Tony
> - Patch 4: Fix return value to 0 for clean pages per Miaohe
> - Patch 5: pick return value comments of memory-failure()
> 
> This patch addresses three regressions identified in memory failure
> handling, as discovered using ras-tools[1]:
> 
> - `./einj_mem_uc copyin -f`
> - `./einj_mem_uc futex -f`
> - `./einj_mem_uc instr`

This is not how you write a problem statement and explain why your patches
exist.

You need to state:

1. What are you trying to do
2. What is the expected outcome and why
3. What actually happens and why
4. The fix, in your opinion, should be X or Y

Not quote some ras tools commands. Show me that you actually know what you're
doing and explain the problem in human understandable way.  And then we can
talk fixes.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 1/5] x86/mce: Collect error message for severities below MCE_PANIC_SEVERITY
  2025-02-18  7:58   ` Borislav Petkov
@ 2025-02-18  9:39     ` Shuai Xue
  2025-02-18  9:50       ` Borislav Petkov
  0 siblings, 1 reply; 59+ messages in thread
From: Shuai Xue @ 2025-02-18  9:39 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: tony.luck, nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa,
	linmiaohe, akpm, peterz, jpoimboe, linux-edac, linux-kernel,
	linux-mm, baolin.wang, tianruidong



在 2025/2/18 15:58, Borislav Petkov 写道:
> On Mon, Feb 17, 2025 at 02:33:31PM +0800, Shuai Xue wrote:
>> Currently, mce_no_way_out() only collects error messages when the error
>> severity is equal to `MCE_PANIC_SEVERITY`. To improve diagnostics,
>> modify the behavior to also collect error messages when the severity is
>> less than `MCE_PANIC_SEVERITY`.
> 
> That function is literally called "no way out" as in, is the error severe
> enough so that there is no way out.
> 
> Now you went and stomped all over that to "improve diagnostics". What
> diagnostsics? Your commit messages need to explain in detail why exactly
> a patch exists.
> 
> So nope.
> 

Hi, Borislav,

Thank you for reply.

The msg in predefined `severities`, e.g.

	MCESEV(
		AO, "Action optional: last level cache writeback error",
		SER, MASK(MCI_UC_AR|MCACOD, MCI_STATUS_UC|MCACOD_L3WB)
		),

is helpful for users to know what kind of MCE is happened. For a fatal machine
check, kernel panic use the message and I want to extend to collect the message
and print it out for non-fatal one.

If you don't object, let's go on to discuss how to implement it.
Otherwise, you can ignore the following response.

Yes, mce_no_way_out() means "no way out" literally. It only collects message
for MCE_PANIC_SEVERITY but use in common path. So I used this function to
extend it to non-fatal, assuming it was obvious.
  
Is __mc_scan_banks() a proper function to extend?

Thanks.
Shuai


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 1/5] x86/mce: Collect error message for severities below MCE_PANIC_SEVERITY
  2025-02-18  9:39     ` Shuai Xue
@ 2025-02-18  9:50       ` Borislav Petkov
  0 siblings, 0 replies; 59+ messages in thread
From: Borislav Petkov @ 2025-02-18  9:50 UTC (permalink / raw)
  To: Shuai Xue
  Cc: tony.luck, nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa,
	linmiaohe, akpm, peterz, jpoimboe, linux-edac, linux-kernel,
	linux-mm, baolin.wang, tianruidong

On Tue, Feb 18, 2025 at 05:39:33PM +0800, Shuai Xue wrote:
> Is __mc_scan_banks() a proper function to extend?

No, first you need to explain the big picture:

https://lore.kernel.org/r/20250218082727.GCZ7REb7OG6NTAY-V-@fat_crate.local

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling
  2025-02-18  8:27 ` Borislav Petkov
@ 2025-02-18 11:31   ` Shuai Xue
  2025-02-18 12:24     ` Borislav Petkov
  0 siblings, 1 reply; 59+ messages in thread
From: Shuai Xue @ 2025-02-18 11:31 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: tony.luck, nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa,
	linmiaohe, akpm, peterz, jpoimboe, linux-edac, linux-kernel,
	linux-mm, baolin.wang, tianruidong



在 2025/2/18 16:27, Borislav Petkov 写道:
> On Mon, Feb 17, 2025 at 02:33:30PM +0800, Shuai Xue wrote:
>> changes singce v1:
>> - Patch 1: Fix cur_sev and sev type to `int` per Tony
>> - Patch 4: Fix return value to 0 for clean pages per Miaohe
>> - Patch 5: pick return value comments of memory-failure()
>>
>> This patch addresses three regressions identified in memory failure
>> handling, as discovered using ras-tools[1]:
>>
>> - `./einj_mem_uc copyin -f`
>> - `./einj_mem_uc futex -f`
>> - `./einj_mem_uc instr`
> 
> This is not how you write a problem statement and explain why your patches
> exist.
> 
> You need to state:
> 
> 1. What are you trying to do
> 2. What is the expected outcome and why
> 3. What actually happens and why
> 4. The fix, in your opinion, should be X or Y
> 
> Not quote some ras tools commands. Show me that you actually know what you're
> doing and explain the problem in human understandable way.  And then we can
> talk fixes.
> 
> Thx.
> 

Sorry for the confusion.

> 1. What are you trying to do

I am tring to fix two memory failure regression in upstream kernel compared
with 5.10 LTS.

- copyin case: poison found while copying from user space.
- instr case: poison found while instruction fetching in user space

> 2. What is the expected outcome and why

For copyin case:

Kernel can recover from poison found while copying from user space.  MCE check
the fixup handler type to decide whether an in kernel #MC can be recovered.
When EX_TYPE_UACCESS is found, the PC jumps to recovery code specified in
_ASM_EXTABLE_FAULT() and return a -EFAULT to user space.

For instr case:

If a poison found while instruction fetching in user space, full recovery is
possible. User process takes #PF, Linux allocates a new page and fills by
reading from storage.

> 3. What actually happens and why

For copyin case: kernel panic since v5.17

Commit 4c132d1d844a ("x86/futex: Remove .fixup usage") introduced a new extable
fixup type, EX_TYPE_EFAULT_REG, and later patches updated the extable fixup
type for copy-from-user operations, changing it from EX_TYPE_UACCESS to
EX_TYPE_EFAULT_REG.

For instr case: user process is killed by a SIGBUS signal

Commit 046545a661af ("mm/hwpoison: fix error page recovered but reported "not
recovered"") introduced a bug that kill_accessing_process() return -EHWPOISON
for instr case, as result, kill_me_maybe() send a SIGBUS to user process.

> 4. The fix, in your opinion, should be X or Y

For copyin case: add EX_TYPE_EFAULT_REG as a recovery type.
For instr case: let kill_accessing_process return 0 to prevent a SIGBUS.

For patch 1 and 2:

While debuging the two regression, I found `msg` in predefined `severities`, e.g.

     MCESEV(
         AO, "Action optional: last level cache writeback error",
         SER, MASK(MCI_UC_AR|MCACOD, MCI_STATUS_UC|MCACOD_L3WB)
         ),

is helpful for me to know what kind of MCE is happened. For a fatal machine
check, kernel panic use the message and I want to extend to collect the message
and print it out for non-fatal one.

For patch 5:

The return value of memory_failure() is quite important while discussed instr
case regression with Tony and Miaohe for patch 4, so move comment to the place
it belongs to.

I hope the information provided above effectively addresses your concerns.
Please feel free to let me know if you have any further questions or need
additional clarification.

Thanks.
Shuai


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling
  2025-02-18 11:31   ` Shuai Xue
@ 2025-02-18 12:24     ` Borislav Petkov
  2025-02-18 13:08       ` Shuai Xue
  2025-02-18 17:30       ` Luck, Tony
  0 siblings, 2 replies; 59+ messages in thread
From: Borislav Petkov @ 2025-02-18 12:24 UTC (permalink / raw)
  To: Shuai Xue, tony.luck
  Cc: nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa, linmiaohe,
	akpm, peterz, jpoimboe, linux-edac, linux-kernel, linux-mm,
	baolin.wang, tianruidong

On Tue, Feb 18, 2025 at 07:31:34PM +0800, Shuai Xue wrote:
> Kernel can recover from poison found while copying from user space.

Where was that poison found? On user pages? So reading them consumes the
poison?

So you're not really seeing real issues on real hw - you're using ras tools to
trigger those, correct?

If so, what guarantees ras tools are doing the right thing?

> MCE check the fixup handler type to decide whether an in kernel #MC can be
> recovered.  When EX_TYPE_UACCESS is found,

Sounds like poison on user memory...

> the PC jumps to recovery code specified in _ASM_EXTABLE_FAULT() and return
> a -EFAULT to user space.

> For instr case:
> 
> If a poison found while instruction fetching in user space, full recovery is
> possible. User process takes #PF, Linux allocates a new page and fills by
> reading from storage.
> 
> > 3. What actually happens and why
> 
> For copyin case: kernel panic since v5.17
> 
> Commit 4c132d1d844a ("x86/futex: Remove .fixup usage") introduced a new extable
> fixup type, EX_TYPE_EFAULT_REG, and later patches updated the extable fixup
> type for copy-from-user operations, changing it from EX_TYPE_UACCESS to
> EX_TYPE_EFAULT_REG.

What do futexes have to do with copying user memory?

> For instr case: user process is killed by a SIGBUS signal
> 
> Commit 046545a661af ("mm/hwpoison: fix error page recovered but reported "not
> recovered"") introduced a bug that kill_accessing_process() return -EHWPOISON
> for instr case, as result, kill_me_maybe() send a SIGBUS to user process.

This makes my head hurt... a race between the CMCI reporting an uncorrected
error... why does the CMCI report uncorrected errors? This sounds like some
nasty confusion.

And you've basically reused the format and wording of 046545a661af for your
commit message and makes staring at those a PITA.

Tony, what's going on with that CMCI and SRAR race?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 3/5] x86/mce: add EX_TYPE_EFAULT_REG as in-kernel recovery context to fix copy-from-user operations regression
  2025-02-17  6:33 ` [PATCH v2 3/5] x86/mce: add EX_TYPE_EFAULT_REG as in-kernel recovery context to fix copy-from-user operations regression Shuai Xue
@ 2025-02-18 12:54   ` Peter Zijlstra
  2025-02-18 13:02     ` Peter Zijlstra
  2025-02-18 13:28     ` Shuai Xue
  0 siblings, 2 replies; 59+ messages in thread
From: Peter Zijlstra @ 2025-02-18 12:54 UTC (permalink / raw)
  To: Shuai Xue
  Cc: tony.luck, bp, nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa,
	linmiaohe, akpm, jpoimboe, linux-edac, linux-kernel, linux-mm,
	baolin.wang, tianruidong

On Mon, Feb 17, 2025 at 02:33:33PM +0800, Shuai Xue wrote:

> diff --git a/arch/x86/kernel/cpu/mce/severity.c b/arch/x86/kernel/cpu/mce/severity.c
> index dac4d64dfb2a..14c2d71c3ce1 100644
> --- a/arch/x86/kernel/cpu/mce/severity.c
> +++ b/arch/x86/kernel/cpu/mce/severity.c
> @@ -16,6 +16,7 @@
>  #include <asm/traps.h>
>  #include <asm/insn.h>
>  #include <asm/insn-eval.h>
> +#include <linux/extable.h>
>  
>  #include "internal.h"
>  
> @@ -285,7 +286,8 @@ static bool is_copy_from_user(struct pt_regs *regs)
>   */
>  static noinstr int error_context(struct mce *m, struct pt_regs *regs)
>  {
> -	int fixup_type;
> +	const struct exception_table_entry *e;
> +	int fixup_type, imm;
>  	bool copy_user;
>  
>  	if ((m->cs & 3) == 3)
> @@ -294,9 +296,14 @@ static noinstr int error_context(struct mce *m, struct pt_regs *regs)
>  	if (!mc_recoverable(m->mcgstatus))
>  		return IN_KERNEL;
>  
> +	e = search_exception_tables(m->ip);
> +	if (!e)
> +		return IN_KERNEL;

You didn't actually build this, did you? Or did you ignore the extra
noinstr warnings?

>  	/* Allow instrumentation around external facilities usage. */
>  	instrumentation_begin();
> -	fixup_type = ex_get_fixup_type(m->ip);
> +	fixup_type = FIELD_GET(EX_DATA_TYPE_MASK, e->data);
> +	imm  = FIELD_GET(EX_DATA_IMM_MASK,  e->data);
>  	copy_user  = is_copy_from_user(regs);
>  	instrumentation_end();
>  
> @@ -304,9 +311,13 @@ static noinstr int error_context(struct mce *m, struct pt_regs *regs)
>  	case EX_TYPE_UACCESS:
>  		if (!copy_user)
>  			return IN_KERNEL;
> -		m->kflags |= MCE_IN_KERNEL_COPYIN;
> -		fallthrough;
> -
> +		m->kflags |= MCE_IN_KERNEL_COPYIN | MCE_IN_KERNEL_RECOV;
> +		return IN_KERNEL_RECOV;
> +	case EX_TYPE_IMM_REG:
> +		if (!copy_user || imm != -EFAULT)
> +			return IN_KERNEL;
> +		m->kflags |= MCE_IN_KERNEL_COPYIN | MCE_IN_KERNEL_RECOV;
> +		return IN_KERNEL_RECOV;

Maybe I'm justnot understanding things, but what's wrong with something
like the below; why do we care about the ex-type if we know its a MOV
reading from userspace?

The less we muck about with the extable here, the better.

---
diff --git a/arch/x86/kernel/cpu/mce/severity.c b/arch/x86/kernel/cpu/mce/severity.c
index dac4d64dfb2a..cb021058165f 100644
--- a/arch/x86/kernel/cpu/mce/severity.c
+++ b/arch/x86/kernel/cpu/mce/severity.c
@@ -300,13 +300,12 @@ static noinstr int error_context(struct mce *m, struct pt_regs *regs)
 	copy_user  = is_copy_from_user(regs);
 	instrumentation_end();
 
-	switch (fixup_type) {
-	case EX_TYPE_UACCESS:
-		if (!copy_user)
-			return IN_KERNEL;
-		m->kflags |= MCE_IN_KERNEL_COPYIN;
-		fallthrough;
+	if (copy_user) {
+		m->kflags |= MCE_IN_KERNEL_COPYIN | MCE_IN_KERNEL_COPYIN;
+		return IN_KERNEL_RECOV
+	}
 
+	switch (fixup_type) {
 	case EX_TYPE_FAULT_MCE_SAFE:
 	case EX_TYPE_DEFAULT_MCE_SAFE:
 		m->kflags |= MCE_IN_KERNEL_RECOV;


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 3/5] x86/mce: add EX_TYPE_EFAULT_REG as in-kernel recovery context to fix copy-from-user operations regression
  2025-02-18 12:54   ` Peter Zijlstra
@ 2025-02-18 13:02     ` Peter Zijlstra
  2025-02-18 14:03       ` Shuai Xue
  2025-02-18 13:28     ` Shuai Xue
  1 sibling, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2025-02-18 13:02 UTC (permalink / raw)
  To: Shuai Xue
  Cc: tony.luck, bp, nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa,
	linmiaohe, akpm, jpoimboe, linux-edac, linux-kernel, linux-mm,
	baolin.wang, tianruidong

On Tue, Feb 18, 2025 at 01:54:08PM +0100, Peter Zijlstra wrote:

> ---
> diff --git a/arch/x86/kernel/cpu/mce/severity.c b/arch/x86/kernel/cpu/mce/severity.c
> index dac4d64dfb2a..cb021058165f 100644
> --- a/arch/x86/kernel/cpu/mce/severity.c
> +++ b/arch/x86/kernel/cpu/mce/severity.c
> @@ -300,13 +300,12 @@ static noinstr int error_context(struct mce *m, struct pt_regs *regs)
>  	copy_user  = is_copy_from_user(regs);
>  	instrumentation_end();
>  
> -	switch (fixup_type) {
> -	case EX_TYPE_UACCESS:
> -		if (!copy_user)
> -			return IN_KERNEL;
> -		m->kflags |= MCE_IN_KERNEL_COPYIN;
> -		fallthrough;
> +	if (copy_user) {
> +		m->kflags |= MCE_IN_KERNEL_COPYIN | MCE_IN_KERNEL_COPYIN;

Typing is hard, obviously that second should'be been _RECOV.

> +		return IN_KERNEL_RECOV

But why are we having that bit *and* a return value saying the same
thing?

> +	}
>  
> +	switch (fixup_type) {
>  	case EX_TYPE_FAULT_MCE_SAFE:
>  	case EX_TYPE_DEFAULT_MCE_SAFE:
>  		m->kflags |= MCE_IN_KERNEL_RECOV;


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling
  2025-02-18 12:24     ` Borislav Petkov
@ 2025-02-18 13:08       ` Shuai Xue
  2025-02-18 13:17         ` Borislav Petkov
  2025-02-18 17:59         ` Luck, Tony
  2025-02-18 17:30       ` Luck, Tony
  1 sibling, 2 replies; 59+ messages in thread
From: Shuai Xue @ 2025-02-18 13:08 UTC (permalink / raw)
  To: Borislav Petkov, tony.luck
  Cc: nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa, linmiaohe,
	akpm, peterz, jpoimboe, linux-edac, linux-kernel, linux-mm,
	baolin.wang, tianruidong



在 2025/2/18 20:24, Borislav Petkov 写道:
> On Tue, Feb 18, 2025 at 07:31:34PM +0800, Shuai Xue wrote:
>> Kernel can recover from poison found while copying from user space.
> 
> Where was that poison found? On user pages? So reading them consumes the
> poison?

Yes, the poison is found on user pages.

Form commit log, the mechanism is added by Tony and suggested by you.
https://lkml.kernel.org/r/20210818002942.1607544-3-tony.luck@intel.com

> 
> So you're not really seeing real issues on real hw - you're using ras tools to
> trigger those, correct?
> 
> If so, what guarantees ras tools are doing the right thing?

Ras-tools do it by three step:

- alloc memory in userspace
- inject two bit ECC error (uncorretable error) to the memory (by EINJ interface)
- write the memory to a file fd ( by write(2) )

It's the same as with real issue. There's no magic to it.

Doesn't AMD support it?

> 
>> MCE check the fixup handler type to decide whether an in kernel #MC can be
>> recovered.  When EX_TYPE_UACCESS is found,
> 
> Sounds like poison on user memory...

Yes, sorry for confusion.

> 
>> the PC jumps to recovery code specified in _ASM_EXTABLE_FAULT() and return
>> a -EFAULT to user space.
> 
>> For instr case:
>>
>> If a poison found while instruction fetching in user space, full recovery is
>> possible. User process takes #PF, Linux allocates a new page and fills by
>> reading from storage.
>>
>>> 3. What actually happens and why
>>
>> For copyin case: kernel panic since v5.17
>>
>> Commit 4c132d1d844a ("x86/futex: Remove .fixup usage") introduced a new extable
>> fixup type, EX_TYPE_EFAULT_REG, and later patches updated the extable fixup
>> type for copy-from-user operations, changing it from EX_TYPE_UACCESS to
>> EX_TYPE_EFAULT_REG.
> 
> What do futexes have to do with copying user memory?

Return -EFAULT to userspace.

> 
>> For instr case: user process is killed by a SIGBUS signal
>>
>> Commit 046545a661af ("mm/hwpoison: fix error page recovered but reported "not
>> recovered"") introduced a bug that kill_accessing_process() return -EHWPOISON
>> for instr case, as result, kill_me_maybe() send a SIGBUS to user process.
> 
> This makes my head hurt... a race between the CMCI reporting an uncorrected
> error... why does the CMCI report uncorrected errors? This sounds like some
> nasty confusion.
> 
> And you've basically reused the format and wording of 046545a661af for your
> commit message and makes staring at those a PITA.
> 
> Tony, what's going on with that CMCI and SRAR race?
> 

I try to answer why the CMCI reporting an uncorrected error. Tony, please
correct me if I missed anyting.

When core issue a memory to a memory controller finds a 2 bit ECC error, it
will pass data with a poison flag through bus.

1. Home Agent logs UNCA error and signals CMCI ifenable.
2. Home Agent forwards data with poison indication bit set.
3. DCU detects the posion data, logs SRAR eror and triggers #MCE if recoverable.


Thanks.
Shuai


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling
  2025-02-18 13:08       ` Shuai Xue
@ 2025-02-18 13:17         ` Borislav Petkov
  2025-02-18 13:53           ` Shuai Xue
  2025-02-18 17:59         ` Luck, Tony
  1 sibling, 1 reply; 59+ messages in thread
From: Borislav Petkov @ 2025-02-18 13:17 UTC (permalink / raw)
  To: Shuai Xue
  Cc: tony.luck, nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa,
	linmiaohe, akpm, peterz, jpoimboe, linux-edac, linux-kernel,
	linux-mm, baolin.wang, tianruidong

On Tue, Feb 18, 2025 at 09:08:25PM +0800, Shuai Xue wrote:
> Yes, the poison is found on user pages.
> 
> Form commit log, the mechanism is added by Tony and suggested by you.
> https://lkml.kernel.org/r/20210818002942.1607544-3-tony.luck@intel.com

I'm not talking about how it is detected - I'm asking about *what* you're
doing exactly. I want to figure out what and why you're doing what you're
doing.

> It's the same as with real issue. There's no magic to it.

Magic or not, doesn't matter. The only question is whether this can happen in
real life and it is not just you using some tools and "fixing" things that
ain't broke.

> > What do futexes have to do with copying user memory?
> 
> Return -EFAULT to userspace.

This doesn't even begin to answer my question so I'll ask again:

"What do futexes have to do with copying user memory?"

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 3/5] x86/mce: add EX_TYPE_EFAULT_REG as in-kernel recovery context to fix copy-from-user operations regression
  2025-02-18 12:54   ` Peter Zijlstra
  2025-02-18 13:02     ` Peter Zijlstra
@ 2025-02-18 13:28     ` Shuai Xue
  2025-02-18 14:15       ` Peter Zijlstra
  1 sibling, 1 reply; 59+ messages in thread
From: Shuai Xue @ 2025-02-18 13:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tony.luck, bp, nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa,
	linmiaohe, akpm, jpoimboe, linux-edac, linux-kernel, linux-mm,
	baolin.wang, tianruidong

Hi, Peter

在 2025/2/18 20:54, Peter Zijlstra 写道:
> On Mon, Feb 17, 2025 at 02:33:33PM +0800, Shuai Xue wrote:
> 
>> diff --git a/arch/x86/kernel/cpu/mce/severity.c b/arch/x86/kernel/cpu/mce/severity.c
>> index dac4d64dfb2a..14c2d71c3ce1 100644
>> --- a/arch/x86/kernel/cpu/mce/severity.c
>> +++ b/arch/x86/kernel/cpu/mce/severity.c
>> @@ -16,6 +16,7 @@
>>   #include <asm/traps.h>
>>   #include <asm/insn.h>
>>   #include <asm/insn-eval.h>
>> +#include <linux/extable.h>
>>   
>>   #include "internal.h"
>>   
>> @@ -285,7 +286,8 @@ static bool is_copy_from_user(struct pt_regs *regs)
>>    */
>>   static noinstr int error_context(struct mce *m, struct pt_regs *regs)
>>   {
>> -	int fixup_type;
>> +	const struct exception_table_entry *e;
>> +	int fixup_type, imm;
>>   	bool copy_user;
>>   
>>   	if ((m->cs & 3) == 3)
>> @@ -294,9 +296,14 @@ static noinstr int error_context(struct mce *m, struct pt_regs *regs)
>>   	if (!mc_recoverable(m->mcgstatus))
>>   		return IN_KERNEL;
>>   
>> +	e = search_exception_tables(m->ip);
>> +	if (!e)
>> +		return IN_KERNEL;
> 
> You didn't actually build this, did you? Or did you ignore the extra
> noinstr warnings?

I did build and test this patch set on it. But I did not find any warnings.
Could you provide more details?

> 
>>   	/* Allow instrumentation around external facilities usage. */
>>   	instrumentation_begin();
>> -	fixup_type = ex_get_fixup_type(m->ip);
>> +	fixup_type = FIELD_GET(EX_DATA_TYPE_MASK, e->data);
>> +	imm  = FIELD_GET(EX_DATA_IMM_MASK,  e->data);
>>   	copy_user  = is_copy_from_user(regs);
>>   	instrumentation_end();
>>   
>> @@ -304,9 +311,13 @@ static noinstr int error_context(struct mce *m, struct pt_regs *regs)
>>   	case EX_TYPE_UACCESS:
>>   		if (!copy_user)
>>   			return IN_KERNEL;
>> -		m->kflags |= MCE_IN_KERNEL_COPYIN;
>> -		fallthrough;
>> -
>> +		m->kflags |= MCE_IN_KERNEL_COPYIN | MCE_IN_KERNEL_RECOV;
>> +		return IN_KERNEL_RECOV;
>> +	case EX_TYPE_IMM_REG:
>> +		if (!copy_user || imm != -EFAULT)
>> +			return IN_KERNEL;
>> +		m->kflags |= MCE_IN_KERNEL_COPYIN | MCE_IN_KERNEL_RECOV;
>> +		return IN_KERNEL_RECOV;
> 
> Maybe I'm justnot understanding things, but what's wrong with something
> like the below; why do we care about the ex-type if we know its a MOV
> reading from userspace?
> 
> The less we muck about with the extable here, the better.

We need to make sure that we have register a fixup handler for the copy_user
case.  If no fixup handler found, the PC accessing posion will trigger #MCE
again and again resulting a hardlock up.

> 
> ---
> diff --git a/arch/x86/kernel/cpu/mce/severity.c b/arch/x86/kernel/cpu/mce/severity.c
> index dac4d64dfb2a..cb021058165f 100644
> --- a/arch/x86/kernel/cpu/mce/severity.c
> +++ b/arch/x86/kernel/cpu/mce/severity.c
> @@ -300,13 +300,12 @@ static noinstr int error_context(struct mce *m, struct pt_regs *regs)
>   	copy_user  = is_copy_from_user(regs);
>   	instrumentation_end();
>   
> -	switch (fixup_type) {
> -	case EX_TYPE_UACCESS:
> -		if (!copy_user)
> -			return IN_KERNEL;
> -		m->kflags |= MCE_IN_KERNEL_COPYIN;
> -		fallthrough;
> +	if (copy_user) {
> +		m->kflags |= MCE_IN_KERNEL_COPYIN | MCE_IN_KERNEL_COPYIN;
> +		return IN_KERNEL_RECOV
> +	}
>   
> +	switch (fixup_type) {
>   	case EX_TYPE_FAULT_MCE_SAFE:
>   	case EX_TYPE_DEFAULT_MCE_SAFE:
>   		m->kflags |= MCE_IN_KERNEL_RECOV;

Thanks.
Shuai


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling
  2025-02-18 13:17         ` Borislav Petkov
@ 2025-02-18 13:53           ` Shuai Xue
  2025-02-18 15:31             ` Borislav Petkov
  0 siblings, 1 reply; 59+ messages in thread
From: Shuai Xue @ 2025-02-18 13:53 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: tony.luck, nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa,
	linmiaohe, akpm, peterz, jpoimboe, linux-edac, linux-kernel,
	linux-mm, baolin.wang, tianruidong



在 2025/2/18 21:17, Borislav Petkov 写道:
> On Tue, Feb 18, 2025 at 09:08:25PM +0800, Shuai Xue wrote:
>> Yes, the poison is found on user pages.
>>
>> Form commit log, the mechanism is added by Tony and suggested by you.
>> https://lkml.kernel.org/r/20210818002942.1607544-3-tony.luck@intel.com
> 
> I'm not talking about how it is detected - I'm asking about *what* you're
> doing exactly. I want to figure out what and why you're doing what you're
> doing.
> 
>> It's the same as with real issue. There's no magic to it.
> 
> Magic or not, doesn't matter. The only question is whether this can happen in
> real life and it is not just you using some tools and "fixing" things that
> ain't broke.

The regression is reported by end user and we also observed in the production.

[5056863.064239] task: ffff8837d2a2a0c0 task.stack: ffffc90065814000
[5056863.137299] RIP: 0010:[<ffffffff813ad231>]  [<ffffffff813ad231>] __get_user_8+0x21/0x2b
...
[5056864.512018] Call Trace:
[5056864.543440]  [<ffffffff8111c203>] ? exit_robust_list+0x33/0x110
[5056864.616456]  [<ffffffff81088399>] mm_release+0x109/0x140
[5056864.682178]  [<ffffffff8108faf9>] do_exit+0x159/0xb60
[5056864.744785]  [<ffffffff81090583>] do_group_exit+0x43/0xb0
[5056864.811551]  [<ffffffff8109bdc9>] get_signal+0x289/0x630
[5056864.877277]  [<ffffffff8102d227>] do_signal+0x37/0x690
[5056864.940925]  [<ffffffff8111c4e5>] ? do_futex+0x205/0x520
[5056865.006651]  [<ffffffff8111c885>] ? SyS_futex+0x85/0x170
[5056865.072378]  [<ffffffff81003726>] exit_to_usermode_loop+0x76/0xc0
[5056865.147464]  [<ffffffff81003d01>] do_syscall_64+0x171/0x180
[5056865.216311]  [<ffffffff81741c8e>] entry_SYSCALL_64_after_swapgs+0x58/0xc6
> 
>>> What do futexes have to do with copying user memory?
>>
>> Return -EFAULT to userspace.
> 
> This doesn't even begin to answer my question so I'll ask again:
> 
> "What do futexes have to do with copying user memory?"
> 

Sorry, I did not get your point.

Thanks.
Shuai


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 3/5] x86/mce: add EX_TYPE_EFAULT_REG as in-kernel recovery context to fix copy-from-user operations regression
  2025-02-18 13:02     ` Peter Zijlstra
@ 2025-02-18 14:03       ` Shuai Xue
  0 siblings, 0 replies; 59+ messages in thread
From: Shuai Xue @ 2025-02-18 14:03 UTC (permalink / raw)
  To: Peter Zijlstra, Borislav Petkov
  Cc: tony.luck, bp, nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa,
	linmiaohe, akpm, jpoimboe, linux-edac, linux-kernel, linux-mm,
	baolin.wang, tianruidong



在 2025/2/18 21:02, Peter Zijlstra 写道:
> On Tue, Feb 18, 2025 at 01:54:08PM +0100, Peter Zijlstra wrote:
> 
>> ---
>> diff --git a/arch/x86/kernel/cpu/mce/severity.c b/arch/x86/kernel/cpu/mce/severity.c
>> index dac4d64dfb2a..cb021058165f 100644
>> --- a/arch/x86/kernel/cpu/mce/severity.c
>> +++ b/arch/x86/kernel/cpu/mce/severity.c
>> @@ -300,13 +300,12 @@ static noinstr int error_context(struct mce *m, struct pt_regs *regs)
>>   	copy_user  = is_copy_from_user(regs);
>>   	instrumentation_end();
>>   
>> -	switch (fixup_type) {
>> -	case EX_TYPE_UACCESS:
>> -		if (!copy_user)
>> -			return IN_KERNEL;
>> -		m->kflags |= MCE_IN_KERNEL_COPYIN;
>> -		fallthrough;
>> +	if (copy_user) {
>> +		m->kflags |= MCE_IN_KERNEL_COPYIN | MCE_IN_KERNEL_COPYIN;
> 
> Typing is hard, obviously that second should'be been _RECOV.

Hah, it doesn't matter, I got your point.

> 
>> +		return IN_KERNEL_RECOV
> 
> But why are we having that bit *and* a return value saying the same
> thing?

Yes, it is rather redundant. I can refactor this out if @Borislav is
happy with that.

> 
>> +	}
>>   
>> +	switch (fixup_type) {
>>   	case EX_TYPE_FAULT_MCE_SAFE:
>>   	case EX_TYPE_DEFAULT_MCE_SAFE:
>>   		m->kflags |= MCE_IN_KERNEL_RECOV;

Thanks.
Shuai


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 3/5] x86/mce: add EX_TYPE_EFAULT_REG as in-kernel recovery context to fix copy-from-user operations regression
  2025-02-18 13:28     ` Shuai Xue
@ 2025-02-18 14:15       ` Peter Zijlstra
  2025-02-18 16:48         ` Borislav Petkov
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2025-02-18 14:15 UTC (permalink / raw)
  To: Shuai Xue
  Cc: tony.luck, bp, nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa,
	linmiaohe, akpm, jpoimboe, linux-edac, linux-kernel, linux-mm,
	baolin.wang, tianruidong

On Tue, Feb 18, 2025 at 09:28:33PM +0800, Shuai Xue wrote:

> I did build and test this patch set on it. But I did not find any warnings.
> Could you provide more details?

NOINSTR_VALIDATION=y helps

> > >   	/* Allow instrumentation around external facilities usage. */
> > >   	instrumentation_begin();
> > > -	fixup_type = ex_get_fixup_type(m->ip);
> > > +	fixup_type = FIELD_GET(EX_DATA_TYPE_MASK, e->data);
> > > +	imm  = FIELD_GET(EX_DATA_IMM_MASK,  e->data);
> > >   	copy_user  = is_copy_from_user(regs);
> > >   	instrumentation_end();
> > > @@ -304,9 +311,13 @@ static noinstr int error_context(struct mce *m, struct pt_regs *regs)
> > >   	case EX_TYPE_UACCESS:
> > >   		if (!copy_user)
> > >   			return IN_KERNEL;
> > > -		m->kflags |= MCE_IN_KERNEL_COPYIN;
> > > -		fallthrough;
> > > -
> > > +		m->kflags |= MCE_IN_KERNEL_COPYIN | MCE_IN_KERNEL_RECOV;
> > > +		return IN_KERNEL_RECOV;
> > > +	case EX_TYPE_IMM_REG:
> > > +		if (!copy_user || imm != -EFAULT)
> > > +			return IN_KERNEL;
> > > +		m->kflags |= MCE_IN_KERNEL_COPYIN | MCE_IN_KERNEL_RECOV;
> > > +		return IN_KERNEL_RECOV;
> > 
> > Maybe I'm justnot understanding things, but what's wrong with something
> > like the below; why do we care about the ex-type if we know its a MOV
> > reading from userspace?
> > 
> > The less we muck about with the extable here, the better.
> 
> We need to make sure that we have register a fixup handler for the copy_user
> case.  If no fixup handler found, the PC accessing posion will trigger #MCE
> again and again resulting a hardlock up.

Well, then write it like so. Afaict, you don't care what the actual
exception type is, just that there is one, for the copy_user case.

diff --git a/arch/x86/kernel/cpu/mce/severity.c b/arch/x86/kernel/cpu/mce/severity.c
index dac4d64dfb2a..cfdae25eacd7 100644
--- a/arch/x86/kernel/cpu/mce/severity.c
+++ b/arch/x86/kernel/cpu/mce/severity.c
@@ -301,18 +301,19 @@ static noinstr int error_context(struct mce *m, struct pt_regs *regs)
 	instrumentation_end();
 
 	switch (fixup_type) {
-	case EX_TYPE_UACCESS:
-		if (!copy_user)
-			return IN_KERNEL;
-		m->kflags |= MCE_IN_KERNEL_COPYIN;
-		fallthrough;
-
 	case EX_TYPE_FAULT_MCE_SAFE:
 	case EX_TYPE_DEFAULT_MCE_SAFE:
 		m->kflags |= MCE_IN_KERNEL_RECOV;
 		return IN_KERNEL_RECOV;
 
 	default:
+		if (copy_user) {
+			m->kflags |= MCE_IN_KERNEL_COPYIN | MCE_IN_KERNEL_RECOV;
+			return IN_KERNEL_RECOV;
+		}
+		fallthrough;
+
+	case EX_TYPE_NONE:
 		return IN_KERNEL;
 	}
 }


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling
  2025-02-18 13:53           ` Shuai Xue
@ 2025-02-18 15:31             ` Borislav Petkov
  2025-02-19  7:13               ` Shuai Xue
  0 siblings, 1 reply; 59+ messages in thread
From: Borislav Petkov @ 2025-02-18 15:31 UTC (permalink / raw)
  To: Shuai Xue
  Cc: tony.luck, nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa,
	linmiaohe, akpm, peterz, jpoimboe, linux-edac, linux-kernel,
	linux-mm, baolin.wang, tianruidong

On Tue, Feb 18, 2025 at 09:53:17PM +0800, Shuai Xue wrote:
> The regression is reported by end user and we also observed in the production.

Where is that report? How many times do I have to ask about different aspects
of your patches until you explain the *whole* picture?

> [5056863.064239] task: ffff8837d2a2a0c0 task.stack: ffffc90065814000
> [5056863.137299] RIP: 0010:[<ffffffff813ad231>]  [<ffffffff813ad231>] __get_user_8+0x21/0x2b
> ...
> [5056864.512018] Call Trace:

This tells me exactly 0 - I see some truncated stack trace.

> Sorry, I did not get your point.

I don't get your text either. Until this is explained and debugged properly,
it is not going anywhere.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 3/5] x86/mce: add EX_TYPE_EFAULT_REG as in-kernel recovery context to fix copy-from-user operations regression
  2025-02-18 14:15       ` Peter Zijlstra
@ 2025-02-18 16:48         ` Borislav Petkov
  2025-02-19 10:40           ` Peter Zijlstra
  0 siblings, 1 reply; 59+ messages in thread
From: Borislav Petkov @ 2025-02-18 16:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Shuai Xue, tony.luck, nao.horiguchi, tglx, mingo, dave.hansen,
	x86, hpa, linmiaohe, akpm, jpoimboe, linux-edac, linux-kernel,
	linux-mm, baolin.wang, tianruidong

On Tue, Feb 18, 2025 at 03:15:35PM +0100, Peter Zijlstra wrote:
> diff --git a/arch/x86/kernel/cpu/mce/severity.c b/arch/x86/kernel/cpu/mce/severity.c
> index dac4d64dfb2a..cfdae25eacd7 100644
> --- a/arch/x86/kernel/cpu/mce/severity.c
> +++ b/arch/x86/kernel/cpu/mce/severity.c
> @@ -301,18 +301,19 @@ static noinstr int error_context(struct mce *m, struct pt_regs *regs)
>  	instrumentation_end();
>  
>  	switch (fixup_type) {
> -	case EX_TYPE_UACCESS:
> -		if (!copy_user)
> -			return IN_KERNEL;
> -		m->kflags |= MCE_IN_KERNEL_COPYIN;
> -		fallthrough;
> -
>  	case EX_TYPE_FAULT_MCE_SAFE:
>  	case EX_TYPE_DEFAULT_MCE_SAFE:
>  		m->kflags |= MCE_IN_KERNEL_RECOV;
>  		return IN_KERNEL_RECOV;
>  
>  	default:
> +		if (copy_user) {

As said on chat, if we can make is_copy_from_user() *always* correctly detect
user access, then sure but I'm afraid EX_TYPE_UACCESS being generated at the
handful places where we do user memory access is there for a reason as it
makes it pretty explicit.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling
  2025-02-18 12:24     ` Borislav Petkov
  2025-02-18 13:08       ` Shuai Xue
@ 2025-02-18 17:30       ` Luck, Tony
  2025-02-19  8:10         ` Borislav Petkov
  1 sibling, 1 reply; 59+ messages in thread
From: Luck, Tony @ 2025-02-18 17:30 UTC (permalink / raw)
  To: Borislav Petkov, Shuai Xue
  Cc: nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa, linmiaohe,
	akpm, peterz, jpoimboe, linux-edac, linux-kernel, linux-mm,
	baolin.wang, tianruidong

> > For instr case: user process is killed by a SIGBUS signal
> >
> > Commit 046545a661af ("mm/hwpoison: fix error page recovered but reported "not
> > recovered"") introduced a bug that kill_accessing_process() return -EHWPOISON
> > for instr case, as result, kill_me_maybe() send a SIGBUS to user process.
>
> This makes my head hurt... a race between the CMCI reporting an uncorrected
> error... why does the CMCI report uncorrected errors? This sounds like some
> nasty confusion.

My head hurts too. The problem is the evolution and subsequent overloading of
limited signal options in Intel error reporting.

Prior to Icelake memory controllers reported patrol scrub events that detected
a previously unseen uncorrected error in memory by signaling a broadcast
machine check with an SRAO (Software Recoverable Action Optional) signature
in the machine check bank.

This was overkill. It's not an urgent problem. No core is on the verge of consuming
that bad data.

But the fix causes the confusion. The machine check bank signature was changed
to UCNA (Uncorrected, No Action required), and signal changed to #CMCI (since
that was the only option available in the toolbox :-(

That's how we ended up with *UN*corrected errors tied to *C*MCI.

Just to add to the confusion, Linux does take an action (in uc_decode_notifier())
to try to offline the page despite the UC*NA* signature name.

> And you've basically reused the format and wording of 046545a661af for your
> commit message and makes staring at those a PITA.
>
> Tony, what's going on with that CMCI and SRAR race?

Now the race ... having decided that CMCI/UCNA is the best action for patrol
scrub errors, the memory controller uses it for reads too. But the memory controller
is executing asynchronously from the core, and can't tell the difference between a
"real" read and a speculative read. So it will do CMCI/UCNA if an error is found in
any read.

Thus:

1) Core is clever and thinks address A is needed soon, issues a speculative read.
2) Core finds it is going to use address A soon after sending the read request
3) The CMCI from the memory controller is in a race with the core that will soon try to retire the load from address A.

Quite often (because speculation has got better) the CMCI from the memory controller
is delivered before the core is committed to the instruction reading address A, so the
interrupt is taken, and Linux offlines the page (marking it as poison).

When the interrupt returns, the core gets to the load instruction, and gets a #PF because
the offline process marked the page not-present and flushed the TLB.

Finally the #PF handler tries to fix the page fault, sees that page is marked as poison
so sends SIGBUS to the process.

Note, AMD might have a similar race with the MCE_DEFERRED_SEVERITY signal?
(but with less confusing naming).

-Tony

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling
  2025-02-18 13:08       ` Shuai Xue
  2025-02-18 13:17         ` Borislav Petkov
@ 2025-02-18 17:59         ` Luck, Tony
  2025-02-19  6:04           ` Shuai Xue
  1 sibling, 1 reply; 59+ messages in thread
From: Luck, Tony @ 2025-02-18 17:59 UTC (permalink / raw)
  To: Shuai Xue, Borislav Petkov
  Cc: nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa, linmiaohe,
	akpm, peterz, jpoimboe, linux-edac, linux-kernel, linux-mm,
	baolin.wang, tianruidong

>> What do futexes have to do with copying user memory?
>
> Return -EFAULT to userspace.

Missed this bit. Kernel code for futex does a get_user() to read the
value of the futex from user memory.

-Tony



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling
  2025-02-18 17:59         ` Luck, Tony
@ 2025-02-19  6:04           ` Shuai Xue
  0 siblings, 0 replies; 59+ messages in thread
From: Shuai Xue @ 2025-02-19  6:04 UTC (permalink / raw)
  To: Luck, Tony, Borislav Petkov
  Cc: nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa, linmiaohe,
	akpm, peterz, jpoimboe, linux-edac, linux-kernel, linux-mm,
	baolin.wang, tianruidong



在 2025/2/19 01:59, Luck, Tony 写道:
>>> What do futexes have to do with copying user memory?
>>
>> Return -EFAULT to userspace.
> 
> Missed this bit. Kernel code for futex does a get_user() to read the
> value of the futex from user memory.
> 
> -Tony
> 
> 

Tony, you saved me.

Thanks.
Shuai


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 5/5] mm: memory-failure: move return value documentation to function declaration
  2025-02-17  6:33 ` [PATCH v2 5/5] mm: memory-failure: move return value documentation to function declaration Shuai Xue
@ 2025-02-19  6:31   ` Miaohe Lin
  0 siblings, 0 replies; 59+ messages in thread
From: Miaohe Lin @ 2025-02-19  6:31 UTC (permalink / raw)
  To: Shuai Xue
  Cc: tglx, mingo, dave.hansen, x86, hpa, akpm, peterz, jpoimboe,
	linux-edac, linux-kernel, linux-mm, baolin.wang, tianruidong,
	tony.luck, bp, nao.horiguchi

On 2025/2/17 14:33, Shuai Xue wrote:
> Part of return value comments for memory_failure() were originally
> documented at the call site. Move those comments to the function
> declaration to improve code readability and to provide developers with
> immediate access to function usage and return information.
> 
> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
> Reviewed-by: Jane Chu <jane.chu@oracle.com>

LGTM.

Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Thanks.
.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 4/5] mm/hwpoison: Fix incorrect "not recovered" report for recovered clean pages
  2025-02-17  6:33 ` [PATCH v2 4/5] mm/hwpoison: Fix incorrect "not recovered" report for recovered clean pages Shuai Xue
@ 2025-02-19  6:34   ` Miaohe Lin
  2025-02-19  8:54     ` Shuai Xue
  0 siblings, 1 reply; 59+ messages in thread
From: Miaohe Lin @ 2025-02-19  6:34 UTC (permalink / raw)
  To: Shuai Xue
  Cc: tglx, mingo, dave.hansen, x86, hpa, akpm, peterz, jpoimboe,
	linux-edac, linux-kernel, linux-mm, baolin.wang, tianruidong,
	tony.luck, bp, nao.horiguchi

On 2025/2/17 14:33, Shuai Xue wrote:
> When an uncorrected memory error is consumed there is a race between
> the CMCI from the memory controller reporting an uncorrected error
> with a UCNA signature, and the core reporting and SRAR signature
> machine check when the data is about to be consumed.
> 
> If the CMCI wins that race, the page is marked poisoned when
> uc_decode_notifier() calls memory_failure(). For dirty pages,
> memory_failure() invokes try_to_unmap() with the TTU_HWPOISON flag,
> converting the PTE to a hwpoison entry. As a result,
> kill_accessing_process():
> 
> - call walk_page_range() and return 1 regardless of whether
>   try_to_unmap() succeeds or fails,
> - call kill_proc() to make sure a SIGBUS is sent
> - return -EHWPOISON to indicate that SIGBUS is already sent to the
>   process and kill_me_maybe() doesn't have to send it again.
> 
> However, for clean pages, the TTU_HWPOISON flag is cleared, leaving the
> PTE unchanged and not converted to a hwpoison entry. Conversely, for
> clean pages where PTE entries are not marked as hwpoison,
> kill_accessing_process() returns -EFAULT, causing kill_me_maybe() to
> send a SIGBUS.
> 
> Console log looks like this:
> 
>     Memory failure: 0x827ca68: corrupted page was clean: dropped without side effects
>     Memory failure: 0x827ca68: recovery action for clean LRU page: Recovered
>     Memory failure: 0x827ca68: already hardware poisoned
>     mce: Memory error not recovered
> 
> To fix it, return 0 for "corrupted page was clean", preventing an
> unnecessary SIGBUS.
> 
> Fixes: 046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"")
> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
> Cc: stable@vger.kernel.org
> ---
>  mm/memory-failure.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 995a15eb67e2..b037952565be 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -881,12 +881,17 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
>  	mmap_read_lock(p->mm);
>  	ret = walk_page_range(p->mm, 0, TASK_SIZE, &hwpoison_walk_ops,
>  			      (void *)&priv);
> +	/*
> +	 * ret = 1 when CMCI wins, regardless of whether try_to_unmap()
> +	 * succeeds or fails, then kill the process with SIGBUS.
> +	 * ret = 0 when poison page is a clean page and it's dropped, no
> +	 * SIGBUS is needed.
> +	 */
>  	if (ret == 1 && priv.tk.addr)
>  		kill_proc(&priv.tk, pfn, flags);
> -	else
> -		ret = 0;
>  	mmap_read_unlock(p->mm);
> -	return ret > 0 ? -EHWPOISON : -EFAULT;
> +
> +	return ret > 0 ? -EHWPOISON : 0;

The caller kill_me_maybe will do set_mce_nospec + sync_core again.

static void kill_me_maybe(struct callback_head *cb)
{
	struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
	int flags = MF_ACTION_REQUIRED;
	...
	ret = memory_failure(pfn, flags);
	if (!ret) {
		set_mce_nospec(pfn);
		sync_core();
		return;
	}

Is this expected?

Thanks.
.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling
  2025-02-18 15:31             ` Borislav Petkov
@ 2025-02-19  7:13               ` Shuai Xue
  0 siblings, 0 replies; 59+ messages in thread
From: Shuai Xue @ 2025-02-19  7:13 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: tony.luck, nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa,
	linmiaohe, akpm, peterz, jpoimboe, linux-edac, linux-kernel,
	linux-mm, baolin.wang, tianruidong



在 2025/2/18 23:31, Borislav Petkov 写道:
> On Tue, Feb 18, 2025 at 09:53:17PM +0800, Shuai Xue wrote:
>> The regression is reported by end user and we also observed in the production.
> 
> Where is that report? How many times do I have to ask about different aspects
> of your patches until you explain the *whole* picture?

Sorry, I can not provide the internal report.

Thank you for your continued attention to this matter.

I appreciate your thoroughness, and I'd like to clarify a few points: I've made
every effort to explain the issue comprehensively in my previous responses,
following your previous instructions. And I do not have a more bigger picture.
My picture is not introducing a new feature but addressing a regression caused
by previous commits and improving system reliability. The root cause and its
impact are detailed in patches 3 and 4.

> 
>> [5056863.064239] task: ffff8837d2a2a0c0 task.stack: ffffc90065814000
>> [5056863.137299] RIP: 0010:[<ffffffff813ad231>]  [<ffffffff813ad231>] __get_user_8+0x21/0x2b
>> ...
>> [5056864.512018] Call Trace:
> 
> This tells me exactly 0 - I see some truncated stack trace.
> 
>> Sorry, I did not get your point.
> 
> I don't get your text either. Until this is explained and debugged properly,
> it is not going anywhere.
> 

Tony helped to answer this answer. If you are a asking for how futex trigger a
poison and handle it, hope bellow callstack helps.

futex(2)
     do_futex
         futex_wait
	    __futex_wait
	        futex_wait_setup
    		    get_user // return -EFAULT to top futex(2)

I've strived to provide all the relevant information within the constraints I'm
operating under. If there are specific aspects you feel need further
clarification, please let me know, and I'll do my best to address them.

Thank you for your patience and guidance throughout this process. I look
forward to your feedback and to moving this patch forward.

Shuai


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling
  2025-02-18 17:30       ` Luck, Tony
@ 2025-02-19  8:10         ` Borislav Petkov
  2025-02-19 17:11           ` Luck, Tony
  0 siblings, 1 reply; 59+ messages in thread
From: Borislav Petkov @ 2025-02-19  8:10 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Shuai Xue, nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa,
	linmiaohe, akpm, peterz, jpoimboe, linux-edac, linux-kernel,
	linux-mm, baolin.wang, tianruidong

On Tue, Feb 18, 2025 at 05:30:19PM +0000, Luck, Tony wrote:
First of all, thanks for explaining - that helps a lot!

> That's how we ended up with *UN*corrected errors tied to *C*MCI.
> 
> Just to add to the confusion, Linux does take an action (in uc_decode_notifier())
> to try to offline the page despite the UC*NA* signature name.

So, AFAIU, hw folks are basically trying to tell us: well, this is
*technically* an uncorrectable error but meh, not really important. We just
met it while fetching some data while scrubbing so who knows whether you'll
consume it or not. Meh...

So why don't we simply do that?

We report the signature but we do not try to offline anything. When we get to
*actually* consume it non-speculatively, *then* we run memory failure and then
we offline the page.

Hmmm?

Would that solve that particular debacle?

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 4/5] mm/hwpoison: Fix incorrect "not recovered" report for recovered clean pages
  2025-02-19  6:34   ` Miaohe Lin
@ 2025-02-19  8:54     ` Shuai Xue
  2025-02-19 17:15       ` Luck, Tony
  0 siblings, 1 reply; 59+ messages in thread
From: Shuai Xue @ 2025-02-19  8:54 UTC (permalink / raw)
  To: Miaohe Lin, Luck, Tony
  Cc: tglx, mingo, dave.hansen, x86, hpa, akpm, peterz, jpoimboe,
	linux-edac, linux-kernel, linux-mm, baolin.wang, tianruidong,
	tony.luck, bp, nao.horiguchi



在 2025/2/19 14:34, Miaohe Lin 写道:
> On 2025/2/17 14:33, Shuai Xue wrote:
>> When an uncorrected memory error is consumed there is a race between
>> the CMCI from the memory controller reporting an uncorrected error
>> with a UCNA signature, and the core reporting and SRAR signature
>> machine check when the data is about to be consumed.
>>
>> If the CMCI wins that race, the page is marked poisoned when
>> uc_decode_notifier() calls memory_failure(). For dirty pages,
>> memory_failure() invokes try_to_unmap() with the TTU_HWPOISON flag,
>> converting the PTE to a hwpoison entry. As a result,
>> kill_accessing_process():
>>
>> - call walk_page_range() and return 1 regardless of whether
>>    try_to_unmap() succeeds or fails,
>> - call kill_proc() to make sure a SIGBUS is sent
>> - return -EHWPOISON to indicate that SIGBUS is already sent to the
>>    process and kill_me_maybe() doesn't have to send it again.
>>
>> However, for clean pages, the TTU_HWPOISON flag is cleared, leaving the
>> PTE unchanged and not converted to a hwpoison entry. Conversely, for
>> clean pages where PTE entries are not marked as hwpoison,
>> kill_accessing_process() returns -EFAULT, causing kill_me_maybe() to
>> send a SIGBUS.
>>
>> Console log looks like this:
>>
>>      Memory failure: 0x827ca68: corrupted page was clean: dropped without side effects
>>      Memory failure: 0x827ca68: recovery action for clean LRU page: Recovered
>>      Memory failure: 0x827ca68: already hardware poisoned
>>      mce: Memory error not recovered
>>
>> To fix it, return 0 for "corrupted page was clean", preventing an
>> unnecessary SIGBUS.
>>
>> Fixes: 046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"")
>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
>> Cc: stable@vger.kernel.org
>> ---
>>   mm/memory-failure.c | 11 ++++++++---
>>   1 file changed, 8 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index 995a15eb67e2..b037952565be 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -881,12 +881,17 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
>>   	mmap_read_lock(p->mm);
>>   	ret = walk_page_range(p->mm, 0, TASK_SIZE, &hwpoison_walk_ops,
>>   			      (void *)&priv);
>> +	/*
>> +	 * ret = 1 when CMCI wins, regardless of whether try_to_unmap()
>> +	 * succeeds or fails, then kill the process with SIGBUS.
>> +	 * ret = 0 when poison page is a clean page and it's dropped, no
>> +	 * SIGBUS is needed.
>> +	 */
>>   	if (ret == 1 && priv.tk.addr)
>>   		kill_proc(&priv.tk, pfn, flags);
>> -	else
>> -		ret = 0;
>>   	mmap_read_unlock(p->mm);
>> -	return ret > 0 ? -EHWPOISON : -EFAULT;
>> +
>> +	return ret > 0 ? -EHWPOISON : 0;
> 
> The caller kill_me_maybe will do set_mce_nospec + sync_core again.
> 
> static void kill_me_maybe(struct callback_head *cb)
> {
> 	struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
> 	int flags = MF_ACTION_REQUIRED;
> 	...
> 	ret = memory_failure(pfn, flags);
> 	if (!ret) {
> 		set_mce_nospec(pfn);
> 		sync_core();
> 		return;
> 	}
> 
> Is this expected?
> 

the second set_mce_nospec do nothing and have no side affect.

sync_core() is introduced by Tony [1]:

Also moved sync_core(). The comments for this function say that it should
only be called when instructions have been changed/re-mapped. Recovery for
an instruction fetch may change the physical address. But that doesn't happen
until the scheduled work runs (which could be on another CPU).

[1]https://lore.kernel.org/all/20200824221237.5397-1-tony.luck@intel.com/T/#u

IMHO, I think it also has no side affect.

@Tony, could you help to confirm this?

Thank.
Shuai





^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 3/5] x86/mce: add EX_TYPE_EFAULT_REG as in-kernel recovery context to fix copy-from-user operations regression
  2025-02-18 16:48         ` Borislav Petkov
@ 2025-02-19 10:40           ` Peter Zijlstra
  2025-02-21  6:52             ` Shuai Xue
  0 siblings, 1 reply; 59+ messages in thread
From: Peter Zijlstra @ 2025-02-19 10:40 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Shuai Xue, tony.luck, nao.horiguchi, tglx, mingo, dave.hansen,
	x86, hpa, linmiaohe, akpm, jpoimboe, linux-edac, linux-kernel,
	linux-mm, baolin.wang, tianruidong

On Tue, Feb 18, 2025 at 05:48:00PM +0100, Borislav Petkov wrote:
> On Tue, Feb 18, 2025 at 03:15:35PM +0100, Peter Zijlstra wrote:
> > diff --git a/arch/x86/kernel/cpu/mce/severity.c b/arch/x86/kernel/cpu/mce/severity.c
> > index dac4d64dfb2a..cfdae25eacd7 100644
> > --- a/arch/x86/kernel/cpu/mce/severity.c
> > +++ b/arch/x86/kernel/cpu/mce/severity.c
> > @@ -301,18 +301,19 @@ static noinstr int error_context(struct mce *m, struct pt_regs *regs)
> >  	instrumentation_end();
> >  
> >  	switch (fixup_type) {
> > -	case EX_TYPE_UACCESS:
> > -		if (!copy_user)
> > -			return IN_KERNEL;
> > -		m->kflags |= MCE_IN_KERNEL_COPYIN;
> > -		fallthrough;
> > -
> >  	case EX_TYPE_FAULT_MCE_SAFE:
> >  	case EX_TYPE_DEFAULT_MCE_SAFE:
> >  		m->kflags |= MCE_IN_KERNEL_RECOV;
> >  		return IN_KERNEL_RECOV;
> >  
> >  	default:
> > +		if (copy_user) {
> 
> As said on chat, if we can make is_copy_from_user() *always* correctly detect
> user access, then sure but I'm afraid EX_TYPE_UACCESS being generated at the
> handful places where we do user memory access is there for a reason as it
> makes it pretty explicit.

Thing is, we have copy routines that do not know if its user or not.
is_copy_from_user() must be reliable.

Anyway, if you all really want to go all funny, try the below.

Someone has to go and stick some EX_FLAG_USER on things, but I just
really don't believe that's doing to be useful. Because while you're
doing that, you should also audit if is_copy_from_user() will catch it
and if it does, you don't need the tag.

See how much tags you end up with..


---
diff --git a/arch/x86/include/asm/extable_fixup_types.h b/arch/x86/include/asm/extable_fixup_types.h
index 906b0d5541e8..1d6c6ff51d28 100644
--- a/arch/x86/include/asm/extable_fixup_types.h
+++ b/arch/x86/include/asm/extable_fixup_types.h
@@ -31,6 +31,9 @@
 #define EX_FLAG_CLEAR_DX		EX_DATA_FLAG(2)
 #define EX_FLAG_CLEAR_AX_DX		EX_DATA_FLAG(3)
 
+#define EX_FLAG_USER			EX_DATA_FLAG(4)
+#define EX_FLAG_MCE			EX_DATA_FLAG(8)
+
 /* types */
 #define	EX_TYPE_NONE			 0
 #define	EX_TYPE_DEFAULT			 1
@@ -46,8 +49,6 @@
 #define	EX_TYPE_RDMSR_SAFE		11 /* reg := -EIO */
 #define	EX_TYPE_WRMSR_IN_MCE		12
 #define	EX_TYPE_RDMSR_IN_MCE		13
-#define	EX_TYPE_DEFAULT_MCE_SAFE	14
-#define	EX_TYPE_FAULT_MCE_SAFE		15
 
 #define	EX_TYPE_POP_REG			16 /* sp += sizeof(long) */
 #define EX_TYPE_POP_ZERO		(EX_TYPE_POP_REG | EX_DATA_IMM(0))
diff --git a/arch/x86/kernel/cpu/mce/severity.c b/arch/x86/kernel/cpu/mce/severity.c
index dac4d64dfb2a..86a32fa020d2 100644
--- a/arch/x86/kernel/cpu/mce/severity.c
+++ b/arch/x86/kernel/cpu/mce/severity.c
@@ -300,21 +300,20 @@ static noinstr int error_context(struct mce *m, struct pt_regs *regs)
 	copy_user  = is_copy_from_user(regs);
 	instrumentation_end();
 
-	switch (fixup_type) {
-	case EX_TYPE_UACCESS:
-		if (!copy_user)
-			return IN_KERNEL;
-		m->kflags |= MCE_IN_KERNEL_COPYIN;
-		fallthrough;
-
-	case EX_TYPE_FAULT_MCE_SAFE:
-	case EX_TYPE_DEFAULT_MCE_SAFE:
+	if (fixup_type == EX_TYPE_NONE)
+		return IN_KERNEL;
+
+	if (fixup_type & EX_FLAG_MCE) {
 		m->kflags |= MCE_IN_KERNEL_RECOV;
 		return IN_KERNEL_RECOV;
+	}
 
-	default:
-		return IN_KERNEL;
+	if ((fixup_type & EX_FLAG_USER) || copy_user) {
+		m->kflags |= MCE_IN_KERNEL_COPYIN | MCE_IN_KERNEL_RECOV;
+		return IN_KERNEL_RECOV;
 	}
+
+	return IN_KERNEL;
 }
 
 /* See AMD PPR(s) section Machine Check Error Handling. */
diff --git a/arch/x86/kernel/fpu/legacy.h b/arch/x86/kernel/fpu/legacy.h
index 098f367bb8a7..3f6036840d65 100644
--- a/arch/x86/kernel/fpu/legacy.h
+++ b/arch/x86/kernel/fpu/legacy.h
@@ -24,7 +24,7 @@ static inline void ldmxcsr(u32 mxcsr)
 	asm volatile(ASM_STAC "\n"					\
 		     "1: " #insn "\n"					\
 		     "2: " ASM_CLAC "\n"				\
-		     _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_FAULT_MCE_SAFE)	\
+		     _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_FAULT | EX_FLAG_MCE)	\
 		     : [err] "=a" (err), output				\
 		     : "0"(0), input);					\
 	err;								\
diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h
index aa16f1a1bbcf..eef534091105 100644
--- a/arch/x86/kernel/fpu/xstate.h
+++ b/arch/x86/kernel/fpu/xstate.h
@@ -115,7 +115,7 @@ static inline int update_pkru_in_sigframe(struct xregs_state __user *buf, u64 ma
 	asm volatile("1:" op "\n\t"					\
 		     "xor %[err], %[err]\n"				\
 		     "2:\n\t"						\
-		     _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_FAULT_MCE_SAFE)	\
+		     _ASM_EXTABLE_TYPE(1b, 2b, EX_TYPE_FAULT | EX_FLAG_MCE)	\
 		     : [err] "=a" (err)					\
 		     : "D" (st), "m" (*st), "a" (lmask), "d" (hmask)	\
 		     : "memory")
diff --git a/arch/x86/lib/copy_mc_64.S b/arch/x86/lib/copy_mc_64.S
index c859a8a09860..7977689ad46e 100644
--- a/arch/x86/lib/copy_mc_64.S
+++ b/arch/x86/lib/copy_mc_64.S
@@ -103,9 +103,9 @@ SYM_FUNC_START(copy_mc_fragile)
 	movl	%ecx, %edx
 	jmp copy_mc_fragile_handle_tail
 
-	_ASM_EXTABLE_TYPE(.L_read_leading_bytes, .E_leading_bytes, EX_TYPE_DEFAULT_MCE_SAFE)
-	_ASM_EXTABLE_TYPE(.L_read_words, .E_read_words, EX_TYPE_DEFAULT_MCE_SAFE)
-	_ASM_EXTABLE_TYPE(.L_read_trailing_bytes, .E_trailing_bytes, EX_TYPE_DEFAULT_MCE_SAFE)
+	_ASM_EXTABLE_TYPE(.L_read_leading_bytes, .E_leading_bytes, EX_TYPE_DEFAULT | EX_FLAG_MCE)
+	_ASM_EXTABLE_TYPE(.L_read_words, .E_read_words, EX_TYPE_DEFAULT | EX_FLAG_MCE)
+	_ASM_EXTABLE_TYPE(.L_read_trailing_bytes, .E_trailing_bytes, EX_TYPE_DEFAULT | EX_FLAG_MCE)
 	_ASM_EXTABLE(.L_write_leading_bytes, .E_leading_bytes)
 	_ASM_EXTABLE(.L_write_words, .E_write_words)
 	_ASM_EXTABLE(.L_write_trailing_bytes, .E_trailing_bytes)
@@ -143,7 +143,7 @@ SYM_FUNC_START(copy_mc_enhanced_fast_string)
 	movq %rcx, %rax
 	RET
 
-	_ASM_EXTABLE_TYPE(.L_copy, .E_copy, EX_TYPE_DEFAULT_MCE_SAFE)
+	_ASM_EXTABLE_TYPE(.L_copy, .E_copy, EX_TYPE_DEFAULT | EX_FLAG_MCE)
 
 SYM_FUNC_END(copy_mc_enhanced_fast_string)
 #endif /* !CONFIG_UML */
diff --git a/arch/x86/mm/extable.c b/arch/x86/mm/extable.c
index 51986e8a9d35..7358bf10baba 100644
--- a/arch/x86/mm/extable.c
+++ b/arch/x86/mm/extable.c
@@ -293,8 +293,10 @@ static bool ex_handler_eretu(const struct exception_table_entry *fixup,
 int ex_get_fixup_type(unsigned long ip)
 {
 	const struct exception_table_entry *e = search_exception_tables(ip);
+	if (!e)
+		return EX_TYPE_NONE;
 
-	return e ? FIELD_GET(EX_DATA_TYPE_MASK, e->data) : EX_TYPE_NONE;
+	return FIELD_GET(EX_DATA_TYPE_MASK, e->data) | (e->data & (EX_FLAG_USER | EX_FLAG_MCE));
 }
 
 int fixup_exception(struct pt_regs *regs, int trapnr, unsigned long error_code,
@@ -327,10 +329,8 @@ int fixup_exception(struct pt_regs *regs, int trapnr, unsigned long error_code,
 
 	switch (type) {
 	case EX_TYPE_DEFAULT:
-	case EX_TYPE_DEFAULT_MCE_SAFE:
 		return ex_handler_default(e, regs);
 	case EX_TYPE_FAULT:
-	case EX_TYPE_FAULT_MCE_SAFE:
 		return ex_handler_fault(e, regs, trapnr);
 	case EX_TYPE_UACCESS:
 		return ex_handler_uaccess(e, regs, trapnr, fault_addr);


^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling
  2025-02-19  8:10         ` Borislav Petkov
@ 2025-02-19 17:11           ` Luck, Tony
  2025-02-20 11:19             ` Borislav Petkov
  0 siblings, 1 reply; 59+ messages in thread
From: Luck, Tony @ 2025-02-19 17:11 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Shuai Xue, nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa,
	linmiaohe, akpm, peterz, jpoimboe, linux-edac, linux-kernel,
	linux-mm, baolin.wang, tianruidong

> First of all, thanks for explaining - that helps a lot!
>
> > That's how we ended up with *UN*corrected errors tied to *C*MCI.
> >
> > Just to add to the confusion, Linux does take an action (in uc_decode_notifier())
> > to try to offline the page despite the UC*NA* signature name.
>
> So, AFAIU, hw folks are basically trying to tell us: well, this is
> *technically* an uncorrectable error but meh, not really important. We just
> met it while fetching some data while scrubbing so who knows whether you'll
> consume it or not. Meh...
>
> So why don't we simply do that?

We could, but I don't like it much. By taking the page offline from the relatively
kind environment of a regular interrupt, we often avoid taking a machine check
(which is an unfriendly environment for software).

> We report the signature but we do not try to offline anything. When we get to
> *actually* consume it non-speculatively, *then* we run memory failure and then
> we offline the page.
>
> Hmmm?
>
> Would that solve that particular debacle?

Perhaps. It removes the race. But at the cost of always taking a machine
check instead of frequently avoiding it with the uc_decode_notifier() offline.

Modern Intel Xeons (>= SkyLake) support local machine check[1]. Even
newer Xeons report #MC as recoverable from pretty much all user mode
poison consumption. So machine check isn't as painful on modern systems
as it used to be.

We could make the action in uc_decode_notifier() configurable. Default=off
but with a command line option to enable for systems that are stuck with
broadcast machine checks.

On Intel that would mean not registering the notifier at all. What about AMD?
Do you have similar races for MCE_DEFERRED_SEVERITY errors?

-Tony

[1] Some OEMs still do not enable LMCE in their BIOS.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH v2 4/5] mm/hwpoison: Fix incorrect "not recovered" report for recovered clean pages
  2025-02-19  8:54     ` Shuai Xue
@ 2025-02-19 17:15       ` Luck, Tony
  2025-02-20  1:16         ` Miaohe Lin
  0 siblings, 1 reply; 59+ messages in thread
From: Luck, Tony @ 2025-02-19 17:15 UTC (permalink / raw)
  To: Shuai Xue, Miaohe Lin
  Cc: tglx, mingo, dave.hansen, x86, hpa, akpm, peterz, jpoimboe,
	linux-edac, linux-kernel, linux-mm, baolin.wang, tianruidong, bp,
	nao.horiguchi

> > The caller kill_me_maybe will do set_mce_nospec + sync_core again.
> >
> > static void kill_me_maybe(struct callback_head *cb)
> > {
> >     struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
> >     int flags = MF_ACTION_REQUIRED;
> >     ...
> >     ret = memory_failure(pfn, flags);
> >     if (!ret) {
> >             set_mce_nospec(pfn);
> >             sync_core();
> >             return;
> >     }
> >
> > Is this expected?
> >
>
> the second set_mce_nospec do nothing and have no side affect.
>
> sync_core() is introduced by Tony [1]:
>
> Also moved sync_core(). The comments for this function say that it should
> only be called when instructions have been changed/re-mapped. Recovery for
> an instruction fetch may change the physical address. But that doesn't happen
> until the scheduled work runs (which could be on another CPU).
>
> [1]https://lore.kernel.org/all/20200824221237.5397-1-tony.luck@intel.com/T/#u
>
> IMHO, I think it also has no side affect.
>
> @Tony, could you help to confirm this?

Correct. Re-runing these calls is harmless.

-Tony

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 4/5] mm/hwpoison: Fix incorrect "not recovered" report for recovered clean pages
  2025-02-19 17:15       ` Luck, Tony
@ 2025-02-20  1:16         ` Miaohe Lin
  0 siblings, 0 replies; 59+ messages in thread
From: Miaohe Lin @ 2025-02-20  1:16 UTC (permalink / raw)
  To: Luck, Tony, Shuai Xue
  Cc: tglx, mingo, dave.hansen, x86, hpa, akpm, peterz, jpoimboe,
	linux-edac, linux-kernel, linux-mm, baolin.wang, tianruidong, bp,
	nao.horiguchi

On 2025/2/20 1:15, Luck, Tony wrote:
>>> The caller kill_me_maybe will do set_mce_nospec + sync_core again.
>>>
>>> static void kill_me_maybe(struct callback_head *cb)
>>> {
>>>     struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
>>>     int flags = MF_ACTION_REQUIRED;
>>>     ...
>>>     ret = memory_failure(pfn, flags);
>>>     if (!ret) {
>>>             set_mce_nospec(pfn);
>>>             sync_core();
>>>             return;
>>>     }
>>>
>>> Is this expected?
>>>
>>
>> the second set_mce_nospec do nothing and have no side affect.
>>
>> sync_core() is introduced by Tony [1]:
>>
>> Also moved sync_core(). The comments for this function say that it should
>> only be called when instructions have been changed/re-mapped. Recovery for
>> an instruction fetch may change the physical address. But that doesn't happen
>> until the scheduled work runs (which could be on another CPU).
>>
>> [1]https://lore.kernel.org/all/20200824221237.5397-1-tony.luck@intel.com/T/#u
>>
>> IMHO, I think it also has no side affect.
>>
>> @Tony, could you help to confirm this?
> 
> Correct. Re-runing these calls is harmless.

Got it. Thanks both.

> 
> -Tony
> 



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling
  2025-02-19 17:11           ` Luck, Tony
@ 2025-02-20 11:19             ` Borislav Petkov
  2025-02-20 17:50               ` Luck, Tony
  0 siblings, 1 reply; 59+ messages in thread
From: Borislav Petkov @ 2025-02-20 11:19 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Shuai Xue, nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa,
	linmiaohe, akpm, peterz, jpoimboe, linux-edac, linux-kernel,
	linux-mm, baolin.wang, tianruidong

On Wed, Feb 19, 2025 at 05:11:00PM +0000, Luck, Tony wrote:
> We could, but I don't like it much. By taking the page offline from the relatively
> kind environment of a regular interrupt, we often avoid taking a machine check
> (which is an unfriendly environment for software).

Right.

> We could make the action in uc_decode_notifier() configurable. Default=off
> but with a command line option to enable for systems that are stuck with
> broadcast machine checks.

So we can figure that out during boot - no need for yet another cmdline
option.

It still doesn't fix the race and I'd like to fix that instead, in the optimal
case.

But looking at Shuai's patch, I guess fixing the reporting is fine too - we
need to fix the commit message to explain why this thing even happens.

I.e., basically what you wrote and Shuai could use that explanation to write
a commit message explaining what the situation is along with the background so
that when we go back to this later, we will actually know what is going on.

But looking at

  046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"")

That thing was trying to fix the same reporting fail. Why didn't it do that?

Ooooh, now I see what the issue is. He doesn't want to kill the process which
gets the wrong SIGBUS. Maybe the commit title should've said that:

  mm/hwpoison: Do not send SIGBUS to processes with recovered clean pages

or so.

But how/why is that ok?

Are we confident that

+        * ret = 0 when poison page is a clean page and it's dropped, no
+        * SIGBUS is needed.

can *always* and *only* happen when there's a CMCI *and* a #MC race and the
CMCI has won the race?

Can memory poison return 0 there too, for another reason and we end up *not
killing* a process which we should have?

Hmmm.

> On Intel that would mean not registering the notifier at all. What about AMD?
> Do you have similar races for MCE_DEFERRED_SEVERITY errors?

Probably. Lemme ask around.

> [1] Some OEMs still do not enable LMCE in their BIOS.

Oh, ofc. Gotta love BIOS. They'll get the message when LMCE becomes obsolete,
trust me.

Are we force-enabling LMCE in this case when booting?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling
  2025-02-20 11:19             ` Borislav Petkov
@ 2025-02-20 17:50               ` Luck, Tony
  2025-02-21  6:05                 ` Shuai Xue
  2025-02-24 21:50                 ` Borislav Petkov
  0 siblings, 2 replies; 59+ messages in thread
From: Luck, Tony @ 2025-02-20 17:50 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Shuai Xue, nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa,
	linmiaohe, akpm, peterz, jpoimboe, linux-edac, linux-kernel,
	linux-mm, baolin.wang, tianruidong

> > We could, but I don't like it much. By taking the page offline from the relatively
> > kind environment of a regular interrupt, we often avoid taking a machine check
> > (which is an unfriendly environment for software).
>
> Right.
>
> > We could make the action in uc_decode_notifier() configurable. Default=off
> > but with a command line option to enable for systems that are stuck with
> > broadcast machine checks.
>
> So we can figure that out during boot - no need for yet another cmdline
> option.

Yup. I think the boot time test might be something like:

	// Enable UCNA offline for systems with broadcast machine check
	if (!(AMD || LMCE))
		mce_register_decode_chain(&mce_uc_nb);
>
> It still doesn't fix the race and I'd like to fix that instead, in the optimal
> case.
>
> But looking at Shuai's patch, I guess fixing the reporting is fine too - we
> need to fix the commit message to explain why this thing even happens.
>
> I.e., basically what you wrote and Shuai could use that explanation to write
> a commit message explaining what the situation is along with the background so
> that when we go back to this later, we will actually know what is going on.

Agreed. Shaui needs to harvest this thread to fill out the details in the commit
messages.

>
> But looking at
>
>   046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"")
>
> That thing was trying to fix the same reporting fail. Why didn't it do that?
>
> Ooooh, now I see what the issue is. He doesn't want to kill the process which
> gets the wrong SIGBUS. Maybe the commit title should've said that:
>
>   mm/hwpoison: Do not send SIGBUS to processes with recovered clean pages
>
> or so.
>
> But how/why is that ok?
>
> Are we confident that
>
> +        * ret = 0 when poison page is a clean page and it's dropped, no
> +        * SIGBUS is needed.
>
> can *always* and *only* happen when there's a CMCI *and* a #MC race and the
> CMCI has won the race?

There are probably other races. Two CPUs both take local #MC on the same page
(maybe not all that rare in threaded processes ... or even with some hot code in 
a shared library).

> Can memory poison return 0 there too, for another reason and we end up *not
> killing* a process which we should have?
>
> Hmmm.

Hmmm indeed. Needs some thought. Though failing to kill a process likely means
it retries the access and comes right back to try again (without the race this time).

>
> > On Intel that would mean not registering the notifier at all. What about AMD?
> > Do you have similar races for MCE_DEFERRED_SEVERITY errors?
>
> Probably. Lemme ask around.
>
> > [1] Some OEMs still do not enable LMCE in their BIOS.
>
> Oh, ofc. Gotta love BIOS. They'll get the message when LMCE becomes obsolete,
> trust me.
>
> Are we force-enabling LMCE in this case when booting?

Linux tries to enable if LMCE is supported, but BIOS has veto power.
See the bit in lmce_supported() that checks MSR_IA32_FEAT_CTL

-Tony




^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling
  2025-02-20 17:50               ` Luck, Tony
@ 2025-02-21  6:05                 ` Shuai Xue
  2025-02-24 22:01                   ` Borislav Petkov
  2025-02-24 21:50                 ` Borislav Petkov
  1 sibling, 1 reply; 59+ messages in thread
From: Shuai Xue @ 2025-02-21  6:05 UTC (permalink / raw)
  To: Luck, Tony, Borislav Petkov
  Cc: nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa, linmiaohe,
	akpm, peterz, jpoimboe, linux-edac, linux-kernel, linux-mm,
	baolin.wang, tianruidong



在 2025/2/21 01:50, Luck, Tony 写道:
>>> We could, but I don't like it much. By taking the page offline from the relatively
>>> kind environment of a regular interrupt, we often avoid taking a machine check
>>> (which is an unfriendly environment for software).
>>
>> Right.
>>
>>> We could make the action in uc_decode_notifier() configurable. Default=off
>>> but with a command line option to enable for systems that are stuck with
>>> broadcast machine checks.
>>
>> So we can figure that out during boot - no need for yet another cmdline
>> option.
> 
> Yup. I think the boot time test might be something like:
> 
> 	// Enable UCNA offline for systems with broadcast machine check
> 	if (!(AMD || LMCE))
> 		mce_register_decode_chain(&mce_uc_nb);
>>
>> It still doesn't fix the race and I'd like to fix that instead, in the optimal
>> case.
>>
>> But looking at Shuai's patch, I guess fixing the reporting is fine too - we
>> need to fix the commit message to explain why this thing even happens.
>>
>> I.e., basically what you wrote and Shuai could use that explanation to write
>> a commit message explaining what the situation is along with the background so
>> that when we go back to this later, we will actually know what is going on.
> 
> Agreed. Shaui needs to harvest this thread to fill out the details in the commit
> messages.

Sure, I'd like to add more backgroud details with Tony's explanation.

> 
>>
>> But looking at
>>
>>    046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"")
>>
>> That thing was trying to fix the same reporting fail. Why didn't it do that?
>>
>> Ooooh, now I see what the issue is. He doesn't want to kill the process which
>> gets the wrong SIGBUS. Maybe the commit title should've said that:
>>
>>    mm/hwpoison: Do not send SIGBUS to processes with recovered clean pages
>>
>> or so.
>>
>> But how/why is that ok?
>>
>> Are we confident that
>>
>> +        * ret = 0 when poison page is a clean page and it's dropped, no
>> +        * SIGBUS is needed.
>>
>> can *always* and *only* happen when there's a CMCI *and* a #MC race and the
>> CMCI has won the race?
> 
> There are probably other races. Two CPUs both take local #MC on the same page
> (maybe not all that rare in threaded processes ... or even with some hot code in
> a shared library).
> 
>> Can memory poison return 0 there too, for another reason and we end up *not
>> killing* a process which we should have?
>>
>> Hmmm.
> 
> Hmmm indeed. Needs some thought. Though failing to kill a process likely means
> it retries the access and comes right back to try again (without the race this time).
> 

Emmm, if two threaded processes consume a poisond data, there may three CPUs
race, two of which take local #MC on the same page and one take CMCI. For,
example:

#perf script
kworker/48:1-mm 25516 [048]  1713.893549: probe:memory_failure: (ffffffffaa622db4)
         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
         ffffffffaa25aa93 uc_decode_notifier+0x73 ([kernel.kallsyms])
         ffffffffaa3068bb notifier_call_chain+0x5b ([kernel.kallsyms])
         ffffffffaa306ae1 blocking_notifier_call_chain+0x41 ([kernel.kallsyms])
         ffffffffaa25bbfe mce_gen_pool_process+0x3e ([kernel.kallsyms])
         ffffffffaa2f455f process_one_work+0x19f ([kernel.kallsyms])
         ffffffffaa2f509c worker_thread+0x20c ([kernel.kallsyms])
         ffffffffaa2fec89 kthread+0xd9 ([kernel.kallsyms])
         ffffffffaa245131 ret_from_fork+0x31 ([kernel.kallsyms])
         ffffffffaa2076ca ret_from_fork_asm+0x1a ([kernel.kallsyms])

einj_mem_uc 44530 [184]  1713.908089: probe:memory_failure: (ffffffffaa622db4)
         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
         ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
         ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
         ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
         ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
         ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
                   405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

einj_mem_uc 44531 [089]  1713.916319: probe:memory_failure: (ffffffffaa622db4)
         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
         ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
         ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
         ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
         ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
         ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
                   405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

It seems to complicate the issue further.

IMHO, we should focus on three main points:

- kill_accessing_process() is only called when the flags are set to
   MF_ACTION_REQUIRED, which means it is in the MCE path.
- Whether the page is clean determines the behavior of try_to_unmap. For a
   dirty page, try_to_unmap uses TTU_HWPOISON to unmap the PTE and convert the
   PTE entry to a swap entry. For a clean page, try_to_unmap uses ~TTU_HWPOISON
   and simply unmaps the PTE.
- When does walk_page_range() with hwpoison_walk_ops return 1?
   1. If the poison page still exists, we should of course kill the current
      process.
   2. If the poison page does not exist, but is_hwpoison_entry is true, meaning
      it is a dirty page, we should also kill the current process, too.
   3. Otherwise, it returns 0, which means the page is clean.


Thanks.
Shuai


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 3/5] x86/mce: add EX_TYPE_EFAULT_REG as in-kernel recovery context to fix copy-from-user operations regression
  2025-02-19 10:40           ` Peter Zijlstra
@ 2025-02-21  6:52             ` Shuai Xue
  0 siblings, 0 replies; 59+ messages in thread
From: Shuai Xue @ 2025-02-21  6:52 UTC (permalink / raw)
  To: Peter Zijlstra, Borislav Petkov, Luck, Tony
  Cc: nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa, linmiaohe,
	akpm, jpoimboe, linux-edac, linux-kernel, linux-mm, baolin.wang,
	tianruidong



在 2025/2/19 18:40, Peter Zijlstra 写道:
> On Tue, Feb 18, 2025 at 05:48:00PM +0100, Borislav Petkov wrote:
>> On Tue, Feb 18, 2025 at 03:15:35PM +0100, Peter Zijlstra wrote:
>>> diff --git a/arch/x86/kernel/cpu/mce/severity.c b/arch/x86/kernel/cpu/mce/severity.c
>>> index dac4d64dfb2a..cfdae25eacd7 100644
>>> --- a/arch/x86/kernel/cpu/mce/severity.c
>>> +++ b/arch/x86/kernel/cpu/mce/severity.c
>>> @@ -301,18 +301,19 @@ static noinstr int error_context(struct mce *m, struct pt_regs *regs)
>>>   	instrumentation_end();
>>>   
>>>   	switch (fixup_type) {
>>> -	case EX_TYPE_UACCESS:
>>> -		if (!copy_user)
>>> -			return IN_KERNEL;
>>> -		m->kflags |= MCE_IN_KERNEL_COPYIN;
>>> -		fallthrough;
>>> -
>>>   	case EX_TYPE_FAULT_MCE_SAFE:
>>>   	case EX_TYPE_DEFAULT_MCE_SAFE:
>>>   		m->kflags |= MCE_IN_KERNEL_RECOV;
>>>   		return IN_KERNEL_RECOV;
>>>   
>>>   	default:
>>> +		if (copy_user) {
>>
>> As said on chat, if we can make is_copy_from_user() *always* correctly detect
>> user access, then sure but I'm afraid EX_TYPE_UACCESS being generated at the
>> handful places where we do user memory access is there for a reason as it
>> makes it pretty explicit.
> 
> Thing is, we have copy routines that do not know if its user or not.
> is_copy_from_user() must be reliable.
> 
> Anyway, if you all really want to go all funny, try the below.
> 
> Someone has to go and stick some EX_FLAG_USER on things, but I just
> really don't believe that's doing to be useful. Because while you're
> doing that, you should also audit if is_copy_from_user() will catch it
> and if it does, you don't need the tag.
> 
> See how much tags you end up with..

Agreed, I think the key point whether the error context is in a read from user
memory. We do not care about the ex-type if we know its a MOV
reading from userspace.

is_copy_from_user() return true when both of the following two checks are
true:

- the current instruction is copy
- source address is user memory

If copy_user is true, we set

m->kflags |= MCE_IN_KERNEL_COPYIN | MCE_IN_KERNEL_RECOV;

Then do_machine_check will try fixup_exception first.

	/*
	 * Handle an MCE which has happened in kernel space but from
	 * which the kernel can recover: ex_has_fault_handler() has
	 * already verified that the rIP at which the error happened is
	 * a rIP from which the kernel can recover (by jumping to
	 * recovery code specified in _ASM_EXTABLE_FAULT()) and the
	 * corresponding exception handler which would do that is the
	 * proper one.
	 */
	if (m->kflags & MCE_IN_KERNEL_RECOV) {
		if (!fixup_exception(regs, X86_TRAP_MC, 0, 0))
			mce_panic("Failed kernel mode recovery", &err, msg);
	}

	if (m->kflags & MCE_IN_KERNEL_COPYIN)
		queue_task_work(&err, msg, kill_me_never);

So Peter's code is fine to me.

---
diff --git a/arch/x86/kernel/cpu/mce/severity.c b/arch/x86/kernel/cpu/mce/severity.c
index dac4d64dfb2a..cb021058165f 100644
--- a/arch/x86/kernel/cpu/mce/severity.c
+++ b/arch/x86/kernel/cpu/mce/severity.c
@@ -300,13 +300,12 @@ static noinstr int error_context(struct mce *m, struct pt_regs *regs)
  	copy_user  = is_copy_from_user(regs);
  	instrumentation_end();
  
-	switch (fixup_type) {
-	case EX_TYPE_UACCESS:
-		if (!copy_user)
-			return IN_KERNEL;
-		m->kflags |= MCE_IN_KERNEL_COPYIN;
-		fallthrough;
+	if (copy_user) {
+		m->kflags |= MCE_IN_KERNEL_COPYIN | MCE_IN_KERNEL_COPYIN;
+		return IN_KERNEL_RECOV
+	}
  
+	switch (fixup_type) {
  	case EX_TYPE_FAULT_MCE_SAFE:
  	case EX_TYPE_DEFAULT_MCE_SAFE:
  		m->kflags |= MCE_IN_KERNEL_RECOV;


Is that ok? Please correct me if I missed anyting.

Thanks.
Shuai


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling
  2025-02-20 17:50               ` Luck, Tony
  2025-02-21  6:05                 ` Shuai Xue
@ 2025-02-24 21:50                 ` Borislav Petkov
  1 sibling, 0 replies; 59+ messages in thread
From: Borislav Petkov @ 2025-02-24 21:50 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Shuai Xue, nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa,
	linmiaohe, akpm, peterz, jpoimboe, linux-edac, linux-kernel,
	linux-mm, baolin.wang, tianruidong

On Thu, Feb 20, 2025 at 05:50:14PM +0000, Luck, Tony wrote:
> Agreed. Shaui needs to harvest this thread to fill out the details in the commit
> messages.

Yap.

> There are probably other races. Two CPUs both take local #MC on the same page
> (maybe not all that rare in threaded processes ... or even with some hot code in 
> a shared library).

Yap, exactly. And I think there's nothing we can do - the hw is out there so
the sw needs to handle them cases correctly.

> Hmmm indeed. Needs some thought. Though failing to kill a process likely means
> it retries the access and comes right back to try again (without the race this time).

What happens if it fails to kill the process? It'll return to it, it'll try to
touch the faulty memory and raise another #MC? Right, I think so.


> > > On Intel that would mean not registering the notifier at all. What about AMD?
> > > Do you have similar races for MCE_DEFERRED_SEVERITY errors?
> >
> > Probably. Lemme ask around.

After talking to folks internally, yeah, I think we'll probably have a similar
thing. Haven't seen it happen yet.

> Linux tries to enable if LMCE is supported, but BIOS has veto power.
> See the bit in lmce_supported() that checks MSR_IA32_FEAT_CTL

I'm trying to educate our hw folks to not rely on OEM BIOS if possible. For
every chance I get. Otherwise you get crap like that and this is never getting
better.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling
  2025-02-21  6:05                 ` Shuai Xue
@ 2025-02-24 22:01                   ` Borislav Petkov
  2025-02-25  1:51                     ` Shuai Xue
  0 siblings, 1 reply; 59+ messages in thread
From: Borislav Petkov @ 2025-02-24 22:01 UTC (permalink / raw)
  To: Shuai Xue
  Cc: Luck, Tony, nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa,
	linmiaohe, akpm, peterz, jpoimboe, linux-edac, linux-kernel,
	linux-mm, baolin.wang, tianruidong

On Fri, Feb 21, 2025 at 02:05:28PM +0800, Shuai Xue wrote:
> #perf script
> kworker/48:1-mm 25516 [048]  1713.893549: probe:memory_failure: (ffffffffaa622db4)
>         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
>         ffffffffaa25aa93 uc_decode_notifier+0x73 ([kernel.kallsyms])
>         ffffffffaa3068bb notifier_call_chain+0x5b ([kernel.kallsyms])
>         ffffffffaa306ae1 blocking_notifier_call_chain+0x41 ([kernel.kallsyms])
>         ffffffffaa25bbfe mce_gen_pool_process+0x3e ([kernel.kallsyms])
>         ffffffffaa2f455f process_one_work+0x19f ([kernel.kallsyms])
>         ffffffffaa2f509c worker_thread+0x20c ([kernel.kallsyms])
>         ffffffffaa2fec89 kthread+0xd9 ([kernel.kallsyms])
>         ffffffffaa245131 ret_from_fork+0x31 ([kernel.kallsyms])
>         ffffffffaa2076ca ret_from_fork_asm+0x1a ([kernel.kallsyms])
> 
> einj_mem_uc 44530 [184]  1713.908089: probe:memory_failure: (ffffffffaa622db4)
>         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
>         ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
>         ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
>         ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
>         ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
>         ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
>                   405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)
> 
> einj_mem_uc 44531 [089]  1713.916319: probe:memory_failure: (ffffffffaa622db4)
>         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
>         ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
>         ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
>         ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
>         ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
>         ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
>                   405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

What are those stack traces supposed to say?

Two processes are injecting, cause a #MC and a kworker gets to handle the UC?

All injecting to the same page?

What's the upper limit on CPUs seeing the same hw error and all raising
a CMCI/#MC?

> - kill_accessing_process() is only called when the flags are set to
>   MF_ACTION_REQUIRED, which means it is in the MCE path.
> - Whether the page is clean determines the behavior of try_to_unmap. For a
>   dirty page, try_to_unmap uses TTU_HWPOISON to unmap the PTE and convert the
>   PTE entry to a swap entry. For a clean page, try_to_unmap uses ~TTU_HWPOISON
>   and simply unmaps the PTE.
> - When does walk_page_range() with hwpoison_walk_ops return 1?
>   1. If the poison page still exists, we should of course kill the current
>      process.
>   2. If the poison page does not exist, but is_hwpoison_entry is true, meaning
>      it is a dirty page, we should also kill the current process, too.
>   3. Otherwise, it returns 0, which means the page is clean.

I think you're too deep into detail. What I'd do is step back, think what
would be the *proper* recovery action and then make sure memory_failure does
that. If it doesn't - fix it to do so.

So, what should really happen wrt recovery action if any number of CPUs see
the same memory error?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling
  2025-02-24 22:01                   ` Borislav Petkov
@ 2025-02-25  1:51                     ` Shuai Xue
  2025-02-28 12:35                       ` Borislav Petkov
  0 siblings, 1 reply; 59+ messages in thread
From: Shuai Xue @ 2025-02-25  1:51 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Luck, Tony, nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa,
	linmiaohe, akpm, peterz, jpoimboe, linux-edac, linux-kernel,
	linux-mm, baolin.wang, tianruidong



在 2025/2/25 06:01, Borislav Petkov 写道:
> On Fri, Feb 21, 2025 at 02:05:28PM +0800, Shuai Xue wrote:
>> #perf script
>> kworker/48:1-mm 25516 [048]  1713.893549: probe:memory_failure: (ffffffffaa622db4)
>>          ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
>>          ffffffffaa25aa93 uc_decode_notifier+0x73 ([kernel.kallsyms])
>>          ffffffffaa3068bb notifier_call_chain+0x5b ([kernel.kallsyms])
>>          ffffffffaa306ae1 blocking_notifier_call_chain+0x41 ([kernel.kallsyms])
>>          ffffffffaa25bbfe mce_gen_pool_process+0x3e ([kernel.kallsyms])
>>          ffffffffaa2f455f process_one_work+0x19f ([kernel.kallsyms])
>>          ffffffffaa2f509c worker_thread+0x20c ([kernel.kallsyms])
>>          ffffffffaa2fec89 kthread+0xd9 ([kernel.kallsyms])
>>          ffffffffaa245131 ret_from_fork+0x31 ([kernel.kallsyms])
>>          ffffffffaa2076ca ret_from_fork_asm+0x1a ([kernel.kallsyms])
>>
>> einj_mem_uc 44530 [184]  1713.908089: probe:memory_failure: (ffffffffaa622db4)
>>          ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
>>          ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
>>          ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
>>          ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
>>          ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
>>          ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
>>                    405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)
>>
>> einj_mem_uc 44531 [089]  1713.916319: probe:memory_failure: (ffffffffaa622db4)
>>          ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
>>          ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
>>          ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
>>          ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
>>          ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
>>          ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
>>                    405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)
> 
> What are those stack traces supposed to say?
> 
> Two processes are injecting, cause a #MC and a kworker gets to handle the UC?
> 
> All injecting to the same page?

Yes, I inject poison to a page and create two process with pthread_create() which
trigger the same poison page.

> 
> What's the upper limit on CPUs seeing the same hw error and all raising
> a CMCI/#MC?

It depends on the forked process which trying to read the poison.

> 
>> - kill_accessing_process() is only called when the flags are set to
>>    MF_ACTION_REQUIRED, which means it is in the MCE path.
>> - Whether the page is clean determines the behavior of try_to_unmap. For a
>>    dirty page, try_to_unmap uses TTU_HWPOISON to unmap the PTE and convert the
>>    PTE entry to a swap entry. For a clean page, try_to_unmap uses ~TTU_HWPOISON
>>    and simply unmaps the PTE.
>> - When does walk_page_range() with hwpoison_walk_ops return 1?
>>    1. If the poison page still exists, we should of course kill the current
>>       process.
>>    2. If the poison page does not exist, but is_hwpoison_entry is true, meaning
>>       it is a dirty page, we should also kill the current process, too.
>>    3. Otherwise, it returns 0, which means the page is clean.
> 
> I think you're too deep into detail. What I'd do is step back, think what
> would be the *proper* recovery action and then make sure memory_failure does
> that. If it doesn't - fix it to do so.
> 
> So, what should really happen wrt recovery action if any number of CPUs see
> the same memory error?
> 

IMHO, we should send a SIGBUS signal to the processes running on the CPUs that
detect a memory error for dirty page, which is the current behavior in the
memory_failure.

Thanks
Shuai


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling
  2025-02-25  1:51                     ` Shuai Xue
@ 2025-02-28 12:35                       ` Borislav Petkov
  2025-03-01  5:54                         ` Shuai Xue
  0 siblings, 1 reply; 59+ messages in thread
From: Borislav Petkov @ 2025-02-28 12:35 UTC (permalink / raw)
  To: Shuai Xue
  Cc: Luck, Tony, nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa,
	linmiaohe, akpm, peterz, jpoimboe, linux-edac, linux-kernel,
	linux-mm, baolin.wang, tianruidong

On Tue, Feb 25, 2025 at 09:51:25AM +0800, Shuai Xue wrote:
> It depends on the forked process which trying to read the poison.

And? Can you try creating more processes and see what happens then?

> IMHO, we should send a SIGBUS signal to the processes running on the CPUs that
> detect a memory error for dirty page, which is the current behavior in the
> memory_failure.

And for all those other processes which do get to see the already
poisoned/clean page, they should continue on their merry way instead of
getting killed by a SIGBUS?

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 2/5] x86/mce: dump error msg from severities
  2025-02-17  6:33 ` [PATCH v2 2/5] x86/mce: dump error msg from severities Shuai Xue
@ 2025-02-28 12:37   ` Borislav Petkov
  2025-03-01  6:16     ` Shuai Xue
  0 siblings, 1 reply; 59+ messages in thread
From: Borislav Petkov @ 2025-02-28 12:37 UTC (permalink / raw)
  To: Shuai Xue
  Cc: tony.luck, nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa,
	linmiaohe, akpm, peterz, jpoimboe, linux-edac, linux-kernel,
	linux-mm, baolin.wang, tianruidong

On Mon, Feb 17, 2025 at 02:33:32PM +0800, Shuai Xue wrote:
> The message in severities is useful for identifying the type of MCE that
> has occurred; dump it if it is valid.

Needs more explanation as to what "useful" means. We already log and report
MCEs in gazillion ways.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling
  2025-02-28 12:35                       ` Borislav Petkov
@ 2025-03-01  5:54                         ` Shuai Xue
  0 siblings, 0 replies; 59+ messages in thread
From: Shuai Xue @ 2025-03-01  5:54 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Luck, Tony, nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa,
	linmiaohe, akpm, peterz, jpoimboe, linux-edac, linux-kernel,
	linux-mm, baolin.wang, tianruidong



在 2025/2/28 20:35, Borislav Petkov 写道:
> On Tue, Feb 25, 2025 at 09:51:25AM +0800, Shuai Xue wrote:
>> It depends on the forked process which trying to read the poison.
> 
> And? Can you try creating more processes and see what happens then?
> 

Sure.

The experimental model includes:

1. inject UE to a memory buffer
2. create 10 processes
3. all 10 process read the posioned buffer
4. 10 MCEs and 1 UCNA will be triggered
5. each process receives a SIGBUS

Some details:

#perf record -e probe:memory_failure -agR -- ./einj_mem_uc thread
0: thread   vaddr = 0x7f65f08da400 paddr = 82702ec400
injecting ...
>> trigger_thread
>> trigger_thread
>> trigger_thread
>> trigger_thread
>> trigger_thread
>> trigger_thread
>> trigger_thread
>> trigger_thread
>> trigger_thread
>> trigger_thread
signal 7 code 4 addr 0x7f65f08da000
page not present
Unusual number of MCEs seen: 10
signal 7 code 4 addr 0x7f65f08da000
page not present
Unusual number of MCEs seen: 10
signal 7 code 4 addr 0x7f65f08da000
page not present
Unusual number of MCEs seen: 10
signal 7 code 4 addr 0x7f65f08da000
page not present
Unusual number of MCEs seen: 10
signal 7 code 4 addr 0x7f65f08da000
page not present
Unusual number of MCEs seen: 10
signal 7 code 4 addr 0x7f65f08da000
page not present
Unusual number of MCEs seen: 10
signal 7 code 4 addr 0x7f65f08da000
page not present
Unusual number of MCEs seen: 10
signal 7 code 4 addr 0x7f65f08da000
page not present
Unusual number of MCEs seen: 10
signal 7 code 4 addr 0x7f65f08da000
page not present
Unusual number of MCEs seen: 10
signal 7 code 4 addr 0x7f65f08da000
page not present
Unusual number of MCEs seen: 10
Test passed
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.640 MB perf.data (11 samples) ]


#perf script
einj_mem_uc 1722254 [151] 695128.161644: probe:memory_failure: (ffffffffaa622db4)
         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
         ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
         ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
         ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
         ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
         ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
                   405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

einj_mem_uc 1722255 [014] 695128.161712: probe:memory_failure: (ffffffffaa622db4)
         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
         ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
         ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
         ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
         ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
         ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
                   405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

einj_mem_uc 1722256 [153] 695128.161716: probe:memory_failure: (ffffffffaa622db4)
         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
         ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
         ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
         ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
         ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
         ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
                   405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

einj_mem_uc 1722257 [124] 695128.161759: probe:memory_failure: (ffffffffaa622db4)
         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
         ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
         ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
         ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
         ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
         ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
                   405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

einj_mem_uc 1722258 [154] 695128.161782: probe:memory_failure: (ffffffffaa622db4)
         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
         ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
         ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
         ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
         ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
         ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
                   405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

einj_mem_uc 1722259 [026] 695128.161819: probe:memory_failure: (ffffffffaa622db4)
         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
         ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
         ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
         ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
         ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
         ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
                   405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

einj_mem_uc 1722260 [157] 695128.161852: probe:memory_failure: (ffffffffaa622db4)
         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
         ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
         ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
         ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
         ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
         ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
                   405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

einj_mem_uc 1722261 [158] 695128.161895: probe:memory_failure: (ffffffffaa622db4)
         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
         ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
         ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
         ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
         ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
         ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
                   405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

kworker/50:3-mm 1714430 [050] 695128.168736: probe:memory_failure: (ffffffffaa622db4)
         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
         ffffffffaa25aa93 uc_decode_notifier+0x73 ([kernel.kallsyms])
         ffffffffaa3068bb notifier_call_chain+0x5b ([kernel.kallsyms])
         ffffffffaa306ae1 blocking_notifier_call_chain+0x41 ([kernel.kallsyms])
         ffffffffaa25bbfe mce_gen_pool_process+0x3e ([kernel.kallsyms])
         ffffffffaa2f455f process_one_work+0x19f ([kernel.kallsyms])
         ffffffffaa2f509c worker_thread+0x20c ([kernel.kallsyms])
         ffffffffaa2fec89 kthread+0xd9 ([kernel.kallsyms])
         ffffffffaa245131 ret_from_fork+0x31 ([kernel.kallsyms])
         ffffffffaa2076ca ret_from_fork_asm+0x1a ([kernel.kallsyms])

einj_mem_uc 1722252 [050] 695128.183025: probe:memory_failure: (ffffffffaa622db4)
         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
         ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
         ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
         ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
         ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
         ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
                   405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

einj_mem_uc 1722253 [051] 695128.191348: probe:memory_failure: (ffffffffaa622db4)
         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
         ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
         ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
         ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
         ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
         ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
                   405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

>> IMHO, we should send a SIGBUS signal to the processes running on the CPUs that
>> detect a memory error for dirty page, which is the current behavior in the
>> memory_failure.
> 
> And for all those other processes which do get to see the already
> poisoned/clean page, they should continue on their merry way instead of
> getting killed by a SIGBUS?
> 

Yes, memory_failure() only sends a SIGBUS signal to the process that
is actively reading a poisoned page. Other processes that share the
poisoned page will not receive a SIGBUS signal unless they have the
PF_MCE_EARLY flag set.[1]

[1]https://lkml.kernel.org/r/20220218090118.1105-4-linmiaohe@huawei.com

Thanks.
Shuai




^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 2/5] x86/mce: dump error msg from severities
  2025-02-28 12:37   ` Borislav Petkov
@ 2025-03-01  6:16     ` Shuai Xue
  2025-03-01 11:10       ` Borislav Petkov
  0 siblings, 1 reply; 59+ messages in thread
From: Shuai Xue @ 2025-03-01  6:16 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: tony.luck, nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa,
	linmiaohe, akpm, peterz, jpoimboe, linux-edac, linux-kernel,
	linux-mm, baolin.wang, tianruidong



在 2025/2/28 20:37, Borislav Petkov 写道:
> On Mon, Feb 17, 2025 at 02:33:32PM +0800, Shuai Xue wrote:
>> The message in severities is useful for identifying the type of MCE that
>> has occurred; dump it if it is valid.
> 
> Needs more explanation as to what "useful" means. We already log and report
> MCEs in gazillion ways.
> 

You are right.

While MCE information is indeed decoded by the EDAC and APEI drivers,
the decoding is limited to the interpretation of hardware registers
and lacks the contextual details of the error.
  
For instance, it does not specify whether the error occurred in the
context of IN_KERNEL or IN_KERNEL_RECOV, which are crucial for
understanding the error's circumstances.

For the regression cases (copy from user) in Patch 3, an error message

     "mce: Action required: data load in error recoverable area of kernel"

will be added if this patch is applied.

I could add more explanations in next version if you have no objection.

Thanks.
Shuai


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 2/5] x86/mce: dump error msg from severities
  2025-03-01  6:16     ` Shuai Xue
@ 2025-03-01 11:10       ` Borislav Petkov
  2025-03-01 14:03         ` Shuai Xue
  0 siblings, 1 reply; 59+ messages in thread
From: Borislav Petkov @ 2025-03-01 11:10 UTC (permalink / raw)
  To: Shuai Xue
  Cc: tony.luck, nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa,
	linmiaohe, akpm, peterz, jpoimboe, linux-edac, linux-kernel,
	linux-mm, baolin.wang, tianruidong

On Sat, Mar 01, 2025 at 02:16:12PM +0800, Shuai Xue wrote:
> For instance, it does not specify whether the error occurred in the
> context of IN_KERNEL or IN_KERNEL_RECOV, which are crucial for
> understanding the error's circumstances.

1. Crucial for whom? For you? Or for users?

You need to explain how this error message is going to be used. Because simply
issuing such a message causes a lot of panicked people calling a lot of admins
to figure out why their machine is broken. Because they see "mce" and think
"hw broken, need to replace it immediately."

This is one of the reasons we did the cec.c thing - just to save people from
panicking unnecessarily and causing expensive and useless maintenance calls.

2. This message goes to dmesg which means something needs to parse it, beside
   a human. An AI?

3. Dmesg is a ring buffer which gets overwritten and this message is
   eventually lost

There's a reason why MCEs get logged with the notifiers and through
a tracepoint - so that agents can act upon them properly.

And we have had this discussion for years now - I'm sorry that you're late to
the party.

> For the regression cases (copy from user) in Patch 3, an error message
> 
>     "mce: Action required: data load in error recoverable area of kernel"

See above.

Besides, this message is completely useless as it has no concrete info about
the error and what is being done about it.

> I could add more explanations in next version if you have no objection.

All of the above are objections.

Please go into git history and read why we're avoiding dumping useless
messages instead of proposing silly patches.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 2/5] x86/mce: dump error msg from severities
  2025-03-01 11:10       ` Borislav Petkov
@ 2025-03-01 14:03         ` Shuai Xue
  2025-03-01 18:47           ` Borislav Petkov
  0 siblings, 1 reply; 59+ messages in thread
From: Shuai Xue @ 2025-03-01 14:03 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: tony.luck, nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa,
	linmiaohe, akpm, peterz, jpoimboe, linux-edac, linux-kernel,
	linux-mm, baolin.wang, tianruidong



在 2025/3/1 19:10, Borislav Petkov 写道:
> On Sat, Mar 01, 2025 at 02:16:12PM +0800, Shuai Xue wrote:
>> For instance, it does not specify whether the error occurred in the
>> context of IN_KERNEL or IN_KERNEL_RECOV, which are crucial for
>> understanding the error's circumstances.
> 
> 1. Crucial for whom? For you? Or for users?
> 
> You need to explain how this error message is going to be used. Because simply
> issuing such a message causes a lot of panicked people calling a lot of admins
> to figure out why their machine is broken. Because they see "mce" and think
> "hw broken, need to replace it immediately."
> 
> This is one of the reasons we did the cec.c thing - just to save people from
> panicking unnecessarily and causing expensive and useless maintenance calls.


For me, and cloud providers which maintains million servers.

(By the way, Cenots/Redhat build kernel without CONFIG_RAS_CEC set, becase
it breaks EDAC decoding. We do not use CEC in production at all for the same
reasion.)

> 
> 2. This message goes to dmesg which means something needs to parse it, beside
>     a human. An AI?

Yes, we collect all kernel message from host, parse the logs and predict panic
with AI tools. The more details we collect, the better the performance of
the AI model.

> 
> 3. Dmesg is a ring buffer which gets overwritten and this message is
>     eventually lost
> 
> There's a reason why MCEs get logged with the notifiers and through
> a tracepoint - so that agents can act upon them properly.
> 
> And we have had this discussion for years now - I'm sorry that you're late to
> the party.

Agreed, tracepoint is a more elegant way. However, it does not include error context,
just some hardware registers.

> 
>> For the regression cases (copy from user) in Patch 3, an error message
>>
>>      "mce: Action required: data load in error recoverable area of kernel"
> 
> See above.
> 
> Besides, this message is completely useless as it has no concrete info about
> the error and what is being done about it.

I don't think so,

"Action required" means MCI_UC_AR
"data load" means MCACOD_DATA
"recoverable area of kernel" means KERNEL_RECOV

It is more readable and concrete than "Uncorrected hardware memory error", e.g.
message in kill_me_maybe():

     "mce: Uncorrected hardware memory error in user-access at 3b116c400"

> 
>> I could add more explanations in next version if you have no objection.
> 
> All of the above are objections.
> 
> Please go into git history and read why we're avoiding dumping useless
> messages instead of proposing silly patches.
> 

Anyway, I respect the maintainer's opinion.

Thanks
Shuai



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 2/5] x86/mce: dump error msg from severities
  2025-03-01 14:03         ` Shuai Xue
@ 2025-03-01 18:47           ` Borislav Petkov
  2025-03-02  7:14             ` Shuai Xue
  2025-03-03 16:49             ` Luck, Tony
  0 siblings, 2 replies; 59+ messages in thread
From: Borislav Petkov @ 2025-03-01 18:47 UTC (permalink / raw)
  To: Shuai Xue
  Cc: tony.luck, nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa,
	linmiaohe, akpm, peterz, jpoimboe, linux-edac, linux-kernel,
	linux-mm, baolin.wang, tianruidong

On Sat, Mar 01, 2025 at 10:03:13PM +0800, Shuai Xue wrote:
> (By the way, Cenots/Redhat build kernel without CONFIG_RAS_CEC set, becase
> it breaks EDAC decoding. We do not use CEC in production at all for the same
> reasion.)

It doesn't "break" error decoding - it collects every correctable DRAM error
and puts it in "leaky" bucket of sorts. And when a certain error address
generates too many errors, it memory_failure()s the page and poisons it.

You do not use it in production because you want to see every error, collect
it, massage it and perhaps decide when DIMMs go bad and you can replace
them... or whatever you do.

All the others who enable it and we can sleep properly, without getting
unnecessarily upset about a correctable error.

> Yes, we collect all kernel message from host, parse the logs and predict panic
> with AI tools. The more details we collect, the better the performance of
> the AI model.

LOL.

We go the great effort of going a MCE tracepoint which gives a *structured*
error record, show an example how to use
it in rasdaemon and you go and do the crazy hard and, at the same time, silly
thing and parse dmesg?!??!

This is priceless. Oh boy.

> Agreed, tracepoint is a more elegant way. However, it does not include error
> context, just some hardware registers.

The error context is in the behavior of the hw. If the error is fatal, you
won't see it - the machine will panic or do something else to prevent error
propagation. It definitely won't run any software anymore.

If you see the error getting logged, it means it is not fatal enough to kill
the machine.

> > Besides, this message is completely useless as it has no concrete info about
> > the error and what is being done about it.
> 
> I don't think so,

I think so and you're not reading my mail.

>     "mce: Uncorrected hardware memory error in user-access at 3b116c400"

Ask yourself: what can you do when you see a message like that?

Exactly *nothing* because there's not nearly enough information to recover
from it or log it or whatever. That error message is *totally useless* and
you're upsetting your users unnecessarily and even if they report it to you,
you can't help them.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 2/5] x86/mce: dump error msg from severities
  2025-03-01 18:47           ` Borislav Petkov
@ 2025-03-02  7:14             ` Shuai Xue
  2025-03-02  7:37               ` Borislav Petkov
  2025-03-03 16:49             ` Luck, Tony
  1 sibling, 1 reply; 59+ messages in thread
From: Shuai Xue @ 2025-03-02  7:14 UTC (permalink / raw)
  To: Borislav Petkov, Luck, Tony
  Cc: nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa, linmiaohe,
	akpm, peterz, jpoimboe, linux-edac, linux-kernel, linux-mm,
	baolin.wang, tianruidong



在 2025/3/2 02:47, Borislav Petkov 写道:
> On Sat, Mar 01, 2025 at 10:03:13PM +0800, Shuai Xue wrote:
>> (By the way, Cenots/Redhat build kernel without CONFIG_RAS_CEC set, becase
>> it breaks EDAC decoding. We do not use CEC in production at all for the same
>> reasion.)
> 
> It doesn't "break" error decoding - it collects every correctable DRAM error
> and puts it in "leaky" bucket of sorts. And when a certain error address
> generates too many errors, it memory_failure()s the page and poisons it.
> 
> You do not use it in production because you want to see every error, collect
> it, massage it and perhaps decide when DIMMs go bad and you can replace
> them... or whatever you do.
> 
> All the others who enable it and we can sleep properly, without getting
> unnecessarily upset about a correctable error.

Yes, we want to see event CE error and use the CE pattern (e.g. correctable
error-bit)[1][2] to  predict whether a row fault is prone to UEs or not.
And we are not upset to CE error, becasue it have corrected by hardware :)

[1]https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fault-aware-prediction-guide.pdf
[2]https://arxiv.org/html/2312.02855v2

> 
>> Yes, we collect all kernel message from host, parse the logs and predict panic
>> with AI tools. The more details we collect, the better the performance of
>> the AI model.
> 
> LOL.
> 
> We go the great effort of going a MCE tracepoint which gives a *structured*
> error record, show an example how to use
> it in rasdaemon and you go and do the crazy hard and, at the same time, silly
> thing and parse dmesg?!??!
> 
> This is priceless. Oh boy.
> 
>> Agreed, tracepoint is a more elegant way. However, it does not include error
>> context, just some hardware registers.
> 
> The error context is in the behavior of the hw. If the error is fatal, you
> won't see it - the machine will panic or do something else to prevent error
> propagation. It definitely won't run any software anymore.
> 
> If you see the error getting logged, it means it is not fatal enough to kill
> the machine.

Agreed.

> 
>>> Besides, this message is completely useless as it has no concrete info about
>>> the error and what is being done about it.
>>
>> I don't think so,
> 
> I think so and you're not reading my mail.
> 
>>      "mce: Uncorrected hardware memory error in user-access at 3b116c400"

It is the current message in kill_me_maybe(), not added by me.

> 
> Ask yourself: what can you do when you see a message like that?
> 
> Exactly *nothing* because there's not nearly enough information to recover
> from it or log it or whatever. That error message is *totally useless* and
> you're upsetting your users unnecessarily and even if they report it to you,
> you can't help them.
> 

I believe we are approaching this issue from different perspectives.
As a cloud service provider, I need to address the following points:

1. I must be able to explain to end users why the MCE has occurred.
2. It is important to determine whether there are any kernel bugs that could
    compromise the overall stability of the cloud platform.
3. We need to identify and implement potential improvements.

"mce: Uncorrected hardware memory error in user-access at 3b116c400"

is *nothing* but

"mce: Action required: data load in error recoverable area of kernel"

helps.


Thanks for your time.
Shuai


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 2/5] x86/mce: dump error msg from severities
  2025-03-02  7:14             ` Shuai Xue
@ 2025-03-02  7:37               ` Borislav Petkov
  2025-03-02  9:13                 ` Shuai Xue
  0 siblings, 1 reply; 59+ messages in thread
From: Borislav Petkov @ 2025-03-02  7:37 UTC (permalink / raw)
  To: Shuai Xue
  Cc: Luck, Tony, nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa,
	linmiaohe, akpm, peterz, jpoimboe, linux-edac, linux-kernel,
	linux-mm, baolin.wang, tianruidong

On Sun, Mar 02, 2025 at 03:14:52PM +0800, Shuai Xue wrote:
> > >      "mce: Uncorrected hardware memory error in user-access at 3b116c400"
> 
> It is the current message in kill_me_maybe(), not added by me.

Doesn't change the fact that it is not really helpful when it comes to logging
all errors properly.

  [ Properly means using a structured log format with the tracepoint and not
    dumping it into dmesg. ]

And figuring out what hw is failing so that it can be replaced. No one has
come with a real need for making it better, more useful.

You're coming with what I think is such a need and I'm trying to explain to
you what needs to be done. But you want to feed your AI with dmesg and solve
it this way.

If you wanna do it right, we can talk. Otherwise, have fun.

> 3. We need to identify and implement potential improvements.
> 
> "mce: Uncorrected hardware memory error in user-access at 3b116c400"
> 
> is *nothing* but
> 
> "mce: Action required: data load in error recoverable area of kernel"
> 
> helps.

I don't think you've read what I wrote but that's ok. If you think it helps,
you can keep it in your kernels.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 2/5] x86/mce: dump error msg from severities
  2025-03-02  7:37               ` Borislav Petkov
@ 2025-03-02  9:13                 ` Shuai Xue
  0 siblings, 0 replies; 59+ messages in thread
From: Shuai Xue @ 2025-03-02  9:13 UTC (permalink / raw)
  To: Borislav Petkov, Luck, Tony
  Cc: nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa, linmiaohe,
	akpm, peterz, jpoimboe, linux-edac, linux-kernel, linux-mm,
	baolin.wang, tianruidong



在 2025/3/2 15:37, Borislav Petkov 写道:
> On Sun, Mar 02, 2025 at 03:14:52PM +0800, Shuai Xue wrote:
>>>>       "mce: Uncorrected hardware memory error in user-access at 3b116c400"
>>
>> It is the current message in kill_me_maybe(), not added by me.
> 
> Doesn't change the fact that it is not really helpful when it comes to logging
> all errors properly.
> 
>    [ Properly means using a structured log format with the tracepoint and not
>      dumping it into dmesg. ]
> 
> And figuring out what hw is failing so that it can be replaced. No one has
> come with a real need for making it better, more useful.
> 
> You're coming with what I think is such a need and I'm trying to explain to
> you what needs to be done. But you want to feed your AI with dmesg and solve
> it this way.
> 
> If you wanna do it right, we can talk. Otherwise, have fun.

I see. So I am just curious why we define `msg` in `severities`?

I perfer to use structured log format with the tracepoint, and we do use it in
production, but it lacks of process context.

AMD folks add error message for panic errors[1] to help debugging
in which the EDAC driver is not able to decode.

For non-fatal errors, is it reasonable to assume that all users are using
tracepoint-based tools like Rasdaemon?

[1]https://lore.kernel.org/all/20220405183212.354606-1-carlos.bilbao@amd.com/

> 
>> 3. We need to identify and implement potential improvements.
>>
>> "mce: Uncorrected hardware memory error in user-access at 3b116c400"
>>
>> is *nothing* but
>>
>> "mce: Action required: data load in error recoverable area of kernel"
>>
>> helps.
> 
> I don't think you've read what I wrote but that's ok. If you think it helps,
> you can keep it in your kernels.
> 

Fine, I could drop patch 1 and 2 in next version.

Thanks.
Shuai


^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH v2 2/5] x86/mce: dump error msg from severities
  2025-03-01 18:47           ` Borislav Petkov
  2025-03-02  7:14             ` Shuai Xue
@ 2025-03-03 16:49             ` Luck, Tony
  2025-03-03 18:08               ` Yazen Ghannam
  2025-03-05  1:50               ` Shuai Xue
  1 sibling, 2 replies; 59+ messages in thread
From: Luck, Tony @ 2025-03-03 16:49 UTC (permalink / raw)
  To: Borislav Petkov, Shuai Xue, Yazen.Ghannam@amd.com
  Cc: nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa, linmiaohe,
	akpm, peterz, jpoimboe, linux-edac, linux-kernel, linux-mm,
	baolin.wang, tianruidong

> The error context is in the behavior of the hw. If the error is fatal, you
> won't see it - the machine will panic or do something else to prevent error
> propagation. It definitely won't run any software anymore.
>
> If you see the error getting logged, it means it is not fatal enough to kill
> the machine.

One place in the fatal case where I would like to see more information is the

  "Action required: data load in error *UN*recoverable area of kernel"

[emphasis on the "UN" added].

case.  We have a few places where the kernel does recover. And most places
we crash. Our code for the recoverable cases is fragile. Most of this series is
about repairing regressions where we used to recover from places where kernel
is doing get_user() or copy_from_user() which can be recovered if those places
get an error return and the kernel kills the process instead of crashing.

A long time ago I posted some patches to include a stack trace for this type
of crash. It didn't make it into the kernel, and I got distracted by other things.

If we had that, it would have been easier to diagnose this regression (Shaui
Xie would have seen crashes with a stack trace pointing to code that used
to recover in older kernels). Folks with big clusters would also be able to
point out other places where the kernel crashes often enough that additional
EXTABLE recovery paths would be worth investigating.

So:

1) We need to fix the regressions. That just needs new commit messages
for these patches that explain the issue better.

2) I'd like to see a patch for a stack trace for the unrecoverable case.

3) I don't see much value in a message that reports the recoverable case.

Yazen: At one point I think you said you were looking at adding additional
decorations to the return value from mce_severity() to indicate actions
needed for recoverable errors (kill the process, offline the page) rather
than have do_machine_check() figure it out by looking at various fields
in the "struct mce". Did that go anywhere? Those extra details might be
interesting in the tracepoint.

-Tony


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 2/5] x86/mce: dump error msg from severities
  2025-03-03 16:49             ` Luck, Tony
@ 2025-03-03 18:08               ` Yazen Ghannam
  2025-03-05  1:50               ` Shuai Xue
  1 sibling, 0 replies; 59+ messages in thread
From: Yazen Ghannam @ 2025-03-03 18:08 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Borislav Petkov, Shuai Xue, nao.horiguchi, tglx, mingo,
	dave.hansen, x86, hpa, linmiaohe, akpm, peterz, jpoimboe,
	linux-edac, linux-kernel, linux-mm, baolin.wang, tianruidong

On Mon, Mar 03, 2025 at 04:49:25PM +0000, Luck, Tony wrote:
> > The error context is in the behavior of the hw. If the error is fatal, you
> > won't see it - the machine will panic or do something else to prevent error
> > propagation. It definitely won't run any software anymore.
> >
> > If you see the error getting logged, it means it is not fatal enough to kill
> > the machine.
> 
> One place in the fatal case where I would like to see more information is the
> 
>   "Action required: data load in error *UN*recoverable area of kernel"
> 
> [emphasis on the "UN" added].
> 
> case.  We have a few places where the kernel does recover. And most places
> we crash. Our code for the recoverable cases is fragile. Most of this series is
> about repairing regressions where we used to recover from places where kernel
> is doing get_user() or copy_from_user() which can be recovered if those places
> get an error return and the kernel kills the process instead of crashing.
> 
> A long time ago I posted some patches to include a stack trace for this type
> of crash. It didn't make it into the kernel, and I got distracted by other things.
> 
> If we had that, it would have been easier to diagnose this regression (Shaui
> Xie would have seen crashes with a stack trace pointing to code that used
> to recover in older kernels). Folks with big clusters would also be able to
> point out other places where the kernel crashes often enough that additional
> EXTABLE recovery paths would be worth investigating.
> 
> So:
> 
> 1) We need to fix the regressions. That just needs new commit messages
> for these patches that explain the issue better.
> 
> 2) I'd like to see a patch for a stack trace for the unrecoverable case.
> 
> 3) I don't see much value in a message that reports the recoverable case.
> 
> Yazen: At one point I think you said you were looking at adding additional
> decorations to the return value from mce_severity() to indicate actions
> needed for recoverable errors (kill the process, offline the page) rather
> than have do_machine_check() figure it out by looking at various fields
> in the "struct mce". Did that go anywhere? Those extra details might be
> interesting in the tracepoint.
> 

Hi Tony,

Yes, I have a patch here:
https://github.com/AMDESE/linux/commit/cf0b8a97240abf0fbd98a91cd8deb262f827721b

Branch:
https://github.com/AMDESE/linux/commits/wip-mca/

This work is at the tail-end of a lot of other refactoring. But it can
be prioritized if there's interest. Most of the dependencies have
already been merged.

Thanks,
Yazen


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 2/5] x86/mce: dump error msg from severities
  2025-03-03 16:49             ` Luck, Tony
  2025-03-03 18:08               ` Yazen Ghannam
@ 2025-03-05  1:50               ` Shuai Xue
  2025-03-05 16:16                 ` Luck, Tony
  1 sibling, 1 reply; 59+ messages in thread
From: Shuai Xue @ 2025-03-05  1:50 UTC (permalink / raw)
  To: Luck, Tony, Borislav Petkov, Yazen.Ghannam@amd.com
  Cc: nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa, linmiaohe,
	akpm, peterz, jpoimboe, linux-edac, linux-kernel, linux-mm,
	baolin.wang, tianruidong



在 2025/3/4 00:49, Luck, Tony 写道:
>> The error context is in the behavior of the hw. If the error is fatal, you
>> won't see it - the machine will panic or do something else to prevent error
>> propagation. It definitely won't run any software anymore.
>>
>> If you see the error getting logged, it means it is not fatal enough to kill
>> the machine.
> 
> One place in the fatal case where I would like to see more information is the
> 
>    "Action required: data load in error *UN*recoverable area of kernel"
> 
> [emphasis on the "UN" added].

Do you mean this one?

     MCESEV(
         PANIC, "Data load in unrecoverable area of kernel",
         SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|MCACOD_DATA),
         KERNEL
        ),


> 
> case.  We have a few places where the kernel does recover. And most places
> we crash. Our code for the recoverable cases is fragile.Most of this series is
> about repairing regressions where we used to recover from places where kernel
> is doing get_user() or copy_from_user() which can be recovered if those places
> get an error return and the kernel kills the process instead of crashing.

I can’t agree with you more.


> A long time ago I posted some patches to include a stack trace for this type
> of crash. It didn't make it into the kernel, and I got distracted by other things.
> 
> If we had that, it would have been easier to diagnose this regression (Shaui
> Xie would have seen crashes with a stack trace pointing to code that used
> to recover in older kernels). Folks with big clusters would also be able to
> point out other places where the kernel crashes often enough that additional
> EXTABLE recovery paths would be worth investigating.

Agreed, a stack trace will be helpful for debug unrecoverable cases.
The current panic message is bellow:

[ 1879.726794] mce: [Hardware Error]: CPU 178: Machine Check Exception: f Bank 1: bd80000000100134
[ 1879.726798] mce: [Hardware Error]: RIP 10:<ffffffff981d7af3> {futex_wait_setup+0x83/0xf0}
[ 1879.726807] mce: [Hardware Error]: TSC 49a1e6001c1 ADDR 80f7ada400 MISC 86 PPIN fc6b80e0ba9d616
[ 1879.726809] mce: [Hardware Error]: PROCESSOR 0:806f4 TIME 1741091252 SOCKET 1 APIC c5 microcode 2b000571
[ 1879.726811] mce: [Hardware Error]: Run the above through 'mcelog --ascii'
[ 1879.726813] mce: [Hardware Error]: Machine check events logged
[ 1879.727166] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel
[ 1879.727168] Kernel panic - not syncing: Fatal local machine check


It only provides a RIP and I spent a lot time to figure out the root cause about
why get_user() and copy_from_user() fail in upstream kernel.

> 
> So:
> 
> 1) We need to fix the regressions. That just needs new commit messages
> for these patches that explain the issue better.

I will polish commit message.

> 
> 2) I'd like to see a patch for a stack trace for the unrecoverable case.

Could you provide any reference link to your previous patch?

> 
> 3) I don't see much value in a message that reports the recoverable case.
> 

Got it.

Thanks
Shuai


^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH v2 2/5] x86/mce: dump error msg from severities
  2025-03-05  1:50               ` Shuai Xue
@ 2025-03-05 16:16                 ` Luck, Tony
  2025-03-05 22:33                   ` Luck, Tony
  0 siblings, 1 reply; 59+ messages in thread
From: Luck, Tony @ 2025-03-05 16:16 UTC (permalink / raw)
  To: Shuai Xue, Borislav Petkov, Yazen.Ghannam@amd.com
  Cc: nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa, linmiaohe,
	akpm, peterz, jpoimboe, linux-edac, linux-kernel, linux-mm,
	baolin.wang, tianruidong

>> 2) I'd like to see a patch for a stack trace for the unrecoverable case.
>
> Could you provide any reference link to your previous patch?

https://lore.kernel.org/all/20220922195136.54575-1-tony.luck@intel.com/

-Tony


^ permalink raw reply	[flat|nested] 59+ messages in thread

* RE: [PATCH v2 2/5] x86/mce: dump error msg from severities
  2025-03-05 16:16                 ` Luck, Tony
@ 2025-03-05 22:33                   ` Luck, Tony
  2025-03-06 15:58                     ` Yazen Ghannam
  0 siblings, 1 reply; 59+ messages in thread
From: Luck, Tony @ 2025-03-05 22:33 UTC (permalink / raw)
  To: Shuai Xue, Borislav Petkov, Yazen.Ghannam@amd.com
  Cc: nao.horiguchi, tglx, mingo, dave.hansen, x86, hpa, linmiaohe,
	akpm, peterz, jpoimboe, linux-edac, linux-kernel, linux-mm,
	baolin.wang, tianruidong

>>> 2) I'd like to see a patch for a stack trace for the unrecoverable case.
>>
>> Could you provide any reference link to your previous patch?
>
> https://lore.kernel.org/all/20220922195136.54575-1-tony.luck@intel.com/

Yazen has some cleanups to the severity code under consideration
here:

https://github.com/AMDESE/linux/commit/cf0b8a97240abf0fbd98a91cd8deb262f827721b


I wonder if a slightly different approach with the "mce_action" that Yazen
defines being a bitmask of options, instead of an enum, would be possible?

If that happened, then adding a "dump the call stack to the console" option
would fit neatly into the scheme.

-Tony


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH v2 2/5] x86/mce: dump error msg from severities
  2025-03-05 22:33                   ` Luck, Tony
@ 2025-03-06 15:58                     ` Yazen Ghannam
  0 siblings, 0 replies; 59+ messages in thread
From: Yazen Ghannam @ 2025-03-06 15:58 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Shuai Xue, Borislav Petkov, nao.horiguchi, tglx, mingo,
	dave.hansen, x86, hpa, linmiaohe, akpm, peterz, jpoimboe,
	linux-edac, linux-kernel, linux-mm, baolin.wang, tianruidong

On Wed, Mar 05, 2025 at 10:33:04PM +0000, Luck, Tony wrote:
> >>> 2) I'd like to see a patch for a stack trace for the unrecoverable case.
> >>
> >> Could you provide any reference link to your previous patch?
> >
> > https://lore.kernel.org/all/20220922195136.54575-1-tony.luck@intel.com/
> 
> Yazen has some cleanups to the severity code under consideration
> here:
> 
> https://github.com/AMDESE/linux/commit/cf0b8a97240abf0fbd98a91cd8deb262f827721b
> 
> 
> I wonder if a slightly different approach with the "mce_action" that Yazen
> defines being a bitmask of options, instead of an enum, would be possible?
> 
> If that happened, then adding a "dump the call stack to the console" option
> would fit neatly into the scheme.
> 

Right, I was thinking something similar. Basically, replace single
action (enum) with multiple possible actions (flags/bitmask). And we can
have a mask for a "class" of actions for general cases.

Thanks,
Yazen


^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2025-03-06 15:58 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-17  6:33 [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling Shuai Xue
2025-02-17  6:33 ` [PATCH v2 1/5] x86/mce: Collect error message for severities below MCE_PANIC_SEVERITY Shuai Xue
2025-02-18  7:58   ` Borislav Petkov
2025-02-18  9:39     ` Shuai Xue
2025-02-18  9:50       ` Borislav Petkov
2025-02-17  6:33 ` [PATCH v2 2/5] x86/mce: dump error msg from severities Shuai Xue
2025-02-28 12:37   ` Borislav Petkov
2025-03-01  6:16     ` Shuai Xue
2025-03-01 11:10       ` Borislav Petkov
2025-03-01 14:03         ` Shuai Xue
2025-03-01 18:47           ` Borislav Petkov
2025-03-02  7:14             ` Shuai Xue
2025-03-02  7:37               ` Borislav Petkov
2025-03-02  9:13                 ` Shuai Xue
2025-03-03 16:49             ` Luck, Tony
2025-03-03 18:08               ` Yazen Ghannam
2025-03-05  1:50               ` Shuai Xue
2025-03-05 16:16                 ` Luck, Tony
2025-03-05 22:33                   ` Luck, Tony
2025-03-06 15:58                     ` Yazen Ghannam
2025-02-17  6:33 ` [PATCH v2 3/5] x86/mce: add EX_TYPE_EFAULT_REG as in-kernel recovery context to fix copy-from-user operations regression Shuai Xue
2025-02-18 12:54   ` Peter Zijlstra
2025-02-18 13:02     ` Peter Zijlstra
2025-02-18 14:03       ` Shuai Xue
2025-02-18 13:28     ` Shuai Xue
2025-02-18 14:15       ` Peter Zijlstra
2025-02-18 16:48         ` Borislav Petkov
2025-02-19 10:40           ` Peter Zijlstra
2025-02-21  6:52             ` Shuai Xue
2025-02-17  6:33 ` [PATCH v2 4/5] mm/hwpoison: Fix incorrect "not recovered" report for recovered clean pages Shuai Xue
2025-02-19  6:34   ` Miaohe Lin
2025-02-19  8:54     ` Shuai Xue
2025-02-19 17:15       ` Luck, Tony
2025-02-20  1:16         ` Miaohe Lin
2025-02-17  6:33 ` [PATCH v2 5/5] mm: memory-failure: move return value documentation to function declaration Shuai Xue
2025-02-19  6:31   ` Miaohe Lin
2025-02-18  3:29 ` [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling Andrew Morton
2025-02-18  8:03   ` Borislav Petkov
2025-02-18  8:27 ` Borislav Petkov
2025-02-18 11:31   ` Shuai Xue
2025-02-18 12:24     ` Borislav Petkov
2025-02-18 13:08       ` Shuai Xue
2025-02-18 13:17         ` Borislav Petkov
2025-02-18 13:53           ` Shuai Xue
2025-02-18 15:31             ` Borislav Petkov
2025-02-19  7:13               ` Shuai Xue
2025-02-18 17:59         ` Luck, Tony
2025-02-19  6:04           ` Shuai Xue
2025-02-18 17:30       ` Luck, Tony
2025-02-19  8:10         ` Borislav Petkov
2025-02-19 17:11           ` Luck, Tony
2025-02-20 11:19             ` Borislav Petkov
2025-02-20 17:50               ` Luck, Tony
2025-02-21  6:05                 ` Shuai Xue
2025-02-24 22:01                   ` Borislav Petkov
2025-02-25  1:51                     ` Shuai Xue
2025-02-28 12:35                       ` Borislav Petkov
2025-03-01  5:54                         ` Shuai Xue
2025-02-24 21:50                 ` Borislav Petkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox