linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Luck, Tony" <tony.luck@intel.com>
To: Borislav Petkov <bp@alien8.de>,
	Shuai Xue <xueshuai@linux.alibaba.com>,
	"Yazen.Ghannam@amd.com" <yazen.ghannam@amd.com>
Cc: "nao.horiguchi@gmail.com" <nao.horiguchi@gmail.com>,
	"tglx@linutronix.de" <tglx@linutronix.de>,
	"mingo@redhat.com" <mingo@redhat.com>,
	"dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com>,
	"x86@kernel.org" <x86@kernel.org>,
	"hpa@zytor.com" <hpa@zytor.com>,
	"linmiaohe@huawei.com" <linmiaohe@huawei.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"peterz@infradead.org" <peterz@infradead.org>,
	"jpoimboe@kernel.org" <jpoimboe@kernel.org>,
	"linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"baolin.wang@linux.alibaba.com" <baolin.wang@linux.alibaba.com>,
	"tianruidong@linux.alibaba.com" <tianruidong@linux.alibaba.com>
Subject: RE: [PATCH v2 2/5] x86/mce: dump error msg from severities
Date: Mon, 3 Mar 2025 16:49:25 +0000	[thread overview]
Message-ID: <SJ1PR11MB6083697C08D8B6B8BFD3CC98FCC92@SJ1PR11MB6083.namprd11.prod.outlook.com> (raw)
In-Reply-To: <20250301184724.GGZ8NWPI2Ys_BX-w2F@fat_crate.local>

> The error context is in the behavior of the hw. If the error is fatal, you
> won't see it - the machine will panic or do something else to prevent error
> propagation. It definitely won't run any software anymore.
>
> If you see the error getting logged, it means it is not fatal enough to kill
> the machine.

One place in the fatal case where I would like to see more information is the

  "Action required: data load in error *UN*recoverable area of kernel"

[emphasis on the "UN" added].

case.  We have a few places where the kernel does recover. And most places
we crash. Our code for the recoverable cases is fragile. Most of this series is
about repairing regressions where we used to recover from places where kernel
is doing get_user() or copy_from_user() which can be recovered if those places
get an error return and the kernel kills the process instead of crashing.

A long time ago I posted some patches to include a stack trace for this type
of crash. It didn't make it into the kernel, and I got distracted by other things.

If we had that, it would have been easier to diagnose this regression (Shaui
Xie would have seen crashes with a stack trace pointing to code that used
to recover in older kernels). Folks with big clusters would also be able to
point out other places where the kernel crashes often enough that additional
EXTABLE recovery paths would be worth investigating.

So:

1) We need to fix the regressions. That just needs new commit messages
for these patches that explain the issue better.

2) I'd like to see a patch for a stack trace for the unrecoverable case.

3) I don't see much value in a message that reports the recoverable case.

Yazen: At one point I think you said you were looking at adding additional
decorations to the return value from mce_severity() to indicate actions
needed for recoverable errors (kill the process, offline the page) rather
than have do_machine_check() figure it out by looking at various fields
in the "struct mce". Did that go anywhere? Those extra details might be
interesting in the tracepoint.

-Tony


  parent reply	other threads:[~2025-03-03 16:49 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-17  6:33 [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling Shuai Xue
2025-02-17  6:33 ` [PATCH v2 1/5] x86/mce: Collect error message for severities below MCE_PANIC_SEVERITY Shuai Xue
2025-02-18  7:58   ` Borislav Petkov
2025-02-18  9:39     ` Shuai Xue
2025-02-18  9:50       ` Borislav Petkov
2025-02-17  6:33 ` [PATCH v2 2/5] x86/mce: dump error msg from severities Shuai Xue
2025-02-28 12:37   ` Borislav Petkov
2025-03-01  6:16     ` Shuai Xue
2025-03-01 11:10       ` Borislav Petkov
2025-03-01 14:03         ` Shuai Xue
2025-03-01 18:47           ` Borislav Petkov
2025-03-02  7:14             ` Shuai Xue
2025-03-02  7:37               ` Borislav Petkov
2025-03-02  9:13                 ` Shuai Xue
2025-03-03 16:49             ` Luck, Tony [this message]
2025-03-03 18:08               ` Yazen Ghannam
2025-03-05  1:50               ` Shuai Xue
2025-03-05 16:16                 ` Luck, Tony
2025-03-05 22:33                   ` Luck, Tony
2025-03-06 15:58                     ` Yazen Ghannam
2025-02-17  6:33 ` [PATCH v2 3/5] x86/mce: add EX_TYPE_EFAULT_REG as in-kernel recovery context to fix copy-from-user operations regression Shuai Xue
2025-02-18 12:54   ` Peter Zijlstra
2025-02-18 13:02     ` Peter Zijlstra
2025-02-18 14:03       ` Shuai Xue
2025-02-18 13:28     ` Shuai Xue
2025-02-18 14:15       ` Peter Zijlstra
2025-02-18 16:48         ` Borislav Petkov
2025-02-19 10:40           ` Peter Zijlstra
2025-02-21  6:52             ` Shuai Xue
2025-02-17  6:33 ` [PATCH v2 4/5] mm/hwpoison: Fix incorrect "not recovered" report for recovered clean pages Shuai Xue
2025-02-19  6:34   ` Miaohe Lin
2025-02-19  8:54     ` Shuai Xue
2025-02-19 17:15       ` Luck, Tony
2025-02-20  1:16         ` Miaohe Lin
2025-02-17  6:33 ` [PATCH v2 5/5] mm: memory-failure: move return value documentation to function declaration Shuai Xue
2025-02-19  6:31   ` Miaohe Lin
2025-02-18  3:29 ` [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling Andrew Morton
2025-02-18  8:03   ` Borislav Petkov
2025-02-18  8:27 ` Borislav Petkov
2025-02-18 11:31   ` Shuai Xue
2025-02-18 12:24     ` Borislav Petkov
2025-02-18 13:08       ` Shuai Xue
2025-02-18 13:17         ` Borislav Petkov
2025-02-18 13:53           ` Shuai Xue
2025-02-18 15:31             ` Borislav Petkov
2025-02-19  7:13               ` Shuai Xue
2025-02-18 17:59         ` Luck, Tony
2025-02-19  6:04           ` Shuai Xue
2025-02-18 17:30       ` Luck, Tony
2025-02-19  8:10         ` Borislav Petkov
2025-02-19 17:11           ` Luck, Tony
2025-02-20 11:19             ` Borislav Petkov
2025-02-20 17:50               ` Luck, Tony
2025-02-21  6:05                 ` Shuai Xue
2025-02-24 22:01                   ` Borislav Petkov
2025-02-25  1:51                     ` Shuai Xue
2025-02-28 12:35                       ` Borislav Petkov
2025-03-01  5:54                         ` Shuai Xue
2025-02-24 21:50                 ` Borislav Petkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=SJ1PR11MB6083697C08D8B6B8BFD3CC98FCC92@SJ1PR11MB6083.namprd11.prod.outlook.com \
    --to=tony.luck@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=hpa@zytor.com \
    --cc=jpoimboe@kernel.org \
    --cc=linmiaohe@huawei.com \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mingo@redhat.com \
    --cc=nao.horiguchi@gmail.com \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=tianruidong@linux.alibaba.com \
    --cc=x86@kernel.org \
    --cc=xueshuai@linux.alibaba.com \
    --cc=yazen.ghannam@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox