From: Shuai Xue <xueshuai@linux.alibaba.com>
To: Borislav Petkov <bp@alien8.de>
Cc: tony.luck@intel.com, nao.horiguchi@gmail.com, tglx@linutronix.de,
mingo@redhat.com, dave.hansen@linux.intel.com, x86@kernel.org,
hpa@zytor.com, linmiaohe@huawei.com, akpm@linux-foundation.org,
peterz@infradead.org, jpoimboe@kernel.org,
linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, baolin.wang@linux.alibaba.com,
tianruidong@linux.alibaba.com
Subject: Re: [PATCH v2 2/5] x86/mce: dump error msg from severities
Date: Sat, 1 Mar 2025 22:03:13 +0800 [thread overview]
Message-ID: <dee8d758-dd65-4438-8e42-251fb1a305a7@linux.alibaba.com> (raw)
In-Reply-To: <20250301111022.GAZ8LrHkal1bR4G1QR@fat_crate.local>
在 2025/3/1 19:10, Borislav Petkov 写道:
> On Sat, Mar 01, 2025 at 02:16:12PM +0800, Shuai Xue wrote:
>> For instance, it does not specify whether the error occurred in the
>> context of IN_KERNEL or IN_KERNEL_RECOV, which are crucial for
>> understanding the error's circumstances.
>
> 1. Crucial for whom? For you? Or for users?
>
> You need to explain how this error message is going to be used. Because simply
> issuing such a message causes a lot of panicked people calling a lot of admins
> to figure out why their machine is broken. Because they see "mce" and think
> "hw broken, need to replace it immediately."
>
> This is one of the reasons we did the cec.c thing - just to save people from
> panicking unnecessarily and causing expensive and useless maintenance calls.
For me, and cloud providers which maintains million servers.
(By the way, Cenots/Redhat build kernel without CONFIG_RAS_CEC set, becase
it breaks EDAC decoding. We do not use CEC in production at all for the same
reasion.)
>
> 2. This message goes to dmesg which means something needs to parse it, beside
> a human. An AI?
Yes, we collect all kernel message from host, parse the logs and predict panic
with AI tools. The more details we collect, the better the performance of
the AI model.
>
> 3. Dmesg is a ring buffer which gets overwritten and this message is
> eventually lost
>
> There's a reason why MCEs get logged with the notifiers and through
> a tracepoint - so that agents can act upon them properly.
>
> And we have had this discussion for years now - I'm sorry that you're late to
> the party.
Agreed, tracepoint is a more elegant way. However, it does not include error context,
just some hardware registers.
>
>> For the regression cases (copy from user) in Patch 3, an error message
>>
>> "mce: Action required: data load in error recoverable area of kernel"
>
> See above.
>
> Besides, this message is completely useless as it has no concrete info about
> the error and what is being done about it.
I don't think so,
"Action required" means MCI_UC_AR
"data load" means MCACOD_DATA
"recoverable area of kernel" means KERNEL_RECOV
It is more readable and concrete than "Uncorrected hardware memory error", e.g.
message in kill_me_maybe():
"mce: Uncorrected hardware memory error in user-access at 3b116c400"
>
>> I could add more explanations in next version if you have no objection.
>
> All of the above are objections.
>
> Please go into git history and read why we're avoiding dumping useless
> messages instead of proposing silly patches.
>
Anyway, I respect the maintainer's opinion.
Thanks
Shuai
next prev parent reply other threads:[~2025-03-01 14:03 UTC|newest]
Thread overview: 59+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-17 6:33 [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling Shuai Xue
2025-02-17 6:33 ` [PATCH v2 1/5] x86/mce: Collect error message for severities below MCE_PANIC_SEVERITY Shuai Xue
2025-02-18 7:58 ` Borislav Petkov
2025-02-18 9:39 ` Shuai Xue
2025-02-18 9:50 ` Borislav Petkov
2025-02-17 6:33 ` [PATCH v2 2/5] x86/mce: dump error msg from severities Shuai Xue
2025-02-28 12:37 ` Borislav Petkov
2025-03-01 6:16 ` Shuai Xue
2025-03-01 11:10 ` Borislav Petkov
2025-03-01 14:03 ` Shuai Xue [this message]
2025-03-01 18:47 ` Borislav Petkov
2025-03-02 7:14 ` Shuai Xue
2025-03-02 7:37 ` Borislav Petkov
2025-03-02 9:13 ` Shuai Xue
2025-03-03 16:49 ` Luck, Tony
2025-03-03 18:08 ` Yazen Ghannam
2025-03-05 1:50 ` Shuai Xue
2025-03-05 16:16 ` Luck, Tony
2025-03-05 22:33 ` Luck, Tony
2025-03-06 15:58 ` Yazen Ghannam
2025-02-17 6:33 ` [PATCH v2 3/5] x86/mce: add EX_TYPE_EFAULT_REG as in-kernel recovery context to fix copy-from-user operations regression Shuai Xue
2025-02-18 12:54 ` Peter Zijlstra
2025-02-18 13:02 ` Peter Zijlstra
2025-02-18 14:03 ` Shuai Xue
2025-02-18 13:28 ` Shuai Xue
2025-02-18 14:15 ` Peter Zijlstra
2025-02-18 16:48 ` Borislav Petkov
2025-02-19 10:40 ` Peter Zijlstra
2025-02-21 6:52 ` Shuai Xue
2025-02-17 6:33 ` [PATCH v2 4/5] mm/hwpoison: Fix incorrect "not recovered" report for recovered clean pages Shuai Xue
2025-02-19 6:34 ` Miaohe Lin
2025-02-19 8:54 ` Shuai Xue
2025-02-19 17:15 ` Luck, Tony
2025-02-20 1:16 ` Miaohe Lin
2025-02-17 6:33 ` [PATCH v2 5/5] mm: memory-failure: move return value documentation to function declaration Shuai Xue
2025-02-19 6:31 ` Miaohe Lin
2025-02-18 3:29 ` [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling Andrew Morton
2025-02-18 8:03 ` Borislav Petkov
2025-02-18 8:27 ` Borislav Petkov
2025-02-18 11:31 ` Shuai Xue
2025-02-18 12:24 ` Borislav Petkov
2025-02-18 13:08 ` Shuai Xue
2025-02-18 13:17 ` Borislav Petkov
2025-02-18 13:53 ` Shuai Xue
2025-02-18 15:31 ` Borislav Petkov
2025-02-19 7:13 ` Shuai Xue
2025-02-18 17:59 ` Luck, Tony
2025-02-19 6:04 ` Shuai Xue
2025-02-18 17:30 ` Luck, Tony
2025-02-19 8:10 ` Borislav Petkov
2025-02-19 17:11 ` Luck, Tony
2025-02-20 11:19 ` Borislav Petkov
2025-02-20 17:50 ` Luck, Tony
2025-02-21 6:05 ` Shuai Xue
2025-02-24 22:01 ` Borislav Petkov
2025-02-25 1:51 ` Shuai Xue
2025-02-28 12:35 ` Borislav Petkov
2025-03-01 5:54 ` Shuai Xue
2025-02-24 21:50 ` Borislav Petkov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=dee8d758-dd65-4438-8e42-251fb1a305a7@linux.alibaba.com \
--to=xueshuai@linux.alibaba.com \
--cc=akpm@linux-foundation.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=hpa@zytor.com \
--cc=jpoimboe@kernel.org \
--cc=linmiaohe@huawei.com \
--cc=linux-edac@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mingo@redhat.com \
--cc=nao.horiguchi@gmail.com \
--cc=peterz@infradead.org \
--cc=tglx@linutronix.de \
--cc=tianruidong@linux.alibaba.com \
--cc=tony.luck@intel.com \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox