linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Shuai Xue <xueshuai@linux.alibaba.com>
To: Borislav Petkov <bp@alien8.de>, "Luck, Tony" <tony.luck@intel.com>
Cc: nao.horiguchi@gmail.com, tglx@linutronix.de, mingo@redhat.com,
	dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
	linmiaohe@huawei.com, akpm@linux-foundation.org,
	peterz@infradead.org, jpoimboe@kernel.org,
	linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, baolin.wang@linux.alibaba.com,
	tianruidong@linux.alibaba.com
Subject: Re: [PATCH v2 2/5] x86/mce: dump error msg from severities
Date: Sun, 2 Mar 2025 15:14:52 +0800	[thread overview]
Message-ID: <7eddced6-bf45-44c8-abbf-7d0d541511ab@linux.alibaba.com> (raw)
In-Reply-To: <20250301184724.GGZ8NWPI2Ys_BX-w2F@fat_crate.local>



在 2025/3/2 02:47, Borislav Petkov 写道:
> On Sat, Mar 01, 2025 at 10:03:13PM +0800, Shuai Xue wrote:
>> (By the way, Cenots/Redhat build kernel without CONFIG_RAS_CEC set, becase
>> it breaks EDAC decoding. We do not use CEC in production at all for the same
>> reasion.)
> 
> It doesn't "break" error decoding - it collects every correctable DRAM error
> and puts it in "leaky" bucket of sorts. And when a certain error address
> generates too many errors, it memory_failure()s the page and poisons it.
> 
> You do not use it in production because you want to see every error, collect
> it, massage it and perhaps decide when DIMMs go bad and you can replace
> them... or whatever you do.
> 
> All the others who enable it and we can sleep properly, without getting
> unnecessarily upset about a correctable error.

Yes, we want to see event CE error and use the CE pattern (e.g. correctable
error-bit)[1][2] to  predict whether a row fault is prone to UEs or not.
And we are not upset to CE error, becasue it have corrected by hardware :)

[1]https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fault-aware-prediction-guide.pdf
[2]https://arxiv.org/html/2312.02855v2

> 
>> Yes, we collect all kernel message from host, parse the logs and predict panic
>> with AI tools. The more details we collect, the better the performance of
>> the AI model.
> 
> LOL.
> 
> We go the great effort of going a MCE tracepoint which gives a *structured*
> error record, show an example how to use
> it in rasdaemon and you go and do the crazy hard and, at the same time, silly
> thing and parse dmesg?!??!
> 
> This is priceless. Oh boy.
> 
>> Agreed, tracepoint is a more elegant way. However, it does not include error
>> context, just some hardware registers.
> 
> The error context is in the behavior of the hw. If the error is fatal, you
> won't see it - the machine will panic or do something else to prevent error
> propagation. It definitely won't run any software anymore.
> 
> If you see the error getting logged, it means it is not fatal enough to kill
> the machine.

Agreed.

> 
>>> Besides, this message is completely useless as it has no concrete info about
>>> the error and what is being done about it.
>>
>> I don't think so,
> 
> I think so and you're not reading my mail.
> 
>>      "mce: Uncorrected hardware memory error in user-access at 3b116c400"

It is the current message in kill_me_maybe(), not added by me.

> 
> Ask yourself: what can you do when you see a message like that?
> 
> Exactly *nothing* because there's not nearly enough information to recover
> from it or log it or whatever. That error message is *totally useless* and
> you're upsetting your users unnecessarily and even if they report it to you,
> you can't help them.
> 

I believe we are approaching this issue from different perspectives.
As a cloud service provider, I need to address the following points:

1. I must be able to explain to end users why the MCE has occurred.
2. It is important to determine whether there are any kernel bugs that could
    compromise the overall stability of the cloud platform.
3. We need to identify and implement potential improvements.

"mce: Uncorrected hardware memory error in user-access at 3b116c400"

is *nothing* but

"mce: Action required: data load in error recoverable area of kernel"

helps.


Thanks for your time.
Shuai


  reply	other threads:[~2025-03-02  7:15 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-17  6:33 [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling Shuai Xue
2025-02-17  6:33 ` [PATCH v2 1/5] x86/mce: Collect error message for severities below MCE_PANIC_SEVERITY Shuai Xue
2025-02-18  7:58   ` Borislav Petkov
2025-02-18  9:39     ` Shuai Xue
2025-02-18  9:50       ` Borislav Petkov
2025-02-17  6:33 ` [PATCH v2 2/5] x86/mce: dump error msg from severities Shuai Xue
2025-02-28 12:37   ` Borislav Petkov
2025-03-01  6:16     ` Shuai Xue
2025-03-01 11:10       ` Borislav Petkov
2025-03-01 14:03         ` Shuai Xue
2025-03-01 18:47           ` Borislav Petkov
2025-03-02  7:14             ` Shuai Xue [this message]
2025-03-02  7:37               ` Borislav Petkov
2025-03-02  9:13                 ` Shuai Xue
2025-03-03 16:49             ` Luck, Tony
2025-03-03 18:08               ` Yazen Ghannam
2025-03-05  1:50               ` Shuai Xue
2025-03-05 16:16                 ` Luck, Tony
2025-03-05 22:33                   ` Luck, Tony
2025-03-06 15:58                     ` Yazen Ghannam
2025-02-17  6:33 ` [PATCH v2 3/5] x86/mce: add EX_TYPE_EFAULT_REG as in-kernel recovery context to fix copy-from-user operations regression Shuai Xue
2025-02-18 12:54   ` Peter Zijlstra
2025-02-18 13:02     ` Peter Zijlstra
2025-02-18 14:03       ` Shuai Xue
2025-02-18 13:28     ` Shuai Xue
2025-02-18 14:15       ` Peter Zijlstra
2025-02-18 16:48         ` Borislav Petkov
2025-02-19 10:40           ` Peter Zijlstra
2025-02-21  6:52             ` Shuai Xue
2025-02-17  6:33 ` [PATCH v2 4/5] mm/hwpoison: Fix incorrect "not recovered" report for recovered clean pages Shuai Xue
2025-02-19  6:34   ` Miaohe Lin
2025-02-19  8:54     ` Shuai Xue
2025-02-19 17:15       ` Luck, Tony
2025-02-20  1:16         ` Miaohe Lin
2025-02-17  6:33 ` [PATCH v2 5/5] mm: memory-failure: move return value documentation to function declaration Shuai Xue
2025-02-19  6:31   ` Miaohe Lin
2025-02-18  3:29 ` [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling Andrew Morton
2025-02-18  8:03   ` Borislav Petkov
2025-02-18  8:27 ` Borislav Petkov
2025-02-18 11:31   ` Shuai Xue
2025-02-18 12:24     ` Borislav Petkov
2025-02-18 13:08       ` Shuai Xue
2025-02-18 13:17         ` Borislav Petkov
2025-02-18 13:53           ` Shuai Xue
2025-02-18 15:31             ` Borislav Petkov
2025-02-19  7:13               ` Shuai Xue
2025-02-18 17:59         ` Luck, Tony
2025-02-19  6:04           ` Shuai Xue
2025-02-18 17:30       ` Luck, Tony
2025-02-19  8:10         ` Borislav Petkov
2025-02-19 17:11           ` Luck, Tony
2025-02-20 11:19             ` Borislav Petkov
2025-02-20 17:50               ` Luck, Tony
2025-02-21  6:05                 ` Shuai Xue
2025-02-24 22:01                   ` Borislav Petkov
2025-02-25  1:51                     ` Shuai Xue
2025-02-28 12:35                       ` Borislav Petkov
2025-03-01  5:54                         ` Shuai Xue
2025-02-24 21:50                 ` Borislav Petkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7eddced6-bf45-44c8-abbf-7d0d541511ab@linux.alibaba.com \
    --to=xueshuai@linux.alibaba.com \
    --cc=akpm@linux-foundation.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=hpa@zytor.com \
    --cc=jpoimboe@kernel.org \
    --cc=linmiaohe@huawei.com \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mingo@redhat.com \
    --cc=nao.horiguchi@gmail.com \
    --cc=peterz@infradead.org \
    --cc=tglx@linutronix.de \
    --cc=tianruidong@linux.alibaba.com \
    --cc=tony.luck@intel.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox