From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BF298C19F32 for ; Sun, 2 Mar 2025 07:15:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 10FD36B0082; Sun, 2 Mar 2025 02:15:09 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 0C09F6B0083; Sun, 2 Mar 2025 02:15:09 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EF2E06B0085; Sun, 2 Mar 2025 02:15:08 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id D26AE6B0082 for ; Sun, 2 Mar 2025 02:15:08 -0500 (EST) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 4D019162D82 for ; Sun, 2 Mar 2025 07:15:08 +0000 (UTC) X-FDA: 83175749496.16.E8D625E Received: from out30-99.freemail.mail.aliyun.com (out30-99.freemail.mail.aliyun.com [115.124.30.99]) by imf06.hostedemail.com (Postfix) with ESMTP id A37DA180006 for ; Sun, 2 Mar 2025 07:15:05 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b="D/qT6SmP"; spf=pass (imf06.hostedemail.com: domain of xueshuai@linux.alibaba.com designates 115.124.30.99 as permitted sender) smtp.mailfrom=xueshuai@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1740899706; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=v69h+SOeTjqXpJtalLZZCoQPIBTBXQ3ns0RzzvJWsFw=; b=58w1qFYDxwJHg2fPDOruFjRiIUpcKqsDQye+mGU53HZY5liXjISbbDoT1K73VnUoSfZhay Wc65uIOu6TA/qvo81ToXu2v4hcPQWMNnLi3+39AphhkhwAksHAk7szNyLPQ4GwiUaeWmL9 xvIbx1LyyxeRr1dgF4MJXmoSPc8tttk= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740899706; a=rsa-sha256; cv=none; b=qnXA39AvyU82h5s/irw99XthpBnQ2DATVYtcaoZZCizf0e+D29VLnThqwe+drTnKoDnzn4 9vDoRW1oMQmZhE4E8NZnS6ZTlTiOc6yhGeBeZt9VzooVX1l3RtCfHzkz5b3vZ5qcB9bPR4 H0V5efiUc/i4agd7rrp3H3a4KtGGu84= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b="D/qT6SmP"; spf=pass (imf06.hostedemail.com: domain of xueshuai@linux.alibaba.com designates 115.124.30.99 as permitted sender) smtp.mailfrom=xueshuai@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1740899699; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=v69h+SOeTjqXpJtalLZZCoQPIBTBXQ3ns0RzzvJWsFw=; b=D/qT6SmPEduN5KlOhlTLPIvu/BnWPqTglHWG+7xl5MKBnvtRXcBa9gz7auKKENhSjQxt+WIdUa08YWU3hVNTrKQXXdkNzgGNLb8ibfvxiW49u7eOWAfZ92vdWiDuHy9IfZFBSQLRneTlcQmb4avvw6JH8UqE6cZ3+MJWIGlDDxA= Received: from 30.246.161.128(mailfrom:xueshuai@linux.alibaba.com fp:SMTPD_---0WQUXm6t_1740899693 cluster:ay36) by smtp.aliyun-inc.com; Sun, 02 Mar 2025 15:14:54 +0800 Message-ID: <7eddced6-bf45-44c8-abbf-7d0d541511ab@linux.alibaba.com> Date: Sun, 2 Mar 2025 15:14:52 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 2/5] x86/mce: dump error msg from severities To: Borislav Petkov , "Luck, Tony" Cc: nao.horiguchi@gmail.com, tglx@linutronix.de, mingo@redhat.com, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, linmiaohe@huawei.com, akpm@linux-foundation.org, peterz@infradead.org, jpoimboe@kernel.org, linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, baolin.wang@linux.alibaba.com, tianruidong@linux.alibaba.com References: <20250217063335.22257-1-xueshuai@linux.alibaba.com> <20250217063335.22257-3-xueshuai@linux.alibaba.com> <20250228123724.GDZ8GuBOuDy5xeHvjc@fat_crate.local> <20250301111022.GAZ8LrHkal1bR4G1QR@fat_crate.local> <20250301184724.GGZ8NWPI2Ys_BX-w2F@fat_crate.local> From: Shuai Xue In-Reply-To: <20250301184724.GGZ8NWPI2Ys_BX-w2F@fat_crate.local> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam02 X-Stat-Signature: ycqampt77p81xcu568u4d31z1mx9zahk X-Rspamd-Queue-Id: A37DA180006 X-Rspam-User: X-HE-Tag: 1740899705-489725 X-HE-Meta: U2FsdGVkX18IxiYvMi31C2IHq9mTLZ1OkDOeKaUZgpL49RIcvV+1wZe1YKrhlG23eKLiAitDNNAmZTIHQtGiMl0viELzO1LpIQwI+qC39Dv/jJjzEDninr3aypvrUJpBmMUfqIYfN4CXaJa9N1ZEcL4BqTjI8svjEz2whF1EeszxqOCSht6/TqecUwR443cafXETDC+Y2TnE/WD2mUEBOosjZPj8MlBVUDawLx9XYbwtbFe95t3HXpJPTHEDU9GoUuvexx126RHceEYwkh0ueXA8x9e763dqWLX2kw/O30RyO/Hi4jHTL4lYN+nCvd5MXvCRiIoASqc68rGmfnqI1nz8CypyvAZNJ3RI+bu6ng+jY4PL9SQd482b5qwUQ693SKOgNsGbuAns/VWptHSMo2kozaRs6DpZ/efG9/Jf0ZP5MeXOufuxiJRq3Ciz5ReTr6jTkdslelm3fLdEx8LqIlAJs53hJArVpkHYL8i37aXt7ntT3SlaJzpCLlwcUIzYIg6pY82ybjWFeu1B6N03IjlWEOHN+1DuZkeAL052G7FrDEnAK43RltexH3+0MS0GD1l9Iq31kdzqICZP6XH5sNvlJjtsWY72TVIljX6mWCSWDIAPnSeEldINJz7M+CXcIbEumCkQGGNlDJ4BU6Us36uMf20wusQsBDQhVPAcO+jD7v6mZKwAhdAOVvPsznWFPceNhptcdouMR86kdvEDlRDe+WUa7jW/3KRLblBAUTbrHhXo0oy0WkLr9anNbsPYkLr06dfeIZlJfa3io71qwy14OMUQrQY+995dLzhfJHNsW5cCJ2wfP54lda+U0L/FDYnxfWtUvJDP5Zsw0URbnmgGrK0jWTnfZNQkKK2lpz1MZKzMJ/ah9uK9HT1qtHHIHJcq1LQ7VK0zlizI7Ib9wY28w6snHoHUvI0tSeVxvAFhzEdyFAV1L+Lj2pfyU2bAXlr9lYST7vJSWyzhzK+ K+QYKTxh 4QdjGQE61n1m9kYEIFaD+uX5WOugCzx7OgV3lNVgv4ZFF3v5G1AdMaZ9guP+48Ba1nbKgiWRVhOzIyDsnv2LgSDSiVK/0AU/rXiLrtFsqiqroIMxvx+FFkaecie2gHbzIIz3rZHNxU+xYMj2bYJLroIlcSjdPr/KyXBNi9sNP2RSG7bF1IaW0Wjk3SeHy480Tfzo58OmwBkFqUZ7I0w/HLeAUCiHi0f5edWUyaKm1SN2AAU7ObVoYQQgnv5dXZyvnTrIAMVV+QtJmDxnm5NHAqL2/dwFM1ZJmft3UxH3btm2TEMUOL6EIdjws6w+GGDRDHUIXGCUMzQbDlsWarDUDPb8NVqQ/b1YiD16bXg1e2UYNQFOQhOlEX2ysiXKaOmIIktyQVhHCjdEbmaA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.111198, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: 在 2025/3/2 02:47, Borislav Petkov 写道: > On Sat, Mar 01, 2025 at 10:03:13PM +0800, Shuai Xue wrote: >> (By the way, Cenots/Redhat build kernel without CONFIG_RAS_CEC set, becase >> it breaks EDAC decoding. We do not use CEC in production at all for the same >> reasion.) > > It doesn't "break" error decoding - it collects every correctable DRAM error > and puts it in "leaky" bucket of sorts. And when a certain error address > generates too many errors, it memory_failure()s the page and poisons it. > > You do not use it in production because you want to see every error, collect > it, massage it and perhaps decide when DIMMs go bad and you can replace > them... or whatever you do. > > All the others who enable it and we can sleep properly, without getting > unnecessarily upset about a correctable error. Yes, we want to see event CE error and use the CE pattern (e.g. correctable error-bit)[1][2] to predict whether a row fault is prone to UEs or not. And we are not upset to CE error, becasue it have corrected by hardware :) [1]https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/fault-aware-prediction-guide.pdf [2]https://arxiv.org/html/2312.02855v2 > >> Yes, we collect all kernel message from host, parse the logs and predict panic >> with AI tools. The more details we collect, the better the performance of >> the AI model. > > LOL. > > We go the great effort of going a MCE tracepoint which gives a *structured* > error record, show an example how to use > it in rasdaemon and you go and do the crazy hard and, at the same time, silly > thing and parse dmesg?!??! > > This is priceless. Oh boy. > >> Agreed, tracepoint is a more elegant way. However, it does not include error >> context, just some hardware registers. > > The error context is in the behavior of the hw. If the error is fatal, you > won't see it - the machine will panic or do something else to prevent error > propagation. It definitely won't run any software anymore. > > If you see the error getting logged, it means it is not fatal enough to kill > the machine. Agreed. > >>> Besides, this message is completely useless as it has no concrete info about >>> the error and what is being done about it. >> >> I don't think so, > > I think so and you're not reading my mail. > >> "mce: Uncorrected hardware memory error in user-access at 3b116c400" It is the current message in kill_me_maybe(), not added by me. > > Ask yourself: what can you do when you see a message like that? > > Exactly *nothing* because there's not nearly enough information to recover > from it or log it or whatever. That error message is *totally useless* and > you're upsetting your users unnecessarily and even if they report it to you, > you can't help them. > I believe we are approaching this issue from different perspectives. As a cloud service provider, I need to address the following points: 1. I must be able to explain to end users why the MCE has occurred. 2. It is important to determine whether there are any kernel bugs that could compromise the overall stability of the cloud platform. 3. We need to identify and implement potential improvements. "mce: Uncorrected hardware memory error in user-access at 3b116c400" is *nothing* but "mce: Action required: data load in error recoverable area of kernel" helps. Thanks for your time. Shuai