From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 34DD7C021B8 for ; Sat, 1 Mar 2025 18:48:02 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 722C16B0085; Sat, 1 Mar 2025 13:48:01 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 6D39E6B0088; Sat, 1 Mar 2025 13:48:01 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 59B1B6B0089; Sat, 1 Mar 2025 13:48:01 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 3F27D6B0085 for ; Sat, 1 Mar 2025 13:48:01 -0500 (EST) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id F1B3E1CC95C for ; Sat, 1 Mar 2025 18:48:00 +0000 (UTC) X-FDA: 83173866720.25.1CB9D3E Received: from mail.alien8.de (mail.alien8.de [65.109.113.108]) by imf11.hostedemail.com (Postfix) with ESMTP id 5E7C840003 for ; Sat, 1 Mar 2025 18:47:58 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=alien8.de header.s=alien8 header.b=JQMXtY2L; spf=pass (imf11.hostedemail.com: domain of bp@alien8.de designates 65.109.113.108 as permitted sender) smtp.mailfrom=bp@alien8.de; dmarc=pass (policy=none) header.from=alien8.de ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1740854879; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=l71h9KXcCuDFU+fj0zNN0wy4uU4tSQfRahhqWOcy5yA=; b=rnPsSeLvkeeM+uYD1yiTPzoNqj1U3jYfI0J7Cr+GNNqlAX8Ol9gIPiAOWyPiJ2PNuNulSZ MBy6Yy2H/ITqjmKp8OftXC+NQ/Q7KBx3vTtKpOZK1s5G2nTlQVOCqDI83XRh3tUMvvDT45 uqmFLVRMOWaOAZutlcFX0f84dKL8vHU= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740854879; a=rsa-sha256; cv=none; b=jD/RqA+/DwIoYeJeiDghF0SNXzNRZdVN1XVj6xUvlGB/MEBxPsnp1Xf4EpM7zEp7h+2kL4 iGrqx6nH7vFlLfbHDtvxWWYOGtsYWyp6E49AsDO+rsBXd9DQRr8uACLubeX0U36Sh3zwQc 7g5uyTTrLPo3eWoRG41wqCMtIv16Ybs= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=alien8.de header.s=alien8 header.b=JQMXtY2L; spf=pass (imf11.hostedemail.com: domain of bp@alien8.de designates 65.109.113.108 as permitted sender) smtp.mailfrom=bp@alien8.de; dmarc=pass (policy=none) header.from=alien8.de Received: from localhost (localhost.localdomain [127.0.0.1]) by mail.alien8.de (SuperMail on ZX Spectrum 128k) with ESMTP id ADA4740E0173; Sat, 1 Mar 2025 18:47:54 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at mail.alien8.de Received: from mail.alien8.de ([127.0.0.1]) by localhost (mail.alien8.de [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id iIcm_nZK5oGm; Sat, 1 Mar 2025 18:47:48 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=alien8.de; s=alien8; t=1740854868; bh=l71h9KXcCuDFU+fj0zNN0wy4uU4tSQfRahhqWOcy5yA=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=JQMXtY2Lxgk1ihOvpFWTuQR6nmFfH4WvcJSzINNVCY4tVIkMAcEKpP56oGL33LX0Y UM+YZzhvXhpWpVDCGLiqFcSZTufYO4UNSLznub0PXKkSZpBTrZIqvvhLXcBy3vMeo5 a4jPOyOIW4Ol+VYOZvQVbbnAn1Yzhd+DVJ/0KflWTCgq1u7A0aPI3iyLa9wEyCyTmr PUYlpTrJTOyaNmxBpu/EcBzjd3KTAAvro9tASGZDnbgxC9RXVsTp7bSczc7/3oy4mG 3ZNf4YsLZJ2OfjakKmfrPZbMw4gbbCVTDCWSX9Ldfb924KjNGGN0UK251jO8d+iNQs hpx4PUukCdqbOWAwFhd67hLJT4BkaSRhe+t/lrXx2MMqvDNsGyF968qyPlGiQHU6N5 PYqywEf1u03V4Z42V1K22uymdSvJilh6eJJbyIeCGwDtL4I6+v+HxNRgw0MoXlV117 B3aBpMkXIX0ACIcUwKvT/c9OPpghNBpxgEmCnw43rqKUgxhsKsamIhNwRa2dNIhRlE RTs1Kl6dMYsGwbQywCqxxPJPLDvx1eT+PbUr6nkGakURVt3Elf+0K/Wxse2hsuFte6 bwaQtBxEiiAqSUnL8F5awrQnm55aGsfgsa6nD5jvY1EMfjXDrGmTVp6tQOBjmrzkn1 2xhyl/IQsGjBaK3wCZPaIViE= Received: from zn.tnic (pd95303ce.dip0.t-ipconnect.de [217.83.3.206]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature ECDSA (P-256) server-digest SHA256) (No client certificate requested) by mail.alien8.de (SuperMail on ZX Spectrum 128k) with ESMTPSA id 2045B40E0028; Sat, 1 Mar 2025 18:47:31 +0000 (UTC) Date: Sat, 1 Mar 2025 19:47:24 +0100 From: Borislav Petkov To: Shuai Xue Cc: tony.luck@intel.com, nao.horiguchi@gmail.com, tglx@linutronix.de, mingo@redhat.com, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, linmiaohe@huawei.com, akpm@linux-foundation.org, peterz@infradead.org, jpoimboe@kernel.org, linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, baolin.wang@linux.alibaba.com, tianruidong@linux.alibaba.com Subject: Re: [PATCH v2 2/5] x86/mce: dump error msg from severities Message-ID: <20250301184724.GGZ8NWPI2Ys_BX-w2F@fat_crate.local> References: <20250217063335.22257-1-xueshuai@linux.alibaba.com> <20250217063335.22257-3-xueshuai@linux.alibaba.com> <20250228123724.GDZ8GuBOuDy5xeHvjc@fat_crate.local> <20250301111022.GAZ8LrHkal1bR4G1QR@fat_crate.local> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: X-Rspamd-Server: rspam02 X-Stat-Signature: 9edtyrfykn4nriqttbfuxgxdy5fyf7dz X-Rspamd-Queue-Id: 5E7C840003 X-Rspam-User: X-HE-Tag: 1740854878-437299 X-HE-Meta: U2FsdGVkX18XVooI64cyI0+LLbbokK4vdasIjOnns4Uigr62EYeaQUmU0tCTc8GTj6a8N6mX5bfOntyJL2daO97n9GGWG482Xok9xfXB1HeMywQ8VWqz8itGeTCOVAFLztynZuvQvYEwbYP29cicBMVDCGjVpOQoFqFjvjWa8v8NyaBhZkSpuk0E8WpoW28hoXEDaDHSi8r033AOptJnePrw+0FSPcZ4JqyiL9S0OmLL86Oj/zl8MF5XgFuwVfbOh3u/umKK0SjnKzRBuILL1/B5kWAwoO7b4AMWWkgA6fsPQY8AUcRdwWRoPQK9PKVT0vvowxbM4UA0FPh2YOQSoCWTfJXckfXEvOgT+8m9xPHxTxjqH4uhJJ7iOdk2QvVXfGs4Nz/9uOD34seyo8sHK9pn3z0RMX6KpZm5AQ5jh5IM0iX/MK6x9uToxGijJZTEAstw8dgGGfwP+8FOy1Y4xQESKYt+e3AlZz+P3MxyT9SfGY0CEhxR+9DvxhXHrpBTnQblhjW/c9hc+ldMtMzjaabSUaXudClpCxHj5QdOteVNQg7f0RiKg6pvRZiwsjQLfaN5ylDmirSc6NOjOYYoiG7pxsW3bydlNbAmc4nXqpwz8ureEiMMdKjPbWHD8QpFMvhCi/6ePTf7vjU3yzsn+Wj74D4l5xONIhBjr0naP0/EUC2vEgNJkLcGZUg+BN/M5t1ufVkGWO0QeGuZjOuLJA+9w+3AlKNyq9E5Ya6KM8iqS1lUdm/vXiS6gDh3QHqrEyvGqUIQl/knDT0DbX/7BjCmydbDwljO1ye2ZXnjBrfVRMkixdjpQV4O/f6Wp4Pv7ns9bjPpbUQJIQ3bCwePuz0im4Dq6IsLnzz1SEYdlde4i1SzWXNC6k5gp821mXqQ3vXNaRrFl1gLrlYqa8mhZi9ZEyhL983sKx6vqemTUJYEF9aJmNgJ9n0V2PFd7DnQRu/yr1+BvfqUBjr2LSS GUDhYR2F U3ab8+ArL2NUeEZqnMzL0vJxuAY1IT5W12CUvt9tHLSe3rWQ53PsvikbkpG0ePqQknoo9TqrMP9RBAV4gPD6eHUjSzshq8f8G3LLad/oGUddYBR6WvWF1y4dNFV1Wczj6SL0w7Q9vhKH12DwAqVUqDrolghnPsYcSYdEft6coAbYwXZ5cLb7W/O9zoZk0CeCYdrdykZN2xmn+YRfobOgEz/xy2Ogt2ACoHYyhXgS5GEDyz2iLFGy0W5Vl2p6g/bMTLkwQbdkKrVgCIl25IAy2SDPcgkeapK7hJLMkQ1IirKWyDQQiR9VRTjDm99a8IoEU3iKTgIn0B3DaJv26FDnZTjFVoA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.260168, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, Mar 01, 2025 at 10:03:13PM +0800, Shuai Xue wrote: > (By the way, Cenots/Redhat build kernel without CONFIG_RAS_CEC set, becase > it breaks EDAC decoding. We do not use CEC in production at all for the same > reasion.) It doesn't "break" error decoding - it collects every correctable DRAM error and puts it in "leaky" bucket of sorts. And when a certain error address generates too many errors, it memory_failure()s the page and poisons it. You do not use it in production because you want to see every error, collect it, massage it and perhaps decide when DIMMs go bad and you can replace them... or whatever you do. All the others who enable it and we can sleep properly, without getting unnecessarily upset about a correctable error. > Yes, we collect all kernel message from host, parse the logs and predict panic > with AI tools. The more details we collect, the better the performance of > the AI model. LOL. We go the great effort of going a MCE tracepoint which gives a *structured* error record, show an example how to use it in rasdaemon and you go and do the crazy hard and, at the same time, silly thing and parse dmesg?!??! This is priceless. Oh boy. > Agreed, tracepoint is a more elegant way. However, it does not include error > context, just some hardware registers. The error context is in the behavior of the hw. If the error is fatal, you won't see it - the machine will panic or do something else to prevent error propagation. It definitely won't run any software anymore. If you see the error getting logged, it means it is not fatal enough to kill the machine. > > Besides, this message is completely useless as it has no concrete info about > > the error and what is being done about it. > > I don't think so, I think so and you're not reading my mail. > "mce: Uncorrected hardware memory error in user-access at 3b116c400" Ask yourself: what can you do when you see a message like that? Exactly *nothing* because there's not nearly enough information to recover from it or log it or whatever. That error message is *totally useless* and you're upsetting your users unnecessarily and even if they report it to you, you can't help them. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette