From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E0D64C021B8 for ; Sat, 1 Mar 2025 14:03:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4754C6B007B; Sat, 1 Mar 2025 09:03:32 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 4248B6B0082; Sat, 1 Mar 2025 09:03:32 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3142E280001; Sat, 1 Mar 2025 09:03:32 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 102236B007B for ; Sat, 1 Mar 2025 09:03:32 -0500 (EST) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 8D07FC2261 for ; Sat, 1 Mar 2025 14:03:31 +0000 (UTC) X-FDA: 83173149822.29.6B2FFB9 Received: from out30-111.freemail.mail.aliyun.com (out30-111.freemail.mail.aliyun.com [115.124.30.111]) by imf24.hostedemail.com (Postfix) with ESMTP id 95D8A180015 for ; Sat, 1 Mar 2025 14:03:28 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=ihXvnzed; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf24.hostedemail.com: domain of xueshuai@linux.alibaba.com designates 115.124.30.111 as permitted sender) smtp.mailfrom=xueshuai@linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1740837810; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=n4mjhg+z8Ag43SI258ptgwqtySGPYzf97j5mfKNZEzY=; b=UfXrzfchyyhIQcxETfqmMPxsrYDnOwcqi2luXTl7qcoT93t1YHC4cKGco9sJjMcCwoYS33 T+s40NFwclD+Of2Gk6EFuGHTQAtSWgs9YCosUasoWG8FCkLPKCkSBKE8HYzsgYW5CbTn+L T71ur9tXQwU+c5iYwY3oKWLCGDLOnHk= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=ihXvnzed; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf24.hostedemail.com: domain of xueshuai@linux.alibaba.com designates 115.124.30.111 as permitted sender) smtp.mailfrom=xueshuai@linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740837810; a=rsa-sha256; cv=none; b=cnvlkXSefJU0eXMzt+/9DJZVVmt7w+EQsQbOXgCnI/z5+ujG7EMnyiEsRf9L1tp6vJu4XX U4c3oIjxX/SV+Tc4cVYGTR03fPSL4VD2c725JJ4eZ4Qssoo2jAWj9yR9UO0QVZeJiZnxyk u4IxJ2H303DN0zzlEWJosfSYxjz1Cg0= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1740837803; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=n4mjhg+z8Ag43SI258ptgwqtySGPYzf97j5mfKNZEzY=; b=ihXvnzedi/cYl19a60i5aCDf4DLsIdLQP0DnVEKEyQ9pZfD+nYaBIzNddSqGHLVX3Ky7kGoXN/OsbUqY4k1Tswm5l28aI980OFmdsLnABYied4snrhIxoi0omYtFtnydDqI1hX3zKMkTjEWKGMawfMLpO0d0KWC6Vxiiuw0wQnA= Received: from 30.246.161.128(mailfrom:xueshuai@linux.alibaba.com fp:SMTPD_---0WQSaLac_1740837796 cluster:ay36) by smtp.aliyun-inc.com; Sat, 01 Mar 2025 22:03:17 +0800 Message-ID: Date: Sat, 1 Mar 2025 22:03:13 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 2/5] x86/mce: dump error msg from severities To: Borislav Petkov Cc: tony.luck@intel.com, nao.horiguchi@gmail.com, tglx@linutronix.de, mingo@redhat.com, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, linmiaohe@huawei.com, akpm@linux-foundation.org, peterz@infradead.org, jpoimboe@kernel.org, linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, baolin.wang@linux.alibaba.com, tianruidong@linux.alibaba.com References: <20250217063335.22257-1-xueshuai@linux.alibaba.com> <20250217063335.22257-3-xueshuai@linux.alibaba.com> <20250228123724.GDZ8GuBOuDy5xeHvjc@fat_crate.local> <20250301111022.GAZ8LrHkal1bR4G1QR@fat_crate.local> From: Shuai Xue In-Reply-To: <20250301111022.GAZ8LrHkal1bR4G1QR@fat_crate.local> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 95D8A180015 X-Rspam-User: X-Stat-Signature: chnj88693o9ihhu8ft5ijy31ka1uhwe4 X-HE-Tag: 1740837808-301722 X-HE-Meta: U2FsdGVkX188ApYCWg6LrFEqiZRb3C5BeW5ixnaYqxMUV10guNE3hFqD4oGOdgqXqx0ZAJz7VN5g8NNEmMxYcAj1LYCFzoDUBeVVxhQUJl9aTlI3wuvs3SMeQQqQwSeYx0MH0RSTrgDh3+9z5Q/PZn6tvwTaLj8FxoEdqH/RY7DvhNVwbduDUgVGb6EdKIcnsDrOMo0ESiUgiNpc4a1sdWz4Bq1zTFHU+cNyxZeDsroEdnpWsBwnje45h9BP+RpxFc/aGFbb2Ol+KOxRlRoat19z6yMpO6vDqtRAXu+Sgi7HTbQWwPhrRST6Zbpwyn504qrRIFQjN8JhprVCae70B32m4m2FDCJwjiouEgxHAJqBJOuElAqFHZwD0dfKgcLwr/FEnLwU9ruO0UaQ8fSpxTl+Ubr50h6OK0x2ZcbDG5PYxSUFes0nOLiQVTqWmi6z1q+g9+zCIaAB7yHV06XbfoK8L+brXwCk8Zh9TvU1eh+lDRNRKoBiLzblb78xGli+UcGhJYu8enLJ3hCBGeGLj2+Y/Es1J3doYyvQliJ787wJzvMDXEqBoMDrfWuio16L91UGxRrJ+ZLSfzW74BOByeaJNLZvmLkpxuQtjqXonH8wvWIYNNMmz6jfrf3vXMuTKO6PHsT2iyIHu5WEnwahiy+/WDNuh3R8tC+KZk5kEg7CqsmrCvzbbey/3RQHEAMf8yawAHqPpcNEH4VfNj2nkImG05iEYEM7CtLFFeocmiGjV/7EWeOddKujnnmwbRrpbjdCIjqP0u2vU4eaiDG+WvFv0OxQz9g1+l8zUGksb5K/oHJsXzFfHrkCMRO0Ng4l8fare20E/zJWDFl9Ec6SnBqAasQeMkWV3i5bTjwbkglSF2Qx3vujOH8dkrhdXodRUibmeINqfu4Hyow0loUH+NcV+VH9lNwOpscwQRErHFkwMn41cavc3fPQ8cgdNhZkIt8b2ALCN78E2Z14OVV tPLj2XsR n6ENEIeHKk7MpAe00htJRgvDyNM9OZjAqLSuvYL3Q5oQyuSJ6TEvoxSfgQRDdwhMFM0cBclXAVX4PBdy476+BcnUQFTsxNjR7mx5yuTJo6Y18szozz2XrI7I5yb5PFpQ1kZRCfFJMZWZefASMwUJErfhh4kMZGATUEWbW4flupQgrXD4s8WxiI8GFFbqf09UDd88UyNsQf0CBwvFAKlnYSHTBfNFeCqkLeQaQQ+oDsIfjm60wWPm+XjkLjJ7Ih8XiU9Ib5enUa5luSwXgkzHXWgNlWGcOS8x+Ly/AhzUTXg16sf7J3pDigUV3HSiKUNREpVeZSYLdban9vTrsax1+rLVH6Mr93RqPfxEC X-Bogosity: Ham, tests=bogofilter, spamicity=0.000137, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: 在 2025/3/1 19:10, Borislav Petkov 写道: > On Sat, Mar 01, 2025 at 02:16:12PM +0800, Shuai Xue wrote: >> For instance, it does not specify whether the error occurred in the >> context of IN_KERNEL or IN_KERNEL_RECOV, which are crucial for >> understanding the error's circumstances. > > 1. Crucial for whom? For you? Or for users? > > You need to explain how this error message is going to be used. Because simply > issuing such a message causes a lot of panicked people calling a lot of admins > to figure out why their machine is broken. Because they see "mce" and think > "hw broken, need to replace it immediately." > > This is one of the reasons we did the cec.c thing - just to save people from > panicking unnecessarily and causing expensive and useless maintenance calls. For me, and cloud providers which maintains million servers. (By the way, Cenots/Redhat build kernel without CONFIG_RAS_CEC set, becase it breaks EDAC decoding. We do not use CEC in production at all for the same reasion.) > > 2. This message goes to dmesg which means something needs to parse it, beside > a human. An AI? Yes, we collect all kernel message from host, parse the logs and predict panic with AI tools. The more details we collect, the better the performance of the AI model. > > 3. Dmesg is a ring buffer which gets overwritten and this message is > eventually lost > > There's a reason why MCEs get logged with the notifiers and through > a tracepoint - so that agents can act upon them properly. > > And we have had this discussion for years now - I'm sorry that you're late to > the party. Agreed, tracepoint is a more elegant way. However, it does not include error context, just some hardware registers. > >> For the regression cases (copy from user) in Patch 3, an error message >> >> "mce: Action required: data load in error recoverable area of kernel" > > See above. > > Besides, this message is completely useless as it has no concrete info about > the error and what is being done about it. I don't think so, "Action required" means MCI_UC_AR "data load" means MCACOD_DATA "recoverable area of kernel" means KERNEL_RECOV It is more readable and concrete than "Uncorrected hardware memory error", e.g. message in kill_me_maybe(): "mce: Uncorrected hardware memory error in user-access at 3b116c400" > >> I could add more explanations in next version if you have no objection. > > All of the above are objections. > > Please go into git history and read why we're avoiding dumping useless > messages instead of proposing silly patches. > Anyway, I respect the maintainer's opinion. Thanks Shuai