From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E30E5C282D0 for ; Wed, 5 Mar 2025 01:50:23 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 504286B0083; Tue, 4 Mar 2025 20:50:23 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 4B4656B0085; Tue, 4 Mar 2025 20:50:23 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 3A2F46B0088; Tue, 4 Mar 2025 20:50:23 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 1E9E26B0083 for ; Tue, 4 Mar 2025 20:50:23 -0500 (EST) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id A4D4BC0854 for ; Wed, 5 Mar 2025 01:50:22 +0000 (UTC) X-FDA: 83185817484.10.A7516D8 Received: from out30-112.freemail.mail.aliyun.com (out30-112.freemail.mail.aliyun.com [115.124.30.112]) by imf13.hostedemail.com (Postfix) with ESMTP id 83F0920002 for ; Wed, 5 Mar 2025 01:50:19 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b="dxk/mUvQ"; spf=pass (imf13.hostedemail.com: domain of xueshuai@linux.alibaba.com designates 115.124.30.112 as permitted sender) smtp.mailfrom=xueshuai@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1741139420; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Bn1xeCRo3XM3e/RHN56us/V1x3j8UE+PxVIiqcDKx80=; b=m381aWWk5D5J3Y8W1T5L0SeQFoptTTkbSvxkv9jr2lV0HPGoq7P4nQsre+Bl6GJi4/dM6i 0BM2uY9TI+04EWRB70ZW/ZrcHYMMNu6lKXlC/DfJowfTl+Z5ppsGrXapWrmAa1lYhEFYRA B8n2V4y8ULzdPTUBoV1/W4gXzD1mpgI= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b="dxk/mUvQ"; spf=pass (imf13.hostedemail.com: domain of xueshuai@linux.alibaba.com designates 115.124.30.112 as permitted sender) smtp.mailfrom=xueshuai@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1741139420; a=rsa-sha256; cv=none; b=0YrU/PIqdXkohRN2PFwL1708BJg7hQvUb0wdDOKymJZfkPTXqPYjDfNWmRmja3QTrV5SGy 6+SqDJ0m2takEWlDyeTkW8ArumxvymFfRIx7+5prjEKf9++NlMaCU8fBD38gzdrOQ/ZGyp NMFGxEztenYgVDIHmncwmeNbg0vFNnU= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1741139416; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=Bn1xeCRo3XM3e/RHN56us/V1x3j8UE+PxVIiqcDKx80=; b=dxk/mUvQqd4nmK0lS4Ugz9KOx2is5evt1PTLwVu5PNEsMYqdxU0cKTyv/AvU5x572yb0OMtacNtX531/sMGQmSM8D+gPF0uTLFDDSrFt20iHmiR5+oYeq0xDpOk3knQ1mELOXShbAk1I0VVod/v+OSvfDXEqmKg/WzaAn0Y0ahQ= Received: from 30.246.161.128(mailfrom:xueshuai@linux.alibaba.com fp:SMTPD_---0WQjj2Fs_1741139414 cluster:ay36) by smtp.aliyun-inc.com; Wed, 05 Mar 2025 09:50:15 +0800 Message-ID: <89027155-8ca3-46a5-8c3a-e24b903cb3eb@linux.alibaba.com> Date: Wed, 5 Mar 2025 09:50:13 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 2/5] x86/mce: dump error msg from severities To: "Luck, Tony" , Borislav Petkov , "Yazen.Ghannam@amd.com" Cc: "nao.horiguchi@gmail.com" , "tglx@linutronix.de" , "mingo@redhat.com" , "dave.hansen@linux.intel.com" , "x86@kernel.org" , "hpa@zytor.com" , "linmiaohe@huawei.com" , "akpm@linux-foundation.org" , "peterz@infradead.org" , "jpoimboe@kernel.org" , "linux-edac@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , "baolin.wang@linux.alibaba.com" , "tianruidong@linux.alibaba.com" References: <20250217063335.22257-1-xueshuai@linux.alibaba.com> <20250217063335.22257-3-xueshuai@linux.alibaba.com> <20250228123724.GDZ8GuBOuDy5xeHvjc@fat_crate.local> <20250301111022.GAZ8LrHkal1bR4G1QR@fat_crate.local> <20250301184724.GGZ8NWPI2Ys_BX-w2F@fat_crate.local> From: Shuai Xue In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Stat-Signature: 5pffp9cyjnbkcsy4pp9f87b944we55pu X-Rspamd-Queue-Id: 83F0920002 X-Rspamd-Server: rspam06 X-Rspam-User: X-HE-Tag: 1741139419-949552 X-HE-Meta: U2FsdGVkX1+4yNSDtBUUUwQvdW/U3bYS2lO1ibr7hjHSCSdW05hhzpvon0EAD79dDL3HU2ZNKkZAuJ6g8rD42f+uN+ZKjlXY6ut0t4XHKYB869wfqzURYjEMHUw1c90FQmzJmL01A10tdD69CzP/4kHxVOEyDItquUDUc13BKkUfrTE8G1bGzkTXVdypEhc8FdvzkPIdM150heTpCjcuOSgH9b2RPDP33+gPb/jDd8fYtUVE4FKKT6lQn3XMBTc9gbt505WlNFafXBVsWJCNmDEqZoTIGsPJ20cjNX6jy5GGLvdTc0alXwr70n55IWsjRhx8jghbGYHn8VFg6nidX2+uILWjvCSmqH2Q1DX4tM8AE3f4d6bvm6B07tU1eilSqKHM28xjKI0/qA14kI1pOUg8//fl8y1o95nwAZtj/XvhuCsrzn3K4cjvJsSqkxRuVJ9vaiEamqIqIHNFHmUFGFOQTWJb8RkacJi0EKr4DNZkLECr5IvZ7kwVqlaw+vnt9X0sKXXX5QZ3mHj3wZwx+YnRS50FRsJQYOndKRTjPWxQkhR1c72+QnZeamHMqtDox2Dbf9Ny3aF1Dhe2CkSHdBf/d+713Q7UEXTEN0/S/nBV8SH0Vvxb1MmitFTZHY/8jwhf702GbLoLJGmG9EU+I5d6Ijm4pSssu5BpZp4LJxz+jBW0o0i03rWuLHOQ1ndVo/IMHS86nRCvIO5XGzFDCCjj5KpQ+Ty8TIbAwV3gXosehObX/6B9inJO11g/yJmlcBfgR0ehYEgEFCjrQzxCFeWMB2h3e23MvjT80lJHWN/WfsuQV3KVtusudNnuX1zqz5x2udVNLGXpM6amJSbvBCUAj3PS8F8a9KOpA0R/8wjYbBG2utQWujAdNDYKdltkjPmGhzWlrDnNj4GE1kIUk5WuEKoKNd6leD91uWwsvUi6K+nEFg3TKoJZcVk/1BeKKaS/aplREtYUY0PZOkp ZiB3Do6t siLFn278rQCE8FNpjcSNU6dofbFIrSf8de5pEJ41pxcMmbgSsR2ydngNg0d+NA4Kkqv/qY3YlNQDzmwF4ZgrAGm2QwJE1GaMxwK96G0EDkGNM1hRc2O3dRa/ewJIZ56q8kOkB3pF20E12Sus3UxtBTQlxJm4mQq4NolSj5XuMQM3T9icdXlpDI4GIWoyVD1Q3EUyQK0BekJAiC9fluiiPl2HqhQApAdcsz2sOOxR6yCpcQA6Ind+VBX38ijPOZBMm94r+gRFQ45fJugGF35O3EWt0zhxpipf3kDin37dwmpFz0yLJLjpuGia+dJrZsrtweGHgkG6t6dVah9DzjOc4a/tYPuKnd1c5FX+XmXQdTtt1mTphqESnVycNN6H3/kqhESPQVM9D8HT1PvthetMUWBMyK/QY1l3RUk5QLC09jE6XxqM3HYhJoBSMODJ8JK8XIf0/ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: 在 2025/3/4 00:49, Luck, Tony 写道: >> The error context is in the behavior of the hw. If the error is fatal, you >> won't see it - the machine will panic or do something else to prevent error >> propagation. It definitely won't run any software anymore. >> >> If you see the error getting logged, it means it is not fatal enough to kill >> the machine. > > One place in the fatal case where I would like to see more information is the > > "Action required: data load in error *UN*recoverable area of kernel" > > [emphasis on the "UN" added]. Do you mean this one? MCESEV( PANIC, "Data load in unrecoverable area of kernel", SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|MCACOD_DATA), KERNEL ), > > case. We have a few places where the kernel does recover. And most places > we crash. Our code for the recoverable cases is fragile.Most of this series is > about repairing regressions where we used to recover from places where kernel > is doing get_user() or copy_from_user() which can be recovered if those places > get an error return and the kernel kills the process instead of crashing. I can’t agree with you more. > A long time ago I posted some patches to include a stack trace for this type > of crash. It didn't make it into the kernel, and I got distracted by other things. > > If we had that, it would have been easier to diagnose this regression (Shaui > Xie would have seen crashes with a stack trace pointing to code that used > to recover in older kernels). Folks with big clusters would also be able to > point out other places where the kernel crashes often enough that additional > EXTABLE recovery paths would be worth investigating. Agreed, a stack trace will be helpful for debug unrecoverable cases. The current panic message is bellow: [ 1879.726794] mce: [Hardware Error]: CPU 178: Machine Check Exception: f Bank 1: bd80000000100134 [ 1879.726798] mce: [Hardware Error]: RIP 10: {futex_wait_setup+0x83/0xf0} [ 1879.726807] mce: [Hardware Error]: TSC 49a1e6001c1 ADDR 80f7ada400 MISC 86 PPIN fc6b80e0ba9d616 [ 1879.726809] mce: [Hardware Error]: PROCESSOR 0:806f4 TIME 1741091252 SOCKET 1 APIC c5 microcode 2b000571 [ 1879.726811] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [ 1879.726813] mce: [Hardware Error]: Machine check events logged [ 1879.727166] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel [ 1879.727168] Kernel panic - not syncing: Fatal local machine check It only provides a RIP and I spent a lot time to figure out the root cause about why get_user() and copy_from_user() fail in upstream kernel. > > So: > > 1) We need to fix the regressions. That just needs new commit messages > for these patches that explain the issue better. I will polish commit message. > > 2) I'd like to see a patch for a stack trace for the unrecoverable case. Could you provide any reference link to your previous patch? > > 3) I don't see much value in a message that reports the recoverable case. > Got it. Thanks Shuai