From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C65A4D74956 for ; Wed, 30 Oct 2024 01:54:15 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4EDA76B00C4; Tue, 29 Oct 2024 21:54:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 493736B00C5; Tue, 29 Oct 2024 21:54:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 35C256B00C6; Tue, 29 Oct 2024 21:54:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 158566B00C4 for ; Tue, 29 Oct 2024 21:54:15 -0400 (EDT) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id AD7A840239 for ; Wed, 30 Oct 2024 01:54:14 +0000 (UTC) X-FDA: 82728597924.18.B81B64C Received: from out30-132.freemail.mail.aliyun.com (out30-132.freemail.mail.aliyun.com [115.124.30.132]) by imf20.hostedemail.com (Postfix) with ESMTP id DD93A1C0003 for ; Wed, 30 Oct 2024 01:53:39 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=mSGZd4Ju; spf=pass (imf20.hostedemail.com: domain of xueshuai@linux.alibaba.com designates 115.124.30.132 as permitted sender) smtp.mailfrom=xueshuai@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730253173; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=QVuP8bL2UthKGyf2UbcF1p2yjun+l2O30gXmPUgnGX4=; b=pw7pN88/9608GtQTxGxmQvWkPzM9f3rSOH/GEs/Xg76MBTgm8QFTydVGyp6QRTDEyqYqfI aSlkHAnQSQf0VzYXEA7GeTPvYfh5x6MCkaoRSTlE4dsJ7niKf5RdqUaexp6uhm629xU6nu QGroX2WWUJHHeFEFzbPP/5upa0Cm3MA= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=mSGZd4Ju; spf=pass (imf20.hostedemail.com: domain of xueshuai@linux.alibaba.com designates 115.124.30.132 as permitted sender) smtp.mailfrom=xueshuai@linux.alibaba.com; dmarc=pass (policy=none) header.from=linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730253173; a=rsa-sha256; cv=none; b=gG9uNFy/f9PceAmIi/wPBEnkuhEI/W7qE+BBQnS4lHFTQMFDFRfOfp/8jugUSnpk9GqbDS HplJ81W87VUvVuYcwR0k3T7Mtpjm2Mkw73z0yCkTgq3N10Pps4ci+aznUa8jJdfYx8jLQj d0w0KdZxuuA7Gx0/KNr5tUm7hpjce9k= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1730253245; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=QVuP8bL2UthKGyf2UbcF1p2yjun+l2O30gXmPUgnGX4=; b=mSGZd4JupzdO2ElmmUzS+iJ9HGEzlyY/+L8dJM/T63ginb3G5SytveIwvMAr5Df5YlcDr7Daq4bYm0BBcQYfaH+UwwNvR4y9MmdAKSAoUSU/qGQb9XjmxkI2jh+CWMRNAqSbMrCTxdBzwLi4j4saclzPyxZBneo+879/iS3jWZQ= Received: from 30.246.162.170(mailfrom:xueshuai@linux.alibaba.com fp:SMTPD_---0WIC1TDj_1730253241 cluster:ay36) by smtp.aliyun-inc.com; Wed, 30 Oct 2024 09:54:04 +0800 Message-ID: Date: Wed, 30 Oct 2024 09:54:00 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v15 1/3] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered To: Yazen Ghannam Cc: mark.rutland@arm.com, catalin.marinas@arm.com, mingo@redhat.com, robin.murphy@arm.com, Jonathan.Cameron@huawei.com, bp@alien8.de, rafael@kernel.org, wangkefeng.wang@huawei.com, tanxiaofei@huawei.com, mawupeng1@huawei.com, tony.luck@intel.com, linmiaohe@huawei.com, naoya.horiguchi@nec.com, james.morse@arm.com, tongtiangen@huawei.com, gregkh@linuxfoundation.org, will@kernel.org, jarkko@kernel.org, linux-acpi@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, linux-edac@vger.kernel.org, x86@kernel.org, justin.he@arm.com, ardb@kernel.org, ying.huang@intel.com, ashish.kalra@amd.com, baolin.wang@linux.alibaba.com, tglx@linutronix.de, dave.hansen@linux.intel.com, lenb@kernel.org, hpa@zytor.com, robert.moore@intel.com, lvying6@huawei.com, xiexiuqi@huawei.com, zhuo.song@linux.alibaba.com References: <20221027042445.60108-1-xueshuai@linux.alibaba.com> <20241028081142.66028-2-xueshuai@linux.alibaba.com> <20241029204848.GA1229628@yaz-khff2.amd.com> From: Shuai Xue In-Reply-To: <20241029204848.GA1229628@yaz-khff2.amd.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam03 X-Rspam-User: X-Rspamd-Queue-Id: DD93A1C0003 X-Stat-Signature: 9kwca8qqgzstnt9wxop3yuby9r8xdu46 X-HE-Tag: 1730253219-97804 X-HE-Meta: U2FsdGVkX19yiizJdsayuSZy9N8miC1nwHEuJK1kjv0Zbeal4KS1bNmmYOwy7p6WwLirUTBS5qDmJbdtc1hP2bpIf7h7xQ08ZxgJHRTkDlIJpyCCOaiD3iFLsYY2n3toBafKiRISJaKGA1N1TTqjnXg8rs3EwK247gM0iItev4RwYuxEnJD/C2lR9mvj31jo4icp5Y5fsw9gerPryM6kSSroHvEWU+9RkfOge4Xtpewh0ZO1l2Ma0EF19UsurMETmoKRp4uZZFQGTRMMRxjnvucN6dY+IjviggA2gL904YSlTN77O9g7BLd0YhekmO0yZDbt0RZEIgrnIaczhfUCCX6/gwZGTpXQbb3o5uB0rpbgJLoekX3ynvJKBLTrNwUQg+h/pltkSp6ZRAwppNNrd+cx9R8pw9152+tcy+OFmFjN6AlJshIdXdHQ8EzI/5NR0dpVU8+uYwkv44SodgfITVfT6u1jh0hjYdE9+zK0Q5SRUxM/9b/XFp3uBhPuhNSk31NT3K4Row4h4Jqe2GgvAwp60eQvTQ+8b7dv28gCretj3fFudjSYy+S659Dpn0YQYJJqQB6ahYuPJCKDbJ2iQHqm8LbGriD+nzOPbXo8yhXQawvgBMcg6v60tG6a9HP8cSelLMP6sg92P6qmyrcaXbgVE0LEnfM5ITDFOEeGBSU08twag0EXRBsvORIg8m0BvC2HYaVman1qmPCkREV9ts13s10mdIcy+HHyzhCz79Qlj8+3h6+oNd6nNd1thwmNb2SVXBABhN5eqVs29IwY82N51hFpYmykuV4BKVDCltS2F/bQDqJoJV4L/+KzSGDoHuGfxnzl6wWSpetqB6jZ0hLe/S7SKSV8gyB1x5/1JD8YPLAM2fhXp/AZGKQLFDCPRmV7xOxw4sX2978fy4ZNZ6w+/2DTLu5Ui1E2iefpv3MSNsWm+DF9mLGxFQylgMdziHHinjACCwFLYVlGKws I2RXW4TU /CvbMcK4ztWWru5zzeV3eLZVCphIZDgnmyZPDkKMI77E5H/GGGgp97xD3I8pt9rGG1mpbIfLhscXsPK1WacAjCFPOVLqbk64fMlPo9R0gD05m2FKFh3zmXYaAWEZMvF4pIFZQBUcfKNk6u9foRnL6smcsx+zKm3WIhVk+yEfayqDuuVaiGLYUOOB2AW3SoTVk+bp5R7Pb78A6FNMWHIgHaZ76hLseWh+7Qeac7z8wtETnE5TyX478R0lyy9rt2bEkf4SocOWoM4XQdktD+VF+hOma9vflXIOu9qG+PY1gO2yKJLtYUzYAIPO+BUMqDRA0RpQfHffx/wFK++ycJ52oLcfE4qWrzaZfFXcU X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: 在 2024/10/30 04:48, Yazen Ghannam 写道: > On Mon, Oct 28, 2024 at 04:11:40PM +0800, Shuai Xue wrote: >> Synchronous error was detected as a result of user-space process accessing >> a 2-bit uncorrected error. The CPU will take a synchronous error exception >> such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a >> memory_failure() work which poisons the related page, unmaps the page, and >> then sends a SIGBUS to the process, so that a system wide panic can be >> avoided. >> >> However, no memory_failure() work will be queued when abnormal synchronous >> errors occur. These errors can include situations such as invalid PA, >> unexpected severity, no memory failure config support, invalid GUID >> section, etc. In such case, the user-space process will trigger SEA again. >> This loop can potentially exceed the platform firmware threshold or even >> trigger a kernel hard lockup, leading to a system reboot. >> >> Fix it by performing a force kill if no memory_failure() work is queued >> for synchronous errors. >> >> Signed-off-by: Shuai Xue >> Reviewed-by: Jarkko Sakkinen >> Reviewed-by: Jonathan Cameron >> --- >> drivers/acpi/apei/ghes.c | 10 ++++++++++ >> 1 file changed, 10 insertions(+) >> >> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c >> index ada93cfde9ba..f2ee28c44d7a 100644 >> --- a/drivers/acpi/apei/ghes.c >> +++ b/drivers/acpi/apei/ghes.c >> @@ -801,6 +801,16 @@ static bool ghes_do_proc(struct ghes *ghes, >> } >> } >> >> + /* >> + * If no memory failure work is queued for abnormal synchronous >> + * errors, do a force kill. >> + */ >> + if (sync && !queued) { >> + pr_err("%s:%d: hardware memory corruption (SIGBUS)\n", >> + current->comm, task_pid_nr(current)); > > I think it would help to include the GHES_PFX to indicate where this > message is coming from. The pr_fmt() macro could also be introduced > instead. Yes, GHES_PFX is a effective prefix and will be consistent to other message in GHES driver. Will add it in next version. What do you mean about pr_fmt()? > > Also, you may want to include the HW_ERR prefix. Not all kernel messages > related to hardware errors have this prefix today. But maybe that should > be changed so there is more consistent messaging. > Do we really need a HW_ERR prefix? The other case which use HW_ERR prefix are for hardware registers. The messages which send SIGBUS does not include HW_ERR, e.g. in kill_proc(), kill_procs(). pr_err("%#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",... pr_err("%#lx: forcibly killing %s:%d because of failure to unmap corrupted page\n",... > Thanks, > Yazen Thanks for valuable comments. Best Regards, Shuai