From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 42579C369AB for ; Fri, 18 Apr 2025 12:35:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 22C4128001A; Fri, 18 Apr 2025 08:35:16 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1B086280005; Fri, 18 Apr 2025 08:35:16 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EF55028001A; Fri, 18 Apr 2025 08:35:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id C89C4280005 for ; Fri, 18 Apr 2025 08:35:15 -0400 (EDT) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id CC023160216 for ; Fri, 18 Apr 2025 12:35:15 +0000 (UTC) X-FDA: 83347109790.20.8DE92D5 Received: from out30-131.freemail.mail.aliyun.com (out30-131.freemail.mail.aliyun.com [115.124.30.131]) by imf05.hostedemail.com (Postfix) with ESMTP id 087B7100005 for ; Fri, 18 Apr 2025 12:35:12 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=H34aLp19; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf05.hostedemail.com: domain of xueshuai@linux.alibaba.com designates 115.124.30.131 as permitted sender) smtp.mailfrom=xueshuai@linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1744979714; a=rsa-sha256; cv=none; b=w2o4Of0RfHIWPeUwyHrdjiB1XEFd9EgzffGTd7Wq3duAJUpvxqskPbo9NiNws1y/1fWJpg IgL80An9qSDTxhcy4Sq5D8gSboY4Ii4mpBktKKt7FIOAY7jBxee8S8Z7jOU5T+wlQKWluj r6IbYMsnMdG3zi/iq28jul32LV5QwYU= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=H34aLp19; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf05.hostedemail.com: domain of xueshuai@linux.alibaba.com designates 115.124.30.131 as permitted sender) smtp.mailfrom=xueshuai@linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1744979714; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ET48gmghr/S6KXKpbHbaUOhbwdjcr6/9oYUul7jTD2Y=; b=bK35yuCjuTwQz4C5GnZjQXeLAcJ6vjqBsqhSJcrNUWcQ2rJb8KaU+tIsGcrIcVf4Mqyif4 v3JhjzeBofXHBaBhV9zqSFGfFl9ZGRHpSku5uP4G8w9Bc4XdWCMBW+I0AHGCBadCp1VtJ0 s5HSnTgvyV1DPPw0n9LHmfFik5wC9fU= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1744979708; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=ET48gmghr/S6KXKpbHbaUOhbwdjcr6/9oYUul7jTD2Y=; b=H34aLp19BoMIW+cp1y5Vo/s4xNLEoZxunukpcExeRIkfjHKzh/kLNOlYCj8KhL3aNaUa2qMPY1V09ht/Dvqw4uZRgIT1ucubKsUsENxHS1GQlzmEUWzi63bFnlsA9l7tUuWBYY5eXO/4EKXqHuDz6x0Sp/5zmQlREyMjTRphEWg= Received: from 30.246.162.65(mailfrom:xueshuai@linux.alibaba.com fp:SMTPD_---0WXJZ5.I_1744979704 cluster:ay36) by smtp.aliyun-inc.com; Fri, 18 Apr 2025 20:35:06 +0800 Message-ID: <653abdd4-46d2-4956-b49c-8f9c309af34d@linux.alibaba.com> Date: Fri, 18 Apr 2025 20:35:03 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RESEND PATCH v18 1/2] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered To: Hanjun Guo , "Luck, Tony" , rafael@kernel.org, Catalin Marinas Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, linux-edac@vger.kernel.org, x86@kernel.org, justin.he@arm.com, ardb@kernel.org, ying.huang@linux.alibaba.com, ashish.kalra@amd.com, baolin.wang@linux.alibaba.com, tglx@linutronix.de, dave.hansen@linux.intel.com, lenb@kernel.org, hpa@zytor.com, robert.moore@intel.com, lvying6@huawei.com, xiexiuqi@huawei.com, zhuo.song@linux.alibaba.com, Hanjun Guo , catalin.marinas@arm.com, sudeep.holla@arm.com, lpieralisi@kernel.org, linux-acpi@vger.kernel.org, yazen.ghannam@amd.com, mark.rutland@arm.com, mingo@redhat.com, robin.murphy@arm.com, Jonathan.Cameron@Huawei.com, bp@alien8.de, rafael@kernel.org, linux-arm-kernel@lists.infradead.org, wangkefeng.wang@huawei.com, tanxiaofei@huawei.com, mawupeng1@huawei.com, tony.luck@intel.com, linmiaohe@huawei.com, naoya.horiguchi@nec.com, james.morse@arm.com, tongtiangen@huawei.com, gregkh@linuxfoundation.org, will@kernel.org, jarkko@kernel.org References: <20250404112050.42040-1-xueshuai@linux.alibaba.com> <20250404112050.42040-2-xueshuai@linux.alibaba.com> <0c0bc332-0323-4e43-a96b-dd5f5957ecc9@huawei.com> <709ee8d2-8969-424c-b32b-101c6a8220fb@linux.alibaba.com> <353809e7-5373-0d54-6ddb-767bc5af9e5f@huawei.com> From: Shuai Xue In-Reply-To: <353809e7-5373-0d54-6ddb-767bc5af9e5f@huawei.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 087B7100005 X-Stat-Signature: mz4zdksm8j8d9jbw68ywukckp5xa45ns X-Rspam-User: X-HE-Tag: 1744979712-668153 X-HE-Meta: U2FsdGVkX1+lfjWPD7w2qKQE0V3U2UVJ3uoWs8GQlvAAc6EpJZMI743Yiqy6KAvWOC7HwCx4q7z9D6tLY0WZYwiO5FwM0De6ndZvT1d7O1Ococdrd/4PBPzwQtxpvfkmF3wdSgMMap6D8CPoaTE+czspmbwTaGvT8sWpQWf4Q/43mn0DvF8JRFfr7SJJN1f4LOlyStvzgh5Iwi0RDWwYI4aNNXstuJR2qJpivLi7JQXF9VFoBu+iREJ0ZUTQ+00dz8utxGnb6jFNftmXw/P5O547N/sPyQKB/mX026gzSsje5Nv2mEp5aZc5JzsnPIJYjZwWKvxGTBW/T1pnSPrBOmX3i0ypynNsZ68Q8qi+zLbbXM6If3BbUPx4CoUahdJLwCSX5+RoAoSpBY+tA+IfeT55gsG0ZIvB4NW1CzG785mrodZ9YHET3TyiKxaVFZSMRgySGUOlx5UjCvEa4o4R5BgyqS9I2YCAK8W1Ln051zYlgw0by3MIp7UqKKMVVpgwbI3FZ5pk7gF5OaeuoLthviduwJzx4o7KET7Jw25wa+xjW+0oZbnebrLMtlfkPrkrHMYjhBH6FMa3Iz/ykDebfkVmR60wK5WpcvaNXPIsGvt3vMJARjRtrGRglKo2iEUwMPQtXlFs6ZxW2zwA3vN3HUXeGVI/0i3fOOzYN+fwVc85cREQEunOjTUc/1Sner0DcyGONq8NN9sb3ZVZSJal24dJ5znaR4SaRoBgY890pfa+EXPwD8piY7sxGU8JfD/M7eDP0e32GCvJwB2w0QIplESg7M82cYkccF4rbKFl6l0egFGMT3Y/miKEyZVWHDattQu3RQOtGnw5SPaohHB4f490ICFiU6XOBqjH3L4ILi/LCmx4GlrYJN5VEvxBz1VldTWMFTzo7ODUmhvodnlFXoLHDZD9YSoR X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: 在 2025/4/18 15:48, Hanjun Guo 写道: > On 2025/4/14 23:02, Shuai Xue wrote: >> >> >> 在 2025/4/14 22:37, Hanjun Guo 写道: >>> On 2025/4/4 19:20, Shuai Xue wrote: >>>> Synchronous error was detected as a result of user-space process accessing >>>> a 2-bit uncorrected error. The CPU will take a synchronous error exception >>>> such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a >>>> memory_failure() work which poisons the related page, unmaps the page, and >>>> then sends a SIGBUS to the process, so that a system wide panic can be >>>> avoided. >>>> >>>> However, no memory_failure() work will be queued when abnormal synchronous >>>> errors occur. These errors can include situations such as invalid PA, >>>> unexpected severity, no memory failure config support, invalid GUID >>>> section, etc. In such case, the user-space process will trigger SEA again. >>>> This loop can potentially exceed the platform firmware threshold or even >>>> trigger a kernel hard lockup, leading to a system reboot. >>>> >>>> Fix it by performing a force kill if no memory_failure() work is queued >>>> for synchronous errors. >>>> >>>> Signed-off-by: Shuai Xue >>>> Reviewed-by: Jarkko Sakkinen >>>> Reviewed-by: Jonathan Cameron >>>> Reviewed-by: Yazen Ghannam >>>> Reviewed-by: Jane Chu >>>> --- >>>>   drivers/acpi/apei/ghes.c | 11 +++++++++++ >>>>   1 file changed, 11 insertions(+) >>>> >>>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c >>>> index b72772494655..50e4d924aa8b 100644 >>>> --- a/drivers/acpi/apei/ghes.c >>>> +++ b/drivers/acpi/apei/ghes.c >>>> @@ -799,6 +799,17 @@ static bool ghes_do_proc(struct ghes *ghes, >>>>           } >>>>       } >>>> +    /* >>>> +     * If no memory failure work is queued for abnormal synchronous >>>> +     * errors, do a force kill. >>>> +     */ >>>> +    if (sync && !queued) { >>>> +        dev_err(ghes->dev, >>>> +            HW_ERR GHES_PFX "%s:%d: synchronous unrecoverable error (SIGBUS)\n", >>>> +            current->comm, task_pid_nr(current)); >>>> +        force_sig(SIGBUS); >>>> +    } >>> >>> I think it's reasonable to send a force kill to the task when the >>> synchronous memory error is not recovered. >>> >>> But I hope this code will not trigger some legacy firmware issues, >>> let's be careful for this, so can we just introduce arch specific >>> callbacks for this? >> >> Sorry, can you give more details? I am not sure I got your point. >> >> For x86, Tony confirmed that ghes will not dispatch x86 synchronous errors >> (a.k.a machine check exception), in previous vesion. >> Sync is only used in arm64 platform, see is_hest_sync_notify(). > > Sorry for the late reply, from the code I can see that x86 will reuse > ghes_do_proc(), if Tony confirmed that x86 is OK, it's OK to me as well. Hi, Hanjun, Glad to hear that. I copy and paste in the original disscusion with @Tony from mailist.[1] > On x86 the "action required" cases are signaled by a synchronous machine check > that is delivered before the instruction that is attempting to consume the uncorrected > data retires. I.e., it is guaranteed that the uncorrected error has not been propagated > because it is not visible in any architectural state. > APEI signaled errors don't fall into that category on x86 ... the uncorrected data > could have been consumed and propagated long before the signaling used for > APEI can alert the OS. I also add comments in the code. /* * A platform may describe one error source for the handling of synchronous * errors (e.g. MCE or SEA), or for handling asynchronous errors (e.g. SCI * or External Interrupt). On x86, the HEST notifications are always * asynchronous, so only SEA on ARM is delivered as a synchronous * notification. */ static inline bool is_hest_sync_notify(struct ghes *ghes) { u8 notify_type = ghes->generic->notify.type; return notify_type == ACPI_HEST_NOTIFY_SEA; } If you are happy with code, please explictly give me your reviewed-by tags :) > > Thanks > Hanjun Thanks. Best Regards, Shuai [1] https://lore.kernel.org/lkml/CAJZ5v0hdgxsDiXqOmeqBQoZUQJ1RssM=3jpYpWt3qzy0n2eyaA@mail.gmail.com/t/#u