From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 97D08C369AB for ; Fri, 25 Apr 2025 01:10:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9745D6B0023; Thu, 24 Apr 2025 21:10:20 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8FB7E6B0025; Thu, 24 Apr 2025 21:10:20 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 775756B0026; Thu, 24 Apr 2025 21:10:20 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 55AE96B0023 for ; Thu, 24 Apr 2025 21:10:20 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 4809BC04DA for ; Fri, 25 Apr 2025 01:10:20 +0000 (UTC) X-FDA: 83370785400.28.494193A Received: from out30-97.freemail.mail.aliyun.com (out30-97.freemail.mail.aliyun.com [115.124.30.97]) by imf20.hostedemail.com (Postfix) with ESMTP id 3300C1C000B for ; Fri, 25 Apr 2025 01:10:16 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=hT0Io7Uc; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf20.hostedemail.com: domain of xueshuai@linux.alibaba.com designates 115.124.30.97 as permitted sender) smtp.mailfrom=xueshuai@linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745543418; a=rsa-sha256; cv=none; b=vGihsBqzUOXoFEQz91yhQegK8tseF2CPtMziSSs8HhweQmxT96mMwtpPsSg8iD5SXGmKIy rNq6S8IpqQR7+fWqzZmhP/swx+qLDtovKcvv8T9c/llYHXR3zsVFBwoS7fEHuA8MiEP/g2 Y7fP+08sKa/wp+0cxB7RGoULnS7sYJs= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=hT0Io7Uc; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf20.hostedemail.com: domain of xueshuai@linux.alibaba.com designates 115.124.30.97 as permitted sender) smtp.mailfrom=xueshuai@linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745543418; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=xuK53ZT3pmW8ZkR+lYJ8YR2B580d+pyqYdFFgzrMsMs=; b=8K/BwH8aJuobGYeEIcqCXNtkAYX2LwlbBr1JqidsKPxXJM/TXJvB6vOoeanC2+RO1vkhfH VClaoWG3upgEEzZ2YKN/huQthq0sTSI3RiDAfcgGJtFRjgulu11xlt8bVv2c6dMrffE7ah GNLm7jIyRDqZXwCvCJDrJzvkwgGTxvM= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1745543414; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=xuK53ZT3pmW8ZkR+lYJ8YR2B580d+pyqYdFFgzrMsMs=; b=hT0Io7Uc2ax5SuqguakSavw8PXgMx+DSliZcYfRH1Ciuea4Ds5clpjs+OeANO8m/zQh+pmaKPs3P4raHzdPixNq9Mz6A+n1+gSwguTqluA4WcfAu3WwD+BVERzXzkAvHtA0PnaZ3wsGKnptDJCMDdPtEajt+aInkQhmQwfKQiqc= Received: from 30.246.162.65(mailfrom:xueshuai@linux.alibaba.com fp:SMTPD_---0WY.2xUu_1745543410 cluster:ay36) by smtp.aliyun-inc.com; Fri, 25 Apr 2025 09:10:12 +0800 Message-ID: Date: Fri, 25 Apr 2025 09:10:09 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RESEND PATCH v18 1/2] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered To: Hanjun Guo , "Luck, Tony" , rafael@kernel.org, Catalin Marinas Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, linux-edac@vger.kernel.org, x86@kernel.org, justin.he@arm.com, ardb@kernel.org, ying.huang@linux.alibaba.com, ashish.kalra@amd.com, baolin.wang@linux.alibaba.com, tglx@linutronix.de, dave.hansen@linux.intel.com, lenb@kernel.org, hpa@zytor.com, robert.moore@intel.com, lvying6@huawei.com, xiexiuqi@huawei.com, zhuo.song@linux.alibaba.com, sudeep.holla@arm.com, lpieralisi@kernel.org, linux-acpi@vger.kernel.org, yazen.ghannam@amd.com, mark.rutland@arm.com, mingo@redhat.com, robin.murphy@arm.com, Jonathan.Cameron@Huawei.com, bp@alien8.de, linux-arm-kernel@lists.infradead.org, wangkefeng.wang@huawei.com, tanxiaofei@huawei.com, mawupeng1@huawei.com, linmiaohe@huawei.com, naoya.horiguchi@nec.com, james.morse@arm.com, tongtiangen@huawei.com, gregkh@linuxfoundation.org, will@kernel.org, jarkko@kernel.org References: <20250404112050.42040-1-xueshuai@linux.alibaba.com> <20250404112050.42040-2-xueshuai@linux.alibaba.com> <0c0bc332-0323-4e43-a96b-dd5f5957ecc9@huawei.com> <709ee8d2-8969-424c-b32b-101c6a8220fb@linux.alibaba.com> <353809e7-5373-0d54-6ddb-767bc5af9e5f@huawei.com> <653abdd4-46d2-4956-b49c-8f9c309af34d@linux.alibaba.com> From: Shuai Xue In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 3300C1C000B X-Stat-Signature: rpq695kw5n8wd65tkx7txritx6f3uwdm X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1745543416-562591 X-HE-Meta: U2FsdGVkX19sh9fS2+HbAXsgQsa/h1abFjqw5/teMK+ibl8T9UfzKxgVveMRAVZ29VtA6cmLiTocWihtz99mCY527c4XuXZ7SvBpUKnzKBKzpQSVyVGlPjYXfZyP9HZD2OaBmFXbn5/eh7JAKNnN465sOjlzIOwDm2P9IfJTSzZj6vhjAONaWUkGOezCMya0LCRLck2nnJMIghvOIXDJwkvsAz64jXGs6X9WxCUKDn0lCHIPnYSv/JaL0mw0DnWHoN2nfDg40lje1rd3fc5K3NOeCTDdX0up9IsgAx2vCuBkf5hGKU/lSiifUN3x+cBOI7LeZfewG1qFvVQaxSkwBKOPOGILdBYOOgoPU/QuLMSapfh8VzqiQ9rw+fKuWqiNBsfH6MUHY2hJAZu8D9gM5v/2GTroXJFz1t+PdaWr5mKkVfKcZrFo+Ge1DwJDiLvJRZ5IYKJrRQNObh/+uawLO2h3c/qIdH4cDjLpv0SX1lj3Mb7BIaXTJnJtKhov/0s5g30x30T7++XKn/1Ef9CDpr3Li7ksoZGtnAdh/rNiHb59Y4ZNShXLPZxkbE+8dVwVcwN41dY1zxRatCJQRBqf5A6WfQU0QSadUgq9kH0t+xV884KqYQTkizf8ayqCQZ08rtmYQEyw98bjEKHnMAurpf9F4rZkBHFEDlm6s++ft1BxJLKRH4pbynFQVK0AnaIXcuPQ6H/1l9lQ5+73q7hjvNYgHNnjdk7UsNI+BmqXVOtBsDN/EkTQp3WCHugs9SFnyIp3qWox6J9InhUuWHxrTB+bGJs9+q5ScubPi3PlAeUED7YNZRG5rAnB6hTT4cG9hoT2lqxLmFiwwZ2EL3/BPfyYmg8IX4ADuYUnyhx8KbCVuzoE6BTJNDJOxR3uBvfUf5hUknwcJv/BvMthil+O9ZKsVuidoPlLeSkb4DnucVacv5cLnGuQ6CatRAHJrQDUnVozuG5bXYWf2gyHR+r CheOC3F1 alNe+482vsuE1K7KyShud5qv+h0+gYrgIWjag+ea4SE9L7zEMkQBhXdsKyQyglauUPj8F7tBQqMwW5rKAyxmSbG8xZAb5Me70iwQ4G18TNHcjx8a0zrOAxM0aTs/3meZbXZ4M5wiB83j2iu2a3mrEzZbcomdxvgsmu3cx9Ft2rMrRDua+7YjV0DWnys0mQUFzqfPsXpFATIuoQBm8HxKklIPTEZZrnXwDldbkzx5JcJMZoRPCpUTQQk68QtyAyuqP2+KOdKS6mf7D0JYA2HvT/+dd6A1knW2TzJuOx9S0YNLEvBG6eYOBTGaflV71jtxet6PWfa4I02D5i3bGhOBzLwX2GwCc1qEeSt0X4sjXkMdbgLzFZeV6NUcIPb/hDwIO3AbcMFW7Zg2ezps= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: 在 2025/4/25 09:00, Hanjun Guo 写道: > On 2025/4/18 20:35, Shuai Xue wrote: >> >> >> 在 2025/4/18 15:48, Hanjun Guo 写道: >>> On 2025/4/14 23:02, Shuai Xue wrote: >>>> >>>> >>>> 在 2025/4/14 22:37, Hanjun Guo 写道: >>>>> On 2025/4/4 19:20, Shuai Xue wrote: >>>>>> Synchronous error was detected as a result of user-space process accessing >>>>>> a 2-bit uncorrected error. The CPU will take a synchronous error exception >>>>>> such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a >>>>>> memory_failure() work which poisons the related page, unmaps the page, and >>>>>> then sends a SIGBUS to the process, so that a system wide panic can be >>>>>> avoided. >>>>>> >>>>>> However, no memory_failure() work will be queued when abnormal synchronous >>>>>> errors occur. These errors can include situations such as invalid PA, >>>>>> unexpected severity, no memory failure config support, invalid GUID >>>>>> section, etc. In such case, the user-space process will trigger SEA again. >>>>>> This loop can potentially exceed the platform firmware threshold or even >>>>>> trigger a kernel hard lockup, leading to a system reboot. >>>>>> >>>>>> Fix it by performing a force kill if no memory_failure() work is queued >>>>>> for synchronous errors. >>>>>> >>>>>> Signed-off-by: Shuai Xue >>>>>> Reviewed-by: Jarkko Sakkinen >>>>>> Reviewed-by: Jonathan Cameron >>>>>> Reviewed-by: Yazen Ghannam >>>>>> Reviewed-by: Jane Chu >>>>>> --- >>>>>>   drivers/acpi/apei/ghes.c | 11 +++++++++++ >>>>>>   1 file changed, 11 insertions(+) >>>>>> >>>>>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c >>>>>> index b72772494655..50e4d924aa8b 100644 >>>>>> --- a/drivers/acpi/apei/ghes.c >>>>>> +++ b/drivers/acpi/apei/ghes.c >>>>>> @@ -799,6 +799,17 @@ static bool ghes_do_proc(struct ghes *ghes, >>>>>>           } >>>>>>       } >>>>>> +    /* >>>>>> +     * If no memory failure work is queued for abnormal synchronous >>>>>> +     * errors, do a force kill. >>>>>> +     */ >>>>>> +    if (sync && !queued) { >>>>>> +        dev_err(ghes->dev, >>>>>> +            HW_ERR GHES_PFX "%s:%d: synchronous unrecoverable error (SIGBUS)\n", >>>>>> +            current->comm, task_pid_nr(current)); >>>>>> +        force_sig(SIGBUS); >>>>>> +    } >>>>> >>>>> I think it's reasonable to send a force kill to the task when the >>>>> synchronous memory error is not recovered. >>>>> >>>>> But I hope this code will not trigger some legacy firmware issues, >>>>> let's be careful for this, so can we just introduce arch specific >>>>> callbacks for this? >>>> >>>> Sorry, can you give more details? I am not sure I got your point. >>>> >>>> For x86, Tony confirmed that ghes will not dispatch x86 synchronous errors >>>> (a.k.a machine check exception), in previous vesion. >>>> Sync is only used in arm64 platform, see is_hest_sync_notify(). >>> >>> Sorry for the late reply, from the code I can see that x86 will reuse >>> ghes_do_proc(), if Tony confirmed that x86 is OK, it's OK to me as well. >> >> Hi, Hanjun, >> >> Glad to hear that. >> >> I copy and paste in the original disscusion with @Tony from mailist.[1] >> >>> On x86 the "action required" cases are signaled by a synchronous machine check >>> that is delivered before the instruction that is attempting to consume the uncorrected >>> data retires. I.e., it is guaranteed that the uncorrected error has not been propagated >>> because it is not visible in any architectural state. >> >>> APEI signaled errors don't fall into that category on x86 ... the uncorrected data >>> could have been consumed and propagated long before the signaling used for >>> APEI can alert the OS. >> >> I also add comments in the code. >> >> /* >>   * A platform may describe one error source for the handling of synchronous >>   * errors (e.g. MCE or SEA), or for handling asynchronous errors (e.g. SCI >>   * or External Interrupt). On x86, the HEST notifications are always >>   * asynchronous, so only SEA on ARM is delivered as a synchronous >>   * notification. >>   */ >> static inline bool is_hest_sync_notify(struct ghes *ghes) >> { >>      u8 notify_type = ghes->generic->notify.type; >> >>      return notify_type == ACPI_HEST_NOTIFY_SEA; >> } >> >> >> If you are happy with code, please explictly give me your reviewed-by tags :) > > Call force_sig(SIGBUS) directly in ghes_do_proc() is not my favourite, > but I can bear that, please add > > Reviewed-by: Hanjun Guo > > Thanks > Hanjun Thanks. Hanjun. @Rafael, @Catalin, Both patch 1 and 2 have reviewed-by tag from the arm64 ACPI maintainers, Hanjun, now. Are you happpy to pick and queue this patch set to acpi tree or arm tree? If you need me to send a new version to collect the reviewed-by tag, please let me know. Thanks. Best Regards, Shuai