From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 255D6C369C2 for ; Fri, 25 Apr 2025 01:00:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E7F826B0024; Thu, 24 Apr 2025 21:00:42 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E527E6B0025; Thu, 24 Apr 2025 21:00:42 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D1ADC6B0026; Thu, 24 Apr 2025 21:00:42 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id B11A16B0024 for ; Thu, 24 Apr 2025 21:00:42 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id B85A8C1328 for ; Fri, 25 Apr 2025 01:00:43 +0000 (UTC) X-FDA: 83370761166.17.9F94896 Received: from szxga08-in.huawei.com (szxga08-in.huawei.com [45.249.212.255]) by imf28.hostedemail.com (Postfix) with ESMTP id A4AF5C000B for ; Fri, 25 Apr 2025 01:00:40 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf28.hostedemail.com: domain of guohanjun@huawei.com designates 45.249.212.255 as permitted sender) smtp.mailfrom=guohanjun@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745542842; a=rsa-sha256; cv=none; b=TYeFpd0LeGyUg8B3yvzi5sJgT/9j9c5zw5JCEaO+PHI/CB7ECtCw7QvmxJD3zYTGNDlP3z vygVcz3Ty9BraMX6Dik6mhWJn18SH1nLu9q58h50EIhmZrdeFo4UXcsm8x1cCWmfbfJXIW eHwb1nX1wmgpKHAYdcFz8jO0OOR21pE= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf28.hostedemail.com: domain of guohanjun@huawei.com designates 45.249.212.255 as permitted sender) smtp.mailfrom=guohanjun@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745542842; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ZfscDFXZ6o8smCM+QTApvo5cRYzMYjA2zU0L7w3hV7Y=; b=m/1xGRvk5R1nvktpgVnKiuNudIkfbwJDw90ZDAAH/s7HQkFZXJFDQRcpTwKHGwKffCmBzD vzDyyjRGxu5LsyFSS/FaKwaTKljDb4bA1X53hBf3NEpjlvth2WPlwOv7zV4r1/9T3il+sF 25Yd+P8n495xo0f2Hk6C2zzTyrKIeJE= Received: from mail.maildlp.com (unknown [172.19.163.252]) by szxga08-in.huawei.com (SkyGuard) with ESMTP id 4ZkDyH1Qbpz1d0t6; Fri, 25 Apr 2025 08:59:35 +0800 (CST) Received: from dggpemf500002.china.huawei.com (unknown [7.185.36.57]) by mail.maildlp.com (Postfix) with ESMTPS id 2F1DC180B46; Fri, 25 Apr 2025 09:00:36 +0800 (CST) Received: from [10.174.178.247] (10.174.178.247) by dggpemf500002.china.huawei.com (7.185.36.57) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Fri, 25 Apr 2025 09:00:34 +0800 Subject: Re: [RESEND PATCH v18 1/2] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered To: Shuai Xue , "Luck, Tony" , , Catalin Marinas CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , References: <20250404112050.42040-1-xueshuai@linux.alibaba.com> <20250404112050.42040-2-xueshuai@linux.alibaba.com> <0c0bc332-0323-4e43-a96b-dd5f5957ecc9@huawei.com> <709ee8d2-8969-424c-b32b-101c6a8220fb@linux.alibaba.com> <353809e7-5373-0d54-6ddb-767bc5af9e5f@huawei.com> <653abdd4-46d2-4956-b49c-8f9c309af34d@linux.alibaba.com> From: Hanjun Guo Message-ID: Date: Fri, 25 Apr 2025 09:00:34 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:68.0) Gecko/20100101 Thunderbird/68.6.0 MIME-Version: 1.0 In-Reply-To: <653abdd4-46d2-4956-b49c-8f9c309af34d@linux.alibaba.com> Content-Type: text/plain; charset="utf-8"; format=flowed Content-Language: en-GB Content-Transfer-Encoding: 8bit X-Originating-IP: [10.174.178.247] X-ClientProxiedBy: dggems701-chm.china.huawei.com (10.3.19.178) To dggpemf500002.china.huawei.com (7.185.36.57) X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: A4AF5C000B X-Stat-Signature: tn5ghwwp767ij9e4o46iksepsxt4ocqh X-HE-Tag: 1745542840-416537 X-HE-Meta: U2FsdGVkX1+/n08bgs6FGKxy1DCTvYrzIBTCLFajIvrJ5GVpUD9i6qgXsHEo74N5PHSaPlTorRQJtQBdYNJqxYkPvH2iMnrbfKu3fnsyECWqmd5PAlJRuSnxI5CgOscZrbqfqICWVPFLCg+x9es1jD6VqA7BjybGdEJEdbvP/W2SswQJnX7EiCixbU2wEfT/9DmTO5v4G32GK9RmxSjlLv/ikn0Mkuy5Ct91zc5nPJ3hOcEHT+ulzQMcUz2cmxEB7QuFq8TWIiuf0qIDsBAVD+2WzeYrQMmXYghd+3+TQSlbctsh4QdEsbYvTbsS1fKWYIS534NBYUBUClMe4GIGQ1wPTgdmUxnetDGhnz1/YzklmkA0rs6rDiMmV94jx3cvx7utP8aeLntFogO5flFn0UW0upqhto1PfaJF7CPcBXOHyaHbfOxZILr8oGpMZcTiQpdJp40Nkp726RJglrTFBoztjJ8er0IMq1mb/kDhbrjxj0HdCdjpMgkMvoUCI2Si438utThdPFo6c11YtanX+MCs82haG9U+475YkjsrdGR9OXDaSw+/oMVwi0jNJDT6eSS5n+Qz1nsSWDwZHqUcJCZ2Uwk8Upcc/1HohG/eMhsFu28eKyU3GcAGCxBFtWODeFtXhUFKa0gys/rEWPZfoJWlUwl/Yrd16lkwFIWOy03pUN11yUAlfs+03fVdHOMo1T+aFY9jc6mPkntuhvn2jJdKmriwtQ1+Y0lr/sh8+t53q4Socwt+UeW2FHejI5nwhb4inGqONdzS+k9w1ssDTT/QdmDdu2YG3+3O+cy7BH1ILpBiNgWInDBmNFfsRjAXxDjoZV7cdyz8j8oHlcmCBzm4ljASCAXKXcFooUfxIhjdBb6uaGQcM9F0PcvlHQIShw+VEecS86w+cVBYlVnDxSEVBLn/sKe3KVt4KgcLCHuZinU/AYArgmbUc3iBtwRQzjrWBMG2FMyoAcL+dCi 6OjxEpAU 5G00Tnddr40sxgX7PXSHNXmZv+6shUZxnuj4C67UcpUNhn0VVghpDvha86//6fI2vJknjheaYlZ5mdYUFRC2LaaZoKLl4CNBg+HtWnSNExkUDb5COtp0slPDYBhVwWvEi12lftK1tqj4/jmaFC87yG9DL5ZAg9QI1VlqXYgC1bIGQw1kC9jXp6E7CxOf+PmODNeNkx2I4nTPg3LZG9b/BS4b34YonGp5Iz0wS4Ar4V0nL9XQ3WC65B41+wsnLluAE6suCeVGuAlb70zA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2025/4/18 20:35, Shuai Xue wrote: > > > 在 2025/4/18 15:48, Hanjun Guo 写道: >> On 2025/4/14 23:02, Shuai Xue wrote: >>> >>> >>> 在 2025/4/14 22:37, Hanjun Guo 写道: >>>> On 2025/4/4 19:20, Shuai Xue wrote: >>>>> Synchronous error was detected as a result of user-space process >>>>> accessing >>>>> a 2-bit uncorrected error. The CPU will take a synchronous error >>>>> exception >>>>> such as Synchronous External Abort (SEA) on Arm64. The kernel will >>>>> queue a >>>>> memory_failure() work which poisons the related page, unmaps the >>>>> page, and >>>>> then sends a SIGBUS to the process, so that a system wide panic can be >>>>> avoided. >>>>> >>>>> However, no memory_failure() work will be queued when abnormal >>>>> synchronous >>>>> errors occur. These errors can include situations such as invalid PA, >>>>> unexpected severity, no memory failure config support, invalid GUID >>>>> section, etc. In such case, the user-space process will trigger SEA >>>>> again. >>>>> This loop can potentially exceed the platform firmware threshold or >>>>> even >>>>> trigger a kernel hard lockup, leading to a system reboot. >>>>> >>>>> Fix it by performing a force kill if no memory_failure() work is >>>>> queued >>>>> for synchronous errors. >>>>> >>>>> Signed-off-by: Shuai Xue >>>>> Reviewed-by: Jarkko Sakkinen >>>>> Reviewed-by: Jonathan Cameron >>>>> Reviewed-by: Yazen Ghannam >>>>> Reviewed-by: Jane Chu >>>>> --- >>>>>   drivers/acpi/apei/ghes.c | 11 +++++++++++ >>>>>   1 file changed, 11 insertions(+) >>>>> >>>>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c >>>>> index b72772494655..50e4d924aa8b 100644 >>>>> --- a/drivers/acpi/apei/ghes.c >>>>> +++ b/drivers/acpi/apei/ghes.c >>>>> @@ -799,6 +799,17 @@ static bool ghes_do_proc(struct ghes *ghes, >>>>>           } >>>>>       } >>>>> +    /* >>>>> +     * If no memory failure work is queued for abnormal synchronous >>>>> +     * errors, do a force kill. >>>>> +     */ >>>>> +    if (sync && !queued) { >>>>> +        dev_err(ghes->dev, >>>>> +            HW_ERR GHES_PFX "%s:%d: synchronous unrecoverable >>>>> error (SIGBUS)\n", >>>>> +            current->comm, task_pid_nr(current)); >>>>> +        force_sig(SIGBUS); >>>>> +    } >>>> >>>> I think it's reasonable to send a force kill to the task when the >>>> synchronous memory error is not recovered. >>>> >>>> But I hope this code will not trigger some legacy firmware issues, >>>> let's be careful for this, so can we just introduce arch specific >>>> callbacks for this? >>> >>> Sorry, can you give more details? I am not sure I got your point. >>> >>> For x86, Tony confirmed that ghes will not dispatch x86 synchronous >>> errors >>> (a.k.a machine check exception), in previous vesion. >>> Sync is only used in arm64 platform, see is_hest_sync_notify(). >> >> Sorry for the late reply, from the code I can see that x86 will reuse >> ghes_do_proc(), if Tony confirmed that x86 is OK, it's OK to me as well. > > Hi, Hanjun, > > Glad to hear that. > > I copy and paste in the original disscusion with @Tony from mailist.[1] > >> On x86 the "action required" cases are signaled by a synchronous >> machine check >> that is delivered before the instruction that is attempting to consume >> the uncorrected >> data retires. I.e., it is guaranteed that the uncorrected error has >> not been propagated >> because it is not visible in any architectural state. > >> APEI signaled errors don't fall into that category on x86 ... the >> uncorrected data >> could have been consumed and propagated long before the signaling used >> for >> APEI can alert the OS. > > I also add comments in the code. > > /* >  * A platform may describe one error source for the handling of > synchronous >  * errors (e.g. MCE or SEA), or for handling asynchronous errors (e.g. SCI >  * or External Interrupt). On x86, the HEST notifications are always >  * asynchronous, so only SEA on ARM is delivered as a synchronous >  * notification. >  */ > static inline bool is_hest_sync_notify(struct ghes *ghes) > { >     u8 notify_type = ghes->generic->notify.type; > >     return notify_type == ACPI_HEST_NOTIFY_SEA; > } > > > If you are happy with code, please explictly give me your reviewed-by > tags :) Call force_sig(SIGBUS) directly in ghes_do_proc() is not my favourite, but I can bear that, please add Reviewed-by: Hanjun Guo Thanks Hanjun