From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AA543C77B6F for ; Tue, 11 Apr 2023 14:17:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 443C8280004; Tue, 11 Apr 2023 10:17:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 3CD91280001; Tue, 11 Apr 2023 10:17:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 29593280004; Tue, 11 Apr 2023 10:17:28 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 184DD280001 for ; Tue, 11 Apr 2023 10:17:28 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id CF5FD140D71 for ; Tue, 11 Apr 2023 14:17:27 +0000 (UTC) X-FDA: 80669312934.21.B3B80CC Received: from szxga08-in.huawei.com (szxga08-in.huawei.com [45.249.212.255]) by imf27.hostedemail.com (Postfix) with ESMTP id E6F034000B for ; Tue, 11 Apr 2023 14:17:23 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf27.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.255 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1681222645; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ne4h+i6DWPq5pt/aduAPgR2tonwum4TL2Hws9UbWjRo=; b=fFxVqZ3J0LCw2zgYKAajXbPNtTmmwkjXDUpbVqswGok6XqJBvyQrPdcgyko5ZTSWB3+kha QLXykhdEBzpt9Ya3xeIY4NfwgbyxZhoFmHM0PE8k/Y3cMT/Bl7f8p5mJiPYLY3fwRAS5Xy cAMUNcGWSbbzyVsPeN8+tM+oFbm3WNk= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf27.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.255 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1681222645; a=rsa-sha256; cv=none; b=IUERL/zum74ft9MDyWI0l18RiYsBpZzuq0CLw32YY4j1NuD4P+WZdW7Zt0CkVIvNoUgqRq Ihj56iI+eTPWR5cB/TlOCQLdUIo8fcdcHPQ8cYtzuriMPO7Y9yYP8qLYeqQ0ZMOXgXNlrc gOW5xzjNgTIe4gw0PzapciCGfNl4vyc= Received: from dggpemm500001.china.huawei.com (unknown [172.30.72.53]) by szxga08-in.huawei.com (SkyGuard) with ESMTP id 4PwnrR0L0Qz16NWS; Tue, 11 Apr 2023 22:13:43 +0800 (CST) Received: from [10.174.177.243] (10.174.177.243) by dggpemm500001.china.huawei.com (7.185.36.107) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.23; Tue, 11 Apr 2023 22:17:15 +0800 Message-ID: Date: Tue, 11 Apr 2023 22:17:14 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.5.1 Subject: Re: [PATCH v5 1/2] ACPI: APEI: set memory failure flags as MF_ACTION_REQUIRED on synchronous events Content-Language: en-US To: Shuai Xue , , , , CC: , , , , , , , , , , , , , , , , , , References: <20221027042445.60108-1-xueshuai@linux.alibaba.com> <20230411104842.37079-2-xueshuai@linux.alibaba.com> From: Kefeng Wang In-Reply-To: <20230411104842.37079-2-xueshuai@linux.alibaba.com> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [10.174.177.243] X-ClientProxiedBy: dggems706-chm.china.huawei.com (10.3.19.183) To dggpemm500001.china.huawei.com (7.185.36.107) X-CFilter-Loop: Reflected X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: E6F034000B X-Stat-Signature: msuynt4e11hcd7ysqbpjamw41arqjdx6 X-Rspam-User: X-HE-Tag: 1681222643-196207 X-HE-Meta: U2FsdGVkX180vl38AsTTQWYbb6ClbCqgqSDWb5Rm0+9xSG1p8RVudYlqbn//A4ypBnibgyJ2YGOwzihVne584TJGd046GuB3/i7rb+E91n3+MdL+Z1y9ifwgNyjvdq34YSee10hpAFd001Bwa6ONRtOBCCuqpSc5SiipsWWdLf2rkbKs4paQpMcCMlOyD222PZq0JsElcD/CUEYHk8YX7yvkZijpQ549IGJqSkKvg9jOvvvhiWPWa7JViHY1/Gj7+XeJuwuqo6By0VwpVs5E9BBrykbEfNudWG0Y0olqczz4Y3LPd4VKGOPSgj/IthNdRJe/Et+x+xzU9IMx0dLnugF+3sq8AM+5iWZDPxqF3Tw3OINhxcWrr10djgMv5nDF1qMQ3E8LhkT35fE/XRTqV+E/+6xbppbRWcfBO6qXnTNs0Wpq3jW8uQcnfsrDzi6B4BcDvoGnOmuAPXLzyDHhIB4XLSrqBEXbJjvZ2d3NrRRGn7DnADAtm1WRV1ff21/Ulld7jkbP4xz/hYF5iXOW8brWOYCTF8MbtYPEBcJQp35VSU+A4w9yQYOgeZ5YRkPs93dpiEQpzVmHS6o9tL5Br0IBqTUvD1lqLaXYBfplnirmv+2fhD+79lPh3z4XqtvhY1mYSWC42pvazakX7j5m/peroSwg1ZZoZ1aXxl4/SDSjIMuxdZ6cJtusLtvencPjUZ1aEJ0RNStl/dl0DLy66mvJDWc/0jP/ejTFgVMMRFSRqiu/wPeAqbNr6/YtTk5xn1hzJkZvwpLlwlK/4jdZ7xAEv4AxPWhF2F4AXXCB4SWyrfmJF7701PDj8CWXUcMdyVP9QIpByS7xHw/515/Haa2Lqw3MGigqKwZMQLn4fnKWY0etJ0Pj3/znjLmCybx+w6ctlVED8Vo7ZXg0PNbeY+EgtKg+5gfEpaJsOoeSuP10DNx7o1V9UQK2OEUb+Nn4SFOudZN3M0CkluSoJnq iHCPttlw B36dXcWsTqqlGnJsxvWQxEOXssJFCftoDARxQdG8rzG+W6WjPNT7mq43YghFLfScT0NQcEaQLbs6Ds7odoZt28j85i1+b9TxaSKN357409D3wjxoGxNQgBkgvxhr4bQ3kldCfC6630f/bGouQHgBeCgsQWATbgrNkeG+ioMo9h7BWoX7uYR9WRNuGBkNtVSb1QnH0aerm0r6+FROd1euf4KmXFjuPhzui+JJhV/QEI0M0ydHYOSsJNBAxrQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi Shuai Xue, On 2023/4/11 18:48, Shuai Xue wrote: > There are two major types of uncorrected recoverable (UCR) errors : > > - Action Required (AR): The error is detected and the processor already > consumes the memory. OS requires to take action (for example, offline > failure page/kill failure thread) to recover this uncorrectable error. > > - Action Optional (AO): The error is detected out of processor execution > context. Some data in the memory are corrupted. But the data have not > been consumed. OS is optional to take action to recover this > uncorrectable error. > > The essential difference between AR and AO errors is that AR is a > synchronous event, while AO is an asynchronous event. The hardware will > signal a synchronous exception (Machine Check Exception on X86 and > Synchronous External Abort on Arm64) when an error is detected and the > memory access has been architecturally executed. > > When APEI firmware first is enabled, a platform may describe one error > source for the handling of synchronous errors (e.g. MCE or SEA notification > ), or for handling asynchronous errors (e.g. SCI or External Interrupt > notification). In other words, we can distinguish synchronous errors by > APEI notification. For AR errors, kernel will kill current process > accessing the poisoned page by sending SIGBUS with BUS_MCEERR_AR. In > addition, for AO errors, kernel will notify the process who owns the > poisoned page by sending SIGBUS with BUS_MCEERR_AO in early kill mode. > However, the GHES driver always sets mf_flags to 0 so that all UCR errors > are handled as AO errors in memory failure. > > To this end, set memory failure flags as MF_ACTION_REQUIRED on synchronous > events. As your mentioned in cover-letter, we met same issue, and hope it could be fixed ASAP, this patch looks good to me, Reviewed-by: Kefeng Wang > > Fixes: ba61ca4aab47 ("ACPI, APEI, GHES: Add hardware memory error recovery support")' > Signed-off-by: Shuai Xue > Tested-by: Ma Wupeng > --- > drivers/acpi/apei/ghes.c | 29 +++++++++++++++++++++++------ > 1 file changed, 23 insertions(+), 6 deletions(-) > > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c > index 34ad071a64e9..c479b85899f5 100644 > --- a/drivers/acpi/apei/ghes.c > +++ b/drivers/acpi/apei/ghes.c > @@ -101,6 +101,20 @@ static inline bool is_hest_type_generic_v2(struct ghes *ghes) > return ghes->generic->header.type == ACPI_HEST_TYPE_GENERIC_ERROR_V2; > } > > +/* > + * A platform may describe one error source for the handling of synchronous > + * errors (e.g. MCE or SEA), or for handling asynchronous errors (e.g. SCI > + * or External Interrupt). On x86, the HEST notifications are always > + * asynchronous, so only SEA on ARM is delivered as a synchronous > + * notification. > + */ > +static inline bool is_hest_sync_notify(struct ghes *ghes) > +{ > + u8 notify_type = ghes->generic->notify.type; > + > + return notify_type == ACPI_HEST_NOTIFY_SEA; > +} > + > /* > * This driver isn't really modular, however for the time being, > * continuing to use module_param is the easiest way to remain > @@ -477,7 +491,7 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags) > } > > static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, > - int sev) > + int sev, bool sync) > { > int flags = -1; > int sec_sev = ghes_severity(gdata->error_severity); > @@ -491,7 +505,7 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, > (gdata->flags & CPER_SEC_ERROR_THRESHOLD_EXCEEDED)) > flags = MF_SOFT_OFFLINE; > if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE) > - flags = 0; > + flags = sync ? MF_ACTION_REQUIRED : 0; > > if (flags != -1) > return ghes_do_memory_failure(mem_err->physical_addr, flags); > @@ -499,9 +513,11 @@ static bool ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, > return false; > } > > -static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int sev) > +static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, > + int sev, bool sync) > { > struct cper_sec_proc_arm *err = acpi_hest_get_payload(gdata); > + int flags = sync ? MF_ACTION_REQUIRED : 0; > bool queued = false; > int sec_sev, i; > char *p; > @@ -526,7 +542,7 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, int s > * and don't filter out 'corrected' error here. > */ > if (is_cache && has_pa) { > - queued = ghes_do_memory_failure(err_info->physical_fault_addr, 0); > + queued = ghes_do_memory_failure(err_info->physical_fault_addr, flags); > p += err_info->length; > continue; > } > @@ -647,6 +663,7 @@ static bool ghes_do_proc(struct ghes *ghes, > const guid_t *fru_id = &guid_null; > char *fru_text = ""; > bool queued = false; > + bool sync = is_hest_sync_notify(ghes); > > sev = ghes_severity(estatus->error_severity); > apei_estatus_for_each_section(estatus, gdata) { > @@ -664,13 +681,13 @@ static bool ghes_do_proc(struct ghes *ghes, > atomic_notifier_call_chain(&ghes_report_chain, sev, mem_err); > > arch_apei_report_mem_error(sev, mem_err); > - queued = ghes_handle_memory_failure(gdata, sev); > + queued = ghes_handle_memory_failure(gdata, sev, sync); > } > else if (guid_equal(sec_type, &CPER_SEC_PCIE)) { > ghes_handle_aer(gdata); > } > else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) { > - queued = ghes_handle_arm_hw_error(gdata, sev); > + queued = ghes_handle_arm_hw_error(gdata, sev, sync); > } else { > void *err = acpi_hest_get_payload(gdata); >