From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 21684D2126D for ; Thu, 17 Oct 2024 09:56:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id AB4BB6B008A; Thu, 17 Oct 2024 05:56:32 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A657B6B008C; Thu, 17 Oct 2024 05:56:32 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8DE2C6B009D; Thu, 17 Oct 2024 05:56:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 6AA6E6B008A for ; Thu, 17 Oct 2024 05:56:32 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id B7CBE80FBA for ; Thu, 17 Oct 2024 09:56:22 +0000 (UTC) X-FDA: 82682639046.06.94A6D4B Received: from frasgout.his.huawei.com (frasgout.his.huawei.com [185.176.79.56]) by imf23.hostedemail.com (Postfix) with ESMTP id 43B1B140012 for ; Thu, 17 Oct 2024 09:56:23 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf23.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1729158830; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=AquKsbD3/L0uo/TJsRVL7yDitCPcDwBgzuws62eFY4A=; b=jBJMX8e8BMOtnP4PSbAOs9HjFIsHr/ZmHu3B8TQwAfhwLhxGhBBk9Ymt1cfNePLlxgrYiq h936Q1YPdrvTCTZYwCpir7/YDuqGI1T4HDFtVDR3TfXRf0UzgkqMgu+cbW2C2zA21RreIT m1a/yuA0KM+OjTVtJ5cl1XqoCfIfHWI= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1729158830; a=rsa-sha256; cv=none; b=GUSLicBLgw5ydI36UYWWKdb7D+bj0Ho8O4CBvAmEoVV7peFw68Pm9kP7cWSx0K210ff16e J5VYtYHjlbSAgCU5RNs7ylGySq/2gV59n/DLu9GBs8nM+b5KmdCVIwBTlkqagUWbI8dzGF veJCsIheWvbqL0RdjTwCbQvi2F6wK6U= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf23.hostedemail.com: domain of jonathan.cameron@huawei.com designates 185.176.79.56 as permitted sender) smtp.mailfrom=jonathan.cameron@huawei.com Received: from mail.maildlp.com (unknown [172.18.186.216]) by frasgout.his.huawei.com (SkyGuard) with ESMTP id 4XTjmB1zcpz6J7Db; Thu, 17 Oct 2024 17:51:54 +0800 (CST) Received: from frapeml500008.china.huawei.com (unknown [7.182.85.71]) by mail.maildlp.com (Postfix) with ESMTPS id A29B4140119; Thu, 17 Oct 2024 17:56:25 +0800 (CST) Received: from localhost (10.126.174.164) by frapeml500008.china.huawei.com (7.182.85.71) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.39; Thu, 17 Oct 2024 11:56:23 +0200 Date: Thu, 17 Oct 2024 10:56:21 +0100 From: Jonathan Cameron To: Shuai Xue CC: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , Subject: Re: [PATCH v14 3/3] ACPI: APEI: handle synchronous exceptions in task work Message-ID: <20241017105621.00000c1e@Huawei.com> In-Reply-To: <20241014084240.18614-4-xueshuai@linux.alibaba.com> References: <20221027042445.60108-1-xueshuai@linux.alibaba.com> <20241014084240.18614-4-xueshuai@linux.alibaba.com> Organization: Huawei Technologies Research and Development (UK) Ltd. X-Mailer: Claws Mail 4.1.0 (GTK 3.24.33; x86_64-w64-mingw32) MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.126.174.164] X-ClientProxiedBy: lhrpeml500001.china.huawei.com (7.191.163.213) To frapeml500008.china.huawei.com (7.182.85.71) X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 43B1B140012 X-Stat-Signature: 18o7fuks5run637hizgbdjfgtfmd8wm3 X-Rspam-User: X-HE-Tag: 1729158983-448739 X-HE-Meta: U2FsdGVkX197UPkS1IWpkCPWJ8b2uMO6AJL5268bniWZZYlnLW/IHxckXU5ijkgKXbCfcTesb5DTrtz2Wgo76yXo2f+0WV4MFoqUSzJ0VUL/ANwhEdbcuVFIHjHI2/bxDlyDZEpNQ70fuHoCNoS+XwBHvmz9uHVfO41VTybtDeKr46Ocuc135BM74exALd1jozYmNfKPQZe6aHtULr0LUElv4skXD9v79tx/kcGJnZxQhsWxRRRlKA2+aM7wGJXYZM8sx/L/cUF5BN+6bsc0wEj5iV7JtvUtl2SmTnNl+1wBsc1Cpk8OwgPOo+7rXXlj1xcjCE3dOJPp85MgHs41AN5pMc9fuq/7i0wYWWJI6zFN70wVOz01tkU8IHWfataGUedvuZ+dDXNZdRUpQJWRsD8iooNo44x/lx3eVjwOtt1vpzhC9uCR1mS4u6IjLDSEFp8XC2z5UEK45nNZQ653wEqd3oRLsBeCa5t8LJK1FBwQaKdeysR0q5w5kJf4n6GCUN6BWnWqG6S6gi9EHe84ORyZL8S1JfChg0DTUmwR55iVPx0IF6tlROU7rxJ1eM5L7mRuuf1d3HOGO0XAOhI+/nJbjkhaod+LGGq3pEZMl6FDIgU3HTCArOJh5d7blp2JGsgfTAoD4w0I5paMcVWMULEHGi7JcvgZfOzOSU5re3zGyKrR32/+6TqB0k0OTCnCR47GP3eYUdTqz4WDDPOAQ2nbuhTRrtppG7JjIKjx0eqqe8rktUNJaLGnK2z2J/ym0xpAkmLMumTgn5msj45ZPhgDIPwlSKmy+M7tZtUU5JZp00IGaxyknLwjSTExgMrwQCf1oa6uUc9mCusaZ7uV8TyASLcPzpXC/SsZ7+eD4rdeBKHVFR5WQPKM98xnF1cKiqo7V1USMTkD5+sVdF+zRglcSqI6J80K2wb3j5EOnaMLwE3UzHG19V6zBkZUl09ZrXRBTAq3fgXJWQl0Ndw R6Iw/yVx XRfvruojGUTIiQf0WkXdOMaUykq+PK1/9es7VftJ3n31PQZuN3oK8DL5LCTv2fuPrfh2GSLIdBqixqrbrPefnxdZ+38oRjkWdI9QJQNOLZuFvdVFW3PgHajr4OPFfzykWqWHnFQNLFiqH3q3Q1xBkqFiqLnKw+BKDzlLfVjVr4g45Z0ZjRjzjswd7rQNuhXg5z3rqynkNqJOciWo91nS9nQSeyFsmnRQDetNqrqjq2GKckuaiU38khgFM8WaR/p/sKmt+ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, 14 Oct 2024 16:42:40 +0800 Shuai Xue wrote: > The memory uncorrected error could be signaled by asynchronous interrupt > (specifically, SPI in arm64 platform), e.g. when an error is detected by > a background scrubber, or signaled by synchronous exception > (specifically, data abort excepction in arm64 platform), e.g. when a CPU exception > tries to access a poisoned cache line. Currently, both synchronous and > asynchronous error use memory_failure_queue() to schedule > memory_failure() exectute in kworker context. memory_failure() to execute in kworker context. (spell check in general) > > As a result, when a user-space process is accessing a poisoned data, a > data abort is taken and the memory_failure() is executed in the kworker > context: context, memory_failure(): //No subject of the following two bullets otherwise. > > - will send wrong si_code by SIGBUS signal in early_kill mode, and > - can not kill the user-space in some cases resulting a synchronous > error infinite loop > > Issue 1: send wrong si_code in early_kill mode > > Since commit a70297d22132 ("ACPI: APEI: set memory failure flags as > MF_ACTION_REQUIRED on synchronous events")', the flag MF_ACTION_REQUIRED > could be used to determine whether a synchronous exception occurs on > ARM64 platform. When a synchronous exception is detected, the kernel is > expected to terminate the current process which has accessed poisoned > page. This is done by sending a SIGBUS signal with an error code > BUS_MCEERR_AR, indicating an action-required machine check error on > read. > > However, when kill_proc() is called to terminate the processes who have > the poisoned page mapped, it sends the incorrect SIGBUS error code > BUS_MCEERR_AO because the context in which it operates is not the one > where the error was triggered. > > To reproduce this problem: > > #sysctl -w vm.memory_failure_early_kill=1 > vm.memory_failure_early_kill = 1 > > # STEP2: inject an UCE error and consume it to trigger a synchronous error > #einj_mem_uc single > 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 > injecting ... > triggering ... > signal 7 code 5 addr 0xffffb0d75000 > page not present > Test passed > > The si_code (code 5) from einj_mem_uc indicates that it is BUS_MCEERR_AO > error and it is not fact. > > After this patch: > > # STEP1: enable early kill mode > #sysctl -w vm.memory_failure_early_kill=1 > vm.memory_failure_early_kill = 1 > # STEP2: inject an UCE error and consume it to trigger a synchronous error > #einj_mem_uc single > 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 > injecting ... > triggering ... > signal 7 code 4 addr 0xffffb0d75000 > page not present > Test passed > > The si_code (code 4) from einj_mem_uc indicates that it is BUS_MCEERR_AR > error as we expected. > > Issue 2: a synchronous error infinite loop > > If a user-space process, e.g. devmem, a poisoned page which has been set devmem accesses a poisoned page for which the HWPoison flag is set > HWPosion flag, kill_accessing_process() is called to send SIGBUS to the > current processs with error info. Because the memory_failure() is > executed in the kworker contex, it will just do nothing but return context > EFAULT. So, devmem will access the posioned page and trigger an > excepction again, resulting in a synchronous error infinite loop. Such exception > loop may cause platform firmware to exceed some threshold and reboot > when Linux could have recovered from this error. > > To reproduce this problem: > > # STEP 1: inject an UCE error, and kernel will set HWPosion flag for related page > #einj_mem_uc single > 0: single vaddr = 0xffffb0d75400 paddr = 4092d55b400 > injecting ... > triggering ... > signal 7 code 4 addr 0xffffb0d75000 > page not present > Test passed > > # STEP 2: access the same page and it will trigger a synchronous error infinite loop > devmem 0x4092d55b400 > > To fix above two issues, queue memory_failure() as a task_work so that it runs in > the context of the process that is actually consuming the poisoned data. > > Signed-off-by: Shuai Xue > Tested-by: Ma Wupeng > Reviewed-by: Kefeng Wang > Reviewed-by: Xiaofei Tan > Reviewed-by: Baolin Wang One other trivial comment inline. Whilst this also looks fine to me, there are others who (hopefully) understand these paths better than me. With that said. Reviewed-by: Jonathan Cameron > --- > drivers/acpi/apei/ghes.c | 78 +++++++++++++++++++++++----------------- > include/acpi/ghes.h | 3 -- > include/linux/mm.h | 1 - > mm/memory-failure.c | 13 ------- > 4 files changed, 45 insertions(+), 50 deletions(-) > > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c > index f2ee28c44d7a..95e9520eb803 100644 > --- a/drivers/acpi/apei/ghes.c > +++ b/drivers/acpi/apei/ghes.c > @@ -467,28 +467,42 @@ static void ghes_clear_estatus(struct ghes *ghes, > } > > /* > - * Called as task_work before returning to user-space. > - * Ensure any queued work has been done before we return to the context that > - * triggered the notification. > + * struct ghes_task_work - for synchronous RAS event > + * > + * @twork: callback_head for task work > + * @pfn: page frame number of corrupted page > + * @flags: work control flags > + * > + * Structure to pass task work to be handled before > + * returning to user-space via task_work_add(). > */ > -static void ghes_kick_task_work(struct callback_head *head) > +struct ghes_task_work { > + struct callback_head twork; > + u64 pfn; > + int flags; > +}; > + > +static void memory_failure_cb(struct callback_head *twork) > { > - struct acpi_hest_generic_status *estatus; > - struct ghes_estatus_node *estatus_node; > - u32 node_len; > + struct ghes_task_work *twcb = container_of(twork, struct ghes_task_work, twork); > + unsigned long pfn = twcb->pfn; This local variable is not used consistently. I'd just drop it in favor of always accessing via twcb->pfn > + int ret; > > - estatus_node = container_of(head, struct ghes_estatus_node, task_work); > - if (IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE)) > - memory_failure_queue_kick(estatus_node->task_work_cpu); > + ret = memory_failure(twcb->pfn, twcb->flags); > + gen_pool_free(ghes_estatus_pool, (unsigned long)twcb, sizeof(*twcb)); > > - estatus = GHES_ESTATUS_FROM_NODE(estatus_node); > - node_len = GHES_ESTATUS_NODE_LEN(cper_estatus_len(estatus)); > - gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node, node_len); > + if (!ret || ret == -EHWPOISON || ret == -EOPNOTSUPP) > + return; > + > + pr_err("%#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n", > + pfn, current->comm, task_pid_nr(current)); > + force_sig(SIGBUS); > }