From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CF144CD5BC8 for ; Thu, 5 Sep 2024 14:17:48 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 615966B010C; Thu, 5 Sep 2024 10:17:48 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5C5A86B010D; Thu, 5 Sep 2024 10:17:48 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4B47F6B010E; Thu, 5 Sep 2024 10:17:48 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 2D0616B010C for ; Thu, 5 Sep 2024 10:17:48 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id C6F4840201 for ; Thu, 5 Sep 2024 14:17:47 +0000 (UTC) X-FDA: 82530888174.12.C49DC02 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf23.hostedemail.com (Postfix) with ESMTP id 2081E140018 for ; Thu, 5 Sep 2024 14:17:45 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=jYENXWRq; dmarc=pass (policy=quarantine) header.from=kernel.org; spf=pass (imf23.hostedemail.com: domain of jarkko@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=jarkko@kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1725545787; a=rsa-sha256; cv=none; b=1oucku0AruoPLfXCw2qmIjoyHOpFEBd0dYC2npYPUKQDYs/sg6OSgSnRZVJEaDKtqo0JJ5 xQvd25oBO8rLBXZILMFnkMf0WX+oN7sdQc5adafPUj5z9wNAERoDk3Mxlou248qH7MmQ9w 4f0lQT+RBaW/5JyrCoc1KIMkhKRa2NQ= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=jYENXWRq; dmarc=pass (policy=quarantine) header.from=kernel.org; spf=pass (imf23.hostedemail.com: domain of jarkko@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=jarkko@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1725545787; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=2/5Py+u9s6X9RahxYCn0h3hcptJ4woc39Sr1PaxH1pU=; b=FYRkiTKQDxc55EPLXjjvlaJEUSbryXk9MgT/9tEO6T2EwoXwTeH/6iii7J0zicN7kJ0aIE HzRbkzJQNGkORfksfsHcacc2KsjF4wOaQ1T+Wpn5ClTQ3m+MR11FteHd6YNK5GOA1gojwa 2dxTHmNMg5qtiw7Ub9EdNKRJEcQZtd8= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id A9A7D5C5CD4; Thu, 5 Sep 2024 14:17:41 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 4F66CC4CEC3; Thu, 5 Sep 2024 14:17:44 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1725545864; bh=9ObeGGmMmZUQNIKDyBpvXEWmJ0neSXU3BmU9ES+ry00=; h=Date:Cc:Subject:From:To:References:In-Reply-To:From; b=jYENXWRq/s5piuSSClbLNfn9GFyK4cNTB1pV0h/NuSNrs5WGB/QZd94fU1antviwf aoUjZN+asVBX/qvzQfKFoQWXme7Z3PyBLR+1op7KQltt2sWvirJQksblrYbPCNpgY5 kXYaFMy2m64KmCZIRGNuMFscNyQXoeciOoOTQQtnkWF5fU7WgQwNLk7Qe+ZZU2ZX0+ i8qVFEBpXfTvhMFsPy4xrXlXAGXlCz25fW7aycdWn2N2dzY6EwRjxncUuFQQhzp+X7 xfr1vTNQ8FW5YM1ih7KNGLzHa0gS3mbijcLxtm/Di4C6vb98WpFgMsOEQqYlf5Uca2 ZF9PgrfFzBebQ== Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=UTF-8 Date: Thu, 05 Sep 2024 17:17:41 +0300 Message-Id: Cc: , , , , , , , , , , , , , , , , , , , Subject: Re: [PATCH v12 1/3] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered From: "Jarkko Sakkinen" To: "Jarkko Sakkinen" , "Shuai Xue" , , , , , , , , , , , , X-Mailer: aerc 0.18.2 References: <20221027042445.60108-1-xueshuai@linux.alibaba.com> <20240902030034.67152-2-xueshuai@linux.alibaba.com> In-Reply-To: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 2081E140018 X-Stat-Signature: quiwz1ztiuwao6ybnhd5jeenrsx33t51 X-Rspam-User: X-HE-Tag: 1725545865-885045 X-HE-Meta: U2FsdGVkX19W1wt9Gh2O6eE10A4TE3Q+L2NWOKCPI+2TG+E1HC609TJQw05tCoNP3jWObmvDISEwju5Yw1EGvvP8ev4RQ5TmUyYUQtHDozSB/HmmzGL4CAb2d5Jmr/6rQ/oc0HDfbYfvnCaw6QKM7UFrdVmPJesZKB2KZWfp8xYkLWMoYEsrSXkyZxxpzl6dNNKQ6lVrQT29Ii+5sJUvMAHTcnmhFklXo/3y9L5K9Q80hbGX3gUF0z19BJIaHOnGbQsLfMPvkfT8X32WIWYyYJX/cO8TUK0KKP1rFOM6X93n3mtL7B2z2MIQ8QJqDHoAyH7F7o/oN2SM2BhIbx9dN2nZUKTKy5T90Gxi2d4zZtU0G3LdFOiUvc0EsLp1L7TimZCPAZYt8qfM0c/bxVY/MiRFc3uslDjgTyP+BKPoywYodrvxRcEDWFTPP7PPiX5od9cf4iVpjKBB3xjedlo0+2Hkw55OuN1cgiY5vJ2eyJSS7TwZ9+phW3ksSYNNY3o9MJG3pU82QPYOWL7UH/zBmzH2Mv4oe3drsGvlFbc4pGJYlD9uOKydMebJHCy96jxGJ4ka8n1HHQ/qv3MRQVT+lSR2pafbu2Pa/dVNBAQq91rWXCnQSiLi4XLrCf0NBC7cVwEVD5FWzX3ub8vEXpi6cLy/CxX12QD2vRCjwLDxW7DsW7aNGB0qnj6xHLIEYSe5qvcWLw7b3qAJyS4ypzdNpFVwxruPTtwM3DCkKlpTOrtZn5LLsdEgLoRR20cBmIxrOYNdwTe3H5kcbpsqR0gdW6n+/2cfPyN+hErnW8UwFRfJNw3cxgZaMY8em4SSkTaYwUkeeWeuLLmuBxv2LmyeSaWEhvfS+EMdOJuGGiGWfdnsDxRpNZpUnSOKCg8waTI7WqwNh7FJrbCbqdW6BRXhvXRcbaL6Sl2WwsKB13iPpRW2WLAAS/VogMIIKO8BFoDP+Lx17L02qalihB8MrEW jGbwvHRa 0l4JAo746d9dWVTrMpucfN/2dtjTc0eKpuVYJ9Jv77iye5u3yvzoGF3SjIRx49XHLXFPAIfZy5McppBZ9TcNZnw5OGAra60biWoaWlTsDhROWxRWPBHZjBjhm/eD2Oii7GU+sajvQRT5OYMNbc+oqn0s3Z/jRV4V0Bg5MZkPm7gvaehqEU4QX+BXZJnV6DqDslGvHF7fNn3deKF1NJ6ub2VJQhcftPnBamgkF3Vbd3CUj1lnurDbTuJqrIpi8DLtQb5rPphxy+IXlBKw= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu Sep 5, 2024 at 5:14 PM EEST, Jarkko Sakkinen wrote: > On Thu Sep 5, 2024 at 6:04 AM EEST, Shuai Xue wrote: > > > > > > =E5=9C=A8 2024/9/4 00:09, Jarkko Sakkinen =E5=86=99=E9=81=93: > > > On Mon Sep 2, 2024 at 6:00 AM EEST, Shuai Xue wrote: > > >> Synchronous error was detected as a result of user-space process acc= essing > > >> a 2-bit uncorrected error. The CPU will take a synchronous error exc= eption > > >> such as Synchronous External Abort (SEA) on Arm64. The kernel will q= ueue a > > >> memory_failure() work which poisons the related page, unmaps the pag= e, and > > >> then sends a SIGBUS to the process, so that a system wide panic can = be > > >> avoided. > > >> > > >> However, no memory_failure() work will be queued unless all bellow > > >> preconditions check passed: > > >> > > >> - `if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))` in ghes_han= dle_memory_failure() > > >> - `if (flags =3D=3D -1)` in ghes_handle_memory_failure() > > >> - `if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))` in ghes_do_mem= ory_failure() > > >> - `if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) ` i= n ghes_do_memory_failure() > > >> > > >> In such case, the user-space process will trigger SEA again. This l= oop > > >> can potentially exceed the platform firmware threshold or even trigg= er a > > >> kernel hard lockup, leading to a system reboot. > > >> > > >> Fix it by performing a force kill if no memory_failure() work is que= ued > > >> for synchronous errors. > > >> > > >> Suggested-by: Xiaofei Tan > > >> Signed-off-by: Shuai Xue > > >> > > >> --- > > >> drivers/acpi/apei/ghes.c | 10 ++++++++++ > > >> 1 file changed, 10 insertions(+) > > >> > > >> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c > > >> index 623cc0cb4a65..b0b20ee533d9 100644 > > >> --- a/drivers/acpi/apei/ghes.c > > >> +++ b/drivers/acpi/apei/ghes.c > > >> @@ -801,6 +801,16 @@ static bool ghes_do_proc(struct ghes *ghes, > > >> } > > >> } > > >> =20 > > >> + /* > > >> + * If no memory failure work is queued for abnormal synchronous > > >> + * errors, do a force kill. > > >> + */ > > >> + if (sync && !queued) { > > >> + pr_err("Sending SIGBUS to %s:%d due to hardware memory corruption= \n", > > >> + current->comm, task_pid_nr(current)); > > >=20 > > > Hmm... doest this need "hardware" or would "memory corruption" be > > > enough? > > >=20 > > > Also, does this need to say that it is sending SIGBUS when the signal > > > itself tells that already? > > >=20 > > > I.e. could "%s:%d has memory corruption" be enough information? > > > > Hi, Jarkko, > > > > Thank you for your suggestion. Maybe it could. > > > > There are some similar error info which use "hardware memory error", e.= g. > > By tweaking my original suggestion just a bit: > > "%s:%d: hardware memory corruption" > > Can't get clearer than that, right? And obvious reason that shorter and more consistent klog message is easy to spot and grep. It is simply less convoluted. If you want also SIGBUS, I'd just put it as "%s:%d: hardware memory corruption (SIGBUS)" BR, Jarkko