From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A3A60CE7B1C for ; Fri, 6 Sep 2024 14:42:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A5EE66B0082; Fri, 6 Sep 2024 10:42:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9E6E86B0085; Fri, 6 Sep 2024 10:42:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 860696B0088; Fri, 6 Sep 2024 10:42:28 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 646EB6B0082 for ; Fri, 6 Sep 2024 10:42:28 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id C3FB41404F3 for ; Fri, 6 Sep 2024 14:42:27 +0000 (UTC) X-FDA: 82534579134.07.E4B7DCE Received: from nyc.source.kernel.org (nyc.source.kernel.org [147.75.193.91]) by imf30.hostedemail.com (Postfix) with ESMTP id 280D880016 for ; Fri, 6 Sep 2024 14:42:24 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=kZc3U5fZ; spf=pass (imf30.hostedemail.com: domain of jarkko@kernel.org designates 147.75.193.91 as permitted sender) smtp.mailfrom=jarkko@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1725633696; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=J7n0UBGDebTNTRHZYlFbmfs9IocTTN5UHNy+lWVyk8g=; b=d5xvvSECi/5erXc2Ovbu7ku3xMQDxV0WsdnMdITQvny+bED0J+IiztbFHCzdQgNT1n5TXd yPBDyPzThVMwK+Dgn506wybxC/EU8OZZkou90QFB8tP8Y9VqxtAi0jc1YSKvqWsdi/L24l o3H0mV7GYfJe3IBiijFj1oyUBvCbnDs= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=kZc3U5fZ; spf=pass (imf30.hostedemail.com: domain of jarkko@kernel.org designates 147.75.193.91 as permitted sender) smtp.mailfrom=jarkko@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1725633696; a=rsa-sha256; cv=none; b=OSQc8Sh3SHpNyfFJuLlIfHGgDw7AtGb2hDtkIbJ2D+Bp2JetHdm8o5ANpEKN2lwuqMV/oS D4VhzimAyA9aVqiJamozIsMH+SQ8nkO3CtJ3M3iKIgu0G5KfZGrDLsS/rH28B6A/j22gB5 Mw2jkROwEc6ViN0AkNZ/JY55NqnLnpY= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by nyc.source.kernel.org (Postfix) with ESMTP id 03ED1A40421; Fri, 6 Sep 2024 14:42:17 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 70578C4CEC4; Fri, 6 Sep 2024 14:42:23 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1725633744; bh=TmcKdTXExCJ2UcE1g4PkpPFg+K6EEGji27f/9N6zwT0=; h=Date:Cc:Subject:From:To:References:In-Reply-To:From; b=kZc3U5fZf44vXMVSFFrz1bJ9A6votpx8WQajkGxDUSrPlM8OPiv1f6Z5yr9fNl2ij C/+WXsA17kFTHcMX03Y3/PS8eWODt2aFHvgmJbEXdqGVcm9a5q0CmtXYSy0w18/1qY 4s6zBZsVVggdBdVUm4EZ8guz3uyv71PT3UJjbu3OB8U+Gkm2CNFdZvdmCf06jsdXwv OBViOamHdL09ScYOdhhMNM6CKWO3auKqPam/oN/XpfAF0hfYLu2DKlmXKL9hTxJiQQ rqnH4044VJVCrImz2XOeny9ykWOxS9e3QQFPKinAYrHXUSXsua65KTOTelz/DYgoSW RLy5712wW+YWQ== Mime-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=UTF-8 Date: Fri, 06 Sep 2024 17:42:20 +0300 Message-Id: Cc: , , , , , , , , , , , , , , , , , , , Subject: Re: [PATCH v12 1/3] ACPI: APEI: send SIGBUS to current task if synchronous memory error not recovered From: "Jarkko Sakkinen" To: "Shuai Xue" , , , , , , , , , , , , X-Mailer: aerc 0.18.2 References: <20221027042445.60108-1-xueshuai@linux.alibaba.com> <20240902030034.67152-2-xueshuai@linux.alibaba.com> <34d5d58b-7fc2-4f93-9d3b-3051ec5e6a23@linux.alibaba.com> In-Reply-To: <34d5d58b-7fc2-4f93-9d3b-3051ec5e6a23@linux.alibaba.com> X-Rspamd-Server: rspam03 X-Rspam-User: X-Rspamd-Queue-Id: 280D880016 X-Stat-Signature: hbc8ptihde91nw3ep9wgkucwfyy6ue1x X-HE-Tag: 1725633744-378976 X-HE-Meta: U2FsdGVkX1990vXg1KWYDiat+bEHw4hw0SXL6TfSyvOVWOxihqevU2H37nZUNrVSFo7v8F7OUT1/5YAgZXbvnDFEkyxWFmo01niYf4hScWW3frEBFzK05nOTvr0/l+OckTw2qqjFHIhphZnMQQLCEpg3R7MylHWSjIBDrnFgnHlJn0xjZtNQ+sSIVAC/bSjZLvnACnAPJK9UR61jET+YC3HxChfQhdbYLl3zs/F0cY0GE4oNTS5nlbbqLc2iSQPDOFH1pBZFxNImwwsCjuTuqU24T5yKWA4h5Sa44tjmKsAFMzqzela/CgFEDE1dRkO54+hKTzy1Sf8GEWkvTpa0ag977oPaPFEdOf48hQOvN3rCF8iTiS64rrMpiZFMWeED8WFzR+4ztZKFGe4MsVB9JnUJYq2r42I2cNpC4fOo0GGra4xFI+YfSt3s1D+KT5LLdUKoDKX8Tk4VPbmTuyFHtq9b9X4gXwU3/Vliao82qpTc4u6fdKWgtJQ6FYR0SPi7L/a4RTbjwmpJmcYyTeQFGNIivwthJV+E76qyCgsvX/Ah7rsy1tXSw6tbhuUyjbIlPTkptXuqlO5/8N78XoLRoTdtaH536vqMorpc8ZYLCdAAaGGZVO7Xm3XL3zxMVxIdlYwd3YCX91UZUdpflzPyrhPYuxctqwksT2UP2eDs/7M1Ow55mFcg7piCinP1cMFhL4dSInLpbsY5lOqtfBdgAIRl8Y03s9SyQeNmAqA0nI+OgsoLluznEl/t8Q0raWpGJA8AD0uUkwu/dkj9G7phf3c3AHUSOIxRaNL2ay0faIdGX7b7nDh4ubSZx+NNtNdWM9QA9PMhZJtL8XtMFiUW+hF50+KxrIg8aNXWO2B3MTncZaHX3GJ9d4ovWzseTsrKpbJED/i4ZrmUiP3v4RBgYN1jXV8GS478pBdbgxrO5zfX8mdKHbmKyZispOFOG2C4iwKgyAunAMRrY00n97D /6bWI/vv AGx+z30XIak9ztl+MWQ8Qk3Ku8Z02aWBHx9OBfMFn2II2iLM0ZClzGphu5KccyGkpl7N0mpsEbPZpQ7foQ4gCwpCitrZUc+SsLeMxR/JeMlUMzHUUUMkB7F/b9oEY1kkDy401HQpwT01xVmx8qfId0CQXrVr0GLlV2taQmNkQZTbxisoUCXwdMTmRdZYeY+mPT8HUaz5U8b5w8e3kGPYE3N7kT5S5qN85cf71r/G28yJ8MDeJ4Rz+GnVjIUNvEV9/ED37CrGG9uYqWmeXJqc8P7zdIOEFZKajDdqP6Tm87SrKBdo= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri Sep 6, 2024 at 4:53 AM EEST, Shuai Xue wrote: > > > =E5=9C=A8 2024/9/5 22:17, Jarkko Sakkinen =E5=86=99=E9=81=93: > > On Thu Sep 5, 2024 at 5:14 PM EEST, Jarkko Sakkinen wrote: > >> On Thu Sep 5, 2024 at 6:04 AM EEST, Shuai Xue wrote: > >>> > >>> > >>> =E5=9C=A8 2024/9/4 00:09, Jarkko Sakkinen =E5=86=99=E9=81=93: > >>>> On Mon Sep 2, 2024 at 6:00 AM EEST, Shuai Xue wrote: > >>>>> Synchronous error was detected as a result of user-space process ac= cessing > >>>>> a 2-bit uncorrected error. The CPU will take a synchronous error ex= ception > >>>>> such as Synchronous External Abort (SEA) on Arm64. The kernel will = queue a > >>>>> memory_failure() work which poisons the related page, unmaps the pa= ge, and > >>>>> then sends a SIGBUS to the process, so that a system wide panic can= be > >>>>> avoided. > >>>>> > >>>>> However, no memory_failure() work will be queued unless all bellow > >>>>> preconditions check passed: > >>>>> > >>>>> - `if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))` in ghes_ha= ndle_memory_failure() > >>>>> - `if (flags =3D=3D -1)` in ghes_handle_memory_failure() > >>>>> - `if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE))` in ghes_do_me= mory_failure() > >>>>> - `if (!pfn_valid(pfn) && !arch_is_platform_page(physical_addr)) ` = in ghes_do_memory_failure() > >>>>> > >>>>> In such case, the user-space process will trigger SEA again. This = loop > >>>>> can potentially exceed the platform firmware threshold or even trig= ger a > >>>>> kernel hard lockup, leading to a system reboot. > >>>>> > >>>>> Fix it by performing a force kill if no memory_failure() work is qu= eued > >>>>> for synchronous errors. > >>>>> > >>>>> Suggested-by: Xiaofei Tan > >>>>> Signed-off-by: Shuai Xue > >>>>> > >>>>> --- > >>>>> drivers/acpi/apei/ghes.c | 10 ++++++++++ > >>>>> 1 file changed, 10 insertions(+) > >>>>> > >>>>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c > >>>>> index 623cc0cb4a65..b0b20ee533d9 100644 > >>>>> --- a/drivers/acpi/apei/ghes.c > >>>>> +++ b/drivers/acpi/apei/ghes.c > >>>>> @@ -801,6 +801,16 @@ static bool ghes_do_proc(struct ghes *ghes, > >>>>> } > >>>>> } > >>>>> =20 > >>>>> + /* > >>>>> + * If no memory failure work is queued for abnormal synchronous > >>>>> + * errors, do a force kill. > >>>>> + */ > >>>>> + if (sync && !queued) { > >>>>> + pr_err("Sending SIGBUS to %s:%d due to hardware memory corruptio= n\n", > >>>>> + current->comm, task_pid_nr(current)); > >>>> > >>>> Hmm... doest this need "hardware" or would "memory corruption" be > >>>> enough? > >>>> > >>>> Also, does this need to say that it is sending SIGBUS when the signa= l > >>>> itself tells that already? > >>>> > >>>> I.e. could "%s:%d has memory corruption" be enough information? > >>> > >>> Hi, Jarkko, > >>> > >>> Thank you for your suggestion. Maybe it could. > >>> > >>> There are some similar error info which use "hardware memory error", = e.g. > >> > >> By tweaking my original suggestion just a bit: > >> > >> "%s:%d: hardware memory corruption" > >> > >> Can't get clearer than that, right? > >=20 > > And obvious reason that shorter and more consistent klog message is eas= y > > to spot and grep. It is simply less convoluted. > >=20 > > If you want also SIGBUS, I'd just put it as "%s:%d: hardware memory > > corruption (SIGBUS)" > >=20 > > BR, Jarkko > > Hi, Jarkko, > > I will change it to "%s:%d: hardware memory corruption (SIGBUS)". > > Thank you for valuable suggestion. Yeah, no intention nitpick, it has a practical value when debugging issues :-) > > Best Regards, > Shuai BR, Jarkko