From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 72CD0C021B1 for ; Thu, 20 Feb 2025 11:20:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 09D252802CD; Thu, 20 Feb 2025 06:20:01 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 04E392802CA; Thu, 20 Feb 2025 06:20:01 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E57A12802CD; Thu, 20 Feb 2025 06:20:00 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id C607A2802CA for ; Thu, 20 Feb 2025 06:20:00 -0500 (EST) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id B0592C0FF4 for ; Thu, 20 Feb 2025 11:19:34 +0000 (UTC) X-FDA: 83140077468.01.14A029B Received: from mail.alien8.de (mail.alien8.de [65.109.113.108]) by imf13.hostedemail.com (Postfix) with ESMTP id A7A0820010 for ; Thu, 20 Feb 2025 11:19:32 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=alien8.de header.s=alien8 header.b="V63tGo/W"; spf=pass (imf13.hostedemail.com: domain of bp@alien8.de designates 65.109.113.108 as permitted sender) smtp.mailfrom=bp@alien8.de; dmarc=pass (policy=none) header.from=alien8.de ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1740050373; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=oDN5WUS3A7D2IMAxJ5mGAyDsc5KuNgF5Rt5+F6hzIE0=; b=MSkSRiWY+F+eNjs8utugaYL5EBiQAQDjoCc2nczDgTDu3vce5UtfPPwjehocduwnA0U4Tz vRi+SmDZrR+1/k3BVCNK71k341aP8blfvJHyI439CMK9BVt2siLrhomdskygT5npNxPZZ/ qVN1zEbHvbMp7pTb4YwBKn76rJ88KOI= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740050373; a=rsa-sha256; cv=none; b=3Ua+wCiqEdMSU9X27dtrRgkabbaiGGNfmy8Osx9rPQ3NI8sSi4XIDrtDqlutcS1MkCxGPX bJISxZOROBZb+GhyNNlEzztjsieV+NbAz9fpCfrAZznPsyPaEYFi2lgDNJ6FqJS2PEv2TM T/XSMROZY54TP9KUMKcmG0VrzYsxGD8= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=alien8.de header.s=alien8 header.b="V63tGo/W"; spf=pass (imf13.hostedemail.com: domain of bp@alien8.de designates 65.109.113.108 as permitted sender) smtp.mailfrom=bp@alien8.de; dmarc=pass (policy=none) header.from=alien8.de Received: from localhost (localhost.localdomain [127.0.0.1]) by mail.alien8.de (SuperMail on ZX Spectrum 128k) with ESMTP id 92A6140E0177; Thu, 20 Feb 2025 11:19:29 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at mail.alien8.de Received: from mail.alien8.de ([127.0.0.1]) by localhost (mail.alien8.de [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id bcZMN1EI5MuF; Thu, 20 Feb 2025 11:19:26 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=alien8.de; s=alien8; t=1740050365; bh=oDN5WUS3A7D2IMAxJ5mGAyDsc5KuNgF5Rt5+F6hzIE0=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=V63tGo/WIO8oyP5lcHu/Lr6YRBigqKI9sffiBGW0bu3QvI2bL9CuKwfK9fgrw2KFM UkPn/wU2v3PB6PJKkvQKn9dHJEHe0UIoCm0KMEa24mmHo1dcxhHb8ya5uIV43ZXqrQ +4nRPsNv09gYhuVExneeg+NmnvQfGKjZl9hqmFMPrWvuhh6HgCC4KWLHIEm0fuXjwH EadCpw46Kf775EQiPVhmndleG/Nxakr1idzT0EMHQfxxDmC9Lk+VXXjz9KD1S47fVi XuzDMCUGBzTgqzjHP4p66FvMuf13m1eX+3yXQppgBrqKw3Eg4+ZCgyx6ZZ0CpBPji3 J3VTLhKK5yH4PRhxF0NimWrdO0RkIJYInb/OEue7bOlzuut0xbXxpps7JhzbmBLDdZ 9197w7bq9ahTFRzc8D0BvlI0OQqS1f6UprZpnSd3Kss2iGdYZNv+KxI5/1tm4sxl+D p7T1mAnVwHqzXPk+m/7JddB2GIkmwhLN6amYTVj2EuSZqq0sBniaBIRE38hQ4en+wc aZB1LnQH9comphR3c+FSS/J1SK6ryXyl7PZ2amSQxI4W4diQTlP5VuPvjSBDJQEwAs pTSK8cpnY2ZJ57QmjHqtijK6p/NaZ23+c4VrtVtuCPYsSt40qoA8d4oMUCiXbxO/z3 P6gviHsb06gI5JT7Tx9xI3Dc= Received: from zn.tnic (pd95303ce.dip0.t-ipconnect.de [217.83.3.206]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange ECDHE (P-256) server-signature ECDSA (P-256) server-digest SHA256) (No client certificate requested) by mail.alien8.de (SuperMail on ZX Spectrum 128k) with ESMTPSA id 6FA0640E01A1; Thu, 20 Feb 2025 11:19:08 +0000 (UTC) Date: Thu, 20 Feb 2025 12:19:03 +0100 From: Borislav Petkov To: "Luck, Tony" Cc: Shuai Xue , "nao.horiguchi@gmail.com" , "tglx@linutronix.de" , "mingo@redhat.com" , "dave.hansen@linux.intel.com" , "x86@kernel.org" , "hpa@zytor.com" , "linmiaohe@huawei.com" , "akpm@linux-foundation.org" , "peterz@infradead.org" , "jpoimboe@kernel.org" , "linux-edac@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , "baolin.wang@linux.alibaba.com" , "tianruidong@linux.alibaba.com" Subject: Re: [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling Message-ID: <20250220111903.GDZ7cPp1qVq3t9Jgs6@fat_crate.local> References: <20250217063335.22257-1-xueshuai@linux.alibaba.com> <20250218082727.GCZ7REb7OG6NTAY-V-@fat_crate.local> <7393bcfb-fe94-4967-b664-f32da19ae5f9@linux.alibaba.com> <20250218122417.GHZ7R78fPm32jKYUlx@fat_crate.local> <20250219081037.GAZ7WR_YmRtRvN_LKA@fat_crate.local> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: X-Rspam-User: X-Rspamd-Queue-Id: A7A0820010 X-Rspamd-Server: rspam07 X-Stat-Signature: ttkgt6c6xst9esxwsoeajsucqaj1wdd3 X-HE-Tag: 1740050372-608844 X-HE-Meta: U2FsdGVkX1+RmROu7sWfoiyFSKgdYcIB9giPSmuofA0G+5XlDWWFJkSyK7J8oduQWFd5VqUiWyKdxUyhmKWcD156xsgXsMt/J5n4U/3ePxeuo8TzRhgMec4iuAvCyvSs57L5U4OzdRhZ8DFbS767L7+/0ojroqQfual5aocRMaBT3/D+YrzTNJMigK3ZAu0bIIXuksVjZBmt3Jhyn5+7dAp5LqfLNj91GEtH17X9opvrsp9+QiVS3QwjtSeJrWS3UAjonUUIfWaAdLQ14DlVpy3aAsvRZ1TzHHLut/vJWvbLvJ/wwSMRTA52RFby3GOY3t01gKjUgc688N6yd1pEONnmL21UJotUnHjYHrWBnmUG2zx+q44FTIcx0pGRS1gcm5Lgk1lNJTm12PAn5B8FQAL9W2v15IYmsyetbvQFg23G0xQ25EvjvTMN+VhuplGgfw1rElBdIsluc6K4QjxmLGIOKoxGuZFT5Sv8FB+HYFk4GWlts4lHARa3j/XEXfAd49aa7jTPhgVl+90t/Z/9OfIQor0pEj1wM/dewcLG4oInFg75WsSG1mRm7rpuIU7j8cO4ZKdJl4ZP6kMFDyCYOQ9zCYAiHcTHtiyu2znNkhIPdgil2Pw5iSDWI0z15l8CejOSDtAry8D9yeDGdzittBm85ol9hw5M2kJApgAJVGhJEgFM89RMC936w6dEWg7Xqm+RgFfEss6/tSfBo75ls+bWUpquG0enmA2++7cKr3I7wXbgQersfnjVTzdt9eEp/qsAJfjdSIVyYvsKLI7ejLR7d6zmqBCMucm9TBMHllInsYdIW9SOBxHh9uBIsMAq9D+jsGeJEzeGx4w/+h7F3ms4T2a0CMqSSTKSBhJbJ60EM7tuZUvXUjvFK/fDqREgKjJC0lbBDIR2yMzQXSVF6CD9nXoqleN5CtCkFPfrAuU047ehpr3WhQ5/ySfMgsxOy3Bo2iw3CI14fY4O686 ZybW3oFu WH+y3M6eU1LA9wVa+S3fvLSCOVivjreSWeEPufK1mh6rOBefC8UVyiY53NYWrrPaRKyu056zolnxB4sAmNipnt8/e7HmnMaeEIHIFSlqagAU3EO41ewq2kG03xl0V/44/dhBTQ4MFCHWI5msxggPrkrjTpbN4wEsx9ybqoKn0MpZ1HYACsScG7ca1/W7rXbU+zkzWzWfOYP+pI2V2IDO2n2FviVOXoi05qP59WSrekRu6kcOJzAuV6g1vj3lVbhVQ15ZuPtEb05yrzWQv+fGezfYkTODisQJF1d5WT6Eh8VRFJjBY+1iZY5Oyv4Rpzbgrq+sk6nJ03rKzvhSbiqbpZrwu4otY5eyMOHj7yhVkwJcUAyFor5BpXVbpUGhaIQHV6cVRr79bGUh2FsLZMJb5ES7AUJ50+lHzLWaUpLBBGwTIHSkknTJ/UbFGGbMty3V4NLYXIwhaVo58Ifs= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000001, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Feb 19, 2025 at 05:11:00PM +0000, Luck, Tony wrote: > We could, but I don't like it much. By taking the page offline from the relatively > kind environment of a regular interrupt, we often avoid taking a machine check > (which is an unfriendly environment for software). Right. > We could make the action in uc_decode_notifier() configurable. Default=off > but with a command line option to enable for systems that are stuck with > broadcast machine checks. So we can figure that out during boot - no need for yet another cmdline option. It still doesn't fix the race and I'd like to fix that instead, in the optimal case. But looking at Shuai's patch, I guess fixing the reporting is fine too - we need to fix the commit message to explain why this thing even happens. I.e., basically what you wrote and Shuai could use that explanation to write a commit message explaining what the situation is along with the background so that when we go back to this later, we will actually know what is going on. But looking at 046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"") That thing was trying to fix the same reporting fail. Why didn't it do that? Ooooh, now I see what the issue is. He doesn't want to kill the process which gets the wrong SIGBUS. Maybe the commit title should've said that: mm/hwpoison: Do not send SIGBUS to processes with recovered clean pages or so. But how/why is that ok? Are we confident that + * ret = 0 when poison page is a clean page and it's dropped, no + * SIGBUS is needed. can *always* and *only* happen when there's a CMCI *and* a #MC race and the CMCI has won the race? Can memory poison return 0 there too, for another reason and we end up *not killing* a process which we should have? Hmmm. > On Intel that would mean not registering the notifier at all. What about AMD? > Do you have similar races for MCE_DEFERRED_SEVERITY errors? Probably. Lemme ask around. > [1] Some OEMs still do not enable LMCE in their BIOS. Oh, ofc. Gotta love BIOS. They'll get the message when LMCE becomes obsolete, trust me. Are we force-enabling LMCE in this case when booting? -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette