From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 2E906C021AA
	for <linux-mm@archiver.kernel.org>; Fri, 21 Feb 2025 06:05:44 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 556096B00A5; Fri, 21 Feb 2025 01:05:43 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 5065F6B00A6; Fri, 21 Feb 2025 01:05:43 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 3CE59280001; Fri, 21 Feb 2025 01:05:43 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id 08FB36B00A5
	for <linux-mm@kvack.org>; Fri, 21 Feb 2025 01:05:42 -0500 (EST)
Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id 4B69F1A07E6
	for <linux-mm@kvack.org>; Fri, 21 Feb 2025 06:05:42 +0000 (UTC)
X-FDA: 83142915324.17.BDAEC49
Received: from out30-118.freemail.mail.aliyun.com (out30-118.freemail.mail.aliyun.com [115.124.30.118])
	by imf24.hostedemail.com (Postfix) with ESMTP id DFD28180007
	for <linux-mm@kvack.org>; Fri, 21 Feb 2025 06:05:37 +0000 (UTC)
Authentication-Results: imf24.hostedemail.com;
	dkim=pass header.d=linux.alibaba.com header.s=default header.b=OlmLdicA;
	dmarc=pass (policy=none) header.from=linux.alibaba.com;
	spf=pass (imf24.hostedemail.com: domain of xueshuai@linux.alibaba.com designates 115.124.30.118 as permitted sender) smtp.mailfrom=xueshuai@linux.alibaba.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740117939; a=rsa-sha256;
	cv=none;
	b=cudGLWeo6TTxVAW2v2AVnXkMhwqUY8X1QFWk7VccpoOvs2dDRJWaaGRWjxkN9az7nUF0wn
	TSDjcaJnzL6D6Sdx870nIciB+jLqq57mMno+0QL1VCEKkbs42LgC06DanX8QHESRdUUl/V
	pgqlZsRGEXqhM0liEoqchpwcB9eBo5g=
ARC-Authentication-Results: i=1;
	imf24.hostedemail.com;
	dkim=pass header.d=linux.alibaba.com header.s=default header.b=OlmLdicA;
	dmarc=pass (policy=none) header.from=linux.alibaba.com;
	spf=pass (imf24.hostedemail.com: domain of xueshuai@linux.alibaba.com designates 115.124.30.118 as permitted sender) smtp.mailfrom=xueshuai@linux.alibaba.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1740117939;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=RIQVHMJ2EOCtMNc/t3pzTdzMfo+2yjnJUkeVVbernOw=;
	b=w7PMKZUTKjpm0/l1MFimkJLyM1H7DoYNd9WZFFR7K0hDx7IRK+DAUp3iklR9x7giDgj8HY
	HrxUKskguquYlc/IOwyDyYwyAA6FuSVZYdCbN7G0q83OiBuo6O3mQmABx6ZOqQqijjoSfz
	bJamFIK8T3AReNbrytb+cfxdD1yfakg=
DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=linux.alibaba.com; s=default;
	t=1740117932; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type;
	bh=RIQVHMJ2EOCtMNc/t3pzTdzMfo+2yjnJUkeVVbernOw=;
	b=OlmLdicANherrU9AnCn+sibjtWJ0ATqciuoxTmmOZqfXydgGTUxqGeuSi3ts3eIbBpGLihWYFrFZ5VOfJcq/Te95maaVjdTikcwR4tGoDEN73Bz6t9/95JP7ABmLaCk3yLaqROjcRcU3VmKseiwi/dbqkJEGoZvdqpkyz8l2zC4=
Received: from 30.246.161.128(mailfrom:xueshuai@linux.alibaba.com fp:SMTPD_---0WPvDVI8_1740117929 cluster:ay36)
          by smtp.aliyun-inc.com;
          Fri, 21 Feb 2025 14:05:31 +0800
Message-ID: <4e13bef2-7402-4f75-8f0c-4a3cc210c5a6@linux.alibaba.com>
Date: Fri, 21 Feb 2025 14:05:28 +0800
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure
 handling
To: "Luck, Tony" <tony.luck@intel.com>, Borislav Petkov <bp@alien8.de>
Cc: "nao.horiguchi@gmail.com" <nao.horiguchi@gmail.com>,
 "tglx@linutronix.de" <tglx@linutronix.de>,
 "mingo@redhat.com" <mingo@redhat.com>,
 "dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com>,
 "x86@kernel.org" <x86@kernel.org>, "hpa@zytor.com" <hpa@zytor.com>,
 "linmiaohe@huawei.com" <linmiaohe@huawei.com>,
 "akpm@linux-foundation.org" <akpm@linux-foundation.org>,
 "peterz@infradead.org" <peterz@infradead.org>,
 "jpoimboe@kernel.org" <jpoimboe@kernel.org>,
 "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>,
 "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
 "linux-mm@kvack.org" <linux-mm@kvack.org>,
 "baolin.wang@linux.alibaba.com" <baolin.wang@linux.alibaba.com>,
 "tianruidong@linux.alibaba.com" <tianruidong@linux.alibaba.com>
References: <20250217063335.22257-1-xueshuai@linux.alibaba.com>
 <20250218082727.GCZ7REb7OG6NTAY-V-@fat_crate.local>
 <7393bcfb-fe94-4967-b664-f32da19ae5f9@linux.alibaba.com>
 <20250218122417.GHZ7R78fPm32jKYUlx@fat_crate.local>
 <SJ1PR11MB60836781C4CE26C4B43AFF0BFCFA2@SJ1PR11MB6083.namprd11.prod.outlook.com>
 <20250219081037.GAZ7WR_YmRtRvN_LKA@fat_crate.local>
 <SJ1PR11MB6083F7AC9C5AED072141B8CAFCC52@SJ1PR11MB6083.namprd11.prod.outlook.com>
 <20250220111903.GDZ7cPp1qVq3t9Jgs6@fat_crate.local>
 <SJ1PR11MB608335ACA7AEC51F7F6A75D2FCC42@SJ1PR11MB6083.namprd11.prod.outlook.com>
From: Shuai Xue <xueshuai@linux.alibaba.com>
In-Reply-To: <SJ1PR11MB608335ACA7AEC51F7F6A75D2FCC42@SJ1PR11MB6083.namprd11.prod.outlook.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Rspamd-Server: rspam02
X-Rspamd-Queue-Id: DFD28180007
X-Stat-Signature: 7qdhskg8rhwjogqmaby574sqzykikcy8
X-Rspam-User: 
X-HE-Tag: 1740117937-938877
X-HE-Meta: U2FsdGVkX19wOmXYUF/0TTSDEJnizx3n6R6m2ggoxGoeMIVmoh/u2aHIg4bct1XdfEucmsOxLHK8htezb/0E/c7NytEWRS8wSDNB8//G6Im8Smb7szpAK+D72s+rjMkALPeNdyWRUiLLO/6u9rNJWGC2HNaq0l0DVzz2JaJX+Tfki8kH8tXu9SNdH05KmFXk3c28R8OtVGb4tFxi0//KlzPAtcze7arYRAdYXMM1ycg4Wz5G94Dd9r4cWFXL1iIeS2QQZvVBNXqndHhoj7/EfyKyYMQS0YFozYsGVI5SVyg01REvV/yp+Tb6K/I09/7i2h/A7F+EsXBlmmsXASw+S8up+3ACNmYc5VfpZXbv0O7eKjhQiLGuoLmxxc9+2BENahdrAEAi99bTbwgf1iE4vjsaW3X41t0ombRWUGtf/uri3gV7dvgoxbLuTgb6MKOIVV8zxJSLmOOe8mt1dqZls+pDHUJKi5/PAcR+rAZBa63ybhaoRzujghQlXbO+KTVpVBNmXt+8eXhZC5wrVZAPMUzVgVf/qWK1MQF+sh05EcR601ots2G88PTk45sM3Wa2THHpOVW3p+LrxNOWRMpSc4NNoHCCP36vnQNGpiA23TAUqX8PA45ZIEI3C7YT5eVMumhJNHkljZnqpUqzFDFfhoU2PFHe+ow6gEMXthDLCARKv8aCx2vyVT0GKfB/WVs6m+2wqdwKijsGYWADC1zxI71pqEpLKddxPIUz2dUIWgn8gjtmsr+KQ9TRD6ymDRKbreBOZq5/+C4TbUtRFWMWCx51eRF/Wu2o1j8/Zv82UiNQOJjVKujFqyIFNQ/IWBkeq30J+D4fKXpFBsOXdj8cKiOTGmju7ywkN+K9syjcYciCzD0Zk7gLJ4gGk2YIZHsRs9zsZJDDGy6/gYg/mYiwKzl0EclqEQMLCJNr2Ukd3fpDUbJncz9TJVnlKTaw70TET2RdBSgBZ2N9VpUOc9+
 WseSPlBm
 LWa0/BmQeMdYt9oI6kaF/APUsBlT/AKUHUCgf8BLFWCgHL0ilSbAwALZ1f2I8foIY9kvAZguzaZZsWItecFnbj9d4JC2uHspGAh6pWhD7bP+OJ6umkZOn7gxCBoFl96cs0003zTBy1Ps8uylJKMypzu4MoxyRw4CN6SgJLYYuJXcdV2cciP0fZCbDRKVl6xcAETNgLhuvnTfzZN0UhLUM5AY0JnYskb3P6FmHWVAXOrpY6d0q758VENOfRcet8sElafBg/55wyXJtqN4E9bZP+39EVA1WV4906/b58ODES5VdDifVuAg5okjsP1xqq8eWrSuLDhLrcqyUi1AJxejFUshXjyx/iv+c8gHycd1tSB1nUTRnnpv18I81AMk8gRG5J9ooZJVnDAB8yuA=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.037145, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>


在 2025/2/21 01:50, Luck, Tony 写道:
>>> We could, but I don't like it much. By taking the page offline from the relatively
>>> kind environment of a regular interrupt, we often avoid taking a machine check
>>> (which is an unfriendly environment for software).
>>
>> Right.
>>
>>> We could make the action in uc_decode_notifier() configurable. Default=off
>>> but with a command line option to enable for systems that are stuck with
>>> broadcast machine checks.
>>
>> So we can figure that out during boot - no need for yet another cmdline
>> option.
> 
> Yup. I think the boot time test might be something like:
> 
> 	// Enable UCNA offline for systems with broadcast machine check
> 	if (!(AMD || LMCE))
> 		mce_register_decode_chain(&mce_uc_nb);
>>
>> It still doesn't fix the race and I'd like to fix that instead, in the optimal
>> case.
>>
>> But looking at Shuai's patch, I guess fixing the reporting is fine too - we
>> need to fix the commit message to explain why this thing even happens.
>>
>> I.e., basically what you wrote and Shuai could use that explanation to write
>> a commit message explaining what the situation is along with the background so
>> that when we go back to this later, we will actually know what is going on.
> 
> Agreed. Shaui needs to harvest this thread to fill out the details in the commit
> messages.

Sure, I'd like to add more backgroud details with Tony's explanation.

> 
>>
>> But looking at
>>
>>    046545a661af ("mm/hwpoison: fix error page recovered but reported "not recovered"")
>>
>> That thing was trying to fix the same reporting fail. Why didn't it do that?
>>
>> Ooooh, now I see what the issue is. He doesn't want to kill the process which
>> gets the wrong SIGBUS. Maybe the commit title should've said that:
>>
>>    mm/hwpoison: Do not send SIGBUS to processes with recovered clean pages
>>
>> or so.
>>
>> But how/why is that ok?
>>
>> Are we confident that
>>
>> +        * ret = 0 when poison page is a clean page and it's dropped, no
>> +        * SIGBUS is needed.
>>
>> can *always* and *only* happen when there's a CMCI *and* a #MC race and the
>> CMCI has won the race?
> 
> There are probably other races. Two CPUs both take local #MC on the same page
> (maybe not all that rare in threaded processes ... or even with some hot code in
> a shared library).
> 
>> Can memory poison return 0 there too, for another reason and we end up *not
>> killing* a process which we should have?
>>
>> Hmmm.
> 
> Hmmm indeed. Needs some thought. Though failing to kill a process likely means
> it retries the access and comes right back to try again (without the race this time).
> 

Emmm, if two threaded processes consume a poisond data, there may three CPUs
race, two of which take local #MC on the same page and one take CMCI. For,
example:

#perf script
kworker/48:1-mm 25516 [048]  1713.893549: probe:memory_failure: (ffffffffaa622db4)
         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
         ffffffffaa25aa93 uc_decode_notifier+0x73 ([kernel.kallsyms])
         ffffffffaa3068bb notifier_call_chain+0x5b ([kernel.kallsyms])
         ffffffffaa306ae1 blocking_notifier_call_chain+0x41 ([kernel.kallsyms])
         ffffffffaa25bbfe mce_gen_pool_process+0x3e ([kernel.kallsyms])
         ffffffffaa2f455f process_one_work+0x19f ([kernel.kallsyms])
         ffffffffaa2f509c worker_thread+0x20c ([kernel.kallsyms])
         ffffffffaa2fec89 kthread+0xd9 ([kernel.kallsyms])
         ffffffffaa245131 ret_from_fork+0x31 ([kernel.kallsyms])
         ffffffffaa2076ca ret_from_fork_asm+0x1a ([kernel.kallsyms])

einj_mem_uc 44530 [184]  1713.908089: probe:memory_failure: (ffffffffaa622db4)
         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
         ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
         ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
         ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
         ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
         ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
                   405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

einj_mem_uc 44531 [089]  1713.916319: probe:memory_failure: (ffffffffaa622db4)
         ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms])
         ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms])
         ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms])
         ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms])
         ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms])
         ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms])
                   405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc)

It seems to complicate the issue further.

IMHO, we should focus on three main points:

- kill_accessing_process() is only called when the flags are set to
   MF_ACTION_REQUIRED, which means it is in the MCE path.
- Whether the page is clean determines the behavior of try_to_unmap. For a
   dirty page, try_to_unmap uses TTU_HWPOISON to unmap the PTE and convert the
   PTE entry to a swap entry. For a clean page, try_to_unmap uses ~TTU_HWPOISON
   and simply unmaps the PTE.
- When does walk_page_range() with hwpoison_walk_ops return 1?
   1. If the poison page still exists, we should of course kill the current
      process.
   2. If the poison page does not exist, but is_hwpoison_entry is true, meaning
      it is a dirty page, we should also kill the current process, too.
   3. Otherwise, it returns 0, which means the page is clean.


Thanks.
Shuai