From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3569CC021A4 for ; Tue, 25 Feb 2025 01:51:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7E4456B007B; Mon, 24 Feb 2025 20:51:34 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 7945E6B0082; Mon, 24 Feb 2025 20:51:34 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 60E6C280001; Mon, 24 Feb 2025 20:51:34 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 465EF6B007B for ; Mon, 24 Feb 2025 20:51:34 -0500 (EST) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id E9DBB12066E for ; Tue, 25 Feb 2025 01:51:33 +0000 (UTC) X-FDA: 83156790066.27.3EADFA5 Received: from out30-119.freemail.mail.aliyun.com (out30-119.freemail.mail.aliyun.com [115.124.30.119]) by imf22.hostedemail.com (Postfix) with ESMTP id 133C9C0005 for ; Tue, 25 Feb 2025 01:51:30 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=qvwfQcGc; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf22.hostedemail.com: domain of xueshuai@linux.alibaba.com designates 115.124.30.119 as permitted sender) smtp.mailfrom=xueshuai@linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740448292; a=rsa-sha256; cv=none; b=6BhLkk1QEl5Hc1lF3e85tWZ39tk6wupxfwl7xsg3J+90P7sNJ0xfy1ubtU/y3oBAxoQBQz LWMYLPMJVc0COXAfWM5d/eZg0+eFVwpUhGicP69dfLMAdzmBn54YCaaZbm0ppV5JMhcXg0 rmUdaDCKSMTyTsd1Js6ktYBH37MSmcs= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=qvwfQcGc; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf22.hostedemail.com: domain of xueshuai@linux.alibaba.com designates 115.124.30.119 as permitted sender) smtp.mailfrom=xueshuai@linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1740448292; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=XLFzlWyNgA2NvZExDRHffNi/+7Rg7mFUz7UnwELeKFs=; b=b4b3Tiy11CMZVda7k2P2+JukQrA53/6L3XmqMWb352llWwjsx9p03bmQzquqtUuMxSsBfx z8Jgstop6+14qVXARc2rs8RxI0H4t5i6vm4L7FCGS9Ve5rj2GTZwp4LNcV6BEQeBvovl+g hvV7EUlCyEXEMAYrsvtpSpl95CL8H1I= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1740448288; h=Message-ID:Date:MIME-Version:From:Subject:To:Content-Type; bh=XLFzlWyNgA2NvZExDRHffNi/+7Rg7mFUz7UnwELeKFs=; b=qvwfQcGcTaCwMtJlJda33qk4Jj9MBV8TZCmAaT/STfc+l6Ee5sv/iTnBLGevltErPM7ijfuoZj1vsr0EkXIq+bAWie42hx0tJe5ikOlPYI0Nx3ZjxZLERoG0e+Qp301ll6YBFMUEPkrGhdN247+8U6vwnfnfNJo9MfuWt/5H+Ig= Received: from 30.246.161.128(mailfrom:xueshuai@linux.alibaba.com fp:SMTPD_---0WQChn5i_1740448285 cluster:ay36) by smtp.aliyun-inc.com; Tue, 25 Feb 2025 09:51:27 +0800 Message-ID: <6f34c17c-4113-46d9-aa66-53ff5a1feed5@linux.alibaba.com> Date: Tue, 25 Feb 2025 09:51:25 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird From: Shuai Xue Subject: Re: [PATCH v2 0/5] mm/hwpoison: Fix regressions in memory failure handling To: Borislav Petkov Cc: "Luck, Tony" , "nao.horiguchi@gmail.com" , "tglx@linutronix.de" , "mingo@redhat.com" , "dave.hansen@linux.intel.com" , "x86@kernel.org" , "hpa@zytor.com" , "linmiaohe@huawei.com" , "akpm@linux-foundation.org" , "peterz@infradead.org" , "jpoimboe@kernel.org" , "linux-edac@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "linux-mm@kvack.org" , "baolin.wang@linux.alibaba.com" , "tianruidong@linux.alibaba.com" References: <20250217063335.22257-1-xueshuai@linux.alibaba.com> <20250218082727.GCZ7REb7OG6NTAY-V-@fat_crate.local> <7393bcfb-fe94-4967-b664-f32da19ae5f9@linux.alibaba.com> <20250218122417.GHZ7R78fPm32jKYUlx@fat_crate.local> <20250219081037.GAZ7WR_YmRtRvN_LKA@fat_crate.local> <20250220111903.GDZ7cPp1qVq3t9Jgs6@fat_crate.local> <4e13bef2-7402-4f75-8f0c-4a3cc210c5a6@linux.alibaba.com> <20250224220146.GBZ7zsSnXLftyqWzW_@fat_crate.local> In-Reply-To: <20250224220146.GBZ7zsSnXLftyqWzW_@fat_crate.local> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 133C9C0005 X-Stat-Signature: oc8b9mi6ns1pbn5us4ys19s5kjti73nc X-Rspam-User: X-HE-Tag: 1740448290-85514 X-HE-Meta: U2FsdGVkX18RlVCg6IKgQ1LxswN5DH+mXFSBKAI5Qnr+1fQd2d/7joyXYWuGIhtrvqukAWYym7/wSvFJ6IGv2589ArNHPjsQZ8TTXF/ugWEWAvP6lRE5lRxR5W1T7qCAOMCoks8z13mH0jw4X9c9cGSmi1tmTfmmYEe3rGvzCJR47blAYmqMYgwE7s1Z9syA/FZZMDNPgYXak5Pfz97GPbXBKZPoGSEILx6VjJ03s182R8b17c44beVtWO1XN+tnJNx8hXI9ujYWfqb1Y7CtsZLhKcTAMZiqMt7Eto2G5hrkFdl0xshInwHawpLoBWpGF8m0+oDf/yjPBS16c+n4InxcqVBLQ8aLHQ8nsj7A7GOAgD7Jv1ypDzSkGOHaDivrRULEtUJAIWufzr0EEJP2is4D29TCFQj9M+W8ff/yc9ag9oREmRviT1UALjMduf1syl/+f9l6C2UL341IH2eEgX7jZYx9pL8XEkIAwAjABJSv/NfIr0fTPLd5YFElzyTcKWKwnIMed8/7xauYNVV2s/zeMw+537Z99+QNGz9OcrG1baT7ipVwhXYMMQHAozFaJYJo/bJNQfQUeALSjjy+QJGhJXOPO/KmbRYiu8d1aj2rjJDhs6IlUiAy7Wh/Ef/HAiGYSvWPIw5zOb/r7T7H+WcecMqeSnwenGyizOxPFQjfORZEwhMugtT5NmDtg6x4kighqw4cKBWeG/I17a/Y+MCuCLkuWJ1ep5W9mLMLqZLA8qeB51oSix3U25PRv9UywXpEuMDMrwXKO14PNihjVFIOjRLTp4AfQoqlJKtySacAbD4AhYGO+I4byYxXwx8U79/lLxcGoMz9RF2wiTAQ0Ig286r8bdFTd8FkdoZUK+ZYyaZsC7edOv3iRx5oXlWi3djBB9vPPR/CXKFw6cVL2PSfxp8+J+cPlgGaijJz8WXkIyUFJnFQtEv0adG9bDSTI3T5oRxsENBXb2GwnYH UohhkLEL Gr5iaCePISRIK5fmMn6aRwFKhk+K5Tc51ZdALYeL1B2rSuXx7UBaW9I0t7DTWs+uSL+mR9JpLmOiVCRYKeF08Qw4cMwRTWltQi06kMb1araFAME1yrzNzUtppty27LsDrzpPhUqVbmhESSqmjSauQH4Up96ATiqv4xeRjhpEpynFXIHh0CeGctpTX/wlbbrmm+MU5fyIAT3tNqX1UUY4An5CPCEmdiIHyibT3p/P6pJWWaIkqDZ5/fHVgCgzAwn0RH7RZIaHd56OSMIMywr/CqoZ3wEuJds+x25+37y9NdoHNvkOAu8TAMabqI75ht5sX37caIYFpjCudZm0ALjscKGik/fdgd10z0gqOkqgbjk0QDwRZP+CG10xJvCWHRLgfswx5eaf/+9TmacE= X-Bogosity: Ham, tests=bogofilter, spamicity=0.062262, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: 在 2025/2/25 06:01, Borislav Petkov 写道: > On Fri, Feb 21, 2025 at 02:05:28PM +0800, Shuai Xue wrote: >> #perf script >> kworker/48:1-mm 25516 [048] 1713.893549: probe:memory_failure: (ffffffffaa622db4) >> ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms]) >> ffffffffaa25aa93 uc_decode_notifier+0x73 ([kernel.kallsyms]) >> ffffffffaa3068bb notifier_call_chain+0x5b ([kernel.kallsyms]) >> ffffffffaa306ae1 blocking_notifier_call_chain+0x41 ([kernel.kallsyms]) >> ffffffffaa25bbfe mce_gen_pool_process+0x3e ([kernel.kallsyms]) >> ffffffffaa2f455f process_one_work+0x19f ([kernel.kallsyms]) >> ffffffffaa2f509c worker_thread+0x20c ([kernel.kallsyms]) >> ffffffffaa2fec89 kthread+0xd9 ([kernel.kallsyms]) >> ffffffffaa245131 ret_from_fork+0x31 ([kernel.kallsyms]) >> ffffffffaa2076ca ret_from_fork_asm+0x1a ([kernel.kallsyms]) >> >> einj_mem_uc 44530 [184] 1713.908089: probe:memory_failure: (ffffffffaa622db4) >> ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms]) >> ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms]) >> ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms]) >> ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms]) >> ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms]) >> ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms]) >> 405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc) >> >> einj_mem_uc 44531 [089] 1713.916319: probe:memory_failure: (ffffffffaa622db4) >> ffffffffaa622db5 memory_failure+0x5 ([kernel.kallsyms]) >> ffffffffaa2594fb kill_me_maybe+0x5b ([kernel.kallsyms]) >> ffffffffaa2fac29 task_work_run+0x59 ([kernel.kallsyms]) >> ffffffffaaf52347 irqentry_exit_to_user_mode+0x1c7 ([kernel.kallsyms]) >> ffffffffaaf50bce noist_exc_machine_check+0x3e ([kernel.kallsyms]) >> ffffffffaa001303 asm_exc_machine_check+0x33 ([kernel.kallsyms]) >> 405046 thread+0xe (/home/shawn.xs/ras-tools/einj_mem_uc) > > What are those stack traces supposed to say? > > Two processes are injecting, cause a #MC and a kworker gets to handle the UC? > > All injecting to the same page? Yes, I inject poison to a page and create two process with pthread_create() which trigger the same poison page. > > What's the upper limit on CPUs seeing the same hw error and all raising > a CMCI/#MC? It depends on the forked process which trying to read the poison. > >> - kill_accessing_process() is only called when the flags are set to >> MF_ACTION_REQUIRED, which means it is in the MCE path. >> - Whether the page is clean determines the behavior of try_to_unmap. For a >> dirty page, try_to_unmap uses TTU_HWPOISON to unmap the PTE and convert the >> PTE entry to a swap entry. For a clean page, try_to_unmap uses ~TTU_HWPOISON >> and simply unmaps the PTE. >> - When does walk_page_range() with hwpoison_walk_ops return 1? >> 1. If the poison page still exists, we should of course kill the current >> process. >> 2. If the poison page does not exist, but is_hwpoison_entry is true, meaning >> it is a dirty page, we should also kill the current process, too. >> 3. Otherwise, it returns 0, which means the page is clean. > > I think you're too deep into detail. What I'd do is step back, think what > would be the *proper* recovery action and then make sure memory_failure does > that. If it doesn't - fix it to do so. > > So, what should really happen wrt recovery action if any number of CPUs see > the same memory error? > IMHO, we should send a SIGBUS signal to the processes running on the CPUs that detect a memory error for dirty page, which is the current behavior in the memory_failure. Thanks Shuai