From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CA482C352A1 for ; Wed, 7 Dec 2022 12:29:14 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 160048E0003; Wed, 7 Dec 2022 07:29:14 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 1104E8E0001; Wed, 7 Dec 2022 07:29:14 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F40ED8E0003; Wed, 7 Dec 2022 07:29:13 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id E51F68E0001 for ; Wed, 7 Dec 2022 07:29:13 -0500 (EST) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id AC59AC047C for ; Wed, 7 Dec 2022 12:29:13 +0000 (UTC) X-FDA: 80215440186.29.456F360 Received: from out30-7.freemail.mail.aliyun.com (out30-7.freemail.mail.aliyun.com [115.124.30.7]) by imf09.hostedemail.com (Postfix) with ESMTP id F2437140002 for ; Wed, 7 Dec 2022 12:29:11 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=none; spf=pass (imf09.hostedemail.com: domain of cuibixuan@linux.alibaba.com designates 115.124.30.7 as permitted sender) smtp.mailfrom=cuibixuan@linux.alibaba.com; dmarc=pass (policy=none) header.from=alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1670416153; a=rsa-sha256; cv=none; b=u/PUlmTBtb398LD5+0/yE8OBGYhZpuYIpIMZ/uzPKIN/rGvNu2tcleq4r9OIZ/g2cRnXrv LdvaVTeWjfkF/MR7m2XK+MyE96iPdaIYUG6UBTdq52hQNqATmVq5uYcfxSWFoqdGR5xBJk 2OUswM4sV1K8qvcuRlYM+Qi8GG/4j8I= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=none; spf=pass (imf09.hostedemail.com: domain of cuibixuan@linux.alibaba.com designates 115.124.30.7 as permitted sender) smtp.mailfrom=cuibixuan@linux.alibaba.com; dmarc=pass (policy=none) header.from=alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1670416153; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=U0Nz7gq4C33OeAffzuwXGkx5wqg21Jco11mY1m4sE24=; b=gBJttONFZppTv2bbCpHb9j2aabkNqPd5hzgvavzZ/RhszRR9ICupTcGa6zs9WrQP6k7234 +JSniGKAdUDlWGHcxT0biR8LxE0Vrto8/EEZx69l48lDjbflW+Q1vzRbiYBKTRfYrwACLS jIPwboRi7V9oNeqw1/AOmCjFp+yERXo= X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R151e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018046049;MF=cuibixuan@linux.alibaba.com;NM=0;PH=DS;RN=18;SR=0;TI=SMTPD_---0VWm2QMY_1670416146; Received: from 30.221.148.76(mailfrom:cuibixuan@linux.alibaba.com fp:SMTPD_---0VWm2QMY_1670416146) by smtp.aliyun-inc.com; Wed, 07 Dec 2022 20:29:07 +0800 Message-ID: Date: Wed, 7 Dec 2022 20:29:04 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.4.1 Subject: Re: [RFC 2/2] ACPI: APEI: fix reboot caused by synchronous error loop because of memory_failure() failed Content-Language: en-US To: Lv Ying , rafael@kernel.org, lenb@kernel.org, james.morse@arm.com, tony.luck@intel.com, bp@alien8.de, naoya.horiguchi@nec.com, linmiaohe@huawei.com, akpm@linux-foundation.org, xueshuai@linux.alibaba.com, ashish.kalra@amd.com Cc: xiezhipeng1@huawei.com, wangkefeng.wang@huawei.com, xiexiuqi@huawei.com, tanxiaofei@huawei.com, linux-acpi@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org References: <20221205115111.131568-1-lvying6@huawei.com> <20221205115111.131568-3-lvying6@huawei.com> From: Bixuan Cui In-Reply-To: <20221205115111.131568-3-lvying6@huawei.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspam-User: X-Spamd-Result: default: False [-3.93 / 9.00]; BAYES_HAM(-4.23)[95.72%]; SUBJECT_HAS_UNDERSCORES(1.00)[]; DMARC_POLICY_ALLOW(-0.50)[alibaba.com,none]; R_SPF_ALLOW(-0.20)[+ip4:115.124.30.0/24]; RCVD_NO_TLS_LAST(0.10)[]; MIME_GOOD(-0.10)[text/plain]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+]; RCVD_COUNT_TWO(0.00)[2]; FROM_EQ_ENVFROM(0.00)[]; TO_DN_SOME(0.00)[]; RCPT_COUNT_TWELVE(0.00)[18]; MID_RHS_MATCH_FROM(0.00)[]; FROM_HAS_DN(0.00)[]; ARC_SIGNED(0.00)[hostedemail.com:s=arc-20220608:i=1]; TO_MATCH_ENVRCPT_SOME(0.00)[]; ARC_NA(0.00)[] X-Rspamd-Queue-Id: F2437140002 X-Rspamd-Server: rspam01 X-Stat-Signature: i4zso78ypiwek8bsbfmwrw9rh8y4hds5 X-HE-Tag: 1670416151-345699 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: 在 2022/12/5 19:51, Lv Ying 写道: > diff --git a/mm/memory-failure.c b/mm/memory-failure.c > index 3b6ac3694b8d..4c1c558f7161 100644 > --- a/mm/memory-failure.c > +++ b/mm/memory-failure.c > @@ -2266,7 +2266,11 @@ static void __memory_failure_work_func(struct work_struct *work, bool sync) > break; > if (entry.flags & MF_SOFT_OFFLINE) > soft_offline_page(entry.pfn, entry.flags); > - else if (!sync || (entry.flags & MF_ACTION_REQUIRED)) > + else if (sync) { > + if ((entry.flags & MF_ACTION_REQUIRED) && > + memory_failure(entry.pfn, entry.flags)) > + force_sig_mceerr(BUS_MCEERR_AR, 0, 0); > + } else > memory_failure(entry.pfn, entry.flags); Hi, Some of the ideas in this patch are wrong :-( 1. As Shuai Xue said, it is wrong to judge synchronization error and asynchronization error through functions such as memory_failure_queue_kick()/ghes_proc()/ghes_proc_in_irq(), because both synchronization error and asynchronization error may go to the same notification. 2. There is no need to pass 'sync' to __memory_failure_work_func(), because memory_failure() can directly handle synchronous and asynchronous errors according to entry.flags & MF_ACTION_REQUIRED: entry.flags & MF_ACTION_REQUIRED == 1: Action: poison page and kill task for synchronous error entry.flags & MF_ACTION_REQUIRED == 0: Action: poison page for asynchronous error Reference x86: do_machine_check # MCE, synchronous ->kill_me_maybe ->memory_failure(p->mce_addr >> PAGE_SHIFT, MF_ACTION_REQUIRED); uc_decode_notifier # CMCI, asynchronous ->memory_failure(pfn, 0) At the same time, the modification here is repeated with your patch 01 if (sev == GHES_SEV_RECOVERABLE && sec_sev == GHES_SEV_RECOVERABLE) - flags = 0; + flags = sync ? MF_ACTION_REQUIRED : 0; 3. Why add 'force_sig_mceerr(BUS_MCEERR_AR, 0, 0)' after memory_failure(pfn, MF_ACTION_REQUIRED)? The task will be killed in memory_failure(): if poisoned, kill_accessing_process()->kill_proc() if not poisoned, hwpoison_user_mappings()->collect_procs()->kill_procs() Reference x86 to handle synchronous error: kill_me_maybe() { int flags = MF_ACTION_REQUIRED; ret = memory_failure(p->mce_addr >> PAGE_SHIFT, flags); if (!ret) { ... return; } if (ret == -EHWPOISON || ret == -EOPNOTSUPP) return; pr_err("Memory error not recovered"); kill_me_now(cb); } Thanks, Bixuan Cui