From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 812F1C77B6F for ; Tue, 11 Apr 2023 09:48:23 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C4B7F900002; Tue, 11 Apr 2023 05:48:22 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BACB06B007E; Tue, 11 Apr 2023 05:48:22 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A4D45900002; Tue, 11 Apr 2023 05:48:22 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 8E6256B0071 for ; Tue, 11 Apr 2023 05:48:22 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 44D81A0A69 for ; Tue, 11 Apr 2023 09:48:22 +0000 (UTC) X-FDA: 80668634844.15.AE7CAFF Received: from out30-111.freemail.mail.aliyun.com (out30-111.freemail.mail.aliyun.com [115.124.30.111]) by imf17.hostedemail.com (Postfix) with ESMTP id 849484000B for ; Tue, 11 Apr 2023 09:48:17 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=none; spf=pass (imf17.hostedemail.com: domain of xueshuai@linux.alibaba.com designates 115.124.30.111 as permitted sender) smtp.mailfrom=xueshuai@linux.alibaba.com; dmarc=pass (policy=none) header.from=alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1681206499; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=8Sq99xvKBXzuU6yAfru03DUTP1l9slg788ldGfTk0p4=; b=O9Hxwy0qAjJ4oXDtt1+Oeo1RhGZiAoyEtSPgpORs8+EB8V/AC6a8dm5ixabgYuttShq4gf CuMUaiZ7XrwmZgr0RyySK1CbIWg2pj3YfRBxyNgcuzbKaQs3GwTmHgi2KHPNWRTQuzAKmq GLEJFtb0vKw/xKXu5rL4Wgx25/fUKfE= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=none; spf=pass (imf17.hostedemail.com: domain of xueshuai@linux.alibaba.com designates 115.124.30.111 as permitted sender) smtp.mailfrom=xueshuai@linux.alibaba.com; dmarc=pass (policy=none) header.from=alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1681206499; a=rsa-sha256; cv=none; b=lb71U3w/K5m2Hc9mCq5Uh+4q5N+IdBiylIRIKJFywbsckRAJbIBL8UsKrjvbHXCdjrK4sW dH01XHdbMm5Qdv7sjAGk+Dqes8vOZGNKbWYzqwJZqIfv6ve+bEZQFetIxZdb06r7VHPUfC cwDQ5Ev+m6NGX4otbwnatHdNvNHr90Q= X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R211e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=ay29a033018045176;MF=xueshuai@linux.alibaba.com;NM=1;PH=DS;RN=23;SR=0;TI=SMTPD_---0Vfrny5N_1681206490; Received: from 30.240.113.3(mailfrom:xueshuai@linux.alibaba.com fp:SMTPD_---0Vfrny5N_1681206490) by smtp.aliyun-inc.com; Tue, 11 Apr 2023 17:48:12 +0800 Message-ID: <422ceece-64f6-46ae-22f6-71f1b32a2b10@linux.alibaba.com> Date: Tue, 11 Apr 2023 17:48:07 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:102.0) Gecko/20100101 Thunderbird/102.9.0 Subject: Re: [PATCH v4 2/2] ACPI: APEI: handle synchronous exceptions in task work Content-Language: en-US To: Xiaofei Tan , mawupeng1@huawei.com, tony.luck@intel.com, naoya.horiguchi@nec.com Cc: linux-acpi@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, justin.he@arm.com, akpm@linux-foundation.org, ardb@kernel.org, ashish.kalra@amd.com, baolin.wang@linux.alibaba.com, bp@alien8.de, cuibixuan@linux.alibaba.com, dave.hansen@linux.intel.com, james.morse@arm.com, jarkko@kernel.org, lenb@kernel.org, linmiaohe@huawei.com, lvying6@huawei.com, rafael@kernel.org, xiexiuqi@huawei.com, zhuo.song@linux.alibaba.com References: <20221027042445.60108-1-xueshuai@linux.alibaba.com> <20230408091359.31554-3-xueshuai@linux.alibaba.com> <10b26d51-5f9d-25bc-0d15-6d78a08c43cb@huawei.com> <34492b33-ce55-6aeb-4cf2-3cae7bd5faba@linux.alibaba.com> <10a10bc8-bf08-43b4-7f31-f017a9144534@huawei.com> From: Shuai Xue In-Reply-To: <10a10bc8-bf08-43b4-7f31-f017a9144534@huawei.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam03 X-Stat-Signature: xwb77pq563kup9btssjp7mp7xpz7xko9 X-Rspamd-Queue-Id: 849484000B X-HE-Tag: 1681206497-749374 X-HE-Meta: U2FsdGVkX1/BPYiMY1UtQuyN4h3r3bUw+W7mNbqfyz/czGiI1BETSNZvqdgst8LqWypYRppRn2CAvNsT/4Bz59/LE5Nz7B3zxNwtnWU35NcqWiUNMm8WutmjnKO+z92pYsuGS0TTyt5BAfziOKrtObsnrknTrJK5625wofLHhOkdMoSjznBZrJ6hskcjc64l0RnMCTS2A+rnTwy8Fi726OIDlIDvx5ebC7BnGb8uMl3nsgOYhJ3GuxPw53aOKmBLe0CPSIJj6UiKdtWoNMGOc8KQqXu4wKBDHSnSydns60Vj/9ZXB/c0BZzhH1SSSqaHXeVBWJ2/0VItVNmRgML4uTAtiKbXaO0UKEIkSLoRv19vOzrN1VOhskjkGNmpXOgpsoLvVRHdWvL6Jvq2HdVyCshx9zSFvwTzSOJM4P07Sa67eVLKr3b6jQs5Brox7GCAjQ7E0t5BXYk62gqza9nAXzH0MD3+n35zoJRlh1Ox854GI7WN/iOMCRl9EIhsR/ZoBJwYSPoSEwbk7qTGi6RB+DVOpeoCzVQxekc9m6ITpkce73A7NOG2Mh+H5euEUEEwpFABE6QZCWHUw1ESbVh+cAHEjVz0i7nvnkI4fMY/kMfukRD8jjbx4Um0lBhhMIORi8VMTK6I0V122zPKmFZLhvrE3vq8PNm7VcT2XD3xpNWmVObYmS44u10FgaEbGI+Dsd0GetbcxUtAxqullB6pHJnp/kUrrwanCfYLZN9OK2LpQ0ONQ6RiWJHE23ZoQcJxLIDhgxE8lI2T06hKW+ff016YQCk0kPBJtMoG+LBUwwWbITZYmLG6Ovbdysi8f6KABdKG/bjva0VKIxa/vcE7TWgq3DmVEyv2bD+vFVKj2OUTnEIuxXkAX1wPS1yFkwpb0m7LDHwTxcaYgipZ5i5VN1ktW0LX8iwmv9nhhK4WHpgeuY1f/MMUEQYHt9OwXnPpfP37Zg8zw6sEDDL9nyM kz/rR2K7 SKZH0qPg5AbtIYGGDuKjUcahTbsvrumRqt8P3Lm+/gjQkGtEFcJytTYnYwlsnhf22TRvDIIPyN2lUCqCo4Qw0vZDY5x+4/CZ0Wx41Cxeh3W6bsRzDVvNNySmzN4XwU8qx58d574Zi5tjZG/p+lytJGVO1Ow== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 2023/4/11 PM5:02, Xiaofei Tan wrote: > > 在 2023/4/11 11:16, Shuai Xue 写道: >> >> On 2023/4/11 AM9:44, Xiaofei Tan wrote: >>> Hi Shuai, >>> >>> 在 2023/4/8 17:13, Shuai Xue 写道: >>>> Hardware errors could be signaled by synchronous interrupt, e.g.  when an >>>> error is detected by a background scrubber, or signaled by synchronous >>>> exception, e.g. when an uncorrected error is consumed. Both synchronous and >>>> asynchronous error are queued and handled by a dedicated kthread in >>>> workqueue. >>>> >>>> commit 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for >>>> synchronous errors") keep track of whether memory_failure() work was >>>> queued, and make task_work pending to flush out the workqueue so that the >>>> work for synchronous error is processed before returning to user-space. >>>> The trick ensures that the corrupted page is unmapped and poisoned. And >>>> after returning to user-space, the task starts at current instruction which >>>> triggering a page fault in which kernel will send SIGBUS to current process >>>> due to VM_FAULT_HWPOISON. >>>> >>>> However, the memory failure recovery for hwpoison-aware mechanisms does not >>>> work as expected. For example, hwpoison-aware user-space processes like >>>> QEMU register their customized SIGBUS handler and enable early kill mode by >>>> seting PF_MCE_EARLY at initialization. Then the kernel will directy notify >>>> the process by sending a SIGBUS signal in memory failure with wrong >>>> si_code: the actual user-space process accessing the corrupt memory >>>> location, but its memory failure work is handled in a kthread context, so >>>> it will send SIGBUS with BUS_MCEERR_AO si_code to the actual user-space >>>> process instead of BUS_MCEERR_AR in kill_proc(). >>>> >>>> To this end, separate synchronous and asynchronous error handling into >>>> different paths like X86 platform does: >>>> >>>> - valid synchronous errors: queue a task_work to synchronously send SIGBUS >>>>     before ret_to_user. >>>> - valid asynchronous errors: queue a work into workqueue to asynchronously >>>>     handle memory failure. >>>> - abnormal branches such as invalid PA, unexpected severity, no memory >>>>     failure config support, invalid GUID section, OOM, etc. >>>> >>>> Then for valid synchronous errors, the current context in memory failure is >>>> exactly belongs to the task consuming poison data and it will send SIBBUS >>>> with proper si_code. >>>> >>>> Fixes: 7f17b4a121d0 ("ACPI: APEI: Kick the memory_failure() queue for synchronous errors") >>>> Signed-off-by: Shuai Xue >>>> Tested-by: Ma Wupeng >>>> --- >>>>    drivers/acpi/apei/ghes.c | 91 +++++++++++++++++++++++++++------------- >>>>    include/acpi/ghes.h      |  3 -- >>>>    mm/memory-failure.c      | 13 ------ >>>>    3 files changed, 61 insertions(+), 46 deletions(-) >>>> >>>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c >>>> index c479b85899f5..df5574264d1b 100644 >>>> --- a/drivers/acpi/apei/ghes.c >>>> +++ b/drivers/acpi/apei/ghes.c >>>> @@ -452,28 +452,51 @@ static void ghes_clear_estatus(struct ghes *ghes, >>>>    } >>>>      /* >>>> - * Called as task_work before returning to user-space. >>>> - * Ensure any queued work has been done before we return to the context that >>>> - * triggered the notification. >>>> + * struct sync_task_work - for synchronous RAS event >>>> + * >>>> + * @twork:                callback_head for task work >>>> + * @pfn:                  page frame number of corrupted page >>>> + * @flags:                fine tune action taken >>>> + * >>>> + * Structure to pass task work to be handled before >>>> + * ret_to_user via task_work_add(). >>>>     */ >>>> -static void ghes_kick_task_work(struct callback_head *head) >>>> +struct sync_task_work { >>>> +    struct callback_head twork; >>>> +    u64 pfn; >>>> +    int flags; >>>> +}; >>>> + >>>> +static void memory_failure_cb(struct callback_head *twork) >>>>    { >>>> -    struct acpi_hest_generic_status *estatus; >>>> -    struct ghes_estatus_node *estatus_node; >>>> -    u32 node_len; >>>> +    int ret; >>>> +    struct sync_task_work *twcb = >>>> +        container_of(twork, struct sync_task_work, twork); >>>>    -    estatus_node = container_of(head, struct ghes_estatus_node, task_work); >>>> -    if (IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE)) >>>> -        memory_failure_queue_kick(estatus_node->task_work_cpu); >>>> +    ret = memory_failure(twcb->pfn, twcb->flags); >>>> +    kfree(twcb); >>>>    -    estatus = GHES_ESTATUS_FROM_NODE(estatus_node); >>>> -    node_len = GHES_ESTATUS_NODE_LEN(cper_estatus_len(estatus)); >>>> -    gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node, node_len); >>>> +    if (!ret) >>>> +        return; >>>> + >>>> +    /* >>>> +     * -EHWPOISON from memory_failure() means that it already sent SIGBUS >>>> +     * to the current process with the proper error info, >>>> +     * -EOPNOTSUPP means hwpoison_filter() filtered the error event, >>>> +     * >>>> +     * In both cases, no further processing is required. >>>> +     */ >>>> +    if (ret == -EHWPOISON || ret == -EOPNOTSUPP) >>>> +        return; >>>> + >>>> +    pr_err("Memory error not recovered"); >>>> +    force_sig(SIGBUS); >>>>    } >>>>      static bool ghes_do_memory_failure(u64 physical_addr, int flags) >>>>    { >>>>        unsigned long pfn; >>>> +    struct sync_task_work *twcb; >>>>          if (!IS_ENABLED(CONFIG_ACPI_APEI_MEMORY_FAILURE)) >>>>            return false; >>>> @@ -486,6 +509,18 @@ static bool ghes_do_memory_failure(u64 physical_addr, int flags) >>>>            return false; >>>>        } >>>>    +    if (flags == MF_ACTION_REQUIRED && current->mm) { >>>> +        twcb = kmalloc(sizeof(*twcb), GFP_ATOMIC); >>>> +        if (!twcb) >>>> +            return false; >>>> + >>>> +        twcb->pfn = pfn; >>>> +        twcb->flags = flags; >>>> +        init_task_work(&twcb->twork, memory_failure_cb); >>>> +        task_work_add(current, &twcb->twork, TWA_RESUME); >>>> +        return true; >>>> +    } >>>> + >>>>        memory_failure_queue(pfn, flags); >>>>        return true; >>>>    } >>>> @@ -1000,9 +1035,8 @@ static void ghes_proc_in_irq(struct irq_work *irq_work) >>>>        struct ghes_estatus_node *estatus_node; >>>>        struct acpi_hest_generic *generic; >>>>        struct acpi_hest_generic_status *estatus; >>>> -    bool task_work_pending; >>>> +    bool queued; >>>>        u32 len, node_len; >>>> -    int ret; >>>>          llnode = llist_del_all(&ghes_estatus_llist); >>>>        /* >>>> @@ -1017,25 +1051,23 @@ static void ghes_proc_in_irq(struct irq_work *irq_work) >>>>            estatus = GHES_ESTATUS_FROM_NODE(estatus_node); >>>>            len = cper_estatus_len(estatus); >>>>            node_len = GHES_ESTATUS_NODE_LEN(len); >>>> -        task_work_pending = ghes_do_proc(estatus_node->ghes, estatus); >>>> + >>>> +        queued = ghes_do_proc(estatus_node->ghes, estatus); >>>> +        /* >>>> +         * No memory failure work is queued into work queue or task queue >>>> +         * due to invalid PA, unexpected severity, OOM, etc, do a force >>>> +         * kill. >>>> +         */ >>>> +        if (!queued && current->mm) >>>> +            force_sig(SIGBUS); >>> The SIGBUS needs to be sent to the current only for synchronous exceptions. The judgment of this if statement does not guarantee this. >>> Because the function ghes_proc_in_irq() is used for NMI, but NMI not only used for synchronous exception. One user SEA is synchronous >>> exception, and some other users, such as SDEI, may be not synchronous exception. >> Yes, you are right. I was going to handle abnormal cases for sync error >> and async error. But SIGBUS sent to the current task for an asynchronous >> error is totally wrong. > > yes > >> Is it safe to keep running when an asynchronous >> error is not handled? > > I think so. Corrupt data should not be consumed silently. It should be guaranteed by Chip platorm. > If platform can't support this, it will still not be 100% safe even we panic the system here, once received > uncorrected memory error section. > > >> >> And should we add some warning message in abnormal cases? >> e.g pr_warn_ratelimited on invalid PA? > > Do you mean here ? it is not needed, as ghes_print_estatus() has included this info. > >     if (!(mem_err->validation_bits & CPER_MEM_VALID_PA)) >         return false; > >>> You could transfer the sync flag out from ghes_do_proc() and judge it here, or change meaning of the ghes_do_proc() return value >>> as if recovered. >> I think we could get sync flag by estatus_node, e.g: >> >>     bool sync = is_hest_sync_notify(estatus_node->ghes); > > It's ok for me. > >> >> Then the condition in if statement should be: >> >>     if (sync && !queued) >> >> I drop out current->mm from if statement. For sync errors, the current >> is guaranteed to be in user task, kernel task for sync error will panic >> in do_sea(), the caller of ghes_proc_in_irq(). For async errors, SIGBUS >> to current is meaningless. > > OK. It is correct for ARM SEA, if want to support more sync notify type, > should consider in the future. Thanks for confirmations. I will send a new version later. Best Regards, Shuai > >> >> Thank you. >> >> Best Regards, >> Shuai >> >>> >>>> + >>>>            if (!ghes_estatus_cached(estatus)) { >>>>                generic = estatus_node->generic; >>>>                if (ghes_print_estatus(NULL, generic, estatus)) >>>>                    ghes_estatus_cache_add(generic, estatus); >>>>            } >>>> - >>>> -        if (task_work_pending && current->mm) { >>>> -            estatus_node->task_work.func = ghes_kick_task_work; >>>> -            estatus_node->task_work_cpu = smp_processor_id(); >>>> -            ret = task_work_add(current, &estatus_node->task_work, >>>> -                        TWA_RESUME); >>>> -            if (ret) >>>> -                estatus_node->task_work.func = NULL; >>>> -        } >>>> - >>>> -        if (!estatus_node->task_work.func) >>>> -            gen_pool_free(ghes_estatus_pool, >>>> -                      (unsigned long)estatus_node, node_len); >>>> +        gen_pool_free(ghes_estatus_pool, (unsigned long)estatus_node, >>>> +                  node_len); >>>>              llnode = next; >>>>        } >>>> @@ -1096,7 +1128,6 @@ static int ghes_in_nmi_queue_one_entry(struct ghes *ghes, >>>>          estatus_node->ghes = ghes; >>>>        estatus_node->generic = ghes->generic; >>>> -    estatus_node->task_work.func = NULL; >>>>        estatus = GHES_ESTATUS_FROM_NODE(estatus_node); >>>>          if (__ghes_read_estatus(estatus, buf_paddr, fixmap_idx, len)) { >>>> diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h >>>> index 3c8bba9f1114..e5e0c308d27f 100644 >>>> --- a/include/acpi/ghes.h >>>> +++ b/include/acpi/ghes.h >>>> @@ -35,9 +35,6 @@ struct ghes_estatus_node { >>>>        struct llist_node llnode; >>>>        struct acpi_hest_generic *generic; >>>>        struct ghes *ghes; >>>> - >>>> -    int task_work_cpu; >>>> -    struct callback_head task_work; >>>>    }; >>>>      struct ghes_estatus_cache { >>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c >>>> index fae9baf3be16..6ea8c325acb3 100644 >>>> --- a/mm/memory-failure.c >>>> +++ b/mm/memory-failure.c >>>> @@ -2355,19 +2355,6 @@ static void memory_failure_work_func(struct work_struct *work) >>>>        } >>>>    } >>>>    -/* >>>> - * Process memory_failure work queued on the specified CPU. >>>> - * Used to avoid return-to-userspace racing with the memory_failure workqueue. >>>> - */ >>>> -void memory_failure_queue_kick(int cpu) >>>> -{ >>>> -    struct memory_failure_cpu *mf_cpu; >>>> - >>>> -    mf_cpu = &per_cpu(memory_failure_cpu, cpu); >>>> -    cancel_work_sync(&mf_cpu->work); >>>> -    memory_failure_work_func(&mf_cpu->work); >>>> -} >>>> - >>>>    static int __init memory_failure_init(void) >>>>    { >>>>        struct memory_failure_cpu *mf_cpu; >> .