From: Andrew Morton <akpm@linux-foundation.org>
To: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Cc: naoya.horiguchi@nec.com, linmiaohe@huawei.com,
tony.luck@intel.com, ying.huang@intel.com, fengwei.yin@intel.com,
linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2 2/2] mm: memory-failure: Re-split hw-poisoned huge page on -EAGAIN
Date: Fri, 22 Dec 2023 11:42:33 -0800 [thread overview]
Message-ID: <20231222114233.68a4fcf2428ae50da6b249f4@linux-foundation.org> (raw)
In-Reply-To: <20231222062706.5221-2-qiuxu.zhuo@intel.com>
On Fri, 22 Dec 2023 14:27:06 +0800 Qiuxu Zhuo <qiuxu.zhuo@intel.com> wrote:
> During the process of splitting a hw-poisoned huge page, it is possible
> for the reference count of the huge page to be increased by the threads
> within the affected process, leading to a failure in splitting the
> hw-poisoned huge page with an error code of -EAGAIN.
>
> This issue can be reproduced when doing memory error injection to a
> multiple-thread process, and the error occurs within a huge page.
> The call path with the returned -EAGAIN during the testing is shown below:
>
> memory_failure()
> try_to_split_thp_page()
> split_huge_page()
> split_huge_page_to_list() {
> ...
> Step A: can_split_folio() - Checked that the thp can be split.
> Step B: unmap_folio()
> Step C: folio_ref_freeze() - Failed and returned -EAGAIN.
> ...
> }
>
> The testing logs indicated that some huge pages were split successfully
> via the call path above (Step C was successful for these huge pages).
> However, some huge pages failed to split due to a failure at Step C, and
> it was observed that the reference count of the huge page increased between
> Step A and Step C.
>
> Testing has shown that after receiving -EAGAIN, simply re-splitting the
> hw-poisoned huge page within memory_failure() always results in the same
> -EAGAIN. This is possible because memory_failure() is executed in the
> currently affected process. Before this process exits memory_failure() and
> is terminated, its threads could increase the reference count of the
> hw-poisoned page.
>
> Furthermore, if the h/w-poisoned huge page had been mapped for the victim
> application's text and was present in the file cache and it was failed to
> be split. When attempting to restart the process without splitting the
> h/w-poisoned huge page, the application restart failed. This was possible
> because its text was remapped to the hardware-poisoned huge page from the
> file cache, leading to its swift termination due to another MCE.
So we're hoping that when the worker runs to split the page, the
process and its threads have exited. What guarantees this timing?
And we're hoping that the worker has split the page before userspace
attempts to restart the process. What guarantees this timing?
All this reliance upon fortunate timing sounds rather unreliable,
doesn't it?
next prev parent reply other threads:[~2023-12-22 19:42 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-15 8:12 [PATCH 1/1] " Qiuxu Zhuo
2023-12-19 2:17 ` Naoya Horiguchi
2023-12-20 8:44 ` Zhuo, Qiuxu
2023-12-19 11:50 ` Miaohe Lin
2023-12-20 8:56 ` Zhuo, Qiuxu
2023-12-22 6:27 ` [PATCH v2 1/2] mm: memory-failure: Make memory_failure_queue_delayed() helper Qiuxu Zhuo
2023-12-22 6:27 ` [PATCH v2 2/2] mm: memory-failure: Re-split hw-poisoned huge page on -EAGAIN Qiuxu Zhuo
2023-12-22 19:42 ` Andrew Morton [this message]
2024-01-02 2:41 ` Zhuo, Qiuxu
2024-01-03 2:47 ` Miaohe Lin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20231222114233.68a4fcf2428ae50da6b249f4@linux-foundation.org \
--to=akpm@linux-foundation.org \
--cc=fengwei.yin@intel.com \
--cc=linmiaohe@huawei.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=naoya.horiguchi@nec.com \
--cc=qiuxu.zhuo@intel.com \
--cc=tony.luck@intel.com \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox