Re: [PATCH v2 2/2] mm: memory-failure: Re-split hw-poisoned huge page on -EAGAIN

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Miaohe Lin <linmiaohe@huawei.com>
To: "Zhuo, Qiuxu" <qiuxu.zhuo@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: "naoya.horiguchi@nec.com" <naoya.horiguchi@nec.com>,
	"Luck, Tony" <tony.luck@intel.com>,
	"Huang, Ying" <ying.huang@intel.com>,
	"Yin, Fengwei" <fengwei.yin@intel.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v2 2/2] mm: memory-failure: Re-split hw-poisoned huge page on -EAGAIN
Date: Wed, 3 Jan 2024 10:47:13 +0800	[thread overview]
Message-ID: <bb76102e-bfe7-ea96-3e26-be68752d1664@huawei.com> (raw)
In-Reply-To: <CY8PR11MB7134D3ADA0BCDAB938E6E2A58961A@CY8PR11MB7134.namprd11.prod.outlook.com>

On 2024/1/2 10:41, Zhuo, Qiuxu wrote:
>> From: Andrew Morton <akpm@linux-foundation.org>
> 
> Hi Andrew, 
> 
> Happy New Year. 
> Thanks for reviewing the patch.
> Please see the comments inline.
> 
>> ...
>>
>> So we're hoping that when the worker runs to split the page, the process and
>> its threads have exited.  What guarantees this timing?
> 
> Case 1: If the threads of the victim process do not access the new mapping to 
> the h/w-poisoned huge page(no refcnt increase), the h/w-poisoned huge page
> should be successfully split in the process context. No need for the worker to
> split this h/w-poisoned page.
> 
> Case 2: If the threads of the victim process access the new mapping to the
> hardware-poisoned huge page (refcnt increase), causing the failure of splitting
> the hardware-poisoned huge page, a new MCE will be re-triggered immediately.
> Consequently, the process will be promptly terminated upon re-entering the
> code below:
> 
> MCE occurs:
>   memory_failure()
>   {
>     { 
>       ...
>       if (TestSetPageHWPoison(p)) {
>       ...
>       kill_accessing_process(current, pfn, flags); 
>       ...
> 	}
>       ...
>   }
> 
> The worker splits the h/w-poisoned background with retry delays of 1ms, 2ms,
> 4ms, 8ms, ..., 512ms. Before reaching the max 512ms timeout, the process and
> its threads should already exit. So, the retry delays can guarantee the timing.
> 
>> And we're hoping that the worker has split the page before userspace
>> attempts to restart the process.  What guarantees this timing?
> 
> Our experiments showed that an immediate restart of the victim process was
> consistently successful. This success could be attributed to the duration between
> the process being killed and its subsequent restart being sufficiently long,
> allowing the worker enough time to split the hardware-poisoned page.
> However, in theory, this timing indeed isn't guaranteed.
> 
>> All this reliance upon fortunate timing sounds rather unreliable, doesn't it?
> 
> The timing of the victim process exit can be guaranteed.
> The timing of the new restart of the process cannot be guaranteed in theory.
> 
> The patch is not perfect, but it still provides the victim process with the
> opportunity to be restarted successfully.

Will it be better if affected process could try re-splitting the hw-poisoned huge page itself before
returning to userspace? Each affected process (including possible later restarted process) will try
re-splitting huge page in that case and the last one without any competitor will get the work done.
So the delayed work is not needed. Will this provide more reliance?

Thanks.

> 
> Thanks!
> -Qiuxu
>

     prev parent reply	other threads:[~2024-01-03  2:47 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-15  8:12 [PATCH 1/1] " Qiuxu Zhuo
2023-12-19  2:17 ` Naoya Horiguchi
2023-12-20  8:44   ` Zhuo, Qiuxu
2023-12-19 11:50 ` Miaohe Lin
2023-12-20  8:56   ` Zhuo, Qiuxu
2023-12-22  6:27 ` [PATCH v2 1/2] mm: memory-failure: Make memory_failure_queue_delayed() helper Qiuxu Zhuo
2023-12-22  6:27   ` [PATCH v2 2/2] mm: memory-failure: Re-split hw-poisoned huge page on -EAGAIN Qiuxu Zhuo
2023-12-22 19:42     ` Andrew Morton
2024-01-02  2:41       ` Zhuo, Qiuxu
2024-01-03  2:47         ` Miaohe Lin [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bb76102e-bfe7-ea96-3e26-be68752d1664@huawei.com \
    --to=linmiaohe@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=fengwei.yin@intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=naoya.horiguchi@nec.com \
    --cc=qiuxu.zhuo@intel.com \
    --cc=tony.luck@intel.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox