From: Miaohe Lin <linmiaohe@huawei.com>
To: David Hildenbrand <david@redhat.com>
Cc: <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
<akpm@linux-foundation.org>, <naoya.horiguchi@nec.com>,
<osalvador@suse.de>
Subject: Re: [PATCH RFC] mm/memory-failure.c: fix memory failure race with memory offline
Date: Tue, 1 Mar 2022 21:22:11 +0800 [thread overview]
Message-ID: <b36f92bd-e0ec-89db-c830-5cf21d3b61a1@huawei.com> (raw)
In-Reply-To: <4307e915-ac24-58bc-23ad-7e94e2b37170@redhat.com>
On 2022/3/1 17:53, David Hildenbrand wrote:
> On 26.02.22 10:40, Miaohe Lin wrote:
>> There is a theoretical race window between memory failure and memory
>> offline. Think about the below scene:
>>
>> CPU A CPU B
>> memory_failure offline_pages
>> mutex_lock(&mf_mutex);
>> TestSetPageHWPoison(p)
>> start_isolate_page_range
>> has_unmovable_pages
>> --PageHWPoison is movable
>> do {
>> scan_movable_pages
>> do_migrate_range
>> --PageHWPoison isn't migrated
>> }
>> test_pages_isolated
>> --PageHWPoison is isolated
>> remove_memory
>> access page... bang
>> ...
>
> I think the motivation for the offlining code was to not block memory
> hotunplug (especially on ZONE_MOVABLE) just because there is a
> HWpoisoned page. But how often does that happen?
This should be really race. The memory failure itself shouldn't be common
otherwise we have other problems.
>
> It's all semi-broken either way. Assume you just offlined a memory block
> with a hwpoisoned page. The memmap is stale and the information about
> hwpoison is lost. You can happily re-online that memory block and use
> *all* memory, including previously hwpoisoned memory. Note that this
Agree. This is how it works now. But it seems the hwpoisoned memory might can
be used again as normal memory after offline+online.
> used to be different in the past, when the memmap was initialized when
> adding memory, not when onlining that memory.
>
>
> IMHO, we should stop special casing hwpoison. Either fail offlining
> completely if we stumble over a hwpoisoned page, or allow offlining only
> if the refcount==0 -- just as any other page.
>
I'm not sure whether this "rare" race condition worth fixing. But the problem
is there and we might come across it. Failing offlining completely sounds not
that good but it looks hard to reliably detect the "offline-safe" hwpoisoned page.
I can't come out a solution...
Many thanks for reply and comment. :)
>
next prev parent reply other threads:[~2022-03-01 13:22 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-02-26 9:40 Miaohe Lin
2022-02-28 12:04 ` Naoya Horiguchi
2022-03-01 3:32 ` Miaohe Lin
2022-03-01 9:53 ` David Hildenbrand
2022-03-01 13:22 ` Miaohe Lin [this message]
2022-03-10 13:04 ` Miaohe Lin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=b36f92bd-e0ec-89db-c830-5cf21d3b61a1@huawei.com \
--to=linmiaohe@huawei.com \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=naoya.horiguchi@nec.com \
--cc=osalvador@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox