From: Miaohe Lin <linmiaohe@huawei.com>
To: David Hildenbrand <david@redhat.com>, <naoya.horiguchi@nec.com>
Cc: <linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
Oscar Salvador <osalvador@suse.de>
Subject: Re: [PATCH RFC] mm/memory-failure.c: fix memory failure race with memory offline
Date: Thu, 10 Mar 2022 21:04:23 +0800 [thread overview]
Message-ID: <dee60691-3873-ac7b-021e-d4eb73d494cc@huawei.com> (raw)
In-Reply-To: <4307e915-ac24-58bc-23ad-7e94e2b37170@redhat.com>
On 2022/3/1 17:53, David Hildenbrand wrote:
> On 26.02.22 10:40, Miaohe Lin wrote:
>> There is a theoretical race window between memory failure and memory
>> offline. Think about the below scene:
>>
>> CPU A CPU B
>> memory_failure offline_pages
>> mutex_lock(&mf_mutex);
>> TestSetPageHWPoison(p)
>> start_isolate_page_range
>> has_unmovable_pages
>> --PageHWPoison is movable
>> do {
>> scan_movable_pages
>> do_migrate_range
>> --PageHWPoison isn't migrated
>> }
>> test_pages_isolated
>> --PageHWPoison is isolated
>> remove_memory
>> access page... bang
>> ...
>
> I think the motivation for the offlining code was to not block memory
> hotunplug (especially on ZONE_MOVABLE) just because there is a
> HWpoisoned page. But how often does that happen?
>
> It's all semi-broken either way. Assume you just offlined a memory block
> with a hwpoisoned page. The memmap is stale and the information about
> hwpoison is lost. You can happily re-online that memory block and use
> *all* memory, including previously hwpoisoned memory. Note that this
> used to be different in the past, when the memmap was initialized when
> adding memory, not when onlining that memory.
>
>
> IMHO, we should stop special casing hwpoison. Either fail offlining
> completely if we stumble over a hwpoisoned page, or allow offlining only
> if the refcount==0 -- just as any other page.
>
>
IIUC, there is no easy way to found out whether a hwpoinsoned page could be
safely offlined. If memory_failure succeeds, page refcnt should be 1. But if
failed, page refcnt is unknown. So it seems failing offlining completely if
we stumble over a hwpoisoned page is most suitable way to close the race. But
is this too overkill for such rare cases? Any suggestions?
Many thanks!
prev parent reply other threads:[~2022-03-10 13:04 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-02-26 9:40 Miaohe Lin
2022-02-28 12:04 ` Naoya Horiguchi
2022-03-01 3:32 ` Miaohe Lin
2022-03-01 9:53 ` David Hildenbrand
2022-03-01 13:22 ` Miaohe Lin
2022-03-10 13:04 ` Miaohe Lin [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=dee60691-3873-ac7b-021e-d4eb73d494cc@huawei.com \
--to=linmiaohe@huawei.com \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=naoya.horiguchi@nec.com \
--cc=osalvador@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox