linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: "HORIGUCHI NAOYA(堀口 直也)" <naoya.horiguchi@nec.com>
Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Miaohe Lin <linmiaohe@huawei.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Yang Shi <shy828301@gmail.com>,
	Oscar Salvador <osalvador@suse.de>,
	Muchun Song <songmuchun@bytedance.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [RFC PATCH v1 0/4] mm, hwpoison: improve handling workload related to hugetlb and memory_hotplug
Date: Thu, 28 Apr 2022 10:44:15 +0200	[thread overview]
Message-ID: <bb1caf48-7e9d-61bf-e0dc-72fcc0228f28@redhat.com> (raw)
In-Reply-To: <20220427122049.GA3918978@hori.linux.bs1.fc.nec.co.jp>

>> 2) It happens rarely (ever?), so do we even care?
> 
> I'm not certain of the rarity.  Some cloud service providers who maintain
> lots of servers may care?

About replacing broken DIMMs? I'm not so sure, especially because it
requires a special setup with ZONE_MOVABLE (i.e., movablecore) to be
somewhat reliable and individual DIMMs can usually not get replaced at all.

> 
>> 3) Once the memory is offline, we can re-online it and lost HWPoison.
>>    The memory can be happily used.
>>
>> 3) can happen easily if our DIMM consists of multiple memory blocks and
>> offlining of some memory block fails -> we'll re-online all already
>> offlined ones. We'll happily reuse previously HWPoisoned pages, which
>> feels more dangerous to me then just leaving the DIMM around (and
>> eventually hwpoisoning all pages on it such that it won't get used
>> anymore?).
> 
> I see. This scenario can often happen.
> 
>>
>> So maybe we should just fail offlining once we stumble over a hwpoisoned
>> page?
> 
> That could be one choice.
> 
> Maybe another is like this: offlining can succeed but HWPoison flags are
> kept over offline-reonline operations.  If the system noticed that the
> re-onlined blocks are backed by the original DIMMs or NUMA nodes, then the
> saved HWPoison flags are still effective, so keep using them.  If the
> re-onlined blocks are backed by replaced DIMMs/NUMA nodes, then we can clear
> all HWPoison flags associated with replaced physical address range.  This
> can be done automatically in re-onlining if there's a way for kernel to know
> whether DIMM/NUMA nodes are replaced with new ones.  But if there isn't,
> system applications have to check the HW and explicitly reset the HWPoison
> flags.

Offline memory sections have a stale memmap, so there is no trusting on
that. And trying to work around that or adjusting memory onlining code
overcomplicates something we really don't care about supporting.

So if we continue allowing offlining memory blocks with poisoned pages,
we could simply remember that that memory block had any posioned page
(either for the memory section or maybe better for the whole memory
block). We can then simply reject/fail memory onlining of these memory
blocks.

So that leaves us with either

1) Fail offlining -> no need to care about reonlining
2) Succeed offlining but fail re-onlining

-- 
Thanks,

David / dhildenb



  reply	other threads:[~2022-04-28  8:44 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-04-27  4:28 Naoya Horiguchi
2022-04-27  4:28 ` [RFC PATCH v1 1/4] mm, hwpoison, hugetlb: introduce SUBPAGE_INDEX_HWPOISON to save raw error page Naoya Horiguchi
2022-04-27  7:11   ` Miaohe Lin
2022-04-27 13:03     ` HORIGUCHI NAOYA(堀口 直也)
2022-04-28  3:14       ` Miaohe Lin
2022-05-12 22:31   ` Jane Chu
2022-05-12 22:49     ` HORIGUCHI NAOYA(堀口 直也)
2022-04-27  4:28 ` [RFC PATCH v1 2/4] mm,hwpoison,hugetlb,memory_hotplug: hotremove memory section with hwpoisoned hugepage Naoya Horiguchi
2022-04-29  8:49   ` Miaohe Lin
2022-05-09  7:55     ` HORIGUCHI NAOYA(堀口 直也)
2022-05-09  8:57       ` Miaohe Lin
2022-04-27  4:28 ` [RFC PATCH v1 3/4] mm, hwpoison: add parameter unpoison to get_hwpoison_huge_page() Naoya Horiguchi
2022-04-27  4:28 ` [RFC PATCH v1 4/4] mm, memory_hotplug: fix inconsistent num_poisoned_pages on memory hotremove Naoya Horiguchi
2022-04-28  3:20   ` Miaohe Lin
2022-04-28  4:05     ` HORIGUCHI NAOYA(堀口 直也)
2022-04-28  7:16       ` Miaohe Lin
2022-05-09 13:34         ` Naoya Horiguchi
2022-04-27 10:48 ` [RFC PATCH v1 0/4] mm, hwpoison: improve handling workload related to hugetlb and memory_hotplug David Hildenbrand
2022-04-27 12:20   ` Oscar Salvador
2022-04-27 12:20   ` HORIGUCHI NAOYA(堀口 直也)
2022-04-28  8:44     ` David Hildenbrand [this message]
2022-05-09  7:29       ` HORIGUCHI NAOYA(堀口 直也)
2022-05-09  9:04         ` Miaohe Lin
2022-05-09  9:58           ` Oscar Salvador
2022-05-09 10:53             ` Miaohe Lin
2022-05-11 15:11               ` David Hildenbrand
2022-05-11 16:10                 ` HORIGUCHI NAOYA(堀口 直也)
2022-05-11 16:22                   ` David Hildenbrand
2022-05-12  3:04                     ` Miaohe Lin
2022-05-12  6:35                     ` HORIGUCHI NAOYA(堀口 直也)
2022-05-12  7:28                       ` David Hildenbrand
2022-05-12 11:13                         ` Miaohe Lin
2022-05-12 12:59                           ` David Hildenbrand
2022-05-16  3:25                             ` Miaohe Lin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bb1caf48-7e9d-61bf-e0dc-72fcc0228f28@redhat.com \
    --to=david@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mike.kravetz@oracle.com \
    --cc=naoya.horiguchi@linux.dev \
    --cc=naoya.horiguchi@nec.com \
    --cc=osalvador@suse.de \
    --cc=shy828301@gmail.com \
    --cc=songmuchun@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox