From: "HORIGUCHI NAOYA(堀口 直也)" <naoya.horiguchi@nec.com>
To: David Rientjes <rientjes@google.com>
Cc: Jiaqi Yan <jiaqiyan@google.com>,
"Ghannam, Yazen" <Yazen.Ghannam@amd.com>,
"tony.luck@intel.com" <tony.luck@intel.com>,
"dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com>,
"david@redhat.com" <david@redhat.com>,
"erdemaktas@google.com" <erdemaktas@google.com>,
"pgonda@google.com" <pgonda@google.com>,
"duenwen@google.com" <duenwen@google.com>,
"Vilas.Sridharan@amd.com" <Vilas.Sridharan@amd.com>,
"mike.malvestuto@intel.com" <mike.malvestuto@intel.com>,
"gthelen@google.com" <gthelen@google.com>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
"jthoughton@google.com" <jthoughton@google.com>
Subject: Re: [RFC] Kernel Support of Memory Error Detection.
Date: Tue, 13 Dec 2022 09:27:47 +0000 [thread overview]
Message-ID: <20221213092743.GA1977915@hori.linux.bs1.fc.nec.co.jp> (raw)
In-Reply-To: <6bb93638-5702-076c-b72a-f33b39f35842@google.com>
On Tue, Nov 29, 2022 at 09:31:15PM -0800, David Rientjes wrote:
> On Thu, 3 Nov 2022, Jiaqi Yan wrote:
>
> > This RFC is a followup for [1]. We’d like to first revisit the problem
> > statement, then explain the motivation for kernel support of memory
> > error detection. We attempt to answer two key questions raised in the
> > initial memory-scanning based solution: what memory to scan and how the
> > scanner should be designed. Different from what [1] originally proposed,
> > we think a kernel-driven design similar to khugepaged/kcompactd would
> > work better than the userspace-driven design.
> >
>
> Lots of great discussion in this thread, thanks Jiaqi for a very detailed
> overview of what is trying to be addressed and the multiple options that
> we can consider.
>
> I think this thread has been a very useful starting point for us to
> discuss what should comprise the first patchset. I haven't seen any
> objections to enlightening the kernel for this support, but any additional
> feedback would indeed be useful.
>
> Let me suggest a possible way forward: if we can agree on an kernel driven
> approach and its design allows for it to be extended for future use cases,
> then it should be possible to introduce something generally useful that
> can then be built upon later if needed.
>
> I can think about a couple future use cases that may arise that will
> impact the minimal design that you intend to introduce: (1) the ability to
> configure a hardware patrol scrubber depending on the platform, if
> possible, as a substitute for driving the scanning by a kthread, and (2)
> the ability to scan different types of memory rather than all system
> memory.
>
> Imagining the simplest possible design, I assume we could introuce a
> /sys/devices/system/node/nodeN/mcescan/* for each NUMA node on the system.
> As a foundation, this can include only a "stat" file which provides the
> interface to the memory poison subsystem that describes detected errors
> and their resolution (this would be a good starting point).
>
> Building on that, and using your reference to khugepaged, we can add
> pages_to_scan and scan_sleep_millisecs files. This will allow us to
> control scanning on demotion nodes differently. We'd want the kthread to
> be NUMA aware for the memory it is scanning, so this would simply control
> when each thread wakes up and how much memory it scans before going to
> sleep. Defaults would be disabled, so no kthreads are forked.
>
> If this needs to be extended later for a hardware patrol scrubber, we'd
> make this a request to cpu vendors to make configurable on a per socket
> basis and used only with an ACPI capability that would put it under the
> control of the kernel in place of the kthread (there would be a single
> source of truth for the scan configuration). If this is not possible,
> we'd decouple the software and hardware approach and configure the HPS
> through the ACPI subsystem independently.
>
> Subsequently, if there is a need to only scan certain types of memory per
> NUMA node, we could introduce a "type" file later under the mcescan
> directory. Idea would be to specify a bitmask to include certain memory
> types into the scan. Bits for things such as buddy pages, pcp pages,
> hugetlb pages, etc.
>
> [ And if userspace, perhaps non-root, wanted to trigger a scan of its own
> virtual memory, for example, another future extension could allow you
> to explicitly trigger a scan of the calling process, but this would be
> done in process context, not by the kthreads. ]
>
> If this is deemed acceptable, the minimal viable patchset would:
>
> - introduce the per-node mcescan directories
>
> - introduce a "stat" file that would describe the state of memory errors
> on each NUMA node and their disposition
>
> - introduce a per-node kthread driven by pages_to_scan and
> scan_sleep_millisecs to do software controlled memory scanning
>
> All future possible use cases could be extended using this later if the
> demand arises.
>
> Thoughts? It would be very useful to agree on a path forward since I
> think this would be generally useful for the kernel.
Thank you for the ideas, the above looks to me simple enough to start with.
I think that one point not mentioned yet is how the in-kernel scanner finds
a broken page before the page is marked by PG_hwpoison. Some mechanism
similar to mcsafe-memcpy could be used, but maybe memcpy is not necessary
because we just want to check the healthiness of pages. So a core routine
like mcsafe-read would be introduced in the first patchset (or we already
have it)?
Thanks,
Naoya Horiguchi
next prev parent reply other threads:[~2022-12-13 9:27 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-11-03 15:50 Jiaqi Yan
2022-11-03 16:27 ` Luck, Tony
2022-11-03 16:40 ` Nadav Amit
2022-11-08 2:24 ` Jiaqi Yan
2022-11-08 16:17 ` Luck, Tony
2022-11-09 5:04 ` HORIGUCHI NAOYA(堀口 直也)
2022-11-10 20:23 ` Jiaqi Yan
2022-11-18 1:19 ` Jiaqi Yan
2022-11-18 14:38 ` Sridharan, Vilas
2022-11-18 17:10 ` Luck, Tony
2022-11-07 16:59 ` Sridharan, Vilas
2022-11-09 5:29 ` HORIGUCHI NAOYA(堀口 直也)
2022-11-09 16:15 ` Luck, Tony
2022-11-10 20:25 ` Jiaqi Yan
2022-11-10 20:23 ` Jiaqi Yan
2022-11-30 5:31 ` David Rientjes
2022-12-13 9:27 ` HORIGUCHI NAOYA(堀口 直也) [this message]
2022-12-13 18:09 ` Luck, Tony
2022-12-13 19:03 ` Jiaqi Yan
2022-12-14 14:45 ` Yazen Ghannam
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20221213092743.GA1977915@hori.linux.bs1.fc.nec.co.jp \
--to=naoya.horiguchi@nec.com \
--cc=Vilas.Sridharan@amd.com \
--cc=Yazen.Ghannam@amd.com \
--cc=dave.hansen@linux.intel.com \
--cc=david@redhat.com \
--cc=duenwen@google.com \
--cc=erdemaktas@google.com \
--cc=gthelen@google.com \
--cc=jiaqiyan@google.com \
--cc=jthoughton@google.com \
--cc=linux-mm@kvack.org \
--cc=mike.malvestuto@intel.com \
--cc=pgonda@google.com \
--cc=rientjes@google.com \
--cc=tony.luck@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox