Re: [RFC] Kernel Support of Memory Error Detection. - HORIGUCHI NAOYA(堀口　直也)

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "HORIGUCHI NAOYA(堀口　直也)" <naoya.horiguchi@nec.com>
To: David Rientjes <rientjes@google.com>
Cc: Jiaqi Yan <jiaqiyan@google.com>,
	"Ghannam, Yazen" <Yazen.Ghannam@amd.com>,
	"tony.luck@intel.com" <tony.luck@intel.com>,
	"dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com>,
	"david@redhat.com" <david@redhat.com>,
	"erdemaktas@google.com" <erdemaktas@google.com>,
	"pgonda@google.com" <pgonda@google.com>,
	"duenwen@google.com" <duenwen@google.com>,
	"Vilas.Sridharan@amd.com" <Vilas.Sridharan@amd.com>,
	"mike.malvestuto@intel.com" <mike.malvestuto@intel.com>,
	"gthelen@google.com" <gthelen@google.com>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"jthoughton@google.com" <jthoughton@google.com>
Subject: Re: [RFC] Kernel Support of Memory Error Detection.
Date: Tue, 13 Dec 2022 09:27:47 +0000	[thread overview]
Message-ID: <20221213092743.GA1977915@hori.linux.bs1.fc.nec.co.jp> (raw)
In-Reply-To: <6bb93638-5702-076c-b72a-f33b39f35842@google.com>

On Tue, Nov 29, 2022 at 09:31:15PM -0800, David Rientjes wrote:
> On Thu, 3 Nov 2022, Jiaqi Yan wrote:
> 
> > This RFC is a followup for [1]. We’d like to first revisit the problem
> > statement, then explain the motivation for kernel support of memory
> > error detection. We attempt to answer two key questions raised in the
> > initial memory-scanning based solution: what memory to scan and how the
> > scanner should be designed. Different from what [1] originally proposed,
> > we think a kernel-driven design similar to khugepaged/kcompactd would
> > work better than the userspace-driven design.
> > 
> 
> Lots of great discussion in this thread, thanks Jiaqi for a very detailed 
> overview of what is trying to be addressed and the multiple options that 
> we can consider.
> 
> I think this thread has been a very useful starting point for us to 
> discuss what should comprise the first patchset.  I haven't seen any 
> objections to enlightening the kernel for this support, but any additional 
> feedback would indeed be useful.
> 
> Let me suggest a possible way forward: if we can agree on an kernel driven 
> approach and its design allows for it to be extended for future use cases, 
> then it should be possible to introduce something generally useful that 
> can then be built upon later if needed.
> 
> I can think about a couple future use cases that may arise that will 
> impact the minimal design that you intend to introduce: (1) the ability to 
> configure a hardware patrol scrubber depending on the platform, if 
> possible, as a substitute for driving the scanning by a kthread, and (2) 
> the ability to scan different types of memory rather than all system 
> memory.
> 
> Imagining the simplest possible design, I assume we could introuce a
> /sys/devices/system/node/nodeN/mcescan/* for each NUMA node on the system.  
> As a foundation, this can include only a "stat" file which provides the 
> interface to the memory poison subsystem that describes detected errors 
> and their resolution (this would be a good starting point).
> 
> Building on that, and using your reference to khugepaged, we can add 
> pages_to_scan and scan_sleep_millisecs files.  This will allow us to 
> control scanning on demotion nodes differently.  We'd want the kthread to 
> be NUMA aware for the memory it is scanning, so this would simply control 
> when each thread wakes up and how much memory it scans before going to 
> sleep.  Defaults would be disabled, so no kthreads are forked.
> 
> If this needs to be extended later for a hardware patrol scrubber, we'd 
> make this a request to cpu vendors to make configurable on a per socket 
> basis and used only with an ACPI capability that would put it under the 
> control of the kernel in place of the kthread (there would be a single 
> source of truth for the scan configuration).  If this is not possible, 
> we'd decouple the software and hardware approach and configure the HPS 
> through the ACPI subsystem independently.
> 
> Subsequently, if there is a need to only scan certain types of memory per 
> NUMA node, we could introduce a "type" file later under the mcescan 
> directory.  Idea would be to specify a bitmask to include certain memory 
> types into the scan.  Bits for things such as buddy pages, pcp pages, 
> hugetlb pages, etc.
> 
>  [ And if userspace, perhaps non-root, wanted to trigger a scan of its own 
>    virtual memory, for example, another future extension could allow you 
>    to explicitly trigger a scan of the calling process, but this would be 
>    done in process context, not by the kthreads. ]
> 
> If this is deemed acceptable, the minimal viable patchset would:
> 
>  - introduce the per-node mcescan directories
> 
>  - introduce a "stat" file that would describe the state of memory errors
>    on each NUMA node and their disposition
> 
>  - introduce a per-node kthread driven by pages_to_scan and
>    scan_sleep_millisecs to do software controlled memory scanning
> 
> All future possible use cases could be extended using this later if the 
> demand arises.
> 
> Thoughts?  It would be very useful to agree on a path forward since I 
> think this would be generally useful for the kernel.

Thank you for the ideas, the above looks to me simple enough to start with.
I think that one point not mentioned yet is how the in-kernel scanner finds
a broken page before the page is marked by PG_hwpoison.  Some mechanism
similar to mcsafe-memcpy could be used, but maybe memcpy is not necessary
because we just want to check the healthiness of pages.  So a core routine
like mcsafe-read would be introduced in the first patchset (or we already
have it)?

Thanks,
Naoya Horiguchi

next prev parent reply	other threads:[~2022-12-13  9:27 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-03 15:50 Jiaqi Yan
2022-11-03 16:27 ` Luck, Tony
2022-11-03 16:40   ` Nadav Amit
2022-11-08  2:24     ` Jiaqi Yan
2022-11-08 16:17       ` Luck, Tony
2022-11-09  5:04         ` HORIGUCHI NAOYA(堀口　直也)
2022-11-10 20:23           ` Jiaqi Yan
2022-11-18  1:19           ` Jiaqi Yan
2022-11-18 14:38             ` Sridharan, Vilas
2022-11-18 17:10               ` Luck, Tony
2022-11-07 16:59 ` Sridharan, Vilas
2022-11-09  5:29 ` HORIGUCHI NAOYA(堀口　直也)
2022-11-09 16:15   ` Luck, Tony
2022-11-10 20:25     ` Jiaqi Yan
2022-11-10 20:23   ` Jiaqi Yan
2022-11-30  5:31 ` David Rientjes
2022-12-13  9:27   ` HORIGUCHI NAOYA(堀口　直也) [this message]
2022-12-13 18:09     ` Luck, Tony
2022-12-13 19:03       ` Jiaqi Yan
2022-12-14 14:45         ` Yazen Ghannam

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20221213092743.GA1977915@hori.linux.bs1.fc.nec.co.jp \
    --to=naoya.horiguchi@nec.com \
    --cc=Vilas.Sridharan@amd.com \
    --cc=Yazen.Ghannam@amd.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@redhat.com \
    --cc=duenwen@google.com \
    --cc=erdemaktas@google.com \
    --cc=gthelen@google.com \
    --cc=jiaqiyan@google.com \
    --cc=jthoughton@google.com \
    --cc=linux-mm@kvack.org \
    --cc=mike.malvestuto@intel.com \
    --cc=pgonda@google.com \
    --cc=rientjes@google.com \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox