Re: [PATCH v18 04/19] EDAC: Add memory repair control feature

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Borislav Petkov <bp@alien8.de>
To: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Shiju Jose <shiju.jose@huawei.com>,
	"linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>,
	"linux-cxl@vger.kernel.org" <linux-cxl@vger.kernel.org>,
	"linux-acpi@vger.kernel.org" <linux-acpi@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"tony.luck@intel.com" <tony.luck@intel.com>,
	"rafael@kernel.org" <rafael@kernel.org>,
	"lenb@kernel.org" <lenb@kernel.org>,
	"mchehab@kernel.org" <mchehab@kernel.org>,
	"dan.j.williams@intel.com" <dan.j.williams@intel.com>,
	"dave@stgolabs.net" <dave@stgolabs.net>,
	"dave.jiang@intel.com" <dave.jiang@intel.com>,
	"alison.schofield@intel.com" <alison.schofield@intel.com>,
	"vishal.l.verma@intel.com" <vishal.l.verma@intel.com>,
	"ira.weiny@intel.com" <ira.weiny@intel.com>,
	"david@redhat.com" <david@redhat.com>,
	"Vilas.Sridharan@amd.com" <Vilas.Sridharan@amd.com>,
	"leo.duran@amd.com" <leo.duran@amd.com>,
	"Yazen.Ghannam@amd.com" <Yazen.Ghannam@amd.com>,
	"rientjes@google.com" <rientjes@google.com>,
	"jiaqiyan@google.com" <jiaqiyan@google.com>,
	"Jon.Grimm@amd.com" <Jon.Grimm@amd.com>,
	"dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com>,
	"naoya.horiguchi@nec.com" <naoya.horiguchi@nec.com>,
	"james.morse@arm.com" <james.morse@arm.com>,
	"jthoughton@google.com" <jthoughton@google.com>,
	"somasundaram.a@hpe.com" <somasundaram.a@hpe.com>,
	"erdemaktas@google.com" <erdemaktas@google.com>,
	"pgonda@google.com" <pgonda@google.com>,
	"duenwen@google.com" <duenwen@google.com>,
	"gthelen@google.com" <gthelen@google.com>,
	"wschwartz@amperecomputing.com" <wschwartz@amperecomputing.com>,
	"dferguson@amperecomputing.com" <dferguson@amperecomputing.com>,
	"wbs@os.amperecomputing.com" <wbs@os.amperecomputing.com>,
	"nifan.cxl@gmail.com" <nifan.cxl@gmail.com>,
	tanxiaofei <tanxiaofei@huawei.com>,
	"Zengtao (B)" <prime.zeng@hisilicon.com>,
	Roberto Sassu <roberto.sassu@huawei.com>,
	"kangkang.shen@futurewei.com" <kangkang.shen@futurewei.com>,
	wanghuiqiang <wanghuiqiang@huawei.com>,
	Linuxarm <linuxarm@huawei.com>
Subject: Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
Date: Tue, 21 Jan 2025 17:16:53 +0100	[thread overview]
Message-ID: <20250121161653.GAZ4_IdYDQ9_-QoEvn@fat_crate.local> (raw)
In-Reply-To: <20250113110740.00003a7c@huawei.com>

On Mon, Jan 13, 2025 at 11:07:40AM +0000, Jonathan Cameron wrote:
> We can do that if you prefer.  I'm not that fussed how this is handled
> because, for tooling at least, I don't see why we'd ever read it.
> It's for human parsing only and the above is fine.

Is there even a concrete use case for humans currently? Because if not, we
might as well not do it at all and keep it simple.

All I see is an avalanche of sysfs nodes and I'm questioning the usefulness of
the interface and what's the 30K ft big picture for all this.

If this all is just wishful thinking on the part of how this is going to be
used, then I agree with Dan: less is more. But I need to read the rest of that
thread when there's time.

...
> Repair cam be a feature of the DIMMs themselves or it can be a feature
> of the memory controller. It is basically replacing them with spare
> memory from somewhere else (usually elsewhere on same DIMMs that have
> a bit of spare capacity for this).  Bit like a hot spare in a RAID setup.

Ooh, this is what you call repair. That's using a spare rank or so, under
which I know it as one example.

What I thought you mean with repair is what you mean with "correct". Ok,
I see.

> In some other systems the OS gets the errors and is responsible for making
> the decision.

This decision has been kept away from the OS in my world so far. So yes, the
FW doing all the RAS recovery work is more like it. And the FW is the better
agent in some sense because it has a lot more intimate knowledge of the
platform. However...

> Sticking to the corrected error case (uncorrected handling
> is going to require a lot more work given we've lost data, Dan asked about that
> in the other branch of the thread), the OS as a whole (kernel + userspace)
> gets the error records and makes the policy decision to repair based on
> assessment of risk vs resource availability to make a repair.
> 
> Two reasons for this
> 1) Hardware isn't necessarily capable of repairing autonomously as
>    other actions may be needed (memory traffic to some granularity of
>    memory may need to be stopped to avoid timeouts). Note there are many
>    graduations of this from A) can do it live with no visible effect, through
>    B) offline a page, to C) offlining the whole device.
> 2) Policy can be a lot more sophisticated than a BMC can do.

... yes, that's why you can't rely only on the FW to do recovery but involve
the OS too. Basically what I've been saying all those years. Oh well...

> In some cases perhaps, but another very strong driver is that policy is involved.
> 
> We can either try put a complex design in firmware and poke it with N opaque
> parameters from a userspace tool or via some out of band method or we can put
> the algorithm in userspace where it can be designed to incorporate lessons learnt
> over time.  We will start simple and see what is appropriate as this starts
> to get used in large fleets.  This stuff is a reasonable target for AI type
> algorithms etc that we aren't going to put in the kernel.
> 
> Doing this at all is a reliability optimization, normally it isn't required for
> correct operation.

I'm not saying you should put an AI engine into the kernel - all I'm saying
is, the stuff which the kernel can decide itself without user input doesn't
need user input. Only toggle: the kernel should do this correction and/or
repair automatically or not.

What is clear here is that you can't design an interface properly right now
for algorithms which you don't have yet. And there's experience missing from
running this in large fleets.

But the interface you're adding now will remain forever cast in stone. Just
for us to realize one day that we're not really using it but it is sitting out
there dead in the water and we can't retract it. Or we're not using it as
originally designed but differently and we need this and that hack to make it
work for the current sensible use case.

So the way it looks to me right now is, you want this to be in debugfs. You
want to go nuts there, collect experience, algorithms, lessons learned etc and
*then*, the parts which are really useful and sensible should be moved to
sysfs and cast in stone. But not preemptively like that.

> Offline has no permanent cost and no limit on number of times you can
> do it. Repair is definitely a limited resource and may permanently use
> up that resource (discoverable as a policy wants to know that too!)
> In some cases once you run out of repair resources you have to send an
> engineer to replace the memory before you can do it again.

Yes, and until you can do that and because cloud doesn't want to *ever*
reboot, you must do diminished but still present machine capabilities by
offlining pages and cordoning off faulty hw, etc, etc.

> Ok. I guess it is an option (I wasn't aware of that work).
> 
> I was thinking that was far more complex to deal with than just doing it in
> userspace tooling. From a quick look that solution seems to rely on ACPI ERSR
> infrastructure to provide that persistence that we won't generally have but
> I suppose we can read it from the filesystem or other persistent stores.
> We'd need to be a lot more general about that as can't make system assumptions
> that can be made in AMD specific code.
> 
> So could be done, I don't think it is a good idea in this case, but that
> example does suggest it is possible.

You can look at this as specialized solutions. Could they be more general?
Ofc. But we don't have a general RAS architecture which is vendor-agnostic.

> In approach we are targetting, there is no round trip situation.  We let the kernel
> deal with any synchronous error just fine and run it's existing logic
> to offline problematic memory.  That needs to be timely and to carry on operating
> exactly as it always has.
> 
> In parallel with that we gather the error reports that we will already be
> gathering and run analysis on those.  From that we decide if a memory is likely to fail
> again and perform a sparing operation if appropriate.
> Effectively this is 'free'. All the information is already there in userspace
> and already understood by tools like rasdaemon, we are not expanding that
> reporting interface at all.

That is fair. I think you can do that even now if the errors logged have
enough hw information to classify them and use them for predictive analysis.

> Ok.  It seems you correlate number of files with complexity.

No, wrong. I'm looking at the interface and am wondering how is this going to
be used and whether it is worth it to have it cast in stone forever.

> I correlated difficulty of understanding those files with complexity.
> Everyone one of the files is clearly defined and aligned with long history
> of how to describe DRAM (see how long CPER records have used these
> fields for example - they go back to the beginning).

Ok, then pls point me to the actual use cases how those files are going to be
used or they are used already.

> I'm all in favor of building an interface up by providing minimum first
> and then adding to it, but here what is proposed is the minimum for basic
> functionality and the alternative of doing the whole thing in kernel both
> puts complexity in the wrong place and restricts us in what is possible.

There's another point to consider: if this is the correct and proper solution
for *your* fleet, that doesn't necessarily mean it is the correct and
generic solution for *everybody* using the kernel. So you can imagine that I'd
like to have a generic solution which can maximally include everyone instead
of *some* special case only.

> To some degree but I think there is a major mismatch in what we think
> this is for.
> 
> What I've asked Shiju to look at is splitting the repair infrastructure
> into two cases so that maybe we can make partial progress:
> 
> 1) Systems that support repair by Physical Address
>  - Covers Post Package Repair for CXL
> 
> 2) Systems that support repair by description of the underlying hardware
> - Covers Memory Sparing interfaces for CXL. 
> 
> We need both longer term anyway, but maybe 1 is less controversial simply
> on basis it has fewer control parameters
> 
> This still fundamentally puts the policy in userspace where I
> believe it belongs.

Ok, this is more concrete. Let's start with those. Can I have some more
details on how this works pls and who does what? Is it generic enough?

If not, can it live in debugfs for now? See above what I mean about this.

Big picture: what is the kernel's role here? To be a parrot to carry data
back'n'forth or can it simply do clear-cut decisions itself without the need
for userspace involvement?

So far I get the idea that this is something for your RAS needs. This should
have general usability for the rest of the kernel users - otherwise it should
remain a vendor-specific solution until it is needed by others and can be
generalized.

Also, can already existing solutions in the kernel be generalized so that you
can use them too and others can benefit from your improvements?

I hope this makes more sense.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

next prev parent reply	other threads:[~2025-01-21 16:17 UTC|newest]

Thread overview: 87+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-06 12:09 [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers shiju.jose
2025-01-06 12:09 ` [PATCH v18 01/19] EDAC: Add support for EDAC device features control shiju.jose
2025-01-06 13:37   ` Borislav Petkov
2025-01-06 14:48     ` Shiju Jose
2025-01-13 15:06   ` Mauro Carvalho Chehab
2025-01-14  9:55     ` Jonathan Cameron
2025-01-14 10:08     ` Shiju Jose
2025-01-14 11:33       ` Mauro Carvalho Chehab
2025-01-30 19:18   ` Daniel Ferguson
2025-01-06 12:09 ` [PATCH v18 02/19] EDAC: Add scrub control feature shiju.jose
2025-01-06 15:57   ` Borislav Petkov
2025-01-06 19:34     ` Shiju Jose
2025-01-07  7:32       ` Borislav Petkov
2025-01-07  9:23         ` Shiju Jose
2025-01-08 15:47         ` Shiju Jose
2025-01-13 15:50   ` Mauro Carvalho Chehab
2025-01-30 19:18   ` Daniel Ferguson
2025-01-06 12:09 ` [PATCH v18 03/19] EDAC: Add ECS " shiju.jose
2025-01-13 16:09   ` Mauro Carvalho Chehab
2025-01-06 12:10 ` [PATCH v18 04/19] EDAC: Add memory repair " shiju.jose
2025-01-09  9:19   ` Borislav Petkov
2025-01-09 11:00     ` Shiju Jose
2025-01-09 12:32       ` Borislav Petkov
2025-01-09 14:24         ` Jonathan Cameron
2025-01-09 15:18           ` Borislav Petkov
2025-01-09 16:01             ` Jonathan Cameron
2025-01-09 16:19               ` Borislav Petkov
2025-01-09 18:34                 ` Jonathan Cameron
2025-01-09 23:51                   ` Dan Williams
2025-01-10 11:01                     ` Jonathan Cameron
2025-01-10 22:49                       ` Dan Williams
2025-01-13 11:40                         ` Jonathan Cameron
2025-01-14 19:35                           ` Dan Williams
2025-01-15 10:07                             ` Jonathan Cameron
2025-01-15 11:35                             ` Mauro Carvalho Chehab
2025-01-11 17:12                   ` Borislav Petkov
2025-01-13 11:07                     ` Jonathan Cameron
2025-01-21 16:16                       ` Borislav Petkov [this message]
2025-01-21 18:16                         ` Jonathan Cameron
2025-01-22 19:09                           ` Borislav Petkov
2025-02-06 13:39                             ` Jonathan Cameron
2025-02-17 13:23                               ` Borislav Petkov
2025-02-18 16:51                                 ` Jonathan Cameron
2025-02-19 18:45                                   ` Borislav Petkov
2025-02-20 12:19                                     ` Jonathan Cameron
2025-01-14 13:10                   ` Mauro Carvalho Chehab
2025-01-14 12:57               ` Mauro Carvalho Chehab
2025-01-14 12:38           ` Mauro Carvalho Chehab
2025-01-14 13:05             ` Jonathan Cameron
2025-01-14 14:39               ` Mauro Carvalho Chehab
2025-01-14 11:47   ` Mauro Carvalho Chehab
2025-01-14 12:31     ` Shiju Jose
2025-01-14 14:26       ` Mauro Carvalho Chehab
2025-01-14 13:47   ` Mauro Carvalho Chehab
2025-01-14 14:30     ` Shiju Jose
2025-01-15 12:03       ` Mauro Carvalho Chehab
2025-01-06 12:10 ` [PATCH v18 05/19] ACPI:RAS2: Add ACPI RAS2 driver shiju.jose
2025-01-21 23:01   ` Daniel Ferguson
2025-01-22 15:38     ` Shiju Jose
2025-01-30 19:19   ` Daniel Ferguson
2025-01-06 12:10 ` [PATCH v18 06/19] ras: mem: Add memory " shiju.jose
2025-01-21 23:01   ` Daniel Ferguson
2025-01-30 19:19   ` Daniel Ferguson
2025-01-06 12:10 ` [PATCH v18 07/19] cxl: Refactor user ioctl command path from mds to mailbox shiju.jose
2025-01-06 12:10 ` [PATCH v18 08/19] cxl: Add skeletal features driver shiju.jose
2025-01-06 12:10 ` [PATCH v18 09/19] cxl: Enumerate feature commands shiju.jose
2025-01-06 12:10 ` [PATCH v18 10/19] cxl: Add Get Supported Features command for kernel usage shiju.jose
2025-01-06 12:10 ` [PATCH v18 11/19] cxl: Add features driver attribute to emit number of features supported shiju.jose
2025-01-06 12:10 ` [PATCH v18 12/19] cxl/mbox: Add GET_FEATURE mailbox command shiju.jose
2025-01-06 12:10 ` [PATCH v18 13/19] cxl/mbox: Add SET_FEATURE " shiju.jose
2025-01-06 12:10 ` [PATCH v18 14/19] cxl: Setup exclusive CXL features that are reserved for the kernel shiju.jose
2025-01-06 12:10 ` [PATCH v18 15/19] cxl/memfeature: Add CXL memory device patrol scrub control feature shiju.jose
2025-01-24 20:38   ` Dan Williams
2025-01-27 10:06     ` Jonathan Cameron
2025-01-27 12:53     ` Shiju Jose
2025-01-27 23:17       ` Dan Williams
2025-01-29 12:28         ` Shiju Jose
2025-01-06 12:10 ` [PATCH v18 16/19] cxl/memfeature: Add CXL memory device ECS " shiju.jose
2025-01-06 12:10 ` [PATCH v18 17/19] cxl/mbox: Add support for PERFORM_MAINTENANCE mailbox command shiju.jose
2025-01-06 12:10 ` [PATCH v18 18/19] cxl/memfeature: Add CXL memory device soft PPR control feature shiju.jose
2025-01-06 12:10 ` [PATCH v18 19/19] cxl/memfeature: Add CXL memory device memory sparing " shiju.jose
2025-01-13 14:46 ` [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers Mauro Carvalho Chehab
2025-01-13 15:36   ` Jonathan Cameron
2025-01-14 14:06     ` Mauro Carvalho Chehab
2025-01-13 18:15   ` Shiju Jose
2025-01-30 19:18 ` Daniel Ferguson
2025-02-03  9:25   ` Shiju Jose

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250121161653.GAZ4_IdYDQ9_-QoEvn@fat_crate.local \
    --to=bp@alien8.de \
    --cc=Jon.Grimm@amd.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=Vilas.Sridharan@amd.com \
    --cc=Yazen.Ghannam@amd.com \
    --cc=alison.schofield@intel.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@redhat.com \
    --cc=dferguson@amperecomputing.com \
    --cc=duenwen@google.com \
    --cc=erdemaktas@google.com \
    --cc=gthelen@google.com \
    --cc=ira.weiny@intel.com \
    --cc=james.morse@arm.com \
    --cc=jiaqiyan@google.com \
    --cc=jthoughton@google.com \
    --cc=kangkang.shen@futurewei.com \
    --cc=lenb@kernel.org \
    --cc=leo.duran@amd.com \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linuxarm@huawei.com \
    --cc=mchehab@kernel.org \
    --cc=naoya.horiguchi@nec.com \
    --cc=nifan.cxl@gmail.com \
    --cc=pgonda@google.com \
    --cc=prime.zeng@hisilicon.com \
    --cc=rafael@kernel.org \
    --cc=rientjes@google.com \
    --cc=roberto.sassu@huawei.com \
    --cc=shiju.jose@huawei.com \
    --cc=somasundaram.a@hpe.com \
    --cc=tanxiaofei@huawei.com \
    --cc=tony.luck@intel.com \
    --cc=vishal.l.verma@intel.com \
    --cc=wanghuiqiang@huawei.com \
    --cc=wbs@os.amperecomputing.com \
    --cc=wschwartz@amperecomputing.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox