From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
To: Borislav Petkov <bp@alien8.de>
Cc: Shiju Jose <shiju.jose@huawei.com>,
"linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>,
"linux-cxl@vger.kernel.org" <linux-cxl@vger.kernel.org>,
"linux-acpi@vger.kernel.org" <linux-acpi@vger.kernel.org>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"tony.luck@intel.com" <tony.luck@intel.com>,
"rafael@kernel.org" <rafael@kernel.org>,
"lenb@kernel.org" <lenb@kernel.org>,
"mchehab@kernel.org" <mchehab@kernel.org>,
"dan.j.williams@intel.com" <dan.j.williams@intel.com>,
"dave@stgolabs.net" <dave@stgolabs.net>,
"gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>,
"sudeep.holla@arm.com" <sudeep.holla@arm.com>,
"jassisinghbrar@gmail.com" <jassisinghbrar@gmail.com>,
"dave.jiang@intel.com" <dave.jiang@intel.com>,
"alison.schofield@intel.com" <alison.schofield@intel.com>,
"vishal.l.verma@intel.com" <vishal.l.verma@intel.com>,
"ira.weiny@intel.com" <ira.weiny@intel.com>,
"david@redhat.com" <david@redhat.com>,
"Vilas.Sridharan@amd.com" <Vilas.Sridharan@amd.com>,
"leo.duran@amd.com" <leo.duran@amd.com>,
"Yazen.Ghannam@amd.com" <Yazen.Ghannam@amd.com>,
"rientjes@google.com" <rientjes@google.com>,
"jiaqiyan@google.com" <jiaqiyan@google.com>,
"Jon.Grimm@amd.com" <Jon.Grimm@amd.com>,
"dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com>,
"naoya.horiguchi@nec.com" <naoya.horiguchi@nec.com>,
"james.morse@arm.com" <james.morse@arm.com>,
"jthoughton@google.com" <jthoughton@google.com>,
"somasundaram.a@hpe.com" <somasundaram.a@hpe.com>,
"erdemaktas@google.com" <erdemaktas@google.com>,
"pgonda@google.com" <pgonda@google.com>,
"duenwen@google.com" <duenwen@google.com>,
"gthelen@google.com" <gthelen@google.com>,
"wschwartz@amperecomputing.com" <wschwartz@amperecomputing.com>,
"dferguson@amperecomputing.com" <dferguson@amperecomputing.com>,
"wbs@os.amperecomputing.com" <wbs@os.amperecomputing.com>,
"nifan.cxl@gmail.com" <nifan.cxl@gmail.com>,
tanxiaofei <tanxiaofei@huawei.com>,
"Zengtao (B)" <prime.zeng@hisilicon.com>,
"Roberto Sassu" <roberto.sassu@huawei.com>,
"kangkang.shen@futurewei.com" <kangkang.shen@futurewei.com>,
wanghuiqiang <wanghuiqiang@huawei.com>,
Linuxarm <linuxarm@huawei.com>
Subject: Re: [PATCH v15 11/15] EDAC: Add memory repair control feature
Date: Fri, 15 Nov 2024 12:14:15 +0000 [thread overview]
Message-ID: <20241115121415.00005c76@huawei.com> (raw)
In-Reply-To: <20241114133249.GEZzX8ATNyc_Xw1L52@fat_crate.local>
Hi Borislav,
I'll just jump in on one element.
> > This will work for the CXL PPR feature where the result of the query operation for resources availability
> > return to the command, however for the CXL memory sparing features, the result of the query resources
> > availability command returned later in a Memory Sparing Event Record from the device.
> > Userspace shall issue repair operation with the attributes values received on the Memory Sparing trace event.
> > Thus for the CXL memory sparing feature, query for resources availability and repair operation
> > cannot be combined.
>
> What happens if the resources availability changes between the query and the
> start of the repair operation?
>
Short answer, you get an error return.
The query is an optional step / optimization. You can just skip it.
There is no point in querying if you are going to immediately issue the command to repair
(as that will report an error if you can't do it).
A typical flow where it might be useful is:
1) Lots of corrected errors reported on a particular part of the memory.
2) OS decides enough is enough, that row/bank/nibble should be replaced.
3) Before doing so it checks it can actually replace it - otherwise maybe we will be disrupting a
gigantic page or similar where the perf cost of just off lining is higher than we want.
4) After query the page is offlined etc (may or may not be necessary depending on the
hardware design - we may be able to do it 'live').
5) 'Try' to repair. Hopefully no one raced with us and used up the remaining resources.
Given this is typically only driven by something like RASDaemon that race should be
a corner case only (very unlikely)
6) If repair fails can just bring the memory back - but this dance was expensive and
we will carry on working with less than ideal memory (probably schedule some
real maintenance to swap out the device).
7) If repair succeeds bring the memory back as now we have shiny new memory.
We could drop the query for now and bring it back later once more of the surrounding
infrastructure becomes clearer. To me it's a useful feature, but I appreciate
this is early days and we shouldn't always try for all the bells and whistles on
day 1.
> The cat catches fire?
Dog person? :) Just a nice normal error return to indicate no resources.
Jonathan
>
next prev parent reply other threads:[~2024-11-15 12:14 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-11-01 9:17 [PATCH v15 00/15] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers shiju.jose
2024-11-01 9:17 ` [PATCH v15 01/15] EDAC: Add support for EDAC device features control shiju.jose
2024-11-08 0:17 ` Fan Ni
2024-11-01 9:17 ` [PATCH v15 02/15] EDAC: Add scrub control feature shiju.jose
2024-11-08 0:36 ` Fan Ni
2024-11-08 13:47 ` Shiju Jose
2024-11-01 9:17 ` [PATCH v15 03/15] EDAC: Add ECS " shiju.jose
2024-11-01 9:17 ` [PATCH v15 04/15] cxl: Add Get Supported Features command for kernel usage shiju.jose
2024-11-06 23:34 ` Dave Jiang
2024-11-08 13:33 ` Shiju Jose
2024-11-01 9:17 ` [PATCH v15 05/15] cxl/mbox: Add GET_FEATURE mailbox command shiju.jose
2024-11-01 9:17 ` [PATCH v15 06/15] cxl/mbox: Add SET_FEATURE " shiju.jose
2024-11-01 9:17 ` [PATCH v15 07/15] cxl/memfeature: Add CXL memory device patrol scrub control feature shiju.jose
2024-11-04 18:16 ` Dave Jiang
2024-11-01 9:17 ` [PATCH v15 08/15] cxl/memfeature: Add CXL memory device ECS " shiju.jose
2024-11-04 18:30 ` Dave Jiang
2024-11-05 9:51 ` Shiju Jose
2024-11-01 9:17 ` [PATCH v15 09/15] ACPI:RAS2: Add ACPI RAS2 driver shiju.jose
2024-11-13 11:56 ` Rafael J. Wysocki
2024-11-01 9:17 ` [PATCH v15 10/15] ras: mem: Add memory " shiju.jose
2024-11-01 9:17 ` [PATCH v15 11/15] EDAC: Add memory repair control feature shiju.jose
2024-11-04 6:15 ` Borislav Petkov
2024-11-04 13:05 ` Shiju Jose
2024-11-11 11:28 ` Borislav Petkov
2024-11-11 16:54 ` Shiju Jose
2024-11-14 13:32 ` Borislav Petkov
2024-11-15 12:14 ` Jonathan Cameron [this message]
2024-11-19 12:32 ` Borislav Petkov
2024-11-15 12:21 ` Shiju Jose
2024-11-19 12:36 ` Borislav Petkov
2024-11-08 16:59 ` Fan Ni
2024-11-11 17:01 ` Shiju Jose
2024-11-01 9:17 ` [PATCH v15 12/15] cxl/mbox: Add support for PERFORM_MAINTENANCE mailbox command shiju.jose
2024-11-05 17:22 ` Dave Jiang
2024-11-01 9:17 ` [PATCH v15 13/15] cxl/memfeature: Add CXL memory device sPPR control feature shiju.jose
2024-11-05 20:32 ` Dave Jiang
2024-11-06 17:28 ` Shiju Jose
2024-11-01 9:17 ` [PATCH v15 14/15] cxl/memfeature: Add CXL memory device memory sparing " shiju.jose
2024-11-07 16:24 ` Dave Jiang
2024-11-08 13:44 ` Shiju Jose
2024-11-01 9:17 ` [PATCH v15 15/15] EDAC: Add documentation for RAS feature control shiju.jose
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20241115121415.00005c76@huawei.com \
--to=jonathan.cameron@huawei.com \
--cc=Jon.Grimm@amd.com \
--cc=Vilas.Sridharan@amd.com \
--cc=Yazen.Ghannam@amd.com \
--cc=alison.schofield@intel.com \
--cc=bp@alien8.de \
--cc=dan.j.williams@intel.com \
--cc=dave.hansen@linux.intel.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=david@redhat.com \
--cc=dferguson@amperecomputing.com \
--cc=duenwen@google.com \
--cc=erdemaktas@google.com \
--cc=gregkh@linuxfoundation.org \
--cc=gthelen@google.com \
--cc=ira.weiny@intel.com \
--cc=james.morse@arm.com \
--cc=jassisinghbrar@gmail.com \
--cc=jiaqiyan@google.com \
--cc=jthoughton@google.com \
--cc=kangkang.shen@futurewei.com \
--cc=lenb@kernel.org \
--cc=leo.duran@amd.com \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-edac@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linuxarm@huawei.com \
--cc=mchehab@kernel.org \
--cc=naoya.horiguchi@nec.com \
--cc=nifan.cxl@gmail.com \
--cc=pgonda@google.com \
--cc=prime.zeng@hisilicon.com \
--cc=rafael@kernel.org \
--cc=rientjes@google.com \
--cc=roberto.sassu@huawei.com \
--cc=shiju.jose@huawei.com \
--cc=somasundaram.a@hpe.com \
--cc=sudeep.holla@arm.com \
--cc=tanxiaofei@huawei.com \
--cc=tony.luck@intel.com \
--cc=vishal.l.verma@intel.com \
--cc=wanghuiqiang@huawei.com \
--cc=wbs@os.amperecomputing.com \
--cc=wschwartz@amperecomputing.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox