From: Borislav Petkov <bp@alien8.de>
To: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Shiju Jose <shiju.jose@huawei.com>,
Dan Williams <dan.j.williams@intel.com>,
"linux-cxl@vger.kernel.org" <linux-cxl@vger.kernel.org>,
"linux-acpi@vger.kernel.org" <linux-acpi@vger.kernel.org>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
"dave@stgolabs.net" <dave@stgolabs.net>,
"dave.jiang@intel.com" <dave.jiang@intel.com>,
"alison.schofield@intel.com" <alison.schofield@intel.com>,
"vishal.l.verma@intel.com" <vishal.l.verma@intel.com>,
"ira.weiny@intel.com" <ira.weiny@intel.com>,
"linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"david@redhat.com" <david@redhat.com>,
"Vilas.Sridharan@amd.com" <Vilas.Sridharan@amd.com>,
"leo.duran@amd.com" <leo.duran@amd.com>,
"Yazen.Ghannam@amd.com" <Yazen.Ghannam@amd.com>,
"rientjes@google.com" <rientjes@google.com>,
"jiaqiyan@google.com" <jiaqiyan@google.com>,
"tony.luck@intel.com" <tony.luck@intel.com>,
"Jon.Grimm@amd.com" <Jon.Grimm@amd.com>,
"dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com>,
"rafael@kernel.org" <rafael@kernel.org>,
"lenb@kernel.org" <lenb@kernel.org>,
"naoya.horiguchi@nec.com" <naoya.horiguchi@nec.com>,
"james.morse@arm.com" <james.morse@arm.com>,
"jthoughton@google.com" <jthoughton@google.com>,
"somasundaram.a@hpe.com" <somasundaram.a@hpe.com>,
"erdemaktas@google.com" <erdemaktas@google.com>,
"pgonda@google.com" <pgonda@google.com>,
"duenwen@google.com" <duenwen@google.com>,
"mike.malvestuto@intel.com" <mike.malvestuto@intel.com>,
"gthelen@google.com" <gthelen@google.com>,
"wschwartz@amperecomputing.com" <wschwartz@amperecomputing.com>,
"dferguson@amperecomputing.com" <dferguson@amperecomputing.com>,
"wbs@os.amperecomputing.com" <wbs@os.amperecomputing.com>,
"nifan.cxl@gmail.com" <nifan.cxl@gmail.com>,
tanxiaofei <tanxiaofei@huawei.com>,
"Zengtao (B)" <prime.zeng@hisilicon.com>,
"kangkang.shen@futurewei.com" <kangkang.shen@futurewei.com>,
wanghuiqiang <wanghuiqiang@huawei.com>,
Linuxarm <linuxarm@huawei.com>,
Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
Jean Delvare <jdelvare@suse.com>,
Guenter Roeck <linux@roeck-us.net>,
Dmitry Torokhov <dmitry.torokhov@gmail.com>
Subject: Re: [RFC PATCH v8 01/10] ras: scrub: Add scrub subsystem
Date: Thu, 6 Jun 2024 18:05:33 +0200 [thread overview]
Message-ID: <20240606160533.GDZmHeTbhCoJYKSsD2@fat_crate.local> (raw)
In-Reply-To: <20240528100645.00000765@Huawei.com>
On Tue, May 28, 2024 at 10:06:45AM +0100, Jonathan Cameron wrote:
> If dealing with disabling, I'd be surprised if it was a normal policy but
> if it were udev script or boot script. If unusual event (i.e. someone is
Yeah, I wouldn't disable it during boot but around my workload only. You
want for automatic scrubs to still happen on the system.
> trying to reduce jitter in a benchmark targetting something else) then
> interface is simple enough that an admin can poke it directly.
Right, for benchmarks direct poking is fine.
When it is supposed to be something more involved like, dunno, HPC doing
a heavy workload and it wants to squeeze all performance so I guess
turning off the scrubbers would be part of the setup script. So yeah, if
this is properly documented, scripting around it is easy.
> To a certain extent this is bounded by what the hardware lets us
> do but agreed we should make sure it 'works' for the usecases we know
> about. Starting point is some more documentation in the patch set
> giving common flows (and maybe some example scripts).
Yap, sounds good. As in: "These are the envisioned usages at the time of
writing... " or so.
> > Do you go and start a scrub cycle by hand?
>
> Typically no, but the option would be there to support an admin who is
> suspicious or who is trying to gather statistics or similar.
Ok.
> That definitely makes sense for NVDIMM scrub as the model there is
> to only ever do it on a demand as a single scrub pass.
> For a cyclic scrub we can spin a policy in rasdaemon or similar to
> possibly crank up the frequency if we are getting lots of 'non scrub'
> faults (i.e. correct error reported on demand accesses).
I was going to suggest that: automating stuff with rasdaemon. It would
definitely simplify talking to that API.
> Shiju is our expert on this sort of userspace stats monitoring and
> handling so I'll leave him to come back with a proposal / PoC for doing that.
>
> I can see two motivations though:
> a) Gather better stats on suspect device by ensuring more correctable
> error detections.
> b) Increase scrubbing on a device which is on it's way out but not replacable
> yet for some reason.
>
> I would suggest this will be PoC level only for now as it will need
> a lot of testing on large fleets to do anything sophisticated.
Yeah, sounds like a good start.
> > Do you automate it? I wanna say yes because that's miles better than
> > having to explain yet another set of knobs to users.
>
> First instance, I'd expect an UDEV policy so when a new CXL memory
> turns up we set a default value. A cautious admin would have tweaked
> that script to set the default to scrub more often, an admin who
> knows they don't care might turn it off. We can include an example of that
> in next version I think.
Yes, and then hook into rasdaemon the moment it logs an error in some
component to go and increase scrubbing of that component. But yeah, you
said that above already.
> Absolutely. One area that needs to improve (Dan raised it) is
> association with HPA ranges so we at can correlate easily error reports
> with which scrub engine. That can be done with existing version but
> it's fiddlier than it needs to be. This 'might' be a userspace script
> example, or maybe making associations tighter in kernel.
Right.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
next prev parent reply other threads:[~2024-06-06 16:06 UTC|newest]
Thread overview: 56+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-04-19 16:47 [RFC PATCH v8 00/10] ras: scrub: introduce subsystem + CXL/ACPI-RAS2 drivers shiju.jose
2024-04-19 16:47 ` [RFC PATCH v8 01/10] ras: scrub: Add scrub subsystem shiju.jose
2024-04-24 20:25 ` fan
2024-04-25 10:38 ` Shiju Jose
2024-04-25 10:15 ` Borislav Petkov
2024-04-25 18:11 ` Shiju Jose
2024-05-06 10:30 ` Borislav Petkov
2024-05-08 16:59 ` Shiju Jose
2024-05-08 17:20 ` Borislav Petkov
2024-05-08 17:44 ` Shiju Jose
2024-05-08 19:25 ` Borislav Petkov
2024-05-09 9:19 ` Jonathan Cameron
2024-05-09 15:52 ` Borislav Petkov
2024-05-09 20:03 ` Borislav Petkov
2024-05-09 21:21 ` Dan Williams
2024-05-09 21:51 ` Borislav Petkov
2024-05-09 22:59 ` Dan Williams
2024-05-10 9:25 ` Borislav Petkov
2024-05-10 17:13 ` Dan Williams
2024-05-11 10:17 ` Borislav Petkov
2024-05-17 11:15 ` Jonathan Cameron
2024-05-17 11:44 ` Jonathan Cameron
2024-05-21 8:06 ` Borislav Petkov
2024-05-22 9:40 ` Jonathan Cameron
2024-05-27 9:09 ` Borislav Petkov
2024-05-20 10:54 ` Shiju Jose
2024-05-20 11:58 ` Jonathan Cameron
2024-05-27 9:21 ` Borislav Petkov
2024-05-28 9:06 ` Jonathan Cameron
2024-06-06 16:05 ` Borislav Petkov [this message]
2024-05-10 13:31 ` Jonathan Cameron
2024-05-09 21:47 ` Dan Williams
2024-05-10 9:03 ` Jonathan Cameron
2024-04-19 16:47 ` [RFC PATCH v8 02/10] cxl/mbox: Add GET_SUPPORTED_FEATURES mailbox command shiju.jose
2024-04-19 16:47 ` [RFC PATCH v8 03/10] cxl/mbox: Add GET_FEATURE " shiju.jose
2024-04-24 23:19 ` fan
2024-04-25 10:38 ` Shiju Jose
2024-04-19 16:47 ` [RFC PATCH v8 04/10] cxl/mbox: Add SET_FEATURE " shiju.jose
2024-04-25 17:26 ` fan
2024-04-19 16:47 ` [RFC PATCH v8 05/10] cxl/memscrub: Add CXL device patrol scrub control feature shiju.jose
2024-04-26 23:56 ` fan
2024-04-29 11:20 ` Shiju Jose
2024-04-29 12:21 ` Jonathan Cameron
2024-05-10 0:26 ` Dan Williams
2024-05-10 11:23 ` Jonathan Cameron
2024-04-19 16:47 ` [RFC PATCH v8 06/10] ACPICA: Add __free() based cleanup function for acpi_put_table shiju.jose
2024-04-19 18:06 ` Jonathan Cameron
2024-04-19 16:47 ` [RFC PATCH v8 07/10] platform: Add __free() based cleanup function for platform_device_put shiju.jose
2024-04-19 16:47 ` [RFC PATCH v8 08/10] ACPI:RAS2: Add ACPI RAS2 driver shiju.jose
2024-06-05 21:32 ` Daniel Ferguson
2024-04-19 16:47 ` [RFC PATCH v8 09/10] ras: scrub: Add scrub control attributes for ACPI RAS2 shiju.jose
2024-04-19 16:47 ` [RFC PATCH v8 10/10] ras: scrub: ACPI RAS2: Add memory ACPI RAS2 driver shiju.jose
2024-06-05 21:33 ` Daniel Ferguson
2024-06-07 15:46 ` Shiju Jose
2024-06-21 18:06 ` Daniel Ferguson
2024-06-26 12:23 ` Shiju Jose
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240606160533.GDZmHeTbhCoJYKSsD2@fat_crate.local \
--to=bp@alien8.de \
--cc=Jon.Grimm@amd.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=Vilas.Sridharan@amd.com \
--cc=Yazen.Ghannam@amd.com \
--cc=alison.schofield@intel.com \
--cc=dan.j.williams@intel.com \
--cc=dave.hansen@linux.intel.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=david@redhat.com \
--cc=dferguson@amperecomputing.com \
--cc=dmitry.torokhov@gmail.com \
--cc=duenwen@google.com \
--cc=erdemaktas@google.com \
--cc=gregkh@linuxfoundation.org \
--cc=gthelen@google.com \
--cc=ira.weiny@intel.com \
--cc=james.morse@arm.com \
--cc=jdelvare@suse.com \
--cc=jiaqiyan@google.com \
--cc=jthoughton@google.com \
--cc=kangkang.shen@futurewei.com \
--cc=lenb@kernel.org \
--cc=leo.duran@amd.com \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-edac@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux@roeck-us.net \
--cc=linuxarm@huawei.com \
--cc=mike.malvestuto@intel.com \
--cc=naoya.horiguchi@nec.com \
--cc=nifan.cxl@gmail.com \
--cc=pgonda@google.com \
--cc=prime.zeng@hisilicon.com \
--cc=rafael@kernel.org \
--cc=rientjes@google.com \
--cc=shiju.jose@huawei.com \
--cc=somasundaram.a@hpe.com \
--cc=tanxiaofei@huawei.com \
--cc=tony.luck@intel.com \
--cc=vishal.l.verma@intel.com \
--cc=wanghuiqiang@huawei.com \
--cc=wbs@os.amperecomputing.com \
--cc=wschwartz@amperecomputing.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox