RE: [PATCH v12 1/2] ACPI:RAS2: Add ACPI RAS2 driver

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Shiju Jose <shiju.jose@huawei.com>
To: Borislav Petkov <bp@alien8.de>
Cc: "rafael@kernel.org" <rafael@kernel.org>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"rppt@kernel.org" <rppt@kernel.org>,
	"dferguson@amperecomputing.com" <dferguson@amperecomputing.com>,
	"linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>,
	"linux-acpi@vger.kernel.org" <linux-acpi@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>,
	"tony.luck@intel.com" <tony.luck@intel.com>,
	"lenb@kernel.org" <lenb@kernel.org>,
	"leo.duran@amd.com" <leo.duran@amd.com>,
	"Yazen.Ghannam@amd.com" <Yazen.Ghannam@amd.com>,
	"mchehab@kernel.org" <mchehab@kernel.org>,
	Jonathan Cameron <jonathan.cameron@huawei.com>,
	Linuxarm <linuxarm@huawei.com>,
	"rientjes@google.com" <rientjes@google.com>,
	"jiaqiyan@google.com" <jiaqiyan@google.com>,
	"Jon.Grimm@amd.com" <Jon.Grimm@amd.com>,
	"dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com>,
	"naoya.horiguchi@nec.com" <naoya.horiguchi@nec.com>,
	"james.morse@arm.com" <james.morse@arm.com>,
	"jthoughton@google.com" <jthoughton@google.com>,
	"somasundaram.a@hpe.com" <somasundaram.a@hpe.com>,
	"erdemaktas@google.com" <erdemaktas@google.com>,
	"pgonda@google.com" <pgonda@google.com>,
	"duenwen@google.com" <duenwen@google.com>,
	"gthelen@google.com" <gthelen@google.com>,
	"wschwartz@amperecomputing.com" <wschwartz@amperecomputing.com>,
	"wbs@os.amperecomputing.com" <wbs@os.amperecomputing.com>,
	"nifan.cxl@gmail.com" <nifan.cxl@gmail.com>,
	tanxiaofei <tanxiaofei@huawei.com>,
	"Zengtao (B)" <prime.zeng@hisilicon.com>,
	"Roberto Sassu" <roberto.sassu@huawei.com>,
	"kangkang.shen@futurewei.com" <kangkang.shen@futurewei.com>,
	wanghuiqiang <wanghuiqiang@huawei.com>
Subject: RE: [PATCH v12 1/2] ACPI:RAS2: Add ACPI RAS2 driver
Date: Mon, 15 Sep 2025 11:50:16 +0000	[thread overview]
Message-ID: <9433067c142b45d583eb96587b929878@huawei.com> (raw)
In-Reply-To: <20250912141155.GAaMQqK4vS8zHd1z4_@fat_crate.local>

>-----Original Message-----
>From: Borislav Petkov <bp@alien8.de>
>Sent: 12 September 2025 15:12
>To: Shiju Jose <shiju.jose@huawei.com>
>Cc: rafael@kernel.org; akpm@linux-foundation.org; rppt@kernel.org;
>dferguson@amperecomputing.com; linux-edac@vger.kernel.org; linux-
>acpi@vger.kernel.org; linux-mm@kvack.org; linux-doc@vger.kernel.org;
>tony.luck@intel.com; lenb@kernel.org; leo.duran@amd.com;
>Yazen.Ghannam@amd.com; mchehab@kernel.org; Jonathan Cameron
><jonathan.cameron@huawei.com>; Linuxarm <linuxarm@huawei.com>;
>rientjes@google.com; jiaqiyan@google.com; Jon.Grimm@amd.com;
>dave.hansen@linux.intel.com; naoya.horiguchi@nec.com;
>james.morse@arm.com; jthoughton@google.com; somasundaram.a@hpe.com;
>erdemaktas@google.com; pgonda@google.com; duenwen@google.com;
>gthelen@google.com; wschwartz@amperecomputing.com;
>wbs@os.amperecomputing.com; nifan.cxl@gmail.com; tanxiaofei
><tanxiaofei@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; Roberto
>Sassu <roberto.sassu@huawei.com>; kangkang.shen@futurewei.com;
>wanghuiqiang <wanghuiqiang@huawei.com>
>Subject: Re: [PATCH v12 1/2] ACPI:RAS2: Add ACPI RAS2 driver
>
>On Fri, Sep 12, 2025 at 12:04:57PM +0000, Shiju Jose wrote:
>> >Why is this requirement here?
>> The physical memory address range retrieved here for the NUMA domain
>> is used in the subsequent patch  [PATCH v12 2/2] ras: mem: Add memory
>> ACPI RAS2 driver, 1. to set Requested Address Range(INPUT) field of
>> Table 5.87: Parameter Block Structure for PATROL_SCRUB when send
>> GET_PATROL_PARAMETERS command to the firmware, to get scrub
>parameters, running status, current scrub rate etc.
>> 2. for the validity check of the user requested memory address range to scrub.
>
>Again, why does it have to be *lowest* and *contiguous*?
>
>Your answer doesn't explain that.
This has been added as suggested by Jonathan considering the interleaved NUMA node.
Link to the related discussion in V11:
https://lore.kernel.org/all/20250821100655.00003942@huawei.com/#t

| node 0 | node 1 | node 0 |   PA address map.
Can you give your suggestion what we should do about it?

>
>> Also intended to expose this supported memory address range to the
>> userspace via EDAC scrub control interface, though it is not present now.
>
>Why? To tie ourselves with even more user ABI?!
>
>There better be a good reason and not a better design for what this is trying to
>do.

The "address_range_base" and "address_range_size" sysfs attributes
(until the v13 of EDAC scrub interface,) which we could be used for publish this
physical address range of the memory in NUMA domain to the userspace when the demand scrubbing
is not in progress, but "address_range_base" has changed to read the status of  on-demand
scrubbing based on the feedback here.
https://lore.kernel.org/all/4ee36d03a2894606a571b37f440da36f@huawei.com/#t   
Also to present this requirement for the RAS2,  there was no method found to retrieve the
memory physical address range until recent versions as it is not supported in the RAS2.

Use case for the RAS2 to publish the supported PA range of the node memory to the userspace:
Systems with multiple NUMA node domains with the support for the demand scrubbing
exposed to the user via the EDAC scrub interface as acpi_ras_mem0/scrub, acpi_ras_mem1/scrub,
acpi_ras_mem2/scrub, ... etc.   When the userspace tool (For e.g. rasdaemon)  or an admin detects
a faulty page or faulty address, system policy may decides to scrub the corresponding memory. 
However it is required to find out the EDAC scrub instance of  the corresponding  memory in the
NUMA domain, set scrub parameters and issue the scrub request.
There are two options present,
(1) Set the scrub parameters and issue scrub request in all the EDAC scrub instances present for RAS2. The
 scrub request should fail for the invalid cases.
(2)  Locate the corresponding EDAC scrub instances for the corresponding Node memory
      by read and check against the PA range published.

I think Option (2) seems better? 
If so, can the EDAC scrub interface  be updated to include attributes for publishing the supported
PA range for the memory device to scrub?

>
>> >What happens with the aux devices you created successfully here? Unwind?
>> Please see the previous discussions on this were about allowing the
>> successfully created auxiliary devices to exist.
>> https://lore.kernel.org/all/20250415210504.GA854098@yaz-khff2.amd.com/
>
>There's no discussion here. And nothing answers the question "why" this is ok to
>do this way.

This was changed  based on the feedback from the Yazen in v3 of the series,
Copy of the Yazen's feedback from the above link: 
=============================
> +	}
> +
> +	pcc_desc_list = (struct acpi_ras2_pcc_desc *)(ras2_tab + 1);
> +	for (i = 0; i < ras2_tab->num_pcc_descs; i++, pcc_desc_list++) {
> +		if (pcc_desc_list->feature_type != RAS2_FEAT_TYPE_MEMORY)
> +			continue;
> +
> +		rc = ras2_add_aux_device(RAS2_MEM_DEV_ID_NAME, pcc_desc_list->channel_id);
> +		if (rc)
> +			return rc;

This returns error on the first failure.

What if there was a success before? Does that aux_device need to be removed?

If not, then why return failure at all? Why not just try to add all devices? Some may fail and some may succeed.
============================= 

We thought second option is a better because a successfully added aux dev for a memory device and corresponding
EDAC interface continue exist and support the scrub/a memory feature. 
We do not mind doing stop on a failure adding an aux_device and free previously crated aux devices, though
it may require some additional dynamically allocated memory space to store the successfully created aux devices
so that free them on a failure later. Hope that is acceptable?

Thanks,
Shiju  

>
>--
>Regards/Gruss,
>    Boris.
>
>https://people.kernel.org/tglx/notes-about-netiquette

next prev parent reply	other threads:[~2025-09-15 11:50 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-02 17:30 [PATCH v12 0/2] ACPI: Add support for ACPI RAS2 feature table shiju.jose
2025-09-02 17:30 ` [PATCH v12 1/2] ACPI:RAS2: Add ACPI RAS2 driver shiju.jose
2025-09-09 16:24   ` Yazen Ghannam
2025-09-10  8:38     ` Shiju Jose
2025-09-10 19:27   ` Borislav Petkov
2025-09-12 12:04     ` Shiju Jose
2025-09-12 14:11       ` Borislav Petkov
2025-09-15 11:50         ` Shiju Jose [this message]
2025-09-17 16:22           ` Borislav Petkov
2025-09-17 17:36             ` Jonathan Cameron
2025-09-18 20:22               ` Daniel Ferguson
2025-09-19 10:39               ` Borislav Petkov
2025-10-06 10:37                 ` Shiju Jose
2025-10-16 10:30                   ` Borislav Petkov
2025-10-17 12:54                     ` Shiju Jose
2025-10-24 18:13                       ` Daniel Ferguson
2025-11-03 13:19                       ` Borislav Petkov
2025-11-04 12:55                         ` Shiju Jose
2025-11-22 11:36                           ` Borislav Petkov
2025-09-02 17:30 ` [PATCH v12 2/2] ras: mem: Add memory " shiju.jose

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9433067c142b45d583eb96587b929878@huawei.com \
    --to=shiju.jose@huawei.com \
    --cc=Jon.Grimm@amd.com \
    --cc=Yazen.Ghannam@amd.com \
    --cc=akpm@linux-foundation.org \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=dferguson@amperecomputing.com \
    --cc=duenwen@google.com \
    --cc=erdemaktas@google.com \
    --cc=gthelen@google.com \
    --cc=james.morse@arm.com \
    --cc=jiaqiyan@google.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=jthoughton@google.com \
    --cc=kangkang.shen@futurewei.com \
    --cc=lenb@kernel.org \
    --cc=leo.duran@amd.com \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linuxarm@huawei.com \
    --cc=mchehab@kernel.org \
    --cc=naoya.horiguchi@nec.com \
    --cc=nifan.cxl@gmail.com \
    --cc=pgonda@google.com \
    --cc=prime.zeng@hisilicon.com \
    --cc=rafael@kernel.org \
    --cc=rientjes@google.com \
    --cc=roberto.sassu@huawei.com \
    --cc=rppt@kernel.org \
    --cc=somasundaram.a@hpe.com \
    --cc=tanxiaofei@huawei.com \
    --cc=tony.luck@intel.com \
    --cc=wanghuiqiang@huawei.com \
    --cc=wbs@os.amperecomputing.com \
    --cc=wschwartz@amperecomputing.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox