linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Daniel Ferguson <danielf@os.amperecomputing.com>
To: Jonathan Cameron <jonathan.cameron@huawei.com>,
	Borislav Petkov <bp@alien8.de>
Cc: Shiju Jose <shiju.jose@huawei.com>,
	"rafael@kernel.org" <rafael@kernel.org>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"rppt@kernel.org" <rppt@kernel.org>,
	"dferguson@amperecomputing.com" <dferguson@amperecomputing.com>,
	"linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>,
	"linux-acpi@vger.kernel.org" <linux-acpi@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>,
	"tony.luck@intel.com" <tony.luck@intel.com>,
	"lenb@kernel.org" <lenb@kernel.org>,
	"Yazen.Ghannam@amd.com" <Yazen.Ghannam@amd.com>,
	"mchehab@kernel.org" <mchehab@kernel.org>,
	Linuxarm <linuxarm@huawei.com>,
	"rientjes@google.com" <rientjes@google.com>,
	"jiaqiyan@google.com" <jiaqiyan@google.com>,
	"Jon.Grimm@amd.com" <Jon.Grimm@amd.com>,
	"dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com>,
	"naoya.horiguchi@nec.com" <naoya.horiguchi@nec.com>,
	"james.morse@arm.com" <james.morse@arm.com>,
	"jthoughton@google.com" <jthoughton@google.com>,
	"somasundaram.a@hpe.com" <somasundaram.a@hpe.com>,
	"erdemaktas@google.com" <erdemaktas@google.com>,
	"pgonda@google.com" <pgonda@google.com>,
	"duenwen@google.com" <duenwen@google.com>,
	"gthelen@google.com" <gthelen@google.com>,
	"wschwartz@amperecomputing.com" <wschwartz@amperecomputing.com>,
	"wbs@os.amperecomputing.com" <wbs@os.amperecomputing.com>,
	"nifan.cxl@gmail.com" <nifan.cxl@gmail.com>,
	tanxiaofei <tanxiaofei@huawei.com>,
	"Zengtao (B)" <prime.zeng@hisilicon.com>,
	Roberto Sassu <roberto.sassu@huawei.com>,
	"kangkang.shen@futurewei.com" <kangkang.shen@futurewei.com>,
	wanghuiqiang <wanghuiqiang@huawei.com>,
	vanshikonda@os.amperecomputing.com,
	danielf@os.amperecomputing.com
Subject: Re: [PATCH v12 1/2] ACPI:RAS2: Add ACPI RAS2 driver
Date: Thu, 18 Sep 2025 13:22:16 -0700	[thread overview]
Message-ID: <7a211c5c-174c-438b-9a98-fd47b057ea4a@os.amperecomputing.com> (raw)
In-Reply-To: <20250917183608.000038c4@huawei.com>



On 9/17/2025 10:36 AM, Jonathan Cameron wrote:
> On Wed, 17 Sep 2025 18:22:53 +0200
> Borislav Petkov <bp@alien8.de> wrote:
> 
>> On Mon, Sep 15, 2025 at 11:50:16AM +0000, Shiju Jose wrote:
>>> This has been added as suggested by Jonathan considering the interleaved NUMA node.
>>> Link to the related discussion in V11:
>>> https://lore.kernel.org/all/20250821100655.00003942@huawei.com/#t  
>>
>> Sorry, this doesn't work this way.
>>
>> If something in the code is being done which is not obvious and trivial, then
>> the reason for it is written down in a prominent place so that it is clear to
>> people.
>>
>> Not pointing to a discussion or some funky place on the web where someone
>> might've said something.
>>
>> Your patch submission should contain that info and not have reviewers ask for
>> it.
>>
>>> | node 0 | node 1 | node 0 |   PA address map.
>>> Can you give your suggestion what we should do about it?  
>>
>> I don't know what the problem is to begin with...
>>
>>> I think Option (2) seems better?  If so, can the EDAC scrub interface  be
>>> updated to include attributes for publishing the supported PA range for the
>>> memory device to scrub?  
>>
>> The memory ranges should already be available somewhere in the NUMA/mm code or
>> so and for starters, we should start a scrub for all ranges and do the
>> single-range only when there really is a good reason for it.
>>
>> Also, you don't have to expose any ranges to userspace in order to start
>> a scrub activity - you can simply start the scrub in the affected range
>> automatically.
>>
>> Like I preached the last time, your aim should be to make as much of the
>> variables that control the scrub automatic and not expose everything to
>> userspace so that some userspace tool decides. The tool should simply start
>> the scrub and the kernel should DTRT.
> 
> This 'first contiguous range' is an attempt to DTRT in a corner
> case that is real but where there is not an obvious right thing due to spec limitations.
> 
> The problem is the ACPI specification ties a controller to a NUMA / _PXM but
> then controls it via a single memory range with rules on whether that range can
> include holes (that are actually covered by a different controller).  There
> is no way to discover if two disconnected ranges may be scanned at once other
> that trying that.
> 
> Without resolving this corner, there is no way we could come up with for the kernel
> to DTRT.
> 
> Aim is to automatically establish the range that can be scrubbed. The corner
> case is the hole problem.  Alternative as discussed in earlier versions of
> this series was ignore holes.
> 
> The options discussed in earlier versions of this patch
> =======================================================
> 
> So with Node / PA mapping (happens because people want low memory addresses
> near to CPUs in different sockets - I simplified it here somewhat).
> 
> | Node 0 | Node 1 | Node 0 | Node 2|
> 
> 1. Hole skipping approach
> Have control parameters default to
> |       Node 0             |
>          | Node 1 |
>                            | Node 2|
> 2. Present part of range that is at least not including a hole where it might
> look like we were controlling memory scrub that we were not.
> | Node 0 |
>          | Node 1 |
>                            | Node 2|
> 
> 3. Just don't present anything and leave it up to general mm interfaces
>    to provide what 'should' be set.
> |
> All 3 here, 0 size.
> 
> Whilst nice to support, I'm not seeing 'default' as a key use case and
> changing scrub from a kernel driver without policy from userspace to
> me would be a wrong thing to do. (I'm not sure if you are suggesting this)
> 
> The scrub should always be running pre linux (true today on all systems that I'm
> aware of, but maybe not as universal as I think?).  If there is a need for
> a default scrub setting on boot of Linux, then sure we can add it.
> 
> This interface all about tweaking the settings not defaults (unlike the CXL
> case which does need the setting of defaults as well because of hotplug of
> the devices and lack of firmware involvement in that).
> 
> We are fine with any of the options above.
> 
> This was an attempt to respond to review feedback from Daniel - it was not
> something Huawei needs.
> 
> https://lore.kernel.org/all/547ed8fb-d6b7-4b6b-a38b-bf13223971b1@os.amperecomputing.com/
> 
> After a discussion of why 0, 0 defaults give an unexpected result...
> 
> "Proposed Solution:
> What we propose, is to instead of zeroing out the base and size after an error,
> use the full range of the current NUMA node. We believe that a superset of a
> currently active scrub range can properly report all the relevant and correct
> information."
> 
> The above Numa node pattern in PA space |0|1|0|2| etc is a thing
> that happens on real systems so if was the best that had come up in earlier
> discussion as an approximation of what Daniel asked for that should allow the
> right values to be queried.
> 
> I'm not entirely sure this even matters now they have resolved the shared
> PCC interface issue on their platforms.  Felt nice to provide meaningful
> defaults but maybe this is a problem we don't need to solve and can go back to
> just using 0, 0 until told to do something else.
> 
> Daniel, perhaps you can provide more info?

Thanks for pointing out the limitation of using NUMA base address + size as the
Requested Address Range to determine the status of the scrubber for that NUMA
domain.

> go back to just using 0, 0 until told to do something else.

Here is why I think (0,0) is not scalable:
Not every architecture would consider 0 to be an invalid DRAM address. For a
system where 0 is a valid base address, size 0 is an invalid argument. The
firmware would be correct to return "Status - 0001b" for this Requested Address
Range. This would be compliant to the ACPI spec as defined today.


Based on the discussion so far on all the threads, here's my attempt at
summarizing the spec gap. Please add if I'm missing anything:
The OS can't determine if the component is busy in a single RAS2 scrub call if
the NUMA range has holes. To correctly determine status of the component, the OS
would have to make multiple calls to cover all the disjoint memory ranges in the
NUMA domain.


Could this gap be addressed as follows?:
The patrol scrubber can only be busy due to a previously *successful* request
from the OS. The OS can keep track of the address range it submitted for this
request. The next time we try to submit a GET_PATROL_PARAMETERS command, we use
the previously requested range to check if the scrubber is busy. Additionally,
if you know the previous START_PATROL_SCRUBBER wasn't successful, then there is
no need to call GET_PATROL_PARAMETERS to determine if the scrubber is running.

At driver init time, use the first page of the NUMA range as the Requested
Address Range to invoke GET_PATROL_PARAMETERS - to get parameters that apply to
all addresses in the NUMA domain, i.e., "Scrub Parameters" and the "Extended
Data Region". The "Actual Address Range", "Flags" and "Status" fields will
change when the "Requested Address Range" changes but not "Scrub Parameters" and
"Extended Data Region" [1].

[1] - https://github.com/tianocore/edk2/issues/11353

Thanks,
~Daniel

> 
> Thanks,
> 
> Jonathan


  reply	other threads:[~2025-09-18 20:22 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-02 17:30 [PATCH v12 0/2] ACPI: Add support for ACPI RAS2 feature table shiju.jose
2025-09-02 17:30 ` [PATCH v12 1/2] ACPI:RAS2: Add ACPI RAS2 driver shiju.jose
2025-09-09 16:24   ` Yazen Ghannam
2025-09-10  8:38     ` Shiju Jose
2025-09-10 19:27   ` Borislav Petkov
2025-09-12 12:04     ` Shiju Jose
2025-09-12 14:11       ` Borislav Petkov
2025-09-15 11:50         ` Shiju Jose
2025-09-17 16:22           ` Borislav Petkov
2025-09-17 17:36             ` Jonathan Cameron
2025-09-18 20:22               ` Daniel Ferguson [this message]
2025-09-19 10:39               ` Borislav Petkov
2025-10-06 10:37                 ` Shiju Jose
2025-10-16 10:30                   ` Borislav Petkov
2025-10-17 12:54                     ` Shiju Jose
2025-10-24 18:13                       ` Daniel Ferguson
2025-11-03 13:19                       ` Borislav Petkov
2025-11-04 12:55                         ` Shiju Jose
2025-11-22 11:36                           ` Borislav Petkov
2025-09-02 17:30 ` [PATCH v12 2/2] ras: mem: Add memory " shiju.jose

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7a211c5c-174c-438b-9a98-fd47b057ea4a@os.amperecomputing.com \
    --to=danielf@os.amperecomputing.com \
    --cc=Jon.Grimm@amd.com \
    --cc=Yazen.Ghannam@amd.com \
    --cc=akpm@linux-foundation.org \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=dferguson@amperecomputing.com \
    --cc=duenwen@google.com \
    --cc=erdemaktas@google.com \
    --cc=gthelen@google.com \
    --cc=james.morse@arm.com \
    --cc=jiaqiyan@google.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=jthoughton@google.com \
    --cc=kangkang.shen@futurewei.com \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linuxarm@huawei.com \
    --cc=mchehab@kernel.org \
    --cc=naoya.horiguchi@nec.com \
    --cc=nifan.cxl@gmail.com \
    --cc=pgonda@google.com \
    --cc=prime.zeng@hisilicon.com \
    --cc=rafael@kernel.org \
    --cc=rientjes@google.com \
    --cc=roberto.sassu@huawei.com \
    --cc=rppt@kernel.org \
    --cc=shiju.jose@huawei.com \
    --cc=somasundaram.a@hpe.com \
    --cc=tanxiaofei@huawei.com \
    --cc=tony.luck@intel.com \
    --cc=vanshikonda@os.amperecomputing.com \
    --cc=wanghuiqiang@huawei.com \
    --cc=wbs@os.amperecomputing.com \
    --cc=wschwartz@amperecomputing.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox