From: Shiju Jose <shiju.jose@huawei.com>
To: Daniel Ferguson <danielf@os.amperecomputing.com>,
"linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>,
"linux-acpi@vger.kernel.org" <linux-acpi@vger.kernel.org>,
"linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>
Cc: "bp@alien8.de" <bp@alien8.de>,
"rafael@kernel.org" <rafael@kernel.org>,
"tony.luck@intel.com" <tony.luck@intel.com>,
"lenb@kernel.org" <lenb@kernel.org>,
"leo.duran@amd.com" <leo.duran@amd.com>,
"Yazen.Ghannam@amd.com" <Yazen.Ghannam@amd.com>,
"mchehab@kernel.org" <mchehab@kernel.org>,
Jonathan Cameron <jonathan.cameron@huawei.com>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
Linuxarm <linuxarm@huawei.com>,
"rientjes@google.com" <rientjes@google.com>,
"jiaqiyan@google.com" <jiaqiyan@google.com>,
"Jon.Grimm@amd.com" <Jon.Grimm@amd.com>,
"dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com>,
"naoya.horiguchi@nec.com" <naoya.horiguchi@nec.com>,
"james.morse@arm.com" <james.morse@arm.com>,
"jthoughton@google.com" <jthoughton@google.com>,
"somasundaram.a@hpe.com" <somasundaram.a@hpe.com>,
"erdemaktas@google.com" <erdemaktas@google.com>,
"pgonda@google.com" <pgonda@google.com>,
"duenwen@google.com" <duenwen@google.com>,
"gthelen@google.com" <gthelen@google.com>,
"wschwartz@amperecomputing.com" <wschwartz@amperecomputing.com>,
"dferguson@amperecomputing.com" <dferguson@amperecomputing.com>,
"wbs@os.amperecomputing.com" <wbs@os.amperecomputing.com>,
"nifan.cxl@gmail.com" <nifan.cxl@gmail.com>,
tanxiaofei <tanxiaofei@huawei.com>,
"Zengtao (B)" <prime.zeng@hisilicon.com>,
Roberto Sassu <roberto.sassu@huawei.com>,
"kangkang.shen@futurewei.com" <kangkang.shen@futurewei.com>,
wanghuiqiang <wanghuiqiang@huawei.com>
Subject: RE: [PATCH v5 1/2] ACPI:RAS2: Add ACPI RAS2 driver
Date: Fri, 16 May 2025 13:36:07 +0000 [thread overview]
Message-ID: <fd2e3f1bf37d45b59daa883f0b5c96d5@huawei.com> (raw)
In-Reply-To: <2dda5ebd-3bd1-4ab3-9722-02590094d6ac@os.amperecomputing.com>
>-----Original Message-----
>From: Daniel Ferguson <danielf@os.amperecomputing.com>
>Sent: 16 May 2025 02:49
>To: Shiju Jose <shiju.jose@huawei.com>; linux-edac@vger.kernel.org; linux-
>acpi@vger.kernel.org; linux-doc@vger.kernel.org
>Cc: bp@alien8.de; rafael@kernel.org; tony.luck@intel.com; lenb@kernel.org;
>leo.duran@amd.com; Yazen.Ghannam@amd.com; mchehab@kernel.org;
>Jonathan Cameron <jonathan.cameron@huawei.com>; linux-mm@kvack.org;
>Linuxarm <linuxarm@huawei.com>; rientjes@google.com;
>jiaqiyan@google.com; Jon.Grimm@amd.com; dave.hansen@linux.intel.com;
>naoya.horiguchi@nec.com; james.morse@arm.com; jthoughton@google.com;
>somasundaram.a@hpe.com; erdemaktas@google.com; pgonda@google.com;
>duenwen@google.com; gthelen@google.com;
>wschwartz@amperecomputing.com; dferguson@amperecomputing.com;
>wbs@os.amperecomputing.com; nifan.cxl@gmail.com; tanxiaofei
><tanxiaofei@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; Roberto
>Sassu <roberto.sassu@huawei.com>; kangkang.shen@futurewei.com;
>wanghuiqiang <wanghuiqiang@huawei.com>
>Subject: Re: [PATCH v5 1/2] ACPI:RAS2: Add ACPI RAS2 driver
>
>
>
>On 5/14/2025 4:31 AM, Shiju Jose wrote:
>>> -----Original Message-----
>>> From: Daniel Ferguson <danielf@os.amperecomputing.com>
>>> Sent: 14 May 2025 03:55
>>> To: Shiju Jose <shiju.jose@huawei.com>; linux-edac@vger.kernel.org;
>>> linux- acpi@vger.kernel.org; linux-doc@vger.kernel.org
>>> Cc: bp@alien8.de; rafael@kernel.org; tony.luck@intel.com;
>>> lenb@kernel.org; leo.duran@amd.com; Yazen.Ghannam@amd.com;
>>> mchehab@kernel.org; Jonathan Cameron
><jonathan.cameron@huawei.com>;
>>> linux-mm@kvack.org; Linuxarm <linuxarm@huawei.com>;
>>> rientjes@google.com; jiaqiyan@google.com; Jon.Grimm@amd.com;
>>> dave.hansen@linux.intel.com; naoya.horiguchi@nec.com;
>>> james.morse@arm.com; jthoughton@google.com;
>somasundaram.a@hpe.com;
>>> erdemaktas@google.com; pgonda@google.com; duenwen@google.com;
>>> gthelen@google.com; wschwartz@amperecomputing.com;
>>> dferguson@amperecomputing.com; wbs@os.amperecomputing.com;
>>> nifan.cxl@gmail.com; tanxiaofei <tanxiaofei@huawei.com>; Zengtao (B)
>>> <prime.zeng@hisilicon.com>; Roberto Sassu <roberto.sassu@huawei.com>;
>>> kangkang.shen@futurewei.com; wanghuiqiang
><wanghuiqiang@huawei.com>
>>> Subject: Re: [PATCH v5 1/2] ACPI:RAS2: Add ACPI RAS2 driver
>>>
>>>> +static int ras2_report_cap_error(u32 cap_status) {
>>>> + switch (cap_status) {
>>>> + case ACPI_RAS2_NOT_VALID:
>>>> + case ACPI_RAS2_NOT_SUPPORTED:
>>>> + return -EPERM;
>>>> + case ACPI_RAS2_BUSY:
>>>> + return -EBUSY;
>>>> + case ACPI_RAS2_FAILED:
>>>> + case ACPI_RAS2_ABORTED:
>>>> + case ACPI_RAS2_INVALID_DATA:
>>>> + return -EINVAL;
>>>> + default: /* 0 or other, Success */
>>>> + return 0;
>>>> + }
>>>> +}
>>>> +
>>>> +static int ras2_check_pcc_chan(struct ras2_pcc_subspace
>>>> +*pcc_subspace) {
>>>> + struct acpi_ras2_shmem __iomem *gen_comm_base = pcc_subspace-
>>>> comm_addr;
>>>> + u32 cap_status;
>>>> + u16 status;
>>>> + u32 rc;
>>>> +
>>>> + /*
>>>> + * As per ACPI spec, the PCC space will be initialized by
>>>> + * platform and should have set the command completion bit when
>>>> + * PCC can be used by OSPM.
>>>> + *
>>>> + * Poll PCC status register every 3us(delay_us) for maximum of
>>>> + * deadline_us(timeout_us) until PCC command complete bit is
>>> set(cond).
>>>> + */
>>>> + rc = readw_relaxed_poll_timeout(&gen_comm_base->status, status,
>>>> + status &
>>> PCC_STATUS_CMD_COMPLETE, 3,
>>>> + pcc_subspace->deadline_us);
>>>> + if (rc) {
>>>> + pr_warn("PCC check channel failed for : %d rc=%d\n",
>>>> + pcc_subspace->pcc_id, rc);
>>>> + return rc;
>>>> + }
>>>> +
>>>> + if (status & PCC_STATUS_ERROR) {
>>>> + cap_status = readw_relaxed(&gen_comm_base-
>>>> set_caps_status);
>>>> + rc = ras2_report_cap_error(cap_status);
>>>> +
>>>> + status &= ~PCC_STATUS_ERROR;
>>>> + writew_relaxed(status, &gen_comm_base->status);
>>>> + return rc;
>>>> + }
>>>> +
>>>> + if (status & PCC_STATUS_CMD_COMPLETE)
>>>> + return 0;
>>>> +
>>>> + return -EIO;
>>>> +}
>>>
>>> We still have an outstanding problem. This may sound familiar.
>>>
>>> If a user specifies an invalid address, our firmware will set an
>>> error code in the set_caps_status field of the acpi_ras2_shmem
>>> structure. In our case, the error code is ACPI_RAS2_INVALID_DATA, and
>>> the user will observe an EINVAL. This is expected.
>>>
>>> However, if the user then subsequently attempts to write a VALID
>>> address, ras2_get_patrol_scrub_running will indirectly call
>>> ras2_check_pcc_chan using the previously INVALID address to determine if
>the scrubber is still running.
>>> Unfortunately, the INVALID address causes
>>> ras2_get_patrol_scrub_running to fail, therefore preventing the user
>>> from specifying a VALID address after specifying an INVALID address.
>>>
>>> The only way to move forward from this inescapable condition is to
>>> reboot the system.
>>>
>>> Here is a demo of the problem as I roughly see it on our system (I've
>>> labeled the line numbers for sake of discussion):
>>> 1 [root@myhost scrub0]# echo 0x100000000 > size
>>> 2 [root@myhost scrub0]# echo 0x1f00000000 > addr
>>> 3 [root@myhost scrub0]# echo 0xcf00000000 > addr
>>> 4 write error: Invalid argument
>>> 5 [ 214.446338] PCCT PCCT: Failed to start demand scrubbing
>>> 6 [root@myhost scrub0]# echo 0x1f00000000 > addr
>>> 7 write error: Invalid argument
>>> 8 [ 242.263909] PCCT PCCT: failed to read parameters
>>> 9 [root@myhost scrub0]# echo 0x100000000 > size
>>> 10 write error: Invalid argument
>>> 11 [ 246.190196] PCCT PCCT: failed to read parameters
>>>
>>> The upper most memory address on this system is 0xbf00000000. Line 1
>>> and 2 use valid values, and line 2 produces the expected results. On
>>> line 3, I've specified an INVALID address (outside of valid range).
>>> The error on line 5 is expected after executing the
>>> START_PATROL_SCRUBBER command with an INVALID address.
>>>
>>> Line 6 show how I attempt to specify a VALID address. Unfortunately,
>>> ras2_get_patrol_scrub_running encounters and error after executing
>>> GET_PATROL_PARAMETERS because it used the OLD INVALID values in
>>> ps_sm-
>>>> params.req_addr_range. Line 7 and 8 are the result. Since the flow
>>>> of
>>> execution if aborted at this point, you can never rectify the
>>> situation and insert a valid value into ps_sm->params.req_addr_range, unless
>you reboot the system.
>>>
>>> One half baked solution to this problem, is to modify
>>> ras2_get_patrol_scrub_running so that if there is a non-zero address
>>> or size specified, AND the last error code we received was INVALID
>>> DATA, then assume the scrubber is NOT running.
>> Hi Daniel,
>>
>> Thanks for reporting the issue.
>> Can you check whether following change fix the issue in your test setup?
>> =============================================
>> diff --git a/drivers/ras/acpi_ras2.c b/drivers/ras/acpi_ras2.c index
>> 4d9cfd3bdf45..ff4aa1b75860 100644
>> --- a/drivers/ras/acpi_ras2.c
>> +++ b/drivers/ras/acpi_ras2.c
>> @@ -255,6 +255,13 @@ static int ras2_hw_scrub_write_addr(struct device
>*dev, void *drv_data, u64 base
>> ret = ras2_send_pcc_cmd(ras2_ctx, PCC_CMD_EXEC_RAS2);
>> if (ret) {
>> dev_err(ras2_ctx->dev, "Failed to start demand
>> scrubbing\n");
>> + if (ret == -EPERM || ret == -EINVAL) {
>> + ps_sm->params.req_addr_range[0] = 0;
>> + ps_sm->params.req_addr_range[1] = 0;
>> + ras2_ctx->base = 0;
>> + ras2_ctx->size = 0;
>> + ras2_ctx->od_scrub_sts = OD_SCRUB_STS_IDLE;
>> + }
>> return ret;
>> }
>> ras2_ctx->od_scrub_sts = OD_SCRUB_STS_ACTIVE;
>> ==============================================
>> Thanks,
>> Shiju
>
>We're happy! with this fix.
>
>For this to work, we had to no-op the START_PATROL_SCRUBBER and
>GET_PATROL_PARAMETERS commands when base and size are equal to zero.
>Previously, we returned INVALID DATA when base and size were zero.
Thanks Daniel for verifying the changes.
For demand scrubbing, kernel does not send START_PATROL_SCRUBBER command
when size is zero.
However GET_PATROL_PARAMETERS command from ras2_update_patrol_scrub_params_cache()
is valid case when base and size are equal to zero.
>
>Maybe we should amend the ACPI spec with this special knowledge.
>
>Anyways; as of now, with this fix, this driver will work out of the box on our
>systems.
Please tag V6 sent with fix.
https://lore.kernel.org/all/20250516132205.789-1-shiju.jose@huawei.com/
>
>Thanks a lot,
>~Daniel
>>
>>>
>>> Regards,
>>> ~Daniel
>
Thanks,
Shiju
next prev parent reply other threads:[~2025-05-16 13:36 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-07 21:43 [PATCH v5 0/2] ACPI: Add support for ACPI RAS2 feature table shiju.jose
2025-05-07 21:43 ` [PATCH v5 1/2] ACPI:RAS2: Add ACPI RAS2 driver shiju.jose
2025-05-14 2:55 ` Daniel Ferguson
2025-05-14 11:31 ` Shiju Jose
2025-05-16 1:49 ` Daniel Ferguson
2025-05-16 13:36 ` Shiju Jose [this message]
2025-05-07 21:43 ` [PATCH v5 2/2] ras: mem: Add memory " shiju.jose
2025-05-12 8:16 ` [PATCH v5 0/2] ACPI: Add support for ACPI RAS2 feature table Jonathan Cameron
2025-05-12 8:38 ` Borislav Petkov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=fd2e3f1bf37d45b59daa883f0b5c96d5@huawei.com \
--to=shiju.jose@huawei.com \
--cc=Jon.Grimm@amd.com \
--cc=Yazen.Ghannam@amd.com \
--cc=bp@alien8.de \
--cc=danielf@os.amperecomputing.com \
--cc=dave.hansen@linux.intel.com \
--cc=dferguson@amperecomputing.com \
--cc=duenwen@google.com \
--cc=erdemaktas@google.com \
--cc=gthelen@google.com \
--cc=james.morse@arm.com \
--cc=jiaqiyan@google.com \
--cc=jonathan.cameron@huawei.com \
--cc=jthoughton@google.com \
--cc=kangkang.shen@futurewei.com \
--cc=lenb@kernel.org \
--cc=leo.duran@amd.com \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-edac@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linuxarm@huawei.com \
--cc=mchehab@kernel.org \
--cc=naoya.horiguchi@nec.com \
--cc=nifan.cxl@gmail.com \
--cc=pgonda@google.com \
--cc=prime.zeng@hisilicon.com \
--cc=rafael@kernel.org \
--cc=rientjes@google.com \
--cc=roberto.sassu@huawei.com \
--cc=somasundaram.a@hpe.com \
--cc=tanxiaofei@huawei.com \
--cc=tony.luck@intel.com \
--cc=wanghuiqiang@huawei.com \
--cc=wbs@os.amperecomputing.com \
--cc=wschwartz@amperecomputing.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox