linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
To: Shiju Jose <shiju.jose@huawei.com>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>,
	"linux-cxl@vger.kernel.org" <linux-cxl@vger.kernel.org>,
	"linux-acpi@vger.kernel.org" <linux-acpi@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"bp@alien8.de" <bp@alien8.de>,
	"tony.luck@intel.com" <tony.luck@intel.com>,
	"rafael@kernel.org" <rafael@kernel.org>,
	"lenb@kernel.org" <lenb@kernel.org>,
	"mchehab@kernel.org" <mchehab@kernel.org>,
	"dan.j.williams@intel.com" <dan.j.williams@intel.com>,
	"dave@stgolabs.net" <dave@stgolabs.net>,
	"Jonathan Cameron" <jonathan.cameron@huawei.com>,
	"dave.jiang@intel.com" <dave.jiang@intel.com>,
	"alison.schofield@intel.com" <alison.schofield@intel.com>,
	"vishal.l.verma@intel.com" <vishal.l.verma@intel.com>,
	"ira.weiny@intel.com" <ira.weiny@intel.com>,
	"david@redhat.com" <david@redhat.com>,
	"Vilas.Sridharan@amd.com" <Vilas.Sridharan@amd.com>,
	"leo.duran@amd.com" <leo.duran@amd.com>,
	"Yazen.Ghannam@amd.com" <Yazen.Ghannam@amd.com>,
	"rientjes@google.com" <rientjes@google.com>,
	"jiaqiyan@google.com" <jiaqiyan@google.com>,
	"Jon.Grimm@amd.com" <Jon.Grimm@amd.com>,
	"dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com>,
	"naoya.horiguchi@nec.com" <naoya.horiguchi@nec.com>,
	"james.morse@arm.com" <james.morse@arm.com>,
	"jthoughton@google.com" <jthoughton@google.com>,
	"somasundaram.a@hpe.com" <somasundaram.a@hpe.com>,
	"erdemaktas@google.com" <erdemaktas@google.com>,
	"pgonda@google.com" <pgonda@google.com>,
	"duenwen@google.com" <duenwen@google.com>,
	"gthelen@google.com" <gthelen@google.com>,
	"wschwartz@amperecomputing.com" <wschwartz@amperecomputing.com>,
	"dferguson@amperecomputing.com" <dferguson@amperecomputing.com>,
	"wbs@os.amperecomputing.com" <wbs@os.amperecomputing.com>,
	"nifan.cxl@gmail.com" <nifan.cxl@gmail.com>,
	tanxiaofei <tanxiaofei@huawei.com>,
	"Zengtao (B)" <prime.zeng@hisilicon.com>,
	"Roberto Sassu" <roberto.sassu@huawei.com>,
	"kangkang.shen@futurewei.com" <kangkang.shen@futurewei.com>,
	wanghuiqiang <wanghuiqiang@huawei.com>,
	Linuxarm <linuxarm@huawei.com>
Subject: Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
Date: Tue, 14 Jan 2025 15:26:17 +0100	[thread overview]
Message-ID: <20250114152617.14eb41b5@foz.lan> (raw)
In-Reply-To: <df8b3c3bffd24e1e8eb05b2ec53b3c58@huawei.com>

Em Tue, 14 Jan 2025 12:31:44 +0000
Shiju Jose <shiju.jose@huawei.com> escreveu:

> Hi Mauro,
> 
> Thanks for the comments.
> 
> >-----Original Message-----
> >From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
> >Sent: 14 January 2025 11:48
> >To: Shiju Jose <shiju.jose@huawei.com>
> >Cc: linux-edac@vger.kernel.org; linux-cxl@vger.kernel.org; linux-
> >acpi@vger.kernel.org; linux-mm@kvack.org; linux-kernel@vger.kernel.org;
> >bp@alien8.de; tony.luck@intel.com; rafael@kernel.org; lenb@kernel.org;
> >mchehab@kernel.org; dan.j.williams@intel.com; dave@stgolabs.net; Jonathan
> >Cameron <jonathan.cameron@huawei.com>; dave.jiang@intel.com;
> >alison.schofield@intel.com; vishal.l.verma@intel.com; ira.weiny@intel.com;
> >david@redhat.com; Vilas.Sridharan@amd.com; leo.duran@amd.com;
> >Yazen.Ghannam@amd.com; rientjes@google.com; jiaqiyan@google.com;
> >Jon.Grimm@amd.com; dave.hansen@linux.intel.com;
> >naoya.horiguchi@nec.com; james.morse@arm.com; jthoughton@google.com;
> >somasundaram.a@hpe.com; erdemaktas@google.com; pgonda@google.com;
> >duenwen@google.com; gthelen@google.com;
> >wschwartz@amperecomputing.com; dferguson@amperecomputing.com;
> >wbs@os.amperecomputing.com; nifan.cxl@gmail.com; tanxiaofei
> ><tanxiaofei@huawei.com>; Zengtao (B) <prime.zeng@hisilicon.com>; Roberto
> >Sassu <roberto.sassu@huawei.com>; kangkang.shen@futurewei.com;
> >wanghuiqiang <wanghuiqiang@huawei.com>; Linuxarm
> ><linuxarm@huawei.com>
> >Subject: Re: [PATCH v18 04/19] EDAC: Add memory repair control feature
> >
> >Em Mon, 6 Jan 2025 12:10:00 +0000
> ><shiju.jose@huawei.com> escreveu:
> >  
> >> From: Shiju Jose <shiju.jose@huawei.com>
> >>
> >> Add a generic EDAC memory repair control driver to manage memory repairs
> >> in the system, such as CXL Post Package Repair (PPR) and CXL memory sparing
> >> features.
> >>
> >> For example, a CXL device with DRAM components that support PPR features
> >> may implement PPR maintenance operations. DRAM components may support  
> >two  
> >> types of PPR, hard PPR, for a permanent row repair, and soft PPR,  for a
> >> temporary row repair. Soft PPR is much faster than hard PPR, but the repair
> >> is lost with a power cycle.
> >> Similarly a CXL memory device may support soft and hard memory sparing at
> >> cacheline, row, bank and rank granularities. Memory sparing is defined as
> >> a repair function that replaces a portion of memory with a portion of
> >> functional memory at that same granularity.
> >> When a CXL device detects an error in a memory, it may report the host of
> >> the need for a repair maintenance operation by using an event record where
> >> the "maintenance needed" flag is set. The event records contains the device
> >> physical address(DPA) and other attributes of the memory to repair (such as
> >> channel, sub-channel, bank group, bank, rank, row, column etc). The kernel
> >> will report the corresponding CXL general media or DRAM trace event to
> >> userspace, and userspace tools (e.g. rasdaemon) will initiate a repair
> >> operation in response to the device request via the sysfs repair control.
> >>
> >> Device with memory repair features registers with EDAC device driver,
> >> which retrieves memory repair descriptor from EDAC memory repair driver
> >> and exposes the sysfs repair control attributes to userspace in
> >> /sys/bus/edac/devices/<dev-name>/mem_repairX/.
> >>
> >> The common memory repair control interface abstracts the control of
> >> arbitrary memory repair functionality into a standardized set of functions.
> >> The sysfs memory repair attribute nodes are only available if the client
> >> driver has implemented the corresponding attribute callback function and
> >> provided operations to the EDAC device driver during registration.
> >>
> >> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
> >> ---
> >>  .../ABI/testing/sysfs-edac-memory-repair      | 244 +++++++++
> >>  Documentation/edac/features.rst               |   3 +
> >>  Documentation/edac/index.rst                  |   1 +
> >>  Documentation/edac/memory_repair.rst          | 101 ++++
> >>  drivers/edac/Makefile                         |   2 +-
> >>  drivers/edac/edac_device.c                    |  33 ++
> >>  drivers/edac/mem_repair.c                     | 492 ++++++++++++++++++
> >>  include/linux/edac.h                          | 139 +++++
> >>  8 files changed, 1014 insertions(+), 1 deletion(-)
> >>  create mode 100644 Documentation/ABI/testing/sysfs-edac-memory-repair
> >>  create mode 100644 Documentation/edac/memory_repair.rst
> >>  create mode 100755 drivers/edac/mem_repair.c
> >>
> >> diff --git a/Documentation/ABI/testing/sysfs-edac-memory-repair  
> >b/Documentation/ABI/testing/sysfs-edac-memory-repair  
> >> new file mode 100644
> >> index 000000000000..e9268f3780ed
> >> --- /dev/null
> >> +++ b/Documentation/ABI/testing/sysfs-edac-memory-repair
> >> @@ -0,0 +1,244 @@
> >> +What:		/sys/bus/edac/devices/<dev-name>/mem_repairX
> >> +Date:		Jan 2025
> >> +KernelVersion:	6.14
> >> +Contact:	linux-edac@vger.kernel.org
> >> +Description:
> >> +		The sysfs EDAC bus devices /<dev-name>/mem_repairX  
> >subdirectory  
> >> +		pertains to the memory media repair features control, such as
> >> +		PPR (Post Package Repair), memory sparing etc, where<dev-
> >name>
> >> +		directory corresponds to a device registered with the EDAC
> >> +		device driver for the memory repair features.
> >> +
> >> +		Post Package Repair is a maintenance operation requests the  
> >memory  
> >> +		device to perform a repair operation on its media, in detail is a
> >> +		memory self-healing feature that fixes a failing memory  
> >location by  
> >> +		replacing it with a spare row in a DRAM device. For example, a
> >> +		CXL memory device with DRAM components that support PPR  
> >features may  
> >> +		implement PPR maintenance operations. DRAM components  
> >may support  
> >> +		two types of PPR functions: hard PPR, for a permanent row  
> >repair, and  
> >> +		soft PPR, for a temporary row repair. soft PPR is much faster  
> >than  
> >> +		hard PPR, but the repair is lost with a power cycle.
> >> +
> >> +		Memory sparing is a repair function that replaces a portion
> >> +		of memory with a portion of functional memory at that same
> >> +		sparing granularity. Memory sparing has  
> >cacheline/row/bank/rank  
> >> +		sparing granularities. For example, in memory-sparing mode,
> >> +		one memory rank serves as a spare for other ranks on the same
> >> +		channel in case they fail. The spare rank is held in reserve and
> >> +		not used as active memory until a failure is indicated, with
> >> +		reserved capacity subtracted from the total available memory
> >> +		in the system.The DIMM installation order for memory sparing
> >> +		varies based on the number of processors and memory modules
> >> +		installed in the server. After an error threshold is surpassed
> >> +		in a system protected by memory sparing, the content of a  
> >failing  
> >> +		rank of DIMMs is copied to the spare rank. The failing rank is
> >> +		then taken offline and the spare rank placed online for use as
> >> +		active memory in place of the failed rank.
> >> +
> >> +		The sysfs attributes nodes for a repair feature are only
> >> +		present if the parent driver has implemented the corresponding
> >> +		attr callback function and provided the necessary operations
> >> +		to the EDAC device driver during registration.
> >> +
> >> +		In some states of system configuration (e.g. before address
> >> +		decoders have been configured), memory devices (e.g. CXL)
> >> +		may not have an active mapping in the main host address
> >> +		physical address map. As such, the memory to repair must be
> >> +		identified by a device specific physical addressing scheme
> >> +		using a device physical address(DPA). The DPA and other control
> >> +		attributes to use will be presented in related error records.
> >> +
> >> +What:		/sys/bus/edac/devices/<dev-
> >name>/mem_repairX/repair_function
> >> +Date:		Jan 2025
> >> +KernelVersion:	6.14
> >> +Contact:	linux-edac@vger.kernel.org
> >> +Description:
> >> +		(RO) Memory repair function type. For eg. post package repair,
> >> +		memory sparing etc.
> >> +		EDAC_SOFT_PPR - Soft post package repair
> >> +		EDAC_HARD_PPR - Hard post package repair
> >> +		EDAC_CACHELINE_MEM_SPARING - Cacheline memory sparing
> >> +		EDAC_ROW_MEM_SPARING - Row memory sparing
> >> +		EDAC_BANK_MEM_SPARING - Bank memory sparing
> >> +		EDAC_RANK_MEM_SPARING - Rank memory sparing
> >> +		All other values are reserved.  
> >
> >Too big strings. Why are them in upper cases? IMO:
> >
> >	soft-ppr, hard-ppr, ... would be enough.
> >  
> Here return repair type (single value, such as 0, 1, or 2 etc not as decoded string  for eg."EDAC_SOFT_PPR")
> of the memory repair instance, which is  defined as enums (EDAC_SOFT_PPR, EDAC_HARD_PPR, ... etc) 
> for the memory repair interface in the include/linux/edac.h.
> 
> enum edac_mem_repair_function {
> 	EDAC_SOFT_PPR,
> 	EDAC_HARD_PPR,
> 	EDAC_CACHELINE_MEM_SPARING,
> 	EDAC_ROW_MEM_SPARING,
> 	EDAC_BANK_MEM_SPARING,
> 	EDAC_RANK_MEM_SPARING,
> };
>   
> I documented return value in terms of the above enums.

The ABI documentation describes exactly what numeric/strings values will be there.
So, if you place:

	EDAC_SOFT_PPR

It means a string with EDAC_SOFT_PPR, not a numeric zero value.

Also, as I explained at:
	 https://lore.kernel.org/linux-edac/1bf421f9d1924d68860d08c70829a705@huawei.com/T/#m1e60da13198b47701a4c2f740d4b78701f912d2d

it doesn't make sense to report soft/hard PPR, as the persist mode
is designed to be on a different sysfs devnode (/persist_mode on your
proposal).

So, here you need to fold EDAC_SOFT_PPR and EDAC_HARD_PPR into a single
value ("ppr").

-

Btw, very few sysfs nodes use numbers for things that can be mapped with 
enums:

	$ git grep -l "\- 0" Documentation/ABI|wc -l
	20
	(several of those are actually false-positives)

and this is done mostly when it reports what the hardware actually
outputs when reading some register.

Thanks,
Mauro


  reply	other threads:[~2025-01-14 14:26 UTC|newest]

Thread overview: 87+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-06 12:09 [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers shiju.jose
2025-01-06 12:09 ` [PATCH v18 01/19] EDAC: Add support for EDAC device features control shiju.jose
2025-01-06 13:37   ` Borislav Petkov
2025-01-06 14:48     ` Shiju Jose
2025-01-13 15:06   ` Mauro Carvalho Chehab
2025-01-14  9:55     ` Jonathan Cameron
2025-01-14 10:08     ` Shiju Jose
2025-01-14 11:33       ` Mauro Carvalho Chehab
2025-01-30 19:18   ` Daniel Ferguson
2025-01-06 12:09 ` [PATCH v18 02/19] EDAC: Add scrub control feature shiju.jose
2025-01-06 15:57   ` Borislav Petkov
2025-01-06 19:34     ` Shiju Jose
2025-01-07  7:32       ` Borislav Petkov
2025-01-07  9:23         ` Shiju Jose
2025-01-08 15:47         ` Shiju Jose
2025-01-13 15:50   ` Mauro Carvalho Chehab
2025-01-30 19:18   ` Daniel Ferguson
2025-01-06 12:09 ` [PATCH v18 03/19] EDAC: Add ECS " shiju.jose
2025-01-13 16:09   ` Mauro Carvalho Chehab
2025-01-06 12:10 ` [PATCH v18 04/19] EDAC: Add memory repair " shiju.jose
2025-01-09  9:19   ` Borislav Petkov
2025-01-09 11:00     ` Shiju Jose
2025-01-09 12:32       ` Borislav Petkov
2025-01-09 14:24         ` Jonathan Cameron
2025-01-09 15:18           ` Borislav Petkov
2025-01-09 16:01             ` Jonathan Cameron
2025-01-09 16:19               ` Borislav Petkov
2025-01-09 18:34                 ` Jonathan Cameron
2025-01-09 23:51                   ` Dan Williams
2025-01-10 11:01                     ` Jonathan Cameron
2025-01-10 22:49                       ` Dan Williams
2025-01-13 11:40                         ` Jonathan Cameron
2025-01-14 19:35                           ` Dan Williams
2025-01-15 10:07                             ` Jonathan Cameron
2025-01-15 11:35                             ` Mauro Carvalho Chehab
2025-01-11 17:12                   ` Borislav Petkov
2025-01-13 11:07                     ` Jonathan Cameron
2025-01-21 16:16                       ` Borislav Petkov
2025-01-21 18:16                         ` Jonathan Cameron
2025-01-22 19:09                           ` Borislav Petkov
2025-02-06 13:39                             ` Jonathan Cameron
2025-02-17 13:23                               ` Borislav Petkov
2025-02-18 16:51                                 ` Jonathan Cameron
2025-02-19 18:45                                   ` Borislav Petkov
2025-02-20 12:19                                     ` Jonathan Cameron
2025-01-14 13:10                   ` Mauro Carvalho Chehab
2025-01-14 12:57               ` Mauro Carvalho Chehab
2025-01-14 12:38           ` Mauro Carvalho Chehab
2025-01-14 13:05             ` Jonathan Cameron
2025-01-14 14:39               ` Mauro Carvalho Chehab
2025-01-14 11:47   ` Mauro Carvalho Chehab
2025-01-14 12:31     ` Shiju Jose
2025-01-14 14:26       ` Mauro Carvalho Chehab [this message]
2025-01-14 13:47   ` Mauro Carvalho Chehab
2025-01-14 14:30     ` Shiju Jose
2025-01-15 12:03       ` Mauro Carvalho Chehab
2025-01-06 12:10 ` [PATCH v18 05/19] ACPI:RAS2: Add ACPI RAS2 driver shiju.jose
2025-01-21 23:01   ` Daniel Ferguson
2025-01-22 15:38     ` Shiju Jose
2025-01-30 19:19   ` Daniel Ferguson
2025-01-06 12:10 ` [PATCH v18 06/19] ras: mem: Add memory " shiju.jose
2025-01-21 23:01   ` Daniel Ferguson
2025-01-30 19:19   ` Daniel Ferguson
2025-01-06 12:10 ` [PATCH v18 07/19] cxl: Refactor user ioctl command path from mds to mailbox shiju.jose
2025-01-06 12:10 ` [PATCH v18 08/19] cxl: Add skeletal features driver shiju.jose
2025-01-06 12:10 ` [PATCH v18 09/19] cxl: Enumerate feature commands shiju.jose
2025-01-06 12:10 ` [PATCH v18 10/19] cxl: Add Get Supported Features command for kernel usage shiju.jose
2025-01-06 12:10 ` [PATCH v18 11/19] cxl: Add features driver attribute to emit number of features supported shiju.jose
2025-01-06 12:10 ` [PATCH v18 12/19] cxl/mbox: Add GET_FEATURE mailbox command shiju.jose
2025-01-06 12:10 ` [PATCH v18 13/19] cxl/mbox: Add SET_FEATURE " shiju.jose
2025-01-06 12:10 ` [PATCH v18 14/19] cxl: Setup exclusive CXL features that are reserved for the kernel shiju.jose
2025-01-06 12:10 ` [PATCH v18 15/19] cxl/memfeature: Add CXL memory device patrol scrub control feature shiju.jose
2025-01-24 20:38   ` Dan Williams
2025-01-27 10:06     ` Jonathan Cameron
2025-01-27 12:53     ` Shiju Jose
2025-01-27 23:17       ` Dan Williams
2025-01-29 12:28         ` Shiju Jose
2025-01-06 12:10 ` [PATCH v18 16/19] cxl/memfeature: Add CXL memory device ECS " shiju.jose
2025-01-06 12:10 ` [PATCH v18 17/19] cxl/mbox: Add support for PERFORM_MAINTENANCE mailbox command shiju.jose
2025-01-06 12:10 ` [PATCH v18 18/19] cxl/memfeature: Add CXL memory device soft PPR control feature shiju.jose
2025-01-06 12:10 ` [PATCH v18 19/19] cxl/memfeature: Add CXL memory device memory sparing " shiju.jose
2025-01-13 14:46 ` [PATCH v18 00/19] EDAC: Scrub: introduce generic EDAC RAS control feature driver + CXL/ACPI-RAS2 drivers Mauro Carvalho Chehab
2025-01-13 15:36   ` Jonathan Cameron
2025-01-14 14:06     ` Mauro Carvalho Chehab
2025-01-13 18:15   ` Shiju Jose
2025-01-30 19:18 ` Daniel Ferguson
2025-02-03  9:25   ` Shiju Jose

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250114152617.14eb41b5@foz.lan \
    --to=mchehab+huawei@kernel.org \
    --cc=Jon.Grimm@amd.com \
    --cc=Vilas.Sridharan@amd.com \
    --cc=Yazen.Ghannam@amd.com \
    --cc=alison.schofield@intel.com \
    --cc=bp@alien8.de \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@redhat.com \
    --cc=dferguson@amperecomputing.com \
    --cc=duenwen@google.com \
    --cc=erdemaktas@google.com \
    --cc=gthelen@google.com \
    --cc=ira.weiny@intel.com \
    --cc=james.morse@arm.com \
    --cc=jiaqiyan@google.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=jthoughton@google.com \
    --cc=kangkang.shen@futurewei.com \
    --cc=lenb@kernel.org \
    --cc=leo.duran@amd.com \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linuxarm@huawei.com \
    --cc=mchehab@kernel.org \
    --cc=naoya.horiguchi@nec.com \
    --cc=nifan.cxl@gmail.com \
    --cc=pgonda@google.com \
    --cc=prime.zeng@hisilicon.com \
    --cc=rafael@kernel.org \
    --cc=rientjes@google.com \
    --cc=roberto.sassu@huawei.com \
    --cc=shiju.jose@huawei.com \
    --cc=somasundaram.a@hpe.com \
    --cc=tanxiaofei@huawei.com \
    --cc=tony.luck@intel.com \
    --cc=vishal.l.verma@intel.com \
    --cc=wanghuiqiang@huawei.com \
    --cc=wbs@os.amperecomputing.com \
    --cc=wschwartz@amperecomputing.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox