linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 1/2] ARM: mm: support memory-failure
@ 2025-09-22  2:14 Xie Yuanbin
  2025-09-22  2:14 ` [RFC PATCH 2/2] ARM: memory-failure: not select RAS and MEMORY_ISOLATION Xie Yuanbin
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Xie Yuanbin @ 2025-09-22  2:14 UTC (permalink / raw)
  To: linux, akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, linmiaohe, nao.horiguchi, rmk+kernel, ardb,
	nathan, ebiggers, arnd, rostedt, kees, dave, peterz
  Cc: will, linux-arm-kernel, linux-kernel, linux-mm, liaohua4,
	lilinjie8, xieyuanbin1

Memory failure provides the ability of soft offline pages,
which is very useful to handle the memory errors such as CE in ECC.

Although ARM does not have a user interface like
`/sys/devices/system/memory/soft_offline_page`, memory-failure still
provides some exported func that can be used by some module ko driver.

Memory-failure will use one page flag (PG_hwpoison). For historical
versions, this will cause the page flags to exceed the 32-bit limit
(when CONFIG_SPARSEMEM and CONFIG_HIGHMEM are both enabled),
and therefore it cannot be enabled.

The following commit:
commit 09022bc196d2 ("mm: remove PG_error")
removes a page flag, so memory-failure can now be launched on ARM now.

The core codes of memory-failure is architecture independent, in fact,
it has performed well in current testing. Perhaps it can also be enabled
on other 32-bit architectures(like x86, and it seems that it can already
be enabled on 32-bit parisc architecture), but I haven't tested it yet.

Signed-off-by: Xie Yuanbin <xieyuanbin1@huawei.com>
---
 arch/arm/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 5527935fd15a..b38c194a5cc4 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -149,20 +149,21 @@ config ARM
 	select PCI_SYSCALL if PCI
 	select PERF_USE_VMALLOC
 	select RTC_LIB
 	select SPARSE_IRQ if !(ARCH_FOOTBRIDGE || ARCH_RPC)
 	select SYS_SUPPORTS_APM_EMULATION
 	select THREAD_INFO_IN_TASK
 	select TIMER_OF if OF
 	select HAVE_ARCH_VMAP_STACK if MMU && ARM_HAS_GROUP_RELOCS
 	select TRACE_IRQFLAGS_SUPPORT if !CPU_V7M
 	select USE_OF if !(ARCH_FOOTBRIDGE || ARCH_RPC || ARCH_SA1100)
+	select ARCH_SUPPORTS_MEMORY_FAILURE
 	# Above selects are sorted alphabetically; please add new ones
 	# according to that.  Thanks.
 	help
 	  The ARM series is a line of low-power-consumption RISC chip designs
 	  licensed by ARM Ltd and targeted at embedded applications and
 	  handhelds such as the Compaq IPAQ.  ARM-based PCs are no longer
 	  manufactured, but legacy ARM-based PC hardware remains popular in
 	  Europe.  There is an ARM Linux project with a web page at
 	  <http://www.arm.linux.org.uk/>.
 
-- 
2.48.1



^ permalink raw reply	[flat|nested] 11+ messages in thread

* [RFC PATCH 2/2] ARM: memory-failure: not select RAS and MEMORY_ISOLATION
  2025-09-22  2:14 [RFC PATCH 1/2] ARM: mm: support memory-failure Xie Yuanbin
@ 2025-09-22  2:14 ` Xie Yuanbin
  2025-09-22  8:15   ` David Hildenbrand
  2025-09-22  6:37 ` Arnd Bergmann
  2025-10-22  3:58 ` Xie Yuanbin
  2 siblings, 1 reply; 11+ messages in thread
From: Xie Yuanbin @ 2025-09-22  2:14 UTC (permalink / raw)
  To: linux, akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, linmiaohe, nao.horiguchi, rmk+kernel, ardb,
	nathan, ebiggers, arnd, rostedt, kees, dave, peterz
  Cc: will, linux-arm-kernel, linux-kernel, linux-mm, liaohua4,
	lilinjie8, xieyuanbin1

For memory-failure on ARM, these features do not seem necessary.

Signed-off-by: Xie Yuanbin <xieyuanbin1@huawei.com>
---
 mm/Kconfig          | 4 ++--
 mm/memory-failure.c | 2 ++
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index 034a1662d8c1..22eefc4747d5 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -742,22 +742,22 @@ config DEFAULT_MMAP_MIN_ADDR
 	  This value can be changed after boot using the
 	  /proc/sys/vm/mmap_min_addr tunable.
 
 config ARCH_SUPPORTS_MEMORY_FAILURE
 	bool
 
 config MEMORY_FAILURE
 	depends on MMU
 	depends on ARCH_SUPPORTS_MEMORY_FAILURE
 	bool "Enable recovery from hardware memory errors"
-	select MEMORY_ISOLATION
-	select RAS
+	select MEMORY_ISOLATION if !ARM
+	select RAS if !ARM
 	help
 	  Enables code to recover from some memory failures on systems
 	  with MCA recovery. This allows a system to continue running
 	  even when some of its memory has uncorrected errors. This requires
 	  special hardware support and typically ECC memory.
 
 config HWPOISON_INJECT
 	tristate "HWPoison pages injector"
 	depends on MEMORY_FAILURE && DEBUG_KERNEL && PROC_FS
 	select PROC_PAGE_MONITOR
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index a24806bb8e82..83b77caf41a1 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1271,21 +1271,23 @@ static void update_per_node_mf_stats(unsigned long pfn,
 	++mf_stats->total;
 }
 
 /*
  * "Dirty/Clean" indication is not 100% accurate due to the possibility of
  * setting PG_dirty outside page lock. See also comment above set_page_dirty().
  */
 static int action_result(unsigned long pfn, enum mf_action_page_type type,
 			 enum mf_result result)
 {
+#ifdef CONFIG_RAS
 	trace_memory_failure_event(pfn, type, result);
+#endif
 
 	if (type != MF_MSG_ALREADY_POISONED) {
 		num_poisoned_pages_inc(pfn);
 		update_per_node_mf_stats(pfn, result);
 	}
 
 	pr_err("%#lx: recovery action for %s: %s\n",
 		pfn, action_page_types[type], action_name[result]);
 
 	return (result == MF_RECOVERED || result == MF_DELAYED) ? 0 : -EBUSY;
-- 
2.48.1



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/2] ARM: mm: support memory-failure
  2025-09-22  2:14 [RFC PATCH 1/2] ARM: mm: support memory-failure Xie Yuanbin
  2025-09-22  2:14 ` [RFC PATCH 2/2] ARM: memory-failure: not select RAS and MEMORY_ISOLATION Xie Yuanbin
@ 2025-09-22  6:37 ` Arnd Bergmann
  2025-09-22  8:28   ` Xie Yuanbin
  2025-10-22  3:58 ` Xie Yuanbin
  2 siblings, 1 reply; 11+ messages in thread
From: Arnd Bergmann @ 2025-09-22  6:37 UTC (permalink / raw)
  To: Xie Yuanbin, Russell King, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, linmiaohe, nao.horiguchi,
	Russell King, Ard Biesheuvel, Nathan Chancellor, Eric Biggers,
	Steven Rostedt, Kees Cook, Dave Vasilevsky, Peter Zijlstra
  Cc: Will Deacon, linux-arm-kernel, linux-kernel, linux-mm, liaohua4,
	lilinjie8

On Mon, Sep 22, 2025, at 04:14, Xie Yuanbin wrote:
> Memory failure provides the ability of soft offline pages,
> which is very useful to handle the memory errors such as CE in ECC.
>
> Although ARM does not have a user interface like
> `/sys/devices/system/memory/soft_offline_page`, memory-failure still
> provides some exported func that can be used by some module ko driver.

It would be helpful to be more specific about what you
want to do with this.

Are you working on a driver that would actually make use of
the exported interface? I see only a very small number of
drivers that call memory_failure(), and none of them are
usable on Arm.

     Arnd


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 2/2] ARM: memory-failure: not select RAS and MEMORY_ISOLATION
  2025-09-22  2:14 ` [RFC PATCH 2/2] ARM: memory-failure: not select RAS and MEMORY_ISOLATION Xie Yuanbin
@ 2025-09-22  8:15   ` David Hildenbrand
  2025-09-22  8:47     ` [RFC PATCH 1/2] ARM: mm: support memory-failure Xie Yuanbin
  0 siblings, 1 reply; 11+ messages in thread
From: David Hildenbrand @ 2025-09-22  8:15 UTC (permalink / raw)
  To: Xie Yuanbin, linux, akpm, lorenzo.stoakes, Liam.Howlett, vbabka,
	rppt, surenb, mhocko, linmiaohe, nao.horiguchi, rmk+kernel, ardb,
	nathan, ebiggers, arnd, rostedt, kees, dave, peterz, Minchan Kim
  Cc: will, linux-arm-kernel, linux-kernel, linux-mm, liaohua4, lilinjie8

On 22.09.25 04:14, Xie Yuanbin wrote:
> For memory-failure on ARM, these features do not seem necessary.
> 
> Signed-off-by: Xie Yuanbin <xieyuanbin1@huawei.com>
> ---
>   mm/Kconfig          | 4 ++--
>   mm/memory-failure.c | 2 ++
>   2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 034a1662d8c1..22eefc4747d5 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -742,22 +742,22 @@ config DEFAULT_MMAP_MIN_ADDR
>   	  This value can be changed after boot using the
>   	  /proc/sys/vm/mmap_min_addr tunable.
>   
>   config ARCH_SUPPORTS_MEMORY_FAILURE
>   	bool
>   
>   config MEMORY_FAILURE
>   	depends on MMU
>   	depends on ARCH_SUPPORTS_MEMORY_FAILURE
>   	bool "Enable recovery from hardware memory errors"
> -	select MEMORY_ISOLATION
> -	select RAS
> +	select MEMORY_ISOLATION if !ARM
> +	select RAS if !ARM

I'm trying to figure out why we need MEMORY_ISOLATION at all.

MEMORY_ISOLATION is mostly required for memory offlining and 
alloc_contig_range()/cma -- it controls the availability of the 
"isolate" bit in the pageblock.

What CONFIG_MEMORY_FAILURE soft-offline support wants is migrate_pages() 
support. But that comes with CONFIG_MIGRATION.

And isolate_folio_to_list() has nothing to do with CONFIG_MEMORY_ISOLATION.

We added that "select MEMORY_ISOLATION" in commit ee6f509c3274 ("mm: 
factor out memory isolate functions").

Turns out we remove the need for that in add05cecef80 ("mm: 
soft-offline: don't free target page in successful page migration") 
where we removed the calls to set_migratetype_isolate() etc.

Can you send a patch to remove the "select MEMORY_ISOLATION" independent 
of any arm changes?

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/2] ARM: mm: support memory-failure
  2025-09-22  6:37 ` Arnd Bergmann
@ 2025-09-22  8:28   ` Xie Yuanbin
  2025-09-22 12:51     ` Arnd Bergmann
  0 siblings, 1 reply; 11+ messages in thread
From: Xie Yuanbin @ 2025-09-22  8:28 UTC (permalink / raw)
  To: arnd
  Cc: Liam.Howlett, akpm, ardb, dave, david, ebiggers, kees, liaohua4,
	lilinjie8, linmiaohe, linux-arm-kernel, linux-kernel, linux-mm,
	linux, lorenzo.stoakes, mhocko, nao.horiguchi, nathan, peterz,
	rmk+kernel, rostedt, rppt, surenb, vbabka, will, xieyuanbin1

> It would be helpful to be more specific about what you
> want to do with this.
> 
> Are you working on a driver that would actually make use of
> the exported interface?

Thanks for your reply.

Yes, In fact, we have developed a hardware component to detect DDR bit
transitions (software does not sense the detection behavior). Once a bit
transition is detected, an interrupt is reported to the CPU.

On the software side, we have developed a driver module ko to register
the interrupt callback to perform soft page offline to the corresponding
physical pages.

In fact, we will export `soft_offline_page` for ko to use (we can ensure
that it is not called in the interrupt context), but I have looked at the
code and found that `memory_failure_queue` and `memory_failure` can also
be used, which are already exported.

> I see only a very small number of
> drivers that call memory_failure(), and none of them are
> usable on Arm.

I think that not all drivers are in the open source kernel code.
As far as I know, there should be similar third-party drivers in other
architectures that use memory-failure functions, like x86 or arm64.
I am not a specialist in drivers, so if I have made any mistakes,
please correct me.

Xie Yuanbin


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/2] ARM: mm: support memory-failure
  2025-09-22  8:15   ` David Hildenbrand
@ 2025-09-22  8:47     ` Xie Yuanbin
  0 siblings, 0 replies; 11+ messages in thread
From: Xie Yuanbin @ 2025-09-22  8:47 UTC (permalink / raw)
  To: david
  Cc: Liam.Howlett, akpm, ardb, arnd, dave, ebiggers, kees, liaohua4,
	lilinjie8, linmiaohe, linux-arm-kernel, linux-kernel, linux-mm,
	linux, lorenzo.stoakes, mhocko, minchan, nao.horiguchi, nathan,
	peterz, rmk+kernel, rostedt, rppt, surenb, vbabka, will,
	xieyuanbin1

David/dhildenb wrote:
> Can you send a patch to remove the "select MEMORY_ISOLATION" independent 
> of any arm changes?

With pleasure, I will send a patch later.

Xie Yuanbin


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/2] ARM: mm: support memory-failure
  2025-09-22  8:28   ` Xie Yuanbin
@ 2025-09-22 12:51     ` Arnd Bergmann
  2025-09-23  4:10       ` Xie Yuanbin
  0 siblings, 1 reply; 11+ messages in thread
From: Arnd Bergmann @ 2025-09-22 12:51 UTC (permalink / raw)
  To: Xie Yuanbin
  Cc: Liam R. Howlett, Andrew Morton, Ard Biesheuvel, Dave Vasilevsky,
	David Hildenbrand, Eric Biggers, Kees Cook, liaohua4, lilinjie8,
	linmiaohe, linux-arm-kernel, linux-kernel, linux-mm,
	Russell King, Lorenzo Stoakes, Michal Hocko, nao.horiguchi,
	Nathan Chancellor, Peter Zijlstra, Russell King, Steven Rostedt,
	Mike Rapoport, Suren Baghdasaryan, Vlastimil Babka, Will Deacon

On Mon, Sep 22, 2025, at 10:28, Xie Yuanbin wrote:
>> It would be helpful to be more specific about what you
>> want to do with this.
>> 
>> Are you working on a driver that would actually make use of
>> the exported interface?
>
> Thanks for your reply.
>
> Yes, In fact, we have developed a hardware component to detect DDR bit
> transitions (software does not sense the detection behavior). Once a bit
> transition is detected, an interrupt is reported to the CPU.
>
> On the software side, we have developed a driver module ko to register
> the interrupt callback to perform soft page offline to the corresponding
> physical pages.
>
> In fact, we will export `soft_offline_page` for ko to use (we can ensure
> that it is not called in the interrupt context), but I have looked at the
> code and found that `memory_failure_queue` and `memory_failure` can also
> be used, which are already exported.

Ok

>> I see only a very small number of
>> drivers that call memory_failure(), and none of them are
>> usable on Arm.
>
> I think that not all drivers are in the open source kernel code.
> As far as I know, there should be similar third-party drivers in other
> architectures that use memory-failure functions, like x86 or arm64.
> I am not a specialist in drivers, so if I have made any mistakes,
> please correct me.

I'm not familiar with the memory-failure support, but this sounds
like something that is usually done with a drivers/edac/ driver.
There are many SoC specific drivers, including for 32-bit Arm
SoCs.

Have you considered adding an EDAC driver first? I don't know
how the other platforms that have EDAC drivers handle failures,
but I would assume that either that subsystem already contains
functionality for taking pages offline, or this is something
that should be done in a way that works for all of them without
requiring an extra driver.

      Arnd


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/2] ARM: mm: support memory-failure
  2025-09-22 12:51     ` Arnd Bergmann
@ 2025-09-23  4:10       ` Xie Yuanbin
  2025-11-03 16:53         ` David Hildenbrand (Red Hat)
  0 siblings, 1 reply; 11+ messages in thread
From: Xie Yuanbin @ 2025-09-23  4:10 UTC (permalink / raw)
  To: linux, akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, linmiaohe, nao.horiguchi, rmk+kernel, ardb,
	nathan, ebiggers, arnd, rostedt, kees, dave, peterz
  Cc: will, linux-arm-kernel, linux-kernel, linux-mm, liaohua4,
	lilinjie8, xieyuanbin1

Arnd Bergmann wrote:
>>> It would be helpful to be more specific about what you
>>> want to do with this.
>>> 
>>> Are you working on a driver that would actually make use of
>>> the exported interface?
>>
>> Thanks for your reply.
>>
>> Yes, In fact, we have developed a hardware component to detect DDR bit
>> transitions (software does not sense the detection behavior). Once a bit
>> transition is detected, an interrupt is reported to the CPU.
>>
>> On the software side, we have developed a driver module ko to register
>> the interrupt callback to perform soft page offline to the corresponding
>> physical pages.
>>
>> In fact, we will export `soft_offline_page` for ko to use (we can ensure
>> that it is not called in the interrupt context), but I have looked at the
>> code and found that `memory_failure_queue` and `memory_failure` can also
>> be used, which are already exported.
>
> Ok
>
>>> I see only a very small number of
>>> drivers that call memory_failure(), and none of them are
>>> usable on Arm.
>>
>> I think that not all drivers are in the open source kernel code.
>> As far as I know, there should be similar third-party drivers in other
>> architectures that use memory-failure functions, like x86 or arm64.
>> I am not a specialist in drivers, so if I have made any mistakes,
>> please correct me.
>
> I'm not familiar with the memory-failure support, but this sounds
> like something that is usually done with a drivers/edac/ driver.
> There are many SoC specific drivers, including for 32-bit Arm
> SoCs.
>
> Have you considered adding an EDAC driver first? I don't know
> how the other platforms that have EDAC drivers handle failures,
> but I would assume that either that subsystem already contains
> functionality for taking pages offline,

I'm very sorry, I tried my best to do this,
but it seems impossible to achieve.
I am a kernel developer rathder than a driver developer. I have tried to
communicate with driver developers, but open source is very difficult due
to the involvement of proprietary hardware and algorithms.

> or this is something
> that should be done in a way that works for all of them without
> requiring an extra driver.

Yes, I think that the memory-failure feature should not be associated with
specific architectures or drivers.

I have read the memory-failure's doc and code,
and found the following features, which are user useable,
are not associated with specific drivers:

1. `/sys/devices/system/memory/soft_offline_page`:
see https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-memory-page-offline

This interface only exists when CONFIG_MEMORY_HOTPLUG is enabled, but
ARM cannot enable it.
However, I have read the code and believe that it should not require a
lot of effort to decouple these two, allowing the interface to exist
even if mem-hotplug is disabled.

2. The syscall madvise with `MADV_SOFT_OFFLINE/MADV_HWPOISON` flags:

According to the documentation, this interface is currently only used for
testing. However, if the user program can map the specified physical
address, it can actually be used for memory-failure.

3. The CONFIG_HWPOISON_INJECT which depends on CONFIG_MEMORY_FAILURE:
see https://docs.kernel.org/mm/hwpoison.html

It seems to allow input of physical addresses and trigger memory-failure,
but according to the doc, it seems to be used only for testing.


Additionally, I noticed that in the memory-failure doc
https://docs.kernel.org/mm/hwpoison.html, it mentions that
"The main target right now is KVM guests, but it works for all kinds of
applications." This seems to confirm my speculation that the
memory-failure feature should not be associated with specific
architectures or drivers.

Xie Yuanbin


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/2] ARM: mm: support memory-failure
  2025-09-22  2:14 [RFC PATCH 1/2] ARM: mm: support memory-failure Xie Yuanbin
  2025-09-22  2:14 ` [RFC PATCH 2/2] ARM: memory-failure: not select RAS and MEMORY_ISOLATION Xie Yuanbin
  2025-09-22  6:37 ` Arnd Bergmann
@ 2025-10-22  3:58 ` Xie Yuanbin
  2 siblings, 0 replies; 11+ messages in thread
From: Xie Yuanbin @ 2025-10-22  3:58 UTC (permalink / raw)
  To: linux, akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
	surenb, mhocko, linmiaohe, nao.horiguchi, rmk+kernel, ardb,
	nathan, ebiggers, arnd, rostedt, kees, dave, peterz
  Cc: will, linux-arm-kernel, linux-kernel, linux-mm, liaohua4,
	lilinjie8, xieyuanbin1

Hello

These patches seem to have not been discussed for a month.
Does anyone have any other opinions?

https://lore.kernel.org/all/20250923041005.9831-1-xieyuanbin1@huawei.com/t/#u

Xie Yuanbin


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/2] ARM: mm: support memory-failure
  2025-09-23  4:10       ` Xie Yuanbin
@ 2025-11-03 16:53         ` David Hildenbrand (Red Hat)
  2025-11-04 13:48           ` Xie Yuanbin
  0 siblings, 1 reply; 11+ messages in thread
From: David Hildenbrand (Red Hat) @ 2025-11-03 16:53 UTC (permalink / raw)
  To: Xie Yuanbin, linux, akpm, david, lorenzo.stoakes, Liam.Howlett,
	vbabka, rppt, surenb, mhocko, linmiaohe, nao.horiguchi,
	rmk+kernel, ardb, nathan, ebiggers, arnd, rostedt, kees, dave,
	peterz
  Cc: will, linux-arm-kernel, linux-kernel, linux-mm, liaohua4, lilinjie8

On 23.09.25 06:10, Xie Yuanbin wrote:
> Arnd Bergmann wrote:
>>>> It would be helpful to be more specific about what you
>>>> want to do with this.
>>>>
>>>> Are you working on a driver that would actually make use of
>>>> the exported interface?
>>>
>>> Thanks for your reply.
>>>
>>> Yes, In fact, we have developed a hardware component to detect DDR bit
>>> transitions (software does not sense the detection behavior). Once a bit
>>> transition is detected, an interrupt is reported to the CPU.
>>>
>>> On the software side, we have developed a driver module ko to register
>>> the interrupt callback to perform soft page offline to the corresponding
>>> physical pages.
>>>
>>> In fact, we will export `soft_offline_page` for ko to use (we can ensure
>>> that it is not called in the interrupt context), but I have looked at the
>>> code and found that `memory_failure_queue` and `memory_failure` can also
>>> be used, which are already exported.
>>
>> Ok
>>
>>>> I see only a very small number of
>>>> drivers that call memory_failure(), and none of them are
>>>> usable on Arm.
>>>
>>> I think that not all drivers are in the open source kernel code.
>>> As far as I know, there should be similar third-party drivers in other
>>> architectures that use memory-failure functions, like x86 or arm64.
>>> I am not a specialist in drivers, so if I have made any mistakes,
>>> please correct me.
>>
>> I'm not familiar with the memory-failure support, but this sounds
>> like something that is usually done with a drivers/edac/ driver.
>> There are many SoC specific drivers, including for 32-bit Arm
>> SoCs.
>>
>> Have you considered adding an EDAC driver first? I don't know
>> how the other platforms that have EDAC drivers handle failures,
>> but I would assume that either that subsystem already contains
>> functionality for taking pages offline,
> 
> I'm very sorry, I tried my best to do this,
> but it seems impossible to achieve.
> I am a kernel developer rathder than a driver developer. I have tried to
> communicate with driver developers, but open source is very difficult due
> to the involvement of proprietary hardware and algorithms.
> 
>> or this is something
>> that should be done in a way that works for all of them without
>> requiring an extra driver.
> 
> Yes, I think that the memory-failure feature should not be associated with
> specific architectures or drivers.
> 
> I have read the memory-failure's doc and code,
> and found the following features, which are user useable,
> are not associated with specific drivers:
> 
> 1. `/sys/devices/system/memory/soft_offline_page`:
> see https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-memory-page-offline
> 
> This interface only exists when CONFIG_MEMORY_HOTPLUG is enabled, but
> ARM cannot enable it.
> However, I have read the code and believe that it should not require a
> lot of effort to decouple these two, allowing the interface to exist
> even if mem-hotplug is disabled.

It's all about the /sys/devices/system/memory/ directory, which 
traditionally only made sense for memory hotplug. Well, still does to 
most degree.

Not sure whether some user space (chmem?) senses for 
/sys/devices/system/memory/ to detect memory hotplug capabilities.

But given soft_offline_page is a pure testing mechanism, I wouldn't be 
too concerned about that for now.

> 
> 2. The syscall madvise with `MADV_SOFT_OFFLINE/MADV_HWPOISON` flags:
> 
> According to the documentation, this interface is currently only used for
> testing. However, if the user program can map the specified physical
> address, it can actually be used for memory-failure.

It's mostly a testing-only interface. It could be used for other things, 
but really detecting MCE and handling it properly is kernel responsibility.

> 
> 3. The CONFIG_HWPOISON_INJECT which depends on CONFIG_MEMORY_FAILURE:
> see https://docs.kernel.org/mm/hwpoison.html
> 
> It seems to allow input of physical addresses and trigger memory-failure,
> but according to the doc, it seems to be used only for testing.

Right, all these interfaces are testing only.

> 
> 
> Additionally, I noticed that in the memory-failure doc
> https://docs.kernel.org/mm/hwpoison.html, it mentions that
> "The main target right now is KVM guests, but it works for all kinds of
> applications." This seems to confirm my speculation that the
> memory-failure feature should not be associated with specific
> architectures or drivers.

Can you go into more details which exact functionality in 
memory-failure.c you would be interested in using?

Only soft-offlining or also the other (possibly architecture-specific) 
handling?

-- 
Cheers

David


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 1/2] ARM: mm: support memory-failure
  2025-11-03 16:53         ` David Hildenbrand (Red Hat)
@ 2025-11-04 13:48           ` Xie Yuanbin
  0 siblings, 0 replies; 11+ messages in thread
From: Xie Yuanbin @ 2025-11-04 13:48 UTC (permalink / raw)
  To: david
  Cc: Liam.Howlett, akpm, ardb, arnd, dave, david, ebiggers, kees,
	liaohua4, lilinjie8, linmiaohe, linux-arm-kernel, linux-kernel,
	linux-mm, linux, lorenzo.stoakes, mhocko, nao.horiguchi, nathan,
	peterz, rmk+kernel, rostedt, rppt, surenb, vbabka, will,
	xieyuanbin1

On Mon, 3 Nov 2025 17:53:18 +0100, David Hildenbrand wrote:
> Can you go into more details which exact functionality in
> memory-failure.c you would be interested in using?
>
> Only soft-offlining or also the other (possibly architecture-specific)
> handling?

Thanks! Let me describe it in as much detail as possible.

The functions in memory-failure.c are currently used in three ways:
1. When the application is using memory, and ECC detects a UE
(Uncorrectable Errors) bit flip from DRAM (the detection is performed by
hardware and is not perceived by software), it reports an interrupt to the
CPU. The relevant driver (a third-party module) has already
registered the interrupt callback function.
Based on the configuration, the driver calls `memory_failure_queue()`
inside callback function, or wakes up the related kthread to call
`soft_offline_page()`/`memory_failure()` to take the affected memory
offline or kill the process.

2. Hardware memory scanning function: The hardware periodically performs
read/write tests on some memory (This hardware is not a standard hardware,
so it is not included in the ARM spec. The scanning is not perceived by
software) If bit flip is detected during the test, an interrupt is
reported to the operating system to do the memory-failure,
just like what described earlier.

3. Software memory scanning function: The software (such as kthread/
work-queue) periodically use `soft_offline_page()` to isolate some free
memory and performs read/write tests. If bit flip is detected during the
test, it is considered a failure, and the memory will not be recovered.
Otherwise, use `unpoison_memory()` to recover the memory.

Unfortunately, the driver code for these three methods is difficult to
open-source. I have also been thinking about whether there is a
general-purpose function that could use memory-failure, but I haven't
come up with a good idea yet.

> Cheers
>
> David

Thanks!

Xie Yuanbin


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-11-04 13:48 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-09-22  2:14 [RFC PATCH 1/2] ARM: mm: support memory-failure Xie Yuanbin
2025-09-22  2:14 ` [RFC PATCH 2/2] ARM: memory-failure: not select RAS and MEMORY_ISOLATION Xie Yuanbin
2025-09-22  8:15   ` David Hildenbrand
2025-09-22  8:47     ` [RFC PATCH 1/2] ARM: mm: support memory-failure Xie Yuanbin
2025-09-22  6:37 ` Arnd Bergmann
2025-09-22  8:28   ` Xie Yuanbin
2025-09-22 12:51     ` Arnd Bergmann
2025-09-23  4:10       ` Xie Yuanbin
2025-11-03 16:53         ` David Hildenbrand (Red Hat)
2025-11-04 13:48           ` Xie Yuanbin
2025-10-22  3:58 ` Xie Yuanbin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox