* [RFC PATCH 1/2] ARM: mm: support memory-failure
@ 2025-09-22 2:14 Xie Yuanbin
2025-09-22 2:14 ` [RFC PATCH 2/2] ARM: memory-failure: not select RAS and MEMORY_ISOLATION Xie Yuanbin
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Xie Yuanbin @ 2025-09-22 2:14 UTC (permalink / raw)
To: linux, akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt,
surenb, mhocko, linmiaohe, nao.horiguchi, rmk+kernel, ardb,
nathan, ebiggers, arnd, rostedt, kees, dave, peterz
Cc: will, linux-arm-kernel, linux-kernel, linux-mm, liaohua4,
lilinjie8, xieyuanbin1
Memory failure provides the ability of soft offline pages,
which is very useful to handle the memory errors such as CE in ECC.
Although ARM does not have a user interface like
`/sys/devices/system/memory/soft_offline_page`, memory-failure still
provides some exported func that can be used by some module ko driver.
Memory-failure will use one page flag (PG_hwpoison). For historical
versions, this will cause the page flags to exceed the 32-bit limit
(when CONFIG_SPARSEMEM and CONFIG_HIGHMEM are both enabled),
and therefore it cannot be enabled.
The following commit:
commit 09022bc196d2 ("mm: remove PG_error")
removes a page flag, so memory-failure can now be launched on ARM now.
The core codes of memory-failure is architecture independent, in fact,
it has performed well in current testing. Perhaps it can also be enabled
on other 32-bit architectures(like x86, and it seems that it can already
be enabled on 32-bit parisc architecture), but I haven't tested it yet.
Signed-off-by: Xie Yuanbin <xieyuanbin1@huawei.com>
---
arch/arm/Kconfig | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 5527935fd15a..b38c194a5cc4 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -149,20 +149,21 @@ config ARM
select PCI_SYSCALL if PCI
select PERF_USE_VMALLOC
select RTC_LIB
select SPARSE_IRQ if !(ARCH_FOOTBRIDGE || ARCH_RPC)
select SYS_SUPPORTS_APM_EMULATION
select THREAD_INFO_IN_TASK
select TIMER_OF if OF
select HAVE_ARCH_VMAP_STACK if MMU && ARM_HAS_GROUP_RELOCS
select TRACE_IRQFLAGS_SUPPORT if !CPU_V7M
select USE_OF if !(ARCH_FOOTBRIDGE || ARCH_RPC || ARCH_SA1100)
+ select ARCH_SUPPORTS_MEMORY_FAILURE
# Above selects are sorted alphabetically; please add new ones
# according to that. Thanks.
help
The ARM series is a line of low-power-consumption RISC chip designs
licensed by ARM Ltd and targeted at embedded applications and
handhelds such as the Compaq IPAQ. ARM-based PCs are no longer
manufactured, but legacy ARM-based PC hardware remains popular in
Europe. There is an ARM Linux project with a web page at
<http://www.arm.linux.org.uk/>.
--
2.48.1
^ permalink raw reply [flat|nested] 11+ messages in thread* [RFC PATCH 2/2] ARM: memory-failure: not select RAS and MEMORY_ISOLATION 2025-09-22 2:14 [RFC PATCH 1/2] ARM: mm: support memory-failure Xie Yuanbin @ 2025-09-22 2:14 ` Xie Yuanbin 2025-09-22 8:15 ` David Hildenbrand 2025-09-22 6:37 ` Arnd Bergmann 2025-10-22 3:58 ` Xie Yuanbin 2 siblings, 1 reply; 11+ messages in thread From: Xie Yuanbin @ 2025-09-22 2:14 UTC (permalink / raw) To: linux, akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko, linmiaohe, nao.horiguchi, rmk+kernel, ardb, nathan, ebiggers, arnd, rostedt, kees, dave, peterz Cc: will, linux-arm-kernel, linux-kernel, linux-mm, liaohua4, lilinjie8, xieyuanbin1 For memory-failure on ARM, these features do not seem necessary. Signed-off-by: Xie Yuanbin <xieyuanbin1@huawei.com> --- mm/Kconfig | 4 ++-- mm/memory-failure.c | 2 ++ 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/mm/Kconfig b/mm/Kconfig index 034a1662d8c1..22eefc4747d5 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -742,22 +742,22 @@ config DEFAULT_MMAP_MIN_ADDR This value can be changed after boot using the /proc/sys/vm/mmap_min_addr tunable. config ARCH_SUPPORTS_MEMORY_FAILURE bool config MEMORY_FAILURE depends on MMU depends on ARCH_SUPPORTS_MEMORY_FAILURE bool "Enable recovery from hardware memory errors" - select MEMORY_ISOLATION - select RAS + select MEMORY_ISOLATION if !ARM + select RAS if !ARM help Enables code to recover from some memory failures on systems with MCA recovery. This allows a system to continue running even when some of its memory has uncorrected errors. This requires special hardware support and typically ECC memory. config HWPOISON_INJECT tristate "HWPoison pages injector" depends on MEMORY_FAILURE && DEBUG_KERNEL && PROC_FS select PROC_PAGE_MONITOR diff --git a/mm/memory-failure.c b/mm/memory-failure.c index a24806bb8e82..83b77caf41a1 100644 --- a/mm/memory-failure.c +++ b/mm/memory-failure.c @@ -1271,21 +1271,23 @@ static void update_per_node_mf_stats(unsigned long pfn, ++mf_stats->total; } /* * "Dirty/Clean" indication is not 100% accurate due to the possibility of * setting PG_dirty outside page lock. See also comment above set_page_dirty(). */ static int action_result(unsigned long pfn, enum mf_action_page_type type, enum mf_result result) { +#ifdef CONFIG_RAS trace_memory_failure_event(pfn, type, result); +#endif if (type != MF_MSG_ALREADY_POISONED) { num_poisoned_pages_inc(pfn); update_per_node_mf_stats(pfn, result); } pr_err("%#lx: recovery action for %s: %s\n", pfn, action_page_types[type], action_name[result]); return (result == MF_RECOVERED || result == MF_DELAYED) ? 0 : -EBUSY; -- 2.48.1 ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 2/2] ARM: memory-failure: not select RAS and MEMORY_ISOLATION 2025-09-22 2:14 ` [RFC PATCH 2/2] ARM: memory-failure: not select RAS and MEMORY_ISOLATION Xie Yuanbin @ 2025-09-22 8:15 ` David Hildenbrand 2025-09-22 8:47 ` [RFC PATCH 1/2] ARM: mm: support memory-failure Xie Yuanbin 0 siblings, 1 reply; 11+ messages in thread From: David Hildenbrand @ 2025-09-22 8:15 UTC (permalink / raw) To: Xie Yuanbin, linux, akpm, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko, linmiaohe, nao.horiguchi, rmk+kernel, ardb, nathan, ebiggers, arnd, rostedt, kees, dave, peterz, Minchan Kim Cc: will, linux-arm-kernel, linux-kernel, linux-mm, liaohua4, lilinjie8 On 22.09.25 04:14, Xie Yuanbin wrote: > For memory-failure on ARM, these features do not seem necessary. > > Signed-off-by: Xie Yuanbin <xieyuanbin1@huawei.com> > --- > mm/Kconfig | 4 ++-- > mm/memory-failure.c | 2 ++ > 2 files changed, 4 insertions(+), 2 deletions(-) > > diff --git a/mm/Kconfig b/mm/Kconfig > index 034a1662d8c1..22eefc4747d5 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -742,22 +742,22 @@ config DEFAULT_MMAP_MIN_ADDR > This value can be changed after boot using the > /proc/sys/vm/mmap_min_addr tunable. > > config ARCH_SUPPORTS_MEMORY_FAILURE > bool > > config MEMORY_FAILURE > depends on MMU > depends on ARCH_SUPPORTS_MEMORY_FAILURE > bool "Enable recovery from hardware memory errors" > - select MEMORY_ISOLATION > - select RAS > + select MEMORY_ISOLATION if !ARM > + select RAS if !ARM I'm trying to figure out why we need MEMORY_ISOLATION at all. MEMORY_ISOLATION is mostly required for memory offlining and alloc_contig_range()/cma -- it controls the availability of the "isolate" bit in the pageblock. What CONFIG_MEMORY_FAILURE soft-offline support wants is migrate_pages() support. But that comes with CONFIG_MIGRATION. And isolate_folio_to_list() has nothing to do with CONFIG_MEMORY_ISOLATION. We added that "select MEMORY_ISOLATION" in commit ee6f509c3274 ("mm: factor out memory isolate functions"). Turns out we remove the need for that in add05cecef80 ("mm: soft-offline: don't free target page in successful page migration") where we removed the calls to set_migratetype_isolate() etc. Can you send a patch to remove the "select MEMORY_ISOLATION" independent of any arm changes? -- Cheers David / dhildenb ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 1/2] ARM: mm: support memory-failure 2025-09-22 8:15 ` David Hildenbrand @ 2025-09-22 8:47 ` Xie Yuanbin 0 siblings, 0 replies; 11+ messages in thread From: Xie Yuanbin @ 2025-09-22 8:47 UTC (permalink / raw) To: david Cc: Liam.Howlett, akpm, ardb, arnd, dave, ebiggers, kees, liaohua4, lilinjie8, linmiaohe, linux-arm-kernel, linux-kernel, linux-mm, linux, lorenzo.stoakes, mhocko, minchan, nao.horiguchi, nathan, peterz, rmk+kernel, rostedt, rppt, surenb, vbabka, will, xieyuanbin1 David/dhildenb wrote: > Can you send a patch to remove the "select MEMORY_ISOLATION" independent > of any arm changes? With pleasure, I will send a patch later. Xie Yuanbin ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 1/2] ARM: mm: support memory-failure 2025-09-22 2:14 [RFC PATCH 1/2] ARM: mm: support memory-failure Xie Yuanbin 2025-09-22 2:14 ` [RFC PATCH 2/2] ARM: memory-failure: not select RAS and MEMORY_ISOLATION Xie Yuanbin @ 2025-09-22 6:37 ` Arnd Bergmann 2025-09-22 8:28 ` Xie Yuanbin 2025-10-22 3:58 ` Xie Yuanbin 2 siblings, 1 reply; 11+ messages in thread From: Arnd Bergmann @ 2025-09-22 6:37 UTC (permalink / raw) To: Xie Yuanbin, Russell King, Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko, linmiaohe, nao.horiguchi, Russell King, Ard Biesheuvel, Nathan Chancellor, Eric Biggers, Steven Rostedt, Kees Cook, Dave Vasilevsky, Peter Zijlstra Cc: Will Deacon, linux-arm-kernel, linux-kernel, linux-mm, liaohua4, lilinjie8 On Mon, Sep 22, 2025, at 04:14, Xie Yuanbin wrote: > Memory failure provides the ability of soft offline pages, > which is very useful to handle the memory errors such as CE in ECC. > > Although ARM does not have a user interface like > `/sys/devices/system/memory/soft_offline_page`, memory-failure still > provides some exported func that can be used by some module ko driver. It would be helpful to be more specific about what you want to do with this. Are you working on a driver that would actually make use of the exported interface? I see only a very small number of drivers that call memory_failure(), and none of them are usable on Arm. Arnd ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 1/2] ARM: mm: support memory-failure 2025-09-22 6:37 ` Arnd Bergmann @ 2025-09-22 8:28 ` Xie Yuanbin 2025-09-22 12:51 ` Arnd Bergmann 0 siblings, 1 reply; 11+ messages in thread From: Xie Yuanbin @ 2025-09-22 8:28 UTC (permalink / raw) To: arnd Cc: Liam.Howlett, akpm, ardb, dave, david, ebiggers, kees, liaohua4, lilinjie8, linmiaohe, linux-arm-kernel, linux-kernel, linux-mm, linux, lorenzo.stoakes, mhocko, nao.horiguchi, nathan, peterz, rmk+kernel, rostedt, rppt, surenb, vbabka, will, xieyuanbin1 > It would be helpful to be more specific about what you > want to do with this. > > Are you working on a driver that would actually make use of > the exported interface? Thanks for your reply. Yes, In fact, we have developed a hardware component to detect DDR bit transitions (software does not sense the detection behavior). Once a bit transition is detected, an interrupt is reported to the CPU. On the software side, we have developed a driver module ko to register the interrupt callback to perform soft page offline to the corresponding physical pages. In fact, we will export `soft_offline_page` for ko to use (we can ensure that it is not called in the interrupt context), but I have looked at the code and found that `memory_failure_queue` and `memory_failure` can also be used, which are already exported. > I see only a very small number of > drivers that call memory_failure(), and none of them are > usable on Arm. I think that not all drivers are in the open source kernel code. As far as I know, there should be similar third-party drivers in other architectures that use memory-failure functions, like x86 or arm64. I am not a specialist in drivers, so if I have made any mistakes, please correct me. Xie Yuanbin ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 1/2] ARM: mm: support memory-failure 2025-09-22 8:28 ` Xie Yuanbin @ 2025-09-22 12:51 ` Arnd Bergmann 2025-09-23 4:10 ` Xie Yuanbin 0 siblings, 1 reply; 11+ messages in thread From: Arnd Bergmann @ 2025-09-22 12:51 UTC (permalink / raw) To: Xie Yuanbin Cc: Liam R. Howlett, Andrew Morton, Ard Biesheuvel, Dave Vasilevsky, David Hildenbrand, Eric Biggers, Kees Cook, liaohua4, lilinjie8, linmiaohe, linux-arm-kernel, linux-kernel, linux-mm, Russell King, Lorenzo Stoakes, Michal Hocko, nao.horiguchi, Nathan Chancellor, Peter Zijlstra, Russell King, Steven Rostedt, Mike Rapoport, Suren Baghdasaryan, Vlastimil Babka, Will Deacon On Mon, Sep 22, 2025, at 10:28, Xie Yuanbin wrote: >> It would be helpful to be more specific about what you >> want to do with this. >> >> Are you working on a driver that would actually make use of >> the exported interface? > > Thanks for your reply. > > Yes, In fact, we have developed a hardware component to detect DDR bit > transitions (software does not sense the detection behavior). Once a bit > transition is detected, an interrupt is reported to the CPU. > > On the software side, we have developed a driver module ko to register > the interrupt callback to perform soft page offline to the corresponding > physical pages. > > In fact, we will export `soft_offline_page` for ko to use (we can ensure > that it is not called in the interrupt context), but I have looked at the > code and found that `memory_failure_queue` and `memory_failure` can also > be used, which are already exported. Ok >> I see only a very small number of >> drivers that call memory_failure(), and none of them are >> usable on Arm. > > I think that not all drivers are in the open source kernel code. > As far as I know, there should be similar third-party drivers in other > architectures that use memory-failure functions, like x86 or arm64. > I am not a specialist in drivers, so if I have made any mistakes, > please correct me. I'm not familiar with the memory-failure support, but this sounds like something that is usually done with a drivers/edac/ driver. There are many SoC specific drivers, including for 32-bit Arm SoCs. Have you considered adding an EDAC driver first? I don't know how the other platforms that have EDAC drivers handle failures, but I would assume that either that subsystem already contains functionality for taking pages offline, or this is something that should be done in a way that works for all of them without requiring an extra driver. Arnd ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 1/2] ARM: mm: support memory-failure 2025-09-22 12:51 ` Arnd Bergmann @ 2025-09-23 4:10 ` Xie Yuanbin 2025-11-03 16:53 ` David Hildenbrand (Red Hat) 0 siblings, 1 reply; 11+ messages in thread From: Xie Yuanbin @ 2025-09-23 4:10 UTC (permalink / raw) To: linux, akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko, linmiaohe, nao.horiguchi, rmk+kernel, ardb, nathan, ebiggers, arnd, rostedt, kees, dave, peterz Cc: will, linux-arm-kernel, linux-kernel, linux-mm, liaohua4, lilinjie8, xieyuanbin1 Arnd Bergmann wrote: >>> It would be helpful to be more specific about what you >>> want to do with this. >>> >>> Are you working on a driver that would actually make use of >>> the exported interface? >> >> Thanks for your reply. >> >> Yes, In fact, we have developed a hardware component to detect DDR bit >> transitions (software does not sense the detection behavior). Once a bit >> transition is detected, an interrupt is reported to the CPU. >> >> On the software side, we have developed a driver module ko to register >> the interrupt callback to perform soft page offline to the corresponding >> physical pages. >> >> In fact, we will export `soft_offline_page` for ko to use (we can ensure >> that it is not called in the interrupt context), but I have looked at the >> code and found that `memory_failure_queue` and `memory_failure` can also >> be used, which are already exported. > > Ok > >>> I see only a very small number of >>> drivers that call memory_failure(), and none of them are >>> usable on Arm. >> >> I think that not all drivers are in the open source kernel code. >> As far as I know, there should be similar third-party drivers in other >> architectures that use memory-failure functions, like x86 or arm64. >> I am not a specialist in drivers, so if I have made any mistakes, >> please correct me. > > I'm not familiar with the memory-failure support, but this sounds > like something that is usually done with a drivers/edac/ driver. > There are many SoC specific drivers, including for 32-bit Arm > SoCs. > > Have you considered adding an EDAC driver first? I don't know > how the other platforms that have EDAC drivers handle failures, > but I would assume that either that subsystem already contains > functionality for taking pages offline, I'm very sorry, I tried my best to do this, but it seems impossible to achieve. I am a kernel developer rathder than a driver developer. I have tried to communicate with driver developers, but open source is very difficult due to the involvement of proprietary hardware and algorithms. > or this is something > that should be done in a way that works for all of them without > requiring an extra driver. Yes, I think that the memory-failure feature should not be associated with specific architectures or drivers. I have read the memory-failure's doc and code, and found the following features, which are user useable, are not associated with specific drivers: 1. `/sys/devices/system/memory/soft_offline_page`: see https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-memory-page-offline This interface only exists when CONFIG_MEMORY_HOTPLUG is enabled, but ARM cannot enable it. However, I have read the code and believe that it should not require a lot of effort to decouple these two, allowing the interface to exist even if mem-hotplug is disabled. 2. The syscall madvise with `MADV_SOFT_OFFLINE/MADV_HWPOISON` flags: According to the documentation, this interface is currently only used for testing. However, if the user program can map the specified physical address, it can actually be used for memory-failure. 3. The CONFIG_HWPOISON_INJECT which depends on CONFIG_MEMORY_FAILURE: see https://docs.kernel.org/mm/hwpoison.html It seems to allow input of physical addresses and trigger memory-failure, but according to the doc, it seems to be used only for testing. Additionally, I noticed that in the memory-failure doc https://docs.kernel.org/mm/hwpoison.html, it mentions that "The main target right now is KVM guests, but it works for all kinds of applications." This seems to confirm my speculation that the memory-failure feature should not be associated with specific architectures or drivers. Xie Yuanbin ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 1/2] ARM: mm: support memory-failure 2025-09-23 4:10 ` Xie Yuanbin @ 2025-11-03 16:53 ` David Hildenbrand (Red Hat) 2025-11-04 13:48 ` Xie Yuanbin 0 siblings, 1 reply; 11+ messages in thread From: David Hildenbrand (Red Hat) @ 2025-11-03 16:53 UTC (permalink / raw) To: Xie Yuanbin, linux, akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko, linmiaohe, nao.horiguchi, rmk+kernel, ardb, nathan, ebiggers, arnd, rostedt, kees, dave, peterz Cc: will, linux-arm-kernel, linux-kernel, linux-mm, liaohua4, lilinjie8 On 23.09.25 06:10, Xie Yuanbin wrote: > Arnd Bergmann wrote: >>>> It would be helpful to be more specific about what you >>>> want to do with this. >>>> >>>> Are you working on a driver that would actually make use of >>>> the exported interface? >>> >>> Thanks for your reply. >>> >>> Yes, In fact, we have developed a hardware component to detect DDR bit >>> transitions (software does not sense the detection behavior). Once a bit >>> transition is detected, an interrupt is reported to the CPU. >>> >>> On the software side, we have developed a driver module ko to register >>> the interrupt callback to perform soft page offline to the corresponding >>> physical pages. >>> >>> In fact, we will export `soft_offline_page` for ko to use (we can ensure >>> that it is not called in the interrupt context), but I have looked at the >>> code and found that `memory_failure_queue` and `memory_failure` can also >>> be used, which are already exported. >> >> Ok >> >>>> I see only a very small number of >>>> drivers that call memory_failure(), and none of them are >>>> usable on Arm. >>> >>> I think that not all drivers are in the open source kernel code. >>> As far as I know, there should be similar third-party drivers in other >>> architectures that use memory-failure functions, like x86 or arm64. >>> I am not a specialist in drivers, so if I have made any mistakes, >>> please correct me. >> >> I'm not familiar with the memory-failure support, but this sounds >> like something that is usually done with a drivers/edac/ driver. >> There are many SoC specific drivers, including for 32-bit Arm >> SoCs. >> >> Have you considered adding an EDAC driver first? I don't know >> how the other platforms that have EDAC drivers handle failures, >> but I would assume that either that subsystem already contains >> functionality for taking pages offline, > > I'm very sorry, I tried my best to do this, > but it seems impossible to achieve. > I am a kernel developer rathder than a driver developer. I have tried to > communicate with driver developers, but open source is very difficult due > to the involvement of proprietary hardware and algorithms. > >> or this is something >> that should be done in a way that works for all of them without >> requiring an extra driver. > > Yes, I think that the memory-failure feature should not be associated with > specific architectures or drivers. > > I have read the memory-failure's doc and code, > and found the following features, which are user useable, > are not associated with specific drivers: > > 1. `/sys/devices/system/memory/soft_offline_page`: > see https://www.kernel.org/doc/Documentation/ABI/testing/sysfs-memory-page-offline > > This interface only exists when CONFIG_MEMORY_HOTPLUG is enabled, but > ARM cannot enable it. > However, I have read the code and believe that it should not require a > lot of effort to decouple these two, allowing the interface to exist > even if mem-hotplug is disabled. It's all about the /sys/devices/system/memory/ directory, which traditionally only made sense for memory hotplug. Well, still does to most degree. Not sure whether some user space (chmem?) senses for /sys/devices/system/memory/ to detect memory hotplug capabilities. But given soft_offline_page is a pure testing mechanism, I wouldn't be too concerned about that for now. > > 2. The syscall madvise with `MADV_SOFT_OFFLINE/MADV_HWPOISON` flags: > > According to the documentation, this interface is currently only used for > testing. However, if the user program can map the specified physical > address, it can actually be used for memory-failure. It's mostly a testing-only interface. It could be used for other things, but really detecting MCE and handling it properly is kernel responsibility. > > 3. The CONFIG_HWPOISON_INJECT which depends on CONFIG_MEMORY_FAILURE: > see https://docs.kernel.org/mm/hwpoison.html > > It seems to allow input of physical addresses and trigger memory-failure, > but according to the doc, it seems to be used only for testing. Right, all these interfaces are testing only. > > > Additionally, I noticed that in the memory-failure doc > https://docs.kernel.org/mm/hwpoison.html, it mentions that > "The main target right now is KVM guests, but it works for all kinds of > applications." This seems to confirm my speculation that the > memory-failure feature should not be associated with specific > architectures or drivers. Can you go into more details which exact functionality in memory-failure.c you would be interested in using? Only soft-offlining or also the other (possibly architecture-specific) handling? -- Cheers David ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 1/2] ARM: mm: support memory-failure 2025-11-03 16:53 ` David Hildenbrand (Red Hat) @ 2025-11-04 13:48 ` Xie Yuanbin 0 siblings, 0 replies; 11+ messages in thread From: Xie Yuanbin @ 2025-11-04 13:48 UTC (permalink / raw) To: david Cc: Liam.Howlett, akpm, ardb, arnd, dave, david, ebiggers, kees, liaohua4, lilinjie8, linmiaohe, linux-arm-kernel, linux-kernel, linux-mm, linux, lorenzo.stoakes, mhocko, nao.horiguchi, nathan, peterz, rmk+kernel, rostedt, rppt, surenb, vbabka, will, xieyuanbin1 On Mon, 3 Nov 2025 17:53:18 +0100, David Hildenbrand wrote: > Can you go into more details which exact functionality in > memory-failure.c you would be interested in using? > > Only soft-offlining or also the other (possibly architecture-specific) > handling? Thanks! Let me describe it in as much detail as possible. The functions in memory-failure.c are currently used in three ways: 1. When the application is using memory, and ECC detects a UE (Uncorrectable Errors) bit flip from DRAM (the detection is performed by hardware and is not perceived by software), it reports an interrupt to the CPU. The relevant driver (a third-party module) has already registered the interrupt callback function. Based on the configuration, the driver calls `memory_failure_queue()` inside callback function, or wakes up the related kthread to call `soft_offline_page()`/`memory_failure()` to take the affected memory offline or kill the process. 2. Hardware memory scanning function: The hardware periodically performs read/write tests on some memory (This hardware is not a standard hardware, so it is not included in the ARM spec. The scanning is not perceived by software) If bit flip is detected during the test, an interrupt is reported to the operating system to do the memory-failure, just like what described earlier. 3. Software memory scanning function: The software (such as kthread/ work-queue) periodically use `soft_offline_page()` to isolate some free memory and performs read/write tests. If bit flip is detected during the test, it is considered a failure, and the memory will not be recovered. Otherwise, use `unpoison_memory()` to recover the memory. Unfortunately, the driver code for these three methods is difficult to open-source. I have also been thinking about whether there is a general-purpose function that could use memory-failure, but I haven't come up with a good idea yet. > Cheers > > David Thanks! Xie Yuanbin ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 1/2] ARM: mm: support memory-failure 2025-09-22 2:14 [RFC PATCH 1/2] ARM: mm: support memory-failure Xie Yuanbin 2025-09-22 2:14 ` [RFC PATCH 2/2] ARM: memory-failure: not select RAS and MEMORY_ISOLATION Xie Yuanbin 2025-09-22 6:37 ` Arnd Bergmann @ 2025-10-22 3:58 ` Xie Yuanbin 2 siblings, 0 replies; 11+ messages in thread From: Xie Yuanbin @ 2025-10-22 3:58 UTC (permalink / raw) To: linux, akpm, david, lorenzo.stoakes, Liam.Howlett, vbabka, rppt, surenb, mhocko, linmiaohe, nao.horiguchi, rmk+kernel, ardb, nathan, ebiggers, arnd, rostedt, kees, dave, peterz Cc: will, linux-arm-kernel, linux-kernel, linux-mm, liaohua4, lilinjie8, xieyuanbin1 Hello These patches seem to have not been discussed for a month. Does anyone have any other opinions? https://lore.kernel.org/all/20250923041005.9831-1-xieyuanbin1@huawei.com/t/#u Xie Yuanbin ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2025-11-04 13:48 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-09-22 2:14 [RFC PATCH 1/2] ARM: mm: support memory-failure Xie Yuanbin 2025-09-22 2:14 ` [RFC PATCH 2/2] ARM: memory-failure: not select RAS and MEMORY_ISOLATION Xie Yuanbin 2025-09-22 8:15 ` David Hildenbrand 2025-09-22 8:47 ` [RFC PATCH 1/2] ARM: mm: support memory-failure Xie Yuanbin 2025-09-22 6:37 ` Arnd Bergmann 2025-09-22 8:28 ` Xie Yuanbin 2025-09-22 12:51 ` Arnd Bergmann 2025-09-23 4:10 ` Xie Yuanbin 2025-11-03 16:53 ` David Hildenbrand (Red Hat) 2025-11-04 13:48 ` Xie Yuanbin 2025-10-22 3:58 ` Xie Yuanbin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox