linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/3] mm/memory-failure: add panic option for unrecoverable pages
@ 2026-04-15 12:54 Breno Leitao
  2026-04-15 12:55 ` [PATCH v4 1/3] mm/memory-failure: report MF_MSG_KERNEL for reserved pages Breno Leitao
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Breno Leitao @ 2026-04-15 12:54 UTC (permalink / raw)
  To: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Jonathan Corbet,
	Shuah Khan, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko
  Cc: linux-mm, linux-kernel, linux-doc, Breno Leitao, kernel-team

When the memory failure handler encounters an in-use kernel page that it
cannot recover (slab, page tables, kernel stacks, vmalloc, etc.), it
currently logs the error as "Ignored" and continues operation.

This leaves corrupted data accessible to the kernel, which will inevitably
cause either silent data corruption or a delayed crash when the poisoned memory
is next accessed.

This is a common problem on large fleets. We frequently observe multi-bit ECC
errors hitting kernel slab pages, where memory_failure() fails to recover them
and the system crashes later at an unrelated code path, making root cause
analysis unnecessarily difficult.

Here is one specific example from production on an arm64 server: a multi-bit
ECC error hit a dentry cache slab page, memory_failure() failed to recover it
(slab pages are not supported by the hwpoison recovery mechanism), and 67
seconds later d_lookup() accessed the poisoned cache line causing
a synchronous external abort:

    [88690.479680] [Hardware Error]: error_type: 3, multi-bit ECC
    [88690.498473] Memory failure: 0x40272d: unhandlable page.
    [88690.498619] Memory failure: 0x40272d: recovery action for
                   get hwpoison page: Ignored
    ...
    [88757.847126] Internal error: synchronous external abort:
                   0000000096000410 [#1] SMP
    [88758.061075] pc : d_lookup+0x5c/0x220

This series adds a new sysctl vm.panic_on_unrecoverable_memory_failure
(default 0) that, when enabled, panics immediately on unrecoverable
memory failures. This provides a clean crash dump at the time of the
error, which is far more useful for diagnosis than a random crash later
at an unrelated code path.

This also categorizes reserved pages as MF_MSG_KERNEL, and panics on
unknown page types (MF_MSG_UNKNOWN).

Note that dynamically allocated kernel memory (SLAB/SLUB, vmalloc,
kernel stacks, page tables) shares the MF_MSG_GET_HWPOISON return path
with transient refcount races, so it is intentionally excluded from the
panic conditions to avoid false positives.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
Changes in v4:
- Drop CONFIG_BOOTPARAM_MEMORY_FAILURE_PANIC kernel configuration option.
- Split the reserved page classification (MF_MSG_KERNEL) into its own
  patch, separate from the panic mechanism.
- Document why the buddy allocator TOCTOU race (between
  get_hwpoison_page() and is_free_buddy_page()) cannot cause false
  positives: PG_hwpoison is set beforehand and check_new_page() in the
  page allocator rejects hwpoisoned pages.
- Document the narrow LRU isolation race window for MF_MSG_UNKNOWN and
  its mitigation via identify_page_state()'s two-pass design.
- Explicitly document why MF_MSG_GET_HWPOISON is excluded from the
  panic conditions (shared path with transient races and non-reserved
  kernel memory).
- Link to v3: https://patch.msgid.link/20260413-ecc_panic-v3-0-1dcbb2f12bc4@debian.org

Changes in v3:
- Rename is_unrecoverable_memory_failure() to panic_on_unrecoverable_mf()
  as suggested by maintainer.
- Add CONFIG_BOOTPARAM_MEMORY_FAILURE_PANIC kernel configuration option,
  similar to CONFIG_BOOTPARAM_HARDLOCKUP_PANIC.
- Add documentation for the sysctl and CONFIG option.
- Add code comments documenting the panic condition design rationale and
  how the retry mechanism mitigates false positives from buddy allocator
  races.
- Link to v2: https://patch.msgid.link/20260331-ecc_panic-v2-0-9e40d0f64f7a@debian.org

Changes in v2:
- Panic on MF_MSG_KERNEL, MF_MSG_KERNEL_HIGH_ORDER and MF_MSG_UNKNOWN
  instead of MF_MSG_GET_HWPOISON.
- Report MF_MSG_KERNEL for reserved pages when get_hwpoison_page() fails
  instead of MF_MSG_GET_HWPOISON.
- Link to v1: https://patch.msgid.link/20260323-ecc_panic-v1-0-72a1921726c5@debian.org

---
Breno Leitao (3):
      mm/memory-failure: report MF_MSG_KERNEL for reserved pages
      mm/memory-failure: add panic option for unrecoverable pages
      Documentation: document panic_on_unrecoverable_memory_failure sysctl

 Documentation/admin-guide/sysctl/vm.rst | 37 +++++++++++++
 mm/memory-failure.c                     | 92 ++++++++++++++++++++++++++++++++-
 2 files changed, 128 insertions(+), 1 deletion(-)
---
base-commit: e6efabc0afca02efa263aba533f35d90117ab283
change-id: 20260323-ecc_panic-4e473b83087c

Best regards,
--  
Breno Leitao <leitao@debian.org>



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v4 1/3] mm/memory-failure: report MF_MSG_KERNEL for reserved pages
  2026-04-15 12:54 [PATCH v4 0/3] mm/memory-failure: add panic option for unrecoverable pages Breno Leitao
@ 2026-04-15 12:55 ` Breno Leitao
  2026-04-15 12:55 ` [PATCH v4 2/3] mm/memory-failure: add panic option for unrecoverable pages Breno Leitao
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: Breno Leitao @ 2026-04-15 12:55 UTC (permalink / raw)
  To: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Jonathan Corbet,
	Shuah Khan, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko
  Cc: linux-mm, linux-kernel, linux-doc, Breno Leitao, kernel-team

When get_hwpoison_page() returns a negative value, distinguish
reserved pages from other failure cases by reporting MF_MSG_KERNEL
instead of MF_MSG_GET_HWPOISON. Reserved pages belong to the kernel
and should be classified accordingly for proper handling.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 mm/memory-failure.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index ee42d43613097..7b67e43dafbd1 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2432,7 +2432,16 @@ int memory_failure(unsigned long pfn, int flags)
 		}
 		goto unlock_mutex;
 	} else if (res < 0) {
-		res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
+		/*
+		 * PageReserved is stable here: reserved pages have
+		 * PG_reserved set at boot or by drivers and are never
+		 * freed through the page allocator.
+		 */
+		if (PageReserved(p))
+			res = action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
+		else
+			res = action_result(pfn, MF_MSG_GET_HWPOISON,
+					    MF_IGNORED);
 		goto unlock_mutex;
 	}
 

-- 
2.52.0



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v4 2/3] mm/memory-failure: add panic option for unrecoverable pages
  2026-04-15 12:54 [PATCH v4 0/3] mm/memory-failure: add panic option for unrecoverable pages Breno Leitao
  2026-04-15 12:55 ` [PATCH v4 1/3] mm/memory-failure: report MF_MSG_KERNEL for reserved pages Breno Leitao
@ 2026-04-15 12:55 ` Breno Leitao
  2026-04-15 12:55 ` [PATCH v4 3/3] Documentation: document panic_on_unrecoverable_memory_failure sysctl Breno Leitao
  2026-04-15 20:56 ` [PATCH v4 0/3] mm/memory-failure: add panic option for unrecoverable pages Jiaqi Yan
  3 siblings, 0 replies; 8+ messages in thread
From: Breno Leitao @ 2026-04-15 12:55 UTC (permalink / raw)
  To: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Jonathan Corbet,
	Shuah Khan, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko
  Cc: linux-mm, linux-kernel, linux-doc, Breno Leitao, kernel-team

Add a sysctl panic_on_unrecoverable_memory_failure that triggers a
kernel panic when memory_failure() encounters pages that cannot be
recovered. This provides a clean crash with useful debug information
rather than allowing silent data corruption.

The panic is triggered for three categories of unrecoverable failures,
all requiring result == MF_IGNORED:

- MF_MSG_KERNEL: reserved pages identified via PageReserved.

- MF_MSG_KERNEL_HIGH_ORDER: pages with refcount 0 that are not in the
  buddy allocator (e.g., tail pages of high-order kernel allocations).
  A TOCTOU race between get_hwpoison_page() and is_free_buddy_page()
  is possible when CONFIG_DEBUG_VM is disabled, since check_new_pages()
  is gated by is_check_pages_enabled() and becomes a no-op. Panicking
  is still correct: the physical memory has a hardware error regardless
  of who allocated the page.

- MF_MSG_UNKNOWN: pages that do not match any known recoverable state
  in error_states[]. A theoretical false positive from concurrent LRU
  isolation is mitigated by identify_page_state()'s two-pass design
  which rechecks using saved page_flags.

MF_MSG_GET_HWPOISON is intentionally excluded: it covers both
non-reserved kernel memory (SLAB/SLUB, vmalloc, kernel stacks, page
tables) and transient refcount races, so panicking would risk false
positives.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 mm/memory-failure.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 81 insertions(+)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 7b67e43dafbd1..311344f332449 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -74,6 +74,8 @@ static int sysctl_memory_failure_recovery __read_mostly = 1;
 
 static int sysctl_enable_soft_offline __read_mostly = 1;
 
+static int sysctl_panic_on_unrecoverable_mf __read_mostly;
+
 atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
 
 static bool hw_memory_failure __read_mostly = false;
@@ -155,6 +157,15 @@ static const struct ctl_table memory_failure_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_ONE,
+	},
+	{
+		.procname	= "panic_on_unrecoverable_memory_failure",
+		.data		= &sysctl_panic_on_unrecoverable_mf,
+		.maxlen		= sizeof(sysctl_panic_on_unrecoverable_mf),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
 	}
 };
 
@@ -1281,6 +1292,59 @@ static void update_per_node_mf_stats(unsigned long pfn,
 	++mf_stats->total;
 }
 
+/*
+ * Determine whether to panic on an unrecoverable memory failure.
+ *
+ * Design rationale: This design opts for immediate panic on kernel memory
+ * failures, capturing clean crashes rather than random crashes on MF_IGNORED
+ * pages.
+ *
+ * This panics on three categories of failures (all requiring result ==
+ * MF_IGNORED, meaning the page was not recovered):
+ *
+ * - MF_MSG_KERNEL: Reserved pages (identified via PageReserved) that belong
+ *   to the kernel and cannot be recovered.
+ *
+ * - MF_MSG_KERNEL_HIGH_ORDER: Pages that get_hwpoison_page() observed as free
+ *   (refcount 0) but are not in the buddy allocator. These are kernel pages
+ *   in a transient state between allocation and freeing. A TOCTOU race
+ *   (page allocated between get_hwpoison_page() and is_free_buddy_page())
+ *   is possible when CONFIG_DEBUG_VM is disabled, since check_new_pages()
+ *   is gated by is_check_pages_enabled() and becomes a no-op. However,
+ *   panicking is still correct in this case: the physical memory has a
+ *   hardware error, so an allocated hwpoisoned page is unrecoverable.
+ *
+ * - MF_MSG_UNKNOWN: Pages that reached identify_page_state() but did not
+ *   match any known recoverable state in error_states[]. This is the
+ *   catch-all for pages whose flags do not indicate a recoverable user or
+ *   cache page (no LRU, no swapcache, no mlock, etc). A theoretical false
+ *   positive exists if concurrent LRU isolation clears PG_lru between
+ *   folio_lock() and saving page_flags, but this window is very narrow and
+ *   mitigated by identify_page_state()'s two-pass design which rechecks
+ *   using saved page_flags.
+ *
+ * Pages intentionally NOT included:
+ * - MF_MSG_GET_HWPOISON: get_hwpoison_page() failure on non-reserved pages.
+ *   This includes dynamically allocated kernel memory (SLAB/SLUB, vmalloc,
+ *   kernel stacks, page tables) which are not PageReserved and fail
+ *   get_hwpoison_page() with -EBUSY/-EIO. These share the return path with
+ *   transient refcount races, so panicking here would risk false positives.
+ *
+ * Note: Some transient races in the buddy allocator path are mitigated by
+ * memory_failure()'s retry mechanism. When take_page_off_buddy() fails,
+ * the code clears PageHWPoison and retries the entire memory_failure()
+ * flow, allowing pages to be properly reclassified with updated flags.
+ */
+static bool panic_on_unrecoverable_mf(enum mf_action_page_type type,
+				      enum mf_result result)
+{
+	return sysctl_panic_on_unrecoverable_mf &&
+	       result == MF_IGNORED &&
+	       (type == MF_MSG_KERNEL ||
+		type == MF_MSG_KERNEL_HIGH_ORDER ||
+		type == MF_MSG_UNKNOWN);
+}
+
 /*
  * "Dirty/Clean" indication is not 100% accurate due to the possibility of
  * setting PG_dirty outside page lock. See also comment above set_page_dirty().
@@ -1298,6 +1362,9 @@ static int action_result(unsigned long pfn, enum mf_action_page_type type,
 	pr_err("%#lx: recovery action for %s: %s\n",
 		pfn, action_page_types[type], action_name[result]);
 
+	if (panic_on_unrecoverable_mf(type, result))
+		panic("Memory failure: %#lx: unrecoverable page", pfn);
+
 	return (result == MF_RECOVERED || result == MF_DELAYED) ? 0 : -EBUSY;
 }
 
@@ -2428,6 +2495,20 @@ int memory_failure(unsigned long pfn, int flags)
 			}
 			res = action_result(pfn, MF_MSG_BUDDY, res);
 		} else {
+			/*
+			 * The page has refcount 0 but is not in the buddy
+			 * allocator — it is a non-compound high-order kernel
+			 * page (e.g., a tail page of a high-order allocation).
+			 *
+			 * A TOCTOU race where the page transitions from
+			 * free-buddy to allocated between get_hwpoison_page()
+			 * and is_free_buddy_page() is possible when
+			 * CONFIG_DEBUG_VM is disabled (check_new_pages() is
+			 * gated by is_check_pages_enabled() and becomes a
+			 * no-op). Panicking is still correct: the physical
+			 * memory has a hardware error regardless of who
+			 * allocated the page.
+			 */
 			res = action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
 		}
 		goto unlock_mutex;

-- 
2.52.0



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v4 3/3] Documentation: document panic_on_unrecoverable_memory_failure sysctl
  2026-04-15 12:54 [PATCH v4 0/3] mm/memory-failure: add panic option for unrecoverable pages Breno Leitao
  2026-04-15 12:55 ` [PATCH v4 1/3] mm/memory-failure: report MF_MSG_KERNEL for reserved pages Breno Leitao
  2026-04-15 12:55 ` [PATCH v4 2/3] mm/memory-failure: add panic option for unrecoverable pages Breno Leitao
@ 2026-04-15 12:55 ` Breno Leitao
  2026-04-15 20:56 ` [PATCH v4 0/3] mm/memory-failure: add panic option for unrecoverable pages Jiaqi Yan
  3 siblings, 0 replies; 8+ messages in thread
From: Breno Leitao @ 2026-04-15 12:55 UTC (permalink / raw)
  To: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Jonathan Corbet,
	Shuah Khan, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko
  Cc: linux-mm, linux-kernel, linux-doc, Breno Leitao, kernel-team

Add documentation for the new vm.panic_on_unrecoverable_memory_failure
sysctl, describing the three categories of failures that trigger a
panic and noting which kernel page types are not yet covered.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 Documentation/admin-guide/sysctl/vm.rst | 37 +++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 97e12359775c9..592ce9ec38c4b 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -67,6 +67,7 @@ Currently, these files are in /proc/sys/vm:
 - page-cluster
 - page_lock_unfairness
 - panic_on_oom
+- panic_on_unrecoverable_memory_failure
 - percpu_pagelist_high_fraction
 - stat_interval
 - stat_refresh
@@ -925,6 +926,42 @@ panic_on_oom=2+kdump gives you very strong tool to investigate
 why oom happens. You can get snapshot.
 
 
+panic_on_unrecoverable_memory_failure
+======================================
+
+When a hardware memory error (e.g. multi-bit ECC) hits a kernel page
+that cannot be recovered by the memory failure handler, the default
+behaviour is to ignore the error and continue operation.  This is
+dangerous because the corrupted data remains accessible to the kernel,
+risking silent data corruption or a delayed crash when the poisoned
+memory is next accessed.
+
+When enabled, this sysctl triggers a panic on three categories of
+unrecoverable failures: reserved kernel pages, non-buddy kernel pages
+with zero refcount (e.g. tail pages of high-order allocations), and
+pages whose state cannot be classified as recoverable.
+
+Note that some kernel page types — such as slab objects, vmalloc
+allocations, kernel stacks, and page tables — share a failure path
+with transient refcount races and are not currently covered by this
+option. I.e, do not panic when not confident of the page status.
+
+For many environments it is preferable to panic immediately with a clean
+crash dump that captures the original error context, rather than to
+continue and face a random crash later whose cause is difficult to
+diagnose.
+
+= =====================================================================
+0 Try to continue operation (default).
+1 Panic immediately.  If the ``panic`` sysctl is also non-zero then the
+  machine will be rebooted.
+= =====================================================================
+
+Example::
+
+     echo 1 > /proc/sys/vm/panic_on_unrecoverable_memory_failure
+
+
 percpu_pagelist_high_fraction
 =============================
 

-- 
2.52.0



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v4 0/3] mm/memory-failure: add panic option for unrecoverable pages
  2026-04-15 12:54 [PATCH v4 0/3] mm/memory-failure: add panic option for unrecoverable pages Breno Leitao
                   ` (2 preceding siblings ...)
  2026-04-15 12:55 ` [PATCH v4 3/3] Documentation: document panic_on_unrecoverable_memory_failure sysctl Breno Leitao
@ 2026-04-15 20:56 ` Jiaqi Yan
  2026-04-16 15:32   ` Breno Leitao
  3 siblings, 1 reply; 8+ messages in thread
From: Jiaqi Yan @ 2026-04-15 20:56 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Jonathan Corbet,
	Shuah Khan, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm, linux-kernel, linux-doc, kernel-team

Hi Breno,

On Wed, Apr 15, 2026 at 5:55 AM Breno Leitao <leitao@debian.org> wrote:
>
> When the memory failure handler encounters an in-use kernel page that it
> cannot recover (slab, page tables, kernel stacks, vmalloc, etc.), it
> currently logs the error as "Ignored" and continues operation.
>
> This leaves corrupted data accessible to the kernel, which will inevitably
> cause either silent data corruption or a delayed crash when the poisoned memory
> is next accessed.
>
> This is a common problem on large fleets. We frequently observe multi-bit ECC
> errors hitting kernel slab pages, where memory_failure() fails to recover them
> and the system crashes later at an unrelated code path, making root cause
> analysis unnecessarily difficult.
>
> Here is one specific example from production on an arm64 server: a multi-bit
> ECC error hit a dentry cache slab page, memory_failure() failed to recover it
> (slab pages are not supported by the hwpoison recovery mechanism), and 67
> seconds later d_lookup() accessed the poisoned cache line causing
> a synchronous external abort:
>
>     [88690.479680] [Hardware Error]: error_type: 3, multi-bit ECC
>     [88690.498473] Memory failure: 0x40272d: unhandlable page.
>     [88690.498619] Memory failure: 0x40272d: recovery action for
>                    get hwpoison page: Ignored
>     ...
>     [88757.847126] Internal error: synchronous external abort:
>                    0000000096000410 [#1] SMP
>     [88758.061075] pc : d_lookup+0x5c/0x220
>
> This series adds a new sysctl vm.panic_on_unrecoverable_memory_failure
> (default 0) that, when enabled, panics immediately on unrecoverable
> memory failures. This provides a clean crash dump at the time of the

I get the fail-fast part, but wonder will kernel really be able to
provide clean crash dump useful for diagnosis?

In your example at 88757.847126, kernel was handling SEA and because
we are under kernel context, eventually has to die(). Apparently not
only your patch, but also memory-failure has no role to play there.
But at least SEA handling tried its best to show the kernel code that
consumed the memory error.

So your code should apply to the memory failure handling at
88690.498473, which is likely triggered from APEI GHES for poison
detection (I guess the example is from ARM64). Anything except SEA is
considered not synchronous (by APEI is_hest_sync_notify()). If kernel
panics there, I guess it will be in a random process context or a
kworker thread? How useful is it for diagnosis? Just the exact time an
error detected (which is already logged by kernel)?

On X86, for UCNA or SRAO type machine check exceptions, I think with
your patch the panic would also happen in random process context or
kworker thread,

Can you share some clean crash dumps from your testing that show they
are more useful than the crash at SEA? Thanks!

> error, which is far more useful for diagnosis than a random crash later
> at an unrelated code path.
>
> This also categorizes reserved pages as MF_MSG_KERNEL, and panics on
> unknown page types (MF_MSG_UNKNOWN).
>
> Note that dynamically allocated kernel memory (SLAB/SLUB, vmalloc,
> kernel stacks, page tables) shares the MF_MSG_GET_HWPOISON return path
> with transient refcount races, so it is intentionally excluded from the
> panic conditions to avoid false positives.
>
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
> Changes in v4:
> - Drop CONFIG_BOOTPARAM_MEMORY_FAILURE_PANIC kernel configuration option.
> - Split the reserved page classification (MF_MSG_KERNEL) into its own
>   patch, separate from the panic mechanism.
> - Document why the buddy allocator TOCTOU race (between
>   get_hwpoison_page() and is_free_buddy_page()) cannot cause false
>   positives: PG_hwpoison is set beforehand and check_new_page() in the
>   page allocator rejects hwpoisoned pages.
> - Document the narrow LRU isolation race window for MF_MSG_UNKNOWN and
>   its mitigation via identify_page_state()'s two-pass design.
> - Explicitly document why MF_MSG_GET_HWPOISON is excluded from the
>   panic conditions (shared path with transient races and non-reserved
>   kernel memory).
> - Link to v3: https://patch.msgid.link/20260413-ecc_panic-v3-0-1dcbb2f12bc4@debian.org
>
> Changes in v3:
> - Rename is_unrecoverable_memory_failure() to panic_on_unrecoverable_mf()
>   as suggested by maintainer.
> - Add CONFIG_BOOTPARAM_MEMORY_FAILURE_PANIC kernel configuration option,
>   similar to CONFIG_BOOTPARAM_HARDLOCKUP_PANIC.
> - Add documentation for the sysctl and CONFIG option.
> - Add code comments documenting the panic condition design rationale and
>   how the retry mechanism mitigates false positives from buddy allocator
>   races.
> - Link to v2: https://patch.msgid.link/20260331-ecc_panic-v2-0-9e40d0f64f7a@debian.org
>
> Changes in v2:
> - Panic on MF_MSG_KERNEL, MF_MSG_KERNEL_HIGH_ORDER and MF_MSG_UNKNOWN
>   instead of MF_MSG_GET_HWPOISON.
> - Report MF_MSG_KERNEL for reserved pages when get_hwpoison_page() fails
>   instead of MF_MSG_GET_HWPOISON.
> - Link to v1: https://patch.msgid.link/20260323-ecc_panic-v1-0-72a1921726c5@debian.org
>
> ---
> Breno Leitao (3):
>       mm/memory-failure: report MF_MSG_KERNEL for reserved pages
>       mm/memory-failure: add panic option for unrecoverable pages
>       Documentation: document panic_on_unrecoverable_memory_failure sysctl
>
>  Documentation/admin-guide/sysctl/vm.rst | 37 +++++++++++++
>  mm/memory-failure.c                     | 92 ++++++++++++++++++++++++++++++++-
>  2 files changed, 128 insertions(+), 1 deletion(-)
> ---
> base-commit: e6efabc0afca02efa263aba533f35d90117ab283
> change-id: 20260323-ecc_panic-4e473b83087c
>
> Best regards,
> --
> Breno Leitao <leitao@debian.org>
>
>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v4 0/3] mm/memory-failure: add panic option for unrecoverable pages
  2026-04-15 20:56 ` [PATCH v4 0/3] mm/memory-failure: add panic option for unrecoverable pages Jiaqi Yan
@ 2026-04-16 15:32   ` Breno Leitao
  2026-04-16 16:26     ` Jiaqi Yan
  0 siblings, 1 reply; 8+ messages in thread
From: Breno Leitao @ 2026-04-16 15:32 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Jonathan Corbet,
	Shuah Khan, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm, linux-kernel, linux-doc, kernel-team

Hi Jiaqi,

On Wed, Apr 15, 2026 at 01:56:35PM -0700, Jiaqi Yan wrote:
> On Wed, Apr 15, 2026 at 5:55 AM Breno Leitao <leitao@debian.org> wrote:
> >
> > When the memory failure handler encounters an in-use kernel page that it
> > cannot recover (slab, page tables, kernel stacks, vmalloc, etc.), it
> > currently logs the error as "Ignored" and continues operation.
> >
> > This leaves corrupted data accessible to the kernel, which will inevitably
> > cause either silent data corruption or a delayed crash when the poisoned memory
> > is next accessed.
> >
> > This is a common problem on large fleets. We frequently observe multi-bit ECC
> > errors hitting kernel slab pages, where memory_failure() fails to recover them
> > and the system crashes later at an unrelated code path, making root cause
> > analysis unnecessarily difficult.
> >
> > Here is one specific example from production on an arm64 server: a multi-bit
> > ECC error hit a dentry cache slab page, memory_failure() failed to recover it
> > (slab pages are not supported by the hwpoison recovery mechanism), and 67
> > seconds later d_lookup() accessed the poisoned cache line causing
> > a synchronous external abort:
> >
> >     [88690.479680] [Hardware Error]: error_type: 3, multi-bit ECC
> >     [88690.498473] Memory failure: 0x40272d: unhandlable page.
> >     [88690.498619] Memory failure: 0x40272d: recovery action for
> >                    get hwpoison page: Ignored
> >     ...
> >     [88757.847126] Internal error: synchronous external abort:
> >                    0000000096000410 [#1] SMP
> >     [88758.061075] pc : d_lookup+0x5c/0x220
> >
> > This series adds a new sysctl vm.panic_on_unrecoverable_memory_failure
> > (default 0) that, when enabled, panics immediately on unrecoverable
> > memory failures. This provides a clean crash dump at the time of the
>
> I get the fail-fast part, but wonder will kernel really be able to
> provide clean crash dump useful for diagnosis?

Yes, the kernel does provide a useful crash dump. With the sysctl enabled,
here's what I observe:

	Kernel panic - not syncing: Memory failure: 0x1: unrecoverable page
	CPU: 40 UID: 0 PID: 682 Comm: bash Tainted: G B  7.0.0-next-20260414-upstream-00004-gcbb3af7bfd3b #93
	Tainted: [B]=BAD_PAGE

	Call Trace:
	 <TASK>
	 vpanic+0x399/0x700
	 panic+0xb4/0xc0
	 action_result+0x278/0x340          ← your new panic call site
	 memory_failure+0x152b/0x1c80


Without the patch (or with the sysctl disabled), you only get:

	Memory failure: 0x1: unhandlable page.
	Memory failure: 0x1: recovery action for reserved kernel page: Ignored

Then the host continues running until it eventually accesses that poisoned
memory, triggering a generic error similar to the d_lookup() case mentioned
above.

> In your example at 88757.847126, kernel was handling SEA and because
> we are under kernel context, eventually has to die(). Apparently not
> only your patch, but also memory-failure has no role to play there.
> But at least SEA handling tried its best to show the kernel code that
> consumed the memory error.
>
> So your code should apply to the memory failure handling at
> 88690.498473, which is likely triggered from APEI GHES for poison
> detection (I guess the example is from ARM64). Anything except SEA is
> considered not synchronous (by APEI is_hest_sync_notify()). If kernel
> panics there, I guess it will be in a random process context or a
> kworker thread? How useful is it for diagnosis? Just the exact time an
> error detected (which is already logged by kernel)?

The kernel panics with a clear stack trace and explicit reason, making it
straightforward to correlate and analyze the failure.

My objective is to have a clean, immediate crash rather than allowing the
system to continue running and potentially crash later (if at all).

Working at a hyperscaler, I regularly see thousands of these "unhandlable
page" messages, followed by later kernel crashes when the corrupted memory
is eventually accessed.

> On X86, for UCNA or SRAO type machine check exceptions, I think with
> your patch the panic would also happen in random process context or
> kworker thread,
>
> Can you share some clean crash dumps from your testing that show they
> are more useful than the crash at SEA? Thanks!

Certainly, here is the complete crash dump from the example above. This
happened on a real production hardware:

	[88690.478913] [ T593001] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 784
	[88690.479097] [ T593001] {1}[Hardware Error]: event severity: recoverable
	[88690.479184] [ T593001] {1}[Hardware Error]:  imprecise tstamp: 2026-03-20 13:13:08
	[88690.479282] [ T593001] {1}[Hardware Error]:  Error 0, type: recoverable
	[88690.479359] [ T593001] {1}[Hardware Error]:   section_type: memory error
	[88690.479424] [ T593001] {1}[Hardware Error]:   physical_address: 0x00000040272d5080
	[88690.479503] [ T593001] {1}[Hardware Error]:   physical_address_mask: 0xfffffffffffff000
	[88690.479606] [ T593001] {1}[Hardware Error]:   node:0 card:0 module:1 rank:1 bank:13 device:6 row:64114 column:832 requestor_id:0x0000000000000027 
	[88690.479680] [ T593001] {1}[Hardware Error]:   error_type: 3, multi-bit ECC
	[88690.479754] [ T593001] {1}[Hardware Error]:   DIMM location: not present. DMI handle: 0x000e 
	[88690.479882] [ T593001] EDAC MC0: 1 UE multi-bit ECC on unknown memory (node:0 card:0 module:1 rank:1 bank:13 device:6 row:64114 column:832 requestor_id:0x0000000000000027 DIMM location: not present. DMI handle: 0x000e page:0x40272d offset:0x5080 grain:4096 - APEI location: node:0 card:0 module:1 rank:1 bank:13 device:6 row:64114 column:832 requestor_id:0x0000000000000027 DIMM location: not present. DMI handle: 0x000e)
	[88690.498473] [ T593001] Memory failure: 0x40272d: unhandlable page.
	[88690.498619] [ T593001] Memory failure: 0x40272d: recovery action for get hwpoison page: Ignored
	[88757.847126] [ T640437] Internal error: synchronous external abort: 0000000096000410 [#1]  SMP
	[88757.867131] [ T640437] Modules linked in: ghes_edac(E) act_gact(E) sch_fq(E) tcp_diag(E) inet_diag(E) cls_bpf(E) mlx5_ib(E) sm3_ce(E) sha3_ce(E) sha512_ce(E) ipmi_ssif(E) ipmi_devintf(E) nvidia_cspmu(E) ib_uverbs(E) cppc_cpufreq(E) coresight_etm4x(E) coresight_stm(E) ipmi_msghandler(E) coresight_trbe(E) arm_cspmu_module(E) arm_smmuv3_pmu(E) arm_spe_pmu(E) stm_core(E) coresight_tmc(E) coresight_funnel(E) coresight(E) bpf_preload(E) sch_fq_codel(E) ip_tables(E) ip6_tables(E) vhost_net(E) tun(E) vhost(E) vhost_iotlb(E) tap(E) tls(E) mpls_gso(E) mpls_iptunnel(E) mpls_router(E) fou(E) acpi_power_meter(E) loop(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) autofs4(E) raid0(E) efivarfs(E) dm_crypt(E)
	[88757.991191] [ T640437] CPU: 70 UID: 34133 PID: 640437 Comm: Collection-20 Kdump: loaded Tainted: G   M        E       6.16.1-0_fbk2_0_gf40efc324cc8 #1 NONE 
	[88758.017569] [ T640437] Tainted: [M]=MACHINE_CHECK, [E]=UNSIGNED_MODULE
	[88758.028860] [ T640437] Hardware name: ....
	[88758.046969] [ T640437] pstate: 23401009 (nzCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
	[88758.061075] [ T640437] pc : d_lookup+0x5c/0x220
	[88758.068392] [ T640437] lr : try_lookup_noperm+0x30/0x50
	[88758.077088] [ T640437] sp : ffff800138cafc30
	[88758.083827] [ T640437] x29: ffff800138cafc40 x28: ffff0001dcfe8bc0 x27: 00000000bc0a11f7
	[88758.098321] [ T640437] x26: 00000000000ee00c x25: ffffffffffffffff x24: 0000000000000001
	[88758.112807] [ T640437] x23: ffff003fa14d0000 x22: ffff8000828d3740 x21: ffff800138cafde8
	[88758.127281] [ T640437] x20: ffff0000d0316fc0 x19: ffff800138cafce0 x18: 0001000000000000
	[88758.141753] [ T640437] x17: 0000000000000001 x16: 0000000001ffffff x15: dfc038a300003936
	[88758.156226] [ T640437] x14: 00000000fffffffa x13: ffffffffffffffff x12: ffff0000d0316fc0
	[88758.170695] [ T640437] x11: 61c8864680b583eb x10: 0000000000000039 x9 : ffff800080fcfd68
	[88758.185170] [ T640437] x8 : ffff003fa72d5088 x7 : 0000000000000000 x6 : ffff800138cafd58
	[88758.199645] [ T640437] x5 : ffff0001dcfe8bc0 x4 : ffff80008104a330 x3 : 0000000000000002
	[88758.214111] [ T640437] x2 : ffff800138cafd4d x1 : ffff800138cafce0 x0 : ffff0000d0316fc0
	[88758.228579] [ T640437] Call trace:
	[88758.233565] [ T640437]  d_lookup+0x5c/0x220 (P)
	[88758.240864] [ T640437]  try_lookup_noperm+0x30/0x50
	[88758.248868] [ T640437]  proc_fill_cache+0x54/0x140
	[88758.256696] [ T640437]  proc_readfd_common+0x138/0x1e8
	[88758.265222] [ T640437]  proc_fd_iterate.llvm.7260857650841435759+0x1c/0x30
	[88758.277248] [ T640437]  iterate_dir+0x84/0x228
	[88758.284354] [ T640437]  __arm64_sys_getdents64+0x5c/0x110
	[88758.293383] [ T640437]  invoke_syscall+0x4c/0xd0
	[88758.300843] [ T640437]  do_el0_svc+0x80/0xb8
	[88758.307599] [ T640437]  el0_svc+0x30/0xf0
	[88758.313820] [ T640437]  el0t_64_sync_handler+0x70/0x100
	[88758.322497] [ T640437]  el0t_64_sync+0x17c/0x180
	...

And my clear crash would look like the following:

	[ 1096.480523] Memory failure: 0x2: recovery action for reserved kernel page: Ignored
	[ 1096.480751] Kernel panic - not syncing: Memory failure: 0x2: unrecoverable page
	[ 1096.480760] CPU: 5 UID: 0 PID: 683 Comm: bash Tainted: G    B               7.0.0-next-20260414-upstream-00004-gcbb3af7bfd3b #93 PREEMPTLAZY
	[ 1096.480768] Tainted: [B]=BAD_PAGE
	[ 1096.480774] Call Trace:
	[ 1096.480778]  <TASK>
	[ 1096.480782]  vpanic+0x399/0x700
	[ 1096.480821]  panic+0xb4/0xc0
	[ 1096.480849]  action_result+0x278/0x340
	[ 1096.480857]  memory_failure+0x152b/0x1c80
	[ 1096.480925]  hwpoison_inject+0x3a6/0x3f0 [hwpoison_inject]
	....


Isn't the clean approach way better than the random one?

For testing, I use this simple procedure, in case you want to play with
it:
	# modprobe hwpoison-inject
	# sysctl -w vm.panic_on_unrecoverable_memory_failure=0
	# echo 1 > /sys/kernel/debug/hwpoison/corrupt-pfn


Thanks for the review and good discussion,
--breno



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v4 0/3] mm/memory-failure: add panic option for unrecoverable pages
  2026-04-16 15:32   ` Breno Leitao
@ 2026-04-16 16:26     ` Jiaqi Yan
  2026-04-17  9:10       ` Breno Leitao
  0 siblings, 1 reply; 8+ messages in thread
From: Jiaqi Yan @ 2026-04-16 16:26 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Jonathan Corbet,
	Shuah Khan, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm, linux-kernel, linux-doc, kernel-team

On Thu, Apr 16, 2026 at 8:32 AM Breno Leitao <leitao@debian.org> wrote:
>
> Hi Jiaqi,
>
> On Wed, Apr 15, 2026 at 01:56:35PM -0700, Jiaqi Yan wrote:
> > On Wed, Apr 15, 2026 at 5:55 AM Breno Leitao <leitao@debian.org> wrote:
> > >
> > > When the memory failure handler encounters an in-use kernel page that it
> > > cannot recover (slab, page tables, kernel stacks, vmalloc, etc.), it
> > > currently logs the error as "Ignored" and continues operation.
> > >
> > > This leaves corrupted data accessible to the kernel, which will inevitably
> > > cause either silent data corruption or a delayed crash when the poisoned memory
> > > is next accessed.
> > >
> > > This is a common problem on large fleets. We frequently observe multi-bit ECC
> > > errors hitting kernel slab pages, where memory_failure() fails to recover them
> > > and the system crashes later at an unrelated code path, making root cause
> > > analysis unnecessarily difficult.
> > >
> > > Here is one specific example from production on an arm64 server: a multi-bit
> > > ECC error hit a dentry cache slab page, memory_failure() failed to recover it
> > > (slab pages are not supported by the hwpoison recovery mechanism), and 67
> > > seconds later d_lookup() accessed the poisoned cache line causing
> > > a synchronous external abort:
> > >
> > >     [88690.479680] [Hardware Error]: error_type: 3, multi-bit ECC
> > >     [88690.498473] Memory failure: 0x40272d: unhandlable page.
> > >     [88690.498619] Memory failure: 0x40272d: recovery action for
> > >                    get hwpoison page: Ignored
> > >     ...
> > >     [88757.847126] Internal error: synchronous external abort:
> > >                    0000000096000410 [#1] SMP
> > >     [88758.061075] pc : d_lookup+0x5c/0x220
> > >
> > > This series adds a new sysctl vm.panic_on_unrecoverable_memory_failure
> > > (default 0) that, when enabled, panics immediately on unrecoverable
> > > memory failures. This provides a clean crash dump at the time of the
> >
> > I get the fail-fast part, but wonder will kernel really be able to
> > provide clean crash dump useful for diagnosis?
>
> Yes, the kernel does provide a useful crash dump. With the sysctl enabled,
> here's what I observe:
>
>         Kernel panic - not syncing: Memory failure: 0x1: unrecoverable page
>         CPU: 40 UID: 0 PID: 682 Comm: bash Tainted: G B  7.0.0-next-20260414-upstream-00004-gcbb3af7bfd3b #93
>         Tainted: [B]=BAD_PAGE
>
>         Call Trace:
>          <TASK>
>          vpanic+0x399/0x700
>          panic+0xb4/0xc0
>          action_result+0x278/0x340          ← your new panic call site
>          memory_failure+0x152b/0x1c80
>
>
> Without the patch (or with the sysctl disabled), you only get:
>
>         Memory failure: 0x1: unhandlable page.
>         Memory failure: 0x1: recovery action for reserved kernel page: Ignored
>
> Then the host continues running until it eventually accesses that poisoned
> memory, triggering a generic error similar to the d_lookup() case mentioned
> above.
>
> > In your example at 88757.847126, kernel was handling SEA and because
> > we are under kernel context, eventually has to die(). Apparently not
> > only your patch, but also memory-failure has no role to play there.
> > But at least SEA handling tried its best to show the kernel code that
> > consumed the memory error.
> >
> > So your code should apply to the memory failure handling at
> > 88690.498473, which is likely triggered from APEI GHES for poison
> > detection (I guess the example is from ARM64). Anything except SEA is
> > considered not synchronous (by APEI is_hest_sync_notify()). If kernel
> > panics there, I guess it will be in a random process context or a
> > kworker thread? How useful is it for diagnosis? Just the exact time an
> > error detected (which is already logged by kernel)?
>
> The kernel panics with a clear stack trace and explicit reason, making it
> straightforward to correlate and analyze the failure.

So we will always get the same stack trace below, right?

          panic+0xb4/0xc0
          action_result+0x278/0x340
          memory_failure+0x152b/0x1c80

IIUC, this stack trace itself doesn't provide any useful information
about the memory error, right? What exactly can we use from the stack
trace? It is just a side-effect that we failed immediately.

You can still correlate failure with "Memory failure: 0x1: unhandlable
page" and keep running until the actual fatal poison consumption takes
down the system. Drawback is that these will be cascading events that
can be "noisy". What I see is the choice between failing fast versus
failing safe.

>
> My objective is to have a clean, immediate crash rather than allowing the
> system to continue running and potentially crash later (if at all).
>
> Working at a hyperscaler, I regularly see thousands of these "unhandlable
> page" messages, followed by later kernel crashes when the corrupted memory
> is eventually accessed.
>
> > On X86, for UCNA or SRAO type machine check exceptions, I think with
> > your patch the panic would also happen in random process context or
> > kworker thread,
> >
> > Can you share some clean crash dumps from your testing that show they
> > are more useful than the crash at SEA? Thanks!
>
> Certainly, here is the complete crash dump from the example above. This
> happened on a real production hardware:
>
>         [88690.478913] [ T593001] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 784
>         [88690.479097] [ T593001] {1}[Hardware Error]: event severity: recoverable
>         [88690.479184] [ T593001] {1}[Hardware Error]:  imprecise tstamp: 2026-03-20 13:13:08
>         [88690.479282] [ T593001] {1}[Hardware Error]:  Error 0, type: recoverable
>         [88690.479359] [ T593001] {1}[Hardware Error]:   section_type: memory error
>         [88690.479424] [ T593001] {1}[Hardware Error]:   physical_address: 0x00000040272d5080
>         [88690.479503] [ T593001] {1}[Hardware Error]:   physical_address_mask: 0xfffffffffffff000
>         [88690.479606] [ T593001] {1}[Hardware Error]:   node:0 card:0 module:1 rank:1 bank:13 device:6 row:64114 column:832 requestor_id:0x0000000000000027
>         [88690.479680] [ T593001] {1}[Hardware Error]:   error_type: 3, multi-bit ECC
>         [88690.479754] [ T593001] {1}[Hardware Error]:   DIMM location: not present. DMI handle: 0x000e
>         [88690.479882] [ T593001] EDAC MC0: 1 UE multi-bit ECC on unknown memory (node:0 card:0 module:1 rank:1 bank:13 device:6 row:64114 column:832 requestor_id:0x0000000000000027 DIMM location: not present. DMI handle: 0x000e page:0x40272d offset:0x5080 grain:4096 - APEI location: node:0 card:0 module:1 rank:1 bank:13 device:6 row:64114 column:832 requestor_id:0x0000000000000027 DIMM location: not present. DMI handle: 0x000e)
>         [88690.498473] [ T593001] Memory failure: 0x40272d: unhandlable page.
>         [88690.498619] [ T593001] Memory failure: 0x40272d: recovery action for get hwpoison page: Ignored
>         [88757.847126] [ T640437] Internal error: synchronous external abort: 0000000096000410 [#1]  SMP
>         [88757.867131] [ T640437] Modules linked in: ghes_edac(E) act_gact(E) sch_fq(E) tcp_diag(E) inet_diag(E) cls_bpf(E) mlx5_ib(E) sm3_ce(E) sha3_ce(E) sha512_ce(E) ipmi_ssif(E) ipmi_devintf(E) nvidia_cspmu(E) ib_uverbs(E) cppc_cpufreq(E) coresight_etm4x(E) coresight_stm(E) ipmi_msghandler(E) coresight_trbe(E) arm_cspmu_module(E) arm_smmuv3_pmu(E) arm_spe_pmu(E) stm_core(E) coresight_tmc(E) coresight_funnel(E) coresight(E) bpf_preload(E) sch_fq_codel(E) ip_tables(E) ip6_tables(E) vhost_net(E) tun(E) vhost(E) vhost_iotlb(E) tap(E) tls(E) mpls_gso(E) mpls_iptunnel(E) mpls_router(E) fou(E) acpi_power_meter(E) loop(E) drm(E) backlight(E) drm_panel_orientation_quirks(E) autofs4(E) raid0(E) efivarfs(E) dm_crypt(E)
>         [88757.991191] [ T640437] CPU: 70 UID: 34133 PID: 640437 Comm: Collection-20 Kdump: loaded Tainted: G   M        E       6.16.1-0_fbk2_0_gf40efc324cc8 #1 NONE
>         [88758.017569] [ T640437] Tainted: [M]=MACHINE_CHECK, [E]=UNSIGNED_MODULE
>         [88758.028860] [ T640437] Hardware name: ....
>         [88758.046969] [ T640437] pstate: 23401009 (nzCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
>         [88758.061075] [ T640437] pc : d_lookup+0x5c/0x220
>         [88758.068392] [ T640437] lr : try_lookup_noperm+0x30/0x50
>         [88758.077088] [ T640437] sp : ffff800138cafc30
>         [88758.083827] [ T640437] x29: ffff800138cafc40 x28: ffff0001dcfe8bc0 x27: 00000000bc0a11f7
>         [88758.098321] [ T640437] x26: 00000000000ee00c x25: ffffffffffffffff x24: 0000000000000001
>         [88758.112807] [ T640437] x23: ffff003fa14d0000 x22: ffff8000828d3740 x21: ffff800138cafde8
>         [88758.127281] [ T640437] x20: ffff0000d0316fc0 x19: ffff800138cafce0 x18: 0001000000000000
>         [88758.141753] [ T640437] x17: 0000000000000001 x16: 0000000001ffffff x15: dfc038a300003936
>         [88758.156226] [ T640437] x14: 00000000fffffffa x13: ffffffffffffffff x12: ffff0000d0316fc0
>         [88758.170695] [ T640437] x11: 61c8864680b583eb x10: 0000000000000039 x9 : ffff800080fcfd68
>         [88758.185170] [ T640437] x8 : ffff003fa72d5088 x7 : 0000000000000000 x6 : ffff800138cafd58
>         [88758.199645] [ T640437] x5 : ffff0001dcfe8bc0 x4 : ffff80008104a330 x3 : 0000000000000002
>         [88758.214111] [ T640437] x2 : ffff800138cafd4d x1 : ffff800138cafce0 x0 : ffff0000d0316fc0
>         [88758.228579] [ T640437] Call trace:
>         [88758.233565] [ T640437]  d_lookup+0x5c/0x220 (P)
>         [88758.240864] [ T640437]  try_lookup_noperm+0x30/0x50
>         [88758.248868] [ T640437]  proc_fill_cache+0x54/0x140
>         [88758.256696] [ T640437]  proc_readfd_common+0x138/0x1e8
>         [88758.265222] [ T640437]  proc_fd_iterate.llvm.7260857650841435759+0x1c/0x30
>         [88758.277248] [ T640437]  iterate_dir+0x84/0x228
>         [88758.284354] [ T640437]  __arm64_sys_getdents64+0x5c/0x110
>         [88758.293383] [ T640437]  invoke_syscall+0x4c/0xd0
>         [88758.300843] [ T640437]  do_el0_svc+0x80/0xb8
>         [88758.307599] [ T640437]  el0_svc+0x30/0xf0
>         [88758.313820] [ T640437]  el0t_64_sync_handler+0x70/0x100
>         [88758.322497] [ T640437]  el0t_64_sync+0x17c/0x180
>         ...
>
> And my clear crash would look like the following:
>
>         [ 1096.480523] Memory failure: 0x2: recovery action for reserved kernel page: Ignored
>         [ 1096.480751] Kernel panic - not syncing: Memory failure: 0x2: unrecoverable page
>         [ 1096.480760] CPU: 5 UID: 0 PID: 683 Comm: bash Tainted: G    B               7.0.0-next-20260414-upstream-00004-gcbb3af7bfd3b #93 PREEMPTLAZY
>         [ 1096.480768] Tainted: [B]=BAD_PAGE
>         [ 1096.480774] Call Trace:
>         [ 1096.480778]  <TASK>
>         [ 1096.480782]  vpanic+0x399/0x700
>         [ 1096.480821]  panic+0xb4/0xc0
>         [ 1096.480849]  action_result+0x278/0x340
>         [ 1096.480857]  memory_failure+0x152b/0x1c80
>         [ 1096.480925]  hwpoison_inject+0x3a6/0x3f0 [hwpoison_inject]
>         ....
>
>
> Isn't the clean approach way better than the random one?

I don't fully agree. In the past upstream has enhanced many kernel mm
services (e.g. khugepaged, page migration, dump_user_range()) to
recover from memory error in order to improve system availability,
given these service or tools can fail safe. Seeing many crashes
pointing to a certain in-kernel service at consumption time helped us
decide what services we should enhance, and which service we should
prioritize. Of course not all kernel code can be recovered from memory
error, but that doesn't mean knowing what kernel code often caused
crash isn't useful.

>
> For testing, I use this simple procedure, in case you want to play with
> it:
>         # modprobe hwpoison-inject
>         # sysctl -w vm.panic_on_unrecoverable_memory_failure=0
>         # echo 1 > /sys/kernel/debug/hwpoison/corrupt-pfn
>
>
> Thanks for the review and good discussion,

Anyway, I only have a second opinion on the usefulness of a static
stack trace. This fail-fast option is good to have. Thanks!

> --breno
>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v4 0/3] mm/memory-failure: add panic option for unrecoverable pages
  2026-04-16 16:26     ` Jiaqi Yan
@ 2026-04-17  9:10       ` Breno Leitao
  0 siblings, 0 replies; 8+ messages in thread
From: Breno Leitao @ 2026-04-17  9:10 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: Miaohe Lin, Naoya Horiguchi, Andrew Morton, Jonathan Corbet,
	Shuah Khan, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko,
	linux-mm, linux-kernel, linux-doc, kernel-team

On Thu, Apr 16, 2026 at 09:26:08AM -0700, Jiaqi Yan wrote:

> So we will always get the same stack trace below, right?
> 
>           panic+0xb4/0xc0
>           action_result+0x278/0x340
>           memory_failure+0x152b/0x1c80
> 
> IIUC, this stack trace itself doesn't provide any useful information
> about the memory error, right? What exactly can we use from the stack
> trace? It is just a side-effect that we failed immediately.

We can use it to correlate problems across a fleet of machines. Let me
share how crash dump analysis works in large datacenters.

There are thousands of crashes a day (to stay on the low ballpark), and
different services try to correlate and categorize them into a few
buckets, something like:

	1. New crash — needs investigation
	2. Known issue — fix is being rolled out
	3. Hardware problem — do not spend engineering time on it

When a machine crashes at a random code path like d_lookup() 67 seconds
after the memory error, the automated triage classifies it as a kernel
bug in VFS/dcache and assigns it to the filesystem team for
investigation. Engineers spend time chasing a bug that doesn't exist in
software — it's a hardware problem.

With the immediate panic at memory_failure(), the stack trace is always
recognizable and can be automatically classified as category 3 (hardware
problem). The static stack trace is the feature, not a limitation: it
gives triage automation a stable signature to match on.

The value isn't in what the stack trace and the panic() tells a human reading
one crash — it's in what it tells automated systems processing thousands of
them.

> You can still correlate failure with "Memory failure: 0x1: unhandlable
> page" and keep running until the actual fatal poison consumption takes
> down the system. Drawback is that these will be cascading events that
> can be "noisy". What I see is the choice between failing fast versus
> failing safe.

Correlating the "unhandlable page" log with a later crash is
theoretically possible but breaks down in practice at scale:

- The crash may happen seconds, minutes, or hours later — or never, if
the page isn't accessed again before a reboot.

- The crash happens on a different CPU, different task, different context

— there's no breadcrumb linking it back to the memory error.

- Automated triage systems work on stack traces and panic strings, not
by correlating dmesg lines across time with later crashes.

- The later crash looks completely different depending on the
architecture. On arm64, you get a "synchronous external abort". On
x86, it's a machine check exception. On some platforms, it might be a
generic page fault or a BUG_ON in a subsystem that found inconsistent
data. There is no single signature to match — every architecture and
every consumption path produces a different crash, making automated
correlation essentially impossible.

- Worse, the crash may never happen at all. If the corrupted memory is
read but the corruption doesn't trigger a fault — say, a flipped bit
in a permission field, a size, a pointer that still maps to valid
memory, or a data buffer — the result is silent data corruption with
no crash to correlate against. The system continues operating on wrong
data with no indication anything went wrong.

Also, I wouldn't call continuing with known-corrupted kernel memory
"failing safe" — it's the opposite. The kernel has no mechanism to
fence off a poisoned slab page or page table from future access.
Continuing is failing unsafely with a delayed, unpredictable
consequence.


> > Isn't the clean approach way better than the random one?
> 
> I don't fully agree. In the past upstream has enhanced many kernel mm
> services (e.g. khugepaged, page migration, dump_user_range()) to
> recover from memory error in order to improve system availability,
> given these service or tools can fail safe. Seeing many crashes
> pointing to a certain in-kernel service at consumption time helped us
> decide what services we should enhance, and which service we should
> prioritize. Of course not all kernel code can be recovered from memory
> error, but that doesn't mean knowing what kernel code often caused
> crash isn't useful.


That's a fair point — consumption-time crashes have historically been
useful for identifying which kernel services to harden. But I'd argue
this patch doesn't prevent that analysis, it complements it.

The sysctl defaults to off. Operators who want to observe where poison
is consumed — to prioritize which services to enhance — can leave it
disabled and get exactly the behavior they have today.

But for operators running large fleets where the priority is fast
diagnosis and machine replacement rather than kernel hardening research,
the immediate panic is what they need. They already know the memory is
bad, they don't need the kernel to keep running to find out which
subsystem hits it first.

Also, the services you mention — khugepaged, page migration,
dump_user_range() — were enhanced to handle errors in user pages,
where recovery is possible (kill the process, fail the migration). The
pages this patch panics on — reserved pages, unknown page types — are
kernel memory where _no_ recovery mechanism exists or is likely to exist.
There's no service to enhance for those; the only options are crash now
or crash later, given a crucial memory page got lost. 

> Anyway, I only have a second opinion on the usefulness of a static
> stack trace. This fail-fast option is good to have. Thanks!

Thanks for the review! Just to make sure I understand your position correctly —
are you saying you'd like changes to the patch, or is this more of a general
observation about the tradeoff?

--breno


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2026-04-17  9:11 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-04-15 12:54 [PATCH v4 0/3] mm/memory-failure: add panic option for unrecoverable pages Breno Leitao
2026-04-15 12:55 ` [PATCH v4 1/3] mm/memory-failure: report MF_MSG_KERNEL for reserved pages Breno Leitao
2026-04-15 12:55 ` [PATCH v4 2/3] mm/memory-failure: add panic option for unrecoverable pages Breno Leitao
2026-04-15 12:55 ` [PATCH v4 3/3] Documentation: document panic_on_unrecoverable_memory_failure sysctl Breno Leitao
2026-04-15 20:56 ` [PATCH v4 0/3] mm/memory-failure: add panic option for unrecoverable pages Jiaqi Yan
2026-04-16 15:32   ` Breno Leitao
2026-04-16 16:26     ` Jiaqi Yan
2026-04-17  9:10       ` Breno Leitao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox