From: Breno Leitao <leitao@debian.org>
To: Miaohe Lin <linmiaohe@huawei.com>,
Naoya Horiguchi <nao.horiguchi@gmail.com>,
Andrew Morton <akpm@linux-foundation.org>,
Jonathan Corbet <corbet@lwn.net>,
Shuah Khan <skhan@linuxfoundation.org>,
David Hildenbrand <david@kernel.org>,
Lorenzo Stoakes <ljs@kernel.org>,
"Liam R. Howlett" <Liam.Howlett@oracle.com>,
Vlastimil Babka <vbabka@kernel.org>,
Mike Rapoport <rppt@kernel.org>,
Suren Baghdasaryan <surenb@google.com>,
Michal Hocko <mhocko@suse.com>
Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org,
linux-doc@vger.kernel.org, Breno Leitao <leitao@debian.org>,
kernel-team@meta.com
Subject: [PATCH v3 0/3] mm/memory-failure: add panic option for unrecoverable pages
Date: Mon, 13 Apr 2026 06:26:32 -0700 [thread overview]
Message-ID: <20260413-ecc_panic-v3-0-1dcbb2f12bc4@debian.org> (raw)
When the memory failure handler encounters an in-use kernel page that it
cannot recover (slab, page tables, kernel stacks, vmalloc, etc.), it
currently logs the error as "Ignored" and continues operation.
This leaves corrupted data accessible to the kernel, which will inevitably
cause either silent data corruption or a delayed crash when the poisoned memory
is next accessed.
This is a common problem on large fleets. We frequently observe multi-bit ECC
errors hitting kernel slab pages, where memory_failure() fails to recover them
and the system crashes later at an unrelated code path, making root cause
analysis unnecessarily difficult.
Here is one specific example from production on an arm64 server: a multi-bit
ECC error hit a dentry cache slab page, memory_failure() failed to recover it
(slab pages are not supported by the hwpoison recovery mechanism), and 67
seconds later d_lookup() accessed the poisoned cache line causing a synchronous
external abort:
[88690.479680] [Hardware Error]: error_type: 3, multi-bit ECC
[88690.498473] Memory failure: 0x40272d: unhandlable page.
[88690.498619] Memory failure: 0x40272d: recovery action for
get hwpoison page: Ignored
...
[88757.847126] Internal error: synchronous external abort:
0000000096000410 [#1] SMP
[88758.061075] pc : d_lookup+0x5c/0x220
This series adds a new sysctl vm.panic_on_unrecoverable_memory_failure
(default 0) that, when enabled, panics immediately on unrecoverable
memory failures. This provides a clean crash dump at the time of the
error, which is far more useful for diagnosis than a random crash later
at an unrelated code path.
This also categorizes reserved pages as MF_MSG_KERNEL, and panics on
unknown page types (MF_MSG_UNKNOWN), so all unrecoverable failure cases
are covered.
A CONFIG_BOOTPARAM_MEMORY_FAILURE_PANIC kernel configuration option is
also provided, similar to CONFIG_BOOTPARAM_HARDLOCKUP_PANIC, allowing
the sysctl to be enabled at build time for systems that always want to
panic on unrecoverable memory failures without requiring runtime
configuration.
Signed-off-by: Breno Leitao <leitao@debian.org>
---
Changes in v3:
- Rename is_unrecoverable_memory_failure() to panic_on_unrecoverable_mf()
as suggested by maintainer.
- Add CONFIG_BOOTPARAM_MEMORY_FAILURE_PANIC kernel configuration option,
similar to CONFIG_BOOTPARAM_HARDLOCKUP_PANIC.
- Add documentation for the sysctl and CONFIG option.
- Add code comments documenting the panic condition design rationale and
how the retry mechanism mitigates false positives from buddy allocator
races.
- Link to v2: https://patch.msgid.link/20260331-ecc_panic-v2-0-9e40d0f64f7a@debian.org
Changes in v2:
- Panic on MF_MSG_KERNEL, MF_MSG_KERNEL_HIGH_ORDER and MF_MSG_UNKNOWN
instead of MF_MSG_GET_HWPOISON.
- Report MF_MSG_KERNEL for reserved pages when get_hwpoison_page() fails
instead of MF_MSG_GET_HWPOISON.
- Link to v1: https://patch.msgid.link/20260323-ecc_panic-v1-0-72a1921726c5@debian.org
---
Breno Leitao (3):
mm/memory-failure: report MF_MSG_KERNEL for reserved pages
mm/memory-failure: add CONFIG_BOOTPARAM_MEMORY_FAILURE_PANIC option
Documentation: document panic_on_unrecoverable_memory_failure sysctl
Documentation/admin-guide/sysctl/vm.rst | 46 ++++++++++++++++++++++++++++++
mm/Kconfig | 9 ++++++
mm/memory-failure.c | 50 ++++++++++++++++++++++++++++++++-
3 files changed, 104 insertions(+), 1 deletion(-)
---
base-commit: 028ef9c96e96197026887c0f092424679298aae8
change-id: 20260323-ecc_panic-4e473b83087c
Best regards,
--
Breno Leitao <leitao@debian.org>
next reply other threads:[~2026-04-13 13:27 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-13 13:26 Breno Leitao [this message]
2026-04-13 13:26 ` [PATCH v3 1/3] mm/memory-failure: report MF_MSG_KERNEL for reserved pages Breno Leitao
2026-04-13 13:26 ` [PATCH v3 2/3] mm/memory-failure: add CONFIG_BOOTPARAM_MEMORY_FAILURE_PANIC option Breno Leitao
2026-04-13 13:26 ` [PATCH v3 3/3] Documentation: document panic_on_unrecoverable_memory_failure sysctl Breno Leitao
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260413-ecc_panic-v3-0-1dcbb2f12bc4@debian.org \
--to=leitao@debian.org \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=corbet@lwn.net \
--cc=david@kernel.org \
--cc=kernel-team@meta.com \
--cc=linmiaohe@huawei.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=mhocko@suse.com \
--cc=nao.horiguchi@gmail.com \
--cc=rppt@kernel.org \
--cc=skhan@linuxfoundation.org \
--cc=surenb@google.com \
--cc=vbabka@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox