Re: [PATCH v4 0/3] mm/memory-failure: add panic option for unrecoverable pages

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Jiaqi Yan <jiaqiyan@google.com>
To: Breno Leitao <leitao@debian.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>,
	Naoya Horiguchi <nao.horiguchi@gmail.com>,
	 Andrew Morton <akpm@linux-foundation.org>,
	Jonathan Corbet <corbet@lwn.net>,
	 Shuah Khan <skhan@linuxfoundation.org>,
	David Hildenbrand <david@kernel.org>,
	 Lorenzo Stoakes <ljs@kernel.org>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	 Vlastimil Babka <vbabka@kernel.org>,
	Mike Rapoport <rppt@kernel.org>,
	 Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>,
	linux-mm@kvack.org,  linux-kernel@vger.kernel.org,
	linux-doc@vger.kernel.org, kernel-team@meta.com
Subject: Re: [PATCH v4 0/3] mm/memory-failure: add panic option for unrecoverable pages
Date: Fri, 17 Apr 2026 17:18:16 -0700	[thread overview]
Message-ID: <CACw3F516bGtU3Qs57wV6K4vCu0O9ir0s7LJnduFq2aA=ivAbug@mail.gmail.com> (raw)
In-Reply-To: <aeHy3-vQTQYJlGw5@gmail.com>

On Fri, Apr 17, 2026 at 2:11 AM Breno Leitao <leitao@debian.org> wrote:
>
> On Thu, Apr 16, 2026 at 09:26:08AM -0700, Jiaqi Yan wrote:
>
> > So we will always get the same stack trace below, right?
> >
> >           panic+0xb4/0xc0
> >           action_result+0x278/0x340
> >           memory_failure+0x152b/0x1c80
> >
> > IIUC, this stack trace itself doesn't provide any useful information
> > about the memory error, right? What exactly can we use from the stack
> > trace? It is just a side-effect that we failed immediately.
>
> We can use it to correlate problems across a fleet of machines. Let me
> share how crash dump analysis works in large datacenters.
>
> There are thousands of crashes a day (to stay on the low ballpark), and
> different services try to correlate and categorize them into a few
> buckets, something like:
>
>         1. New crash — needs investigation
>         2. Known issue — fix is being rolled out
>         3. Hardware problem — do not spend engineering time on it
>
> When a machine crashes at a random code path like d_lookup() 67 seconds
> after the memory error, the automated triage classifies it as a kernel
> bug in VFS/dcache and assigns it to the filesystem team for
> investigation. Engineers spend time chasing a bug that doesn't exist in
> software — it's a hardware problem.
>
> With the immediate panic at memory_failure(), the stack trace is always
> recognizable and can be automatically classified as category 3 (hardware
> problem). The static stack trace is the feature, not a limitation: it
> gives triage automation a stable signature to match on.
>
> The value isn't in what the stack trace and the panic() tells a human reading
> one crash — it's in what it tells automated systems processing thousands of
> them.

Yeah, in this setting, a crash dump with a fixed signature totally makes sense.

>
> > You can still correlate failure with "Memory failure: 0x1: unhandlable
> > page" and keep running until the actual fatal poison consumption takes
> > down the system. Drawback is that these will be cascading events that
> > can be "noisy". What I see is the choice between failing fast versus
> > failing safe.
>
> Correlating the "unhandlable page" log with a later crash is
> theoretically possible but breaks down in practice at scale:
>
> - The crash may happen seconds, minutes, or hours later — or never, if
> the page isn't accessed again before a reboot.
>
> - The crash happens on a different CPU, different task, different context
>
> — there's no breadcrumb linking it back to the memory error.
>
> - Automated triage systems work on stack traces and panic strings, not
> by correlating dmesg lines across time with later crashes.
>
> - The later crash looks completely different depending on the
> architecture. On arm64, you get a "synchronous external abort". On
> x86, it's a machine check exception. On some platforms, it might be a
> generic page fault or a BUG_ON in a subsystem that found inconsistent
> data. There is no single signature to match — every architecture and
> every consumption path produces a different crash, making automated
> correlation essentially impossible.
>
> - Worse, the crash may never happen at all. If the corrupted memory is
> read but the corruption doesn't trigger a fault — say, a flipped bit
> in a permission field, a size, a pointer that still maps to valid
> memory, or a data buffer — the result is silent data corruption with
> no crash to correlate against. The system continues operating on wrong
> data with no indication anything went wrong.
>
> Also, I wouldn't call continuing with known-corrupted kernel memory
> "failing safe" — it's the opposite. The kernel has no mechanism to
> fence off a poisoned slab page or page table from future access.
> Continuing is failing unsafely with a delayed, unpredictable
> consequence.
>
>
> > > Isn't the clean approach way better than the random one?
> >
> > I don't fully agree. In the past upstream has enhanced many kernel mm
> > services (e.g. khugepaged, page migration, dump_user_range()) to
> > recover from memory error in order to improve system availability,
> > given these service or tools can fail safe. Seeing many crashes
> > pointing to a certain in-kernel service at consumption time helped us
> > decide what services we should enhance, and which service we should
> > prioritize. Of course not all kernel code can be recovered from memory
> > error, but that doesn't mean knowing what kernel code often caused
> > crash isn't useful.
>
>
> That's a fair point — consumption-time crashes have historically been
> useful for identifying which kernel services to harden. But I'd argue
> this patch doesn't prevent that analysis, it complements it.
>
> The sysctl defaults to off. Operators who want to observe where poison
> is consumed — to prioritize which services to enhance — can leave it
> disabled and get exactly the behavior they have today.
>
> But for operators running large fleets where the priority is fast
> diagnosis and machine replacement rather than kernel hardening research,
> the immediate panic is what they need. They already know the memory is
> bad, they don't need the kernel to keep running to find out which
> subsystem hits it first.
>
> Also, the services you mention — khugepaged, page migration,
> dump_user_range() — were enhanced to handle errors in user pages,
> where recovery is possible (kill the process, fail the migration). The
> pages this patch panics on — reserved pages, unknown page types — are
> kernel memory where _no_ recovery mechanism exists or is likely to exist.

Maybe, but I won't be surprised if one day someone comes up with some idea.

> There's no service to enhance for those; the only options are crash now
> or crash later, given a crucial memory page got lost.
>
> > Anyway, I only have a second opinion on the usefulness of a static
> > stack trace. This fail-fast option is good to have. Thanks!
>
> Thanks for the review! Just to make sure I understand your position correctly —
> are you saying you'd like changes to the patch, or is this more of a general
> observation about the tradeoff?

No change needed. I just hope to get more clarification from you on
the usefulness of the stack track, and I do get it. Thanks!

>
> --breno

     prev parent reply	other threads:[~2026-04-18  0:18 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-15 12:54 Breno Leitao
2026-04-15 12:55 ` [PATCH v4 1/3] mm/memory-failure: report MF_MSG_KERNEL for reserved pages Breno Leitao
2026-04-15 12:55 ` [PATCH v4 2/3] mm/memory-failure: add panic option for unrecoverable pages Breno Leitao
2026-04-15 12:55 ` [PATCH v4 3/3] Documentation: document panic_on_unrecoverable_memory_failure sysctl Breno Leitao
2026-04-15 20:56 ` [PATCH v4 0/3] mm/memory-failure: add panic option for unrecoverable pages Jiaqi Yan
2026-04-16 15:32   ` Breno Leitao
2026-04-16 16:26     ` Jiaqi Yan
2026-04-17  9:10       ` Breno Leitao
2026-04-18  0:18         ` Jiaqi Yan [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CACw3F516bGtU3Qs57wV6K4vCu0O9ir0s7LJnduFq2aA=ivAbug@mail.gmail.com' \
    --to=jiaqiyan@google.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=corbet@lwn.net \
    --cc=david@kernel.org \
    --cc=kernel-team@meta.com \
    --cc=leitao@debian.org \
    --cc=linmiaohe@huawei.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=mhocko@suse.com \
    --cc=nao.horiguchi@gmail.com \
    --cc=rppt@kernel.org \
    --cc=skhan@linuxfoundation.org \
    --cc=surenb@google.com \
    --cc=vbabka@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox