From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
To: Alexey Dobriyan <adobriyan@gmail.com>
Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, David Hildenbrand <david@redhat.com>,
"Liam R. Howlett" <Liam.Howlett@oracle.com>,
Vlastimil Babka <vbabka@suse.cz>, Mike Rapoport <rppt@kernel.org>,
Suren Baghdasaryan <surenb@google.com>,
Michal Hocko <mhocko@suse.com>
Subject: Re: [PATCH] mm: implement "memory.oops_if_bad_pte=1" boot option
Date: Thu, 10 Jul 2025 18:02:07 +0100 [thread overview]
Message-ID: <5242fca8-4a17-4ff8-a624-08778fc64f19@lucifer.local> (raw)
In-Reply-To: <72956765-39e0-445a-b381-6bbc54046544@p183>
OK I wasn't clear enough I guess - NAK.
This is not upstreamable, nor anything like it.
On Thu, Jul 10, 2025 at 07:57:00PM +0300, Alexey Dobriyan wrote:
> On Thu, Jul 10, 2025 at 05:16:52PM +0100, Lorenzo Stoakes wrote:
> > Sorry but no - this seems to me to just be a hack. And it also appears to
> > violate the rules on BUG_ON() (see [0]) so this is just a no.
> >
> > [0]:https://lore.kernel.org/linux-mm/CAHk-=wjO1xL_ZRKUG_SJuh6sPTQ-6Lem3a3pGoo26CXEsx_w0g@mail.gmail.com/
> >
> > On Wed, Jul 09, 2025 at 09:10:59PM +0300, Alexey Dobriyan wrote:
> > > Implement
> > >
> > > memory.oops_if_bad_pte=1
> >
> > This is a totally new paradigm afaict - introducing an oops based on user
> > input, I really don't think that's sensible.
> >
> > Unless kernel.panic_on_oops is set this won't necessarily cause anything to
> > halt. Really you want a panic_on_bad_pte here, but that would be way way
> > too specific.
> >
> > So it seems like a hack just so you can get a vmcore?
> >
> > You seem to be using BUG_ON() to _maybe_ cause a panic, maybe not, but by
> > doing this you're inferring that there's unrecoverable system instability,
> > which isf clearly not the case.
> >
> > All of the bad PTE handling seems to be intended to be recoverable and
> > handled by calling code.
> >
> > Additionally we have uses like zap_present_folio_ptes() which use it to
> > output PTE state in the instance of an invalid mapcount value - I don't
> > think oopsing there would really be what you wanted right?
> >
> > >
> > > boot option which oopses the machine instead of dreadful
> > >
> > > BUG: Bad page map in process
> > >
> > > message.
> >
> > I'm not sure what's so dreadful about it?
>
> Because the root cause is unknown, happened at unknown time, dmesg
> rotated away and nobody bothered to coredump the machine because it
> didn't oops!
>
> > And why an oops is better?
>
> I apologize for stating the obvious but the less time between the bug
> and coredump collection the better.
>
> > > This is intended
> > > for people who want to panic at the slightest provocation and
> > > for people who ruled out hardware problems which in turn means that
> > > delaying vmcore collection is counter-productive.
> >
> > Seems to be a specific edge case.
>
> Yes, but the option is not enabled by default and costs 2 instructions
> on the coldest code path, so...
>
> > > Linux doesn't (never?) panicked on PTE corruption and even implemented
> > > ratelimited version of the message meaning it can go for minutes and
> > > even hours without anyone noticing which is exactly the opposite of what
> > > should be done to facilitate debugging.
> >
> > But are there real situations you can cite where this has been problematic?
> >
> > >
> > > Not enabled by default.
> >
> > Yeah, obviously.
> >
> > >
> > > Not advertised.
> >
> > Umm why? Seems like you just want to add this for your own very specific
> > purpose?
>
> Sort of, I don't want to patch and unpatch things every time.
>
> > > +/*
> > > + * Oops instead of printing "Bad page map in process" message and
> > > + * trying to continue.
> > > + */
> > > +static bool oops_if_bad_pte __ro_after_init = false;
> > > +module_param(oops_if_bad_pte, bool, 0444);
> > > +
> > > /*
> > > * This function is called to print an error when a bad pte
> > > * is found. For example, we might have a PFN-mapped pte in
> > > @@ -490,6 +498,13 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
> > > static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> > > pte_t pte, struct page *page)
> > > {
> > > + /*
> > > + * This line is a formality to collect vmcore ASAP. Real bug
> > > + * (hardware or software) happened earlier, current registers and
> > > + * backtrace aren't interesting.
> > > + */
> > > + BUG_ON(oops_if_bad_pte);
> >
> > Except that it won't without panic_on_oops?
>
> Yes, I'll update the comment. it is supposed to be used with
> panic_on_oops=1 for maximum effect.
>
> > I mean we can't just go around putting arbitrary BUG_ON()'s like this for
> > cases we want data on.
>
> Yes, we can!
>
> > And far worse here - this is a print_xxx() function, and you're making it
> > oops? That's really bad.
>
> It's fine because, it is conditional BUG_ON.
>
> > Note that other page table levels can be 'bad' as well, see pgd_bad() et
> > al. - none of these will be caught.
>
> Sure, I didn't think much about spreading this option to other places.
> It can be spread independently.
>
> > Overall I suspect there's one single case you're worried about, that really
> > you want to put a WARN_ON_ONCE() against - then you can panic_on_warn and
> > get what you want.
>
> Ehh, no. WARN is for home users who can maybe photo the oops and fish it
> out of dmesg and make bug report -- so that system survives until their
> data are flushed to disk.
>
> I suspect users are very bifurcated: some want to panic always, some
> want to panic during QA but not in the field, and then there are users
> whose only hope is cellphone camera.
>
> > If you can make an argument in favour of this that's convincing then that
> > would be a potentially upstreamable patch, but this one isn't, in my view.
next prev parent reply other threads:[~2025-07-10 17:02 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-09 18:10 Alexey Dobriyan
2025-07-09 22:37 ` Andrew Morton
2025-07-10 16:46 ` Alexey Dobriyan
2025-07-10 7:35 ` Vlastimil Babka
2025-07-10 16:47 ` Alexey Dobriyan
2025-07-10 16:16 ` Lorenzo Stoakes
2025-07-10 16:57 ` Alexey Dobriyan
2025-07-10 17:02 ` Lorenzo Stoakes [this message]
2025-07-10 18:29 ` Michal Hocko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5242fca8-4a17-4ff8-a624-08778fc64f19@lucifer.local \
--to=lorenzo.stoakes@oracle.com \
--cc=Liam.Howlett@oracle.com \
--cc=adobriyan@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=david@redhat.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=rppt@kernel.org \
--cc=surenb@google.com \
--cc=vbabka@suse.cz \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox