From: "Luck, Tony" <tony.luck@intel.com>
To: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>,
Borislav Petkov <bp@alien8.de>,
Yazen Ghannam <yazen.ghannam@amd.com>,
Miaohe Lin <linmiaohe@huawei.com>,
Naoya Horiguchi <naoya.horiguchi@nec.com>,
"linux-mm@kvack.org" <linux-mm@kvack.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: RE: Machine check recovery broken in v6.9-rc1
Date: Sun, 7 Apr 2024 00:08:30 +0000 [thread overview]
Message-ID: <SJ1PR11MB608323D7E6113B78A35F4999FC012@SJ1PR11MB6083.namprd11.prod.outlook.com> (raw)
In-Reply-To: <ZhDMBZ2I9M72D87F@localhost.localdomain>
> This one is against 6.1 (previous one was against v6.9-rc2):
> Again, compile tested only
Oscar.
Both the 6.1 and 6.9-rc2 patches make the BUG (and subsequent issues) go away.
Here's what's happening.
When the machine check occurs there's a scramble from various subsystems
to report the memory error.
ghes_do_memory_failure() calls memory_failure_queue() which later
calls memory_failure() from a kernel thread. Side note: this happens TWICE
for each error. Not sure yet if this is a BIOS issue logging more than once.
or some Linux issues in acpi/apei/ghes.c code.
uc_decode_notifier() [called from a different kernel thread] also calls
do_memory_failure()
Finally kill_me_maybe() [called from task_work on return to the application
when returning from the machine check handler] also calls memory_failure()
do_memory_failure() is somewhat prepared for multiple reports of the same
error. It uses an atomic test and set operation to mark the page as poisoned.
First called to report the error does all the real work. Late arrivals take a
shorter path, but may still take some action(s) depending on the "flags"
passed in:
if (TestSetPageHWPoison(p)) {
pr_err("%#lx: already hardware poisoned\n", pfn);
res = -EHWPOISON;
if (flags & MF_ACTION_REQUIRED)
res = kill_accessing_process(current, pfn, flags);
if (flags & MF_COUNT_INCREASED)
put_page(p);
goto unlock_mutex;
}
In this case the last to arrive has MF_ACTION_REQUIRED set, so calls
kill_accessing_process() ... which is in the stack trace that led to the:
kernel BUG at include/linux/swapops.h:88!
I'm not sure that I fully understand your patch. I guess that it is making sure to
handle the case that the page has already been marked as poisoned?
Anyway ... thanks for the quick fix. I hope the above helps write a good
commit message to get this applied and backported to stable.
Tested-by: Tony Luck <tony.luck@intel.com>
-Tony
next prev parent reply other threads:[~2024-04-07 0:08 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-04-04 22:05 Tony Luck
2024-04-04 23:39 ` Tony Luck
2024-04-05 7:19 ` David Hildenbrand
2024-04-05 15:05 ` Luck, Tony
2024-04-05 23:58 ` Tony Luck
2024-04-06 2:18 ` Oscar Salvador
2024-04-06 3:54 ` Oscar Salvador
2024-04-06 4:13 ` Oscar Salvador
2024-04-07 0:08 ` Luck, Tony [this message]
2024-04-07 3:59 ` Miaohe Lin
2024-04-07 4:51 ` Oscar Salvador
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=SJ1PR11MB608323D7E6113B78A35F4999FC012@SJ1PR11MB6083.namprd11.prod.outlook.com \
--to=tony.luck@intel.com \
--cc=bp@alien8.de \
--cc=david@redhat.com \
--cc=linmiaohe@huawei.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=naoya.horiguchi@nec.com \
--cc=osalvador@suse.de \
--cc=yazen.ghannam@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox