From: Naoya Horiguchi <naoya.horiguchi@linux.dev>
To: linux-mm@kvack.org
Cc: Andrew Morton <akpm@linux-foundation.org>,
Tony Luck <tony.luck@intel.com>,
Youquan Song <youquan.song@intel.com>,
Naoya Horiguchi <naoya.horiguchi@nec.com>,
linux-kernel@vger.kernel.org
Subject: [PATCH v2] mm/hwpoison: Fix error page recovered but reported "not recovered"
Date: Fri, 14 Jan 2022 08:11:17 +0900 [thread overview]
Message-ID: <20220113231117.1021405-1-naoya.horiguchi@linux.dev> (raw)
From: Naoya Horiguchi <naoya.horiguchi@nec.com>
When an uncorrected memory error is consumed there is a race between
the CMCI from the memory controller reporting an uncorrected error
with a UCNA signature, and the core reporting and SRAR signature
machine check when the data is about to be consumed.
If the CMCI wins that race, the page is marked poisoned when
uc_decode_notifier() calls memory_failure() and the machine
check processing code finds the page already poisoned. It calls
kill_accessing_process() to make sure a SIGBUS is sent. But
returns the wrong error code.
Console log looks like this:
[34775.674296] mce: Uncorrected hardware memory error in user-access at 3710b3400
[34775.675413] Memory failure: 0x3710b3: recovery action for dirty LRU page: Recovered
[34775.690310] Memory failure: 0x3710b3: already hardware poisoned
[34775.696247] Memory failure: 0x3710b3: Sending SIGBUS to einj_mem_uc:361438 due to hardware memory corruption
[34775.706072] mce: Memory error not recovered
kill_accessing_process() is supposed to return -EHWPOISON to notify that
SIGBUS is already set to the process and kill_me_maybe() doesn't have to
send it again. But current code simply fails to do this, so fix it to
make sure to work as intended. This change avoids the noise message
"Memory error not recovered" and skips duplicate SIGBUSs.
[Tony: Reworded some parts of commit message]
Fixes: a3f5d80ea401 ("mm,hwpoison: send SIGBUS with error virutal address")
Reported-by: Youquan Song <youquan.song@intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
mm/memory-failure.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 14ae5c18e776..4c9bd1d37301 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -707,8 +707,10 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
(void *)&priv);
if (ret == 1 && priv.tk.addr)
kill_proc(&priv.tk, pfn, flags);
+ else
+ ret = 0;
mmap_read_unlock(p->mm);
- return ret ? -EFAULT : -EHWPOISON;
+ return ret > 0 ? -EHWPOISON : -EFAULT;
}
static const char *action_name[] = {
--
2.25.1
reply other threads:[~2022-01-13 23:11 UTC|newest]
Thread overview: [no followups] expand[flat|nested] mbox.gz Atom feed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220113231117.1021405-1-naoya.horiguchi@linux.dev \
--to=naoya.horiguchi@linux.dev \
--cc=akpm@linux-foundation.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=naoya.horiguchi@nec.com \
--cc=tony.luck@intel.com \
--cc=youquan.song@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox