From: Naoya Horiguchi <naoya.horiguchi@linux.dev>
To: Tony Luck <tony.luck@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
Youquan Song <youquan.song@intel.com>,
Naoya Horiguchi <naoya.horiguchi@nec.com>
Subject: Re: [PATCH] mm/hwpoison: Fix error page recovered but reported "not recovered"
Date: Wed, 12 Jan 2022 21:11:45 +0900 [thread overview]
Message-ID: <20220112121145.GA889650@u2004> (raw)
In-Reply-To: <20220107194450.1687264-1-tony.luck@intel.com>
On Fri, Jan 07, 2022 at 11:44:50AM -0800, Tony Luck wrote:
> From: Youquan Song <youquan.song@intel.com>
>
> When an uncorrected memory error is consumed there is a race between
> the CMCI from the memory controller reporting an uncorrected error
> with a UCNA signature, and the core reporting and SRAR signature
> machine check when the data is about to be consumed.
>
> If the CMCI wins that race, the page is marked poisoned when
> uc_decode_notifier() calls memory_failure() and the machine
> check processing code finds the page already poisoned. It calls
> kill_accessing_process() to make sure a SIGBUS is sent. But
> returns the wrong error code.
>
> Console log looks like this:
>
> [34775.674296] mce: Uncorrected hardware memory error in user-access at 3710b3400
> [34775.675413] Memory failure: 0x3710b3: recovery action for dirty LRU page: Recovered
> [34775.690310] Memory failure: 0x3710b3: already hardware poisoned
> [34775.696247] Memory failure: 0x3710b3: Sending SIGBUS to einj_mem_uc:361438 due to hardware memory corruption
> [34775.706072] mce: Memory error not recovered
>
> Fix kill_accessing_process() to return -EHWPOISON to avoid the noise
> message "Memory error not recovered" and skip duplicate SIGBUS.
>
> [Tony: Reworded some parts of commit message]
>
> Fixes: a3f5d80ea401 ("mm,hwpoison: send SIGBUS with error virutal address")
> Signed-off-by: Youquan Song <youquan.song@intel.com>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>
> This code is very subtle ... the fix makes the "not recovered" message
> go away ... but I'm not more than 75% confident that this is the right
> fix. Please check carefully :-)
>
> mm/memory-failure.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 3a274468f193..a67f558b08ea 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -707,7 +707,8 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
> if (ret == 1 && priv.tk.addr)
> kill_proc(&priv.tk, pfn, flags);
> mmap_read_unlock(p->mm);
> - return ret ? -EFAULT : -EHWPOISON;
Thank you for reporting, the original code was wrong and the trinary operation
returns opposite code. -EHWPOISON here is to notify kill_me_maybe() that it
does not have to send SIGBUS in its own because kill_accessing_process() already
sent SIGBUS with proper virtual address info.
> +
> + return (ret < 0) ? -EFAULT : -EHWPOISON;
It seems to me that this returns -EHWPOISON whether any hwpoison entry is
found or not, so this fix can cause another issue.
We want to return -EHWPOISON only when kill_accessing_process() sent SIGBUS,
so I think that the following diff should be what we want.
Could you check this fix works for you?
Thanks,
Naoya Horiguchi
---
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 14ae5c18e776..4c9bd1d37301 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -707,8 +707,10 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
(void *)&priv);
if (ret == 1 && priv.tk.addr)
kill_proc(&priv.tk, pfn, flags);
+ else
+ ret = 0;
mmap_read_unlock(p->mm);
- return ret ? -EFAULT : -EHWPOISON;
+ return ret > 0 ? -EHWPOISON : -EFAULT;
}
static const char *action_name[] = {
next prev parent reply other threads:[~2022-01-12 12:11 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-01-07 19:44 Tony Luck
2022-01-12 12:11 ` Naoya Horiguchi [this message]
2022-01-13 2:00 ` Luck, Tony
2022-01-13 6:52 ` HORIGUCHI NAOYA(堀口 直也)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220112121145.GA889650@u2004 \
--to=naoya.horiguchi@linux.dev \
--cc=akpm@linux-foundation.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=naoya.horiguchi@nec.com \
--cc=tony.luck@intel.com \
--cc=youquan.song@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox