linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Tony Luck <tony.luck@intel.com>
To: Naoya Horiguchi <naoya.horiguchi@nec.com>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>,
	Matthew Wilcox <willy@infradead.org>,
	Shuai Xue <xueshuai@linux.alibaba.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Michael Ellerman <mpe@ellerman.id.au>,
	Nicholas Piggin <npiggin@gmail.com>,
	Christophe Leroy <christophe.leroy@csgroup.eu>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org, Tony Luck <tony.luck@intel.com>
Subject: [PATCH v3 0/2] Copy-on-write poison recovery
Date: Fri, 21 Oct 2022 13:01:18 -0700	[thread overview]
Message-ID: <20221021200120.175753-1-tony.luck@intel.com> (raw)
In-Reply-To: <20221019170835.155381-1-tony.luck@intel.com>

Part 1 deals with the process that triggered the copy on write
fault with a store to a shared read-only page. That process is
send a SIGBUS with the usual machine check decoration to specify
the virtual address of the lost page, together with the scope.

Part 2 sets up to asynchronously take the page with the uncorrected
error offline to prevent additional machine check faults. H/t to
Miaohe Lin <linmiaohe@huawei.com> and Shuai Xue <xueshuai@linux.alibaba.com>
for pointing me to the existing function to queue a call to
memory_failure().

On x86 there is some duplicate reporting (because the error is
also signalled by the memory controller as well as by the core
that triggered the machine check). Console logs look like this:

[ 1647.723403] mce: [Hardware Error]: Machine check events logged
	Machine check from kernel copy routine

[ 1647.723414] MCE: Killing einj_mem_uc:3600 due to hardware memory corruption fault at 7f3309503400
	x86 fault handler sends SIGBUS to child process

[ 1647.735183] Memory failure: 0x905b92d: recovery action for dirty LRU page: Recovered
	Async call to memory_failure() from copy on write path

[ 1647.748397] Memory failure: 0x905b92d: already hardware poisoned
	uc_decode_notifier() processes memory controller report

[ 1647.761313] MCE: Killing einj_mem_uc:3599 due to hardware memory corruption fault at 7f3309503400
	Parent process tries to read poisoned page. Page has been unmapped, so
	#PF handler sends SIGBUS


Tony Luck (2):
  mm, hwpoison: Try to recover from copy-on write faults
  mm, hwpoison: When copy-on-write hits poison, take page offline

 include/linux/highmem.h | 24 ++++++++++++++++++++++++
 include/linux/mm.h      |  5 ++++-
 mm/memory.c             | 32 ++++++++++++++++++++++----------
 3 files changed, 50 insertions(+), 11 deletions(-)

-- 
2.37.3



  parent reply	other threads:[~2022-10-21 20:01 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-17 23:42 [RFC PATCH] mm, hwpoison: Recover from copy-on-write machine checks Tony Luck
2022-10-18  8:43 ` HORIGUCHI NAOYA(堀口 直也)
2022-10-18 17:52   ` Luck, Tony
2022-10-19 17:08     ` [PATCH v2] mm, hwpoison: Try to recover from copy-on write faults Tony Luck
2022-10-19 17:45       ` Dan Williams
2022-10-19 20:30         ` Luck, Tony
2022-10-20  1:57       ` Shuai Xue
2022-10-20 20:05         ` Tony Luck
2022-10-21  1:38           ` Miaohe Lin
2022-10-21  3:57             ` Luck, Tony
2022-10-21  1:52           ` Shuai Xue
2022-10-21  4:08             ` Tony Luck
2022-10-21  4:11               ` David Laight
2022-10-21  4:41                 ` Luck, Tony
2022-10-21  9:29                   ` Shuai Xue
2022-10-21 16:30                     ` Luck, Tony
2022-10-23 15:04                       ` Shuai Xue
2022-10-21  6:57               ` Shuai Xue
2022-10-21 20:01       ` Tony Luck [this message]
2022-10-21 20:01         ` [PATCH v3 1/2] " Tony Luck
2022-10-25  5:46           ` HORIGUCHI NAOYA(堀口 直也)
2022-10-28  2:11           ` Miaohe Lin
2022-10-28 16:09             ` Luck, Tony
2022-11-02 14:27               ` Alexander Potapenko
2022-11-02 14:30                 ` Alexander Potapenko
2022-10-21 20:01         ` [PATCH v3 2/2] mm, hwpoison: When copy-on-write hits poison, take page offline Tony Luck
2022-10-28  2:28           ` Miaohe Lin
2022-10-28 16:13             ` Luck, Tony
2022-10-29  1:55               ` Miaohe Lin
2022-10-23 15:52         ` [PATCH v3 0/2] Copy-on-write poison recovery Shuai Xue
2022-10-26  5:19           ` Shuai Xue
2022-10-31 20:10         ` [PATCH v4 " Tony Luck
2022-10-31 20:10           ` [PATCH v4 1/2] mm, hwpoison: Try to recover from copy-on write faults Tony Luck
2022-10-31 20:10           ` [PATCH v4 2/2] mm, hwpoison: When copy-on-write hits poison, take page offline Tony Luck
2023-05-18 21:49             ` Jane Chu
2023-05-18 22:10               ` Luck, Tony
2023-05-19  7:28               ` Greg Kroah-Hartman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20221021200120.175753-1-tony.luck@intel.com \
    --to=tony.luck@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=christophe.leroy@csgroup.eu \
    --cc=dan.j.williams@intel.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mpe@ellerman.id.au \
    --cc=naoya.horiguchi@nec.com \
    --cc=npiggin@gmail.com \
    --cc=willy@infradead.org \
    --cc=xueshuai@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox