[PATCH v3 0/5] Enhance soft hwpoison handling and injection

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/5] Enhance soft hwpoison handling and injection
@ 2024-05-21 23:54 Jane Chu
  2024-05-21 23:54 ` [PATCH v3 1/5] mm/memory-failure: try to send SIGBUS even if unmap failed Jane Chu
                   ` (4 more replies)
  0 siblings, 5 replies; 16+ messages in thread
From: Jane Chu @ 2024-05-21 23:54 UTC (permalink / raw)
  To: linmiaohe, nao.horiguchi, akpm, osalvador, linux-mm, linux-kernel

Changes in v3:
  - rebased to mainline as of 5/20/2024
  - added an acked-by from Miaohe Lin
  - picked up a R-B from Oscar Salvador
  - fixed/clarified comments about MF_IGNORED/MF_FAILED definition and
    usage. - Oscar Salvador
  - invoke hwpoison_filter slightly earlier to avoid unnecessary THP split,
    and with refcount held. - Miaohe Lin
  - added comments to try_to_split_thp_page() on when not to release page
    refcount.  - Oscar Salvador
  - added action_result() in a couple cases, but take care not to overwrite
    the intended returns.  - Oscar Salvador

Changes in v2:
  - rebased to mm-stable as of 5/8/2024
  - added RB by Oscar Salvador
  - comments from Oscar on patch 1-of-3: clarify changelog
  - comments from Miahe Lin on patch 3-of-3: remove unnecessary user page
    checking and remove incorrect put_page() in kill_procs_now().
    Invoke kill_procs_now() regardless MF_ACTIN_REQUIRED is set or not,
    moved hwpoison_filter() higher up.
  - added two patches 3-of-5 and 4-of-5 

This series aim at the following enhancement -
- Let one hwpoison injector, that is, madvise(MADV_HWPOISON) to behave
  more like as if a real UE occurred. Because the other two injectors
  such as hwpoison-inject and the 'einj' on x86 can't, and it seems to
  me we need a better simulation to real UE scenario.
- For years, if the kernel is unable to unmap a hwpoisoned page, it send
  a SIGKILL instead of SIGBUS to prevent user process from potentially
  accessing the page again. But in doing so, the user process also lose
  important information: vaddr, for recovery.  Fortunately, the kernel
  already has code to kill process re-accessing a hwpoisoned page, so
  remove the '!unmap_success' check.
- Right now, if a thp page under GUP longterm pin is hwpoisoned, and
  kernel cannot split the thp page, memory-failure simply ignores
  the UE and returns.  That's not ideal, it could deliver a SIGBUS with
  useful information for userspace recovery.


Jane Chu (5):
  mm/memory-failure: try to send SIGBUS even if unmap failed
  mm/madvise: Add MF_ACTION_REQUIRED to madvise(MADV_HWPOISON)
  mm/memory-failure: improve memory failure action_result messages
  mm/memory-failure: move hwpoison_filter() higher up
  mm/memory-failure: send SIGBUS in the event of thp split fail

 include/linux/mm.h      |   2 +
 include/ras/ras_event.h |   2 +
 mm/madvise.c            |   2 +-
 mm/memory-failure.c     | 108 +++++++++++++++++++++++++++++-----------
 4 files changed, 84 insertions(+), 30 deletions(-)

-- 
2.39.3



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v3 1/5] mm/memory-failure: try to send SIGBUS even if unmap failed
  2024-05-21 23:54 [PATCH v3 0/5] Enhance soft hwpoison handling and injection Jane Chu
@ 2024-05-21 23:54 ` Jane Chu
  2024-05-21 23:54 ` [PATCH v3 2/5] mm/madvise: Add MF_ACTION_REQUIRED to madvise(MADV_HWPOISON) Jane Chu
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Jane Chu @ 2024-05-21 23:54 UTC (permalink / raw)
  To: linmiaohe, nao.horiguchi, akpm, osalvador, linux-mm, linux-kernel

For years when it comes down to kill a process due to hwpoison,
a SIGBUS is delivered only if unmap has been successful.
Otherwise, a SIGKILL is delivered. And the reason for that is
to prevent the involved process from accessing the hwpoisoned
page again.

Since then a lot has changed, a hwpoisoned page is marked and
upon being re-accessed, the memory-failure handler invokes
kill_accessing_process() to kill the process immediately.
So let's take out the '!unmap_success' factor and try to deliver
SIGBUS if possible.

Signed-off-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
---
 mm/memory-failure.c | 15 ++++-----------
 1 file changed, 4 insertions(+), 11 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 16ada4fb02b7..739311e121af 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -514,22 +514,15 @@ void add_to_kill_ksm(struct task_struct *tsk, struct page *p,
  *
  * Only do anything when FORCEKILL is set, otherwise just free the
  * list (this is used for clean pages which do not need killing)
- * Also when FAIL is set do a force kill because something went
- * wrong earlier.
  */
-static void kill_procs(struct list_head *to_kill, int forcekill, bool fail,
+static void kill_procs(struct list_head *to_kill, int forcekill,
 		unsigned long pfn, int flags)
 {
 	struct to_kill *tk, *next;
 
 	list_for_each_entry_safe(tk, next, to_kill, nd) {
 		if (forcekill) {
-			/*
-			 * In case something went wrong with munmapping
-			 * make sure the process doesn't catch the
-			 * signal and then access the memory. Just kill it.
-			 */
-			if (fail || tk->addr == -EFAULT) {
+			if (tk->addr == -EFAULT) {
 				pr_err("%#lx: forcibly killing %s:%d because of failure to unmap corrupted page\n",
 				       pfn, tk->tsk->comm, tk->tsk->pid);
 				do_send_sig_info(SIGKILL, SEND_SIG_PRIV,
@@ -1660,7 +1653,7 @@ static bool hwpoison_user_mappings(struct folio *folio, struct page *p,
 	 */
 	forcekill = folio_test_dirty(folio) || (flags & MF_MUST_KILL) ||
 		    !unmap_success;
-	kill_procs(&tokill, forcekill, !unmap_success, pfn, flags);
+	kill_procs(&tokill, forcekill, pfn, flags);
 
 	return unmap_success;
 }
@@ -1724,7 +1717,7 @@ static void unmap_and_kill(struct list_head *to_kill, unsigned long pfn,
 		unmap_mapping_range(mapping, start, size, 0);
 	}
 
-	kill_procs(to_kill, flags & MF_MUST_KILL, false, pfn, flags);
+	kill_procs(to_kill, flags & MF_MUST_KILL, pfn, flags);
 }
 
 /*
-- 
2.39.3



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v3 2/5] mm/madvise: Add MF_ACTION_REQUIRED to madvise(MADV_HWPOISON)
  2024-05-21 23:54 [PATCH v3 0/5] Enhance soft hwpoison handling and injection Jane Chu
  2024-05-21 23:54 ` [PATCH v3 1/5] mm/memory-failure: try to send SIGBUS even if unmap failed Jane Chu
@ 2024-05-21 23:54 ` Jane Chu
  2024-05-23  1:54   ` Miaohe Lin
  2024-05-21 23:54 ` [PATCH v3 3/5] mm/memory-failure: improve memory failure action_result messages Jane Chu
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 16+ messages in thread
From: Jane Chu @ 2024-05-21 23:54 UTC (permalink / raw)
  To: linmiaohe, nao.horiguchi, akpm, osalvador, linux-mm, linux-kernel

The soft hwpoison injector via madvise(MADV_HWPOISON) operates in
a synchrous way in a sense, the injector is also a process under
test, and should it have the poisoned page mapped in its address
space, it should get killed as much as in a real UE situation.
Doing so align with what the madvise(2) man page says: "
"This operation may result in the calling process receiving a SIGBUS
and the page being unmapped."

Signed-off-by: Jane Chu <jane.chu@oracle.com>
Reviewed-by: Oscar Salvador <oalvador@suse.de>
---
 mm/madvise.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index c8ba3f3eb54d..d8a01d7b2860 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1147,7 +1147,7 @@ static int madvise_inject_error(int behavior,
 		} else {
 			pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
 				 pfn, start);
-			ret = memory_failure(pfn, MF_COUNT_INCREASED | MF_SW_SIMULATED);
+			ret = memory_failure(pfn, MF_ACTION_REQUIRED | MF_COUNT_INCREASED | MF_SW_SIMULATED);
 			if (ret == -EOPNOTSUPP)
 				ret = 0;
 		}
-- 
2.39.3



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v3 3/5] mm/memory-failure: improve memory failure action_result messages
  2024-05-21 23:54 [PATCH v3 0/5] Enhance soft hwpoison handling and injection Jane Chu
  2024-05-21 23:54 ` [PATCH v3 1/5] mm/memory-failure: try to send SIGBUS even if unmap failed Jane Chu
  2024-05-21 23:54 ` [PATCH v3 2/5] mm/madvise: Add MF_ACTION_REQUIRED to madvise(MADV_HWPOISON) Jane Chu
@ 2024-05-21 23:54 ` Jane Chu
  2024-05-22 20:37   ` Oscar Salvador
  2024-05-23  2:31   ` Miaohe Lin
  2024-05-21 23:54 ` [PATCH v3 4/5] mm/memory-failure: move hwpoison_filter() higher up Jane Chu
  2024-05-21 23:54 ` [PATCH v3 5/5] mm/memory-failure: send SIGBUS in the event of thp split fail Jane Chu
  4 siblings, 2 replies; 16+ messages in thread
From: Jane Chu @ 2024-05-21 23:54 UTC (permalink / raw)
  To: linmiaohe, nao.horiguchi, akpm, osalvador, linux-mm, linux-kernel

Added two explicit MF_MSG messages describing failure in get_hwpoison_page.
Attemped to document the definition of various action names, and made a few
adjustment to the action_result() calls.

Signed-off-by: Jane Chu <jane.chu@oracle.com>
---
 include/linux/mm.h      |  2 ++
 include/ras/ras_event.h |  2 ++
 mm/memory-failure.c     | 38 +++++++++++++++++++++++++++++++++-----
 3 files changed, 37 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9849dfda44d4..b4598c6a393a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -4111,6 +4111,7 @@ enum mf_action_page_type {
 	MF_MSG_DIFFERENT_COMPOUND,
 	MF_MSG_HUGE,
 	MF_MSG_FREE_HUGE,
+	MF_MSG_GET_HWPOISON,
 	MF_MSG_UNMAP_FAILED,
 	MF_MSG_DIRTY_SWAPCACHE,
 	MF_MSG_CLEAN_SWAPCACHE,
@@ -4124,6 +4125,7 @@ enum mf_action_page_type {
 	MF_MSG_BUDDY,
 	MF_MSG_DAX,
 	MF_MSG_UNSPLIT_THP,
+	MF_MSG_ALREADY_POISONED,
 	MF_MSG_UNKNOWN,
 };
 
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index c011ea236e9b..b3f6832a94fe 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -360,6 +360,7 @@ TRACE_EVENT(aer_event,
 	EM ( MF_MSG_DIFFERENT_COMPOUND, "different compound page after locking" ) \
 	EM ( MF_MSG_HUGE, "huge page" )					\
 	EM ( MF_MSG_FREE_HUGE, "free huge page" )			\
+	EM ( MF_MSG_GET_HWPOISON, "get hwpoison page" )			\
 	EM ( MF_MSG_UNMAP_FAILED, "unmapping failed page" )		\
 	EM ( MF_MSG_DIRTY_SWAPCACHE, "dirty swapcache page" )		\
 	EM ( MF_MSG_CLEAN_SWAPCACHE, "clean swapcache page" )		\
@@ -373,6 +374,7 @@ TRACE_EVENT(aer_event,
 	EM ( MF_MSG_BUDDY, "free buddy page" )				\
 	EM ( MF_MSG_DAX, "dax page" )					\
 	EM ( MF_MSG_UNSPLIT_THP, "unsplit thp" )			\
+	EM ( MF_MSG_ALREADY_POISONED, "already poisoned" )		\
 	EMe ( MF_MSG_UNKNOWN, "unknown page" )
 
 /*
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 739311e121af..1e22d73c9329 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -879,6 +879,28 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
 	return ret > 0 ? -EHWPOISON : -EFAULT;
 }
 
+/*
+ * MF_IGNORED - The m-f() handler marks the page as PG_hwpoisoned'ed.
+ * But it could not do more to isolate the page from being accessed again,
+ * nor does it kill the process. This is extremely rare and one of the
+ * potential causes is that the page state has been changed due to
+ * underlying race condition. This is the most severe outcomes.
+ *
+ * MF_FAILED - The m-f() handler marks the page as PG_hwpoisoned'ed. It
+ * should have killed the process, but it can't isolate the page, due to
+ * conditions such as extra pin, unmap failure, etc. Accessing the page
+ * again will trigger another MCE and the process will be killed by the
+ * m-f() handler immediately.
+ *
+ * MF_DELAYED - The m-f() handler marks the page as PG_hwpoisoned'ed. The
+ * page is unmapped, but perhaps remains in LRU or file mapping. An attempt
+ * to access the page again will trigger page fault and the PF handler
+ * will kill the process.
+ *
+ * MF_RECOVERED - The m-f() handler marks the page as PG_hwpoisoned'ed.
+ * The page has been completely isolated, that is, unmapped, taken out of
+ * the buddy system, or hole-punnched out of the file mapping.
+ */
 static const char *action_name[] = {
 	[MF_IGNORED] = "Ignored",
 	[MF_FAILED] = "Failed",
@@ -893,6 +915,7 @@ static const char * const action_page_types[] = {
 	[MF_MSG_DIFFERENT_COMPOUND]	= "different compound page after locking",
 	[MF_MSG_HUGE]			= "huge page",
 	[MF_MSG_FREE_HUGE]		= "free huge page",
+	[MF_MSG_GET_HWPOISON]		= "get hwpoison page",
 	[MF_MSG_UNMAP_FAILED]		= "unmapping failed page",
 	[MF_MSG_DIRTY_SWAPCACHE]	= "dirty swapcache page",
 	[MF_MSG_CLEAN_SWAPCACHE]	= "clean swapcache page",
@@ -906,6 +929,7 @@ static const char * const action_page_types[] = {
 	[MF_MSG_BUDDY]			= "free buddy page",
 	[MF_MSG_DAX]			= "dax page",
 	[MF_MSG_UNSPLIT_THP]		= "unsplit thp",
+	[MF_MSG_ALREADY_POISONED]	= "already poisoned",
 	[MF_MSG_UNKNOWN]		= "unknown page",
 };
 
@@ -1013,12 +1037,13 @@ static int me_kernel(struct page_state *ps, struct page *p)
 
 /*
  * Page in unknown state. Do nothing.
+ * This is a catch-all in case we fail to make sense of the page state.
  */
 static int me_unknown(struct page_state *ps, struct page *p)
 {
 	pr_err("%#lx: Unknown page state\n", page_to_pfn(p));
 	unlock_page(p);
-	return MF_FAILED;
+	return MF_IGNORED;
 }
 
 /*
@@ -2055,6 +2080,8 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
 		if (flags & MF_ACTION_REQUIRED) {
 			folio = page_folio(p);
 			res = kill_accessing_process(current, folio_pfn(folio), flags);
+			action_result(pfn, MF_MSG_ALREADY_POISONED, MF_FAILED);
+			return res;
 		}
 		return res;
 	} else if (res == -EBUSY) {
@@ -2062,7 +2089,7 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
 			flags |= MF_NO_RETRY;
 			goto retry;
 		}
-		return action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED);
+		return action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
 	}
 
 	folio = page_folio(p);
@@ -2097,7 +2124,7 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
 
 	if (!hwpoison_user_mappings(folio, p, pfn, flags)) {
 		folio_unlock(folio);
-		return action_result(pfn, MF_MSG_UNMAP_FAILED, MF_IGNORED);
+		return action_result(pfn, MF_MSG_UNMAP_FAILED, MF_FAILED);
 	}
 
 	return identify_page_state(pfn, p, page_flags);
@@ -2231,6 +2258,7 @@ int memory_failure(unsigned long pfn, int flags)
 			res = kill_accessing_process(current, pfn, flags);
 		if (flags & MF_COUNT_INCREASED)
 			put_page(p);
+		action_result(pfn, MF_MSG_ALREADY_POISONED, MF_FAILED);
 		goto unlock_mutex;
 	}
 
@@ -2267,7 +2295,7 @@ int memory_failure(unsigned long pfn, int flags)
 			}
 			goto unlock_mutex;
 		} else if (res < 0) {
-			res = action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED);
+			res = action_result(pfn, MF_MSG_GET_HWPOISON, MF_IGNORED);
 			goto unlock_mutex;
 		}
 	}
@@ -2363,7 +2391,7 @@ int memory_failure(unsigned long pfn, int flags)
 	 * Abort on fail: __filemap_remove_folio() assumes unmapped page.
 	 */
 	if (!hwpoison_user_mappings(folio, p, pfn, flags)) {
-		res = action_result(pfn, MF_MSG_UNMAP_FAILED, MF_IGNORED);
+		res = action_result(pfn, MF_MSG_UNMAP_FAILED, MF_FAILED);
 		goto unlock_page;
 	}
 
-- 
2.39.3



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v3 4/5] mm/memory-failure: move hwpoison_filter() higher up
  2024-05-21 23:54 [PATCH v3 0/5] Enhance soft hwpoison handling and injection Jane Chu
                   ` (2 preceding siblings ...)
  2024-05-21 23:54 ` [PATCH v3 3/5] mm/memory-failure: improve memory failure action_result messages Jane Chu
@ 2024-05-21 23:54 ` Jane Chu
  2024-05-22 20:50   ` Oscar Salvador
  2024-05-23  2:37   ` Miaohe Lin
  2024-05-21 23:54 ` [PATCH v3 5/5] mm/memory-failure: send SIGBUS in the event of thp split fail Jane Chu
  4 siblings, 2 replies; 16+ messages in thread
From: Jane Chu @ 2024-05-21 23:54 UTC (permalink / raw)
  To: linmiaohe, nao.horiguchi, akpm, osalvador, linux-mm, linux-kernel

Move hwpoison_filter() higher up as there is no need to spend a lot
cycles only to find out later that the page is supposed to be skipped
from hwpoison handling.

Signed-off-by: Jane Chu <jane.chu@oracle.com>
---
 mm/memory-failure.c | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 1e22d73c9329..794196951a04 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2301,6 +2301,18 @@ int memory_failure(unsigned long pfn, int flags)
 	}
 
 	folio = page_folio(p);
+
+	/* filter pages that are protected from hwpoison test by users */
+	folio_lock(folio);
+	if (hwpoison_filter(p)) {
+		ClearPageHWPoison(p);
+		folio_unlock(folio);
+		folio_put(folio);
+		res = -EOPNOTSUPP;
+		goto unlock_mutex;
+	}
+	folio_unlock(folio);
+
 	if (folio_test_large(folio)) {
 		/*
 		 * The flag must be set after the refcount is bumped
@@ -2364,14 +2376,6 @@ int memory_failure(unsigned long pfn, int flags)
 	 */
 	page_flags = folio->flags;
 
-	if (hwpoison_filter(p)) {
-		ClearPageHWPoison(p);
-		folio_unlock(folio);
-		folio_put(folio);
-		res = -EOPNOTSUPP;
-		goto unlock_mutex;
-	}
-
 	/*
 	 * __munlock_folio() may clear a writeback folio's LRU flag without
 	 * the folio lock. We need to wait for writeback completion for this
-- 
2.39.3



^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v3 5/5] mm/memory-failure: send SIGBUS in the event of thp split fail
  2024-05-21 23:54 [PATCH v3 0/5] Enhance soft hwpoison handling and injection Jane Chu
                   ` (3 preceding siblings ...)
  2024-05-21 23:54 ` [PATCH v3 4/5] mm/memory-failure: move hwpoison_filter() higher up Jane Chu
@ 2024-05-21 23:54 ` Jane Chu
  2024-05-22 20:57   ` Oscar Salvador
  2024-05-23  3:02   ` Miaohe Lin
  4 siblings, 2 replies; 16+ messages in thread
From: Jane Chu @ 2024-05-21 23:54 UTC (permalink / raw)
  To: linmiaohe, nao.horiguchi, akpm, osalvador, linux-mm, linux-kernel

While handling hwpoison in a THP page, it is possible that
try_to_split_thp_page() fails. For example, when the THP page has
been RDMA pinned. At this point, the kernel cannot isolate the
poisoned THP page, all it could do is to send a SIGBUS to the user
process with meaningful payload to give user-level recovery a chance.

Signed-off-by: Jane Chu <jane.chu@oracle.com>
---
 mm/memory-failure.c | 35 ++++++++++++++++++++++++++++++-----
 1 file changed, 30 insertions(+), 5 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 794196951a04..a14d56e66902 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1706,7 +1706,12 @@ static int identify_page_state(unsigned long pfn, struct page *p,
 	return page_action(ps, p, pfn);
 }
 
-static int try_to_split_thp_page(struct page *page)
+/*
+ * When 'release' is 'false', it means that if thp split has failed,
+ * there is still more to do, hence the page refcount we took earlier
+ * is still needed.
+ */
+static int try_to_split_thp_page(struct page *page, bool release)
 {
 	int ret;
 
@@ -1714,7 +1719,7 @@ static int try_to_split_thp_page(struct page *page)
 	ret = split_huge_page(page);
 	unlock_page(page);
 
-	if (unlikely(ret))
+	if (ret && release)
 		put_page(page);
 
 	return ret;
@@ -2187,6 +2192,24 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
 	return rc;
 }
 
+/*
+ * The calling condition is as such: thp split failed, page might have
+ * been RDMA pinned, not much can be done for recovery.
+ * But a SIGBUS should be delivered with vaddr provided so that the user
+ * application has a chance to recover. Also, application processes'
+ * election for MCE early killed will be honored.
+ */
+static int kill_procs_now(struct page *p, unsigned long pfn, int flags,
+				struct folio *folio)
+{
+	LIST_HEAD(tokill);
+
+	collect_procs(folio, p, &tokill, flags & MF_ACTION_REQUIRED);
+	kill_procs(&tokill, true, pfn, flags);
+
+	return -EHWPOISON;
+}
+
 /**
  * memory_failure - Handle memory failure of a page.
  * @pfn: Page Number of the corrupted page
@@ -2328,8 +2351,10 @@ int memory_failure(unsigned long pfn, int flags)
 		 * page is a valid handlable page.
 		 */
 		folio_set_has_hwpoisoned(folio);
-		if (try_to_split_thp_page(p) < 0) {
-			res = action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
+		if (try_to_split_thp_page(p, false) < 0) {
+			res = kill_procs_now(p, pfn, flags, folio);
+			put_page(p);
+			action_result(pfn, MF_MSG_UNSPLIT_THP, MF_FAILED);
 			goto unlock_mutex;
 		}
 		VM_BUG_ON_PAGE(!page_count(p), p);
@@ -2703,7 +2728,7 @@ static int soft_offline_in_use_page(struct page *page)
 	};
 
 	if (!huge && folio_test_large(folio)) {
-		if (try_to_split_thp_page(page)) {
+		if (try_to_split_thp_page(page, true)) {
 			pr_info("soft offline: %#lx: thp split failed\n", pfn);
 			return -EBUSY;
 		}
-- 
2.39.3



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 3/5] mm/memory-failure: improve memory failure action_result messages
  2024-05-21 23:54 ` [PATCH v3 3/5] mm/memory-failure: improve memory failure action_result messages Jane Chu
@ 2024-05-22 20:37   ` Oscar Salvador
  2024-05-23 17:38     ` Jane Chu
  2024-05-23  2:31   ` Miaohe Lin
  1 sibling, 1 reply; 16+ messages in thread
From: Oscar Salvador @ 2024-05-22 20:37 UTC (permalink / raw)
  To: Jane Chu; +Cc: linmiaohe, nao.horiguchi, akpm, linux-mm, linux-kernel

On Tue, May 21, 2024 at 05:54:27PM -0600, Jane Chu wrote:
> Added two explicit MF_MSG messages describing failure in get_hwpoison_page.
> Attemped to document the definition of various action names, and made a few
> adjustment to the action_result() calls.
> 
> Signed-off-by: Jane Chu <jane.chu@oracle.com>

This looks much better, thanks:

Reviewed-by: Oscar Salvador <osalvador@suse.de>

By the way, I was checking the block in memory_failure() that handles
refcount=0 pages, concretely the piece of code that handles buddy pages.

In there, if we fail to take the page off the buddy lists, we return
MF_FAILED, but I really think we should be returning MF_IGNORED.

Thoughts?
 

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 4/5] mm/memory-failure: move hwpoison_filter() higher up
  2024-05-21 23:54 ` [PATCH v3 4/5] mm/memory-failure: move hwpoison_filter() higher up Jane Chu
@ 2024-05-22 20:50   ` Oscar Salvador
  2024-05-23  2:37   ` Miaohe Lin
  1 sibling, 0 replies; 16+ messages in thread
From: Oscar Salvador @ 2024-05-22 20:50 UTC (permalink / raw)
  To: Jane Chu; +Cc: linmiaohe, nao.horiguchi, akpm, linux-mm, linux-kernel

On Tue, May 21, 2024 at 05:54:28PM -0600, Jane Chu wrote:
> Move hwpoison_filter() higher up as there is no need to spend a lot
> cycles only to find out later that the page is supposed to be skipped
> from hwpoison handling.
> 
> Signed-off-by: Jane Chu <jane.chu@oracle.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>

I was about to raise the point that prior to this change hwpoison_filter()
would be called after shake_folio(), which should turn some pages LRU, but
I see that shake_page() is also called in hwpoison-inject code.

 

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 5/5] mm/memory-failure: send SIGBUS in the event of thp split fail
  2024-05-21 23:54 ` [PATCH v3 5/5] mm/memory-failure: send SIGBUS in the event of thp split fail Jane Chu
@ 2024-05-22 20:57   ` Oscar Salvador
  2024-05-23  3:02   ` Miaohe Lin
  1 sibling, 0 replies; 16+ messages in thread
From: Oscar Salvador @ 2024-05-22 20:57 UTC (permalink / raw)
  To: Jane Chu; +Cc: linmiaohe, nao.horiguchi, akpm, linux-mm, linux-kernel

On Tue, May 21, 2024 at 05:54:29PM -0600, Jane Chu wrote:
> While handling hwpoison in a THP page, it is possible that
> try_to_split_thp_page() fails. For example, when the THP page has
> been RDMA pinned. At this point, the kernel cannot isolate the
> poisoned THP page, all it could do is to send a SIGBUS to the user
> process with meaningful payload to give user-level recovery a chance.
> 
> Signed-off-by: Jane Chu <jane.chu@oracle.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>


-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 2/5] mm/madvise: Add MF_ACTION_REQUIRED to madvise(MADV_HWPOISON)
  2024-05-21 23:54 ` [PATCH v3 2/5] mm/madvise: Add MF_ACTION_REQUIRED to madvise(MADV_HWPOISON) Jane Chu
@ 2024-05-23  1:54   ` Miaohe Lin
  0 siblings, 0 replies; 16+ messages in thread
From: Miaohe Lin @ 2024-05-23  1:54 UTC (permalink / raw)
  To: Jane Chu, nao.horiguchi, akpm, osalvador, linux-mm, linux-kernel

On 2024/5/22 7:54, Jane Chu wrote:
> The soft hwpoison injector via madvise(MADV_HWPOISON) operates in
> a synchrous way in a sense, the injector is also a process under
> test, and should it have the poisoned page mapped in its address
> space, it should get killed as much as in a real UE situation.
> Doing so align with what the madvise(2) man page says: "
> "This operation may result in the calling process receiving a SIGBUS
> and the page being unmapped."
> 
> Signed-off-by: Jane Chu <jane.chu@oracle.com>
> Reviewed-by: Oscar Salvador <oalvador@suse.de>

Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Thanks.
.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 3/5] mm/memory-failure: improve memory failure action_result messages
  2024-05-21 23:54 ` [PATCH v3 3/5] mm/memory-failure: improve memory failure action_result messages Jane Chu
  2024-05-22 20:37   ` Oscar Salvador
@ 2024-05-23  2:31   ` Miaohe Lin
  2024-05-23 19:58     ` Jane Chu
  1 sibling, 1 reply; 16+ messages in thread
From: Miaohe Lin @ 2024-05-23  2:31 UTC (permalink / raw)
  To: Jane Chu, nao.horiguchi, akpm, osalvador, linux-mm, linux-kernel

On 2024/5/22 7:54, Jane Chu wrote:
> Added two explicit MF_MSG messages describing failure in get_hwpoison_page.
> Attemped to document the definition of various action names, and made a few
> adjustment to the action_result() calls.
> 
> Signed-off-by: Jane Chu <jane.chu@oracle.com>

Thanks for your patch. This really improves the code.

> ---
>  include/linux/mm.h      |  2 ++
>  include/ras/ras_event.h |  2 ++
>  mm/memory-failure.c     | 38 +++++++++++++++++++++++++++++++++-----
>  3 files changed, 37 insertions(+), 5 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 9849dfda44d4..b4598c6a393a 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -4111,6 +4111,7 @@ enum mf_action_page_type {
>  	MF_MSG_DIFFERENT_COMPOUND,
>  	MF_MSG_HUGE,
>  	MF_MSG_FREE_HUGE,
> +	MF_MSG_GET_HWPOISON,
>  	MF_MSG_UNMAP_FAILED,
>  	MF_MSG_DIRTY_SWAPCACHE,
>  	MF_MSG_CLEAN_SWAPCACHE,
> @@ -4124,6 +4125,7 @@ enum mf_action_page_type {
>  	MF_MSG_BUDDY,
>  	MF_MSG_DAX,
>  	MF_MSG_UNSPLIT_THP,
> +	MF_MSG_ALREADY_POISONED,
>  	MF_MSG_UNKNOWN,
>  };
>  
> diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
> index c011ea236e9b..b3f6832a94fe 100644
> --- a/include/ras/ras_event.h
> +++ b/include/ras/ras_event.h
> @@ -360,6 +360,7 @@ TRACE_EVENT(aer_event,
>  	EM ( MF_MSG_DIFFERENT_COMPOUND, "different compound page after locking" ) \
>  	EM ( MF_MSG_HUGE, "huge page" )					\
>  	EM ( MF_MSG_FREE_HUGE, "free huge page" )			\
> +	EM ( MF_MSG_GET_HWPOISON, "get hwpoison page" )			\
>  	EM ( MF_MSG_UNMAP_FAILED, "unmapping failed page" )		\
>  	EM ( MF_MSG_DIRTY_SWAPCACHE, "dirty swapcache page" )		\
>  	EM ( MF_MSG_CLEAN_SWAPCACHE, "clean swapcache page" )		\
> @@ -373,6 +374,7 @@ TRACE_EVENT(aer_event,
>  	EM ( MF_MSG_BUDDY, "free buddy page" )				\
>  	EM ( MF_MSG_DAX, "dax page" )					\
>  	EM ( MF_MSG_UNSPLIT_THP, "unsplit thp" )			\
> +	EM ( MF_MSG_ALREADY_POISONED, "already poisoned" )		\
>  	EMe ( MF_MSG_UNKNOWN, "unknown page" )
>  
>  /*
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 739311e121af..1e22d73c9329 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -879,6 +879,28 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
>  	return ret > 0 ? -EHWPOISON : -EFAULT;
>  }
>  
> +/*
> + * MF_IGNORED - The m-f() handler marks the page as PG_hwpoisoned'ed.
> + * But it could not do more to isolate the page from being accessed again,
> + * nor does it kill the process. This is extremely rare and one of the
> + * potential causes is that the page state has been changed due to
> + * underlying race condition. This is the most severe outcomes.
> + *
> + * MF_FAILED - The m-f() handler marks the page as PG_hwpoisoned'ed. It
> + * should have killed the process, but it can't isolate the page, due to
> + * conditions such as extra pin, unmap failure, etc. Accessing the page
> + * again will trigger another MCE and the process will be killed by the
> + * m-f() handler immediately.
> + *
> + * MF_DELAYED - The m-f() handler marks the page as PG_hwpoisoned'ed. The
> + * page is unmapped, but perhaps remains in LRU or file mapping. An attempt

Would the page remain in LRU or file mapping? IIUC, MF_DELAYED is returned from two functions:
1. me_swapcache_dirty. Page lives in swap cache and removed from LRU.
2. kvm_gmem_error_folio. Page range is unmapped. It seems page won't be in the LRU or page cache.
Or am I miss something?

> + * to access the page again will trigger page fault and the PF handler
> + * will kill the process.
> + *
> + * MF_RECOVERED - The m-f() handler marks the page as PG_hwpoisoned'ed.
> + * The page has been completely isolated, that is, unmapped, taken out of
> + * the buddy system, or hole-punnched out of the file mapping.
> + */
>  static const char *action_name[] = {
>  	[MF_IGNORED] = "Ignored",
>  	[MF_FAILED] = "Failed",
> @@ -893,6 +915,7 @@ static const char * const action_page_types[] = {
>  	[MF_MSG_DIFFERENT_COMPOUND]	= "different compound page after locking",
>  	[MF_MSG_HUGE]			= "huge page",
>  	[MF_MSG_FREE_HUGE]		= "free huge page",
> +	[MF_MSG_GET_HWPOISON]		= "get hwpoison page",
>  	[MF_MSG_UNMAP_FAILED]		= "unmapping failed page",
>  	[MF_MSG_DIRTY_SWAPCACHE]	= "dirty swapcache page",
>  	[MF_MSG_CLEAN_SWAPCACHE]	= "clean swapcache page",
> @@ -906,6 +929,7 @@ static const char * const action_page_types[] = {
>  	[MF_MSG_BUDDY]			= "free buddy page",
>  	[MF_MSG_DAX]			= "dax page",
>  	[MF_MSG_UNSPLIT_THP]		= "unsplit thp",
> +	[MF_MSG_ALREADY_POISONED]	= "already poisoned",
>  	[MF_MSG_UNKNOWN]		= "unknown page",
>  };
>  
> @@ -1013,12 +1037,13 @@ static int me_kernel(struct page_state *ps, struct page *p)
>  
>  /*
>   * Page in unknown state. Do nothing.
> + * This is a catch-all in case we fail to make sense of the page state.
>   */
>  static int me_unknown(struct page_state *ps, struct page *p)
>  {
>  	pr_err("%#lx: Unknown page state\n", page_to_pfn(p));
>  	unlock_page(p);
> -	return MF_FAILED;
> +	return MF_IGNORED;
>  }
>  
>  /*
> @@ -2055,6 +2080,8 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
>  		if (flags & MF_ACTION_REQUIRED) {
>  			folio = page_folio(p);
>  			res = kill_accessing_process(current, folio_pfn(folio), flags);
> +			action_result(pfn, MF_MSG_ALREADY_POISONED, MF_FAILED);
> +			return res;

We might reuse the below "return res;"?

>  		}
>  		return res;

Besides from the above possible nits, this patch looks good to me.
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Thanks.
.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 4/5] mm/memory-failure: move hwpoison_filter() higher up
  2024-05-21 23:54 ` [PATCH v3 4/5] mm/memory-failure: move hwpoison_filter() higher up Jane Chu
  2024-05-22 20:50   ` Oscar Salvador
@ 2024-05-23  2:37   ` Miaohe Lin
  1 sibling, 0 replies; 16+ messages in thread
From: Miaohe Lin @ 2024-05-23  2:37 UTC (permalink / raw)
  To: Jane Chu, nao.horiguchi, akpm, osalvador, linux-mm, linux-kernel

On 2024/5/22 7:54, Jane Chu wrote:
> Move hwpoison_filter() higher up as there is no need to spend a lot
> cycles only to find out later that the page is supposed to be skipped
> from hwpoison handling.
> 
> Signed-off-by: Jane Chu <jane.chu@oracle.com>

Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Thanks.
.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 5/5] mm/memory-failure: send SIGBUS in the event of thp split fail
  2024-05-21 23:54 ` [PATCH v3 5/5] mm/memory-failure: send SIGBUS in the event of thp split fail Jane Chu
  2024-05-22 20:57   ` Oscar Salvador
@ 2024-05-23  3:02   ` Miaohe Lin
  2024-05-23 20:01     ` Jane Chu
  1 sibling, 1 reply; 16+ messages in thread
From: Miaohe Lin @ 2024-05-23  3:02 UTC (permalink / raw)
  To: Jane Chu, nao.horiguchi, akpm, osalvador, linux-mm, linux-kernel

On 2024/5/22 7:54, Jane Chu wrote:
> While handling hwpoison in a THP page, it is possible that
> try_to_split_thp_page() fails. For example, when the THP page has
> been RDMA pinned. At this point, the kernel cannot isolate the
> poisoned THP page, all it could do is to send a SIGBUS to the user
> process with meaningful payload to give user-level recovery a chance.
> 

Thanks for your patch.

> Signed-off-by: Jane Chu <jane.chu@oracle.com>
> ---
>  mm/memory-failure.c | 35 ++++++++++++++++++++++++++++++-----
>  1 file changed, 30 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 794196951a04..a14d56e66902 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1706,7 +1706,12 @@ static int identify_page_state(unsigned long pfn, struct page *p,
>  	return page_action(ps, p, pfn);
>  }
>  
> -static int try_to_split_thp_page(struct page *page)
> +/*
> + * When 'release' is 'false', it means that if thp split has failed,
> + * there is still more to do, hence the page refcount we took earlier
> + * is still needed.
> + */
> +static int try_to_split_thp_page(struct page *page, bool release)
>  {
>  	int ret;
>  
> @@ -1714,7 +1719,7 @@ static int try_to_split_thp_page(struct page *page)
>  	ret = split_huge_page(page);
>  	unlock_page(page);
>  
> -	if (unlikely(ret))
> +	if (ret && release)
>  		put_page(page);

Is "unlikely" still needed?

>  
>  	return ret;
> @@ -2187,6 +2192,24 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
>  	return rc;
>  }
>  
> +/*
> + * The calling condition is as such: thp split failed, page might have
> + * been RDMA pinned, not much can be done for recovery.
> + * But a SIGBUS should be delivered with vaddr provided so that the user
> + * application has a chance to recover. Also, application processes'
> + * election for MCE early killed will be honored.
> + */
> +static int kill_procs_now(struct page *p, unsigned long pfn, int flags,
> +				struct folio *folio)
> +{
> +	LIST_HEAD(tokill);
> +
> +	collect_procs(folio, p, &tokill, flags & MF_ACTION_REQUIRED);
> +	kill_procs(&tokill, true, pfn, flags);
> +
> +	return -EHWPOISON;
> +}
> +
>  /**
>   * memory_failure - Handle memory failure of a page.
>   * @pfn: Page Number of the corrupted page
> @@ -2328,8 +2351,10 @@ int memory_failure(unsigned long pfn, int flags)
>  		 * page is a valid handlable page.
>  		 */
>  		folio_set_has_hwpoisoned(folio);
> -		if (try_to_split_thp_page(p) < 0) {
> -			res = action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
> +		if (try_to_split_thp_page(p, false) < 0) {
> +			res = kill_procs_now(p, pfn, flags, folio);

No strong opinion but we might remove the return value of kill_procs_now as
it always return -EHWPOISON? We could simply set res to -EHWPOISON here.

Besides from above possible nits, this patch looks good to me.
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Thanks.
.



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 3/5] mm/memory-failure: improve memory failure action_result messages
  2024-05-22 20:37   ` Oscar Salvador
@ 2024-05-23 17:38     ` Jane Chu
  0 siblings, 0 replies; 16+ messages in thread
From: Jane Chu @ 2024-05-23 17:38 UTC (permalink / raw)
  To: Oscar Salvador; +Cc: linmiaohe, nao.horiguchi, akpm, linux-mm, linux-kernel

On 5/22/2024 1:37 PM, Oscar Salvador wrote:

> On Tue, May 21, 2024 at 05:54:27PM -0600, Jane Chu wrote:
>> Added two explicit MF_MSG messages describing failure in get_hwpoison_page.
>> Attemped to document the definition of various action names, and made a few
>> adjustment to the action_result() calls.
>>
>> Signed-off-by: Jane Chu <jane.chu@oracle.com>
> This looks much better, thanks:
>
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
>
> By the way, I was checking the block in memory_failure() that handles
> refcount=0 pages, concretely the piece of code that handles buddy pages.
>
> In there, if we fail to take the page off the buddy lists, we return
> MF_FAILED, but I really think we should be returning MF_IGNORED.

I guess you mean this code -
         if (has_extra_refcount(ps, p, false))
                 ret = MF_FAILED;
?

It appears in below code paths-
     hwpoison_user_mappings
       identify_page_state
         me_huge_page || me_swapcache_dirty || me_swapcache_clean
for LRU pages.

And for non-LRU
     if (!folio_test_lru(folio) && !folio_test_writeback(folio))
             goto identify_page_state;

My hunch is that the most common calling path would be: 
hwpoison_user_mappings has unmapped the page, then identify_page_state 
is called, but for some reason failed to take the page off the LRU.  The 
m-f() handler has isolated the page to avoid further MCE, so I think in 
general return MF_FAILED is okay.

That said, the line is not always clear, for example in the non-LRU 
case, where the m-f() handler may have done only a little, I guess I 
just need to let the case rest.

thanks,

-jane

>
> Thoughts?
>   
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 3/5] mm/memory-failure: improve memory failure action_result messages
  2024-05-23  2:31   ` Miaohe Lin
@ 2024-05-23 19:58     ` Jane Chu
  0 siblings, 0 replies; 16+ messages in thread
From: Jane Chu @ 2024-05-23 19:58 UTC (permalink / raw)
  To: Miaohe Lin, nao.horiguchi, akpm, osalvador, linux-mm, linux-kernel

On 5/22/2024 7:31 PM, Miaohe Lin wrote:

> [..]
>> +/*
>> + * MF_IGNORED - The m-f() handler marks the page as PG_hwpoisoned'ed.
>> + * But it could not do more to isolate the page from being accessed again,
>> + * nor does it kill the process. This is extremely rare and one of the
>> + * potential causes is that the page state has been changed due to
>> + * underlying race condition. This is the most severe outcomes.
>> + *
>> + * MF_FAILED - The m-f() handler marks the page as PG_hwpoisoned'ed. It
>> + * should have killed the process, but it can't isolate the page, due to
>> + * conditions such as extra pin, unmap failure, etc. Accessing the page
>> + * again will trigger another MCE and the process will be killed by the
>> + * m-f() handler immediately.
>> + *
>> + * MF_DELAYED - The m-f() handler marks the page as PG_hwpoisoned'ed. The
>> + * page is unmapped, but perhaps remains in LRU or file mapping. An attempt
> Would the page remain in LRU or file mapping? IIUC, MF_DELAYED is returned from two functions:
> 1. me_swapcache_dirty. Page lives in swap cache and removed from LRU.
> 2. kvm_gmem_error_folio. Page range is unmapped. It seems page won't be in the LRU or page cache.
> Or am I miss something?
Agreed, I'll fix the comment.
>> + * to access the page again will trigger page fault and the PF handler
>> + * will kill the process.
>> + *
>> + * MF_RECOVERED - The m-f() handler marks the page as PG_hwpoisoned'ed.
>> + * The page has been completely isolated, that is, unmapped, taken out of
>> + * the buddy system, or hole-punnched out of the file mapping.
>> + */
>>   static const char *action_name[] = {
>>   	[MF_IGNORED] = "Ignored",
>>   	[MF_FAILED] = "Failed",
>> @@ -893,6 +915,7 @@ static const char * const action_page_types[] = {
>>   	[MF_MSG_DIFFERENT_COMPOUND]	= "different compound page after locking",
>>   	[MF_MSG_HUGE]			= "huge page",
>>   	[MF_MSG_FREE_HUGE]		= "free huge page",
>> +	[MF_MSG_GET_HWPOISON]		= "get hwpoison page",
>>   	[MF_MSG_UNMAP_FAILED]		= "unmapping failed page",
>>   	[MF_MSG_DIRTY_SWAPCACHE]	= "dirty swapcache page",
>>   	[MF_MSG_CLEAN_SWAPCACHE]	= "clean swapcache page",
>> @@ -906,6 +929,7 @@ static const char * const action_page_types[] = {
>>   	[MF_MSG_BUDDY]			= "free buddy page",
>>   	[MF_MSG_DAX]			= "dax page",
>>   	[MF_MSG_UNSPLIT_THP]		= "unsplit thp",
>> +	[MF_MSG_ALREADY_POISONED]	= "already poisoned",
>>   	[MF_MSG_UNKNOWN]		= "unknown page",
>>   };
>>   
>> @@ -1013,12 +1037,13 @@ static int me_kernel(struct page_state *ps, struct page *p)
>>   
>>   /*
>>    * Page in unknown state. Do nothing.
>> + * This is a catch-all in case we fail to make sense of the page state.
>>    */
>>   static int me_unknown(struct page_state *ps, struct page *p)
>>   {
>>   	pr_err("%#lx: Unknown page state\n", page_to_pfn(p));
>>   	unlock_page(p);
>> -	return MF_FAILED;
>> +	return MF_IGNORED;
>>   }
>>   
>>   /*
>> @@ -2055,6 +2080,8 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
>>   		if (flags & MF_ACTION_REQUIRED) {
>>   			folio = page_folio(p);
>>   			res = kill_accessing_process(current, folio_pfn(folio), flags);
>> +			action_result(pfn, MF_MSG_ALREADY_POISONED, MF_FAILED);
>> +			return res;
> We might reuse the below "return res;"?
Yes, will fix.
>>   		}
>>   		return res;
> Besides from the above possible nits, this patch looks good to me.
> Acked-by: Miaohe Lin <linmiaohe@huawei.com>
> Thanks.
> .

Thanks!

-jane




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3 5/5] mm/memory-failure: send SIGBUS in the event of thp split fail
  2024-05-23  3:02   ` Miaohe Lin
@ 2024-05-23 20:01     ` Jane Chu
  0 siblings, 0 replies; 16+ messages in thread
From: Jane Chu @ 2024-05-23 20:01 UTC (permalink / raw)
  To: Miaohe Lin, nao.horiguchi, akpm, osalvador, linux-mm, linux-kernel

On 5/22/2024 8:02 PM, Miaohe Lin wrote:

> On 2024/5/22 7:54, Jane Chu wrote:
>> While handling hwpoison in a THP page, it is possible that
>> try_to_split_thp_page() fails. For example, when the THP page has
>> been RDMA pinned. At this point, the kernel cannot isolate the
>> poisoned THP page, all it could do is to send a SIGBUS to the user
>> process with meaningful payload to give user-level recovery a chance.
>>
> Thanks for your patch.
>
>> Signed-off-by: Jane Chu <jane.chu@oracle.com>
>> ---
>>   mm/memory-failure.c | 35 ++++++++++++++++++++++++++++++-----
>>   1 file changed, 30 insertions(+), 5 deletions(-)
>>
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index 794196951a04..a14d56e66902 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -1706,7 +1706,12 @@ static int identify_page_state(unsigned long pfn, struct page *p,
>>   	return page_action(ps, p, pfn);
>>   }
>>   
>> -static int try_to_split_thp_page(struct page *page)
>> +/*
>> + * When 'release' is 'false', it means that if thp split has failed,
>> + * there is still more to do, hence the page refcount we took earlier
>> + * is still needed.
>> + */
>> +static int try_to_split_thp_page(struct page *page, bool release)
>>   {
>>   	int ret;
>>   
>> @@ -1714,7 +1719,7 @@ static int try_to_split_thp_page(struct page *page)
>>   	ret = split_huge_page(page);
>>   	unlock_page(page);
>>   
>> -	if (unlikely(ret))
>> +	if (ret && release)
>>   		put_page(page);
> Is "unlikely" still needed?
I'd say not, because this code is not on performance sensitive code path.
>>   
>>   	return ret;
>> @@ -2187,6 +2192,24 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
>>   	return rc;
>>   }
>>   
>> +/*
>> + * The calling condition is as such: thp split failed, page might have
>> + * been RDMA pinned, not much can be done for recovery.
>> + * But a SIGBUS should be delivered with vaddr provided so that the user
>> + * application has a chance to recover. Also, application processes'
>> + * election for MCE early killed will be honored.
>> + */
>> +static int kill_procs_now(struct page *p, unsigned long pfn, int flags,
>> +				struct folio *folio)
>> +{
>> +	LIST_HEAD(tokill);
>> +
>> +	collect_procs(folio, p, &tokill, flags & MF_ACTION_REQUIRED);
>> +	kill_procs(&tokill, true, pfn, flags);
>> +
>> +	return -EHWPOISON;
>> +}
>> +
>>   /**
>>    * memory_failure - Handle memory failure of a page.
>>    * @pfn: Page Number of the corrupted page
>> @@ -2328,8 +2351,10 @@ int memory_failure(unsigned long pfn, int flags)
>>   		 * page is a valid handlable page.
>>   		 */
>>   		folio_set_has_hwpoisoned(folio);
>> -		if (try_to_split_thp_page(p) < 0) {
>> -			res = action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
>> +		if (try_to_split_thp_page(p, false) < 0) {
>> +			res = kill_procs_now(p, pfn, flags, folio);
> No strong opinion but we might remove the return value of kill_procs_now as
> it always return -EHWPOISON? We could simply set res to -EHWPOISON here.
I like that, will change.
>
> Besides from above possible nits, this patch looks good to me.
> Acked-by: Miaohe Lin <linmiaohe@huawei.com>
> Thanks.

Thank!

-jane

> .
>


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2024-05-23 20:02 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-05-21 23:54 [PATCH v3 0/5] Enhance soft hwpoison handling and injection Jane Chu
2024-05-21 23:54 ` [PATCH v3 1/5] mm/memory-failure: try to send SIGBUS even if unmap failed Jane Chu
2024-05-21 23:54 ` [PATCH v3 2/5] mm/madvise: Add MF_ACTION_REQUIRED to madvise(MADV_HWPOISON) Jane Chu
2024-05-23  1:54   ` Miaohe Lin
2024-05-21 23:54 ` [PATCH v3 3/5] mm/memory-failure: improve memory failure action_result messages Jane Chu
2024-05-22 20:37   ` Oscar Salvador
2024-05-23 17:38     ` Jane Chu
2024-05-23  2:31   ` Miaohe Lin
2024-05-23 19:58     ` Jane Chu
2024-05-21 23:54 ` [PATCH v3 4/5] mm/memory-failure: move hwpoison_filter() higher up Jane Chu
2024-05-22 20:50   ` Oscar Salvador
2024-05-23  2:37   ` Miaohe Lin
2024-05-21 23:54 ` [PATCH v3 5/5] mm/memory-failure: send SIGBUS in the event of thp split fail Jane Chu
2024-05-22 20:57   ` Oscar Salvador
2024-05-23  3:02   ` Miaohe Lin
2024-05-23 20:01     ` Jane Chu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox