linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/4] Improve hugetlbfs read on HWPOISON hugepages
@ 2023-07-07 20:19 Jiaqi Yan
  2023-07-07 20:19 ` [PATCH v3 1/4] mm/hwpoison: delete all entries before traversal in __folio_free_raw_hwp Jiaqi Yan
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Jiaqi Yan @ 2023-07-07 20:19 UTC (permalink / raw)
  To: akpm, mike.kravetz, naoya.horiguchi
  Cc: songmuchun, shy828301, linmiaohe, linux-mm, linux-kernel,
	duenwen, axelrasmussen, jthoughton, Jiaqi Yan

Today when hardware memory is corrupted in a hugetlb hugepage,
kernel leaves the hugepage in pagecache [1]; otherwise future mmap or
read will suject to silent data corruption. This is implemented by
returning -EIO from hugetlb_read_iter immediately if the hugepage has
HWPOISON flag set.

Since memory_failure already tracks the raw HWPOISON subpages in a
hugepage, a natural improvement is possible: if userspace only asks for
healthy subpages in the pagecache, kernel can return these data.

This patchset implements this improvement. It consist of three parts.
The 1st commit exports the functionality to tell if a subpage inside a
hugetlb hugepage is a raw HWPOISON page. The 2nd commit teaches
hugetlbfs_read_iter to return as many healthy bytes as possible.
The 3rd commit properly tests this new feature.

[1] commit 8625147cafaa ("hugetlbfs: don't delete error page from pagecache")

Changelog

v2 => v3
* Updates commit messages for future reader to know background and
  code details
* v3 is based on commit 5bb367dca2b9 ("Merge branch 'master' into
  mm-stable")

v1 => v2
* __folio_free_raw_hwp deletes all entries in raw_hwp_list before it
  traverses and frees raw_hwp_page.
* find_raw_hwp_page => __is_raw_hwp_subpage and __is_raw_hwp_subpage
  only returns bool instead of a raw_hwp_page entry.
* is_raw_hwp_subpage holds hugetlb_lock while checking
  __is_raw_hwp_subpage.
* No need to do folio_lock in adjust_range_hwpoison.
* v2 is based on commit a6e79df92e4a ("mm/gup: disallow FOLL_LONGTERM
  GUP-fast writing to file-backed mappings")

Jiaqi Yan (4):
  mm/hwpoison: delete all entries before traversal in
    __folio_free_raw_hwp
  mm/hwpoison: check if a subpage of a hugetlb folio is raw HWPOISON
  hugetlbfs: improve read HWPOISON hugepage
  selftests/mm: add tests for HWPOISON hugetlbfs read

 fs/hugetlbfs/inode.c                          |  58 +++-
 include/linux/hugetlb.h                       |  19 ++
 include/linux/mm.h                            |   7 +
 mm/hugetlb.c                                  |  10 +
 mm/memory-failure.c                           |  42 ++-
 tools/testing/selftests/mm/.gitignore         |   1 +
 tools/testing/selftests/mm/Makefile           |   1 +
 .../selftests/mm/hugetlb-read-hwpoison.c      | 322 ++++++++++++++++++
 8 files changed, 439 insertions(+), 21 deletions(-)
 create mode 100644 tools/testing/selftests/mm/hugetlb-read-hwpoison.c

-- 
2.41.0.255.g8b1d071c50-goog



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 1/4] mm/hwpoison: delete all entries before traversal in  __folio_free_raw_hwp
  2023-07-07 20:19 [PATCH v3 0/4] Improve hugetlbfs read on HWPOISON hugepages Jiaqi Yan
@ 2023-07-07 20:19 ` Jiaqi Yan
  2023-07-08  2:40   ` Miaohe Lin
  2023-07-07 20:19 ` [PATCH v3 2/4] mm/hwpoison: check if a subpage of a hugetlb folio is raw HWPOISON Jiaqi Yan
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 17+ messages in thread
From: Jiaqi Yan @ 2023-07-07 20:19 UTC (permalink / raw)
  To: akpm, mike.kravetz, naoya.horiguchi
  Cc: songmuchun, shy828301, linmiaohe, linux-mm, linux-kernel,
	duenwen, axelrasmussen, jthoughton, Jiaqi Yan

Traversal on llist (e.g. llist_for_each_safe) is only safe AFTER entries
are deleted from the llist. Correct the way __folio_free_raw_hwp deletes
and frees raw_hwp_page entries in raw_hwp_list: first llist_del_all, then
kfree within llist_for_each_safe.

As of today, concurrent adding, deleting, and traversal on raw_hwp_list
from hugetlb.c and/or memory-failure.c are fine with each other. Note
this is guaranteed partly by the lock-free nature of llist, and partly
by holding hugetlb_lock and/or mf_mutex. For example, as llist_del_all
is lock-free with itself, folio_clear_hugetlb_hwpoison()s from
__update_and_free_hugetlb_folio and memory_failure won't need explicit
locking when freeing the raw_hwp_list. New code that manipulates
raw_hwp_list must be careful to ensure the concurrency correctness.

Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
 mm/memory-failure.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index e245191e6b04..a08677dcf953 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1829,12 +1829,11 @@ static inline struct llist_head *raw_hwp_list_head(struct folio *folio)
 
 static unsigned long __folio_free_raw_hwp(struct folio *folio, bool move_flag)
 {
-	struct llist_head *head;
-	struct llist_node *t, *tnode;
+	struct llist_node *t, *tnode, *head;
 	unsigned long count = 0;
 
-	head = raw_hwp_list_head(folio);
-	llist_for_each_safe(tnode, t, head->first) {
+	head = llist_del_all(raw_hwp_list_head(folio));
+	llist_for_each_safe(tnode, t, head) {
 		struct raw_hwp_page *p = container_of(tnode, struct raw_hwp_page, node);
 
 		if (move_flag)
@@ -1844,7 +1843,6 @@ static unsigned long __folio_free_raw_hwp(struct folio *folio, bool move_flag)
 		kfree(p);
 		count++;
 	}
-	llist_del_all(head);
 	return count;
 }
 
-- 
2.41.0.255.g8b1d071c50-goog



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 2/4] mm/hwpoison: check if a subpage of a hugetlb folio is raw HWPOISON
  2023-07-07 20:19 [PATCH v3 0/4] Improve hugetlbfs read on HWPOISON hugepages Jiaqi Yan
  2023-07-07 20:19 ` [PATCH v3 1/4] mm/hwpoison: delete all entries before traversal in __folio_free_raw_hwp Jiaqi Yan
@ 2023-07-07 20:19 ` Jiaqi Yan
  2023-07-07 20:31   ` Matthew Wilcox
  2023-07-08  2:57   ` Miaohe Lin
  2023-07-07 20:19 ` [PATCH v3 3/4] hugetlbfs: improve read HWPOISON hugepage Jiaqi Yan
  2023-07-07 20:19 ` [PATCH v3 4/4] selftests/mm: add tests for HWPOISON hugetlbfs read Jiaqi Yan
  3 siblings, 2 replies; 17+ messages in thread
From: Jiaqi Yan @ 2023-07-07 20:19 UTC (permalink / raw)
  To: akpm, mike.kravetz, naoya.horiguchi
  Cc: songmuchun, shy828301, linmiaohe, linux-mm, linux-kernel,
	duenwen, axelrasmussen, jthoughton, Jiaqi Yan

Add the functionality, is_raw_hwp_subpage, to tell if a subpage of a
hugetlb folio is a raw HWPOISON page. This functionality relies on
RawHwpUnreliable to be not set; otherwise hugepage's raw HWPOISON list
becomes meaningless.

is_raw_hwp_subpage needs to hold hugetlb_lock in order to synchronize
with __get_huge_page_for_hwpoison, who iterates and inserts an entry to
raw_hwp_list. llist itself doesn't ensure insertion is synchornized with
the iterating used by __is_raw_hwp_list. Caller can minimize the
overhead of lock cycles by first checking if folio / head page's
HWPOISON flag is set.

Exports this functionality to be immediately used in the read operation
for hugetlbfs.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
 include/linux/hugetlb.h | 19 +++++++++++++++++++
 include/linux/mm.h      |  7 +++++++
 mm/hugetlb.c            | 10 ++++++++++
 mm/memory-failure.c     | 34 ++++++++++++++++++++++++----------
 4 files changed, 60 insertions(+), 10 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ca3c8e10f24a..4a745af98525 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -1007,6 +1007,25 @@ void hugetlb_register_node(struct node *node);
 void hugetlb_unregister_node(struct node *node);
 #endif
 
+/*
+ * Struct raw_hwp_page represents information about "raw error page",
+ * constructing singly linked list from ->_hugetlb_hwpoison field of folio.
+ */
+struct raw_hwp_page {
+	struct llist_node node;
+	struct page *page;
+};
+
+static inline struct llist_head *raw_hwp_list_head(struct folio *folio)
+{
+	return (struct llist_head *)&folio->_hugetlb_hwpoison;
+}
+
+/*
+ * Check if a given raw @subpage in a hugepage @folio is HWPOISON.
+ */
+bool is_raw_hwp_subpage(struct folio *folio, struct page *subpage);
+
 #else	/* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 74f1be743ba2..edaa18b6f731 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3680,6 +3680,7 @@ extern const struct attribute_group memory_failure_attr_group;
 extern void memory_failure_queue(unsigned long pfn, int flags);
 extern int __get_huge_page_for_hwpoison(unsigned long pfn, int flags,
 					bool *migratable_cleared);
+extern bool __is_raw_hwp_subpage(struct folio *folio, struct page *subpage);
 void num_poisoned_pages_inc(unsigned long pfn);
 void num_poisoned_pages_sub(unsigned long pfn, long i);
 struct task_struct *task_early_kill(struct task_struct *tsk, int force_early);
@@ -3694,6 +3695,12 @@ static inline int __get_huge_page_for_hwpoison(unsigned long pfn, int flags,
 	return 0;
 }
 
+static inline bool __is_raw_hwp_subpage(struct folio *folio,
+					struct page *subpage)
+{
+	return false;
+}
+
 static inline void num_poisoned_pages_inc(unsigned long pfn)
 {
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bce28cca73a1..9c608d2f6630 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -7373,6 +7373,16 @@ int get_huge_page_for_hwpoison(unsigned long pfn, int flags,
 	return ret;
 }
 
+bool is_raw_hwp_subpage(struct folio *folio, struct page *subpage)
+{
+	bool ret;
+
+	spin_lock_irq(&hugetlb_lock);
+	ret = __is_raw_hwp_subpage(folio, subpage);
+	spin_unlock_irq(&hugetlb_lock);
+	return ret;
+}
+
 void folio_putback_active_hugetlb(struct folio *folio)
 {
 	spin_lock_irq(&hugetlb_lock);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index a08677dcf953..5b6c8ceb13c0 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1813,18 +1813,32 @@ EXPORT_SYMBOL_GPL(mf_dax_kill_procs);
 #endif /* CONFIG_FS_DAX */
 
 #ifdef CONFIG_HUGETLB_PAGE
-/*
- * Struct raw_hwp_page represents information about "raw error page",
- * constructing singly linked list from ->_hugetlb_hwpoison field of folio.
- */
-struct raw_hwp_page {
-	struct llist_node node;
-	struct page *page;
-};
 
-static inline struct llist_head *raw_hwp_list_head(struct folio *folio)
+bool __is_raw_hwp_subpage(struct folio *folio, struct page *subpage)
 {
-	return (struct llist_head *)&folio->_hugetlb_hwpoison;
+	struct llist_head *raw_hwp_head;
+	struct raw_hwp_page *p, *tmp;
+	bool ret = false;
+
+	if (!folio_test_hwpoison(folio))
+		return false;
+
+	/*
+	 * When RawHwpUnreliable is set, kernel lost track of which subpages
+	 * are HWPOISON. So return as if ALL subpages are HWPOISONed.
+	 */
+	if (folio_test_hugetlb_raw_hwp_unreliable(folio))
+		return true;
+
+	raw_hwp_head = raw_hwp_list_head(folio);
+	llist_for_each_entry_safe(p, tmp, raw_hwp_head->first, node) {
+		if (subpage == p->page) {
+			ret = true;
+			break;
+		}
+	}
+
+	return ret;
 }
 
 static unsigned long __folio_free_raw_hwp(struct folio *folio, bool move_flag)
-- 
2.41.0.255.g8b1d071c50-goog



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 3/4] hugetlbfs: improve read HWPOISON hugepage
  2023-07-07 20:19 [PATCH v3 0/4] Improve hugetlbfs read on HWPOISON hugepages Jiaqi Yan
  2023-07-07 20:19 ` [PATCH v3 1/4] mm/hwpoison: delete all entries before traversal in __folio_free_raw_hwp Jiaqi Yan
  2023-07-07 20:19 ` [PATCH v3 2/4] mm/hwpoison: check if a subpage of a hugetlb folio is raw HWPOISON Jiaqi Yan
@ 2023-07-07 20:19 ` Jiaqi Yan
  2023-07-07 20:19 ` [PATCH v3 4/4] selftests/mm: add tests for HWPOISON hugetlbfs read Jiaqi Yan
  3 siblings, 0 replies; 17+ messages in thread
From: Jiaqi Yan @ 2023-07-07 20:19 UTC (permalink / raw)
  To: akpm, mike.kravetz, naoya.horiguchi
  Cc: songmuchun, shy828301, linmiaohe, linux-mm, linux-kernel,
	duenwen, axelrasmussen, jthoughton, Jiaqi Yan

When a hugepage contains HWPOISON pages, read() fails to read any byte
of the hugepage and returns -EIO, although many bytes in the HWPOISON
hugepage are readable.

Improve this by allowing hugetlbfs_read_iter returns as many bytes as
possible. For a requested range [offset, offset + len) that contains
HWPOISON page, return [offset, first HWPOISON page addr); the next read
attempt will fail and return -EIO.

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
 fs/hugetlbfs/inode.c | 58 +++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 52 insertions(+), 6 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 7b17ccfa039d..c2b807d37f85 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -282,6 +282,42 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 }
 #endif
 
+/*
+ * Someone wants to read @bytes from a HWPOISON hugetlb @page from @offset.
+ * Returns the maximum number of bytes one can read without touching the 1st raw
+ * HWPOISON subpage.
+ *
+ * The implementation borrows the iteration logic from copy_page_to_iter*.
+ */
+static size_t adjust_range_hwpoison(struct page *page, size_t offset, size_t bytes)
+{
+	size_t n = 0;
+	size_t res = 0;
+	struct folio *folio = page_folio(page);
+
+	/* First subpage to start the loop. */
+	page += offset / PAGE_SIZE;
+	offset %= PAGE_SIZE;
+	while (1) {
+		if (is_raw_hwp_subpage(folio, page))
+			break;
+
+		/* Safe to read n bytes without touching HWPOISON subpage. */
+		n = min(bytes, (size_t)PAGE_SIZE - offset);
+		res += n;
+		bytes -= n;
+		if (!bytes || !n)
+			break;
+		offset += n;
+		if (offset == PAGE_SIZE) {
+			page++;
+			offset = 0;
+		}
+	}
+
+	return res;
+}
+
 /*
  * Support for read() - Find the page attached to f_mapping and copy out the
  * data. This provides functionality similar to filemap_read().
@@ -300,7 +336,7 @@ static ssize_t hugetlbfs_read_iter(struct kiocb *iocb, struct iov_iter *to)
 
 	while (iov_iter_count(to)) {
 		struct page *page;
-		size_t nr, copied;
+		size_t nr, copied, want;
 
 		/* nr is the maximum number of bytes to copy from this page */
 		nr = huge_page_size(h);
@@ -328,16 +364,26 @@ static ssize_t hugetlbfs_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		} else {
 			unlock_page(page);
 
-			if (PageHWPoison(page)) {
-				put_page(page);
-				retval = -EIO;
-				break;
+			if (!PageHWPoison(page))
+				want = nr;
+			else {
+				/*
+				 * Adjust how many bytes safe to read without
+				 * touching the 1st raw HWPOISON subpage after
+				 * offset.
+				 */
+				want = adjust_range_hwpoison(page, offset, nr);
+				if (want == 0) {
+					put_page(page);
+					retval = -EIO;
+					break;
+				}
 			}
 
 			/*
 			 * We have the page, copy it to user space buffer.
 			 */
-			copied = copy_page_to_iter(page, offset, nr, to);
+			copied = copy_page_to_iter(page, offset, want, to);
 			put_page(page);
 		}
 		offset += copied;
-- 
2.41.0.255.g8b1d071c50-goog



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 4/4] selftests/mm: add tests for HWPOISON hugetlbfs read
  2023-07-07 20:19 [PATCH v3 0/4] Improve hugetlbfs read on HWPOISON hugepages Jiaqi Yan
                   ` (2 preceding siblings ...)
  2023-07-07 20:19 ` [PATCH v3 3/4] hugetlbfs: improve read HWPOISON hugepage Jiaqi Yan
@ 2023-07-07 20:19 ` Jiaqi Yan
  3 siblings, 0 replies; 17+ messages in thread
From: Jiaqi Yan @ 2023-07-07 20:19 UTC (permalink / raw)
  To: akpm, mike.kravetz, naoya.horiguchi
  Cc: songmuchun, shy828301, linmiaohe, linux-mm, linux-kernel,
	duenwen, axelrasmussen, jthoughton, Jiaqi Yan

Add tests for the improvement made to read operation on HWPOISON
hugetlb page with different read granularities. For each chunk size,
three read scenarios are tested:
1. Simple regression test on read without HWPOISON.
2. Sequential read page by page should succeed until encounters the 1st
   raw HWPOISON subpage.
3. After skip a raw HWPOISON subpage by lseek, read()s always succeed.

Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
 tools/testing/selftests/mm/.gitignore         |   1 +
 tools/testing/selftests/mm/Makefile           |   1 +
 .../selftests/mm/hugetlb-read-hwpoison.c      | 322 ++++++++++++++++++
 3 files changed, 324 insertions(+)
 create mode 100644 tools/testing/selftests/mm/hugetlb-read-hwpoison.c

diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore
index 7e2a982383c0..cdc9ce4426b9 100644
--- a/tools/testing/selftests/mm/.gitignore
+++ b/tools/testing/selftests/mm/.gitignore
@@ -5,6 +5,7 @@ hugepage-mremap
 hugepage-shm
 hugepage-vmemmap
 hugetlb-madvise
+hugetlb-read-hwpoison
 khugepaged
 map_hugetlb
 map_populate
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index 66d7c07dc177..b7fce9073279 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -41,6 +41,7 @@ TEST_GEN_PROGS += gup_longterm
 TEST_GEN_PROGS += gup_test
 TEST_GEN_PROGS += hmm-tests
 TEST_GEN_PROGS += hugetlb-madvise
+TEST_GEN_PROGS += hugetlb-read-hwpoison
 TEST_GEN_PROGS += hugepage-mmap
 TEST_GEN_PROGS += hugepage-mremap
 TEST_GEN_PROGS += hugepage-shm
diff --git a/tools/testing/selftests/mm/hugetlb-read-hwpoison.c b/tools/testing/selftests/mm/hugetlb-read-hwpoison.c
new file mode 100644
index 000000000000..ba6cc6f9cabc
--- /dev/null
+++ b/tools/testing/selftests/mm/hugetlb-read-hwpoison.c
@@ -0,0 +1,322 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include <stdlib.h>
+#include <stdio.h>
+#include <string.h>
+
+#include <linux/magic.h>
+#include <sys/mman.h>
+#include <sys/statfs.h>
+#include <errno.h>
+#include <stdbool.h>
+
+#include "../kselftest.h"
+
+#define PREFIX " ... "
+#define ERROR_PREFIX " !!! "
+
+#define MAX_WRITE_READ_CHUNK_SIZE (getpagesize() * 16)
+#define MAX(a, b) (((a) > (b)) ? (a) : (b))
+
+enum test_status {
+	TEST_PASSED = 0,
+	TEST_FAILED = 1,
+	TEST_SKIPPED = 2,
+};
+
+static char *status_to_str(enum test_status status)
+{
+	switch (status) {
+	case TEST_PASSED:
+		return "TEST_PASSED";
+	case TEST_FAILED:
+		return "TEST_FAILED";
+	case TEST_SKIPPED:
+		return "TEST_SKIPPED";
+	default:
+		return "TEST_???";
+	}
+}
+
+static int setup_filemap(char *filemap, size_t len, size_t wr_chunk_size)
+{
+	char iter = 0;
+
+	for (size_t offset = 0; offset < len;
+	     offset += wr_chunk_size) {
+		iter++;
+		memset(filemap + offset, iter, wr_chunk_size);
+	}
+
+	return 0;
+}
+
+static bool verify_chunk(char *buf, size_t len, char val)
+{
+	size_t i;
+
+	for (i = 0; i < len; ++i) {
+		if (buf[i] != val) {
+			printf(PREFIX ERROR_PREFIX "check fail: buf[%lu] = %u != %u\n",
+				i, buf[i], val);
+			return false;
+		}
+	}
+
+	return true;
+}
+
+static bool seek_read_hugepage_filemap(int fd, size_t len, size_t wr_chunk_size,
+				       off_t offset, size_t expected)
+{
+	char buf[MAX_WRITE_READ_CHUNK_SIZE];
+	ssize_t ret_count = 0;
+	ssize_t total_ret_count = 0;
+	char val = offset / wr_chunk_size + offset % wr_chunk_size;
+
+	printf(PREFIX PREFIX "init val=%u with offset=0x%lx\n", val, offset);
+	printf(PREFIX PREFIX "expect to read 0x%lx bytes of data in total\n",
+	       expected);
+	if (lseek(fd, offset, SEEK_SET) < 0) {
+		perror(PREFIX ERROR_PREFIX "seek failed");
+		return false;
+	}
+
+	while (offset + total_ret_count < len) {
+		ret_count = read(fd, buf, wr_chunk_size);
+		if (ret_count == 0) {
+			printf(PREFIX PREFIX "read reach end of the file\n");
+			break;
+		} else if (ret_count < 0) {
+			perror(PREFIX ERROR_PREFIX "read failed");
+			break;
+		}
+		++val;
+		if (!verify_chunk(buf, ret_count, val))
+			return false;
+
+		total_ret_count += ret_count;
+	}
+	printf(PREFIX PREFIX "actually read 0x%lx bytes of data in total\n",
+	       total_ret_count);
+
+	return total_ret_count == expected;
+}
+
+static bool read_hugepage_filemap(int fd, size_t len,
+				  size_t wr_chunk_size, size_t expected)
+{
+	char buf[MAX_WRITE_READ_CHUNK_SIZE];
+	ssize_t ret_count = 0;
+	ssize_t total_ret_count = 0;
+	char val = 0;
+
+	printf(PREFIX PREFIX "expect to read 0x%lx bytes of data in total\n",
+	       expected);
+	while (total_ret_count < len) {
+		ret_count = read(fd, buf, wr_chunk_size);
+		if (ret_count == 0) {
+			printf(PREFIX PREFIX "read reach end of the file\n");
+			break;
+		} else if (ret_count < 0) {
+			perror(PREFIX ERROR_PREFIX "read failed");
+			break;
+		}
+		++val;
+		if (!verify_chunk(buf, ret_count, val))
+			return false;
+
+		total_ret_count += ret_count;
+	}
+	printf(PREFIX PREFIX "actually read 0x%lx bytes of data in total\n",
+	       total_ret_count);
+
+	return total_ret_count == expected;
+}
+
+static enum test_status
+test_hugetlb_read(int fd, size_t len, size_t wr_chunk_size)
+{
+	enum test_status status = TEST_SKIPPED;
+	char *filemap = NULL;
+
+	if (ftruncate(fd, len) < 0) {
+		perror(PREFIX ERROR_PREFIX "ftruncate failed");
+		return status;
+	}
+
+	filemap = mmap(NULL, len, PROT_READ | PROT_WRITE,
+		       MAP_SHARED | MAP_POPULATE, fd, 0);
+	if (filemap == MAP_FAILED) {
+		perror(PREFIX ERROR_PREFIX "mmap for primary mapping failed");
+		goto done;
+	}
+
+	setup_filemap(filemap, len, wr_chunk_size);
+	status = TEST_FAILED;
+
+	if (read_hugepage_filemap(fd, len, wr_chunk_size, len))
+		status = TEST_PASSED;
+
+	munmap(filemap, len);
+done:
+	if (ftruncate(fd, 0) < 0) {
+		perror(PREFIX ERROR_PREFIX "ftruncate back to 0 failed");
+		status = TEST_FAILED;
+	}
+
+	return status;
+}
+
+static enum test_status
+test_hugetlb_read_hwpoison(int fd, size_t len, size_t wr_chunk_size,
+			   bool skip_hwpoison_page)
+{
+	enum test_status status = TEST_SKIPPED;
+	char *filemap = NULL;
+	char *hwp_addr = NULL;
+	const unsigned long pagesize = getpagesize();
+
+	if (ftruncate(fd, len) < 0) {
+		perror(PREFIX ERROR_PREFIX "ftruncate failed");
+		return status;
+	}
+
+	filemap = mmap(NULL, len, PROT_READ | PROT_WRITE,
+		       MAP_SHARED | MAP_POPULATE, fd, 0);
+	if (filemap == MAP_FAILED) {
+		perror(PREFIX ERROR_PREFIX "mmap for primary mapping failed");
+		goto done;
+	}
+
+	setup_filemap(filemap, len, wr_chunk_size);
+	status = TEST_FAILED;
+
+	/*
+	 * Poisoned hugetlb page layout (assume hugepagesize=2MB):
+	 * |<---------------------- 1MB ---------------------->|
+	 * |<---- healthy page ---->|<---- HWPOISON page ----->|
+	 * |<------------------- (1MB - 8KB) ----------------->|
+	 */
+	hwp_addr = filemap + len / 2 + pagesize;
+	if (madvise(hwp_addr, pagesize, MADV_HWPOISON) < 0) {
+		perror(PREFIX ERROR_PREFIX "MADV_HWPOISON failed");
+		goto unmap;
+	}
+
+	if (!skip_hwpoison_page) {
+		/*
+		 * Userspace should be able to read (1MB + 1 page) from
+		 * the beginning of the HWPOISONed hugepage.
+		 */
+		if (read_hugepage_filemap(fd, len, wr_chunk_size,
+					  len / 2 + pagesize))
+			status = TEST_PASSED;
+	} else {
+		/*
+		 * Userspace should be able to read (1MB - 2 pages) from
+		 * HWPOISONed hugepage.
+		 */
+		if (seek_read_hugepage_filemap(fd, len, wr_chunk_size,
+					       len / 2 + MAX(2 * pagesize, wr_chunk_size),
+					       len / 2 - MAX(2 * pagesize, wr_chunk_size)))
+			status = TEST_PASSED;
+	}
+
+unmap:
+	munmap(filemap, len);
+done:
+	if (ftruncate(fd, 0) < 0) {
+		perror(PREFIX ERROR_PREFIX "ftruncate back to 0 failed");
+		status = TEST_FAILED;
+	}
+
+	return status;
+}
+
+static int create_hugetlbfs_file(struct statfs *file_stat)
+{
+	int fd;
+
+	fd = memfd_create("hugetlb_tmp", MFD_HUGETLB);
+	if (fd < 0) {
+		perror(PREFIX ERROR_PREFIX "could not open hugetlbfs file");
+		return -1;
+	}
+
+	memset(file_stat, 0, sizeof(*file_stat));
+	if (fstatfs(fd, file_stat)) {
+		perror(PREFIX ERROR_PREFIX "fstatfs failed");
+		goto close;
+	}
+	if (file_stat->f_type != HUGETLBFS_MAGIC) {
+		printf(PREFIX ERROR_PREFIX "not hugetlbfs file\n");
+		goto close;
+	}
+
+	return fd;
+close:
+	close(fd);
+	return -1;
+}
+
+int main(void)
+{
+	int fd;
+	struct statfs file_stat;
+	enum test_status status;
+	/* Test read() in different granularity. */
+	size_t wr_chunk_sizes[] = {
+		getpagesize() / 2, getpagesize(),
+		getpagesize() * 2, getpagesize() * 4
+	};
+	size_t i;
+
+	for (i = 0; i < ARRAY_SIZE(wr_chunk_sizes); ++i) {
+		printf("Write/read chunk size=0x%lx\n",
+		       wr_chunk_sizes[i]);
+
+		fd = create_hugetlbfs_file(&file_stat);
+		if (fd < 0)
+			goto create_failure;
+		printf(PREFIX "HugeTLB read regression test...\n");
+		status = test_hugetlb_read(fd, file_stat.f_bsize,
+					   wr_chunk_sizes[i]);
+		printf(PREFIX "HugeTLB read regression test...%s\n",
+		       status_to_str(status));
+		close(fd);
+		if (status == TEST_FAILED)
+			return -1;
+
+		fd = create_hugetlbfs_file(&file_stat);
+		if (fd < 0)
+			goto create_failure;
+		printf(PREFIX "HugeTLB read HWPOISON test...\n");
+		status = test_hugetlb_read_hwpoison(fd, file_stat.f_bsize,
+						    wr_chunk_sizes[i], false);
+		printf(PREFIX "HugeTLB read HWPOISON test...%s\n",
+		       status_to_str(status));
+		close(fd);
+		if (status == TEST_FAILED)
+			return -1;
+
+		fd = create_hugetlbfs_file(&file_stat);
+		if (fd < 0)
+			goto create_failure;
+		printf(PREFIX "HugeTLB seek then read HWPOISON test...\n");
+		status = test_hugetlb_read_hwpoison(fd, file_stat.f_bsize,
+						    wr_chunk_sizes[i], true);
+		printf(PREFIX "HugeTLB seek then read HWPOISON test...%s\n",
+		       status_to_str(status));
+		close(fd);
+		if (status == TEST_FAILED)
+			return -1;
+	}
+
+	return 0;
+
+create_failure:
+	printf(ERROR_PREFIX "Abort test: failed to create hugetlbfs file\n");
+	return -1;
+}
-- 
2.41.0.255.g8b1d071c50-goog



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 2/4] mm/hwpoison: check if a subpage of a hugetlb folio is raw HWPOISON
  2023-07-07 20:19 ` [PATCH v3 2/4] mm/hwpoison: check if a subpage of a hugetlb folio is raw HWPOISON Jiaqi Yan
@ 2023-07-07 20:31   ` Matthew Wilcox
  2023-07-07 21:05     ` Andrew Morton
  2023-07-10  0:21     ` Naoya Horiguchi
  2023-07-08  2:57   ` Miaohe Lin
  1 sibling, 2 replies; 17+ messages in thread
From: Matthew Wilcox @ 2023-07-07 20:31 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: akpm, mike.kravetz, naoya.horiguchi, songmuchun, shy828301,
	linmiaohe, linux-mm, linux-kernel, duenwen, axelrasmussen,
	jthoughton

On Fri, Jul 07, 2023 at 08:19:02PM +0000, Jiaqi Yan wrote:
> Add the functionality, is_raw_hwp_subpage, to tell if a subpage of a

This is incorrect naming.  "subpage" was needed before we had the
folio concept, but it should not be used any more.  We have folios
and pages now.

Also, abbreviating "hwpoison" as "hwp" seems like a bad idea to me.
hwp is already used as an acronym by acpi, intel_pstate, some clock
drivers, an ethernet driver, and a scsi driver.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 2/4] mm/hwpoison: check if a subpage of a hugetlb folio is raw HWPOISON
  2023-07-07 20:31   ` Matthew Wilcox
@ 2023-07-07 21:05     ` Andrew Morton
  2023-07-10  0:21     ` Naoya Horiguchi
  1 sibling, 0 replies; 17+ messages in thread
From: Andrew Morton @ 2023-07-07 21:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jiaqi Yan, mike.kravetz, naoya.horiguchi, songmuchun, shy828301,
	linmiaohe, linux-mm, linux-kernel, duenwen, axelrasmussen,
	jthoughton

On Fri, 7 Jul 2023 21:31:39 +0100 Matthew Wilcox <willy@infradead.org> wrote:

> On Fri, Jul 07, 2023 at 08:19:02PM +0000, Jiaqi Yan wrote:
> > Add the functionality, is_raw_hwp_subpage, to tell if a subpage of a
> 
> This is incorrect naming.  "subpage" was needed before we had the
> folio concept, but it should not be used any more.  We have folios
> and pages now.
> 
> Also, abbreviating "hwpoison" as "hwp" seems like a bad idea to me.
> hwp is already used as an acronym by acpi, intel_pstate, some clock
> drivers, an ethernet driver, and a scsi driver.

Thanks, Matthew.  I'll merge v3 so we hit next -next (which actually
won't be until Monday), and add a note that v4 is expected.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 1/4] mm/hwpoison: delete all entries before traversal in __folio_free_raw_hwp
  2023-07-07 20:19 ` [PATCH v3 1/4] mm/hwpoison: delete all entries before traversal in __folio_free_raw_hwp Jiaqi Yan
@ 2023-07-08  2:40   ` Miaohe Lin
  0 siblings, 0 replies; 17+ messages in thread
From: Miaohe Lin @ 2023-07-08  2:40 UTC (permalink / raw)
  To: Jiaqi Yan, mike.kravetz, naoya.horiguchi, Andrew Morton
  Cc: songmuchun, shy828301, linux-mm, linux-kernel, duenwen,
	axelrasmussen, jthoughton

On 2023/7/8 4:19, Jiaqi Yan wrote:
> Traversal on llist (e.g. llist_for_each_safe) is only safe AFTER entries
> are deleted from the llist. Correct the way __folio_free_raw_hwp deletes
> and frees raw_hwp_page entries in raw_hwp_list: first llist_del_all, then
> kfree within llist_for_each_safe.
> 
> As of today, concurrent adding, deleting, and traversal on raw_hwp_list
> from hugetlb.c and/or memory-failure.c are fine with each other. Note

I think there's a race on freeing the raw_hwp_list between unpoison_memory and __update_and_free_hugetlb_folio:

  unpoison_memory		__update_and_free_hugetlb_folio
    				if (folio_test_hwpoison)
    	  			    folio_clear_hugetlb_hwpoison
    folio_free_raw_hwp		      folio_free_raw_hwp
    folio_test_clear_hwpoison

unpoison_memory and __update_and_free_hugetlb_folio can traverse and free the raw_hwp_list
at the same time. And I believe your patch will fix the problem. Thanks.

> this is guaranteed partly by the lock-free nature of llist, and partly
> by holding hugetlb_lock and/or mf_mutex. For example, as llist_del_all
> is lock-free with itself, folio_clear_hugetlb_hwpoison()s from
> __update_and_free_hugetlb_folio and memory_failure won't need explicit
> locking when freeing the raw_hwp_list. New code that manipulates
> raw_hwp_list must be careful to ensure the concurrency correctness.
> 
> Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>

Anyway, this patch looks good to me.

Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Thanks.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 2/4] mm/hwpoison: check if a subpage of a hugetlb folio is raw HWPOISON
  2023-07-07 20:19 ` [PATCH v3 2/4] mm/hwpoison: check if a subpage of a hugetlb folio is raw HWPOISON Jiaqi Yan
  2023-07-07 20:31   ` Matthew Wilcox
@ 2023-07-08  2:57   ` Miaohe Lin
  2023-07-10 15:16     ` Jiaqi Yan
  1 sibling, 1 reply; 17+ messages in thread
From: Miaohe Lin @ 2023-07-08  2:57 UTC (permalink / raw)
  To: Jiaqi Yan, akpm, mike.kravetz, naoya.horiguchi
  Cc: songmuchun, shy828301, linux-mm, linux-kernel, duenwen,
	axelrasmussen, jthoughton

On 2023/7/8 4:19, Jiaqi Yan wrote:
> Add the functionality, is_raw_hwp_subpage, to tell if a subpage of a
> hugetlb folio is a raw HWPOISON page. This functionality relies on
> RawHwpUnreliable to be not set; otherwise hugepage's raw HWPOISON list
> becomes meaningless.
> 
> is_raw_hwp_subpage needs to hold hugetlb_lock in order to synchronize
> with __get_huge_page_for_hwpoison, who iterates and inserts an entry to
> raw_hwp_list. llist itself doesn't ensure insertion is synchornized with
> the iterating used by __is_raw_hwp_list. Caller can minimize the
> overhead of lock cycles by first checking if folio / head page's
> HWPOISON flag is set.
> 
> Exports this functionality to be immediately used in the read operation
> for hugetlbfs.
> 
> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> ---
>  include/linux/hugetlb.h | 19 +++++++++++++++++++
>  include/linux/mm.h      |  7 +++++++
>  mm/hugetlb.c            | 10 ++++++++++
>  mm/memory-failure.c     | 34 ++++++++++++++++++++++++----------
>  4 files changed, 60 insertions(+), 10 deletions(-)
> ...
> -static inline struct llist_head *raw_hwp_list_head(struct folio *folio)
> +bool __is_raw_hwp_subpage(struct folio *folio, struct page *subpage)
>  {
> -	return (struct llist_head *)&folio->_hugetlb_hwpoison;
> +	struct llist_head *raw_hwp_head;
> +	struct raw_hwp_page *p, *tmp;
> +	bool ret = false;
> +
> +	if (!folio_test_hwpoison(folio))
> +		return false;
> +
> +	/*
> +	 * When RawHwpUnreliable is set, kernel lost track of which subpages
> +	 * are HWPOISON. So return as if ALL subpages are HWPOISONed.
> +	 */
> +	if (folio_test_hugetlb_raw_hwp_unreliable(folio))
> +		return true;
> +
> +	raw_hwp_head = raw_hwp_list_head(folio);
> +	llist_for_each_entry_safe(p, tmp, raw_hwp_head->first, node) {

Since we don't free the raw_hwp_list, does llist_for_each_entry works same as llist_for_each_entry_safe?

> +		if (subpage == p->page) {
> +			ret = true;
> +			break;
> +		}
> +	}
> +
> +	return ret;
>  }

It seems there's a race between __is_raw_hwp_subpage and unpoison_memory:
  unpoison_memory		__is_raw_hwp_subpage
   				  if (!folio_test_hwpoison(folio)) -- hwpoison is set
    folio_free_raw_hwp            llist_for_each_entry_safe raw_hwp_list
      llist_del_all                 ..
    folio_test_clear_hwpoison

But __is_raw_hwp_subpage is used in hugetlbfs, unpoison_memory couldn't reach here because there's a
folio_mapping == NULL check before folio_free_raw_hwp.

Anyway, this patch looks good to me.

Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Thanks.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 2/4] mm/hwpoison: check if a subpage of a hugetlb folio is raw HWPOISON
  2023-07-07 20:31   ` Matthew Wilcox
  2023-07-07 21:05     ` Andrew Morton
@ 2023-07-10  0:21     ` Naoya Horiguchi
  2023-07-10 15:11       ` Jiaqi Yan
  1 sibling, 1 reply; 17+ messages in thread
From: Naoya Horiguchi @ 2023-07-10  0:21 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jiaqi Yan, akpm, mike.kravetz, naoya.horiguchi, songmuchun,
	shy828301, linmiaohe, linux-mm, linux-kernel, duenwen,
	axelrasmussen, jthoughton

On Fri, Jul 07, 2023 at 09:31:39PM +0100, Matthew Wilcox wrote:
> On Fri, Jul 07, 2023 at 08:19:02PM +0000, Jiaqi Yan wrote:
> > Add the functionality, is_raw_hwp_subpage, to tell if a subpage of a
> 
> This is incorrect naming.  "subpage" was needed before we had the
> folio concept, but it should not be used any more.  We have folios
> and pages now.

I think we can address the raw hwpoison page by the offset in folio/hugepage
to eliminate the concept of "subpage".

> 
> Also, abbreviating "hwpoison" as "hwp" seems like a bad idea to me.
> hwp is already used as an acronym by acpi, intel_pstate, some clock
> drivers, an ethernet driver, and a scsi driver.

I originally introduced the abbreviation "hwp" to avoid using a lengthy
function name such as "folio_test_hugetlb_raw_hwpoison_unreliable()."
Therefore, I prefer using "rawhwp" instead of a longer form like
"raw_hwpoison," although I don't expect any confusion between "hwp" and
"rawhwp."
As for "hwp_walk", another user of "hwp" in in mm/memory-failure.c,
we can easily substitute it with "hwpoison_walk."

Thanks,
Naoya Horiguchi


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 2/4] mm/hwpoison: check if a subpage of a hugetlb folio is raw HWPOISON
  2023-07-10  0:21     ` Naoya Horiguchi
@ 2023-07-10 15:11       ` Jiaqi Yan
  2023-07-11  8:59         ` Naoya Horiguchi
  0 siblings, 1 reply; 17+ messages in thread
From: Jiaqi Yan @ 2023-07-10 15:11 UTC (permalink / raw)
  To: Naoya Horiguchi, Matthew Wilcox
  Cc: akpm, mike.kravetz, naoya.horiguchi, songmuchun, shy828301,
	linmiaohe, linux-mm, linux-kernel, duenwen, axelrasmussen,
	jthoughton

On Sun, Jul 9, 2023 at 5:21 PM Naoya Horiguchi
<naoya.horiguchi@linux.dev> wrote:
>
> On Fri, Jul 07, 2023 at 09:31:39PM +0100, Matthew Wilcox wrote:
> > On Fri, Jul 07, 2023 at 08:19:02PM +0000, Jiaqi Yan wrote:
> > > Add the functionality, is_raw_hwp_subpage, to tell if a subpage of a
> >
> > This is incorrect naming.  "subpage" was needed before we had the
> > folio concept, but it should not be used any more.  We have folios
> > and pages now.
>

Thanks for your comment, Matthew.

> I think we can address the raw hwpoison page by the offset in folio/hugepage
> to eliminate the concept of "subpage".
>
> >
> > Also, abbreviating "hwpoison" as "hwp" seems like a bad idea to me.
> > hwp is already used as an acronym by acpi, intel_pstate, some clock
> > drivers, an ethernet driver, and a scsi driver.
>
> I originally introduced the abbreviation "hwp" to avoid using a lengthy
> function name such as "folio_test_hugetlb_raw_hwpoison_unreliable()."
> Therefore, I prefer using "rawhwp" instead of a longer form like
> "raw_hwpoison," although I don't expect any confusion between "hwp" and
> "rawhwp."

These are names in my mind, what do you think?
* is_rawhwp_page_in_hugepage
* is_raw_hwpoison_page_in_hugepage // I prefer this one
* folio_test_hugetlb_raw_hwpoison_page

> As for "hwp_walk", another user of "hwp" in in mm/memory-failure.c,
> we can easily substitute it with "hwpoison_walk."

In this "hwp_walk" case, I also prefer "hwpoison" than "hwp". I can
create a separate renaming patch.

>
> Thanks,
> Naoya Horiguchi


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 2/4] mm/hwpoison: check if a subpage of a hugetlb folio is raw HWPOISON
  2023-07-08  2:57   ` Miaohe Lin
@ 2023-07-10 15:16     ` Jiaqi Yan
  2023-07-11 17:05       ` Jiaqi Yan
  0 siblings, 1 reply; 17+ messages in thread
From: Jiaqi Yan @ 2023-07-10 15:16 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: akpm, mike.kravetz, naoya.horiguchi, songmuchun, shy828301,
	linux-mm, linux-kernel, duenwen, axelrasmussen, jthoughton

On Fri, Jul 7, 2023 at 7:57 PM Miaohe Lin <linmiaohe@huawei.com> wrote:
>
> On 2023/7/8 4:19, Jiaqi Yan wrote:
> > Add the functionality, is_raw_hwp_subpage, to tell if a subpage of a
> > hugetlb folio is a raw HWPOISON page. This functionality relies on
> > RawHwpUnreliable to be not set; otherwise hugepage's raw HWPOISON list
> > becomes meaningless.
> >
> > is_raw_hwp_subpage needs to hold hugetlb_lock in order to synchronize
> > with __get_huge_page_for_hwpoison, who iterates and inserts an entry to
> > raw_hwp_list. llist itself doesn't ensure insertion is synchornized with
> > the iterating used by __is_raw_hwp_list. Caller can minimize the
> > overhead of lock cycles by first checking if folio / head page's
> > HWPOISON flag is set.
> >
> > Exports this functionality to be immediately used in the read operation
> > for hugetlbfs.
> >
> > Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
> > Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> > ---
> >  include/linux/hugetlb.h | 19 +++++++++++++++++++
> >  include/linux/mm.h      |  7 +++++++
> >  mm/hugetlb.c            | 10 ++++++++++
> >  mm/memory-failure.c     | 34 ++++++++++++++++++++++++----------
> >  4 files changed, 60 insertions(+), 10 deletions(-)
> > ...
> > -static inline struct llist_head *raw_hwp_list_head(struct folio *folio)
> > +bool __is_raw_hwp_subpage(struct folio *folio, struct page *subpage)
> >  {
> > -     return (struct llist_head *)&folio->_hugetlb_hwpoison;
> > +     struct llist_head *raw_hwp_head;
> > +     struct raw_hwp_page *p, *tmp;
> > +     bool ret = false;
> > +
> > +     if (!folio_test_hwpoison(folio))
> > +             return false;
> > +
> > +     /*
> > +      * When RawHwpUnreliable is set, kernel lost track of which subpages
> > +      * are HWPOISON. So return as if ALL subpages are HWPOISONed.
> > +      */
> > +     if (folio_test_hugetlb_raw_hwp_unreliable(folio))
> > +             return true;
> > +
> > +     raw_hwp_head = raw_hwp_list_head(folio);
> > +     llist_for_each_entry_safe(p, tmp, raw_hwp_head->first, node) {
>
> Since we don't free the raw_hwp_list, does llist_for_each_entry works same as llist_for_each_entry_safe?
>
> > +             if (subpage == p->page) {
> > +                     ret = true;
> > +                     break;
> > +             }
> > +     }
> > +
> > +     return ret;
> >  }
>
> It seems there's a race between __is_raw_hwp_subpage and unpoison_memory:
>   unpoison_memory               __is_raw_hwp_subpage
>                                   if (!folio_test_hwpoison(folio)) -- hwpoison is set
>     folio_free_raw_hwp            llist_for_each_entry_safe raw_hwp_list
>       llist_del_all                 ..
>     folio_test_clear_hwpoison
>

Thanks Miaohe for raising this concern.

> But __is_raw_hwp_subpage is used in hugetlbfs, unpoison_memory couldn't reach here because there's a
> folio_mapping == NULL check before folio_free_raw_hwp.

I agree. But in near future I do want to make __is_raw_hwp_subpage
work for shared-mapping hugetlb, so it would be nice to work with
unpoison_memory. It doesn't seem to me that holding mf_mutex in
__is_raw_hwp_subpage is nice or even absolutely correct. Let me think
if I can come up with something in v4.

>
> Anyway, this patch looks good to me.
>
> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
> Thanks.
>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 2/4] mm/hwpoison: check if a subpage of a hugetlb folio is raw HWPOISON
  2023-07-10 15:11       ` Jiaqi Yan
@ 2023-07-11  8:59         ` Naoya Horiguchi
  0 siblings, 0 replies; 17+ messages in thread
From: Naoya Horiguchi @ 2023-07-11  8:59 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: Matthew Wilcox, akpm, mike.kravetz, naoya.horiguchi, songmuchun,
	shy828301, linmiaohe, linux-mm, linux-kernel, duenwen,
	axelrasmussen, jthoughton

On Mon, Jul 10, 2023 at 08:11:48AM -0700, Jiaqi Yan wrote:
> On Sun, Jul 9, 2023 at 5:21 PM Naoya Horiguchi
> <naoya.horiguchi@linux.dev> wrote:
> >
> > On Fri, Jul 07, 2023 at 09:31:39PM +0100, Matthew Wilcox wrote:
> > > On Fri, Jul 07, 2023 at 08:19:02PM +0000, Jiaqi Yan wrote:
> > > > Add the functionality, is_raw_hwp_subpage, to tell if a subpage of a
> > >
> > > This is incorrect naming.  "subpage" was needed before we had the
> > > folio concept, but it should not be used any more.  We have folios
> > > and pages now.
> >
> 
> Thanks for your comment, Matthew.
> 
> > I think we can address the raw hwpoison page by the offset in folio/hugepage
> > to eliminate the concept of "subpage".
> >
> > >
> > > Also, abbreviating "hwpoison" as "hwp" seems like a bad idea to me.
> > > hwp is already used as an acronym by acpi, intel_pstate, some clock
> > > drivers, an ethernet driver, and a scsi driver.
> >
> > I originally introduced the abbreviation "hwp" to avoid using a lengthy
> > function name such as "folio_test_hugetlb_raw_hwpoison_unreliable()."
> > Therefore, I prefer using "rawhwp" instead of a longer form like
> > "raw_hwpoison," although I don't expect any confusion between "hwp" and
> > "rawhwp."
> 
> These are names in my mind, what do you think?
> * is_rawhwp_page_in_hugepage
> * is_raw_hwpoison_page_in_hugepage // I prefer this one

This one is fine to me.

> * folio_test_hugetlb_raw_hwpoison_page
> 
> > As for "hwp_walk", another user of "hwp" in in mm/memory-failure.c,
> > we can easily substitute it with "hwpoison_walk."
> 
> In this "hwp_walk" case, I also prefer "hwpoison" than "hwp". I can
> create a separate renaming patch.

Great, thank you.

- Naoya Horiguchi


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 2/4] mm/hwpoison: check if a subpage of a hugetlb folio is raw HWPOISON
  2023-07-10 15:16     ` Jiaqi Yan
@ 2023-07-11 17:05       ` Jiaqi Yan
  2023-07-11 18:01         ` Mike Kravetz
  0 siblings, 1 reply; 17+ messages in thread
From: Jiaqi Yan @ 2023-07-11 17:05 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: akpm, mike.kravetz, naoya.horiguchi, songmuchun, shy828301,
	linux-mm, linux-kernel, duenwen, axelrasmussen, jthoughton

On Mon, Jul 10, 2023 at 8:16 AM Jiaqi Yan <jiaqiyan@google.com> wrote:
>
> On Fri, Jul 7, 2023 at 7:57 PM Miaohe Lin <linmiaohe@huawei.com> wrote:
> >
> > On 2023/7/8 4:19, Jiaqi Yan wrote:
> > > Add the functionality, is_raw_hwp_subpage, to tell if a subpage of a
> > > hugetlb folio is a raw HWPOISON page. This functionality relies on
> > > RawHwpUnreliable to be not set; otherwise hugepage's raw HWPOISON list
> > > becomes meaningless.
> > >
> > > is_raw_hwp_subpage needs to hold hugetlb_lock in order to synchronize
> > > with __get_huge_page_for_hwpoison, who iterates and inserts an entry to
> > > raw_hwp_list. llist itself doesn't ensure insertion is synchornized with
> > > the iterating used by __is_raw_hwp_list. Caller can minimize the
> > > overhead of lock cycles by first checking if folio / head page's
> > > HWPOISON flag is set.
> > >
> > > Exports this functionality to be immediately used in the read operation
> > > for hugetlbfs.
> > >
> > > Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
> > > Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > > Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
> > > ---
> > >  include/linux/hugetlb.h | 19 +++++++++++++++++++
> > >  include/linux/mm.h      |  7 +++++++
> > >  mm/hugetlb.c            | 10 ++++++++++
> > >  mm/memory-failure.c     | 34 ++++++++++++++++++++++++----------
> > >  4 files changed, 60 insertions(+), 10 deletions(-)
> > > ...
> > > -static inline struct llist_head *raw_hwp_list_head(struct folio *folio)
> > > +bool __is_raw_hwp_subpage(struct folio *folio, struct page *subpage)
> > >  {
> > > -     return (struct llist_head *)&folio->_hugetlb_hwpoison;
> > > +     struct llist_head *raw_hwp_head;
> > > +     struct raw_hwp_page *p, *tmp;
> > > +     bool ret = false;
> > > +
> > > +     if (!folio_test_hwpoison(folio))
> > > +             return false;
> > > +
> > > +     /*
> > > +      * When RawHwpUnreliable is set, kernel lost track of which subpages
> > > +      * are HWPOISON. So return as if ALL subpages are HWPOISONed.
> > > +      */
> > > +     if (folio_test_hugetlb_raw_hwp_unreliable(folio))
> > > +             return true;
> > > +
> > > +     raw_hwp_head = raw_hwp_list_head(folio);
> > > +     llist_for_each_entry_safe(p, tmp, raw_hwp_head->first, node) {
> >
> > Since we don't free the raw_hwp_list, does llist_for_each_entry works same as llist_for_each_entry_safe?

Sorry I missed this comment. Yes they are the same but
llist_for_each_entry doesn't need "tmp". I will switch to
llist_for_each_entry in v4.

>
> >
> > > +             if (subpage == p->page) {
> > > +                     ret = true;
> > > +                     break;
> > > +             }
> > > +     }
> > > +
> > > +     return ret;
> > >  }
> >
> > It seems there's a race between __is_raw_hwp_subpage and unpoison_memory:
> >   unpoison_memory               __is_raw_hwp_subpage
> >                                   if (!folio_test_hwpoison(folio)) -- hwpoison is set
> >     folio_free_raw_hwp            llist_for_each_entry_safe raw_hwp_list
> >       llist_del_all                 ..
> >     folio_test_clear_hwpoison
> >
>
> Thanks Miaohe for raising this concern.
>
> > But __is_raw_hwp_subpage is used in hugetlbfs, unpoison_memory couldn't reach here because there's a
> > folio_mapping == NULL check before folio_free_raw_hwp.
>
> I agree. But in near future I do want to make __is_raw_hwp_subpage
> work for shared-mapping hugetlb, so it would be nice to work with
> unpoison_memory. It doesn't seem to me that holding mf_mutex in
> __is_raw_hwp_subpage is nice or even absolutely correct. Let me think
> if I can come up with something in v4.

At my 2nd thought, if __is_raw_hwp_subpage simply takes mf_mutex
before llist_for_each_entry, it will introduce a deadlock:

unpoison_memory                       __is_raw_hwp_subpage
  held mf_mutex                         held hugetlb_lock
  get_hwpoison_hugetlb_folio            attempts mf_mutex
    attempts hugetlb lock

Not for this patch series, but for future, is it a good idea to make
mf_mutex available to hugetlb code? Then enforce the order of locking
to be mf_mutex first, hugetlb_lock second? I believe this is the
current locking pattern / order for try_memory_failure_hugetlb.


>
> >
> > Anyway, this patch looks good to me.
> >
> > Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
> > Thanks.
> >


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 2/4] mm/hwpoison: check if a subpage of a hugetlb folio is raw HWPOISON
  2023-07-11 17:05       ` Jiaqi Yan
@ 2023-07-11 18:01         ` Mike Kravetz
  2023-07-11 22:22           ` Jiaqi Yan
  2023-07-12  2:25           ` Miaohe Lin
  0 siblings, 2 replies; 17+ messages in thread
From: Mike Kravetz @ 2023-07-11 18:01 UTC (permalink / raw)
  To: Jiaqi Yan
  Cc: Miaohe Lin, akpm, naoya.horiguchi, songmuchun, shy828301,
	linux-mm, linux-kernel, duenwen, axelrasmussen, jthoughton

On 07/11/23 10:05, Jiaqi Yan wrote:
> On Mon, Jul 10, 2023 at 8:16 AM Jiaqi Yan <jiaqiyan@google.com> wrote:
> > On Fri, Jul 7, 2023 at 7:57 PM Miaohe Lin <linmiaohe@huawei.com> wrote:
> > > On 2023/7/8 4:19, Jiaqi Yan wrote:
> > >
> > > > +             if (subpage == p->page) {
> > > > +                     ret = true;
> > > > +                     break;
> > > > +             }
> > > > +     }
> > > > +
> > > > +     return ret;
> > > >  }
> > >
> > > It seems there's a race between __is_raw_hwp_subpage and unpoison_memory:
> > >   unpoison_memory               __is_raw_hwp_subpage
> > >                                   if (!folio_test_hwpoison(folio)) -- hwpoison is set
> > >     folio_free_raw_hwp            llist_for_each_entry_safe raw_hwp_list
> > >       llist_del_all                 ..
> > >     folio_test_clear_hwpoison
> > >
> >
> > Thanks Miaohe for raising this concern.
> >
> > > But __is_raw_hwp_subpage is used in hugetlbfs, unpoison_memory couldn't reach here because there's a
> > > folio_mapping == NULL check before folio_free_raw_hwp.
> >
> > I agree. But in near future I do want to make __is_raw_hwp_subpage
> > work for shared-mapping hugetlb, so it would be nice to work with
> > unpoison_memory. It doesn't seem to me that holding mf_mutex in
> > __is_raw_hwp_subpage is nice or even absolutely correct. Let me think
> > if I can come up with something in v4.
> 
> At my 2nd thought, if __is_raw_hwp_subpage simply takes mf_mutex
> before llist_for_each_entry, it will introduce a deadlock:
> 
> unpoison_memory                       __is_raw_hwp_subpage
>   held mf_mutex                         held hugetlb_lock
>   get_hwpoison_hugetlb_folio            attempts mf_mutex
>     attempts hugetlb lock
> 
> Not for this patch series, but for future, is it a good idea to make
> mf_mutex available to hugetlb code? Then enforce the order of locking
> to be mf_mutex first, hugetlb_lock second? I believe this is the
> current locking pattern / order for try_memory_failure_hugetlb.

I think only holding mf_mutex in __is_raw_hwp_subpage would be sufficient
to prevent races with unpoison_memory.  memory failure code needs to take
both mf_mutex and hugetlb_lock.  The hugetlb lock is to prevent hugetlb
page state changes.  IIUC, __is_raw_hwp_subpage is only taking hugetlb_lock
to prevent races with memory failure code.

Of course, I could be missing something as there are subtle issues with
locking in the memory failure code.
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 2/4] mm/hwpoison: check if a subpage of a hugetlb folio is raw HWPOISON
  2023-07-11 18:01         ` Mike Kravetz
@ 2023-07-11 22:22           ` Jiaqi Yan
  2023-07-12  2:25           ` Miaohe Lin
  1 sibling, 0 replies; 17+ messages in thread
From: Jiaqi Yan @ 2023-07-11 22:22 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Miaohe Lin, akpm, naoya.horiguchi, songmuchun, shy828301,
	linux-mm, linux-kernel, duenwen, axelrasmussen, jthoughton

On Tue, Jul 11, 2023 at 11:02 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 07/11/23 10:05, Jiaqi Yan wrote:
> > On Mon, Jul 10, 2023 at 8:16 AM Jiaqi Yan <jiaqiyan@google.com> wrote:
> > > On Fri, Jul 7, 2023 at 7:57 PM Miaohe Lin <linmiaohe@huawei.com> wrote:
> > > > On 2023/7/8 4:19, Jiaqi Yan wrote:
> > > >
> > > > > +             if (subpage == p->page) {
> > > > > +                     ret = true;
> > > > > +                     break;
> > > > > +             }
> > > > > +     }
> > > > > +
> > > > > +     return ret;
> > > > >  }
> > > >
> > > > It seems there's a race between __is_raw_hwp_subpage and unpoison_memory:
> > > >   unpoison_memory               __is_raw_hwp_subpage
> > > >                                   if (!folio_test_hwpoison(folio)) -- hwpoison is set
> > > >     folio_free_raw_hwp            llist_for_each_entry_safe raw_hwp_list
> > > >       llist_del_all                 ..
> > > >     folio_test_clear_hwpoison
> > > >
> > >
> > > Thanks Miaohe for raising this concern.
> > >
> > > > But __is_raw_hwp_subpage is used in hugetlbfs, unpoison_memory couldn't reach here because there's a
> > > > folio_mapping == NULL check before folio_free_raw_hwp.
> > >
> > > I agree. But in near future I do want to make __is_raw_hwp_subpage
> > > work for shared-mapping hugetlb, so it would be nice to work with
> > > unpoison_memory. It doesn't seem to me that holding mf_mutex in
> > > __is_raw_hwp_subpage is nice or even absolutely correct. Let me think
> > > if I can come up with something in v4.
> >
> > At my 2nd thought, if __is_raw_hwp_subpage simply takes mf_mutex
> > before llist_for_each_entry, it will introduce a deadlock:
> >
> > unpoison_memory                       __is_raw_hwp_subpage
> >   held mf_mutex                         held hugetlb_lock
> >   get_hwpoison_hugetlb_folio            attempts mf_mutex
> >     attempts hugetlb lock
> >
> > Not for this patch series, but for future, is it a good idea to make
> > mf_mutex available to hugetlb code? Then enforce the order of locking
> > to be mf_mutex first, hugetlb_lock second? I believe this is the
> > current locking pattern / order for try_memory_failure_hugetlb.
>
> I think only holding mf_mutex in __is_raw_hwp_subpage would be sufficient
> to prevent races with unpoison_memory.  memory failure code needs to take
> both mf_mutex and hugetlb_lock.  The hugetlb lock is to prevent hugetlb
> page state changes.  IIUC, __is_raw_hwp_subpage is only taking hugetlb_lock
> to prevent races with memory failure code.

Thanks Mike, I think mf_mutex is a simple and correct solution. I will
drop hugetlb_lock from __is_raw_hwp_subpage in v4.

>
> Of course, I could be missing something as there are subtle issues with
> locking in the memory failure code.
> --
> Mike Kravetz


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 2/4] mm/hwpoison: check if a subpage of a hugetlb folio is raw HWPOISON
  2023-07-11 18:01         ` Mike Kravetz
  2023-07-11 22:22           ` Jiaqi Yan
@ 2023-07-12  2:25           ` Miaohe Lin
  1 sibling, 0 replies; 17+ messages in thread
From: Miaohe Lin @ 2023-07-12  2:25 UTC (permalink / raw)
  To: Mike Kravetz, Jiaqi Yan
  Cc: akpm, naoya.horiguchi, songmuchun, shy828301, linux-mm,
	linux-kernel, duenwen, axelrasmussen, jthoughton

On 2023/7/12 2:01, Mike Kravetz wrote:
> On 07/11/23 10:05, Jiaqi Yan wrote:
>> On Mon, Jul 10, 2023 at 8:16 AM Jiaqi Yan <jiaqiyan@google.com> wrote:
>>> On Fri, Jul 7, 2023 at 7:57 PM Miaohe Lin <linmiaohe@huawei.com> wrote:
>>>> On 2023/7/8 4:19, Jiaqi Yan wrote:
>>>>
>>>>> +             if (subpage == p->page) {
>>>>> +                     ret = true;
>>>>> +                     break;
>>>>> +             }
>>>>> +     }
>>>>> +
>>>>> +     return ret;
>>>>>  }
>>>>
>>>> It seems there's a race between __is_raw_hwp_subpage and unpoison_memory:
>>>>   unpoison_memory               __is_raw_hwp_subpage
>>>>                                   if (!folio_test_hwpoison(folio)) -- hwpoison is set
>>>>     folio_free_raw_hwp            llist_for_each_entry_safe raw_hwp_list
>>>>       llist_del_all                 ..
>>>>     folio_test_clear_hwpoison
>>>>
>>>
>>> Thanks Miaohe for raising this concern.
>>>
>>>> But __is_raw_hwp_subpage is used in hugetlbfs, unpoison_memory couldn't reach here because there's a
>>>> folio_mapping == NULL check before folio_free_raw_hwp.
>>>
>>> I agree. But in near future I do want to make __is_raw_hwp_subpage
>>> work for shared-mapping hugetlb, so it would be nice to work with
>>> unpoison_memory. It doesn't seem to me that holding mf_mutex in
>>> __is_raw_hwp_subpage is nice or even absolutely correct. Let me think
>>> if I can come up with something in v4.
>>
>> At my 2nd thought, if __is_raw_hwp_subpage simply takes mf_mutex
>> before llist_for_each_entry, it will introduce a deadlock:
>>
>> unpoison_memory                       __is_raw_hwp_subpage
>>   held mf_mutex                         held hugetlb_lock
>>   get_hwpoison_hugetlb_folio            attempts mf_mutex
>>     attempts hugetlb lock
>>
>> Not for this patch series, but for future, is it a good idea to make
>> mf_mutex available to hugetlb code? Then enforce the order of locking
>> to be mf_mutex first, hugetlb_lock second? I believe this is the
>> current locking pattern / order for try_memory_failure_hugetlb.
> 
> I think only holding mf_mutex in __is_raw_hwp_subpage would be sufficient
> to prevent races with unpoison_memory.  memory failure code needs to take

Since soft_offline_page, memory_failure and unpoison_memory both holds mf_mutex,
I think this should be enough to prevent races between them too.

Thanks.



^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-07-12  2:25 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-07 20:19 [PATCH v3 0/4] Improve hugetlbfs read on HWPOISON hugepages Jiaqi Yan
2023-07-07 20:19 ` [PATCH v3 1/4] mm/hwpoison: delete all entries before traversal in __folio_free_raw_hwp Jiaqi Yan
2023-07-08  2:40   ` Miaohe Lin
2023-07-07 20:19 ` [PATCH v3 2/4] mm/hwpoison: check if a subpage of a hugetlb folio is raw HWPOISON Jiaqi Yan
2023-07-07 20:31   ` Matthew Wilcox
2023-07-07 21:05     ` Andrew Morton
2023-07-10  0:21     ` Naoya Horiguchi
2023-07-10 15:11       ` Jiaqi Yan
2023-07-11  8:59         ` Naoya Horiguchi
2023-07-08  2:57   ` Miaohe Lin
2023-07-10 15:16     ` Jiaqi Yan
2023-07-11 17:05       ` Jiaqi Yan
2023-07-11 18:01         ` Mike Kravetz
2023-07-11 22:22           ` Jiaqi Yan
2023-07-12  2:25           ` Miaohe Lin
2023-07-07 20:19 ` [PATCH v3 3/4] hugetlbfs: improve read HWPOISON hugepage Jiaqi Yan
2023-07-07 20:19 ` [PATCH v3 4/4] selftests/mm: add tests for HWPOISON hugetlbfs read Jiaqi Yan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox