linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Jiaqi Yan <jiaqiyan@google.com>
To: naoya.horiguchi@nec.com, muchun.song@linux.dev, linmiaohe@huawei.com
Cc: akpm@linux-foundation.org, mike.kravetz@oracle.com,
	shuah@kernel.org,  corbet@lwn.net, osalvador@suse.de,
	rientjes@google.com, duenwen@google.com,  fvdl@google.com,
	linux-mm@kvack.org, linux-kselftest@vger.kernel.org,
	 linux-doc@vger.kernel.org, Jiaqi Yan <jiaqiyan@google.com>
Subject: [PATCH v1 1/3] mm/memory-failure: userspace controls soft-offlining hugetlb pages
Date: Fri, 31 May 2024 21:34:37 +0000	[thread overview]
Message-ID: <20240531213439.2958891-2-jiaqiyan@google.com> (raw)
In-Reply-To: <20240531213439.2958891-1-jiaqiyan@google.com>

Correctable memory errors are very common on servers with large
amount of memory, and are corrected by ECC. Soft offline is kernel's
additional recovery handling for memory pages having (excessive)
corrected memory errors. Impacted page is migrated to a healthy page
if mapped/inuse; the original page is discarded for any future use.

The actual policy on whether (and when) to soft offline should be
maintained by userspace, especially in case of HugeTLB hugepages.
Soft-offline dissolves a hugepage, either in-use or free, into
chunks of 4K pages, reducing HugeTLB pool capacity by 1 hugepage.
If userspace has not acknowledged such behavior, it may be surprised
when later mmap hugepages MAP_FAILED due to lack of hugepages.
In addition, discarding the entire 1G memory page only because of
corrected memory errors sounds very costly and kernel better not
doing under the hood. But today there are at least 2 such cases:
1. GHES driver sees both GHES_SEV_CORRECTED and
   CPER_SEC_ERROR_THRESHOLD_EXCEEDED after parsing CPER.
2. RAS Correctable Errors Collector counts correctable errors per
   PFN and when the counter for a PFN reaches threshold
In both cases, userspace has no control of the soft offline performed
by kernel's memory failure recovery.

This commit gives userspace the control of soft-offlining HugeTLB
pages: kernel only soft offlines hugepage if userspace has opt-ed in
in for that specific hugepage size. The interface to userspace is a
new sysfs entry called softoffline_corrected_errors under the
/sys/kernel/mm/hugepages/hugepages-${size}kB directory:
* When softoffline_corrected_errors=0, skip soft offlining for all
  hugepages of size ${size}kB.
* When softoffline_corrected_errors=1, soft offline as before this
  patch series.

So the granularity of the control is per hugepage size, and is kept
in corresponding hstate. By default softoffline_corrected_errors is
1 to preserve existing behavior in kernel.

Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
 include/linux/hugetlb.h | 17 +++++++++++++++++
 mm/hugetlb.c            | 34 ++++++++++++++++++++++++++++++++++
 mm/memory-failure.c     |  7 +++++++
 3 files changed, 58 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 2b3c3a404769..55f9e9593cce 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -685,6 +685,7 @@ struct hstate {
 	int next_nid_to_free;
 	unsigned int order;
 	unsigned int demote_order;
+	unsigned int softoffline_corrected_errors;
 	unsigned long mask;
 	unsigned long max_huge_pages;
 	unsigned long nr_huge_pages;
@@ -1029,6 +1030,16 @@ void hugetlb_unregister_node(struct node *node);
  */
 bool is_raw_hwpoison_page_in_hugepage(struct page *page);
 
+/*
+ * For certain hugepage size, when a hugepage has corrected memory error(s):
+ * - Return 0 if userspace wants to disable soft offlining the hugepage.
+ * - Return > 0 if userspace allows soft offlining the hugepage.
+ */
+static inline int hugetlb_softoffline_corrected_errors(struct folio *folio)
+{
+	return folio_hstate(folio)->softoffline_corrected_errors;
+}
+
 #else	/* CONFIG_HUGETLB_PAGE */
 struct hstate {};
 
@@ -1226,6 +1237,12 @@ static inline bool hugetlbfs_pagecache_present(
 {
 	return false;
 }
+
+static inline int hugetlb_softoffline_corrected_errors(struct folio *folio)
+{
+	return 1;
+}
+
 #endif	/* CONFIG_HUGETLB_PAGE */
 
 static inline spinlock_t *huge_pte_lock(struct hstate *h,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6be78e7d4f6e..a184e28ce592 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4325,6 +4325,38 @@ static ssize_t demote_size_store(struct kobject *kobj,
 }
 HSTATE_ATTR(demote_size);
 
+static ssize_t softoffline_corrected_errors_show(struct kobject *kobj,
+						 struct kobj_attribute *attr,
+						 char *buf)
+{
+	struct hstate *h = kobj_to_hstate(kobj, NULL);
+
+	return sysfs_emit(buf, "%d\n", h->softoffline_corrected_errors);
+}
+
+static ssize_t softoffline_corrected_errors_store(struct kobject *kobj,
+						  struct kobj_attribute *attr,
+						  const char *buf,
+						  size_t count)
+{
+	int err;
+	unsigned long input;
+	struct hstate *h = kobj_to_hstate(kobj, NULL);
+
+	err = kstrtoul(buf, 10, &input);
+	if (err)
+		return err;
+
+	/* softoffline_corrected_errors is either 0 or 1. */
+	if (input > 1)
+		return -EINVAL;
+
+	h->softoffline_corrected_errors = input;
+
+	return count;
+}
+HSTATE_ATTR(softoffline_corrected_errors);
+
 static struct attribute *hstate_attrs[] = {
 	&nr_hugepages_attr.attr,
 	&nr_overcommit_hugepages_attr.attr,
@@ -4334,6 +4366,7 @@ static struct attribute *hstate_attrs[] = {
 #ifdef CONFIG_NUMA
 	&nr_hugepages_mempolicy_attr.attr,
 #endif
+	&softoffline_corrected_errors_attr.attr,
 	NULL,
 };
 
@@ -4655,6 +4688,7 @@ void __init hugetlb_add_hstate(unsigned int order)
 	h = &hstates[hugetlb_max_hstate++];
 	mutex_init(&h->resize_lock);
 	h->order = order;
+	h->softoffline_corrected_errors = 1;
 	h->mask = ~(huge_page_size(h) - 1);
 	for (i = 0; i < MAX_NUMNODES; ++i)
 		INIT_LIST_HEAD(&h->hugepage_freelists[i]);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 16ada4fb02b7..7094fc4c62e2 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2776,6 +2776,13 @@ int soft_offline_page(unsigned long pfn, int flags)
 		return -EIO;
 	}
 
+	if (PageHuge(page) &&
+	    !hugetlb_softoffline_corrected_errors(page_folio(page))) {
+		pr_info("soft offline: %#lx: hugetlb page is ignored\n", pfn);
+		put_ref_page(pfn, flags);
+		return -EINVAL;
+	}
+
 	mutex_lock(&mf_mutex);
 
 	if (PageHWPoison(page)) {
-- 
2.45.1.288.g0e0cd299f1-goog



  reply	other threads:[~2024-05-31 21:34 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-31 21:34 [PATCH v1 0/3] Userspace controls soft-offline HugeTLB pages Jiaqi Yan
2024-05-31 21:34 ` Jiaqi Yan [this message]
2024-06-07 22:26   ` [PATCH v1 1/3] mm/memory-failure: userspace controls soft-offlining hugetlb pages Jiaqi Yan
2024-05-31 21:34 ` [PATCH v1 2/3] selftest/mm: test softoffline_corrected_errors behaviors Jiaqi Yan
2024-05-31 21:34 ` [PATCH v1 3/3] docs: hugetlbpage.rst: add softoffline_corrected_errors Jiaqi Yan
2024-06-04  7:19 ` [PATCH v1 0/3] Userspace controls soft-offline HugeTLB pages Miaohe Lin
2024-06-07 22:22   ` Jiaqi Yan
2024-06-10 19:41     ` Jane Chu
2024-06-10 22:55       ` Jiaqi Yan
2024-06-11 17:55         ` Jane Chu
2024-06-11 18:12           ` Jiaqi Yan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240531213439.2958891-2-jiaqiyan@google.com \
    --to=jiaqiyan@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=corbet@lwn.net \
    --cc=duenwen@google.com \
    --cc=fvdl@google.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mike.kravetz@oracle.com \
    --cc=muchun.song@linux.dev \
    --cc=naoya.horiguchi@nec.com \
    --cc=osalvador@suse.de \
    --cc=rientjes@google.com \
    --cc=shuah@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox