[RFC PATCH v1 1/2] mm/memory-failure: introduce global MFR policy

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Jiaqi Yan <jiaqiyan@google.com>
To: nao.horiguchi@gmail.com, linmiaohe@huawei.com
Cc: tony.luck@intel.com, wangkefeng.wang@huawei.com,
	jane.chu@oracle.com,  akpm@linux-foundation.org,
	osalvador@suse.de, rientjes@google.com,  duenwen@google.com,
	jthoughton@google.com, jgg@nvidia.com, ankita@nvidia.com,
	 peterx@redhat.com, linux-mm@kvack.org,
	Jiaqi Yan <jiaqiyan@google.com>
Subject: [RFC PATCH v1 1/2] mm/memory-failure: introduce global MFR policy
Date: Tue, 24 Sep 2024 04:39:19 +0000	[thread overview]
Message-ID: <20240924043924.3562257-2-jiaqiyan@google.com> (raw)
In-Reply-To: <20240924043924.3562257-1-jiaqiyan@google.com>

Give userspace the control to enable or disable HARD_OFFLINE error folio
(either a raw page or a hugepage). By default, HARD_OFFLINE is enabled to
be consistent with existing memory_failure behavior.

Userspace should be able to control whether to keep or discard a large chunk
of memory in the event of uncorrectable memory errors. There are two major
use cases in cloud environments.

The 1st case is 1G HugeTLB-backed database workload. Comparing to discarding
the hugepage when only single PFN is impacted by uncorrectable memory error,
if kernel simply leaves the 1G hugepage mapped, access to major of clean PFNs
within the poisoned 1G region still works well for VM and workload.

The 2nd case is MMIO device memory or EGM [1] mapped to userspace via huge
VM_PFNMAP [2]. If kernel does not zap PUD or PMD, there is no need for the
VFIO drivers that manages the memory to intercept page faults for clean PFNs
and to reinstall PTEs.

In addition, in both cases there is no EPT or stage-2 (S2) violation, so no
performance cost for accessing clean guest pages already mapped in EPT or S2.

See cover letter for more details on why userspace need such control, and
implication when userspace chooses to disable HARD_OFFLINE.

If this RFC receives general positive feedbacks, I will add selftest in v2.

[1] https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/#extended_gpu_memory
[2] https://lore.kernel.org/linux-mm/20240828234958.GE3773488@nvidia.com/T/#m413a61acaf1fc60e65ee7968ab0ae3093f7b1ea3

Signed-off-by: Jiaqi Yan <jiaqiyan@google.com>
---
 mm/memory-failure.c | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 7066fc84f351..a7b85b98d61e 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -70,6 +70,8 @@ static int sysctl_memory_failure_recovery __read_mostly = 1;
 
 static int sysctl_enable_soft_offline __read_mostly = 1;
 
+static int sysctl_enable_hard_offline __read_mostly = 1;
+
 atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
 
 static bool hw_memory_failure __read_mostly = false;
@@ -151,6 +153,15 @@ static struct ctl_table memory_failure_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_ONE,
+	},
+	{
+		.procname	= "enable_hard_offline",
+		.data		= &sysctl_enable_hard_offline,
+		.maxlen		= sizeof(sysctl_enable_hard_offline),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
 	}
 };
 
@@ -2223,6 +2234,14 @@ int memory_failure(unsigned long pfn, int flags)
 
 	p = pfn_to_online_page(pfn);
 	if (!p) {
+		/*
+		 * For ZONE_DEVICE memory and memory on special architectures,
+		 * assume they have opt out core kernel's MFR. Since these
+		 * memory can still be mapped to userspace, let userspace
+		 * know MFR doesn't apply.
+		 */
+		pr_info_once("%#lx: can't apply global MFR policy\n", pfn);
+
 		res = arch_memory_failure(pfn, flags);
 		if (res == 0)
 			goto unlock_mutex;
@@ -2241,6 +2260,20 @@ int memory_failure(unsigned long pfn, int flags)
 		goto unlock_mutex;
 	}
 
+	/*
+	 * On ARM64, if APEI failed to claims SEA, (e.g. GHES driver doesn't
+	 * register to SEA notifications from firmware), memory_failure will
+	 * never be synchrounous to the error consumption thread. Notifying
+	 * it via SIGBUS synchrnously has to be done by either core kernel in
+	 * do_mem_abort, or KVM in kvm_handle_guest_abort.
+	 */
+	if (!sysctl_enable_hard_offline) {
+		pr_info_once("%#lx: disabled by /proc/sys/vm/enable_hard_offline\n", pfn);
+		kill_procs_now(p, pfn, flags, page_folio(p));
+		res = -EOPNOTSUPP;
+		goto unlock_mutex;
+	}
+
 try_again:
 	res = try_memory_failure_hugetlb(pfn, flags, &hugetlb);
 	if (hugetlb)
-- 
2.46.0.792.g87dc391469-goog

next prev parent reply	other threads:[~2024-09-24  4:39 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-24  4:39 [RFC PATCH v1 0/2] Userspace Can Control Memory Failure Recovery Jiaqi Yan
2024-09-24  4:39 ` Jiaqi Yan [this message]
2024-10-02 23:50   ` [RFC PATCH v1 1/2] mm/memory-failure: introduce global MFR policy jane.chu
2024-10-03 23:51     ` Jiaqi Yan
2024-10-07 17:24       ` jane.chu
2024-10-10 23:21         ` Jiaqi Yan
2024-10-11 18:28           ` jane.chu
2024-10-11 19:44             ` Luck, Tony
2024-10-11 20:15               ` jane.chu
2024-10-15 23:45             ` Jiaqi Yan
2024-10-15 23:56               ` Luck, Tony
2024-10-16  0:19                 ` jane.chu
2024-10-11  7:04       ` Miaohe Lin
2024-10-15 23:58         ` Jiaqi Yan
2024-09-24  4:39 ` [RFC PATCH v1 2/2] docs: mm: add enable_hard_offline sysctl Jiaqi Yan
2024-10-02 15:02 ` [RFC PATCH v1 0/2] Userspace Can Control Memory Failure Recovery Jason Gunthorpe
2024-10-03 22:45   ` Jiaqi Yan
2024-10-03 22:58     ` Luck, Tony
2024-10-03 23:19       ` Jiaqi Yan
2024-10-03 23:19     ` Jason Gunthorpe
2024-10-04 18:32       ` Jiaqi Yan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240924043924.3562257-2-jiaqiyan@google.com \
    --to=jiaqiyan@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=ankita@nvidia.com \
    --cc=duenwen@google.com \
    --cc=jane.chu@oracle.com \
    --cc=jgg@nvidia.com \
    --cc=jthoughton@google.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-mm@kvack.org \
    --cc=nao.horiguchi@gmail.com \
    --cc=osalvador@suse.de \
    --cc=peterx@redhat.com \
    --cc=rientjes@google.com \
    --cc=tony.luck@intel.com \
    --cc=wangkefeng.wang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox