[PATCH 4/4] mm: introduce per-node proactive reclaim interface

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Davidlohr Bueso <dave@stgolabs.net>
To: akpm@linux-foundation.org
Cc: mhocko@kernel.org, hannes@cmpxchg.org, roman.gushchin@linux.dev,
	shakeel.butt@linux.dev, yosryahmed@google.com,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	dave@stgolabs.net
Subject: [PATCH 4/4] mm: introduce per-node proactive reclaim interface
Date: Mon, 23 Jun 2025 11:58:51 -0700	[thread overview]
Message-ID: <20250623185851.830632-5-dave@stgolabs.net> (raw)
In-Reply-To: <20250623185851.830632-1-dave@stgolabs.net>

This adds support for allowing proactive reclaim in general on a
NUMA system. A per-node interface extends support for beyond a
memcg-specific interface, respecting the current semantics of
memory.reclaim: respecting aging LRU and not supporting
artificially triggering eviction on nodes belonging to non-bottom
tiers.

This patch allows userspace to do:

     echo "512M swappiness=10" > /sys/devices/system/node/nodeX/reclaim

One of the premises for this is to semantically align as best as
possible with memory.reclaim. During a brief time memcg did
support nodemask until 55ab834a86a9 (Revert "mm: add nodes=
arg to memory.reclaim"), for which semantics around reclaim
(eviction) vs demotion were not clear, rendering charging
expectations to be broken.

With this approach:

1. Users who do not use memcg can benefit from proactive reclaim.
The memcg interface is not NUMA aware and there are usecases that
are focusing on NUMA balancing rather than workload memory footprint.

2. Proactive reclaim on top tiers will trigger demotion, for which
memory is still byte-addressable. Reclaiming on the bottom nodes
will trigger evicting to swap (the traditional sense of reclaim).
This follows the semantics of what is today part of the aging process
on tiered memory, mirroring what every other form of reclaim does
(reactive and memcg proactive reclaim). Furthermore per-node proactive
reclaim is not as susceptible to the memcg charging problem mentioned
above.

3. Unlike the nodes= arg, this interface avoids confusing semantics,
such as what exactly the user wants when mixing top-tier and low-tier
nodes in the nodemask. Further per-node interface is less exposed to
"free up memory in my container" usecases, where eviction is intended.

4. Users that *really* want to free up memory can use proactive reclaim
on nodes knowingly to be on the bottom tiers to force eviction in a
natural way - higher access latencies are still better than swap.
If compelled, while no guarantees and perhaps not worth the effort,
users could also also potentially follow a ladder-like approach to
eventually free up the memory. Alternatively, perhaps an 'evict' option
could be added to the parameters for both memory.reclaim and per-node
interfaces to force this action unconditionally.

Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
---
 Documentation/ABI/stable/sysfs-devices-node |  9 ++++
 drivers/base/node.c                         |  2 +
 include/linux/swap.h                        | 16 +++++++
 mm/vmscan.c                                 | 53 ++++++++++++++++++---
 4 files changed, 74 insertions(+), 6 deletions(-)

diff --git a/Documentation/ABI/stable/sysfs-devices-node b/Documentation/ABI/stable/sysfs-devices-node
index a02707cb7cbc..2d0e023f22a7 100644
--- a/Documentation/ABI/stable/sysfs-devices-node
+++ b/Documentation/ABI/stable/sysfs-devices-node
@@ -227,3 +227,12 @@ Contact:	Jiaqi Yan <jiaqiyan@google.com>
 Description:
 		Of the raw poisoned pages on a NUMA node, how many pages are
 		recovered by memory error recovery attempt.
+
+What:		/sys/devices/system/node/nodeX/reclaim
+Date:		June 2025
+Contact:	Linux Memory Management list <linux-mm@kvack.org>
+Description:
+		Perform user-triggered proactive reclaim on a NUMA node.
+		This interface is equivalent to the memcg variant.
+
+		See Documentation/admin-guide/cgroup-v2.rst
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 6d66382dae65..548b532a2129 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -659,6 +659,7 @@ static int register_node(struct node *node, int num)
 	} else {
 		hugetlb_register_node(node);
 		compaction_register_node(node);
+		reclaim_register_node(node);
 	}
 
 	return error;
@@ -675,6 +676,7 @@ void unregister_node(struct node *node)
 {
 	hugetlb_unregister_node(node);
 	compaction_unregister_node(node);
+	reclaim_unregister_node(node);
 	node_remove_accesses(node);
 	node_remove_caches(node);
 	device_unregister(&node->dev);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index bc0e1c275fc0..dac7ba98783d 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -431,6 +431,22 @@ extern unsigned long shrink_all_memory(unsigned long nr_pages);
 extern int vm_swappiness;
 long remove_mapping(struct address_space *mapping, struct folio *folio);
 
+#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
+extern int reclaim_register_node(struct node *node);
+extern void reclaim_unregister_node(struct node *node);
+
+#else
+
+static inline int reclaim_register_node(struct node *node)
+{
+	return 0;
+}
+
+static inline void reclaim_unregister_node(struct node *node)
+{
+}
+#endif /* CONFIG_SYSFS && CONFIG_NUMA */
+
 #ifdef CONFIG_NUMA
 extern int sysctl_min_unmapped_ratio;
 extern int sysctl_min_slab_ratio;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cdd9cb97fb79..f77feb75c678 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -94,10 +94,8 @@ struct scan_control {
 	unsigned long	anon_cost;
 	unsigned long	file_cost;
 
-#ifdef CONFIG_MEMCG
 	/* Swappiness value for proactive reclaim. Always use sc_swappiness()! */
 	int *proactive_swappiness;
-#endif
 
 	/* Can active folios be deactivated as part of reclaim? */
 #define DEACTIVATE_ANON 1
@@ -121,7 +119,7 @@ struct scan_control {
 	/* Has cache_trim_mode failed at least once? */
 	unsigned int cache_trim_mode_failed:1;
 
-	/* Proactive reclaim invoked by userspace through memory.reclaim */
+	/* Proactive reclaim invoked by userspace */
 	unsigned int proactive:1;
 
 	/*
@@ -7732,13 +7730,15 @@ static const match_table_t tokens = {
 	{ MEMORY_RECLAIM_NULL, NULL },
 };
 
-int user_proactive_reclaim(char *buf, struct mem_cgroup *memcg, pg_data_t *pgdat)
+int user_proactive_reclaim(char *buf,
+			   struct mem_cgroup *memcg, pg_data_t *pgdat)
 {
 	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
 	unsigned long nr_to_reclaim, nr_reclaimed = 0;
 	int swappiness = -1;
 	char *old_buf, *start;
 	substring_t args[MAX_OPT_ARGS];
+	gfp_t gfp_mask = GFP_KERNEL;
 
 	if (!buf || (!memcg && !pgdat))
 		return -EINVAL;
@@ -7792,11 +7792,29 @@ int user_proactive_reclaim(char *buf, struct mem_cgroup *memcg, pg_data_t *pgdat
 			reclaim_options = MEMCG_RECLAIM_MAY_SWAP |
 					  MEMCG_RECLAIM_PROACTIVE;
 			reclaimed = try_to_free_mem_cgroup_pages(memcg,
-						 batch_size, GFP_KERNEL,
+						 batch_size, gfp_mask,
 						 reclaim_options,
 						 swappiness == -1 ? NULL : &swappiness);
 		} else {
-			return -EINVAL;
+			struct scan_control sc = {
+				.gfp_mask = current_gfp_context(gfp_mask),
+				.reclaim_idx = gfp_zone(gfp_mask),
+				.proactive_swappiness = swappiness == -1 ? NULL : &swappiness,
+				.priority = DEF_PRIORITY,
+				.may_writepage = !laptop_mode,
+				.nr_to_reclaim = max(batch_size, SWAP_CLUSTER_MAX),
+				.may_unmap = 1,
+				.may_swap = 1,
+				.proactive = 1,
+			};
+
+			if (test_and_set_bit_lock(PGDAT_RECLAIM_LOCKED,
+						  &pgdat->flags))
+				return -EAGAIN;
+
+			reclaimed = __node_reclaim(pgdat, gfp_mask,
+						   batch_size, &sc);
+			clear_bit_unlock(PGDAT_RECLAIM_LOCKED, &pgdat->flags);
 		}
 
 		if (!reclaimed && !nr_retries--)
@@ -7855,3 +7873,26 @@ void check_move_unevictable_folios(struct folio_batch *fbatch)
 	}
 }
 EXPORT_SYMBOL_GPL(check_move_unevictable_folios);
+
+#if defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
+static ssize_t reclaim_store(struct device *dev,
+			     struct device_attribute *attr,
+			     const char *buf, size_t count)
+{
+	int ret, nid = dev->id;
+
+	ret = user_proactive_reclaim((char *)buf, NULL, NODE_DATA(nid));
+	return ret ? -EAGAIN : count;
+}
+
+static DEVICE_ATTR_WO(reclaim);
+int reclaim_register_node(struct node *node)
+{
+	return device_create_file(&node->dev, &dev_attr_reclaim);
+}
+
+void reclaim_unregister_node(struct node *node)
+{
+	return device_remove_file(&node->dev, &dev_attr_reclaim);
+}
+#endif
-- 
2.39.5

next prev parent reply	other threads:[~2025-06-23 18:59 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-23 18:58 [PATCH -next v2 0/4] mm: per-node proactive reclaim Davidlohr Bueso
2025-06-23 18:58 ` [PATCH 1/4] mm/vmscan: respect psi_memstall region in node reclaim Davidlohr Bueso
2025-06-25 17:08   ` Shakeel Butt
2025-07-17  1:44   ` Roman Gushchin
2025-06-23 18:58 ` [PATCH 2/4] mm/memcg: make memory.reclaim interface generic Davidlohr Bueso
2025-06-23 21:45   ` Andrew Morton
2025-06-23 23:36     ` Davidlohr Bueso
2025-06-24 18:26   ` Klara Modin
2025-07-17  1:58   ` Roman Gushchin
2025-07-17 16:35     ` Davidlohr Bueso
2025-07-17 22:17   ` Shakeel Butt
2025-07-17 22:52     ` Andrew Morton
2025-07-17 23:56       ` Davidlohr Bueso
2025-07-18  0:17         ` Shakeel Butt
2025-06-23 18:58 ` [PATCH 3/4] mm/vmscan: make __node_reclaim() more generic Davidlohr Bueso
2025-07-17  2:03   ` Roman Gushchin
2025-07-17 22:25   ` Shakeel Butt
2025-06-23 18:58 ` Davidlohr Bueso [this message]
2025-06-25 23:10   ` [PATCH 4/4] mm: introduce per-node proactive reclaim interface Shakeel Butt
2025-06-27 19:07     ` SeongJae Park
2025-07-17  2:46   ` Roman Gushchin
2025-07-17 16:26     ` Davidlohr Bueso
2025-07-17 22:46       ` Andrew Morton
     [not found]   ` <20250717064925.2304-1-hdanton@sina.com>
2025-07-17  7:39     ` Michal Hocko
2025-07-17 22:28   ` Shakeel Butt
2025-06-23 21:50 ` [PATCH -next v2 0/4] mm: per-node proactive reclaim Andrew Morton
2025-07-16  0:24 ` Andrew Morton
2025-07-16 15:15   ` Shakeel Butt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250623185851.830632-5-dave@stgolabs.net \
    --to=dave@stgolabs.net \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox