[RFC LPC2025 PATCH 0/4] Deprecate zone_reclaim

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [RFC LPC2025 PATCH 0/4] Deprecate zone_reclaim_mode
@ 2025-12-05 23:32 Joshua Hahn
  2025-12-05 23:32 ` [RFC LPC2025 PATCH 1/4] mm/khugepaged: Remove hpage_collapse_scan_abort Joshua Hahn
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: Joshua Hahn @ 2025-12-05 23:32 UTC (permalink / raw)
  To: willy, david
  Cc: linux-doc, linux-kernel, linux-mm, linux-trace-kernel,
	linuxppc-dev, kernel-team

Hello folks, 
This is a code RFC for my upcoming discussion at LPC 2025 in Tokyo [1].

<preface>
You might notice that the RFC that I'm sending out is different from the
proposed abstract. Initially when I submitted my proposal, I was interested
in addressing how fallback allocations work under pressure for
NUMA-restricted allocations. Soon after, Johannes proposed a patch [2] which
addressed the problem I was investigating, so I wanted to explore a different
direction in the same area of fallback allocations.

At the same time, I was also thinking about zone_reclaim_mode [3]. I thought
that LPC would be a good opportunity to discuss deprecating zone_reclaim_mode,
so I hope to discuss this topic at LPC during my presentation slot.

Sorry for the patch submission so close to the conference as well. I thought
it would still be better to send this RFC out late, instead of just presenting
the topic at the conference without giving folks some time to think about it.
</preface>

zone_reclaim_mode was introduced in 2005 to prevent the kernel from facing
the high remote access latency associated with NUMA systems. With it enabled,
when the kernel sees that the local node is full, it will stall allocations and
trigger direct reclaim locally, instead of making a remote allocation, even
when there may still be free memory. Thsi is the preferred way to consume memory
if remote memory access is more expensive than performing direct reclaim.
The choice is made on a system-wide basis, but can be toggled at runtime.

This series deprecates the zone_reclaim_mode sysctl in favor of other NUMA
aware mechanisms, such as NUMA balancing, memory.reclaim, membind, and
tiering / promotion / demotion. Let's break down what differences there are
in these mechanisms, based on workload characteristics.

Scenario 1) Workload fits in a single NUMA node
In this case, if the rest of the NUMA node is unused, the zone_reclaim_mode
does nothing. On the other hand, if there are several workloads competing
for memory in the same NUMA node, with sum(workload_mem) > mem_capacity(node),
then zone_reclaim_mode is actively harmful. Direct reclaim is aggressively
triggered whenever one workload makes an allocation that goes over the limit,
and there is no fairness mechanism to prevent one workload from completely
blocking the other workload from making progress.

Scenario 2) Workload does not fit in a single NUMA node
Again, in this case, zone_reclaim_mode is actively harmful. Direct reclaim
will constantly be triggered whenever memory goes above the limit, leading
to memory thrashing. Moreover, even if the user really wants avoid remote
allocations, membind is a better alternative in this case; zone_reclaim_mode
forces the user to make the decision for all workloads on the system, whereas
membind gives per-process granularity.

Scenario 3) Workload size is approximately the same as the NUMA capacity
This is probably the case for most workloads. When it is uncertain whether
memory consumption will exceed the capacity, it doesn't really make a lot
of sense to make a system-wide bet on whether direct reclaim is better or
worse than remote allocations. In other words, it might make more sense to
allow memory to spill over to remote nodes, and let the kernel handle the
NUMA balancing depending on how cold or hot the newly allocated memory is.

These examples might make it seem like zone_reclaim_mode is harmful for
all scenarios. But that is not the case:

Scenario 4) Newly allocated memory is going to be hot
This is probably the scenario that makes zone_reclaim_mode shine the most.
If the newly allocated memory is going to be hot, then it makes much more
sense to try and reclaim locally, which would kick out cold(er) memory and
prevent eating any remote memory access latency frequently.

Scenario 5) Tiered NUMA system makes remote access latency higher
In some tiered memory scenarios, remote access latency can be higher for
lower memory tiers. In these scenarios, the cost of direct reclaim may be
cheaper, relative to placing hot memory on a remote node with high access
latency.

Now, let me try and present a case for deprecating zone_reclaim_mode, despite
these two scenarios where it performs as intended.
In scenario 4, the catch is that the system is not an oracle that can predict
that newly allocated memory is going to be hot. In fact, a lot of the kernel
assumes that newly allocated memory is cold, and it has to "prove" that it
is hot through accesses. In a perfect world, the kernel would be able to
selectively trigger direct reclaim or allocate remotely, based on whehter the
current allocation will be cold or hot in the future.

But without these insights, it is difficult to make a system-wide bet and
always trigger direct reclaim locally, when we might be reclaiming or
evicting relatively hotter memory from the local node in order to make room.

In scenario 5, remote access latency is higher, which means the cost of
placing hot memory in remote nodes is higher. But today, we have many
strategies that can help us overcome the higher cost of placing hot memory in
remote nodes. If the system has tiered memory with different memory
access characteristics per-node, then the user is probably already enabling
promotion and demotion mechanisms that can quickly correct the placement of
hot pages in lower tiers. In these systems, it might make more sense to allow
the kernel to naturally consume all of the memory it can (whether it is local
or on a lower tier remote node), then allow the kernel to then take corrective
action based on what it finds as hot or cold memory.

Of course, demonstrating that there are alternatives is not enough to warrant
a deprecation. I think that the real benefit of this patch comes in reduced
sysctl maintenance and what I think is much easier code to read.

This series which has 466 deletions and 9 insertions:
- Deprecates the zone_reclaim_mode sysctl (patch 4)
- Deprecates the min_slab_ratio sysctl (patch 3)
- Deprecates the min_unmapped_ratio sysctl (patch 3)
- Removes the node_reclaim() function and simplifies the get_page_from_freelist
  watermark checks (which is already a very large function) (patch 2)
- Simplifies hpage_collapse_scan_{pmd, file} (patch 1).
- There are also more opportunities for future cleanup, like removing
  __node_reclaim and converting its last caller to use try_to_free_pages
  (suggested by Johannes Weiner)

Here are some discussion points that I hope to discuss at LPC:
- For workloads that are assumed to fit in a NUMA node, is membind really
  enough to achieve the same effect?
- Is NUMA balancing good enough to correct action when memory spills over to
  remote nodes, and end up being accessed frequently?
- How widely is zone_reclaim_mode currently being used?
- Are there usecases for zone_reclaim_mode that cannot be replaced by any
  of the mentioned alternatives?
- Now that node_reclaim() is deprecated in patch 2, patch 3 deprecates
  min_slab_ratio and min_unmapped_ratio. Does this change make sense?
  IOW, should proactive reclaim via memory.reclaim still care about
  these thresholds before making a decision to reclaim?
- If we agree that there are better alternatives to zone_reclaim_mode, how
  should we make the transition to deprecate it, along with the other
  sysctls that are deprecated in this series (min_{slab, unmapped}_ratio)?

Please also note that I've excluded all individual email addresses for the
Cc list. It was ~30 addresses, as I just wanted to avoid spamming
maintainers and reviewers, so I've just left the mailing list targets.
The individuals are Cc-ed in the relevant patches, though.

Thank you everyone. I'm looking forward to discussing this idea with you all!
Joshua

[1] https://lpc.events/event/19/contributions/2142/
[2] https://lore.kernel.org/linux-mm/20250919162134.1098208-1-hannes@cmpxchg.org/
[3] https://lore.kernel.org/all/20250805205048.1518453-1-joshua.hahnjy@gmail.com/

Joshua Hahn (4):
  mm/khugepaged: Remove hpage_collapse_scan_abort
  mm/vmscan/page_alloc: Remove node_reclaim
  mm/vmscan/page_alloc: Deprecate min_{slab, unmapped}_ratio
  mm/vmscan: Deprecate zone_reclaim_mode

 Documentation/admin-guide/sysctl/vm.rst       |  78 ---------
 Documentation/mm/physical_memory.rst          |   9 -
 .../translations/zh_CN/mm/physical_memory.rst |   8 -
 arch/powerpc/include/asm/topology.h           |   4 -
 include/linux/mmzone.h                        |   8 -
 include/linux/swap.h                          |   5 -
 include/linux/topology.h                      |   6 -
 include/linux/vm_event_item.h                 |   4 -
 include/trace/events/huge_memory.h            |   1 -
 include/uapi/linux/mempolicy.h                |  14 --
 mm/internal.h                                 |  22 ---
 mm/khugepaged.c                               |  34 ----
 mm/page_alloc.c                               | 120 +------------
 mm/vmscan.c                                   | 158 +-----------------
 mm/vmstat.c                                   |   4 -
 15 files changed, 9 insertions(+), 466 deletions(-)

base-commit: e4c4d9892021888be6d874ec1be307e80382f431
-- 
2.47.3

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [RFC LPC2025 PATCH 1/4] mm/khugepaged: Remove hpage_collapse_scan_abort
  2025-12-05 23:32 [RFC LPC2025 PATCH 0/4] Deprecate zone_reclaim_mode Joshua Hahn
@ 2025-12-05 23:32 ` Joshua Hahn
  2025-12-05 23:32 ` [RFC LPC2025 PATCH 2/4] mm/vmscan/page_alloc: Remove node_reclaim Joshua Hahn
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: Joshua Hahn @ 2025-12-05 23:32 UTC (permalink / raw)
  Cc: Liam R. Howlett, Andrew Morton, Baolin Wang, Barry Song,
	David Hildenbrand, Dev Jain, Lance Yang, Lorenzo Stoakes,
	Masami Hiramatsu, Mathieu Desnoyers, Nico Pache, Ryan Roberts,
	Steven Rostedt, Zi Yan, linux-kernel, linux-mm,
	linux-trace-kernel

Commit 14a4e2141e24 ("mm, thp: only collapse hugepages to nodes with
affinity for zone_reclaim_mode") introduced khugepaged_scan_abort,
which was later renamed to hpage_collapse_scan_abort. It prevents
collapsing hugepages to remote nodes when zone_reclaim_mode is enabled
as to prefer reclaiming & allocating locally instead of allocating on a
far away remote node (distance > RECLAIM_DISTANCE).

With the zone_reclaim_mode sysctl being deprecated later in the series,
remove hpage_collapse_scan_abort, its callers, and its associated values
in the scan_result enum.

Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
 include/trace/events/huge_memory.h |  1 -
 mm/khugepaged.c                    | 34 ------------------------------
 2 files changed, 35 deletions(-)

diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge_memory.h
index 4cde53b45a85..1c0b146d1286 100644
--- a/include/trace/events/huge_memory.h
+++ b/include/trace/events/huge_memory.h
@@ -20,7 +20,6 @@
 	EM( SCAN_PTE_MAPPED_HUGEPAGE,	"pte_mapped_hugepage")		\
 	EM( SCAN_LACK_REFERENCED_PAGE,	"lack_referenced_page")		\
 	EM( SCAN_PAGE_NULL,		"page_null")			\
-	EM( SCAN_SCAN_ABORT,		"scan_aborted")			\
 	EM( SCAN_PAGE_COUNT,		"not_suitable_page_count")	\
 	EM( SCAN_PAGE_LRU,		"page_not_in_lru")		\
 	EM( SCAN_PAGE_LOCK,		"page_locked")			\
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 97d1b2824386..a93228a53ee4 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -40,7 +40,6 @@ enum scan_result {
 	SCAN_PTE_MAPPED_HUGEPAGE,
 	SCAN_LACK_REFERENCED_PAGE,
 	SCAN_PAGE_NULL,
-	SCAN_SCAN_ABORT,
 	SCAN_PAGE_COUNT,
 	SCAN_PAGE_LRU,
 	SCAN_PAGE_LOCK,
@@ -830,30 +829,6 @@ struct collapse_control khugepaged_collapse_control = {
 	.is_khugepaged = true,
 };
 
-static bool hpage_collapse_scan_abort(int nid, struct collapse_control *cc)
-{
-	int i;
-
-	/*
-	 * If node_reclaim_mode is disabled, then no extra effort is made to
-	 * allocate memory locally.
-	 */
-	if (!node_reclaim_enabled())
-		return false;
-
-	/* If there is a count for this node already, it must be acceptable */
-	if (cc->node_load[nid])
-		return false;
-
-	for (i = 0; i < MAX_NUMNODES; i++) {
-		if (!cc->node_load[i])
-			continue;
-		if (node_distance(nid, i) > node_reclaim_distance)
-			return true;
-	}
-	return false;
-}
-
 #define khugepaged_defrag()					\
 	(transparent_hugepage_flags &				\
 	 (1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG))
@@ -1355,10 +1330,6 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 		 * hit record.
 		 */
 		node = folio_nid(folio);
-		if (hpage_collapse_scan_abort(node, cc)) {
-			result = SCAN_SCAN_ABORT;
-			goto out_unmap;
-		}
 		cc->node_load[node]++;
 		if (!folio_test_lru(folio)) {
 			result = SCAN_PAGE_LRU;
@@ -2342,11 +2313,6 @@ static int hpage_collapse_scan_file(struct mm_struct *mm, unsigned long addr,
 		}
 
 		node = folio_nid(folio);
-		if (hpage_collapse_scan_abort(node, cc)) {
-			result = SCAN_SCAN_ABORT;
-			folio_put(folio);
-			break;
-		}
 		cc->node_load[node]++;
 
 		if (!folio_test_lru(folio)) {
-- 
2.47.3


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [RFC LPC2025 PATCH 2/4] mm/vmscan/page_alloc: Remove node_reclaim
  2025-12-05 23:32 [RFC LPC2025 PATCH 0/4] Deprecate zone_reclaim_mode Joshua Hahn
  2025-12-05 23:32 ` [RFC LPC2025 PATCH 1/4] mm/khugepaged: Remove hpage_collapse_scan_abort Joshua Hahn
@ 2025-12-05 23:32 ` Joshua Hahn
  2025-12-05 23:32 ` [RFC LPC2025 PATCH 3/4] mm/vmscan/page_alloc: Deprecate min_{slab, unmapped}_ratio Joshua Hahn
  2025-12-05 23:32 ` [RFC LPC2025 PATCH 4/4] mm/vmscan: Deprecate zone_reclaim_mode Joshua Hahn
  3 siblings, 0 replies; 5+ messages in thread
From: Joshua Hahn @ 2025-12-05 23:32 UTC (permalink / raw)
  Cc: Liam R. Howlett, Andrew Morton, Axel Rasmussen, Brendan Jackman,
	David Hildenbrand, Johannes Weiner, Lorenzo Stoakes,
	Michal Hocko, Mike Rapoport, Qi Zheng, Shakeel Butt,
	Suren Baghdasaryan, Vlastimil Babka, Wei Xu, Yuanchu Xie, Zi Yan,
	linux-kernel, linux-mm

node_reclaim() is currently only called when the zone_reclaim_mode
sysctl is set, during get_page_from_freelist if the current node is
full.

With the zone_reclaim_mode sysctl being deprecated later in the series,
there are no more callsites for node_reclaim. Remove node_reclaim and
its associated return values NODE_RECLAIM_{NOSCAN, FULL, SOME, SUCCESS},
as well as the zone_reclaim_{success, failed} vmstat items.

We can also remove zone_allows_reclaim, since with node_reclaim_enabled
always returning false, it will never get evaluated.

Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
 include/linux/vm_event_item.h |  4 ---
 mm/internal.h                 | 11 ------
 mm/page_alloc.c               | 34 ------------------
 mm/vmscan.c                   | 67 -----------------------------------
 mm/vmstat.c                   |  4 ---
 5 files changed, 120 deletions(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 92f80b4d69a6..2520200b65f0 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -53,10 +53,6 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		PGSCAN_FILE,
 		PGSTEAL_ANON,
 		PGSTEAL_FILE,
-#ifdef CONFIG_NUMA
-		PGSCAN_ZONE_RECLAIM_SUCCESS,
-		PGSCAN_ZONE_RECLAIM_FAILED,
-#endif
 		PGINODESTEAL, SLABS_SCANNED, KSWAPD_INODESTEAL,
 		KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
 		PAGEOUTRUN, PGROTATED,
diff --git a/mm/internal.h b/mm/internal.h
index 04c307ee33ae..743fcebe53a8 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1196,24 +1196,13 @@ static inline void mminit_verify_zonelist(void)
 }
 #endif /* CONFIG_DEBUG_MEMORY_INIT */
 
-#define NODE_RECLAIM_NOSCAN	-2
-#define NODE_RECLAIM_FULL	-1
-#define NODE_RECLAIM_SOME	0
-#define NODE_RECLAIM_SUCCESS	1
-
 #ifdef CONFIG_NUMA
 extern int node_reclaim_mode;
 
-extern int node_reclaim(struct pglist_data *, gfp_t, unsigned int);
 extern int find_next_best_node(int node, nodemask_t *used_node_mask);
 #else
 #define node_reclaim_mode 0
 
-static inline int node_reclaim(struct pglist_data *pgdat, gfp_t mask,
-				unsigned int order)
-{
-	return NODE_RECLAIM_NOSCAN;
-}
 static inline int find_next_best_node(int node, nodemask_t *used_node_mask)
 {
 	return NUMA_NO_NODE;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d0f026ec10b6..010a035e81bd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3684,17 +3684,6 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
 
 #ifdef CONFIG_NUMA
 int __read_mostly node_reclaim_distance = RECLAIM_DISTANCE;
-
-static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
-{
-	return node_distance(zone_to_nid(local_zone), zone_to_nid(zone)) <=
-				node_reclaim_distance;
-}
-#else	/* CONFIG_NUMA */
-static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
-{
-	return true;
-}
 #endif	/* CONFIG_NUMA */
 
 /*
@@ -3868,8 +3857,6 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		if (!zone_watermark_fast(zone, order, mark,
 				       ac->highest_zoneidx, alloc_flags,
 				       gfp_mask)) {
-			int ret;
-
 			if (cond_accept_memory(zone, order, alloc_flags))
 				goto try_this_zone;
 
@@ -3885,27 +3872,6 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 			BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
 			if (alloc_flags & ALLOC_NO_WATERMARKS)
 				goto try_this_zone;
-
-			if (!node_reclaim_enabled() ||
-			    !zone_allows_reclaim(zonelist_zone(ac->preferred_zoneref), zone))
-				continue;
-
-			ret = node_reclaim(zone->zone_pgdat, gfp_mask, order);
-			switch (ret) {
-			case NODE_RECLAIM_NOSCAN:
-				/* did not scan */
-				continue;
-			case NODE_RECLAIM_FULL:
-				/* scanned but unreclaimable */
-				continue;
-			default:
-				/* did we reclaim enough */
-				if (zone_watermark_ok(zone, order, mark,
-					ac->highest_zoneidx, alloc_flags))
-					goto try_this_zone;
-
-				continue;
-			}
 		}
 
 try_this_zone:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3b85652a42b9..d07acd76fdea 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -7537,13 +7537,6 @@ module_init(kswapd_init)
  */
 int node_reclaim_mode __read_mostly;
 
-/*
- * Priority for NODE_RECLAIM. This determines the fraction of pages
- * of a node considered for each zone_reclaim. 4 scans 1/16th of
- * a zone.
- */
-#define NODE_RECLAIM_PRIORITY 4
-
 /*
  * Percentage of pages in a zone that must be unmapped for node_reclaim to
  * occur.
@@ -7646,66 +7639,6 @@ static unsigned long __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask,
 	return sc->nr_reclaimed;
 }
 
-int node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask, unsigned int order)
-{
-	int ret;
-	/* Minimum pages needed in order to stay on node */
-	const unsigned long nr_pages = 1 << order;
-	struct scan_control sc = {
-		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
-		.gfp_mask = current_gfp_context(gfp_mask),
-		.order = order,
-		.priority = NODE_RECLAIM_PRIORITY,
-		.may_writepage = !!(node_reclaim_mode & RECLAIM_WRITE),
-		.may_unmap = !!(node_reclaim_mode & RECLAIM_UNMAP),
-		.may_swap = 1,
-		.reclaim_idx = gfp_zone(gfp_mask),
-	};
-
-	/*
-	 * Node reclaim reclaims unmapped file backed pages and
-	 * slab pages if we are over the defined limits.
-	 *
-	 * A small portion of unmapped file backed pages is needed for
-	 * file I/O otherwise pages read by file I/O will be immediately
-	 * thrown out if the node is overallocated. So we do not reclaim
-	 * if less than a specified percentage of the node is used by
-	 * unmapped file backed pages.
-	 */
-	if (node_pagecache_reclaimable(pgdat) <= pgdat->min_unmapped_pages &&
-	    node_page_state_pages(pgdat, NR_SLAB_RECLAIMABLE_B) <=
-	    pgdat->min_slab_pages)
-		return NODE_RECLAIM_FULL;
-
-	/*
-	 * Do not scan if the allocation should not be delayed.
-	 */
-	if (!gfpflags_allow_blocking(gfp_mask) || (current->flags & PF_MEMALLOC))
-		return NODE_RECLAIM_NOSCAN;
-
-	/*
-	 * Only run node reclaim on the local node or on nodes that do not
-	 * have associated processors. This will favor the local processor
-	 * over remote processors and spread off node memory allocations
-	 * as wide as possible.
-	 */
-	if (node_state(pgdat->node_id, N_CPU) && pgdat->node_id != numa_node_id())
-		return NODE_RECLAIM_NOSCAN;
-
-	if (test_and_set_bit_lock(PGDAT_RECLAIM_LOCKED, &pgdat->flags))
-		return NODE_RECLAIM_NOSCAN;
-
-	ret = __node_reclaim(pgdat, gfp_mask, nr_pages, &sc) >= nr_pages;
-	clear_bit_unlock(PGDAT_RECLAIM_LOCKED, &pgdat->flags);
-
-	if (ret)
-		count_vm_event(PGSCAN_ZONE_RECLAIM_SUCCESS);
-	else
-		count_vm_event(PGSCAN_ZONE_RECLAIM_FAILED);
-
-	return ret;
-}
-
 enum {
 	MEMORY_RECLAIM_SWAPPINESS = 0,
 	MEMORY_RECLAIM_SWAPPINESS_MAX,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 65de88cdf40e..3564bc62325a 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1349,10 +1349,6 @@ const char * const vmstat_text[] = {
 	[I(PGSTEAL_ANON)]			= "pgsteal_anon",
 	[I(PGSTEAL_FILE)]			= "pgsteal_file",
 
-#ifdef CONFIG_NUMA
-	[I(PGSCAN_ZONE_RECLAIM_SUCCESS)]	= "zone_reclaim_success",
-	[I(PGSCAN_ZONE_RECLAIM_FAILED)]		= "zone_reclaim_failed",
-#endif
 	[I(PGINODESTEAL)]			= "pginodesteal",
 	[I(SLABS_SCANNED)]			= "slabs_scanned",
 	[I(KSWAPD_INODESTEAL)]			= "kswapd_inodesteal",
-- 
2.47.3


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [RFC LPC2025 PATCH 3/4] mm/vmscan/page_alloc: Deprecate min_{slab, unmapped}_ratio
  2025-12-05 23:32 [RFC LPC2025 PATCH 0/4] Deprecate zone_reclaim_mode Joshua Hahn
  2025-12-05 23:32 ` [RFC LPC2025 PATCH 1/4] mm/khugepaged: Remove hpage_collapse_scan_abort Joshua Hahn
  2025-12-05 23:32 ` [RFC LPC2025 PATCH 2/4] mm/vmscan/page_alloc: Remove node_reclaim Joshua Hahn
@ 2025-12-05 23:32 ` Joshua Hahn
  2025-12-05 23:32 ` [RFC LPC2025 PATCH 4/4] mm/vmscan: Deprecate zone_reclaim_mode Joshua Hahn
  3 siblings, 0 replies; 5+ messages in thread
From: Joshua Hahn @ 2025-12-05 23:32 UTC (permalink / raw)
  Cc: Liam R. Howlett, Alex Shi, Andrew Morton, Axel Rasmussen,
	Baoquan He, Barry Song, Brendan Jackman, Chris Li,
	David Hildenbrand, Dongliang Mu, Johannes Weiner,
	Jonathan Corbet, Kairui Song, Kemeng Shi, Lorenzo Stoakes,
	Michal Hocko, Mike Rapoport, Nhat Pham, Qi Zheng, Shakeel Butt,
	Suren Baghdasaryan, Vlastimil Babka, Wei Xu, Yanteng Si,
	Yuanchu Xie, Zi Yan, linux-doc, linux-kernel, linux-mm

The min_slab_ratio and min_unmapped_ratio sysctls allow the user to tune
how much reclaimable slab or reclaimable pagecache a node has before
deciding to shrink it in __node_reclaim. Prior to this series, there
were two ways these checks were done:
  1. When zone_reclaim_mode is enabled, the local node is full, and
     node_reclaim is called to shrink the current node
  2. When the user directly asks to shrink a node by writing to the
     memory.reclaim file (i.e. proactive reclaim)

In the first scenario, the two parameters ensures that node reclaim is
only performed when the cost to reclaim is overcome by the amount of
memory that can easily be freed. In other words, it acts to throttle
node reclaim when the local node runs out of memory, and instead resorts
to fallback allocations on a remote node.

With the zone_reclaim_mode sysctl being deprecated later in the series,
only the second scenario remains in the system. The implications here
are slightly different. Now, node_reclaim is only called when the user
explicitly asks for it. In this case, it might make less sense to try
and throttle this behavior. In fact, it might feel counterintuitive from
the user's perspective if triggering direct reclaim leads to no memory
reclaimed, even if there is reclaimable memory (albeit small).

Deprecate the min_{slab, unmapped}_ratio sysctls now that node_reclaim
no longer needs to be throttled. This leads to less sysctls needing to
be maintained, and a more intuitive __node_reclaim.

Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
 Documentation/admin-guide/sysctl/vm.rst       | 37 ---------
 Documentation/mm/physical_memory.rst          |  9 --
 .../translations/zh_CN/mm/physical_memory.rst |  8 --
 include/linux/mmzone.h                        |  8 --
 include/linux/swap.h                          |  5 --
 mm/page_alloc.c                               | 82 -------------------
 mm/vmscan.c                                   | 73 ++---------------
 7 files changed, 7 insertions(+), 215 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 4d71211fdad8..ea2fd3feb9c6 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -49,8 +49,6 @@ Currently, these files are in /proc/sys/vm:
 - memory_failure_early_kill
 - memory_failure_recovery
 - min_free_kbytes
-- min_slab_ratio
-- min_unmapped_ratio
 - mmap_min_addr
 - mmap_rnd_bits
 - mmap_rnd_compat_bits
@@ -549,41 +547,6 @@ become subtly broken, and prone to deadlock under high loads.
 Setting this too high will OOM your machine instantly.
 
 
-min_slab_ratio
-==============
-
-This is available only on NUMA kernels.
-
-A percentage of the total pages in each zone.  On Zone reclaim
-(fallback from the local zone occurs) slabs will be reclaimed if more
-than this percentage of pages in a zone are reclaimable slab pages.
-This insures that the slab growth stays under control even in NUMA
-systems that rarely perform global reclaim.
-
-The default is 5 percent.
-
-Note that slab reclaim is triggered in a per zone / node fashion.
-The process of reclaiming slab memory is currently not node specific
-and may not be fast.
-
-
-min_unmapped_ratio
-==================
-
-This is available only on NUMA kernels.
-
-This is a percentage of the total pages in each zone. Zone reclaim will
-only occur if more than this percentage of pages are in a state that
-zone_reclaim_mode allows to be reclaimed.
-
-If zone_reclaim_mode has the value 4 OR'd, then the percentage is compared
-against all file-backed unmapped pages including swapcache pages and tmpfs
-files. Otherwise, only unmapped pages backed by normal files but not tmpfs
-files and similar are considered.
-
-The default is 1 percent.
-
-
 mmap_min_addr
 =============
 
diff --git a/Documentation/mm/physical_memory.rst b/Documentation/mm/physical_memory.rst
index b76183545e5b..ee8fd939020d 100644
--- a/Documentation/mm/physical_memory.rst
+++ b/Documentation/mm/physical_memory.rst
@@ -296,15 +296,6 @@ See also Documentation/mm/page_reclaim.rst.
 ``kswapd_failures``
   Number of runs kswapd was unable to reclaim any pages
 
-``min_unmapped_pages``
-  Minimal number of unmapped file backed pages that cannot be reclaimed.
-  Determined by ``vm.min_unmapped_ratio`` sysctl. Only defined when
-  ``CONFIG_NUMA`` is enabled.
-
-``min_slab_pages``
-  Minimal number of SLAB pages that cannot be reclaimed. Determined by
-  ``vm.min_slab_ratio sysctl``. Only defined when ``CONFIG_NUMA`` is enabled
-
 ``flags``
   Flags controlling reclaim behavior.
 
diff --git a/Documentation/translations/zh_CN/mm/physical_memory.rst b/Documentation/translations/zh_CN/mm/physical_memory.rst
index 4594d15cefec..670bd8103c3b 100644
--- a/Documentation/translations/zh_CN/mm/physical_memory.rst
+++ b/Documentation/translations/zh_CN/mm/physical_memory.rst
@@ -280,14 +280,6 @@ kswapd线程可以回收的最高区域索引。
 ``kswapd_failures``
 kswapd无法回收任何页面的运行次数。
 
-``min_unmapped_pages``
-无法回收的未映射文件支持的最小页面数量。由 ``vm.min_unmapped_ratio``
-系统控制台（sysctl）参数决定。在开启 ``CONFIG_NUMA`` 配置时定义。
-
-``min_slab_pages``
-无法回收的SLAB页面的最少数量。由 ``vm.min_slab_ratio`` 系统控制台
-（sysctl）参数决定。在开启 ``CONFIG_NUMA`` 时定义。
-
 ``flags``
 控制回收行为的标志位。
 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 75ef7c9f9307..4be84764d097 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1451,14 +1451,6 @@ typedef struct pglist_data {
 	 */
 	unsigned long		totalreserve_pages;
 
-#ifdef CONFIG_NUMA
-	/*
-	 * node reclaim becomes active if more unmapped pages exist.
-	 */
-	unsigned long		min_unmapped_pages;
-	unsigned long		min_slab_pages;
-#endif /* CONFIG_NUMA */
-
 	/* Write-intensive fields used by page reclaim */
 	CACHELINE_PADDING(_pad1_);
 
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 38ca3df68716..c5915d787852 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -411,11 +411,6 @@ static inline void reclaim_unregister_node(struct node *node)
 }
 #endif /* CONFIG_SYSFS && CONFIG_NUMA */
 
-#ifdef CONFIG_NUMA
-extern int sysctl_min_unmapped_ratio;
-extern int sysctl_min_slab_ratio;
-#endif
-
 void check_move_unevictable_folios(struct folio_batch *fbatch);
 
 extern void __meminit kswapd_run(int nid);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 010a035e81bd..9524713c81b7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5676,8 +5676,6 @@ int local_memory_node(int node)
 }
 #endif
 
-static void setup_min_unmapped_ratio(void);
-static void setup_min_slab_ratio(void);
 #else	/* CONFIG_NUMA */
 
 static void build_zonelists(pg_data_t *pgdat)
@@ -6487,11 +6485,6 @@ int __meminit init_per_zone_wmark_min(void)
 	refresh_zone_stat_thresholds();
 	setup_per_zone_lowmem_reserve();
 
-#ifdef CONFIG_NUMA
-	setup_min_unmapped_ratio();
-	setup_min_slab_ratio();
-#endif
-
 	khugepaged_min_free_kbytes_update();
 
 	return 0;
@@ -6534,63 +6527,6 @@ static int watermark_scale_factor_sysctl_handler(const struct ctl_table *table,
 	return 0;
 }
 
-#ifdef CONFIG_NUMA
-static void setup_min_unmapped_ratio(void)
-{
-	pg_data_t *pgdat;
-	struct zone *zone;
-
-	for_each_online_pgdat(pgdat)
-		pgdat->min_unmapped_pages = 0;
-
-	for_each_zone(zone)
-		zone->zone_pgdat->min_unmapped_pages += (zone_managed_pages(zone) *
-						         sysctl_min_unmapped_ratio) / 100;
-}
-
-
-static int sysctl_min_unmapped_ratio_sysctl_handler(const struct ctl_table *table, int write,
-		void *buffer, size_t *length, loff_t *ppos)
-{
-	int rc;
-
-	rc = proc_dointvec_minmax(table, write, buffer, length, ppos);
-	if (rc)
-		return rc;
-
-	setup_min_unmapped_ratio();
-
-	return 0;
-}
-
-static void setup_min_slab_ratio(void)
-{
-	pg_data_t *pgdat;
-	struct zone *zone;
-
-	for_each_online_pgdat(pgdat)
-		pgdat->min_slab_pages = 0;
-
-	for_each_zone(zone)
-		zone->zone_pgdat->min_slab_pages += (zone_managed_pages(zone) *
-						     sysctl_min_slab_ratio) / 100;
-}
-
-static int sysctl_min_slab_ratio_sysctl_handler(const struct ctl_table *table, int write,
-		void *buffer, size_t *length, loff_t *ppos)
-{
-	int rc;
-
-	rc = proc_dointvec_minmax(table, write, buffer, length, ppos);
-	if (rc)
-		return rc;
-
-	setup_min_slab_ratio();
-
-	return 0;
-}
-#endif
-
 /*
  * lowmem_reserve_ratio_sysctl_handler - just a wrapper around
  *	proc_dointvec() so that we can call setup_per_zone_lowmem_reserve()
@@ -6720,24 +6656,6 @@ static const struct ctl_table page_alloc_sysctl_table[] = {
 		.mode		= 0644,
 		.proc_handler	= numa_zonelist_order_handler,
 	},
-	{
-		.procname	= "min_unmapped_ratio",
-		.data		= &sysctl_min_unmapped_ratio,
-		.maxlen		= sizeof(sysctl_min_unmapped_ratio),
-		.mode		= 0644,
-		.proc_handler	= sysctl_min_unmapped_ratio_sysctl_handler,
-		.extra1		= SYSCTL_ZERO,
-		.extra2		= SYSCTL_ONE_HUNDRED,
-	},
-	{
-		.procname	= "min_slab_ratio",
-		.data		= &sysctl_min_slab_ratio,
-		.maxlen		= sizeof(sysctl_min_slab_ratio),
-		.mode		= 0644,
-		.proc_handler	= sysctl_min_slab_ratio_sysctl_handler,
-		.extra1		= SYSCTL_ZERO,
-		.extra2		= SYSCTL_ONE_HUNDRED,
-	},
 #endif
 };
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d07acd76fdea..4e23289efba4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -7537,62 +7537,6 @@ module_init(kswapd_init)
  */
 int node_reclaim_mode __read_mostly;
 
-/*
- * Percentage of pages in a zone that must be unmapped for node_reclaim to
- * occur.
- */
-int sysctl_min_unmapped_ratio = 1;
-
-/*
- * If the number of slab pages in a zone grows beyond this percentage then
- * slab reclaim needs to occur.
- */
-int sysctl_min_slab_ratio = 5;
-
-static inline unsigned long node_unmapped_file_pages(struct pglist_data *pgdat)
-{
-	unsigned long file_mapped = node_page_state(pgdat, NR_FILE_MAPPED);
-	unsigned long file_lru = node_page_state(pgdat, NR_INACTIVE_FILE) +
-		node_page_state(pgdat, NR_ACTIVE_FILE);
-
-	/*
-	 * It's possible for there to be more file mapped pages than
-	 * accounted for by the pages on the file LRU lists because
-	 * tmpfs pages accounted for as ANON can also be FILE_MAPPED
-	 */
-	return (file_lru > file_mapped) ? (file_lru - file_mapped) : 0;
-}
-
-/* Work out how many page cache pages we can reclaim in this reclaim_mode */
-static unsigned long node_pagecache_reclaimable(struct pglist_data *pgdat)
-{
-	unsigned long nr_pagecache_reclaimable;
-	unsigned long delta = 0;
-
-	/*
-	 * If RECLAIM_UNMAP is set, then all file pages are considered
-	 * potentially reclaimable. Otherwise, we have to worry about
-	 * pages like swapcache and node_unmapped_file_pages() provides
-	 * a better estimate
-	 */
-	if (node_reclaim_mode & RECLAIM_UNMAP)
-		nr_pagecache_reclaimable = node_page_state(pgdat, NR_FILE_PAGES);
-	else
-		nr_pagecache_reclaimable = node_unmapped_file_pages(pgdat);
-
-	/*
-	 * Since we can't clean folios through reclaim, remove dirty file
-	 * folios from consideration.
-	 */
-	delta += node_page_state(pgdat, NR_FILE_DIRTY);
-
-	/* Watch for any possible underflows due to delta */
-	if (unlikely(delta > nr_pagecache_reclaimable))
-		delta = nr_pagecache_reclaimable;
-
-	return nr_pagecache_reclaimable - delta;
-}
-
 /*
  * Try to free up some pages from this node through reclaim.
  */
@@ -7617,16 +7561,13 @@ static unsigned long __node_reclaim(struct pglist_data *pgdat, gfp_t gfp_mask,
 	noreclaim_flag = memalloc_noreclaim_save();
 	set_task_reclaim_state(p, &sc->reclaim_state);
 
-	if (node_pagecache_reclaimable(pgdat) > pgdat->min_unmapped_pages ||
-	    node_page_state_pages(pgdat, NR_SLAB_RECLAIMABLE_B) > pgdat->min_slab_pages) {
-		/*
-		 * Free memory by calling shrink node with increasing
-		 * priorities until we have enough memory freed.
-		 */
-		do {
-			shrink_node(pgdat, sc);
-		} while (sc->nr_reclaimed < nr_pages && --sc->priority >= 0);
-	}
+	/*
+	 * Free memory by calling shrink node with increasing priorities until
+	 * we have enough memory freed.
+	 */
+	do {
+		shrink_node(pgdat, sc);
+	} while (sc->nr_reclaimed < nr_pages && --sc->priority >= 0);
 
 	set_task_reclaim_state(p, NULL);
 	memalloc_noreclaim_restore(noreclaim_flag);
-- 
2.47.3


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [RFC LPC2025 PATCH 4/4] mm/vmscan: Deprecate zone_reclaim_mode
  2025-12-05 23:32 [RFC LPC2025 PATCH 0/4] Deprecate zone_reclaim_mode Joshua Hahn
                   ` (2 preceding siblings ...)
  2025-12-05 23:32 ` [RFC LPC2025 PATCH 3/4] mm/vmscan/page_alloc: Deprecate min_{slab, unmapped}_ratio Joshua Hahn
@ 2025-12-05 23:32 ` Joshua Hahn
  3 siblings, 0 replies; 5+ messages in thread
From: Joshua Hahn @ 2025-12-05 23:32 UTC (permalink / raw)
  Cc: Liam R. Howlett, Alistair Popple, Andrew Morton, Axel Rasmussen,
	Brendan Jackman, Byungchul Park, Christophe Leroy,
	David Hildenbrand, Gregory Price, Johannes Weiner,
	Jonathan Corbet, Lorenzo Stoakes, Madhavan Srinivasan,
	Matthew Brost, Michael Ellerman, Michal Hocko, Mike Rapoport,
	Nicholas Piggin, Qi Zheng, Rakie Kim, Shakeel Butt,
	Suren Baghdasaryan, Vlastimil Babka, Wei Xu, Ying Huang,
	Yuanchu Xie, Zi Yan, linux-doc, linux-kernel, linux-mm,
	linuxppc-dev

zone_reclaim_mode was introduced in 2005 to work around the NUMA
penalties associated with allocating memory on remote nodes. It changed
the page allocator's behavior to prefer stalling and performing direct
reclaim locally over allocating on a remote node.

In 2014, zone_reclaim_mode was disabled by default, as it was deemed as
unsuitable for most workloads [1]. Since then, and especially since
2005, a lot has changed. NUMA penalties are lower than they used to
before, and we now have much more extensive infrastructure to control
NUMA spillage (NUMA balancing, memory.reclaim, tiering / promotion /
demotion). Together, these changes make remote memory access a much more
appealing alternative compared to stalling the system, when there might
be free memory in other nodes.

This is not to say that there are no workloads that perform better with
NUMA locality. However, zone_reclaim_mode is a system-wide setting that
makes this bet for all running workloads on the machine. Today, we have
many more alternatives that can provide more fine-grained control over
allocation strategy, such as mbind or set_mempolicy.

Deprecate zone_reclaim_mode in favor of modern alternatives, such as
NUMA balancing, membinding, and promotion/demotion mechanisms. This
improves code readability and maintainability, especially in the page
allocation code.

[1] Commit 4f9b16a64753 ("mm: disable zone_reclaim_mode by default")

Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
 Documentation/admin-guide/sysctl/vm.rst | 41 -------------------------
 arch/powerpc/include/asm/topology.h     |  4 ---
 include/linux/topology.h                |  6 ----
 include/uapi/linux/mempolicy.h          | 14 ---------
 mm/internal.h                           | 11 -------
 mm/page_alloc.c                         |  4 +--
 mm/vmscan.c                             | 18 -----------
 7 files changed, 2 insertions(+), 96 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index ea2fd3feb9c6..635b16c1867e 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -76,7 +76,6 @@ Currently, these files are in /proc/sys/vm:
 - vfs_cache_pressure_denom
 - watermark_boost_factor
 - watermark_scale_factor
-- zone_reclaim_mode
 
 
 admin_reserve_kbytes
@@ -1046,43 +1045,3 @@ going to sleep prematurely (kswapd_low_wmark_hit_quickly) can indicate
 that the number of free pages kswapd maintains for latency reasons is
 too small for the allocation bursts occurring in the system. This knob
 can then be used to tune kswapd aggressiveness accordingly.
-
-
-zone_reclaim_mode
-=================
-
-Zone_reclaim_mode allows someone to set more or less aggressive approaches to
-reclaim memory when a zone runs out of memory. If it is set to zero then no
-zone reclaim occurs. Allocations will be satisfied from other zones / nodes
-in the system.
-
-This is value OR'ed together of
-
-=	===================================
-1	Zone reclaim on
-2	Zone reclaim writes dirty pages out
-4	Zone reclaim swaps pages
-=	===================================
-
-zone_reclaim_mode is disabled by default.  For file servers or workloads
-that benefit from having their data cached, zone_reclaim_mode should be
-left disabled as the caching effect is likely to be more important than
-data locality.
-
-Consider enabling one or more zone_reclaim mode bits if it's known that the
-workload is partitioned such that each partition fits within a NUMA node
-and that accessing remote memory would cause a measurable performance
-reduction.  The page allocator will take additional actions before
-allocating off node pages.
-
-Allowing zone reclaim to write out pages stops processes that are
-writing large amounts of data from dirtying pages on other nodes. Zone
-reclaim will write out dirty pages if a zone fills up and so effectively
-throttle the process. This may decrease the performance of a single process
-since it cannot use all of system memory to buffer the outgoing writes
-anymore but it preserve the memory on other nodes so that the performance
-of other processes running on other nodes will not be affected.
-
-Allowing regular swap effectively restricts allocations to the local
-node unless explicitly overridden by memory policies or cpuset
-configurations.
diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index f19ca44512d1..49015b2b0d8d 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -10,10 +10,6 @@ struct drmem_lmb;
 
 #ifdef CONFIG_NUMA
 
-/*
- * If zone_reclaim_mode is enabled, a RECLAIM_DISTANCE of 10 will mean that
- * all zones on all nodes will be eligible for zone_reclaim().
- */
 #define RECLAIM_DISTANCE 10
 
 #include <asm/mmzone.h>
diff --git a/include/linux/topology.h b/include/linux/topology.h
index 6575af39fd10..37018264ca1e 100644
--- a/include/linux/topology.h
+++ b/include/linux/topology.h
@@ -50,12 +50,6 @@ int arch_update_cpu_topology(void);
 #define node_distance(from,to)	((from) == (to) ? LOCAL_DISTANCE : REMOTE_DISTANCE)
 #endif
 #ifndef RECLAIM_DISTANCE
-/*
- * If the distance between nodes in a system is larger than RECLAIM_DISTANCE
- * (in whatever arch specific measurement units returned by node_distance())
- * and node_reclaim_mode is enabled then the VM will only call node_reclaim()
- * on nodes within this distance.
- */
 #define RECLAIM_DISTANCE 30
 #endif
 
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 8fbbe613611a..194f922dad9b 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -65,18 +65,4 @@ enum {
 #define MPOL_F_MOF	(1 << 3) /* this policy wants migrate on fault */
 #define MPOL_F_MORON	(1 << 4) /* Migrate On protnone Reference On Node */
 
-/*
- * Enabling zone reclaim means the page allocator will attempt to fulfill
- * the allocation request on the current node by triggering reclaim and
- * trying to shrink the current node.
- * Fallback allocations on the next candidates in the zonelist are considered
- * when reclaim fails to free up enough memory in the current node/zone.
- *
- * These bit locations are exposed in the vm.zone_reclaim_mode sysctl.
- * New bits are OK, but existing bits should not be changed.
- */
-#define RECLAIM_ZONE	(1<<0)	/* Enable zone reclaim */
-#define RECLAIM_WRITE	(1<<1)	/* Writeout pages during reclaim */
-#define RECLAIM_UNMAP	(1<<2)	/* Unmap pages during reclaim */
-
 #endif /* _UAPI_LINUX_MEMPOLICY_H */
diff --git a/mm/internal.h b/mm/internal.h
index 743fcebe53a8..a2df0bf3f458 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1197,24 +1197,13 @@ static inline void mminit_verify_zonelist(void)
 #endif /* CONFIG_DEBUG_MEMORY_INIT */
 
 #ifdef CONFIG_NUMA
-extern int node_reclaim_mode;
-
 extern int find_next_best_node(int node, nodemask_t *used_node_mask);
 #else
-#define node_reclaim_mode 0
-
 static inline int find_next_best_node(int node, nodemask_t *used_node_mask)
 {
 	return NUMA_NO_NODE;
 }
 #endif
-
-static inline bool node_reclaim_enabled(void)
-{
-	/* Is any node_reclaim_mode bit set? */
-	return node_reclaim_mode & (RECLAIM_ZONE|RECLAIM_WRITE|RECLAIM_UNMAP);
-}
-
 /*
  * mm/memory-failure.c
  */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9524713c81b7..bf4faec4ebe6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3823,8 +3823,8 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
 		 * If kswapd is already active on a node, keep looking
 		 * for other nodes that might be idle. This can happen
 		 * if another process has NUMA bindings and is causing
-		 * kswapd wakeups on only some nodes. Avoid accidental
-		 * "node_reclaim_mode"-like behavior in this case.
+		 * kswapd wakeups on only some nodes. Avoid accidentally
+		 * overpressuring the local node when remote nodes are free.
 		 */
 		if (skip_kswapd_nodes &&
 		    !waitqueue_active(&zone->zone_pgdat->kswapd_wait)) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4e23289efba4..f480a395df65 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -7503,16 +7503,6 @@ static const struct ctl_table vmscan_sysctl_table[] = {
 		.extra1		= SYSCTL_ZERO,
 		.extra2		= SYSCTL_TWO_HUNDRED,
 	},
-#ifdef CONFIG_NUMA
-	{
-		.procname	= "zone_reclaim_mode",
-		.data		= &node_reclaim_mode,
-		.maxlen		= sizeof(node_reclaim_mode),
-		.mode		= 0644,
-		.proc_handler	= proc_dointvec_minmax,
-		.extra1		= SYSCTL_ZERO,
-	}
-#endif
 };
 
 static int __init kswapd_init(void)
@@ -7529,14 +7519,6 @@ static int __init kswapd_init(void)
 module_init(kswapd_init)
 
 #ifdef CONFIG_NUMA
-/*
- * Node reclaim mode
- *
- * If non-zero call node_reclaim when the number of free pages falls below
- * the watermarks.
- */
-int node_reclaim_mode __read_mostly;
-
 /*
  * Try to free up some pages from this node through reclaim.
  */
-- 
2.47.3


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-12-05 23:32 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-12-05 23:32 [RFC LPC2025 PATCH 0/4] Deprecate zone_reclaim_mode Joshua Hahn
2025-12-05 23:32 ` [RFC LPC2025 PATCH 1/4] mm/khugepaged: Remove hpage_collapse_scan_abort Joshua Hahn
2025-12-05 23:32 ` [RFC LPC2025 PATCH 2/4] mm/vmscan/page_alloc: Remove node_reclaim Joshua Hahn
2025-12-05 23:32 ` [RFC LPC2025 PATCH 3/4] mm/vmscan/page_alloc: Deprecate min_{slab, unmapped}_ratio Joshua Hahn
2025-12-05 23:32 ` [RFC LPC2025 PATCH 4/4] mm/vmscan: Deprecate zone_reclaim_mode Joshua Hahn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox