linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/2] Add swappiness argument to memory.reclaim
@ 2024-01-03 16:48 Dan Schatzberg
  2024-01-03 16:48 ` [PATCH v6 1/2] mm: add defines for min/max swappiness Dan Schatzberg
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Dan Schatzberg @ 2024-01-03 16:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, cgroups, linux-mm, Tejun Heo, Zefan Li,
	Johannes Weiner, Jonathan Corbet, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, David Hildenbrand, Chris Li,
	Matthew Wilcox, Kefeng Wang, Dan Schatzberg, Yosry Ahmed,
	Yue Zhao, Hugh Dickins

Changes since V5:
  * Made the scan_control behavior limited to proactive reclaim explicitly
  * created sc_swappiness helper to reduce chance of mis-use

Changes since V4:
  * Fixed some initialization bugs by reverting back to a pointer for swappiness
  * Added some more caveats to the behavior of swappiness in documentation

Changes since V3:
  * Added #define for MIN_SWAPPINESS and MAX_SWAPPINESS
  * Added explicit calls to mem_cgroup_swappiness

Changes since V2:
  * No functional change
  * Used int consistently rather than a pointer

Changes since V1:
  * Added documentation

This patch proposes augmenting the memory.reclaim interface with a
swappiness=<val> argument that overrides the swappiness value for that instance
of proactive reclaim.

Userspace proactive reclaimers use the memory.reclaim interface to trigger
reclaim. The memory.reclaim interface does not allow for any way to effect the
balance of file vs anon during proactive reclaim. The only approach is to adjust
the vm.swappiness setting. However, there are a few reasons we look to control
the balance of file vs anon during proactive reclaim, separately from reactive
reclaim:

* Swapout should be limited to manage SSD write endurance. In near-OOM
  situations we are fine with lots of swap-out to avoid OOMs. As these are
  typically rare events, they have relatively little impact on write endurance.
  However, proactive reclaim runs continuously and so its impact on SSD write
  endurance is more significant. Therefore it is desireable to control swap-out
  for proactive reclaim separately from reactive reclaim

* Some userspace OOM killers like systemd-oomd[1] support OOM killing on swap
  exhaustion. This makes sense if the swap exhaustion is triggered due to
  reactive reclaim but less so if it is triggered due to proactive reclaim (e.g.
  one could see OOMs when free memory is ample but anon is just particularly
  cold). Therefore, it's desireable to have proactive reclaim reduce or stop
  swap-out before the threshold at which OOM killing occurs.

In the case of Meta's Senpai proactive reclaimer, we adjust vm.swappiness before
writes to memory.reclaim[2]. This has been in production for nearly two years
and has addressed our needs to control proactive vs reactive reclaim behavior
but is still not ideal for a number of reasons:

* vm.swappiness is a global setting, adjusting it can race/interfere with other
  system administration that wishes to control vm.swappiness. In our case, we
  need to disable Senpai before adjusting vm.swappiness.

* vm.swappiness is stateful - so a crash or restart of Senpai can leave a
  misconfigured setting. This requires some additional management to record the
  "desired" setting and ensure Senpai always adjusts to it.

With this patch, we avoid these downsides of adjusting vm.swappiness globally.

Previously, this exact interface addition was proposed by Yosry[3]. In response,
Roman proposed instead an interface to specify precise file/anon/slab reclaim
amounts[4]. More recently Huan also proposed this as well[5] and others
similarly questioned if this was the proper interface.

Previous proposals sought to use this to allow proactive reclaimers to
effectively perform a custom reclaim algorithm by issuing proactive reclaim with
different settings to control file vs anon reclaim (e.g. to only reclaim anon
from some applications). Responses argued that adjusting swappiness is a poor
interface for custom reclaim.

In contrast, I argue in favor of a swappiness setting not as a way to implement
custom reclaim algorithms but rather to bias the balance of anon vs file due to
differences of proactive vs reactive reclaim. In this context, swappiness is the
existing interface for controlling this balance and this patch simply allows for
it to be configured differently for proactive vs reactive reclaim.

Specifying explicit amounts of anon vs file pages to reclaim feels inappropriate
for this prupose. Proactive reclaimers are un-aware of the relative age of file
vs anon for a cgroup which makes it difficult to manage proactive reclaim of
different memory pools. A proactive reclaimer would need some amount of anon
reclaim attempts separate from the amount of file reclaim attempts which seems
brittle given that it's difficult to observe the impact.

[1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
[2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598
[3]https://lore.kernel.org/linux-mm/CAJD7tkbDpyoODveCsnaqBBMZEkDvshXJmNdbk51yKSNgD7aGdg@mail.gmail.com/
[4]https://lore.kernel.org/linux-mm/YoPHtHXzpK51F%2F1Z@carbon/
[5]https://lore.kernel.org/lkml/20231108065818.19932-1-link@vivo.com/

Dan Schatzberg (2):
  mm: add defines for min/max swappiness
  mm: add swapiness= arg to memory.reclaim

 Documentation/admin-guide/cgroup-v2.rst | 18 +++++---
 include/linux/swap.h                    |  5 ++-
 mm/memcontrol.c                         | 58 ++++++++++++++++++++-----
 mm/vmscan.c                             | 39 ++++++++++++-----
 4 files changed, 90 insertions(+), 30 deletions(-)

-- 
2.39.3



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v6 1/2] mm: add defines for min/max swappiness
  2024-01-03 16:48 [PATCH v6 0/2] Add swappiness argument to memory.reclaim Dan Schatzberg
@ 2024-01-03 16:48 ` Dan Schatzberg
  2024-01-03 16:48 ` [PATCH v6 2/2] mm: add swapiness= arg to memory.reclaim Dan Schatzberg
  2024-06-11 19:25 ` [PATCH v6 0/2] Add swappiness argument " Shakeel Butt
  2 siblings, 0 replies; 17+ messages in thread
From: Dan Schatzberg @ 2024-01-03 16:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, cgroups, linux-mm, David Rientjes, Chris Li,
	Nhat Pham, Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	David Hildenbrand, Matthew Wilcox, Kefeng Wang, Dan Schatzberg,
	Yue Zhao, Yosry Ahmed, Hugh Dickins

We use the constants 0 and 200 in a few places in the mm code when
referring to the min and max swappiness. This patch adds MIN_SWAPPINESS
and MAX_SWAPPINESS #defines to improve clarity. There are no functional
changes.

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Chris Li <chrisl@kernel.org>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
---
 include/linux/swap.h |  2 ++
 mm/memcontrol.c      |  2 +-
 mm/vmscan.c          | 14 +++++++-------
 3 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index f6dd6575b905..e2ab76c25b4a 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -407,6 +407,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 
 #define MEMCG_RECLAIM_MAY_SWAP (1 << 1)
 #define MEMCG_RECLAIM_PROACTIVE (1 << 2)
+#define MIN_SWAPPINESS 0
+#define MAX_SWAPPINESS 200
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 						  unsigned long nr_pages,
 						  gfp_t gfp_mask,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b226090fd906..fbe9f02dd206 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4337,7 +4337,7 @@ static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css,
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
 
-	if (val > 200)
+	if (val > MAX_SWAPPINESS)
 		return -EINVAL;
 
 	if (!mem_cgroup_is_root(memcg))
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9dd8977de5a2..d91963e2d47f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -183,7 +183,7 @@ struct scan_control {
 #endif
 
 /*
- * From 0 .. 200.  Higher means more swappy.
+ * From 0 .. MAX_SWAPPINESS.  Higher means more swappy.
  */
 int vm_swappiness = 60;
 
@@ -2403,7 +2403,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	ap = swappiness * (total_cost + 1);
 	ap /= anon_cost + 1;
 
-	fp = (200 - swappiness) * (total_cost + 1);
+	fp = (MAX_SWAPPINESS - swappiness) * (total_cost + 1);
 	fp /= file_cost + 1;
 
 	fraction[0] = ap;
@@ -4400,7 +4400,7 @@ static int get_type_to_scan(struct lruvec *lruvec, int swappiness, int *tier_idx
 {
 	int type, tier;
 	struct ctrl_pos sp, pv;
-	int gain[ANON_AND_FILE] = { swappiness, 200 - swappiness };
+	int gain[ANON_AND_FILE] = { swappiness, MAX_SWAPPINESS - swappiness };
 
 	/*
 	 * Compare the first tier of anon with that of file to determine which
@@ -4436,7 +4436,7 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw
 	/*
 	 * Try to make the obvious choice first. When anon and file are both
 	 * available from the same generation, interpret swappiness 1 as file
-	 * first and 200 as anon first.
+	 * first and MAX_SWAPPINESS as anon first.
 	 */
 	if (!swappiness)
 		type = LRU_GEN_FILE;
@@ -4444,7 +4444,7 @@ static int isolate_folios(struct lruvec *lruvec, struct scan_control *sc, int sw
 		type = LRU_GEN_ANON;
 	else if (swappiness == 1)
 		type = LRU_GEN_FILE;
-	else if (swappiness == 200)
+	else if (swappiness == MAX_SWAPPINESS)
 		type = LRU_GEN_ANON;
 	else
 		type = get_type_to_scan(lruvec, swappiness, &tier);
@@ -5398,9 +5398,9 @@ static int run_cmd(char cmd, int memcg_id, int nid, unsigned long seq,
 
 	lruvec = get_lruvec(memcg, nid);
 
-	if (swappiness < 0)
+	if (swappiness < MIN_SWAPPINESS)
 		swappiness = get_swappiness(lruvec, sc);
-	else if (swappiness > 200)
+	else if (swappiness > MAX_SWAPPINESS)
 		goto done;
 
 	switch (cmd) {
-- 
2.39.3



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v6 2/2] mm: add swapiness= arg to memory.reclaim
  2024-01-03 16:48 [PATCH v6 0/2] Add swappiness argument to memory.reclaim Dan Schatzberg
  2024-01-03 16:48 ` [PATCH v6 1/2] mm: add defines for min/max swappiness Dan Schatzberg
@ 2024-01-03 16:48 ` Dan Schatzberg
  2024-01-03 17:19   ` Yu Zhao
  2024-01-04 10:09   ` Michal Hocko
  2024-06-11 19:25 ` [PATCH v6 0/2] Add swappiness argument " Shakeel Butt
  2 siblings, 2 replies; 17+ messages in thread
From: Dan Schatzberg @ 2024-01-03 16:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, cgroups, linux-mm, Yosry Ahmed, Michal Hocko,
	David Rientjes, Chris Li, Tejun Heo, Zefan Li, Johannes Weiner,
	Jonathan Corbet, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, David Hildenbrand, Matthew Wilcox, Kefeng Wang,
	Dan Schatzberg, Yue Zhao, Hugh Dickins

Allow proactive reclaimers to submit an additional swappiness=<val>
argument to memory.reclaim. This overrides the global or per-memcg
swappiness setting for that reclaim attempt.

For example:

echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim

will perform reclaim on the rootcg with a swappiness setting of 0 (no
swap) regardless of the vm.swappiness sysctl setting.

Userspace proactive reclaimers use the memory.reclaim interface to
trigger reclaim. The memory.reclaim interface does not allow for any way
to effect the balance of file vs anon during proactive reclaim. The only
approach is to adjust the vm.swappiness setting. However, there are a
few reasons we look to control the balance of file vs anon during
proactive reclaim, separately from reactive reclaim:

* Swapout should be limited to manage SSD write endurance. In near-OOM
situations we are fine with lots of swap-out to avoid OOMs. As these are
typically rare events, they have relatively little impact on write
endurance. However, proactive reclaim runs continuously and so its
impact on SSD write endurance is more significant. Therefore it is
desireable to control swap-out for proactive reclaim separately from
reactive reclaim

* Some userspace OOM killers like systemd-oomd[1] support OOM killing on
swap exhaustion. This makes sense if the swap exhaustion is triggered
due to reactive reclaim but less so if it is triggered due to proactive
reclaim (e.g. one could see OOMs when free memory is ample but anon is
just particularly cold). Therefore, it's desireable to have proactive
reclaim reduce or stop swap-out before the threshold at which OOM
killing occurs.

In the case of Meta's Senpai proactive reclaimer, we adjust
vm.swappiness before writes to memory.reclaim[2]. This has been in
production for nearly two years and has addressed our needs to control
proactive vs reactive reclaim behavior but is still not ideal for a
number of reasons:

* vm.swappiness is a global setting, adjusting it can race/interfere
with other system administration that wishes to control vm.swappiness.
In our case, we need to disable Senpai before adjusting vm.swappiness.

* vm.swappiness is stateful - so a crash or restart of Senpai can leave
a misconfigured setting. This requires some additional management to
record the "desired" setting and ensure Senpai always adjusts to it.

With this patch, we avoid these downsides of adjusting vm.swappiness
globally.

[1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
[2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598

Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
Suggested-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: David Rientjes <rientjes@google.com>
Acked-by: Chris Li <chrisl@kernel.org>
---
 Documentation/admin-guide/cgroup-v2.rst | 18 ++++----
 include/linux/swap.h                    |  3 +-
 mm/memcontrol.c                         | 56 ++++++++++++++++++++-----
 mm/vmscan.c                             | 25 +++++++++--
 4 files changed, 80 insertions(+), 22 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 3f85254f3cef..ee42f74e0765 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1282,17 +1282,10 @@ PAGE_SIZE multiple when read back.
 	This is a simple interface to trigger memory reclaim in the
 	target cgroup.
 
-	This file accepts a single key, the number of bytes to reclaim.
-	No nested keys are currently supported.
-
 	Example::
 
 	  echo "1G" > memory.reclaim
 
-	The interface can be later extended with nested keys to
-	configure the reclaim behavior. For example, specify the
-	type of memory to reclaim from (anon, file, ..).
-
 	Please note that the kernel can over or under reclaim from
 	the target cgroup. If less bytes are reclaimed than the
 	specified amount, -EAGAIN is returned.
@@ -1304,6 +1297,17 @@ PAGE_SIZE multiple when read back.
 	This means that the networking layer will not adapt based on
 	reclaim induced by memory.reclaim.
 
+The following nested keys are defined.
+
+	  ==========            ================================
+	  swappiness            Swappiness value to reclaim with
+	  ==========            ================================
+
+	Specifying a swappiness value instructs the kernel to perform
+	the reclaim with that swappiness value. Note that this has the
+	same semantics as vm.swappiness applied to memcg reclaim with
+	all the existing limitations and potential future extensions.
+
   memory.peak
 	A read-only single value file which exists on non-root
 	cgroups.
diff --git a/include/linux/swap.h b/include/linux/swap.h
index e2ab76c25b4a..8afdec40efe3 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -412,7 +412,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
 extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 						  unsigned long nr_pages,
 						  gfp_t gfp_mask,
-						  unsigned int reclaim_options);
+						  unsigned int reclaim_options,
+						  int *swappiness);
 extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
 						gfp_t gfp_mask, bool noswap,
 						pg_data_t *pgdat,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fbe9f02dd206..6d627a754851 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -52,6 +52,7 @@
 #include <linux/sort.h>
 #include <linux/fs.h>
 #include <linux/seq_file.h>
+#include <linux/parser.h>
 #include <linux/vmpressure.h>
 #include <linux/memremap.h>
 #include <linux/mm_inline.h>
@@ -2449,7 +2450,8 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
 		psi_memstall_enter(&pflags);
 		nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
 							gfp_mask,
-							MEMCG_RECLAIM_MAY_SWAP);
+							MEMCG_RECLAIM_MAY_SWAP,
+							NULL);
 		psi_memstall_leave(&pflags);
 	} while ((memcg = parent_mem_cgroup(memcg)) &&
 		 !mem_cgroup_is_root(memcg));
@@ -2740,7 +2742,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 
 	psi_memstall_enter(&pflags);
 	nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
-						    gfp_mask, reclaim_options);
+						    gfp_mask, reclaim_options, NULL);
 	psi_memstall_leave(&pflags);
 
 	if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
@@ -3660,7 +3662,7 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
 		}
 
 		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
-					memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) {
+					memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, NULL)) {
 			ret = -EBUSY;
 			break;
 		}
@@ -3774,7 +3776,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
 			return -EINTR;
 
 		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
-						  MEMCG_RECLAIM_MAY_SWAP))
+						  MEMCG_RECLAIM_MAY_SWAP, NULL))
 			nr_retries--;
 	}
 
@@ -6720,7 +6722,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
 		}
 
 		reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
-					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP);
+					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL);
 
 		if (!reclaimed && !nr_retries--)
 			break;
@@ -6769,7 +6771,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
 
 		if (nr_reclaims) {
 			if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
-					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP))
+					GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL))
 				nr_reclaims--;
 			continue;
 		}
@@ -6895,19 +6897,50 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
+enum {
+	MEMORY_RECLAIM_SWAPPINESS = 0,
+	MEMORY_RECLAIM_NULL,
+};
+
+static const match_table_t tokens = {
+	{ MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"},
+	{ MEMORY_RECLAIM_NULL, NULL },
+};
+
 static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
 			      size_t nbytes, loff_t off)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
 	unsigned int nr_retries = MAX_RECLAIM_RETRIES;
 	unsigned long nr_to_reclaim, nr_reclaimed = 0;
+	int swappiness = -1;
 	unsigned int reclaim_options;
-	int err;
+	char *old_buf, *start;
+	substring_t args[MAX_OPT_ARGS];
 
 	buf = strstrip(buf);
-	err = page_counter_memparse(buf, "", &nr_to_reclaim);
-	if (err)
-		return err;
+
+	old_buf = buf;
+	nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
+	if (buf == old_buf)
+		return -EINVAL;
+
+	buf = strstrip(buf);
+
+	while ((start = strsep(&buf, " ")) != NULL) {
+		if (!strlen(start))
+			continue;
+		switch (match_token(start, tokens, args)) {
+		case MEMORY_RECLAIM_SWAPPINESS:
+			if (match_int(&args[0], &swappiness))
+				return -EINVAL;
+			if (swappiness < MIN_SWAPPINESS || swappiness > MAX_SWAPPINESS)
+				return -EINVAL;
+			break;
+		default:
+			return -EINVAL;
+		}
+	}
 
 	reclaim_options	= MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE;
 	while (nr_reclaimed < nr_to_reclaim) {
@@ -6926,7 +6959,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
 
 		reclaimed = try_to_free_mem_cgroup_pages(memcg,
 					min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
-					GFP_KERNEL, reclaim_options);
+					GFP_KERNEL, reclaim_options,
+					swappiness == -1 ? NULL : &swappiness);
 
 		if (!reclaimed && !nr_retries--)
 			return -EAGAIN;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d91963e2d47f..394e0dd46b2e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -92,6 +92,11 @@ struct scan_control {
 	unsigned long	anon_cost;
 	unsigned long	file_cost;
 
+#ifdef CONFIG_MEMCG
+	/* Swappiness value for proactive reclaim. Always use sc_swappiness()! */
+	int *proactive_swappiness;
+#endif
+
 	/* Can active folios be deactivated as part of reclaim? */
 #define DEACTIVATE_ANON 1
 #define DEACTIVATE_FILE 2
@@ -227,6 +232,13 @@ static bool writeback_throttling_sane(struct scan_control *sc)
 #endif
 	return false;
 }
+
+static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg)
+{
+	if (sc->proactive && sc->proactive_swappiness)
+		return *sc->proactive_swappiness;
+	return mem_cgroup_swappiness(memcg);
+}
 #else
 static bool cgroup_reclaim(struct scan_control *sc)
 {
@@ -242,6 +254,11 @@ static bool writeback_throttling_sane(struct scan_control *sc)
 {
 	return true;
 }
+
+static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg)
+{
+	return READ_ONCE(vm_swappiness);
+}
 #endif
 
 static void set_task_reclaim_state(struct task_struct *task,
@@ -2327,7 +2344,7 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 	struct pglist_data *pgdat = lruvec_pgdat(lruvec);
 	struct mem_cgroup *memcg = lruvec_memcg(lruvec);
 	unsigned long anon_cost, file_cost, total_cost;
-	int swappiness = mem_cgroup_swappiness(memcg);
+	int swappiness = sc_swappiness(sc, memcg);
 	u64 fraction[ANON_AND_FILE];
 	u64 denominator = 0;	/* gcc */
 	enum scan_balance scan_balance;
@@ -2608,7 +2625,7 @@ static int get_swappiness(struct lruvec *lruvec, struct scan_control *sc)
 	    mem_cgroup_get_nr_swap_pages(memcg) < MIN_LRU_BATCH)
 		return 0;
 
-	return mem_cgroup_swappiness(memcg);
+	return sc_swappiness(sc, memcg);
 }
 
 static int get_nr_gens(struct lruvec *lruvec, int type)
@@ -6463,12 +6480,14 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
 unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
 					   unsigned long nr_pages,
 					   gfp_t gfp_mask,
-					   unsigned int reclaim_options)
+					   unsigned int reclaim_options,
+					   int *swappiness)
 {
 	unsigned long nr_reclaimed;
 	unsigned int noreclaim_flag;
 	struct scan_control sc = {
 		.nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX),
+		.proactive_swappiness = swappiness,
 		.gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) |
 				(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK),
 		.reclaim_idx = MAX_NR_ZONES - 1,
-- 
2.39.3



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 2/2] mm: add swapiness= arg to memory.reclaim
  2024-01-03 16:48 ` [PATCH v6 2/2] mm: add swapiness= arg to memory.reclaim Dan Schatzberg
@ 2024-01-03 17:19   ` Yu Zhao
  2024-01-03 18:19     ` Dan Schatzberg
  2024-01-04 10:09   ` Michal Hocko
  1 sibling, 1 reply; 17+ messages in thread
From: Yu Zhao @ 2024-01-03 17:19 UTC (permalink / raw)
  To: Dan Schatzberg
  Cc: Andrew Morton, linux-kernel, cgroups, linux-mm, Yosry Ahmed,
	Michal Hocko, David Rientjes, Chris Li, Tejun Heo, Zefan Li,
	Johannes Weiner, Jonathan Corbet, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, David Hildenbrand, Matthew Wilcox,
	Kefeng Wang, Yue Zhao, Hugh Dickins

On Wed, Jan 3, 2024 at 9:49 AM Dan Schatzberg <schatzberg.dan@gmail.com> wrote:
>
> Allow proactive reclaimers to submit an additional swappiness=<val>
> argument to memory.reclaim. This overrides the global or per-memcg
> swappiness setting for that reclaim attempt.
>
> For example:
>
> echo "2M swappiness=0" > /sys/fs/cgroup/memory.reclaim
>
> will perform reclaim on the rootcg with a swappiness setting of 0 (no
> swap) regardless of the vm.swappiness sysctl setting.
>
> Userspace proactive reclaimers use the memory.reclaim interface to
> trigger reclaim. The memory.reclaim interface does not allow for any way
> to effect the balance of file vs anon during proactive reclaim. The only
> approach is to adjust the vm.swappiness setting. However, there are a
> few reasons we look to control the balance of file vs anon during
> proactive reclaim, separately from reactive reclaim:
>
> * Swapout should be limited to manage SSD write endurance. In near-OOM
> situations we are fine with lots of swap-out to avoid OOMs. As these are
> typically rare events, they have relatively little impact on write
> endurance. However, proactive reclaim runs continuously and so its
> impact on SSD write endurance is more significant. Therefore it is
> desireable to control swap-out for proactive reclaim separately from
> reactive reclaim
>
> * Some userspace OOM killers like systemd-oomd[1] support OOM killing on
> swap exhaustion. This makes sense if the swap exhaustion is triggered
> due to reactive reclaim but less so if it is triggered due to proactive
> reclaim (e.g. one could see OOMs when free memory is ample but anon is
> just particularly cold). Therefore, it's desireable to have proactive
> reclaim reduce or stop swap-out before the threshold at which OOM
> killing occurs.
>
> In the case of Meta's Senpai proactive reclaimer, we adjust
> vm.swappiness before writes to memory.reclaim[2]. This has been in
> production for nearly two years and has addressed our needs to control
> proactive vs reactive reclaim behavior but is still not ideal for a
> number of reasons:
>
> * vm.swappiness is a global setting, adjusting it can race/interfere
> with other system administration that wishes to control vm.swappiness.
> In our case, we need to disable Senpai before adjusting vm.swappiness.
>
> * vm.swappiness is stateful - so a crash or restart of Senpai can leave
> a misconfigured setting. This requires some additional management to
> record the "desired" setting and ensure Senpai always adjusts to it.
>
> With this patch, we avoid these downsides of adjusting vm.swappiness
> globally.
>
> [1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
> [2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598
>
> Signed-off-by: Dan Schatzberg <schatzberg.dan@gmail.com>
> Suggested-by: Yosry Ahmed <yosryahmed@google.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Acked-by: David Rientjes <rientjes@google.com>
> Acked-by: Chris Li <chrisl@kernel.org>
> ---
>  Documentation/admin-guide/cgroup-v2.rst | 18 ++++----
>  include/linux/swap.h                    |  3 +-
>  mm/memcontrol.c                         | 56 ++++++++++++++++++++-----
>  mm/vmscan.c                             | 25 +++++++++--
>  4 files changed, 80 insertions(+), 22 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 3f85254f3cef..ee42f74e0765 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1282,17 +1282,10 @@ PAGE_SIZE multiple when read back.
>         This is a simple interface to trigger memory reclaim in the
>         target cgroup.
>
> -       This file accepts a single key, the number of bytes to reclaim.
> -       No nested keys are currently supported.
> -
>         Example::
>
>           echo "1G" > memory.reclaim
>
> -       The interface can be later extended with nested keys to
> -       configure the reclaim behavior. For example, specify the
> -       type of memory to reclaim from (anon, file, ..).
> -
>         Please note that the kernel can over or under reclaim from
>         the target cgroup. If less bytes are reclaimed than the
>         specified amount, -EAGAIN is returned.
> @@ -1304,6 +1297,17 @@ PAGE_SIZE multiple when read back.
>         This means that the networking layer will not adapt based on
>         reclaim induced by memory.reclaim.
>
> +The following nested keys are defined.
> +
> +         ==========            ================================
> +         swappiness            Swappiness value to reclaim with
> +         ==========            ================================
> +
> +       Specifying a swappiness value instructs the kernel to perform
> +       the reclaim with that swappiness value. Note that this has the
> +       same semantics as vm.swappiness applied to memcg reclaim with
> +       all the existing limitations and potential future extensions.
> +
>    memory.peak
>         A read-only single value file which exists on non-root
>         cgroups.
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index e2ab76c25b4a..8afdec40efe3 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -412,7 +412,8 @@ extern unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
>  extern unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>                                                   unsigned long nr_pages,
>                                                   gfp_t gfp_mask,
> -                                                 unsigned int reclaim_options);
> +                                                 unsigned int reclaim_options,
> +                                                 int *swappiness);
>  extern unsigned long mem_cgroup_shrink_node(struct mem_cgroup *mem,
>                                                 gfp_t gfp_mask, bool noswap,
>                                                 pg_data_t *pgdat,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index fbe9f02dd206..6d627a754851 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -52,6 +52,7 @@
>  #include <linux/sort.h>
>  #include <linux/fs.h>
>  #include <linux/seq_file.h>
> +#include <linux/parser.h>
>  #include <linux/vmpressure.h>
>  #include <linux/memremap.h>
>  #include <linux/mm_inline.h>
> @@ -2449,7 +2450,8 @@ static unsigned long reclaim_high(struct mem_cgroup *memcg,
>                 psi_memstall_enter(&pflags);
>                 nr_reclaimed += try_to_free_mem_cgroup_pages(memcg, nr_pages,
>                                                         gfp_mask,
> -                                                       MEMCG_RECLAIM_MAY_SWAP);
> +                                                       MEMCG_RECLAIM_MAY_SWAP,
> +                                                       NULL);
>                 psi_memstall_leave(&pflags);
>         } while ((memcg = parent_mem_cgroup(memcg)) &&
>                  !mem_cgroup_is_root(memcg));
> @@ -2740,7 +2742,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>
>         psi_memstall_enter(&pflags);
>         nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages,
> -                                                   gfp_mask, reclaim_options);
> +                                                   gfp_mask, reclaim_options, NULL);
>         psi_memstall_leave(&pflags);
>
>         if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
> @@ -3660,7 +3662,7 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
>                 }
>
>                 if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> -                                       memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP)) {
> +                                       memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, NULL)) {
>                         ret = -EBUSY;
>                         break;
>                 }
> @@ -3774,7 +3776,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
>                         return -EINTR;
>
>                 if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> -                                                 MEMCG_RECLAIM_MAY_SWAP))
> +                                                 MEMCG_RECLAIM_MAY_SWAP, NULL))
>                         nr_retries--;
>         }
>
> @@ -6720,7 +6722,7 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
>                 }
>
>                 reclaimed = try_to_free_mem_cgroup_pages(memcg, nr_pages - high,
> -                                       GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP);
> +                                       GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL);
>
>                 if (!reclaimed && !nr_retries--)
>                         break;
> @@ -6769,7 +6771,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
>
>                 if (nr_reclaims) {
>                         if (!try_to_free_mem_cgroup_pages(memcg, nr_pages - max,
> -                                       GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP))
> +                                       GFP_KERNEL, MEMCG_RECLAIM_MAY_SWAP, NULL))
>                                 nr_reclaims--;
>                         continue;
>                 }
> @@ -6895,19 +6897,50 @@ static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
>         return nbytes;
>  }
>
> +enum {
> +       MEMORY_RECLAIM_SWAPPINESS = 0,
> +       MEMORY_RECLAIM_NULL,
> +};
> +
> +static const match_table_t tokens = {
> +       { MEMORY_RECLAIM_SWAPPINESS, "swappiness=%d"},
> +       { MEMORY_RECLAIM_NULL, NULL },
> +};
> +
>  static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
>                               size_t nbytes, loff_t off)
>  {
>         struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
>         unsigned int nr_retries = MAX_RECLAIM_RETRIES;
>         unsigned long nr_to_reclaim, nr_reclaimed = 0;
> +       int swappiness = -1;
>         unsigned int reclaim_options;
> -       int err;
> +       char *old_buf, *start;
> +       substring_t args[MAX_OPT_ARGS];
>
>         buf = strstrip(buf);
> -       err = page_counter_memparse(buf, "", &nr_to_reclaim);
> -       if (err)
> -               return err;
> +
> +       old_buf = buf;
> +       nr_to_reclaim = memparse(buf, &buf) / PAGE_SIZE;
> +       if (buf == old_buf)
> +               return -EINVAL;
> +
> +       buf = strstrip(buf);
> +
> +       while ((start = strsep(&buf, " ")) != NULL) {
> +               if (!strlen(start))
> +                       continue;
> +               switch (match_token(start, tokens, args)) {
> +               case MEMORY_RECLAIM_SWAPPINESS:
> +                       if (match_int(&args[0], &swappiness))
> +                               return -EINVAL;
> +                       if (swappiness < MIN_SWAPPINESS || swappiness > MAX_SWAPPINESS)
> +                               return -EINVAL;
> +                       break;
> +               default:
> +                       return -EINVAL;
> +               }
> +       }
>
>         reclaim_options = MEMCG_RECLAIM_MAY_SWAP | MEMCG_RECLAIM_PROACTIVE;
>         while (nr_reclaimed < nr_to_reclaim) {
> @@ -6926,7 +6959,8 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
>
>                 reclaimed = try_to_free_mem_cgroup_pages(memcg,
>                                         min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
> -                                       GFP_KERNEL, reclaim_options);
> +                                       GFP_KERNEL, reclaim_options,
> +                                       swappiness == -1 ? NULL : &swappiness);
>
>                 if (!reclaimed && !nr_retries--)
>                         return -EAGAIN;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d91963e2d47f..394e0dd46b2e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -92,6 +92,11 @@ struct scan_control {
>         unsigned long   anon_cost;
>         unsigned long   file_cost;
>
> +#ifdef CONFIG_MEMCG
> +       /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */
> +       int *proactive_swappiness;
> +#endif

Why is proactive_swappiness still a pointer? The whole point of the
previous conversation is that sc->proactive can tell whether
sc->swappiness is valid or not, and that's less awkward than using a
pointer.

Also why the #ifdef here? I don't see the point for a small stack
variable. Otherwise wouldn't we want to do this for sc->proactive as
well?

If you really want it to be explicit, you could do
  struct scan_control {
    ...
    struct {
      bool is_set;
      int swappiness;
    } proactive;
  };

But I think even this is too much.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 2/2] mm: add swapiness= arg to memory.reclaim
  2024-01-03 17:19   ` Yu Zhao
@ 2024-01-03 18:19     ` Dan Schatzberg
  2024-01-04  1:07       ` Yu Zhao
  2024-01-04  1:17       ` Yu Zhao
  0 siblings, 2 replies; 17+ messages in thread
From: Dan Schatzberg @ 2024-01-03 18:19 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Andrew Morton, linux-kernel, cgroups, linux-mm, Yosry Ahmed,
	Michal Hocko, David Rientjes, Chris Li, Tejun Heo, Zefan Li,
	Johannes Weiner, Jonathan Corbet, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, David Hildenbrand, Matthew Wilcox,
	Kefeng Wang, Yue Zhao, Hugh Dickins

On Wed, Jan 03, 2024 at 10:19:40AM -0700, Yu Zhao wrote:
[...]
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index d91963e2d47f..394e0dd46b2e 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -92,6 +92,11 @@ struct scan_control {
> >         unsigned long   anon_cost;
> >         unsigned long   file_cost;
> >
> > +#ifdef CONFIG_MEMCG
> > +       /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */
> > +       int *proactive_swappiness;
> > +#endif
> 
> Why is proactive_swappiness still a pointer? The whole point of the
> previous conversation is that sc->proactive can tell whether
> sc->swappiness is valid or not, and that's less awkward than using a
> pointer.

It's the same reason as before - zero initialization ensures that the
pointer is NULL which tells us if it's valid or not. Proactive reclaim
might not set swappiness and you need to distinguish swappiness of 0
and not-set. See this discussion with Michal:

https://lore.kernel.org/linux-mm/ZZUizpTWOt3gNeqR@tiehlicka/

> Also why the #ifdef here? I don't see the point for a small stack
> variable. Otherwise wouldn't we want to do this for sc->proactive as
> well?

This was Michal's request and it feels similar to your rationale for
naming it proactive_swappiness - it's just restricting the interface
down to the only use-cases. I'd be fine with doing the same in
sc->proactive as a subsequent patch.

See https://lore.kernel.org/linux-mm/ZZUhBoTNgL3AUK3f@tiehlicka/


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 2/2] mm: add swapiness= arg to memory.reclaim
  2024-01-03 18:19     ` Dan Schatzberg
@ 2024-01-04  1:07       ` Yu Zhao
  2024-01-04  8:48         ` Michal Hocko
  2024-01-04  1:17       ` Yu Zhao
  1 sibling, 1 reply; 17+ messages in thread
From: Yu Zhao @ 2024-01-04  1:07 UTC (permalink / raw)
  To: Dan Schatzberg
  Cc: Andrew Morton, linux-kernel, cgroups, linux-mm, Yosry Ahmed,
	Michal Hocko, David Rientjes, Chris Li, Tejun Heo, Zefan Li,
	Johannes Weiner, Jonathan Corbet, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, David Hildenbrand, Matthew Wilcox,
	Kefeng Wang, Yue Zhao, Hugh Dickins

On Wed, Jan 03, 2024 at 01:19:59PM -0500, Dan Schatzberg wrote:
> On Wed, Jan 03, 2024 at 10:19:40AM -0700, Yu Zhao wrote:
> [...]
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index d91963e2d47f..394e0dd46b2e 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -92,6 +92,11 @@ struct scan_control {
> > >         unsigned long   anon_cost;
> > >         unsigned long   file_cost;
> > >
> > > +#ifdef CONFIG_MEMCG
> > > +       /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */
> > > +       int *proactive_swappiness;
> > > +#endif
> > 
> > Why is proactive_swappiness still a pointer? The whole point of the
> > previous conversation is that sc->proactive can tell whether
> > sc->swappiness is valid or not, and that's less awkward than using a
> > pointer.
> 
> It's the same reason as before - zero initialization ensures that the
> pointer is NULL which tells us if it's valid or not. Proactive reclaim
> might not set swappiness and you need to distinguish swappiness of 0
> and not-set. See this discussion with Michal:
> 
> https://lore.kernel.org/linux-mm/ZZUizpTWOt3gNeqR@tiehlicka/

 static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
                              size_t nbytes, loff_t off)
 {
        struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
        unsigned int nr_retries = MAX_RECLAIM_RETRIES;
        unsigned long nr_to_reclaim, nr_reclaimed = 0;
+       int swappiness = -1;
...
                reclaimed = try_to_free_mem_cgroup_pages(memcg,
                                        min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
-                                       GFP_KERNEL, reclaim_options);
+                                       GFP_KERNEL, reclaim_options,
+                                       swappiness);

...

+static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg)
+{
+       return sc->proactive && sc->proactive_swappiness > -1 ?
+              sc->proactive_swappiness : mem_cgroup_swappiness(memcg);
+}


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 2/2] mm: add swapiness= arg to memory.reclaim
  2024-01-03 18:19     ` Dan Schatzberg
  2024-01-04  1:07       ` Yu Zhao
@ 2024-01-04  1:17       ` Yu Zhao
  1 sibling, 0 replies; 17+ messages in thread
From: Yu Zhao @ 2024-01-04  1:17 UTC (permalink / raw)
  To: Dan Schatzberg
  Cc: Andrew Morton, linux-kernel, cgroups, linux-mm, Yosry Ahmed,
	Michal Hocko, David Rientjes, Chris Li, Tejun Heo, Zefan Li,
	Johannes Weiner, Jonathan Corbet, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, David Hildenbrand, Matthew Wilcox,
	Kefeng Wang, Yue Zhao, Hugh Dickins

On Wed, Jan 3, 2024 at 11:20 AM Dan Schatzberg <schatzberg.dan@gmail.com> wrote:
>
> On Wed, Jan 03, 2024 at 10:19:40AM -0700, Yu Zhao wrote:
> [...]
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index d91963e2d47f..394e0dd46b2e 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -92,6 +92,11 @@ struct scan_control {
> > >         unsigned long   anon_cost;
> > >         unsigned long   file_cost;
> > >
> > > +#ifdef CONFIG_MEMCG
> > > +       /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */
> > > +       int *proactive_swappiness;
> > > +#endif
> >
> > Why is proactive_swappiness still a pointer? The whole point of the
> > previous conversation is that sc->proactive can tell whether
> > sc->swappiness is valid or not, and that's less awkward than using a
> > pointer.
>
> It's the same reason as before - zero initialization ensures that the
> pointer is NULL which tells us if it's valid or not. Proactive reclaim
> might not set swappiness and you need to distinguish swappiness of 0
> and not-set. See this discussion with Michal:
>
> https://lore.kernel.org/linux-mm/ZZUizpTWOt3gNeqR@tiehlicka/
>
> > Also why the #ifdef here? I don't see the point for a small stack
> > variable. Otherwise wouldn't we want to do this for sc->proactive as
> > well?
>
> This was Michal's request and it feels similar to your rationale for
> naming it proactive_swappiness - it's just restricting the interface
> down to the only use-cases. I'd be fine with doing the same in
> sc->proactive as a subsequent patch.
>
> See https://lore.kernel.org/linux-mm/ZZUhBoTNgL3AUK3f@tiehlicka/

Also regarding #ifdef, quoting Documentation/process/4.Coding.rst:
"As a general rule, #ifdef use should be confined to header files
whenever possible."


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 2/2] mm: add swapiness= arg to memory.reclaim
  2024-01-04  1:07       ` Yu Zhao
@ 2024-01-04  8:48         ` Michal Hocko
  2024-01-09 23:54           ` Yu Zhao
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2024-01-04  8:48 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Dan Schatzberg, Andrew Morton, linux-kernel, cgroups, linux-mm,
	Yosry Ahmed, David Rientjes, Chris Li, Tejun Heo, Zefan Li,
	Johannes Weiner, Jonathan Corbet, Roman Gushchin, Shakeel Butt,
	Muchun Song, David Hildenbrand, Matthew Wilcox, Kefeng Wang,
	Yue Zhao, Hugh Dickins

On Wed 03-01-24 18:07:43, Yu Zhao wrote:
> On Wed, Jan 03, 2024 at 01:19:59PM -0500, Dan Schatzberg wrote:
> > On Wed, Jan 03, 2024 at 10:19:40AM -0700, Yu Zhao wrote:
> > [...]
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index d91963e2d47f..394e0dd46b2e 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -92,6 +92,11 @@ struct scan_control {
> > > >         unsigned long   anon_cost;
> > > >         unsigned long   file_cost;
> > > >
> > > > +#ifdef CONFIG_MEMCG
> > > > +       /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */
> > > > +       int *proactive_swappiness;
> > > > +#endif
> > > 
> > > Why is proactive_swappiness still a pointer? The whole point of the
> > > previous conversation is that sc->proactive can tell whether
> > > sc->swappiness is valid or not, and that's less awkward than using a
> > > pointer.
> > 
> > It's the same reason as before - zero initialization ensures that the
> > pointer is NULL which tells us if it's valid or not. Proactive reclaim
> > might not set swappiness and you need to distinguish swappiness of 0
> > and not-set. See this discussion with Michal:
> > 
> > https://lore.kernel.org/linux-mm/ZZUizpTWOt3gNeqR@tiehlicka/
> 
>  static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
>                               size_t nbytes, loff_t off)
>  {
>         struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
>         unsigned int nr_retries = MAX_RECLAIM_RETRIES;
>         unsigned long nr_to_reclaim, nr_reclaimed = 0;
> +       int swappiness = -1;
> ...
>                 reclaimed = try_to_free_mem_cgroup_pages(memcg,
>                                         min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
> -                                       GFP_KERNEL, reclaim_options);
> +                                       GFP_KERNEL, reclaim_options,
> +                                       swappiness);
> 
> ...
> 
> +static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg)
> +{
> +       return sc->proactive && sc->proactive_swappiness > -1 ?
> +              sc->proactive_swappiness : mem_cgroup_swappiness(memcg);
> +}

Tpo be completely honest I really fail to see why this is such a hot
discussion point. To be completely clear both approaches are feasible.

The main argument for NULL check based approach is that it is less error
prone from an incorrect ussage because any bug becomes obvious. If we
use any other special constant a missing initialization would be much
harder to spot because they would be subtle behavior change.

Are there really any strong arguments to go against this "default
initialization is safe" policy?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 2/2] mm: add swapiness= arg to memory.reclaim
  2024-01-03 16:48 ` [PATCH v6 2/2] mm: add swapiness= arg to memory.reclaim Dan Schatzberg
  2024-01-03 17:19   ` Yu Zhao
@ 2024-01-04 10:09   ` Michal Hocko
  2024-01-09 23:57     ` Yu Zhao
  1 sibling, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2024-01-04 10:09 UTC (permalink / raw)
  To: Dan Schatzberg
  Cc: Andrew Morton, linux-kernel, cgroups, linux-mm, Yosry Ahmed,
	David Rientjes, Chris Li, Tejun Heo, Zefan Li, Johannes Weiner,
	Jonathan Corbet, Roman Gushchin, Shakeel Butt, Muchun Song,
	David Hildenbrand, Matthew Wilcox, Kefeng Wang, Yue Zhao,
	Hugh Dickins

On Wed 03-01-24 08:48:37, Dan Schatzberg wrote:
[...]
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d91963e2d47f..394e0dd46b2e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -92,6 +92,11 @@ struct scan_control {
>  	unsigned long	anon_cost;
>  	unsigned long	file_cost;
>  
> +#ifdef CONFIG_MEMCG
> +	/* Swappiness value for proactive reclaim. Always use sc_swappiness()! */
> +	int *proactive_swappiness;
> +#endif
> +
>  	/* Can active folios be deactivated as part of reclaim? */
>  #define DEACTIVATE_ANON 1
>  #define DEACTIVATE_FILE 2
> @@ -227,6 +232,13 @@ static bool writeback_throttling_sane(struct scan_control *sc)
>  #endif
>  	return false;
>  }
> +
> +static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg)
> +{
> +	if (sc->proactive && sc->proactive_swappiness)
> +		return *sc->proactive_swappiness;
> +	return mem_cgroup_swappiness(memcg);
> +}

If you really want to make this sc->proactive bound then do not use
CONFIG_MEMCG as sc->proactive is not guarded either.

I do not think that sc->proactive check is really necessary. A pure NULL
check is sufficient to have a valid and self evident code that is future
proof. But TBH this is not the most important aspect of the patch to
spend much more time discussing. Either go with sc->proactive but make
it config space consistent or simply rely on NULL check (with or without
MEMCG guard as both are valid options).

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 2/2] mm: add swapiness= arg to memory.reclaim
  2024-01-04  8:48         ` Michal Hocko
@ 2024-01-09 23:54           ` Yu Zhao
  2024-01-10 10:32             ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Yu Zhao @ 2024-01-09 23:54 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dan Schatzberg, Andrew Morton, linux-kernel, cgroups, linux-mm,
	Yosry Ahmed, David Rientjes, Chris Li, Tejun Heo, Zefan Li,
	Johannes Weiner, Jonathan Corbet, Roman Gushchin, Shakeel Butt,
	Muchun Song, David Hildenbrand, Matthew Wilcox, Kefeng Wang,
	Yue Zhao, Hugh Dickins

On Thu, Jan 4, 2024 at 1:48 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 03-01-24 18:07:43, Yu Zhao wrote:
> > On Wed, Jan 03, 2024 at 01:19:59PM -0500, Dan Schatzberg wrote:
> > > On Wed, Jan 03, 2024 at 10:19:40AM -0700, Yu Zhao wrote:
> > > [...]
> > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > index d91963e2d47f..394e0dd46b2e 100644
> > > > > --- a/mm/vmscan.c
> > > > > +++ b/mm/vmscan.c
> > > > > @@ -92,6 +92,11 @@ struct scan_control {
> > > > >         unsigned long   anon_cost;
> > > > >         unsigned long   file_cost;
> > > > >
> > > > > +#ifdef CONFIG_MEMCG
> > > > > +       /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */
> > > > > +       int *proactive_swappiness;
> > > > > +#endif
> > > >
> > > > Why is proactive_swappiness still a pointer? The whole point of the
> > > > previous conversation is that sc->proactive can tell whether
> > > > sc->swappiness is valid or not, and that's less awkward than using a
> > > > pointer.
> > >
> > > It's the same reason as before - zero initialization ensures that the
> > > pointer is NULL which tells us if it's valid or not. Proactive reclaim
> > > might not set swappiness and you need to distinguish swappiness of 0
> > > and not-set. See this discussion with Michal:
> > >
> > > https://lore.kernel.org/linux-mm/ZZUizpTWOt3gNeqR@tiehlicka/
> >
> >  static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> >                               size_t nbytes, loff_t off)
> >  {
> >         struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> >         unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> >         unsigned long nr_to_reclaim, nr_reclaimed = 0;
> > +       int swappiness = -1;
> > ...
> >                 reclaimed = try_to_free_mem_cgroup_pages(memcg,
> >                                         min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
> > -                                       GFP_KERNEL, reclaim_options);
> > +                                       GFP_KERNEL, reclaim_options,
> > +                                       swappiness);
> >
> > ...
> >
> > +static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg)
> > +{
> > +       return sc->proactive && sc->proactive_swappiness > -1 ?
> > +              sc->proactive_swappiness : mem_cgroup_swappiness(memcg);
> > +}
>
> Tpo be completely honest I really fail to see why this is such a hot
> discussion point. To be completely clear both approaches are feasible.

Feasible but not equal.

> The main argument for NULL check based approach is that it is less error
> prone from an incorrect ussage because any bug becomes obvious.

Any bug becomes *fatal*, and fatal isn't only obvious but also hurts
in production systems.

This was the reason for going through the trouble switching from
VM_BUG_ON() to VM_WARN_ON() and documenting it in
Documentation/process/coding-style.rst:

22) Do not crash the kernel
---------------------------

In general, the decision to crash the kernel belongs to the user, rather
than to the kernel developer.

Isn't?

> If we
> use any other special constant a missing initialization would be much
> harder to spot because they would be subtle behavior change.
>
> Are there really any strong arguments to go against this "default
> initialization is safe" policy?

Just wanted to point out an alternative. Fine details (best practices)
matter to me.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 2/2] mm: add swapiness= arg to memory.reclaim
  2024-01-04 10:09   ` Michal Hocko
@ 2024-01-09 23:57     ` Yu Zhao
  0 siblings, 0 replies; 17+ messages in thread
From: Yu Zhao @ 2024-01-09 23:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dan Schatzberg, Andrew Morton, linux-kernel, cgroups, linux-mm,
	Yosry Ahmed, David Rientjes, Chris Li, Tejun Heo, Zefan Li,
	Johannes Weiner, Jonathan Corbet, Roman Gushchin, Shakeel Butt,
	Muchun Song, David Hildenbrand, Matthew Wilcox, Kefeng Wang,
	Yue Zhao, Hugh Dickins

On Thu, Jan 4, 2024 at 3:09 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Wed 03-01-24 08:48:37, Dan Schatzberg wrote:
> [...]
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index d91963e2d47f..394e0dd46b2e 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -92,6 +92,11 @@ struct scan_control {
> >       unsigned long   anon_cost;
> >       unsigned long   file_cost;
> >
> > +#ifdef CONFIG_MEMCG
> > +     /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */
> > +     int *proactive_swappiness;
> > +#endif
> > +
> >       /* Can active folios be deactivated as part of reclaim? */
> >  #define DEACTIVATE_ANON 1
> >  #define DEACTIVATE_FILE 2
> > @@ -227,6 +232,13 @@ static bool writeback_throttling_sane(struct scan_control *sc)
> >  #endif
> >       return false;
> >  }
> > +
> > +static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg)
> > +{
> > +     if (sc->proactive && sc->proactive_swappiness)
> > +             return *sc->proactive_swappiness;
> > +     return mem_cgroup_swappiness(memcg);
> > +}
>
> If you really want to make this sc->proactive bound then do not use
> CONFIG_MEMCG as sc->proactive is not guarded either.
>
> I do not think that sc->proactive check is really necessary. A pure NULL
> check is sufficient to have a valid and self evident code that is future
> proof. But TBH this is not the most important aspect of the patch to
> spend much more time discussing. Either go with sc->proactive but make
> it config space consistent or simply rely on NULL check (with or without
> MEMCG guard as both are valid options).

Now you see why I replied. That "hybrid" if statement is just neither
of what was suggested.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 2/2] mm: add swapiness= arg to memory.reclaim
  2024-01-09 23:54           ` Yu Zhao
@ 2024-01-10 10:32             ` Michal Hocko
  0 siblings, 0 replies; 17+ messages in thread
From: Michal Hocko @ 2024-01-10 10:32 UTC (permalink / raw)
  To: Yu Zhao
  Cc: Dan Schatzberg, Andrew Morton, linux-kernel, cgroups, linux-mm,
	Yosry Ahmed, David Rientjes, Chris Li, Tejun Heo, Zefan Li,
	Johannes Weiner, Jonathan Corbet, Roman Gushchin, Shakeel Butt,
	Muchun Song, David Hildenbrand, Matthew Wilcox, Kefeng Wang,
	Yue Zhao, Hugh Dickins

On Tue 09-01-24 16:54:15, Yu Zhao wrote:
> On Thu, Jan 4, 2024 at 1:48 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Wed 03-01-24 18:07:43, Yu Zhao wrote:
> > > On Wed, Jan 03, 2024 at 01:19:59PM -0500, Dan Schatzberg wrote:
> > > > On Wed, Jan 03, 2024 at 10:19:40AM -0700, Yu Zhao wrote:
> > > > [...]
> > > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > > > index d91963e2d47f..394e0dd46b2e 100644
> > > > > > --- a/mm/vmscan.c
> > > > > > +++ b/mm/vmscan.c
> > > > > > @@ -92,6 +92,11 @@ struct scan_control {
> > > > > >         unsigned long   anon_cost;
> > > > > >         unsigned long   file_cost;
> > > > > >
> > > > > > +#ifdef CONFIG_MEMCG
> > > > > > +       /* Swappiness value for proactive reclaim. Always use sc_swappiness()! */
> > > > > > +       int *proactive_swappiness;
> > > > > > +#endif
> > > > >
> > > > > Why is proactive_swappiness still a pointer? The whole point of the
> > > > > previous conversation is that sc->proactive can tell whether
> > > > > sc->swappiness is valid or not, and that's less awkward than using a
> > > > > pointer.
> > > >
> > > > It's the same reason as before - zero initialization ensures that the
> > > > pointer is NULL which tells us if it's valid or not. Proactive reclaim
> > > > might not set swappiness and you need to distinguish swappiness of 0
> > > > and not-set. See this discussion with Michal:
> > > >
> > > > https://lore.kernel.org/linux-mm/ZZUizpTWOt3gNeqR@tiehlicka/
> > >
> > >  static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
> > >                               size_t nbytes, loff_t off)
> > >  {
> > >         struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> > >         unsigned int nr_retries = MAX_RECLAIM_RETRIES;
> > >         unsigned long nr_to_reclaim, nr_reclaimed = 0;
> > > +       int swappiness = -1;
> > > ...
> > >                 reclaimed = try_to_free_mem_cgroup_pages(memcg,
> > >                                         min(nr_to_reclaim - nr_reclaimed, SWAP_CLUSTER_MAX),
> > > -                                       GFP_KERNEL, reclaim_options);
> > > +                                       GFP_KERNEL, reclaim_options,
> > > +                                       swappiness);
> > >
> > > ...
> > >
> > > +static int sc_swappiness(struct scan_control *sc, struct mem_cgroup *memcg)
> > > +{
> > > +       return sc->proactive && sc->proactive_swappiness > -1 ?
> > > +              sc->proactive_swappiness : mem_cgroup_swappiness(memcg);
> > > +}
> >
> > Tpo be completely honest I really fail to see why this is such a hot
> > discussion point. To be completely clear both approaches are feasible.
> 
> Feasible but not equal.
> 
> > The main argument for NULL check based approach is that it is less error
> > prone from an incorrect ussage because any bug becomes obvious.
> 
> Any bug becomes *fatal*, and fatal isn't only obvious but also hurts
> in production systems.
> 
> This was the reason for going through the trouble switching from
> VM_BUG_ON() to VM_WARN_ON() and documenting it in
> Documentation/process/coding-style.rst:
> 
> 22) Do not crash the kernel
> ---------------------------
> 
> In general, the decision to crash the kernel belongs to the user, rather
> than to the kernel developer.
> 
> Isn't?

I do agree with this general statement but I do not think it is
applicable in this context.

This is not an explicit BUG() when kernel explicitly sets to panic the
system. We are talking about subtle misbehavior which might be
non-trivial to debug (there are other reasons to not swap at all) vs. a
potential NULL ptr which will kill the userspace in a very obvious way.
Sure there are risks with that but checks for potential NULL ptr
dereferncing is easier than forgot explicit initialization. There are
clear pros and cons for both approaches. NULL default initialized
structures members which allow for behavior override are a general
kernel pattern so I do not really see this going way off the rails.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 0/2] Add swappiness argument to memory.reclaim
  2024-01-03 16:48 [PATCH v6 0/2] Add swappiness argument to memory.reclaim Dan Schatzberg
  2024-01-03 16:48 ` [PATCH v6 1/2] mm: add defines for min/max swappiness Dan Schatzberg
  2024-01-03 16:48 ` [PATCH v6 2/2] mm: add swapiness= arg to memory.reclaim Dan Schatzberg
@ 2024-06-11 19:25 ` Shakeel Butt
  2024-06-11 19:31   ` Yosry Ahmed
  2024-06-11 19:48   ` Andrew Morton
  2 siblings, 2 replies; 17+ messages in thread
From: Shakeel Butt @ 2024-06-11 19:25 UTC (permalink / raw)
  To: Yue Zhao, Andrew Morton
  Cc: linux-kernel, cgroups, linux-mm, Tejun Heo, Zefan Li,
	Johannes Weiner, Jonathan Corbet, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, David Hildenbrand, Chris Li,
	Matthew Wilcox, Kefeng Wang, Yosry Ahmed, Yue Zhao, Hugh Dickins,
	Dan Schatzberg

Hi folks,

This series has been in the mm-unstable for several months. Are there
any remaining concerns here otherwise can we please put this in the
mm-stable branch to be merged in the next Linux release?

On Wed, Jan 03, 2024 at 08:48:35AM GMT, Dan Schatzberg wrote:
> Changes since V5:
>   * Made the scan_control behavior limited to proactive reclaim explicitly
>   * created sc_swappiness helper to reduce chance of mis-use
> 
> Changes since V4:
>   * Fixed some initialization bugs by reverting back to a pointer for swappiness
>   * Added some more caveats to the behavior of swappiness in documentation
> 
> Changes since V3:
>   * Added #define for MIN_SWAPPINESS and MAX_SWAPPINESS
>   * Added explicit calls to mem_cgroup_swappiness
> 
> Changes since V2:
>   * No functional change
>   * Used int consistently rather than a pointer
> 
> Changes since V1:
>   * Added documentation
> 
> This patch proposes augmenting the memory.reclaim interface with a
> swappiness=<val> argument that overrides the swappiness value for that instance
> of proactive reclaim.
> 
> Userspace proactive reclaimers use the memory.reclaim interface to trigger
> reclaim. The memory.reclaim interface does not allow for any way to effect the
> balance of file vs anon during proactive reclaim. The only approach is to adjust
> the vm.swappiness setting. However, there are a few reasons we look to control
> the balance of file vs anon during proactive reclaim, separately from reactive
> reclaim:
> 
> * Swapout should be limited to manage SSD write endurance. In near-OOM
>   situations we are fine with lots of swap-out to avoid OOMs. As these are
>   typically rare events, they have relatively little impact on write endurance.
>   However, proactive reclaim runs continuously and so its impact on SSD write
>   endurance is more significant. Therefore it is desireable to control swap-out
>   for proactive reclaim separately from reactive reclaim
> 
> * Some userspace OOM killers like systemd-oomd[1] support OOM killing on swap
>   exhaustion. This makes sense if the swap exhaustion is triggered due to
>   reactive reclaim but less so if it is triggered due to proactive reclaim (e.g.
>   one could see OOMs when free memory is ample but anon is just particularly
>   cold). Therefore, it's desireable to have proactive reclaim reduce or stop
>   swap-out before the threshold at which OOM killing occurs.
> 
> In the case of Meta's Senpai proactive reclaimer, we adjust vm.swappiness before
> writes to memory.reclaim[2]. This has been in production for nearly two years
> and has addressed our needs to control proactive vs reactive reclaim behavior
> but is still not ideal for a number of reasons:
> 
> * vm.swappiness is a global setting, adjusting it can race/interfere with other
>   system administration that wishes to control vm.swappiness. In our case, we
>   need to disable Senpai before adjusting vm.swappiness.
> 
> * vm.swappiness is stateful - so a crash or restart of Senpai can leave a
>   misconfigured setting. This requires some additional management to record the
>   "desired" setting and ensure Senpai always adjusts to it.
> 
> With this patch, we avoid these downsides of adjusting vm.swappiness globally.
> 
> Previously, this exact interface addition was proposed by Yosry[3]. In response,
> Roman proposed instead an interface to specify precise file/anon/slab reclaim
> amounts[4]. More recently Huan also proposed this as well[5] and others
> similarly questioned if this was the proper interface.
> 
> Previous proposals sought to use this to allow proactive reclaimers to
> effectively perform a custom reclaim algorithm by issuing proactive reclaim with
> different settings to control file vs anon reclaim (e.g. to only reclaim anon
> from some applications). Responses argued that adjusting swappiness is a poor
> interface for custom reclaim.
> 
> In contrast, I argue in favor of a swappiness setting not as a way to implement
> custom reclaim algorithms but rather to bias the balance of anon vs file due to
> differences of proactive vs reactive reclaim. In this context, swappiness is the
> existing interface for controlling this balance and this patch simply allows for
> it to be configured differently for proactive vs reactive reclaim.
> 
> Specifying explicit amounts of anon vs file pages to reclaim feels inappropriate
> for this prupose. Proactive reclaimers are un-aware of the relative age of file
> vs anon for a cgroup which makes it difficult to manage proactive reclaim of
> different memory pools. A proactive reclaimer would need some amount of anon
> reclaim attempts separate from the amount of file reclaim attempts which seems
> brittle given that it's difficult to observe the impact.
> 
> [1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
> [2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598
> [3]https://lore.kernel.org/linux-mm/CAJD7tkbDpyoODveCsnaqBBMZEkDvshXJmNdbk51yKSNgD7aGdg@mail.gmail.com/
> [4]https://lore.kernel.org/linux-mm/YoPHtHXzpK51F%2F1Z@carbon/
> [5]https://lore.kernel.org/lkml/20231108065818.19932-1-link@vivo.com/
> 
> Dan Schatzberg (2):
>   mm: add defines for min/max swappiness
>   mm: add swapiness= arg to memory.reclaim
> 
>  Documentation/admin-guide/cgroup-v2.rst | 18 +++++---
>  include/linux/swap.h                    |  5 ++-
>  mm/memcontrol.c                         | 58 ++++++++++++++++++++-----
>  mm/vmscan.c                             | 39 ++++++++++++-----
>  4 files changed, 90 insertions(+), 30 deletions(-)
> 
> -- 
> 2.39.3
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 0/2] Add swappiness argument to memory.reclaim
  2024-06-11 19:25 ` [PATCH v6 0/2] Add swappiness argument " Shakeel Butt
@ 2024-06-11 19:31   ` Yosry Ahmed
  2024-06-11 19:48   ` Andrew Morton
  1 sibling, 0 replies; 17+ messages in thread
From: Yosry Ahmed @ 2024-06-11 19:31 UTC (permalink / raw)
  To: Shakeel Butt, Yu Zhao
  Cc: Yue Zhao, Andrew Morton, linux-kernel, cgroups, linux-mm,
	Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	David Hildenbrand, Chris Li, Matthew Wilcox, Kefeng Wang,
	Hugh Dickins, Dan Schatzberg

On Tue, Jun 11, 2024 at 12:25 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> Hi folks,
>
> This series has been in the mm-unstable for several months. Are there
> any remaining concerns here otherwise can we please put this in the
> mm-stable branch to be merged in the next Linux release?

+Yu Zhao

I don't think Yu Zhao was correctly CC'd on this :)

>
> On Wed, Jan 03, 2024 at 08:48:35AM GMT, Dan Schatzberg wrote:
> > Changes since V5:
> >   * Made the scan_control behavior limited to proactive reclaim explicitly
> >   * created sc_swappiness helper to reduce chance of mis-use
> >
> > Changes since V4:
> >   * Fixed some initialization bugs by reverting back to a pointer for swappiness
> >   * Added some more caveats to the behavior of swappiness in documentation
> >
> > Changes since V3:
> >   * Added #define for MIN_SWAPPINESS and MAX_SWAPPINESS
> >   * Added explicit calls to mem_cgroup_swappiness
> >
> > Changes since V2:
> >   * No functional change
> >   * Used int consistently rather than a pointer
> >
> > Changes since V1:
> >   * Added documentation
> >
> > This patch proposes augmenting the memory.reclaim interface with a
> > swappiness=<val> argument that overrides the swappiness value for that instance
> > of proactive reclaim.
> >
> > Userspace proactive reclaimers use the memory.reclaim interface to trigger
> > reclaim. The memory.reclaim interface does not allow for any way to effect the
> > balance of file vs anon during proactive reclaim. The only approach is to adjust
> > the vm.swappiness setting. However, there are a few reasons we look to control
> > the balance of file vs anon during proactive reclaim, separately from reactive
> > reclaim:
> >
> > * Swapout should be limited to manage SSD write endurance. In near-OOM
> >   situations we are fine with lots of swap-out to avoid OOMs. As these are
> >   typically rare events, they have relatively little impact on write endurance.
> >   However, proactive reclaim runs continuously and so its impact on SSD write
> >   endurance is more significant. Therefore it is desireable to control swap-out
> >   for proactive reclaim separately from reactive reclaim
> >
> > * Some userspace OOM killers like systemd-oomd[1] support OOM killing on swap
> >   exhaustion. This makes sense if the swap exhaustion is triggered due to
> >   reactive reclaim but less so if it is triggered due to proactive reclaim (e.g.
> >   one could see OOMs when free memory is ample but anon is just particularly
> >   cold). Therefore, it's desireable to have proactive reclaim reduce or stop
> >   swap-out before the threshold at which OOM killing occurs.
> >
> > In the case of Meta's Senpai proactive reclaimer, we adjust vm.swappiness before
> > writes to memory.reclaim[2]. This has been in production for nearly two years
> > and has addressed our needs to control proactive vs reactive reclaim behavior
> > but is still not ideal for a number of reasons:
> >
> > * vm.swappiness is a global setting, adjusting it can race/interfere with other
> >   system administration that wishes to control vm.swappiness. In our case, we
> >   need to disable Senpai before adjusting vm.swappiness.
> >
> > * vm.swappiness is stateful - so a crash or restart of Senpai can leave a
> >   misconfigured setting. This requires some additional management to record the
> >   "desired" setting and ensure Senpai always adjusts to it.
> >
> > With this patch, we avoid these downsides of adjusting vm.swappiness globally.
> >
> > Previously, this exact interface addition was proposed by Yosry[3]. In response,
> > Roman proposed instead an interface to specify precise file/anon/slab reclaim
> > amounts[4]. More recently Huan also proposed this as well[5] and others
> > similarly questioned if this was the proper interface.
> >
> > Previous proposals sought to use this to allow proactive reclaimers to
> > effectively perform a custom reclaim algorithm by issuing proactive reclaim with
> > different settings to control file vs anon reclaim (e.g. to only reclaim anon
> > from some applications). Responses argued that adjusting swappiness is a poor
> > interface for custom reclaim.
> >
> > In contrast, I argue in favor of a swappiness setting not as a way to implement
> > custom reclaim algorithms but rather to bias the balance of anon vs file due to
> > differences of proactive vs reactive reclaim. In this context, swappiness is the
> > existing interface for controlling this balance and this patch simply allows for
> > it to be configured differently for proactive vs reactive reclaim.
> >
> > Specifying explicit amounts of anon vs file pages to reclaim feels inappropriate
> > for this prupose. Proactive reclaimers are un-aware of the relative age of file
> > vs anon for a cgroup which makes it difficult to manage proactive reclaim of
> > different memory pools. A proactive reclaimer would need some amount of anon
> > reclaim attempts separate from the amount of file reclaim attempts which seems
> > brittle given that it's difficult to observe the impact.
> >
> > [1]https://www.freedesktop.org/software/systemd/man/latest/systemd-oomd.service.html
> > [2]https://github.com/facebookincubator/oomd/blob/main/src/oomd/plugins/Senpai.cpp#L585-L598
> > [3]https://lore.kernel.org/linux-mm/CAJD7tkbDpyoODveCsnaqBBMZEkDvshXJmNdbk51yKSNgD7aGdg@mail.gmail.com/
> > [4]https://lore.kernel.org/linux-mm/YoPHtHXzpK51F%2F1Z@carbon/
> > [5]https://lore.kernel.org/lkml/20231108065818.19932-1-link@vivo.com/
> >
> > Dan Schatzberg (2):
> >   mm: add defines for min/max swappiness
> >   mm: add swapiness= arg to memory.reclaim
> >
> >  Documentation/admin-guide/cgroup-v2.rst | 18 +++++---
> >  include/linux/swap.h                    |  5 ++-
> >  mm/memcontrol.c                         | 58 ++++++++++++++++++++-----
> >  mm/vmscan.c                             | 39 ++++++++++++-----
> >  4 files changed, 90 insertions(+), 30 deletions(-)
> >
> > --
> > 2.39.3
> >


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 0/2] Add swappiness argument to memory.reclaim
  2024-06-11 19:25 ` [PATCH v6 0/2] Add swappiness argument " Shakeel Butt
  2024-06-11 19:31   ` Yosry Ahmed
@ 2024-06-11 19:48   ` Andrew Morton
  2024-06-11 22:50     ` Shakeel Butt
  1 sibling, 1 reply; 17+ messages in thread
From: Andrew Morton @ 2024-06-11 19:48 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Yue Zhao, linux-kernel, cgroups, linux-mm, Tejun Heo, Zefan Li,
	Johannes Weiner, Jonathan Corbet, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, David Hildenbrand, Chris Li,
	Matthew Wilcox, Kefeng Wang, Yosry Ahmed, Hugh Dickins,
	Dan Schatzberg, Yu Zhao

On Tue, 11 Jun 2024 12:25:24 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:

> Hi folks,
> 
> This series has been in the mm-unstable for several months. Are there
> any remaining concerns here otherwise can we please put this in the
> mm-stable branch to be merged in the next Linux release?

The review didn't go terribly well so I parked the series awaiting more
clarity.  Although on rereading, it seems that Yu Zhao isn't seeing any
blocking issues?



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 0/2] Add swappiness argument to memory.reclaim
  2024-06-11 19:48   ` Andrew Morton
@ 2024-06-11 22:50     ` Shakeel Butt
  2024-06-11 23:10       ` Yu Zhao
  0 siblings, 1 reply; 17+ messages in thread
From: Shakeel Butt @ 2024-06-11 22:50 UTC (permalink / raw)
  To: Andrew Morton, Yu Zhao
  Cc: Yue Zhao, linux-kernel, cgroups, linux-mm, Tejun Heo, Zefan Li,
	Johannes Weiner, Jonathan Corbet, Michal Hocko, Roman Gushchin,
	Shakeel Butt, Muchun Song, David Hildenbrand, Chris Li,
	Matthew Wilcox, Kefeng Wang, Yosry Ahmed, Hugh Dickins,
	Dan Schatzberg, Yu Zhao

On Tue, Jun 11, 2024 at 12:48:07PM GMT, Andrew Morton wrote:
> On Tue, 11 Jun 2024 12:25:24 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:
> 
> > Hi folks,
> > 
> > This series has been in the mm-unstable for several months. Are there
> > any remaining concerns here otherwise can we please put this in the
> > mm-stable branch to be merged in the next Linux release?
> 
> The review didn't go terribly well so I parked the series awaiting more
> clarity.  Although on rereading, it seems that Yu Zhao isn't seeing any
> blocking issues?
> 

Yu, please share if you have any strong concern in merging this series?


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 0/2] Add swappiness argument to memory.reclaim
  2024-06-11 22:50     ` Shakeel Butt
@ 2024-06-11 23:10       ` Yu Zhao
  0 siblings, 0 replies; 17+ messages in thread
From: Yu Zhao @ 2024-06-11 23:10 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Yue Zhao, linux-kernel, cgroups, linux-mm,
	Tejun Heo, Zefan Li, Johannes Weiner, Jonathan Corbet,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	David Hildenbrand, Chris Li, Matthew Wilcox, Kefeng Wang,
	Yosry Ahmed, Hugh Dickins, Dan Schatzberg

On Tue, Jun 11, 2024 at 4:50 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Tue, Jun 11, 2024 at 12:48:07PM GMT, Andrew Morton wrote:
> > On Tue, 11 Jun 2024 12:25:24 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > > Hi folks,
> > >
> > > This series has been in the mm-unstable for several months. Are there
> > > any remaining concerns here otherwise can we please put this in the
> > > mm-stable branch to be merged in the next Linux release?
> >
> > The review didn't go terribly well so I parked the series awaiting more
> > clarity.  Although on rereading, it seems that Yu Zhao isn't seeing any
> > blocking issues?
> >
>
> Yu, please share if you have any strong concern in merging this series?

I don't remember I had any strong concerns. In fact, I don't remember
what I commented on.

Let me go back to the previous discussion and see why it was stalled.
Will get back to you soon.


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2024-06-11 23:10 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-03 16:48 [PATCH v6 0/2] Add swappiness argument to memory.reclaim Dan Schatzberg
2024-01-03 16:48 ` [PATCH v6 1/2] mm: add defines for min/max swappiness Dan Schatzberg
2024-01-03 16:48 ` [PATCH v6 2/2] mm: add swapiness= arg to memory.reclaim Dan Schatzberg
2024-01-03 17:19   ` Yu Zhao
2024-01-03 18:19     ` Dan Schatzberg
2024-01-04  1:07       ` Yu Zhao
2024-01-04  8:48         ` Michal Hocko
2024-01-09 23:54           ` Yu Zhao
2024-01-10 10:32             ` Michal Hocko
2024-01-04  1:17       ` Yu Zhao
2024-01-04 10:09   ` Michal Hocko
2024-01-09 23:57     ` Yu Zhao
2024-06-11 19:25 ` [PATCH v6 0/2] Add swappiness argument " Shakeel Butt
2024-06-11 19:31   ` Yosry Ahmed
2024-06-11 19:48   ` Andrew Morton
2024-06-11 22:50     ` Shakeel Butt
2024-06-11 23:10       ` Yu Zhao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox