linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC 0/5] add option to restore swap account to cgroupv1 mode
@ 2025-03-19  6:41 Jingxiang Zeng
  2025-03-19  6:41 ` [RFC 1/5] Kconfig: add SWAP_CHARGE_V1_MODE config Jingxiang Zeng
                   ` (6 more replies)
  0 siblings, 7 replies; 28+ messages in thread
From: Jingxiang Zeng @ 2025-03-19  6:41 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, cgroups, linux-kernel, hannes, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, kasong, Zeng Jingxiang

From: Zeng Jingxiang <linuszeng@tencent.com>

memsw account is a very useful knob for container memory
overcommitting: It's a great abstraction of the "expected total
memory usage" of a container, so containers can't allocate too
much memory using SWAP, but still be able to SWAP out.

For a simple example, with memsw.limit == memory.limit, containers
can't exceed their original memory limit, even with SWAP enabled, they
get OOM killed as how they used to, but the host is now able to
offload cold pages.

Similar ability seems absent with V2: With memory.swap.max == 0, the
host can't use SWAP to reclaim container memory at all. But with a
value larger than that, containers are able to overuse memory, causing
delayed OOM kill, thrashing, CPU/Memory usage ratio could be heavily
out of balance, especially with compress SWAP backends.

This patch set adds two interfaces to control the behavior of the
memory.swap.max/current in cgroupv2:

CONFIG_MEMSW_ACCOUNT_ON_DFL
cgroup.memsw_account_on_dfl={0, 1}

When one of the interfaces is enabled: memory.swap.current and
memory.swap.max represents the usage/limit of swap.
When neither is enabled (default behavior),memory.swap.current and
memory.swap.max represents the usage/limit of memory+swap.

As discussed in [1], this patch set can change the semantics of
of memory.swap.max/current to the v1-like behavior.

Link:
https://lore.kernel.org/all/Zk-fQtFrj-2YDJOo@P9FQF9L96D.corp.robot.car/ [1]

linuszeng (5):
  Kconfig: add SWAP_CHARGE_V1_MODE config
  memcontrol: add boot option to enable memsw account on dfl
  mm/memcontrol: do not scan anon pages if memsw limit is hit
  mm/memcontrol: allow memsw account in cgroup v2
  Docs/cgroup-v2: add cgroup.memsw_account_on_dfl Documentation

 Documentation/admin-guide/cgroup-v2.rst       | 21 +++++--
 .../admin-guide/kernel-parameters.txt         |  7 +++
 include/linux/memcontrol.h                    |  8 +++
 init/Kconfig                                  | 16 ++++++
 mm/memcontrol-v1.c                            |  2 +-
 mm/memcontrol-v1.h                            |  4 +-
 mm/memcontrol.c                               | 55 ++++++++++++++-----
 7 files changed, 93 insertions(+), 20 deletions(-)

-- 
2.41.1



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC 1/5] Kconfig: add SWAP_CHARGE_V1_MODE config
  2025-03-19  6:41 [RFC 0/5] add option to restore swap account to cgroupv1 mode Jingxiang Zeng
@ 2025-03-19  6:41 ` Jingxiang Zeng
  2025-03-19 19:29   ` Shakeel Butt
  2025-03-19  6:41 ` [RFC 2/5] memcontrol: add boot option to enable memsw account on dfl Jingxiang Zeng
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 28+ messages in thread
From: Jingxiang Zeng @ 2025-03-19  6:41 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, cgroups, linux-kernel, hannes, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, kasong, Zeng Jingxiang

From: Zeng Jingxiang <linuszeng@tencent.com>

Added SWAP_CHARGE_V1_MODE config, which is disabled by default.
When enabled in cgroupv2 mode, the memory accounting method of
swap will be restored to cgroupv1 mode.

Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
---
 include/linux/memcontrol.h |  6 ++++++
 init/Kconfig               | 16 ++++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 53364526d877..dcb087ee6e8d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -62,6 +62,12 @@ struct mem_cgroup_reclaim_cookie {
 
 #ifdef CONFIG_MEMCG
 
+/* Whether enable memory+swap account in cgroupv2 */
+static inline bool do_memsw_account_on_dfl(void)
+{
+	return IS_ENABLED(CONFIG_MEMSW_ACCOUNT_ON_DFL);
+}
+
 #define MEM_CGROUP_ID_SHIFT	16
 
 struct mem_cgroup_id {
diff --git a/init/Kconfig b/init/Kconfig
index 7f67d8942a09..669e39214244 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1012,6 +1012,22 @@ config MEMCG_V1
 
 	  Say N if unsure.
 
+config MEMSW_ACCOUNT_ON_DFL
+	bool "Whether enable memory+swap account in cgroup v2"
+	depends on MEMCG && MEMCG_V1
+	default n
+	help
+	  Say Y here to enable memory+swap account in cgroup v2. Enabling this
+	  option means that the semantics of memory.swap.max will align with
+	  memory.memsw.limit_in_bytes, and memory.swap.current will align with
+	  memory.memsw.usage_in_bytes.
+	  This is particularly useful for workloads that require strict memory
+	  and swap limits.
+
+	  If you are unsure whether to enable this option, it is recommended
+	  to leave it disabled (N) unless you specifically need memory and swap
+	  accounting features in your cgroup v2 setup.
+
 config BLK_CGROUP
 	bool "IO controller"
 	depends on BLOCK
-- 
2.41.1



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC 2/5] memcontrol: add boot option to enable memsw account on dfl
  2025-03-19  6:41 [RFC 0/5] add option to restore swap account to cgroupv1 mode Jingxiang Zeng
  2025-03-19  6:41 ` [RFC 1/5] Kconfig: add SWAP_CHARGE_V1_MODE config Jingxiang Zeng
@ 2025-03-19  6:41 ` Jingxiang Zeng
  2025-03-19 19:34   ` Shakeel Butt
  2025-03-19  6:41 ` [RFC 3/5] mm/memcontrol: do not scan anon pages if memsw limit is hit Jingxiang Zeng
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 28+ messages in thread
From: Jingxiang Zeng @ 2025-03-19  6:41 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, cgroups, linux-kernel, hannes, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, kasong, Zeng Jingxiang

From: Zeng Jingxiang <linuszeng@tencent.com>

Added cgroup.memsw_account_on_dfl startup parameter, which
is off by default. When enabled in cgroupv2 mode, the memory
accounting mode of swap will be reverted to cgroupv1 mode.

Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
---
 include/linux/memcontrol.h |  4 +++-
 mm/memcontrol.c            | 11 +++++++++++
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index dcb087ee6e8d..96f2fad1c351 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -62,10 +62,12 @@ struct mem_cgroup_reclaim_cookie {
 
 #ifdef CONFIG_MEMCG
 
+DECLARE_STATIC_KEY_FALSE(memsw_account_on_dfl);
 /* Whether enable memory+swap account in cgroupv2 */
 static inline bool do_memsw_account_on_dfl(void)
 {
-	return IS_ENABLED(CONFIG_MEMSW_ACCOUNT_ON_DFL);
+	return IS_ENABLED(CONFIG_MEMSW_ACCOUNT_ON_DFL)
+				|| static_branch_unlikely(&memsw_account_on_dfl);
 }
 
 #define MEM_CGROUP_ID_SHIFT	16
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 768d6b15dbfa..c1171fb2bfd6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5478,3 +5478,14 @@ static int __init mem_cgroup_swap_init(void)
 subsys_initcall(mem_cgroup_swap_init);
 
 #endif /* CONFIG_SWAP */
+
+DEFINE_STATIC_KEY_FALSE(memsw_account_on_dfl);
+static int __init memsw_account_on_dfl_setup(char *s)
+{
+	if (!strcmp(s, "1"))
+		static_branch_enable(&memsw_account_on_dfl);
+	else if (!strcmp(s, "0"))
+		static_branch_disable(&memsw_account_on_dfl);
+	return 1;
+}
+__setup("cgroup.memsw_account_on_dfl=", memsw_account_on_dfl_setup);
+
\ No newline at end of file
-- 
2.41.1



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC 3/5] mm/memcontrol: do not scan anon pages if memsw limit is hit
  2025-03-19  6:41 [RFC 0/5] add option to restore swap account to cgroupv1 mode Jingxiang Zeng
  2025-03-19  6:41 ` [RFC 1/5] Kconfig: add SWAP_CHARGE_V1_MODE config Jingxiang Zeng
  2025-03-19  6:41 ` [RFC 2/5] memcontrol: add boot option to enable memsw account on dfl Jingxiang Zeng
@ 2025-03-19  6:41 ` Jingxiang Zeng
  2025-03-19 19:36   ` Shakeel Butt
  2025-03-19  6:41 ` [RFC 4/5] mm/memcontrol: allow memsw account in cgroup v2 Jingxiang Zeng
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 28+ messages in thread
From: Jingxiang Zeng @ 2025-03-19  6:41 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, cgroups, linux-kernel, hannes, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, kasong, Zeng Jingxiang

From: Zeng Jingxiang <linuszeng@tencent.com>

When memory recycling is triggered by the hard watermark of
memsw, anonymous pages do not want to be recycled any further.
This is consistent with the processing method of cgroup v2.

Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
---
 mm/memcontrol.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c1171fb2bfd6..623ebf610946 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5072,14 +5072,21 @@ void __mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
 
 long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg)
 {
+	struct page_counter *pg_counter;
 	long nr_swap_pages = get_nr_swap_pages();
 
-	if (mem_cgroup_disabled() || do_memsw_account())
+	if (mem_cgroup_disabled())
 		return nr_swap_pages;
-	for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg))
+	for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
+		if (do_memsw_account())
+			pg_counter = &memcg->memsw;
+		else
+			pg_counter = &memcg->swap;
+
 		nr_swap_pages = min_t(long, nr_swap_pages,
-				      READ_ONCE(memcg->swap.max) -
-				      page_counter_read(&memcg->swap));
+				      READ_ONCE(pg_counter->max) -
+				      page_counter_read(pg_counter));
+	}
 	return nr_swap_pages;
 }
 
-- 
2.41.1



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC 4/5] mm/memcontrol: allow memsw account in cgroup v2
  2025-03-19  6:41 [RFC 0/5] add option to restore swap account to cgroupv1 mode Jingxiang Zeng
                   ` (2 preceding siblings ...)
  2025-03-19  6:41 ` [RFC 3/5] mm/memcontrol: do not scan anon pages if memsw limit is hit Jingxiang Zeng
@ 2025-03-19  6:41 ` Jingxiang Zeng
  2025-03-19  6:41 ` [RFC 5/5] Docs/cgroup-v2: add cgroup.memsw_account_on_dfl Documentation Jingxiang Zeng
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 28+ messages in thread
From: Jingxiang Zeng @ 2025-03-19  6:41 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, cgroups, linux-kernel, hannes, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, kasong, Zeng Jingxiang

From: Zeng Jingxiang <linuszeng@tencent.com>

memsw account is a very useful knob for container memory
overcommitting: It's a great abstraction of the "expected total
memory usage" of a container, so containers can't allocate too
much memory using SWAP, but still be able to SWAP out.

For a simple example, with memsw.limit == memory.limit, containers
can't exceed their original memory limit, even with SWAP enabled, they
get OOM killed as how they used to, but the host is now able to
offload cold pages.

Similar ability seems absent with V2: With memory.swap.max == 0, the
host can't use SWAP to reclaim container memory at all. But with a
value larger than that, containers are able to overuse memory, causing
delayed OOM kill, thrashing, CPU/Memory usage ratio could be heavily
out of balance, especially with compress SWAP backends.

This patch restores the semantics of memory.swap.max to be consistent
with memory.memsw.limit_in_bytes and the semantics of
memory.swap.current to be consistent with memory.memsw.usage_in_bytes
when MEMSW_ACCOUNT_ON_DFL config or cgroup.memsw_account_on_dfl
startup parameter is enabled.

Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
---
 mm/memcontrol-v1.c |  2 +-
 mm/memcontrol-v1.h |  4 +++-
 mm/memcontrol.c    | 29 +++++++++++++++++++----------
 3 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index c1feb3945350..3344d5e25822 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -1436,7 +1436,7 @@ void memcg1_oom_finish(struct mem_cgroup *memcg, bool locked)
 
 static DEFINE_MUTEX(memcg_max_mutex);
 
-static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
+int mem_cgroup_resize_max(struct mem_cgroup *memcg,
 				 unsigned long max, bool memsw)
 {
 	bool enlarge = false;
diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
index 6358464bb416..7f7ef9f6d03e 100644
--- a/mm/memcontrol-v1.h
+++ b/mm/memcontrol-v1.h
@@ -36,10 +36,12 @@ struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg);
 /* Cgroup v1-specific declarations */
 #ifdef CONFIG_MEMCG_V1
 
+int mem_cgroup_resize_max(struct mem_cgroup *memcg,
+				 unsigned long max, bool memsw);
 /* Whether legacy memory+swap accounting is active */
 static inline bool do_memsw_account(void)
 {
-	return !cgroup_subsys_on_dfl(memory_cgrp_subsys);
+	return !cgroup_subsys_on_dfl(memory_cgrp_subsys) || do_memsw_account_on_dfl();
 }
 
 unsigned long memcg_events_local(struct mem_cgroup *memcg, int event);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 623ebf610946..d85699fa8a90 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5205,9 +5205,12 @@ static ssize_t swap_max_write(struct kernfs_open_file *of,
 	if (err)
 		return err;
 
-	xchg(&memcg->swap.max, max);
+	if (do_memsw_account_on_dfl())
+		err = mem_cgroup_resize_max(memcg, max, true);
+	else
+		xchg(&memcg->swap.max, max);
 
-	return nbytes;
+	return err ?: nbytes;
 }
 
 static int swap_events_show(struct seq_file *m, void *v)
@@ -5224,24 +5227,28 @@ static int swap_events_show(struct seq_file *m, void *v)
 	return 0;
 }
 
-static struct cftype swap_files[] = {
+static struct cftype swap_files_v1[] = {
 	{
 		.name = "swap.current",
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.read_u64 = swap_current_read,
 	},
-	{
-		.name = "swap.high",
-		.flags = CFTYPE_NOT_ON_ROOT,
-		.seq_show = swap_high_show,
-		.write = swap_high_write,
-	},
 	{
 		.name = "swap.max",
 		.flags = CFTYPE_NOT_ON_ROOT,
 		.seq_show = swap_max_show,
 		.write = swap_max_write,
 	},
+	{ }	/* terminate */
+};
+
+static struct cftype swap_files[] = {
+	{
+		.name = "swap.high",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = swap_high_show,
+		.write = swap_high_write,
+	},
 	{
 		.name = "swap.max.effective",
 		.flags = CFTYPE_NOT_ON_ROOT,
@@ -5473,7 +5480,9 @@ static int __init mem_cgroup_swap_init(void)
 	if (mem_cgroup_disabled())
 		return 0;
 
-	WARN_ON(cgroup_add_dfl_cftypes(&memory_cgrp_subsys, swap_files));
+	WARN_ON(cgroup_add_dfl_cftypes(&memory_cgrp_subsys, swap_files_v1));
+	if (!do_memsw_account_on_dfl())
+		WARN_ON(cgroup_add_dfl_cftypes(&memory_cgrp_subsys, swap_files));
 #ifdef CONFIG_MEMCG_V1
 	WARN_ON(cgroup_add_legacy_cftypes(&memory_cgrp_subsys, memsw_files));
 #endif
-- 
2.41.1



^ permalink raw reply	[flat|nested] 28+ messages in thread

* [RFC 5/5] Docs/cgroup-v2: add cgroup.memsw_account_on_dfl Documentation
  2025-03-19  6:41 [RFC 0/5] add option to restore swap account to cgroupv1 mode Jingxiang Zeng
                   ` (3 preceding siblings ...)
  2025-03-19  6:41 ` [RFC 4/5] mm/memcontrol: allow memsw account in cgroup v2 Jingxiang Zeng
@ 2025-03-19  6:41 ` Jingxiang Zeng
  2025-03-19 19:27 ` [RFC 0/5] add option to restore swap account to cgroupv1 mode Shakeel Butt
  2025-03-19 19:38 ` Johannes Weiner
  6 siblings, 0 replies; 28+ messages in thread
From: Jingxiang Zeng @ 2025-03-19  6:41 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, cgroups, linux-kernel, hannes, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, kasong, Zeng Jingxiang

From: Zeng Jingxiang <linuszeng@tencent.com>

Add documentation descriptions for cgroup.memsw_account_on_dfl and
CONFIG_MEMSW_ACCOUNT_ON_DFL.

Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
---
 Documentation/admin-guide/cgroup-v2.rst       | 21 +++++++++++++++----
 .../admin-guide/kernel-parameters.txt         |  7 +++++++
 2 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 53ada5c2620a..58869279f0a9 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1704,8 +1704,14 @@ The following nested keys are defined.
 	A read-only single value file which exists on non-root
 	cgroups.
 
-	The total amount of swap currently being used by the cgroup
-	and its descendants.
+	If memsw_account_on_dfl and CONFIG_MEMSW_ACCOUNT_ON_DFL is not
+	enabled (the default), this interface represents the total
+	amount of swap currently being used by the cgroup and its
+	descendants.
+
+	If memsw_account_on_dfl or CONFIG_MEMSW_ACCOUNT_ON_DFL is enabled,
+	this interface represents the total amount of memory+swap
+	currently being used by the cgroup and its descendants.
 
   memory.swap.high
 	A read-write single value file which exists on non-root
@@ -1737,8 +1743,15 @@ The following nested keys are defined.
 	A read-write single value file which exists on non-root
 	cgroups.  The default is "max".
 
-	Swap usage hard limit.  If a cgroup's swap usage reaches this
-	limit, anonymous memory of the cgroup will not be swapped out.
+	If memsw_account_on_dfl and CONFIG_MEMSW_ACCOUNT_ON_DFL is not
+	enabled (the default), this interface represents the hard limit
+	of swap usage. If a cgroup's swap usage reaches this limit,
+	anonymous memory of the cgroup will not be swapped out.
+
+	If memsw_account_on_dfl or CONFIG_MEMSW_ACCOUNT_ON_DFL is enabled,
+	this interface represents the hard limit of memory+swap usage.
+	If a cgroup's memory+swap usage reaches this limit, anonymous
+	memory of the cgroup will not be swapped out.
 
   memory.swap.max.effective
 	A read-only single value file which exists on non-root cgroups.
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 2758bc124f16..0aa9e4f85b51 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -644,6 +644,13 @@
 			nokmem -- Disable kernel memory accounting.
 			nobpf -- Disable BPF memory accounting.
 
+	cgroup.memsw_account_on_dfl= [KNL] Enable memory+swap account on cgroupv2.
+			Format: { "0" | "1" }
+			0 - memory.swap.current and memory.swap.max represents the
+				usage/limit of swap.
+			1 - memory.swap.current and memory.swap.max represents the
+				usage/limit of memory+swap.
+
 	checkreqprot=	[SELINUX] Set initial checkreqprot flag value.
 			Format: { "0" | "1" }
 			See security/selinux/Kconfig help text.
-- 
2.41.1



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/5] add option to restore swap account to cgroupv1 mode
  2025-03-19  6:41 [RFC 0/5] add option to restore swap account to cgroupv1 mode Jingxiang Zeng
                   ` (4 preceding siblings ...)
  2025-03-19  6:41 ` [RFC 5/5] Docs/cgroup-v2: add cgroup.memsw_account_on_dfl Documentation Jingxiang Zeng
@ 2025-03-19 19:27 ` Shakeel Butt
  2025-03-19 19:38 ` Johannes Weiner
  6 siblings, 0 replies; 28+ messages in thread
From: Shakeel Butt @ 2025-03-19 19:27 UTC (permalink / raw)
  To: Jingxiang Zeng
  Cc: akpm, linux-mm, cgroups, linux-kernel, hannes, mhocko,
	roman.gushchin, muchun.song, kasong, Tejun Heo

+Tejun

Hi Zeng,

On Wed, Mar 19, 2025 at 02:41:43PM +0800, Jingxiang Zeng wrote:
> From: Zeng Jingxiang <linuszeng@tencent.com>
> 
> memsw account is a very useful knob for container memory
> overcommitting: It's a great abstraction of the "expected total
> memory usage" of a container, so containers can't allocate too
> much memory using SWAP, but still be able to SWAP out.
> 
> For a simple example, with memsw.limit == memory.limit, containers
> can't exceed their original memory limit, even with SWAP enabled, they
> get OOM killed as how they used to, but the host is now able to
> offload cold pages.
> 
> Similar ability seems absent with V2: With memory.swap.max == 0, the
> host can't use SWAP to reclaim container memory at all. But with a
> value larger than that, containers are able to overuse memory, causing
> delayed OOM kill, thrashing, CPU/Memory usage ratio could be heavily
> out of balance, especially with compress SWAP backends.
> 
> This patch set adds two interfaces to control the behavior of the
> memory.swap.max/current in cgroupv2:
> 
> CONFIG_MEMSW_ACCOUNT_ON_DFL
> cgroup.memsw_account_on_dfl={0, 1}
> 
> When one of the interfaces is enabled: memory.swap.current and
> memory.swap.max represents the usage/limit of swap.
> When neither is enabled (default behavior),memory.swap.current and
> memory.swap.max represents the usage/limit of memory+swap.
> 
> As discussed in [1], this patch set can change the semantics of
> of memory.swap.max/current to the v1-like behavior.
> 
> Link:
> https://lore.kernel.org/all/Zk-fQtFrj-2YDJOo@P9FQF9L96D.corp.robot.car/ [1]

Overall I don't have objection but I would like to keep the changes
separate from v2 code as much as possible.

More specifically:

1. Keep CONFIG_MEMSW_ACCOUNT_ON_DFL dependent on CONFIG_MEMCG_V1 and
   disabled by default (as you already did).

2. Keep the changes in memcontrol-v1.[h|c] as much as possible.

I will go over the patches but let's see what others have to say.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 1/5] Kconfig: add SWAP_CHARGE_V1_MODE config
  2025-03-19  6:41 ` [RFC 1/5] Kconfig: add SWAP_CHARGE_V1_MODE config Jingxiang Zeng
@ 2025-03-19 19:29   ` Shakeel Butt
  2025-03-19 19:31     ` Shakeel Butt
  0 siblings, 1 reply; 28+ messages in thread
From: Shakeel Butt @ 2025-03-19 19:29 UTC (permalink / raw)
  To: Jingxiang Zeng
  Cc: akpm, linux-mm, cgroups, linux-kernel, hannes, mhocko,
	roman.gushchin, muchun.song, kasong

On Wed, Mar 19, 2025 at 02:41:44PM +0800, Jingxiang Zeng wrote:
> From: Zeng Jingxiang <linuszeng@tencent.com>
> 
> Added SWAP_CHARGE_V1_MODE config, which is disabled by default.
> When enabled in cgroupv2 mode, the memory accounting method of
> swap will be restored to cgroupv1 mode.
> 
> Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
> ---
>  include/linux/memcontrol.h |  6 ++++++
>  init/Kconfig               | 16 ++++++++++++++++
>  2 files changed, 22 insertions(+)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 53364526d877..dcb087ee6e8d 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -62,6 +62,12 @@ struct mem_cgroup_reclaim_cookie {
>  
>  #ifdef CONFIG_MEMCG
>  
> +/* Whether enable memory+swap account in cgroupv2 */
> +static inline bool do_memsw_account_on_dfl(void)
> +{
> +	return IS_ENABLED(CONFIG_MEMSW_ACCOUNT_ON_DFL);
> +}
> +

Please move the above to memcontrol-v1.h file.

>  #define MEM_CGROUP_ID_SHIFT	16
>  
>  struct mem_cgroup_id {
> diff --git a/init/Kconfig b/init/Kconfig
> index 7f67d8942a09..669e39214244 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1012,6 +1012,22 @@ config MEMCG_V1
>  
>  	  Say N if unsure.
>  
> +config MEMSW_ACCOUNT_ON_DFL
> +	bool "Whether enable memory+swap account in cgroup v2"
> +	depends on MEMCG && MEMCG_V1
> +	default n
> +	help
> +	  Say Y here to enable memory+swap account in cgroup v2. Enabling this
> +	  option means that the semantics of memory.swap.max will align with
> +	  memory.memsw.limit_in_bytes, and memory.swap.current will align with
> +	  memory.memsw.usage_in_bytes.
> +	  This is particularly useful for workloads that require strict memory
> +	  and swap limits.
> +
> +	  If you are unsure whether to enable this option, it is recommended
> +	  to leave it disabled (N) unless you specifically need memory and swap
> +	  accounting features in your cgroup v2 setup.
> +
>  config BLK_CGROUP
>  	bool "IO controller"
>  	depends on BLOCK
> -- 
> 2.41.1
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 1/5] Kconfig: add SWAP_CHARGE_V1_MODE config
  2025-03-19 19:29   ` Shakeel Butt
@ 2025-03-19 19:31     ` Shakeel Butt
  0 siblings, 0 replies; 28+ messages in thread
From: Shakeel Butt @ 2025-03-19 19:31 UTC (permalink / raw)
  To: Jingxiang Zeng
  Cc: akpm, linux-mm, cgroups, linux-kernel, hannes, mhocko,
	roman.gushchin, muchun.song, kasong

On Wed, Mar 19, 2025 at 12:29:11PM -0700, Shakeel Butt wrote:
> On Wed, Mar 19, 2025 at 02:41:44PM +0800, Jingxiang Zeng wrote:
> > From: Zeng Jingxiang <linuszeng@tencent.com>
> > 
> > Added SWAP_CHARGE_V1_MODE config, which is disabled by default.
> > When enabled in cgroupv2 mode, the memory accounting method of
> > swap will be restored to cgroupv1 mode.
> > 
> > Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
> > ---
> >  include/linux/memcontrol.h |  6 ++++++
> >  init/Kconfig               | 16 ++++++++++++++++
> >  2 files changed, 22 insertions(+)
> > 
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 53364526d877..dcb087ee6e8d 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -62,6 +62,12 @@ struct mem_cgroup_reclaim_cookie {
> >  
> >  #ifdef CONFIG_MEMCG
> >  
> > +/* Whether enable memory+swap account in cgroupv2 */
> > +static inline bool do_memsw_account_on_dfl(void)
> > +{
> > +	return IS_ENABLED(CONFIG_MEMSW_ACCOUNT_ON_DFL);
> > +}
> > +
> 
> Please move the above to memcontrol-v1.h file.
> 

And under CONFIG_MEMCG_V1 similar to do_memsw_account().


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 2/5] memcontrol: add boot option to enable memsw account on dfl
  2025-03-19  6:41 ` [RFC 2/5] memcontrol: add boot option to enable memsw account on dfl Jingxiang Zeng
@ 2025-03-19 19:34   ` Shakeel Butt
  2025-03-19 22:30     ` Roman Gushchin
  2025-03-20  8:51     ` jingxiang zeng
  0 siblings, 2 replies; 28+ messages in thread
From: Shakeel Butt @ 2025-03-19 19:34 UTC (permalink / raw)
  To: Jingxiang Zeng
  Cc: akpm, linux-mm, cgroups, linux-kernel, hannes, mhocko,
	roman.gushchin, muchun.song, kasong

On Wed, Mar 19, 2025 at 02:41:45PM +0800, Jingxiang Zeng wrote:
> From: Zeng Jingxiang <linuszeng@tencent.com>
> 
> Added cgroup.memsw_account_on_dfl startup parameter, which
> is off by default. When enabled in cgroupv2 mode, the memory
> accounting mode of swap will be reverted to cgroupv1 mode.
> 
> Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
> ---
>  include/linux/memcontrol.h |  4 +++-
>  mm/memcontrol.c            | 11 +++++++++++
>  2 files changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index dcb087ee6e8d..96f2fad1c351 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -62,10 +62,12 @@ struct mem_cgroup_reclaim_cookie {
>  
>  #ifdef CONFIG_MEMCG
>  
> +DECLARE_STATIC_KEY_FALSE(memsw_account_on_dfl);
>  /* Whether enable memory+swap account in cgroupv2 */
>  static inline bool do_memsw_account_on_dfl(void)
>  {
> -	return IS_ENABLED(CONFIG_MEMSW_ACCOUNT_ON_DFL);
> +	return IS_ENABLED(CONFIG_MEMSW_ACCOUNT_ON_DFL)
> +				|| static_branch_unlikely(&memsw_account_on_dfl);

Why || in above condition? Shouldn't it be && ?

>  }
>  
>  #define MEM_CGROUP_ID_SHIFT	16
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 768d6b15dbfa..c1171fb2bfd6 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5478,3 +5478,14 @@ static int __init mem_cgroup_swap_init(void)
>  subsys_initcall(mem_cgroup_swap_init);
>  
>  #endif /* CONFIG_SWAP */
> +
> +DEFINE_STATIC_KEY_FALSE(memsw_account_on_dfl);
> +static int __init memsw_account_on_dfl_setup(char *s)
> +{
> +	if (!strcmp(s, "1"))
> +		static_branch_enable(&memsw_account_on_dfl);
> +	else if (!strcmp(s, "0"))
> +		static_branch_disable(&memsw_account_on_dfl);
> +	return 1;
> +}
> +__setup("cgroup.memsw_account_on_dfl=", memsw_account_on_dfl_setup);

Please keep the above in memcontrol-v1.c

> +
> \ No newline at end of file
> -- 
> 2.41.1
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 3/5] mm/memcontrol: do not scan anon pages if memsw limit is hit
  2025-03-19  6:41 ` [RFC 3/5] mm/memcontrol: do not scan anon pages if memsw limit is hit Jingxiang Zeng
@ 2025-03-19 19:36   ` Shakeel Butt
  2025-03-20  8:40     ` jingxiang zeng
  0 siblings, 1 reply; 28+ messages in thread
From: Shakeel Butt @ 2025-03-19 19:36 UTC (permalink / raw)
  To: Jingxiang Zeng
  Cc: akpm, linux-mm, cgroups, linux-kernel, hannes, mhocko,
	roman.gushchin, muchun.song, kasong

On Wed, Mar 19, 2025 at 02:41:46PM +0800, Jingxiang Zeng wrote:
> From: Zeng Jingxiang <linuszeng@tencent.com>
> 
> When memory recycling is triggered by the hard watermark of

What is hard watermark?

> memsw, anonymous pages do not want to be recycled any further.
> This is consistent with the processing method of cgroup v2.
> 
> Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>

Is this patch orthogonal to the series or is it needed for v1 as well?

> ---
>  mm/memcontrol.c | 15 +++++++++++----
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c1171fb2bfd6..623ebf610946 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5072,14 +5072,21 @@ void __mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
>  
>  long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg)
>  {
> +	struct page_counter *pg_counter;
>  	long nr_swap_pages = get_nr_swap_pages();
>  
> -	if (mem_cgroup_disabled() || do_memsw_account())
> +	if (mem_cgroup_disabled())
>  		return nr_swap_pages;
> -	for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg))
> +	for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
> +		if (do_memsw_account())
> +			pg_counter = &memcg->memsw;
> +		else
> +			pg_counter = &memcg->swap;
> +
>  		nr_swap_pages = min_t(long, nr_swap_pages,
> -				      READ_ONCE(memcg->swap.max) -
> -				      page_counter_read(&memcg->swap));
> +				      READ_ONCE(pg_counter->max) -
> +				      page_counter_read(pg_counter));
> +	}
>  	return nr_swap_pages;
>  }
>  
> -- 
> 2.41.1
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/5] add option to restore swap account to cgroupv1 mode
  2025-03-19  6:41 [RFC 0/5] add option to restore swap account to cgroupv1 mode Jingxiang Zeng
                   ` (5 preceding siblings ...)
  2025-03-19 19:27 ` [RFC 0/5] add option to restore swap account to cgroupv1 mode Shakeel Butt
@ 2025-03-19 19:38 ` Johannes Weiner
  2025-03-19 19:51   ` Shakeel Butt
  2025-03-20  8:09   ` jingxiang zeng
  6 siblings, 2 replies; 28+ messages in thread
From: Johannes Weiner @ 2025-03-19 19:38 UTC (permalink / raw)
  To: Jingxiang Zeng
  Cc: akpm, linux-mm, cgroups, linux-kernel, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, kasong, Zeng Jingxiang

On Wed, Mar 19, 2025 at 02:41:43PM +0800, Jingxiang Zeng wrote:
> From: Zeng Jingxiang <linuszeng@tencent.com>
> 
> memsw account is a very useful knob for container memory
> overcommitting: It's a great abstraction of the "expected total
> memory usage" of a container, so containers can't allocate too
> much memory using SWAP, but still be able to SWAP out.
> 
> For a simple example, with memsw.limit == memory.limit, containers
> can't exceed their original memory limit, even with SWAP enabled, they
> get OOM killed as how they used to, but the host is now able to
> offload cold pages.
> 
> Similar ability seems absent with V2: With memory.swap.max == 0, the
> host can't use SWAP to reclaim container memory at all. But with a
> value larger than that, containers are able to overuse memory, causing
> delayed OOM kill, thrashing, CPU/Memory usage ratio could be heavily
> out of balance, especially with compress SWAP backends.
> 
> This patch set adds two interfaces to control the behavior of the
> memory.swap.max/current in cgroupv2:
> 
> CONFIG_MEMSW_ACCOUNT_ON_DFL
> cgroup.memsw_account_on_dfl={0, 1}
> 
> When one of the interfaces is enabled: memory.swap.current and
> memory.swap.max represents the usage/limit of swap.
> When neither is enabled (default behavior),memory.swap.current and
> memory.swap.max represents the usage/limit of memory+swap.

This should be new knobs, e.g. memory.memsw.current, memory.memsw.max.

Overloading the existing swap knobs is confusing.

And there doesn't seem to be a good reason to make the behavior
either-or anyway. If memory.swap.max=max (default), it won't interfere
with the memsw operation. And it's at least conceivable somebody might
want to set both, memsw.max > swap.max, to get some flexibility while
excluding the craziest edge cases.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/5] add option to restore swap account to cgroupv1 mode
  2025-03-19 19:38 ` Johannes Weiner
@ 2025-03-19 19:51   ` Shakeel Butt
  2025-03-20  8:09   ` jingxiang zeng
  1 sibling, 0 replies; 28+ messages in thread
From: Shakeel Butt @ 2025-03-19 19:51 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Jingxiang Zeng, akpm, linux-mm, cgroups, linux-kernel, mhocko,
	roman.gushchin, muchun.song, kasong, Zeng Jingxiang

On Wed, Mar 19, 2025 at 03:38:38PM -0400, Johannes Weiner wrote:
> On Wed, Mar 19, 2025 at 02:41:43PM +0800, Jingxiang Zeng wrote:
> > From: Zeng Jingxiang <linuszeng@tencent.com>
> > 
> > memsw account is a very useful knob for container memory
> > overcommitting: It's a great abstraction of the "expected total
> > memory usage" of a container, so containers can't allocate too
> > much memory using SWAP, but still be able to SWAP out.
> > 
> > For a simple example, with memsw.limit == memory.limit, containers
> > can't exceed their original memory limit, even with SWAP enabled, they
> > get OOM killed as how they used to, but the host is now able to
> > offload cold pages.
> > 
> > Similar ability seems absent with V2: With memory.swap.max == 0, the
> > host can't use SWAP to reclaim container memory at all. But with a
> > value larger than that, containers are able to overuse memory, causing
> > delayed OOM kill, thrashing, CPU/Memory usage ratio could be heavily
> > out of balance, especially with compress SWAP backends.
> > 
> > This patch set adds two interfaces to control the behavior of the
> > memory.swap.max/current in cgroupv2:
> > 
> > CONFIG_MEMSW_ACCOUNT_ON_DFL
> > cgroup.memsw_account_on_dfl={0, 1}
> > 
> > When one of the interfaces is enabled: memory.swap.current and
> > memory.swap.max represents the usage/limit of swap.
> > When neither is enabled (default behavior),memory.swap.current and
> > memory.swap.max represents the usage/limit of memory+swap.
> 
> This should be new knobs, e.g. memory.memsw.current, memory.memsw.max.
> 
> Overloading the existing swap knobs is confusing.
> 
> And there doesn't seem to be a good reason to make the behavior
> either-or anyway. If memory.swap.max=max (default), it won't interfere
> with the memsw operation. And it's at least conceivable somebody might
> want to set both, memsw.max > swap.max, to get some flexibility while
> excluding the craziest edge cases.

At the moment memsw and swap shares the underlying page_counter. This
would require having explicit page_counter for memsw.

What's your take on memsw interfaces still behind
CONFIG_MEMSW_ACCOUNT_ON_DFL?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 2/5] memcontrol: add boot option to enable memsw account on dfl
  2025-03-19 19:34   ` Shakeel Butt
@ 2025-03-19 22:30     ` Roman Gushchin
  2025-03-20  8:43       ` jingxiang zeng
  2025-03-20 14:28       ` Johannes Weiner
  2025-03-20  8:51     ` jingxiang zeng
  1 sibling, 2 replies; 28+ messages in thread
From: Roman Gushchin @ 2025-03-19 22:30 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Jingxiang Zeng, akpm, linux-mm, cgroups, linux-kernel, hannes,
	mhocko, muchun.song, kasong

Shakeel Butt <shakeel.butt@linux.dev> writes:

> On Wed, Mar 19, 2025 at 02:41:45PM +0800, Jingxiang Zeng wrote:
>> From: Zeng Jingxiang <linuszeng@tencent.com>
>> 
>> Added cgroup.memsw_account_on_dfl startup parameter, which
>> is off by default. When enabled in cgroupv2 mode, the memory
>> accounting mode of swap will be reverted to cgroupv1 mode.
>> 
>> Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
>> ---
>>  include/linux/memcontrol.h |  4 +++-
>>  mm/memcontrol.c            | 11 +++++++++++
>>  2 files changed, 14 insertions(+), 1 deletion(-)
>> 
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index dcb087ee6e8d..96f2fad1c351 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -62,10 +62,12 @@ struct mem_cgroup_reclaim_cookie {
>>  
>>  #ifdef CONFIG_MEMCG
>>  
>> +DECLARE_STATIC_KEY_FALSE(memsw_account_on_dfl);
>>  /* Whether enable memory+swap account in cgroupv2 */
>>  static inline bool do_memsw_account_on_dfl(void)
>>  {
>> -	return IS_ENABLED(CONFIG_MEMSW_ACCOUNT_ON_DFL);
>> +	return IS_ENABLED(CONFIG_MEMSW_ACCOUNT_ON_DFL)
>> +				|| static_branch_unlikely(&memsw_account_on_dfl);
>
> Why || in above condition? Shouldn't it be && ?
>
>>  }
>>  
>>  #define MEM_CGROUP_ID_SHIFT	16
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index 768d6b15dbfa..c1171fb2bfd6 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -5478,3 +5478,14 @@ static int __init mem_cgroup_swap_init(void)
>>  subsys_initcall(mem_cgroup_swap_init);
>>  
>>  #endif /* CONFIG_SWAP */
>> +
>> +DEFINE_STATIC_KEY_FALSE(memsw_account_on_dfl);
>> +static int __init memsw_account_on_dfl_setup(char *s)
>> +{
>> +	if (!strcmp(s, "1"))
>> +		static_branch_enable(&memsw_account_on_dfl);
>> +	else if (!strcmp(s, "0"))
>> +		static_branch_disable(&memsw_account_on_dfl);
>> +	return 1;
>> +}
>> +__setup("cgroup.memsw_account_on_dfl=", memsw_account_on_dfl_setup);
>
> Please keep the above in memcontrol-v1.c

Hm, I'm not sure about this. This feature might be actually useful with
cgroup v2, as some companies are dependent on the old cgroup v1
semantics here but otherwise would prefer to move to v2.
In other words, I see it as a cgroup v2 feature, not as a cgroup v1.
So there is no reason to move it into the cgroup v1 code.

I think it deserves a separate config option (if we're really concerned
about the memory overhead in struct mem_cgroup) or IMO better a
boot/mount time option.

Thanks!


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/5] add option to restore swap account to cgroupv1 mode
  2025-03-19 19:38 ` Johannes Weiner
  2025-03-19 19:51   ` Shakeel Butt
@ 2025-03-20  8:09   ` jingxiang zeng
  2025-03-20 15:08     ` Johannes Weiner
  1 sibling, 1 reply; 28+ messages in thread
From: jingxiang zeng @ 2025-03-20  8:09 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: akpm, linux-mm, cgroups, linux-kernel, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, kasong, Zeng Jingxiang

On Thu, 20 Mar 2025 at 03:38, Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Mar 19, 2025 at 02:41:43PM +0800, Jingxiang Zeng wrote:
> > From: Zeng Jingxiang <linuszeng@tencent.com>
> >
> > memsw account is a very useful knob for container memory
> > overcommitting: It's a great abstraction of the "expected total
> > memory usage" of a container, so containers can't allocate too
> > much memory using SWAP, but still be able to SWAP out.
> >
> > For a simple example, with memsw.limit == memory.limit, containers
> > can't exceed their original memory limit, even with SWAP enabled, they
> > get OOM killed as how they used to, but the host is now able to
> > offload cold pages.
> >
> > Similar ability seems absent with V2: With memory.swap.max == 0, the
> > host can't use SWAP to reclaim container memory at all. But with a
> > value larger than that, containers are able to overuse memory, causing
> > delayed OOM kill, thrashing, CPU/Memory usage ratio could be heavily
> > out of balance, especially with compress SWAP backends.
> >
> > This patch set adds two interfaces to control the behavior of the
> > memory.swap.max/current in cgroupv2:
> >
> > CONFIG_MEMSW_ACCOUNT_ON_DFL
> > cgroup.memsw_account_on_dfl={0, 1}
> >
> > When one of the interfaces is enabled: memory.swap.current and
> > memory.swap.max represents the usage/limit of swap.
> > When neither is enabled (default behavior),memory.swap.current and
> > memory.swap.max represents the usage/limit of memory+swap.
>
> This should be new knobs, e.g. memory.memsw.current, memory.memsw.max.
>
> Overloading the existing swap knobs is confusing.
>
> And there doesn't seem to be a good reason to make the behavior
> either-or anyway. If memory.swap.max=max (default), it won't interfere
> with the memsw operation. And it's at least conceivable somebody might
> want to set both, memsw.max > swap.max, to get some flexibility while
> excluding the craziest edge cases.

Hi Johannes,

If both memsw.max and swap.max are provided in cgroupv2, there will be some
issues as follows:
(1. As Shakeel Butt mentioned, currently memsw and swap share the page_counter,
and we need to provide a separate page_counter for memsw.
(2. Currently, the statistics for memsw and swap are mutually
exclusive. For example,
during uncharging, both memsw and swap call the __mem_cgroup_uncharge_swap
function together, and this function currently only selects a single
counter for statistics
based on the static do_memsw_account.

As mentioned above, this patch set considers the approach suggested by Roman
Gushchin[1], which involves switching to cgroupv1 behavior through a
configuration
option, making it easier to implement.

Link: https://lore.kernel.org/all/Zk-fQtFrj-2YDJOo@P9FQF9L96D.corp.robot.car/
[1]


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 3/5] mm/memcontrol: do not scan anon pages if memsw limit is hit
  2025-03-19 19:36   ` Shakeel Butt
@ 2025-03-20  8:40     ` jingxiang zeng
  0 siblings, 0 replies; 28+ messages in thread
From: jingxiang zeng @ 2025-03-20  8:40 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Jingxiang Zeng, akpm, linux-mm, cgroups, linux-kernel, hannes,
	mhocko, roman.gushchin, muchun.song, kasong

On Thu, 20 Mar 2025 at 03:36, Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Wed, Mar 19, 2025 at 02:41:46PM +0800, Jingxiang Zeng wrote:
> > From: Zeng Jingxiang <linuszeng@tencent.com>
> >
> > When memory recycling is triggered by the hard watermark of
>
> What is hard watermark?

memory.memsw.limit_in_bytes.
>
> > memsw, anonymous pages do not want to be recycled any further.
> > This is consistent with the processing method of cgroup v2.
> >
> > Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
>
> Is this patch orthogonal to the series or is it needed for v1 as well?

Yes, it is needed for cgroupv1 as well
>
> > ---
> >  mm/memcontrol.c | 15 +++++++++++----
> >  1 file changed, 11 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index c1171fb2bfd6..623ebf610946 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -5072,14 +5072,21 @@ void __mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
> >
> >  long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg)
> >  {
> > +     struct page_counter *pg_counter;
> >       long nr_swap_pages = get_nr_swap_pages();
> >
> > -     if (mem_cgroup_disabled() || do_memsw_account())
> > +     if (mem_cgroup_disabled())
> >               return nr_swap_pages;
> > -     for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg))
> > +     for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg)) {
> > +             if (do_memsw_account())
> > +                     pg_counter = &memcg->memsw;
> > +             else
> > +                     pg_counter = &memcg->swap;
> > +
> >               nr_swap_pages = min_t(long, nr_swap_pages,
> > -                                   READ_ONCE(memcg->swap.max) -
> > -                                   page_counter_read(&memcg->swap));
> > +                                   READ_ONCE(pg_counter->max) -
> > +                                   page_counter_read(pg_counter));
> > +     }
> >       return nr_swap_pages;
> >  }
> >
> > --
> > 2.41.1
> >
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 2/5] memcontrol: add boot option to enable memsw account on dfl
  2025-03-19 22:30     ` Roman Gushchin
@ 2025-03-20  8:43       ` jingxiang zeng
  2025-03-20 14:28       ` Johannes Weiner
  1 sibling, 0 replies; 28+ messages in thread
From: jingxiang zeng @ 2025-03-20  8:43 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Shakeel Butt, Jingxiang Zeng, akpm, linux-mm, cgroups,
	linux-kernel, hannes, mhocko, muchun.song, kasong

On Thu, 20 Mar 2025 at 06:32, Roman Gushchin <roman.gushchin@linux.dev> wrote:
>
> Shakeel Butt <shakeel.butt@linux.dev> writes:
>
> > On Wed, Mar 19, 2025 at 02:41:45PM +0800, Jingxiang Zeng wrote:
> >> From: Zeng Jingxiang <linuszeng@tencent.com>
> >>
> >> Added cgroup.memsw_account_on_dfl startup parameter, which
> >> is off by default. When enabled in cgroupv2 mode, the memory
> >> accounting mode of swap will be reverted to cgroupv1 mode.
> >>
> >> Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
> >> ---
> >>  include/linux/memcontrol.h |  4 +++-
> >>  mm/memcontrol.c            | 11 +++++++++++
> >>  2 files changed, 14 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> >> index dcb087ee6e8d..96f2fad1c351 100644
> >> --- a/include/linux/memcontrol.h
> >> +++ b/include/linux/memcontrol.h
> >> @@ -62,10 +62,12 @@ struct mem_cgroup_reclaim_cookie {
> >>
> >>  #ifdef CONFIG_MEMCG
> >>
> >> +DECLARE_STATIC_KEY_FALSE(memsw_account_on_dfl);
> >>  /* Whether enable memory+swap account in cgroupv2 */
> >>  static inline bool do_memsw_account_on_dfl(void)
> >>  {
> >> -    return IS_ENABLED(CONFIG_MEMSW_ACCOUNT_ON_DFL);
> >> +    return IS_ENABLED(CONFIG_MEMSW_ACCOUNT_ON_DFL)
> >> +                            || static_branch_unlikely(&memsw_account_on_dfl);
> >
> > Why || in above condition? Shouldn't it be && ?
> >
> >>  }
> >>
> >>  #define MEM_CGROUP_ID_SHIFT 16
> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >> index 768d6b15dbfa..c1171fb2bfd6 100644
> >> --- a/mm/memcontrol.c
> >> +++ b/mm/memcontrol.c
> >> @@ -5478,3 +5478,14 @@ static int __init mem_cgroup_swap_init(void)
> >>  subsys_initcall(mem_cgroup_swap_init);
> >>
> >>  #endif /* CONFIG_SWAP */
> >> +
> >> +DEFINE_STATIC_KEY_FALSE(memsw_account_on_dfl);
> >> +static int __init memsw_account_on_dfl_setup(char *s)
> >> +{
> >> +    if (!strcmp(s, "1"))
> >> +            static_branch_enable(&memsw_account_on_dfl);
> >> +    else if (!strcmp(s, "0"))
> >> +            static_branch_disable(&memsw_account_on_dfl);
> >> +    return 1;
> >> +}
> >> +__setup("cgroup.memsw_account_on_dfl=", memsw_account_on_dfl_setup);
> >
> > Please keep the above in memcontrol-v1.c
>
> Hm, I'm not sure about this. This feature might be actually useful with
> cgroup v2, as some companies are dependent on the old cgroup v1
> semantics here but otherwise would prefer to move to v2.
> In other words, I see it as a cgroup v2 feature, not as a cgroup v1.
> So there is no reason to move it into the cgroup v1 code.

Yes, this is mainly intended for use with v2.
>
> I think it deserves a separate config option (if we're really concerned
> about the memory overhead in struct mem_cgroup) or IMO better a
> boot/mount time option.
>
> Thanks!
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 2/5] memcontrol: add boot option to enable memsw account on dfl
  2025-03-19 19:34   ` Shakeel Butt
  2025-03-19 22:30     ` Roman Gushchin
@ 2025-03-20  8:51     ` jingxiang zeng
  1 sibling, 0 replies; 28+ messages in thread
From: jingxiang zeng @ 2025-03-20  8:51 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Jingxiang Zeng, akpm, linux-mm, cgroups, linux-kernel, hannes,
	mhocko, roman.gushchin, muchun.song, kasong

On Thu, 20 Mar 2025 at 03:34, Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> On Wed, Mar 19, 2025 at 02:41:45PM +0800, Jingxiang Zeng wrote:
> > From: Zeng Jingxiang <linuszeng@tencent.com>
> >
> > Added cgroup.memsw_account_on_dfl startup parameter, which
> > is off by default. When enabled in cgroupv2 mode, the memory
> > accounting mode of swap will be reverted to cgroupv1 mode.
> >
> > Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
> > ---
> >  include/linux/memcontrol.h |  4 +++-
> >  mm/memcontrol.c            | 11 +++++++++++
> >  2 files changed, 14 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index dcb087ee6e8d..96f2fad1c351 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -62,10 +62,12 @@ struct mem_cgroup_reclaim_cookie {
> >
> >  #ifdef CONFIG_MEMCG
> >
> > +DECLARE_STATIC_KEY_FALSE(memsw_account_on_dfl);
> >  /* Whether enable memory+swap account in cgroupv2 */
> >  static inline bool do_memsw_account_on_dfl(void)
> >  {
> > -     return IS_ENABLED(CONFIG_MEMSW_ACCOUNT_ON_DFL);
> > +     return IS_ENABLED(CONFIG_MEMSW_ACCOUNT_ON_DFL)
> > +                             || static_branch_unlikely(&memsw_account_on_dfl);
>
> Why || in above condition? Shouldn't it be && ?

It seems that changing it to && is better, Thanks.
>
> >  }
> >
> >  #define MEM_CGROUP_ID_SHIFT  16
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 768d6b15dbfa..c1171fb2bfd6 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -5478,3 +5478,14 @@ static int __init mem_cgroup_swap_init(void)
> >  subsys_initcall(mem_cgroup_swap_init);
> >
> >  #endif /* CONFIG_SWAP */
> > +
> > +DEFINE_STATIC_KEY_FALSE(memsw_account_on_dfl);
> > +static int __init memsw_account_on_dfl_setup(char *s)
> > +{
> > +     if (!strcmp(s, "1"))
> > +             static_branch_enable(&memsw_account_on_dfl);
> > +     else if (!strcmp(s, "0"))
> > +             static_branch_disable(&memsw_account_on_dfl);
> > +     return 1;
> > +}
> > +__setup("cgroup.memsw_account_on_dfl=", memsw_account_on_dfl_setup);
>
> Please keep the above in memcontrol-v1.c
>
> > +
> > \ No newline at end of file
> > --
> > 2.41.1
> >
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 2/5] memcontrol: add boot option to enable memsw account on dfl
  2025-03-19 22:30     ` Roman Gushchin
  2025-03-20  8:43       ` jingxiang zeng
@ 2025-03-20 14:28       ` Johannes Weiner
  2025-03-20 15:16         ` Roman Gushchin
  2025-03-20 15:33         ` Shakeel Butt
  1 sibling, 2 replies; 28+ messages in thread
From: Johannes Weiner @ 2025-03-20 14:28 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Shakeel Butt, Jingxiang Zeng, akpm, linux-mm, cgroups,
	linux-kernel, mhocko, muchun.song, kasong

On Wed, Mar 19, 2025 at 10:30:20PM +0000, Roman Gushchin wrote:
> Shakeel Butt <shakeel.butt@linux.dev> writes:
> 
> > On Wed, Mar 19, 2025 at 02:41:45PM +0800, Jingxiang Zeng wrote:
> >> From: Zeng Jingxiang <linuszeng@tencent.com>
> >> 
> >> Added cgroup.memsw_account_on_dfl startup parameter, which
> >> is off by default. When enabled in cgroupv2 mode, the memory
> >> accounting mode of swap will be reverted to cgroupv1 mode.
> >> 
> >> Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
> >> ---
> >>  include/linux/memcontrol.h |  4 +++-
> >>  mm/memcontrol.c            | 11 +++++++++++
> >>  2 files changed, 14 insertions(+), 1 deletion(-)
> >> 
> >> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> >> index dcb087ee6e8d..96f2fad1c351 100644
> >> --- a/include/linux/memcontrol.h
> >> +++ b/include/linux/memcontrol.h
> >> @@ -62,10 +62,12 @@ struct mem_cgroup_reclaim_cookie {
> >>  
> >>  #ifdef CONFIG_MEMCG
> >>  
> >> +DECLARE_STATIC_KEY_FALSE(memsw_account_on_dfl);
> >>  /* Whether enable memory+swap account in cgroupv2 */
> >>  static inline bool do_memsw_account_on_dfl(void)
> >>  {
> >> -	return IS_ENABLED(CONFIG_MEMSW_ACCOUNT_ON_DFL);
> >> +	return IS_ENABLED(CONFIG_MEMSW_ACCOUNT_ON_DFL)
> >> +				|| static_branch_unlikely(&memsw_account_on_dfl);
> >
> > Why || in above condition? Shouldn't it be && ?
> >
> >>  }
> >>  
> >>  #define MEM_CGROUP_ID_SHIFT	16
> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >> index 768d6b15dbfa..c1171fb2bfd6 100644
> >> --- a/mm/memcontrol.c
> >> +++ b/mm/memcontrol.c
> >> @@ -5478,3 +5478,14 @@ static int __init mem_cgroup_swap_init(void)
> >>  subsys_initcall(mem_cgroup_swap_init);
> >>  
> >>  #endif /* CONFIG_SWAP */
> >> +
> >> +DEFINE_STATIC_KEY_FALSE(memsw_account_on_dfl);
> >> +static int __init memsw_account_on_dfl_setup(char *s)
> >> +{
> >> +	if (!strcmp(s, "1"))
> >> +		static_branch_enable(&memsw_account_on_dfl);
> >> +	else if (!strcmp(s, "0"))
> >> +		static_branch_disable(&memsw_account_on_dfl);
> >> +	return 1;
> >> +}
> >> +__setup("cgroup.memsw_account_on_dfl=", memsw_account_on_dfl_setup);
> >
> > Please keep the above in memcontrol-v1.c
> 
> Hm, I'm not sure about this. This feature might be actually useful with
> cgroup v2, as some companies are dependent on the old cgroup v1
> semantics here but otherwise would prefer to move to v2.
> In other words, I see it as a cgroup v2 feature, not as a cgroup v1.
> So there is no reason to move it into the cgroup v1 code.

Agreed. Let's think of this proposal as making memsw tracking and
control a full-fledged v2 feature.

> I think it deserves a separate config option (if we're really concerned
> about the memory overhead in struct mem_cgroup) or IMO better a
> boot/mount time option.

Yeah, a config option forces distros to enable it :/

I'm hesitant to agree with making it optional in any manner. If you
consider the functionality that is implemented, the overhead should be
fairly minimal. It isn't right now, because page_counter contains a
ton of stuff that isn't applicable to this new user. That overhead is
still paid for unnecessarily by users who _do_ need to enable it.

It seems like a good opportunity to refactor struct page_counter.



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 0/5] add option to restore swap account to cgroupv1 mode
  2025-03-20  8:09   ` jingxiang zeng
@ 2025-03-20 15:08     ` Johannes Weiner
  0 siblings, 0 replies; 28+ messages in thread
From: Johannes Weiner @ 2025-03-20 15:08 UTC (permalink / raw)
  To: jingxiang zeng
  Cc: akpm, linux-mm, cgroups, linux-kernel, mhocko, roman.gushchin,
	shakeel.butt, muchun.song, kasong, Zeng Jingxiang

On Thu, Mar 20, 2025 at 04:09:29PM +0800, jingxiang zeng wrote:
> If both memsw.max and swap.max are provided in cgroupv2, there will be some
> issues as follows:
> (1. As Shakeel Butt mentioned, currently memsw and swap share the page_counter,
> and we need to provide a separate page_counter for memsw.
> (2. Currently, the statistics for memsw and swap are mutually
> exclusive. For example,
> during uncharging, both memsw and swap call the __mem_cgroup_uncharge_swap
> function together, and this function currently only selects a single
> counter for statistics
> based on the static do_memsw_account.

My suggestion is to factor out from struct page_counter all the stuff
that is not necessary for all users, and then have separate counters
for swap and memsw.

The protection stuff is long overdue for this. It makes up nearly half
of the struct's members, but is only used by the memory counter. Even
before your patches this is unnecessary bloat in the swap/memsw, kmem
and tcpmem counters.

Fix that and having separate counters is a non-issue.

Overloading the memory.swap.* controls to mean "combined memory and
swap" is unacceptable to me. They have established meaning on v2. It
may well become a common setting, and there might well be usecases
where people want one setting for one cgroup and another setting for
another cgroup. Or maybe even both controls in one group. And why not?

This is a user-visible interface that we'll have to support forever,
and deployments will be forced to use forever. Tech debt in the
current implementation is not a convincing argument to forever trap us
all with a suboptimal choice.

Nacked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 2/5] memcontrol: add boot option to enable memsw account on dfl
  2025-03-20 14:28       ` Johannes Weiner
@ 2025-03-20 15:16         ` Roman Gushchin
  2025-03-20 15:33         ` Shakeel Butt
  1 sibling, 0 replies; 28+ messages in thread
From: Roman Gushchin @ 2025-03-20 15:16 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Shakeel Butt, Jingxiang Zeng, akpm, linux-mm, cgroups,
	linux-kernel, mhocko, muchun.song, kasong

Johannes Weiner <hannes@cmpxchg.org> writes:

> On Wed, Mar 19, 2025 at 10:30:20PM +0000, Roman Gushchin wrote:
>> Shakeel Butt <shakeel.butt@linux.dev> writes:
>> 
>> > On Wed, Mar 19, 2025 at 02:41:45PM +0800, Jingxiang Zeng wrote:
>> >> From: Zeng Jingxiang <linuszeng@tencent.com>
>> >> 
>> >> Added cgroup.memsw_account_on_dfl startup parameter, which
>> >> is off by default. When enabled in cgroupv2 mode, the memory
>> >> accounting mode of swap will be reverted to cgroupv1 mode.
>> >> 
>> >> Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
>> >> ---
>> >>  include/linux/memcontrol.h |  4 +++-
>> >>  mm/memcontrol.c            | 11 +++++++++++
>> >>  2 files changed, 14 insertions(+), 1 deletion(-)
>> >> 
>> >> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> >> index dcb087ee6e8d..96f2fad1c351 100644
>> >> --- a/include/linux/memcontrol.h
>> >> +++ b/include/linux/memcontrol.h
>> >> @@ -62,10 +62,12 @@ struct mem_cgroup_reclaim_cookie {
>> >>  
>> >>  #ifdef CONFIG_MEMCG
>> >>  
>> >> +DECLARE_STATIC_KEY_FALSE(memsw_account_on_dfl);
>> >>  /* Whether enable memory+swap account in cgroupv2 */
>> >>  static inline bool do_memsw_account_on_dfl(void)
>> >>  {
>> >> -	return IS_ENABLED(CONFIG_MEMSW_ACCOUNT_ON_DFL);
>> >> +	return IS_ENABLED(CONFIG_MEMSW_ACCOUNT_ON_DFL)
>> >> +				|| static_branch_unlikely(&memsw_account_on_dfl);
>> >
>> > Why || in above condition? Shouldn't it be && ?
>> >
>> >>  }
>> >>  
>> >>  #define MEM_CGROUP_ID_SHIFT	16
>> >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> >> index 768d6b15dbfa..c1171fb2bfd6 100644
>> >> --- a/mm/memcontrol.c
>> >> +++ b/mm/memcontrol.c
>> >> @@ -5478,3 +5478,14 @@ static int __init mem_cgroup_swap_init(void)
>> >>  subsys_initcall(mem_cgroup_swap_init);
>> >>  
>> >>  #endif /* CONFIG_SWAP */
>> >> +
>> >> +DEFINE_STATIC_KEY_FALSE(memsw_account_on_dfl);
>> >> +static int __init memsw_account_on_dfl_setup(char *s)
>> >> +{
>> >> +	if (!strcmp(s, "1"))
>> >> +		static_branch_enable(&memsw_account_on_dfl);
>> >> +	else if (!strcmp(s, "0"))
>> >> +		static_branch_disable(&memsw_account_on_dfl);
>> >> +	return 1;
>> >> +}
>> >> +__setup("cgroup.memsw_account_on_dfl=", memsw_account_on_dfl_setup);
>> >
>> > Please keep the above in memcontrol-v1.c
>> 
>> Hm, I'm not sure about this. This feature might be actually useful with
>> cgroup v2, as some companies are dependent on the old cgroup v1
>> semantics here but otherwise would prefer to move to v2.
>> In other words, I see it as a cgroup v2 feature, not as a cgroup v1.
>> So there is no reason to move it into the cgroup v1 code.
>
> Agreed. Let's think of this proposal as making memsw tracking and
> control a full-fledged v2 feature.
>
>> I think it deserves a separate config option (if we're really concerned
>> about the memory overhead in struct mem_cgroup) or IMO better a
>> boot/mount time option.
>
> Yeah, a config option forces distros to enable it :/
>
> I'm hesitant to agree with making it optional in any manner. If you
> consider the functionality that is implemented, the overhead should be
> fairly minimal. It isn't right now, because page_counter contains a
> ton of stuff that isn't applicable to this new user. That overhead is
> still paid for unnecessarily by users who _do_ need to enable it.

Agree. Memcg is already huge, so another page_counter won't add a lot
percentage-wise.

>
> It seems like a good opportunity to refactor struct page_counter.

I don't think it's a hard dependency here, but otherwise fully agree.

Thanks!


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 2/5] memcontrol: add boot option to enable memsw account on dfl
  2025-03-20 14:28       ` Johannes Weiner
  2025-03-20 15:16         ` Roman Gushchin
@ 2025-03-20 15:33         ` Shakeel Butt
  2025-04-02 13:40           ` Michal Koutný
  1 sibling, 1 reply; 28+ messages in thread
From: Shakeel Butt @ 2025-03-20 15:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Roman Gushchin, Jingxiang Zeng, akpm, linux-mm, cgroups,
	linux-kernel, mhocko, muchun.song, kasong

On Thu, Mar 20, 2025 at 10:28:46AM -0400, Johannes Weiner wrote:
> On Wed, Mar 19, 2025 at 10:30:20PM +0000, Roman Gushchin wrote:
> > Shakeel Butt <shakeel.butt@linux.dev> writes:
> > 
> > > On Wed, Mar 19, 2025 at 02:41:45PM +0800, Jingxiang Zeng wrote:
> > >> From: Zeng Jingxiang <linuszeng@tencent.com>
> > >> 
> > >> Added cgroup.memsw_account_on_dfl startup parameter, which
> > >> is off by default. When enabled in cgroupv2 mode, the memory
> > >> accounting mode of swap will be reverted to cgroupv1 mode.
> > >> 
> > >> Signed-off-by: Zeng Jingxiang <linuszeng@tencent.com>
> > >> ---
> > >>  include/linux/memcontrol.h |  4 +++-
> > >>  mm/memcontrol.c            | 11 +++++++++++
> > >>  2 files changed, 14 insertions(+), 1 deletion(-)
> > >> 
> > >> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > >> index dcb087ee6e8d..96f2fad1c351 100644
> > >> --- a/include/linux/memcontrol.h
> > >> +++ b/include/linux/memcontrol.h
> > >> @@ -62,10 +62,12 @@ struct mem_cgroup_reclaim_cookie {
> > >>  
> > >>  #ifdef CONFIG_MEMCG
> > >>  
> > >> +DECLARE_STATIC_KEY_FALSE(memsw_account_on_dfl);
> > >>  /* Whether enable memory+swap account in cgroupv2 */
> > >>  static inline bool do_memsw_account_on_dfl(void)
> > >>  {
> > >> -	return IS_ENABLED(CONFIG_MEMSW_ACCOUNT_ON_DFL);
> > >> +	return IS_ENABLED(CONFIG_MEMSW_ACCOUNT_ON_DFL)
> > >> +				|| static_branch_unlikely(&memsw_account_on_dfl);
> > >
> > > Why || in above condition? Shouldn't it be && ?
> > >
> > >>  }
> > >>  
> > >>  #define MEM_CGROUP_ID_SHIFT	16
> > >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > >> index 768d6b15dbfa..c1171fb2bfd6 100644
> > >> --- a/mm/memcontrol.c
> > >> +++ b/mm/memcontrol.c
> > >> @@ -5478,3 +5478,14 @@ static int __init mem_cgroup_swap_init(void)
> > >>  subsys_initcall(mem_cgroup_swap_init);
> > >>  
> > >>  #endif /* CONFIG_SWAP */
> > >> +
> > >> +DEFINE_STATIC_KEY_FALSE(memsw_account_on_dfl);
> > >> +static int __init memsw_account_on_dfl_setup(char *s)
> > >> +{
> > >> +	if (!strcmp(s, "1"))
> > >> +		static_branch_enable(&memsw_account_on_dfl);
> > >> +	else if (!strcmp(s, "0"))
> > >> +		static_branch_disable(&memsw_account_on_dfl);
> > >> +	return 1;
> > >> +}
> > >> +__setup("cgroup.memsw_account_on_dfl=", memsw_account_on_dfl_setup);
> > >
> > > Please keep the above in memcontrol-v1.c
> > 
> > Hm, I'm not sure about this. This feature might be actually useful with
> > cgroup v2, as some companies are dependent on the old cgroup v1
> > semantics here but otherwise would prefer to move to v2.
> > In other words, I see it as a cgroup v2 feature, not as a cgroup v1.
> > So there is no reason to move it into the cgroup v1 code.
> 
> Agreed. Let's think of this proposal as making memsw tracking and
> control a full-fledged v2 feature.
> 

Sounds good. However I want us to discuss and decide the semantics of
memsw from scratch rather than adopting v1 semantics. Particularly I
don't like the failure of setting memsw limit on v1. Also we should
discuss how memsw and swap limits would interact and what would be the
appropriate default.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC 2/5] memcontrol: add boot option to enable memsw account on dfl
  2025-03-20 15:33         ` Shakeel Butt
@ 2025-04-02 13:40           ` Michal Koutný
  2025-04-03  7:47             ` [External] " Zhongkun He
  0 siblings, 1 reply; 28+ messages in thread
From: Michal Koutný @ 2025-04-02 13:40 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Johannes Weiner, Roman Gushchin, Jingxiang Zeng, akpm, linux-mm,
	cgroups, linux-kernel, mhocko, muchun.song, kasong

[-- Attachment #1: Type: text/plain, Size: 580 bytes --]

On Thu, Mar 20, 2025 at 08:33:09AM -0700, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> However I want us to discuss and decide the semantics of
> memsw from scratch rather than adopting v1 semantics.

+1

> Also we should discuss how memsw and swap limits would interact and
> what would be the appropriate default.

Besides more complicated implementation, merged memsw won't represent an
actual resource.

So I'd be interested in use cases (other than "used to it from v1") that
cannot be controlled with separate memory. and swap. limits.


0.02€,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [External] Re: [RFC 2/5] memcontrol: add boot option to enable memsw account on dfl
  2025-04-02 13:40           ` Michal Koutný
@ 2025-04-03  7:47             ` Zhongkun He
  2025-04-03  9:16               ` jingxiang zeng
  0 siblings, 1 reply; 28+ messages in thread
From: Zhongkun He @ 2025-04-03  7:47 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Shakeel Butt, Johannes Weiner, Roman Gushchin, Jingxiang Zeng,
	akpm, linux-mm, cgroups, linux-kernel, mhocko, muchun.song,
	kasong

On Wed, Apr 2, 2025 at 9:42 PM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Thu, Mar 20, 2025 at 08:33:09AM -0700, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > However I want us to discuss and decide the semantics of
> > memsw from scratch rather than adopting v1 semantics.
>
> +1
>
> > Also we should discuss how memsw and swap limits would interact and
> > what would be the appropriate default.
>
> Besides more complicated implementation, merged memsw won't represent an
> actual resource.
>
> So I'd be interested in use cases (other than "used to it from v1") that
> cannot be controlled with separate memory. and swap. limits.
>

Hi Michal

We encountered an issue, which is also a real use case. With memory offloading,
we can move some cold pages to swap. Suppose an application’s peak memory
usage at certain times is 10GB, while at other times, it exists in a
combination of
memory and swap. If we set limits on memory or swap separately, it would lack
flexibility—sometimes it needs 1GB memory + 9GB swap, sometimes 5GB
memory + 5GB swap, or even 10GB memory + 0GB swap. Therefore, we strongly
hope to use the mem+swap charging method in cgroupv2

>
> 0.02€,
> Michal


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [External] Re: [RFC 2/5] memcontrol: add boot option to enable memsw account on dfl
  2025-04-03  7:47             ` [External] " Zhongkun He
@ 2025-04-03  9:16               ` jingxiang zeng
  2025-04-11 16:57                 ` Michal Koutný
  0 siblings, 1 reply; 28+ messages in thread
From: jingxiang zeng @ 2025-04-03  9:16 UTC (permalink / raw)
  To: Zhongkun He
  Cc: Michal Koutný,
	Shakeel Butt, Johannes Weiner, Roman Gushchin, Jingxiang Zeng,
	akpm, linux-mm, cgroups, linux-kernel, mhocko, muchun.song,
	kasong

On Thu, 3 Apr 2025 at 15:47, Zhongkun He <hezhongkun.hzk@bytedance.com> wrote:
>
> On Wed, Apr 2, 2025 at 9:42 PM Michal Koutný <mkoutny@suse.com> wrote:
> >
> > On Thu, Mar 20, 2025 at 08:33:09AM -0700, Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > However I want us to discuss and decide the semantics of
> > > memsw from scratch rather than adopting v1 semantics.
> >
> > +1
> >
> > > Also we should discuss how memsw and swap limits would interact and
> > > what would be the appropriate default.
> >
> > Besides more complicated implementation, merged memsw won't represent an
> > actual resource.
> >
> > So I'd be interested in use cases (other than "used to it from v1") that
> > cannot be controlled with separate memory. and swap. limits.
> >
>
> Hi Michal
>
> We encountered an issue, which is also a real use case. With memory offloading,
> we can move some cold pages to swap. Suppose an application’s peak memory
> usage at certain times is 10GB, while at other times, it exists in a
> combination of
> memory and swap. If we set limits on memory or swap separately, it would lack
> flexibility—sometimes it needs 1GB memory + 9GB swap, sometimes 5GB
> memory + 5GB swap, or even 10GB memory + 0GB swap. Therefore, we strongly
> hope to use the mem+swap charging method in cgroupv2
>
> >
> > 0.02€,
> > Michal

Yes, in the container scenario, if swap is enabled on the server and
the customer's
container requires 10GB of memory, we only need to set
memory.memsw.limit_in_bytes=10GB, and the kernel can automatically swap out
part of the business container's memory to swap according to the server's memory
pressure, and it can be fully guaranteed that the customer's container
will not use
more memory because swap is enabled on the server.
>


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [External] Re: [RFC 2/5] memcontrol: add boot option to enable memsw account on dfl
  2025-04-03  9:16               ` jingxiang zeng
@ 2025-04-11 16:57                 ` Michal Koutný
  2025-04-16  8:29                   ` jingxiang zeng
  0 siblings, 1 reply; 28+ messages in thread
From: Michal Koutný @ 2025-04-11 16:57 UTC (permalink / raw)
  To: jingxiang zeng, Zhongkun He
  Cc: Shakeel Butt, Johannes Weiner, Roman Gushchin, Jingxiang Zeng,
	akpm, linux-mm, cgroups, linux-kernel, mhocko, muchun.song,
	kasong

[-- Attachment #1: Type: text/plain, Size: 2742 bytes --]

On Thu, Apr 03, 2025 at 05:16:45PM +0800, jingxiang zeng <jingxiangzeng.cas@gmail.com> wrote:
> > We encountered an issue, which is also a real use case. With memory offloading,
> > we can move some cold pages to swap. Suppose an application’s peak memory
> > usage at certain times is 10GB, while at other times, it exists in a
> > combination of
> > memory and swap. If we set limits on memory or swap separately, it would lack
> > flexibility—sometimes it needs 1GB memory + 9GB swap, sometimes 5GB
> > memory + 5GB swap, or even 10GB memory + 0GB swap. Therefore, we strongly
> > hope to use the mem+swap charging method in cgroupv2

App's peak need determines memory.max=10G.
The apparent flexibility is dependency on how much competitors the app
has. It can run 5GB memory + 5GB swap with some competition or 1GB
memory + 9 GB with different competition (more memory demanding).
If you want to prevent faulty app to eating up all of swap for itself
(like it's possible with memsw), you may define some memory.swap.max.
(There's no unique correspondence between this and original memsw value
since the cost of mem<->swap is variable.)


> Yes, in the container scenario, if swap is enabled on the server and
> the customer's container requires 10GB of memory, we only need to set
> memory.memsw.limit_in_bytes=10GB, and the kernel can automatically
> swap out part of the business container's memory to swap according to
> the server's memory pressure, and it can be fully guaranteed that the
> customer's container will not use more memory because swap is enabled
> on the server.

This made me consider various causes of the pressure:

- global pressure
  - it doesn't change memcg's total consuption (memsw.usage=const)
  - memsw limit does nothing
- self-memcg pressure
  - new allocations against own limit and memsw.usage hits memsw.limit
  - memsw.limit prevents new allocations that would extend swap
  - achievable with memory.swap.max=0
- ancestral pressure 
  - when sibling needs to allocate but limit is on ancestor
  - similar to global pressure (memsw.usage=const), self memsw.limit
    does nothing

- or there is no outer pressure but you want to prevent new allocations
  when something has been swapped out already
  - swapped out amount is a debt
    - memsw.limit behavior is suboptimal until the debt needs to be
      repaid
      - repay is when someone else needs the swap space

The above is a free flow of thoughts but I'd condense such conversions:
- memory.max := memory.memsw.limit_in_bytes
- memory.swap.max := anything between 0 and memory.memsw.limit_in_bytes

Did I fail to capture some mode where memsw limits were superior?

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [External] Re: [RFC 2/5] memcontrol: add boot option to enable memsw account on dfl
  2025-04-11 16:57                 ` Michal Koutný
@ 2025-04-16  8:29                   ` jingxiang zeng
  2025-05-05 18:29                     ` Michal Koutný
  0 siblings, 1 reply; 28+ messages in thread
From: jingxiang zeng @ 2025-04-16  8:29 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Zhongkun He, Shakeel Butt, Johannes Weiner, Roman Gushchin,
	Jingxiang Zeng, akpm, linux-mm, cgroups, linux-kernel, mhocko,
	muchun.song, kasong

On Sat, 12 Apr 2025 at 00:57, Michal Koutný <mkoutny@suse.com> wrote:
>
> On Thu, Apr 03, 2025 at 05:16:45PM +0800, jingxiang zeng <jingxiangzeng.cas@gmail.com> wrote:
> > > We encountered an issue, which is also a real use case. With memory offloading,
> > > we can move some cold pages to swap. Suppose an application’s peak memory
> > > usage at certain times is 10GB, while at other times, it exists in a
> > > combination of
> > > memory and swap. If we set limits on memory or swap separately, it would lack
> > > flexibility—sometimes it needs 1GB memory + 9GB swap, sometimes 5GB
> > > memory + 5GB swap, or even 10GB memory + 0GB swap. Therefore, we strongly
> > > hope to use the mem+swap charging method in cgroupv2
>
> App's peak need determines memory.max=10G.
> The apparent flexibility is dependency on how much competitors the app
> has. It can run 5GB memory + 5GB swap with some competition or 1GB
> memory + 9 GB with different competition (more memory demanding).
> If you want to prevent faulty app to eating up all of swap for itself
> (like it's possible with memsw), you may define some memory.swap.max.
> (There's no unique correspondence between this and original memsw value
> since the cost of mem<->swap is variable.)
>
>
> > Yes, in the container scenario, if swap is enabled on the server and
> > the customer's container requires 10GB of memory, we only need to set
> > memory.memsw.limit_in_bytes=10GB, and the kernel can automatically
> > swap out part of the business container's memory to swap according to
> > the server's memory pressure, and it can be fully guaranteed that the
> > customer's container will not use more memory because swap is enabled
> > on the server.
>
> This made me consider various causes of the pressure:
>
> - global pressure
>   - it doesn't change memcg's total consuption (memsw.usage=const)
>   - memsw limit does nothing
> - self-memcg pressure
>   - new allocations against own limit and memsw.usage hits memsw.limit
>   - memsw.limit prevents new allocations that would extend swap
>   - achievable with memory.swap.max=0
> - ancestral pressure
>   - when sibling needs to allocate but limit is on ancestor
>   - similar to global pressure (memsw.usage=const), self memsw.limit
>     does nothing
>
> - or there is no outer pressure but you want to prevent new allocations
>   when something has been swapped out already
>   - swapped out amount is a debt
>     - memsw.limit behavior is suboptimal until the debt needs to be
>       repaid
>       - repay is when someone else needs the swap space
>
> The above is a free flow of thoughts but I'd condense such conversions:
> - memory.max := memory.memsw.limit_in_bytes
> - memory.swap.max := anything between 0 and memory.memsw.limit_in_bytes
>
> Did I fail to capture some mode where memsw limits were superior?
>
Hi, Michal

In fact, the memsw counter is mainly effective in proactive memory offload
scenarios.

For example, the current container memory usage is as follows:
memory.limit_in_bytes = 10GB
memory.usage_in_bytes = 9GB

Theoretically, through the memory.reclaim proactive reclaim interface, the
memory usage of [0GB, 9GB] can be reclaimed to the swap, so:
memory.limit_in_bytes = 10GB
memory.usage_in_bytes = 9GB - [0GB, 9GB]

In the case of proactive memory offload, the amount of memory that can be
reclaimed is determined by the container's PSI and other indicators. It is
difficult to set an accurate memory.swap.max value.
memory.swap.current = [0GB, 9GB]
memory.swap.max = ?

The memory space saved by swapping out to swap can continue to load
the operation of system components or more workloads.
memory.limit_in_bytes = 10GB
memory.usage_in_bytes = 9GB - [0GB, 9GB]
memory.swap.current = [0GB, 9GB]

The memory usage of memory.usage_in_bytes is reduced due to proactive
offload to swap, which will cause additional problems, such as:
1. There may be some memory leaks or abnormal imported network traffic
in the container, which may cause OOM to fail to trigger or be triggered late;
2. In the oversold scenario, if the container's memory requirement is 10GB,
the container's memory+swap should only use 10GB.

In the above scenario, the memsw counter is very useful:
memory.limit_in_bytes = 10GB
memory.usage_in_bytes = 9GB - [0GB, 9GB]

memory.memsw.limit_in_bytes = 10GB
memory.memsw.usage_in_bytes = 9GB

Above, thanks.
> Thanks,
> Michal


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [External] Re: [RFC 2/5] memcontrol: add boot option to enable memsw account on dfl
  2025-04-16  8:29                   ` jingxiang zeng
@ 2025-05-05 18:29                     ` Michal Koutný
  0 siblings, 0 replies; 28+ messages in thread
From: Michal Koutný @ 2025-05-05 18:29 UTC (permalink / raw)
  To: jingxiang zeng
  Cc: Zhongkun He, Shakeel Butt, Johannes Weiner, Roman Gushchin,
	Jingxiang Zeng, akpm, linux-mm, cgroups, linux-kernel, mhocko,
	muchun.song, kasong

[-- Attachment #1: Type: text/plain, Size: 1567 bytes --]

(Excuse my long turnarounds, I assume this needs time.)

On Wed, Apr 16, 2025 at 04:29:13PM +0800, jingxiang zeng <jingxiangzeng.cas@gmail.com> wrote:
> ...
> In fact, the memsw counter is mainly effective in proactive memory offload
> scenarios.

Is that some downstream experience or aspiration? Because there's
no kernel where memsw and proactive reclaim coexist.

> It is difficult to set an accurate memory.swap.max value.
> memory.swap.current = [0GB, 9GB]
> memory.swap.max = ?

I likely don't understand, I'd consider the value of 10GB in this
case...

> The memory space saved by swapping out to swap can continue to load
> the operation of system components or more workloads.
> memory.limit_in_bytes = 10GB
> memory.usage_in_bytes = 9GB - [0GB, 9GB]
> memory.swap.current = [0GB, 9GB]
> 
> The memory usage of memory.usage_in_bytes is reduced due to proactive
> offload to swap, which will cause additional problems, such as:
> 1. There may be some memory leaks or abnormal imported network traffic
> in the container, which may cause OOM to fail to trigger or be triggered late;
> 2. In the oversold scenario, if the container's memory requirement is 10GB,
> the container's memory+swap should only use 10GB.

...which would mean per-container usage doesn't exceed 20GB (leaks
remain bound).
Apparently, you could fit less containers but would they be actually
running (memsw doesn't guarantee that). Or what is problematic with
different treatment of overcommit between memsw vs .max and .swap.max?

Regards,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2025-05-05 18:29 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-03-19  6:41 [RFC 0/5] add option to restore swap account to cgroupv1 mode Jingxiang Zeng
2025-03-19  6:41 ` [RFC 1/5] Kconfig: add SWAP_CHARGE_V1_MODE config Jingxiang Zeng
2025-03-19 19:29   ` Shakeel Butt
2025-03-19 19:31     ` Shakeel Butt
2025-03-19  6:41 ` [RFC 2/5] memcontrol: add boot option to enable memsw account on dfl Jingxiang Zeng
2025-03-19 19:34   ` Shakeel Butt
2025-03-19 22:30     ` Roman Gushchin
2025-03-20  8:43       ` jingxiang zeng
2025-03-20 14:28       ` Johannes Weiner
2025-03-20 15:16         ` Roman Gushchin
2025-03-20 15:33         ` Shakeel Butt
2025-04-02 13:40           ` Michal Koutný
2025-04-03  7:47             ` [External] " Zhongkun He
2025-04-03  9:16               ` jingxiang zeng
2025-04-11 16:57                 ` Michal Koutný
2025-04-16  8:29                   ` jingxiang zeng
2025-05-05 18:29                     ` Michal Koutný
2025-03-20  8:51     ` jingxiang zeng
2025-03-19  6:41 ` [RFC 3/5] mm/memcontrol: do not scan anon pages if memsw limit is hit Jingxiang Zeng
2025-03-19 19:36   ` Shakeel Butt
2025-03-20  8:40     ` jingxiang zeng
2025-03-19  6:41 ` [RFC 4/5] mm/memcontrol: allow memsw account in cgroup v2 Jingxiang Zeng
2025-03-19  6:41 ` [RFC 5/5] Docs/cgroup-v2: add cgroup.memsw_account_on_dfl Documentation Jingxiang Zeng
2025-03-19 19:27 ` [RFC 0/5] add option to restore swap account to cgroupv1 mode Shakeel Butt
2025-03-19 19:38 ` Johannes Weiner
2025-03-19 19:51   ` Shakeel Butt
2025-03-20  8:09   ` jingxiang zeng
2025-03-20 15:08     ` Johannes Weiner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox