[PATCH v2 00/14] mm: memcg: separate legacy cgroup v1 code and put under config option

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 00/14] mm: memcg: separate legacy cgroup v1 code and put under config option
@ 2024-06-25  0:58 Roman Gushchin
  2024-06-25  0:58 ` [PATCH v2 01/14] mm: memcg: introduce memcontrol-v1.c Roman Gushchin
                   ` (14 more replies)
  0 siblings, 15 replies; 31+ messages in thread
From: Roman Gushchin @ 2024-06-25  0:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song,
	linux-kernel, cgroups, linux-mm, Roman Gushchin, Matthew Wilcox

Cgroups v2 have been around for a while and many users have fully adopted them,
so they never use cgroups v1 features and functionality. Yet they have to "pay"
for the cgroup v1 support anyway:
1) the kernel binary contains an unused cgroup v1 code,
2) some code paths have additional checks which are not needed,
3) some common structures like task_struct and mem_cgroup contain unused
   cgroup v1-specific members.

Cgroup v1's memory controller has a number of features that are not supported
by cgroup v2 and their implementation is pretty much self contained.
Most notably, these features are: soft limit reclaim, oom handling in userspace,
complicated event notification system, charge migration. Cgroup v1-specific code
in memcontrol.c is close to 4k lines in size and it's intervened with generic
and cgroup v2-specific code. It's a burden on developers and maintainers.

This patchset aims to solve these problems by:
1) moving cgroup v1-specific memcg code to the new mm/memcontrol-v1.c file,
2) putting definitions shared by memcontrol.c and memcontrol-v1.c into the
   mm/memcontrol-v1.h header,
3) introducing the CONFIG_MEMCG_V1 config option, turned off by default,
4) making memcontrol-v1.c to compile only if CONFIG_MEMCG_V1 is set.

If CONFIG_MEMCG_V1 is not set, cgroup v1 memory controller is still available
for mounting, however no memory-specific control knobs are present.

This patchset is based against mm-unstable tree (b610f75d19a34),
however a version based on mm-stable can be found here:
  https://github.com/rgushchin/linux/tree/memcontrol_v1.1-stable .

v2:
  - minor compilation fix
  - #else/#endif comments fix (Lance Yang)

v1:
  - switched to CONFIG_MEMCG_V1 being off by default based on LSFMMBPF
    discussion [1]
  - switched to memcg1_ prefix (Johannes)
  - many minor fixes
  - dropped patches which put struct memcg members under CONFIG_MEMCG_V1
    (will post as a separate patchset)

rfc:
  https://lwn.net/Articles/973082/

[1]: https://lwn.net/Articles/974575/

MAINTAINERS                |    2 +
include/linux/memcontrol.h |  156 ++++---
init/Kconfig               |    9 +
mm/Makefile                |    2 +
mm/memcontrol-v1.c         | 2933 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
mm/memcontrol-v1.h         |  132 ++++++
mm/memcontrol.c            | 4169 +++++++++++++++++++++++++++---------------------------------------------------------------------------------------------------------------------------------------------------
mm/vmscan.c                |   10 +-
8 files changed, 3794 insertions(+), 3619 deletions(-)

Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>

Roman Gushchin (14):
  mm: memcg: introduce memcontrol-v1.c
  mm: memcg: move soft limit reclaim code to memcontrol-v1.c
  mm: memcg: rename soft limit reclaim-related functions
  mm: memcg: move charge migration code to memcontrol-v1.c
  mm: memcg: rename charge move-related functions
  mm: memcg: move legacy memcg event code into memcontrol-v1.c
  mm: memcg: rename memcg_check_events()
  mm: memcg: move cgroup v1 oom handling code into memcontrol-v1.c
  mm: memcg: rename memcg_oom_recover()
  mm: memcg: move cgroup v1 interface files to memcontrol-v1.c
  mm: memcg: make memcg1_update_tree() static
  mm: memcg: group cgroup v1 memcg related declarations
  mm: memcg: put cgroup v1-related members of task_struct under config
    option
  MAINTAINERS: add mm/memcontrol-v1.c/h to the list of maintained files

 MAINTAINERS                |    2 +
 include/linux/memcontrol.h |  156 +-
 init/Kconfig               |    9 +
 mm/Makefile                |    2 +
 mm/memcontrol-v1.c         | 2933 +++++++++++++++++++++++++
 mm/memcontrol-v1.h         |  132 ++
 mm/memcontrol.c            | 4141 ++++++------------------------------
 mm/vmscan.c                |   10 +-
 8 files changed, 3780 insertions(+), 3605 deletions(-)
 create mode 100644 mm/memcontrol-v1.c
 create mode 100644 mm/memcontrol-v1.h

-- 
2.45.2

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 01/14] mm: memcg: introduce memcontrol-v1.c
  2024-06-25  0:58 [PATCH v2 00/14] mm: memcg: separate legacy cgroup v1 code and put under config option Roman Gushchin
@ 2024-06-25  0:58 ` Roman Gushchin
  2024-06-25  7:05   ` Michal Hocko
  2024-06-25  0:58 ` [PATCH v2 02/14] mm: memcg: move soft limit reclaim code to memcontrol-v1.c Roman Gushchin
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Roman Gushchin @ 2024-06-25  0:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song,
	linux-kernel, cgroups, linux-mm, Roman Gushchin

This patch introduces the mm/memcontrol-v1.c source file which will be used for
all legacy (cgroup v1) memory cgroup code. It also introduces mm/memcontrol-v1.h
to keep declarations shared between mm/memcontrol.c and mm/memcontrol-v1.c.

As of now, let's compile it if CONFIG_MEMCG is set, similar to mm/memcontrol.c.
Later on it can be switched to use a separate config option, so that the legacy
code won't be compiled if not required.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 mm/Makefile        | 3 ++-
 mm/memcontrol-v1.c | 3 +++
 mm/memcontrol-v1.h | 7 +++++++
 3 files changed, 12 insertions(+), 1 deletion(-)
 create mode 100644 mm/memcontrol-v1.c
 create mode 100644 mm/memcontrol-v1.h

diff --git a/mm/Makefile b/mm/Makefile
index 8fb85acda1b1..124d4dea2035 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -26,6 +26,7 @@ KCOV_INSTRUMENT_page_alloc.o := n
 KCOV_INSTRUMENT_debug-pagealloc.o := n
 KCOV_INSTRUMENT_kmemleak.o := n
 KCOV_INSTRUMENT_memcontrol.o := n
+KCOV_INSTRUMENT_memcontrol-v1.o := n
 KCOV_INSTRUMENT_mmzone.o := n
 KCOV_INSTRUMENT_vmstat.o := n
 KCOV_INSTRUMENT_failslab.o := n
@@ -95,7 +96,7 @@ obj-$(CONFIG_NUMA) += memory-tiers.o
 obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
-obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
+obj-$(CONFIG_MEMCG) += memcontrol.o memcontrol-v1.o vmpressure.o
 ifdef CONFIG_SWAP
 obj-$(CONFIG_MEMCG) += swap_cgroup.o
 endif
diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
new file mode 100644
index 000000000000..a941446ba575
--- /dev/null
+++ b/mm/memcontrol-v1.c
@@ -0,0 +1,3 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+
+#include "memcontrol-v1.h"
diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
new file mode 100644
index 000000000000..7c5f094755ff
--- /dev/null
+++ b/mm/memcontrol-v1.h
@@ -0,0 +1,7 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+
+#ifndef __MM_MEMCONTROL_V1_H
+#define __MM_MEMCONTROL_V1_H
+
+
+#endif	/* __MM_MEMCONTROL_V1_H */
-- 
2.45.2



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 02/14] mm: memcg: move soft limit reclaim code to memcontrol-v1.c
  2024-06-25  0:58 [PATCH v2 00/14] mm: memcg: separate legacy cgroup v1 code and put under config option Roman Gushchin
  2024-06-25  0:58 ` [PATCH v2 01/14] mm: memcg: introduce memcontrol-v1.c Roman Gushchin
@ 2024-06-25  0:58 ` Roman Gushchin
  2024-06-25  7:06   ` Michal Hocko
  2024-06-25  0:58 ` [PATCH v2 03/14] mm: memcg: rename soft limit reclaim-related functions Roman Gushchin
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Roman Gushchin @ 2024-06-25  0:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song,
	linux-kernel, cgroups, linux-mm, Roman Gushchin

Soft limits are cgroup v1-specific and are not supported by cgroup v2,
so let's move the corresponding code into memcontrol-v1.c.

Aside from simple moving the code, this commits introduces a trivial
memcg1_soft_limit_reset() function to reset soft limits and also
moves the global soft limit tree initialization code into a new
memcg1_init() function.

It also moves corresponding declarations shared between memcontrol.c
and memcontrol-v1.c into mm/memcontrol-v1.h.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 mm/memcontrol-v1.c | 342 +++++++++++++++++++++++++++++++++++++++++++++
 mm/memcontrol-v1.h |   7 +
 mm/memcontrol.c    | 337 +-------------------------------------------
 3 files changed, 353 insertions(+), 333 deletions(-)

diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index a941446ba575..2ccb8406fa84 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -1,3 +1,345 @@
 // SPDX-License-Identifier: GPL-2.0-or-later
 
+#include <linux/memcontrol.h>
+#include <linux/swap.h>
+#include <linux/mm_inline.h>
+
 #include "memcontrol-v1.h"
+
+/*
+ * Cgroups above their limits are maintained in a RB-Tree, independent of
+ * their hierarchy representation
+ */
+
+struct mem_cgroup_tree_per_node {
+	struct rb_root rb_root;
+	struct rb_node *rb_rightmost;
+	spinlock_t lock;
+};
+
+struct mem_cgroup_tree {
+	struct mem_cgroup_tree_per_node *rb_tree_per_node[MAX_NUMNODES];
+};
+
+static struct mem_cgroup_tree soft_limit_tree __read_mostly;
+
+/*
+ * Maximum loops in mem_cgroup_soft_reclaim(), used for soft
+ * limit reclaim to prevent infinite loops, if they ever occur.
+ */
+#define	MEM_CGROUP_MAX_RECLAIM_LOOPS		100
+#define	MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS	2
+
+static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_node *mz,
+					 struct mem_cgroup_tree_per_node *mctz,
+					 unsigned long new_usage_in_excess)
+{
+	struct rb_node **p = &mctz->rb_root.rb_node;
+	struct rb_node *parent = NULL;
+	struct mem_cgroup_per_node *mz_node;
+	bool rightmost = true;
+
+	if (mz->on_tree)
+		return;
+
+	mz->usage_in_excess = new_usage_in_excess;
+	if (!mz->usage_in_excess)
+		return;
+	while (*p) {
+		parent = *p;
+		mz_node = rb_entry(parent, struct mem_cgroup_per_node,
+					tree_node);
+		if (mz->usage_in_excess < mz_node->usage_in_excess) {
+			p = &(*p)->rb_left;
+			rightmost = false;
+		} else {
+			p = &(*p)->rb_right;
+		}
+	}
+
+	if (rightmost)
+		mctz->rb_rightmost = &mz->tree_node;
+
+	rb_link_node(&mz->tree_node, parent, p);
+	rb_insert_color(&mz->tree_node, &mctz->rb_root);
+	mz->on_tree = true;
+}
+
+static void __mem_cgroup_remove_exceeded(struct mem_cgroup_per_node *mz,
+					 struct mem_cgroup_tree_per_node *mctz)
+{
+	if (!mz->on_tree)
+		return;
+
+	if (&mz->tree_node == mctz->rb_rightmost)
+		mctz->rb_rightmost = rb_prev(&mz->tree_node);
+
+	rb_erase(&mz->tree_node, &mctz->rb_root);
+	mz->on_tree = false;
+}
+
+static void mem_cgroup_remove_exceeded(struct mem_cgroup_per_node *mz,
+				       struct mem_cgroup_tree_per_node *mctz)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&mctz->lock, flags);
+	__mem_cgroup_remove_exceeded(mz, mctz);
+	spin_unlock_irqrestore(&mctz->lock, flags);
+}
+
+static unsigned long soft_limit_excess(struct mem_cgroup *memcg)
+{
+	unsigned long nr_pages = page_counter_read(&memcg->memory);
+	unsigned long soft_limit = READ_ONCE(memcg->soft_limit);
+	unsigned long excess = 0;
+
+	if (nr_pages > soft_limit)
+		excess = nr_pages - soft_limit;
+
+	return excess;
+}
+
+void mem_cgroup_update_tree(struct mem_cgroup *memcg, int nid)
+{
+	unsigned long excess;
+	struct mem_cgroup_per_node *mz;
+	struct mem_cgroup_tree_per_node *mctz;
+
+	if (lru_gen_enabled()) {
+		if (soft_limit_excess(memcg))
+			lru_gen_soft_reclaim(memcg, nid);
+		return;
+	}
+
+	mctz = soft_limit_tree.rb_tree_per_node[nid];
+	if (!mctz)
+		return;
+	/*
+	 * Necessary to update all ancestors when hierarchy is used.
+	 * because their event counter is not touched.
+	 */
+	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
+		mz = memcg->nodeinfo[nid];
+		excess = soft_limit_excess(memcg);
+		/*
+		 * We have to update the tree if mz is on RB-tree or
+		 * mem is over its softlimit.
+		 */
+		if (excess || mz->on_tree) {
+			unsigned long flags;
+
+			spin_lock_irqsave(&mctz->lock, flags);
+			/* if on-tree, remove it */
+			if (mz->on_tree)
+				__mem_cgroup_remove_exceeded(mz, mctz);
+			/*
+			 * Insert again. mz->usage_in_excess will be updated.
+			 * If excess is 0, no tree ops.
+			 */
+			__mem_cgroup_insert_exceeded(mz, mctz, excess);
+			spin_unlock_irqrestore(&mctz->lock, flags);
+		}
+	}
+}
+
+void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup_tree_per_node *mctz;
+	struct mem_cgroup_per_node *mz;
+	int nid;
+
+	for_each_node(nid) {
+		mz = memcg->nodeinfo[nid];
+		mctz = soft_limit_tree.rb_tree_per_node[nid];
+		if (mctz)
+			mem_cgroup_remove_exceeded(mz, mctz);
+	}
+}
+
+static struct mem_cgroup_per_node *
+__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
+{
+	struct mem_cgroup_per_node *mz;
+
+retry:
+	mz = NULL;
+	if (!mctz->rb_rightmost)
+		goto done;		/* Nothing to reclaim from */
+
+	mz = rb_entry(mctz->rb_rightmost,
+		      struct mem_cgroup_per_node, tree_node);
+	/*
+	 * Remove the node now but someone else can add it back,
+	 * we will to add it back at the end of reclaim to its correct
+	 * position in the tree.
+	 */
+	__mem_cgroup_remove_exceeded(mz, mctz);
+	if (!soft_limit_excess(mz->memcg) ||
+	    !css_tryget(&mz->memcg->css))
+		goto retry;
+done:
+	return mz;
+}
+
+static struct mem_cgroup_per_node *
+mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
+{
+	struct mem_cgroup_per_node *mz;
+
+	spin_lock_irq(&mctz->lock);
+	mz = __mem_cgroup_largest_soft_limit_node(mctz);
+	spin_unlock_irq(&mctz->lock);
+	return mz;
+}
+
+static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
+				   pg_data_t *pgdat,
+				   gfp_t gfp_mask,
+				   unsigned long *total_scanned)
+{
+	struct mem_cgroup *victim = NULL;
+	int total = 0;
+	int loop = 0;
+	unsigned long excess;
+	unsigned long nr_scanned;
+	struct mem_cgroup_reclaim_cookie reclaim = {
+		.pgdat = pgdat,
+	};
+
+	excess = soft_limit_excess(root_memcg);
+
+	while (1) {
+		victim = mem_cgroup_iter(root_memcg, victim, &reclaim);
+		if (!victim) {
+			loop++;
+			if (loop >= 2) {
+				/*
+				 * If we have not been able to reclaim
+				 * anything, it might because there are
+				 * no reclaimable pages under this hierarchy
+				 */
+				if (!total)
+					break;
+				/*
+				 * We want to do more targeted reclaim.
+				 * excess >> 2 is not to excessive so as to
+				 * reclaim too much, nor too less that we keep
+				 * coming back to reclaim from this cgroup
+				 */
+				if (total >= (excess >> 2) ||
+					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS))
+					break;
+			}
+			continue;
+		}
+		total += mem_cgroup_shrink_node(victim, gfp_mask, false,
+					pgdat, &nr_scanned);
+		*total_scanned += nr_scanned;
+		if (!soft_limit_excess(root_memcg))
+			break;
+	}
+	mem_cgroup_iter_break(root_memcg, victim);
+	return total;
+}
+
+unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
+					    gfp_t gfp_mask,
+					    unsigned long *total_scanned)
+{
+	unsigned long nr_reclaimed = 0;
+	struct mem_cgroup_per_node *mz, *next_mz = NULL;
+	unsigned long reclaimed;
+	int loop = 0;
+	struct mem_cgroup_tree_per_node *mctz;
+	unsigned long excess;
+
+	if (lru_gen_enabled())
+		return 0;
+
+	if (order > 0)
+		return 0;
+
+	mctz = soft_limit_tree.rb_tree_per_node[pgdat->node_id];
+
+	/*
+	 * Do not even bother to check the largest node if the root
+	 * is empty. Do it lockless to prevent lock bouncing. Races
+	 * are acceptable as soft limit is best effort anyway.
+	 */
+	if (!mctz || RB_EMPTY_ROOT(&mctz->rb_root))
+		return 0;
+
+	/*
+	 * This loop can run a while, specially if mem_cgroup's continuously
+	 * keep exceeding their soft limit and putting the system under
+	 * pressure
+	 */
+	do {
+		if (next_mz)
+			mz = next_mz;
+		else
+			mz = mem_cgroup_largest_soft_limit_node(mctz);
+		if (!mz)
+			break;
+
+		reclaimed = mem_cgroup_soft_reclaim(mz->memcg, pgdat,
+						    gfp_mask, total_scanned);
+		nr_reclaimed += reclaimed;
+		spin_lock_irq(&mctz->lock);
+
+		/*
+		 * If we failed to reclaim anything from this memory cgroup
+		 * it is time to move on to the next cgroup
+		 */
+		next_mz = NULL;
+		if (!reclaimed)
+			next_mz = __mem_cgroup_largest_soft_limit_node(mctz);
+
+		excess = soft_limit_excess(mz->memcg);
+		/*
+		 * One school of thought says that we should not add
+		 * back the node to the tree if reclaim returns 0.
+		 * But our reclaim could return 0, simply because due
+		 * to priority we are exposing a smaller subset of
+		 * memory to reclaim from. Consider this as a longer
+		 * term TODO.
+		 */
+		/* If excess == 0, no tree ops */
+		__mem_cgroup_insert_exceeded(mz, mctz, excess);
+		spin_unlock_irq(&mctz->lock);
+		css_put(&mz->memcg->css);
+		loop++;
+		/*
+		 * Could not reclaim anything and there are no more
+		 * mem cgroups to try or we seem to be looping without
+		 * reclaiming anything.
+		 */
+		if (!nr_reclaimed &&
+			(next_mz == NULL ||
+			loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
+			break;
+	} while (!nr_reclaimed);
+	if (next_mz)
+		css_put(&next_mz->memcg->css);
+	return nr_reclaimed;
+}
+
+static int __init memcg1_init(void)
+{
+	int node;
+
+	for_each_node(node) {
+		struct mem_cgroup_tree_per_node *rtpn;
+
+		rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, node);
+
+		rtpn->rb_root = RB_ROOT;
+		rtpn->rb_rightmost = NULL;
+		spin_lock_init(&rtpn->lock);
+		soft_limit_tree.rb_tree_per_node[node] = rtpn;
+	}
+
+	return 0;
+}
+subsys_initcall(memcg1_init);
diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
index 7c5f094755ff..4da6fa561c6d 100644
--- a/mm/memcontrol-v1.h
+++ b/mm/memcontrol-v1.h
@@ -3,5 +3,12 @@
 #ifndef __MM_MEMCONTROL_V1_H
 #define __MM_MEMCONTROL_V1_H
 
+void mem_cgroup_update_tree(struct mem_cgroup *memcg, int nid);
+void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg);
+
+static inline void memcg1_soft_limit_reset(struct mem_cgroup *memcg)
+{
+	WRITE_ONCE(memcg->soft_limit, PAGE_COUNTER_MAX);
+}
 
 #endif	/* __MM_MEMCONTROL_V1_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 974bd160838c..003e944f34ea 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -72,6 +72,7 @@
 #include <net/ip.h>
 #include "slab.h"
 #include "swap.h"
+#include "memcontrol-v1.h"
 
 #include <linux/uaccess.h>
 
@@ -108,23 +109,6 @@ static bool do_memsw_account(void)
 #define THRESHOLDS_EVENTS_TARGET 128
 #define SOFTLIMIT_EVENTS_TARGET 1024
 
-/*
- * Cgroups above their limits are maintained in a RB-Tree, independent of
- * their hierarchy representation
- */
-
-struct mem_cgroup_tree_per_node {
-	struct rb_root rb_root;
-	struct rb_node *rb_rightmost;
-	spinlock_t lock;
-};
-
-struct mem_cgroup_tree {
-	struct mem_cgroup_tree_per_node *rb_tree_per_node[MAX_NUMNODES];
-};
-
-static struct mem_cgroup_tree soft_limit_tree __read_mostly;
-
 /* for OOM */
 struct mem_cgroup_eventfd_list {
 	struct list_head list;
@@ -199,13 +183,6 @@ static struct move_charge_struct {
 	.waitq = __WAIT_QUEUE_HEAD_INITIALIZER(mc.waitq),
 };
 
-/*
- * Maximum loops in mem_cgroup_soft_reclaim(), used for soft
- * limit reclaim to prevent infinite loops, if they ever occur.
- */
-#define	MEM_CGROUP_MAX_RECLAIM_LOOPS		100
-#define	MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS	2
-
 /* for encoding cft->private value on file */
 enum res_type {
 	_MEM,
@@ -413,169 +390,6 @@ ino_t page_cgroup_ino(struct page *page)
 	return ino;
 }
 
-static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_node *mz,
-					 struct mem_cgroup_tree_per_node *mctz,
-					 unsigned long new_usage_in_excess)
-{
-	struct rb_node **p = &mctz->rb_root.rb_node;
-	struct rb_node *parent = NULL;
-	struct mem_cgroup_per_node *mz_node;
-	bool rightmost = true;
-
-	if (mz->on_tree)
-		return;
-
-	mz->usage_in_excess = new_usage_in_excess;
-	if (!mz->usage_in_excess)
-		return;
-	while (*p) {
-		parent = *p;
-		mz_node = rb_entry(parent, struct mem_cgroup_per_node,
-					tree_node);
-		if (mz->usage_in_excess < mz_node->usage_in_excess) {
-			p = &(*p)->rb_left;
-			rightmost = false;
-		} else {
-			p = &(*p)->rb_right;
-		}
-	}
-
-	if (rightmost)
-		mctz->rb_rightmost = &mz->tree_node;
-
-	rb_link_node(&mz->tree_node, parent, p);
-	rb_insert_color(&mz->tree_node, &mctz->rb_root);
-	mz->on_tree = true;
-}
-
-static void __mem_cgroup_remove_exceeded(struct mem_cgroup_per_node *mz,
-					 struct mem_cgroup_tree_per_node *mctz)
-{
-	if (!mz->on_tree)
-		return;
-
-	if (&mz->tree_node == mctz->rb_rightmost)
-		mctz->rb_rightmost = rb_prev(&mz->tree_node);
-
-	rb_erase(&mz->tree_node, &mctz->rb_root);
-	mz->on_tree = false;
-}
-
-static void mem_cgroup_remove_exceeded(struct mem_cgroup_per_node *mz,
-				       struct mem_cgroup_tree_per_node *mctz)
-{
-	unsigned long flags;
-
-	spin_lock_irqsave(&mctz->lock, flags);
-	__mem_cgroup_remove_exceeded(mz, mctz);
-	spin_unlock_irqrestore(&mctz->lock, flags);
-}
-
-static unsigned long soft_limit_excess(struct mem_cgroup *memcg)
-{
-	unsigned long nr_pages = page_counter_read(&memcg->memory);
-	unsigned long soft_limit = READ_ONCE(memcg->soft_limit);
-	unsigned long excess = 0;
-
-	if (nr_pages > soft_limit)
-		excess = nr_pages - soft_limit;
-
-	return excess;
-}
-
-static void mem_cgroup_update_tree(struct mem_cgroup *memcg, int nid)
-{
-	unsigned long excess;
-	struct mem_cgroup_per_node *mz;
-	struct mem_cgroup_tree_per_node *mctz;
-
-	if (lru_gen_enabled()) {
-		if (soft_limit_excess(memcg))
-			lru_gen_soft_reclaim(memcg, nid);
-		return;
-	}
-
-	mctz = soft_limit_tree.rb_tree_per_node[nid];
-	if (!mctz)
-		return;
-	/*
-	 * Necessary to update all ancestors when hierarchy is used.
-	 * because their event counter is not touched.
-	 */
-	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
-		mz = memcg->nodeinfo[nid];
-		excess = soft_limit_excess(memcg);
-		/*
-		 * We have to update the tree if mz is on RB-tree or
-		 * mem is over its softlimit.
-		 */
-		if (excess || mz->on_tree) {
-			unsigned long flags;
-
-			spin_lock_irqsave(&mctz->lock, flags);
-			/* if on-tree, remove it */
-			if (mz->on_tree)
-				__mem_cgroup_remove_exceeded(mz, mctz);
-			/*
-			 * Insert again. mz->usage_in_excess will be updated.
-			 * If excess is 0, no tree ops.
-			 */
-			__mem_cgroup_insert_exceeded(mz, mctz, excess);
-			spin_unlock_irqrestore(&mctz->lock, flags);
-		}
-	}
-}
-
-static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg)
-{
-	struct mem_cgroup_tree_per_node *mctz;
-	struct mem_cgroup_per_node *mz;
-	int nid;
-
-	for_each_node(nid) {
-		mz = memcg->nodeinfo[nid];
-		mctz = soft_limit_tree.rb_tree_per_node[nid];
-		if (mctz)
-			mem_cgroup_remove_exceeded(mz, mctz);
-	}
-}
-
-static struct mem_cgroup_per_node *
-__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
-{
-	struct mem_cgroup_per_node *mz;
-
-retry:
-	mz = NULL;
-	if (!mctz->rb_rightmost)
-		goto done;		/* Nothing to reclaim from */
-
-	mz = rb_entry(mctz->rb_rightmost,
-		      struct mem_cgroup_per_node, tree_node);
-	/*
-	 * Remove the node now but someone else can add it back,
-	 * we will to add it back at the end of reclaim to its correct
-	 * position in the tree.
-	 */
-	__mem_cgroup_remove_exceeded(mz, mctz);
-	if (!soft_limit_excess(mz->memcg) ||
-	    !css_tryget(&mz->memcg->css))
-		goto retry;
-done:
-	return mz;
-}
-
-static struct mem_cgroup_per_node *
-mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
-{
-	struct mem_cgroup_per_node *mz;
-
-	spin_lock_irq(&mctz->lock);
-	mz = __mem_cgroup_largest_soft_limit_node(mctz);
-	spin_unlock_irq(&mctz->lock);
-	return mz;
-}
-
 /* Subset of node_stat_item for memcg stats */
 static const unsigned int memcg_node_stat_items[] = {
 	NR_INACTIVE_ANON,
@@ -1980,56 +1794,6 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	return ret;
 }
 
-static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
-				   pg_data_t *pgdat,
-				   gfp_t gfp_mask,
-				   unsigned long *total_scanned)
-{
-	struct mem_cgroup *victim = NULL;
-	int total = 0;
-	int loop = 0;
-	unsigned long excess;
-	unsigned long nr_scanned;
-	struct mem_cgroup_reclaim_cookie reclaim = {
-		.pgdat = pgdat,
-	};
-
-	excess = soft_limit_excess(root_memcg);
-
-	while (1) {
-		victim = mem_cgroup_iter(root_memcg, victim, &reclaim);
-		if (!victim) {
-			loop++;
-			if (loop >= 2) {
-				/*
-				 * If we have not been able to reclaim
-				 * anything, it might because there are
-				 * no reclaimable pages under this hierarchy
-				 */
-				if (!total)
-					break;
-				/*
-				 * We want to do more targeted reclaim.
-				 * excess >> 2 is not to excessive so as to
-				 * reclaim too much, nor too less that we keep
-				 * coming back to reclaim from this cgroup
-				 */
-				if (total >= (excess >> 2) ||
-					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS))
-					break;
-			}
-			continue;
-		}
-		total += mem_cgroup_shrink_node(victim, gfp_mask, false,
-					pgdat, &nr_scanned);
-		*total_scanned += nr_scanned;
-		if (!soft_limit_excess(root_memcg))
-			break;
-	}
-	mem_cgroup_iter_break(root_memcg, victim);
-	return total;
-}
-
 #ifdef CONFIG_LOCKDEP
 static struct lockdep_map memcg_oom_lock_dep_map = {
 	.name = "memcg_oom_lock",
@@ -3925,88 +3689,6 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
 	return ret;
 }
 
-unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
-					    gfp_t gfp_mask,
-					    unsigned long *total_scanned)
-{
-	unsigned long nr_reclaimed = 0;
-	struct mem_cgroup_per_node *mz, *next_mz = NULL;
-	unsigned long reclaimed;
-	int loop = 0;
-	struct mem_cgroup_tree_per_node *mctz;
-	unsigned long excess;
-
-	if (lru_gen_enabled())
-		return 0;
-
-	if (order > 0)
-		return 0;
-
-	mctz = soft_limit_tree.rb_tree_per_node[pgdat->node_id];
-
-	/*
-	 * Do not even bother to check the largest node if the root
-	 * is empty. Do it lockless to prevent lock bouncing. Races
-	 * are acceptable as soft limit is best effort anyway.
-	 */
-	if (!mctz || RB_EMPTY_ROOT(&mctz->rb_root))
-		return 0;
-
-	/*
-	 * This loop can run a while, specially if mem_cgroup's continuously
-	 * keep exceeding their soft limit and putting the system under
-	 * pressure
-	 */
-	do {
-		if (next_mz)
-			mz = next_mz;
-		else
-			mz = mem_cgroup_largest_soft_limit_node(mctz);
-		if (!mz)
-			break;
-
-		reclaimed = mem_cgroup_soft_reclaim(mz->memcg, pgdat,
-						    gfp_mask, total_scanned);
-		nr_reclaimed += reclaimed;
-		spin_lock_irq(&mctz->lock);
-
-		/*
-		 * If we failed to reclaim anything from this memory cgroup
-		 * it is time to move on to the next cgroup
-		 */
-		next_mz = NULL;
-		if (!reclaimed)
-			next_mz = __mem_cgroup_largest_soft_limit_node(mctz);
-
-		excess = soft_limit_excess(mz->memcg);
-		/*
-		 * One school of thought says that we should not add
-		 * back the node to the tree if reclaim returns 0.
-		 * But our reclaim could return 0, simply because due
-		 * to priority we are exposing a smaller subset of
-		 * memory to reclaim from. Consider this as a longer
-		 * term TODO.
-		 */
-		/* If excess == 0, no tree ops */
-		__mem_cgroup_insert_exceeded(mz, mctz, excess);
-		spin_unlock_irq(&mctz->lock);
-		css_put(&mz->memcg->css);
-		loop++;
-		/*
-		 * Could not reclaim anything and there are no more
-		 * mem cgroups to try or we seem to be looping without
-		 * reclaiming anything.
-		 */
-		if (!nr_reclaimed &&
-			(next_mz == NULL ||
-			loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
-			break;
-	} while (!nr_reclaimed);
-	if (next_mz)
-		css_put(&next_mz->memcg->css);
-	return nr_reclaimed;
-}
-
 /*
  * Reclaims as many pages from the given memcg as possible.
  *
@@ -5784,7 +5466,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
 		return ERR_CAST(memcg);
 
 	page_counter_set_high(&memcg->memory, PAGE_COUNTER_MAX);
-	WRITE_ONCE(memcg->soft_limit, PAGE_COUNTER_MAX);
+	memcg1_soft_limit_reset(memcg);
 #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
 	memcg->zswap_max = PAGE_COUNTER_MAX;
 	WRITE_ONCE(memcg->zswap_writeback,
@@ -5957,7 +5639,7 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
 	page_counter_set_min(&memcg->memory, 0);
 	page_counter_set_low(&memcg->memory, 0);
 	page_counter_set_high(&memcg->memory, PAGE_COUNTER_MAX);
-	WRITE_ONCE(memcg->soft_limit, PAGE_COUNTER_MAX);
+	memcg1_soft_limit_reset(memcg);
 	page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX);
 	memcg_wb_domain_size_changed(memcg);
 }
@@ -7984,7 +7666,7 @@ __setup("cgroup.memory=", cgroup_memory);
  */
 static int __init mem_cgroup_init(void)
 {
-	int cpu, node;
+	int cpu;
 
 	/*
 	 * Currently s32 type (can refer to struct batched_lruvec_stat) is
@@ -8001,17 +7683,6 @@ static int __init mem_cgroup_init(void)
 		INIT_WORK(&per_cpu_ptr(&memcg_stock, cpu)->work,
 			  drain_local_stock);
 
-	for_each_node(node) {
-		struct mem_cgroup_tree_per_node *rtpn;
-
-		rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, node);
-
-		rtpn->rb_root = RB_ROOT;
-		rtpn->rb_rightmost = NULL;
-		spin_lock_init(&rtpn->lock);
-		soft_limit_tree.rb_tree_per_node[node] = rtpn;
-	}
-
 	return 0;
 }
 subsys_initcall(mem_cgroup_init);
-- 
2.45.2



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 03/14] mm: memcg: rename soft limit reclaim-related functions
  2024-06-25  0:58 [PATCH v2 00/14] mm: memcg: separate legacy cgroup v1 code and put under config option Roman Gushchin
  2024-06-25  0:58 ` [PATCH v2 01/14] mm: memcg: introduce memcontrol-v1.c Roman Gushchin
  2024-06-25  0:58 ` [PATCH v2 02/14] mm: memcg: move soft limit reclaim code to memcontrol-v1.c Roman Gushchin
@ 2024-06-25  0:58 ` Roman Gushchin
  2024-06-25  7:06   ` Michal Hocko
  2024-06-25  0:58 ` [PATCH v2 04/14] mm: memcg: move charge migration code to memcontrol-v1.c Roman Gushchin
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Roman Gushchin @ 2024-06-25  0:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song,
	linux-kernel, cgroups, linux-mm, Roman Gushchin

Rename exported function related to the softlimit reclaim
to have memcg1_ prefix.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 include/linux/memcontrol.h | 12 ++++++------
 mm/memcontrol-v1.c         |  6 +++---
 mm/memcontrol-v1.h         |  4 ++--
 mm/memcontrol.c            |  4 ++--
 mm/vmscan.c                | 10 +++++-----
 5 files changed, 18 insertions(+), 18 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7403dd5926eb..83c8327455d8 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1121,9 +1121,9 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
 
 void split_page_memcg(struct page *head, int old_order, int new_order);
 
-unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
-						gfp_t gfp_mask,
-						unsigned long *total_scanned);
+unsigned long memcg1_soft_limit_reclaim(pg_data_t *pgdat, int order,
+					gfp_t gfp_mask,
+					unsigned long *total_scanned);
 
 #else /* CONFIG_MEMCG */
 
@@ -1572,9 +1572,9 @@ static inline void split_page_memcg(struct page *head, int old_order, int new_or
 }
 
 static inline
-unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
-					    gfp_t gfp_mask,
-					    unsigned long *total_scanned)
+unsigned long memcg1_soft_limit_reclaim(pg_data_t *pgdat, int order,
+					gfp_t gfp_mask,
+					unsigned long *total_scanned)
 {
 	return 0;
 }
diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index 2ccb8406fa84..68e2f1a718d3 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -100,7 +100,7 @@ static unsigned long soft_limit_excess(struct mem_cgroup *memcg)
 	return excess;
 }
 
-void mem_cgroup_update_tree(struct mem_cgroup *memcg, int nid)
+void memcg1_update_tree(struct mem_cgroup *memcg, int nid)
 {
 	unsigned long excess;
 	struct mem_cgroup_per_node *mz;
@@ -143,7 +143,7 @@ void mem_cgroup_update_tree(struct mem_cgroup *memcg, int nid)
 	}
 }
 
-void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg)
+void memcg1_remove_from_trees(struct mem_cgroup *memcg)
 {
 	struct mem_cgroup_tree_per_node *mctz;
 	struct mem_cgroup_per_node *mz;
@@ -243,7 +243,7 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
 	return total;
 }
 
-unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
+unsigned long memcg1_soft_limit_reclaim(pg_data_t *pgdat, int order,
 					    gfp_t gfp_mask,
 					    unsigned long *total_scanned)
 {
diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
index 4da6fa561c6d..e37bc7e8d955 100644
--- a/mm/memcontrol-v1.h
+++ b/mm/memcontrol-v1.h
@@ -3,8 +3,8 @@
 #ifndef __MM_MEMCONTROL_V1_H
 #define __MM_MEMCONTROL_V1_H
 
-void mem_cgroup_update_tree(struct mem_cgroup *memcg, int nid);
-void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg);
+void memcg1_update_tree(struct mem_cgroup *memcg, int nid);
+void memcg1_remove_from_trees(struct mem_cgroup *memcg);
 
 static inline void memcg1_soft_limit_reset(struct mem_cgroup *memcg)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 003e944f34ea..3479e1af12d5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1012,7 +1012,7 @@ static void memcg_check_events(struct mem_cgroup *memcg, int nid)
 						MEM_CGROUP_TARGET_SOFTLIMIT);
 		mem_cgroup_threshold(memcg);
 		if (unlikely(do_softlimit))
-			mem_cgroup_update_tree(memcg, nid);
+			memcg1_update_tree(memcg, nid);
 	}
 }
 
@@ -5610,7 +5610,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
 
 	vmpressure_cleanup(&memcg->vmpressure);
 	cancel_work_sync(&memcg->high_work);
-	mem_cgroup_remove_from_trees(memcg);
+	memcg1_remove_from_trees(memcg);
 	free_shrinker_info(memcg);
 	mem_cgroup_free(memcg);
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 900bad16e506..3d4c681c6d40 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6186,9 +6186,9 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
 			 * and balancing, not for a memcg's limit.
 			 */
 			nr_soft_scanned = 0;
-			nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone->zone_pgdat,
-						sc->order, sc->gfp_mask,
-						&nr_soft_scanned);
+			nr_soft_reclaimed = memcg1_soft_limit_reclaim(zone->zone_pgdat,
+								      sc->order, sc->gfp_mask,
+								      &nr_soft_scanned);
 			sc->nr_reclaimed += nr_soft_reclaimed;
 			sc->nr_scanned += nr_soft_scanned;
 			/* need some check for avoid more shrink_zone() */
@@ -6952,8 +6952,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
 		/* Call soft limit reclaim before calling shrink_node. */
 		sc.nr_scanned = 0;
 		nr_soft_scanned = 0;
-		nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(pgdat, sc.order,
-						sc.gfp_mask, &nr_soft_scanned);
+		nr_soft_reclaimed = memcg1_soft_limit_reclaim(pgdat, sc.order,
+							      sc.gfp_mask, &nr_soft_scanned);
 		sc.nr_reclaimed += nr_soft_reclaimed;
 
 		/*
-- 
2.45.2



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 04/14] mm: memcg: move charge migration code to memcontrol-v1.c
  2024-06-25  0:58 [PATCH v2 00/14] mm: memcg: separate legacy cgroup v1 code and put under config option Roman Gushchin
                   ` (2 preceding siblings ...)
  2024-06-25  0:58 ` [PATCH v2 03/14] mm: memcg: rename soft limit reclaim-related functions Roman Gushchin
@ 2024-06-25  0:58 ` Roman Gushchin
  2024-06-25  7:07   ` Michal Hocko
  2024-06-25  0:58 ` [PATCH v2 05/14] mm: memcg: rename charge move-related functions Roman Gushchin
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Roman Gushchin @ 2024-06-25  0:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song,
	linux-kernel, cgroups, linux-mm, Roman Gushchin

Unlike the legacy cgroup v1 memory controller, cgroup v2 memory
controller doesn't support moving charged pages between cgroups.

It's a fairly large and complicated code which created a number
of problems in the past. Let's move this code into memcontrol-v1.c.
It shaves off 1k lines from memcontrol.c. It's also another step
towards making the legacy memory controller code optionally compiled.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 mm/memcontrol-v1.c |  981 +++++++++++++++++++++++++++++++++++++++++++
 mm/memcontrol-v1.h |   30 ++
 mm/memcontrol.c    | 1004 +-------------------------------------------
 3 files changed, 1019 insertions(+), 996 deletions(-)

diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index 68e2f1a718d3..f4c8bec5ae1b 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -3,7 +3,12 @@
 #include <linux/memcontrol.h>
 #include <linux/swap.h>
 #include <linux/mm_inline.h>
+#include <linux/pagewalk.h>
+#include <linux/backing-dev.h>
+#include <linux/swap_cgroup.h>
 
+#include "internal.h"
+#include "swap.h"
 #include "memcontrol-v1.h"
 
 /*
@@ -30,6 +35,31 @@ static struct mem_cgroup_tree soft_limit_tree __read_mostly;
 #define	MEM_CGROUP_MAX_RECLAIM_LOOPS		100
 #define	MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS	2
 
+/* Stuffs for move charges at task migration. */
+/*
+ * Types of charges to be moved.
+ */
+#define MOVE_ANON	0x1U
+#define MOVE_FILE	0x2U
+#define MOVE_MASK	(MOVE_ANON | MOVE_FILE)
+
+/* "mc" and its members are protected by cgroup_mutex */
+static struct move_charge_struct {
+	spinlock_t	  lock; /* for from, to */
+	struct mm_struct  *mm;
+	struct mem_cgroup *from;
+	struct mem_cgroup *to;
+	unsigned long flags;
+	unsigned long precharge;
+	unsigned long moved_charge;
+	unsigned long moved_swap;
+	struct task_struct *moving_task;	/* a task moving charges */
+	wait_queue_head_t waitq;		/* a waitq for other context */
+} mc = {
+	.lock = __SPIN_LOCK_UNLOCKED(mc.lock),
+	.waitq = __WAIT_QUEUE_HEAD_INITIALIZER(mc.waitq),
+};
+
 static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_node *mz,
 					 struct mem_cgroup_tree_per_node *mctz,
 					 unsigned long new_usage_in_excess)
@@ -325,6 +355,957 @@ unsigned long memcg1_soft_limit_reclaim(pg_data_t *pgdat, int order,
 	return nr_reclaimed;
 }
 
+/*
+ * A routine for checking "mem" is under move_account() or not.
+ *
+ * Checking a cgroup is mc.from or mc.to or under hierarchy of
+ * moving cgroups. This is for waiting at high-memory pressure
+ * caused by "move".
+ */
+static bool mem_cgroup_under_move(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *from;
+	struct mem_cgroup *to;
+	bool ret = false;
+	/*
+	 * Unlike task_move routines, we access mc.to, mc.from not under
+	 * mutual exclusion by cgroup_mutex. Here, we take spinlock instead.
+	 */
+	spin_lock(&mc.lock);
+	from = mc.from;
+	to = mc.to;
+	if (!from)
+		goto unlock;
+
+	ret = mem_cgroup_is_descendant(from, memcg) ||
+		mem_cgroup_is_descendant(to, memcg);
+unlock:
+	spin_unlock(&mc.lock);
+	return ret;
+}
+
+bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg)
+{
+	if (mc.moving_task && current != mc.moving_task) {
+		if (mem_cgroup_under_move(memcg)) {
+			DEFINE_WAIT(wait);
+			prepare_to_wait(&mc.waitq, &wait, TASK_INTERRUPTIBLE);
+			/* moving charge context might have finished. */
+			if (mc.moving_task)
+				schedule();
+			finish_wait(&mc.waitq, &wait);
+			return true;
+		}
+	}
+	return false;
+}
+
+/**
+ * folio_memcg_lock - Bind a folio to its memcg.
+ * @folio: The folio.
+ *
+ * This function prevents unlocked LRU folios from being moved to
+ * another cgroup.
+ *
+ * It ensures lifetime of the bound memcg.  The caller is responsible
+ * for the lifetime of the folio.
+ */
+void folio_memcg_lock(struct folio *folio)
+{
+	struct mem_cgroup *memcg;
+	unsigned long flags;
+
+	/*
+	 * The RCU lock is held throughout the transaction.  The fast
+	 * path can get away without acquiring the memcg->move_lock
+	 * because page moving starts with an RCU grace period.
+         */
+	rcu_read_lock();
+
+	if (mem_cgroup_disabled())
+		return;
+again:
+	memcg = folio_memcg(folio);
+	if (unlikely(!memcg))
+		return;
+
+#ifdef CONFIG_PROVE_LOCKING
+	local_irq_save(flags);
+	might_lock(&memcg->move_lock);
+	local_irq_restore(flags);
+#endif
+
+	if (atomic_read(&memcg->moving_account) <= 0)
+		return;
+
+	spin_lock_irqsave(&memcg->move_lock, flags);
+	if (memcg != folio_memcg(folio)) {
+		spin_unlock_irqrestore(&memcg->move_lock, flags);
+		goto again;
+	}
+
+	/*
+	 * When charge migration first begins, we can have multiple
+	 * critical sections holding the fast-path RCU lock and one
+	 * holding the slowpath move_lock. Track the task who has the
+	 * move_lock for folio_memcg_unlock().
+	 */
+	memcg->move_lock_task = current;
+	memcg->move_lock_flags = flags;
+}
+
+static void __folio_memcg_unlock(struct mem_cgroup *memcg)
+{
+	if (memcg && memcg->move_lock_task == current) {
+		unsigned long flags = memcg->move_lock_flags;
+
+		memcg->move_lock_task = NULL;
+		memcg->move_lock_flags = 0;
+
+		spin_unlock_irqrestore(&memcg->move_lock, flags);
+	}
+
+	rcu_read_unlock();
+}
+
+/**
+ * folio_memcg_unlock - Release the binding between a folio and its memcg.
+ * @folio: The folio.
+ *
+ * This releases the binding created by folio_memcg_lock().  This does
+ * not change the accounting of this folio to its memcg, but it does
+ * permit others to change it.
+ */
+void folio_memcg_unlock(struct folio *folio)
+{
+	__folio_memcg_unlock(folio_memcg(folio));
+}
+
+#ifdef CONFIG_SWAP
+/**
+ * mem_cgroup_move_swap_account - move swap charge and swap_cgroup's record.
+ * @entry: swap entry to be moved
+ * @from:  mem_cgroup which the entry is moved from
+ * @to:  mem_cgroup which the entry is moved to
+ *
+ * It succeeds only when the swap_cgroup's record for this entry is the same
+ * as the mem_cgroup's id of @from.
+ *
+ * Returns 0 on success, -EINVAL on failure.
+ *
+ * The caller must have charged to @to, IOW, called page_counter_charge() about
+ * both res and memsw, and called css_get().
+ */
+static int mem_cgroup_move_swap_account(swp_entry_t entry,
+				struct mem_cgroup *from, struct mem_cgroup *to)
+{
+	unsigned short old_id, new_id;
+
+	old_id = mem_cgroup_id(from);
+	new_id = mem_cgroup_id(to);
+
+	if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id) {
+		mod_memcg_state(from, MEMCG_SWAP, -1);
+		mod_memcg_state(to, MEMCG_SWAP, 1);
+		return 0;
+	}
+	return -EINVAL;
+}
+#else
+static inline int mem_cgroup_move_swap_account(swp_entry_t entry,
+				struct mem_cgroup *from, struct mem_cgroup *to)
+{
+	return -EINVAL;
+}
+#endif
+
+u64 mem_cgroup_move_charge_read(struct cgroup_subsys_state *css,
+				struct cftype *cft)
+{
+	return mem_cgroup_from_css(css)->move_charge_at_immigrate;
+}
+
+#ifdef CONFIG_MMU
+int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css,
+				 struct cftype *cft, u64 val)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+	pr_warn_once("Cgroup memory moving (move_charge_at_immigrate) is deprecated. "
+		     "Please report your usecase to linux-mm@kvack.org if you "
+		     "depend on this functionality.\n");
+
+	if (val & ~MOVE_MASK)
+		return -EINVAL;
+
+	/*
+	 * No kind of locking is needed in here, because ->can_attach() will
+	 * check this value once in the beginning of the process, and then carry
+	 * on with stale data. This means that changes to this value will only
+	 * affect task migrations starting after the change.
+	 */
+	memcg->move_charge_at_immigrate = val;
+	return 0;
+}
+#else
+int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css,
+				 struct cftype *cft, u64 val)
+{
+	return -ENOSYS;
+}
+#endif
+
+#ifdef CONFIG_MMU
+/* Handlers for move charge at task migration. */
+static int mem_cgroup_do_precharge(unsigned long count)
+{
+	int ret;
+
+	/* Try a single bulk charge without reclaim first, kswapd may wake */
+	ret = try_charge(mc.to, GFP_KERNEL & ~__GFP_DIRECT_RECLAIM, count);
+	if (!ret) {
+		mc.precharge += count;
+		return ret;
+	}
+
+	/* Try charges one by one with reclaim, but do not retry */
+	while (count--) {
+		ret = try_charge(mc.to, GFP_KERNEL | __GFP_NORETRY, 1);
+		if (ret)
+			return ret;
+		mc.precharge++;
+		cond_resched();
+	}
+	return 0;
+}
+
+union mc_target {
+	struct folio	*folio;
+	swp_entry_t	ent;
+};
+
+enum mc_target_type {
+	MC_TARGET_NONE = 0,
+	MC_TARGET_PAGE,
+	MC_TARGET_SWAP,
+	MC_TARGET_DEVICE,
+};
+
+static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
+						unsigned long addr, pte_t ptent)
+{
+	struct page *page = vm_normal_page(vma, addr, ptent);
+
+	if (!page)
+		return NULL;
+	if (PageAnon(page)) {
+		if (!(mc.flags & MOVE_ANON))
+			return NULL;
+	} else {
+		if (!(mc.flags & MOVE_FILE))
+			return NULL;
+	}
+	get_page(page);
+
+	return page;
+}
+
+#if defined(CONFIG_SWAP) || defined(CONFIG_DEVICE_PRIVATE)
+static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
+			pte_t ptent, swp_entry_t *entry)
+{
+	struct page *page = NULL;
+	swp_entry_t ent = pte_to_swp_entry(ptent);
+
+	if (!(mc.flags & MOVE_ANON))
+		return NULL;
+
+	/*
+	 * Handle device private pages that are not accessible by the CPU, but
+	 * stored as special swap entries in the page table.
+	 */
+	if (is_device_private_entry(ent)) {
+		page = pfn_swap_entry_to_page(ent);
+		if (!get_page_unless_zero(page))
+			return NULL;
+		return page;
+	}
+
+	if (non_swap_entry(ent))
+		return NULL;
+
+	/*
+	 * Because swap_cache_get_folio() updates some statistics counter,
+	 * we call find_get_page() with swapper_space directly.
+	 */
+	page = find_get_page(swap_address_space(ent), swap_cache_index(ent));
+	entry->val = ent.val;
+
+	return page;
+}
+#else
+static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
+			pte_t ptent, swp_entry_t *entry)
+{
+	return NULL;
+}
+#endif
+
+static struct page *mc_handle_file_pte(struct vm_area_struct *vma,
+			unsigned long addr, pte_t ptent)
+{
+	unsigned long index;
+	struct folio *folio;
+
+	if (!vma->vm_file) /* anonymous vma */
+		return NULL;
+	if (!(mc.flags & MOVE_FILE))
+		return NULL;
+
+	/* folio is moved even if it's not RSS of this task(page-faulted). */
+	/* shmem/tmpfs may report page out on swap: account for that too. */
+	index = linear_page_index(vma, addr);
+	folio = filemap_get_incore_folio(vma->vm_file->f_mapping, index);
+	if (IS_ERR(folio))
+		return NULL;
+	return folio_file_page(folio, index);
+}
+
+/**
+ * mem_cgroup_move_account - move account of the folio
+ * @folio: The folio.
+ * @compound: charge the page as compound or small page
+ * @from: mem_cgroup which the folio is moved from.
+ * @to:	mem_cgroup which the folio is moved to. @from != @to.
+ *
+ * The folio must be locked and not on the LRU.
+ *
+ * This function doesn't do "charge" to new cgroup and doesn't do "uncharge"
+ * from old cgroup.
+ */
+static int mem_cgroup_move_account(struct folio *folio,
+				   bool compound,
+				   struct mem_cgroup *from,
+				   struct mem_cgroup *to)
+{
+	struct lruvec *from_vec, *to_vec;
+	struct pglist_data *pgdat;
+	unsigned int nr_pages = compound ? folio_nr_pages(folio) : 1;
+	int nid, ret;
+
+	VM_BUG_ON(from == to);
+	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
+	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
+	VM_BUG_ON(compound && !folio_test_large(folio));
+
+	ret = -EINVAL;
+	if (folio_memcg(folio) != from)
+		goto out;
+
+	pgdat = folio_pgdat(folio);
+	from_vec = mem_cgroup_lruvec(from, pgdat);
+	to_vec = mem_cgroup_lruvec(to, pgdat);
+
+	folio_memcg_lock(folio);
+
+	if (folio_test_anon(folio)) {
+		if (folio_mapped(folio)) {
+			__mod_lruvec_state(from_vec, NR_ANON_MAPPED, -nr_pages);
+			__mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages);
+			if (folio_test_pmd_mappable(folio)) {
+				__mod_lruvec_state(from_vec, NR_ANON_THPS,
+						   -nr_pages);
+				__mod_lruvec_state(to_vec, NR_ANON_THPS,
+						   nr_pages);
+			}
+		}
+	} else {
+		__mod_lruvec_state(from_vec, NR_FILE_PAGES, -nr_pages);
+		__mod_lruvec_state(to_vec, NR_FILE_PAGES, nr_pages);
+
+		if (folio_test_swapbacked(folio)) {
+			__mod_lruvec_state(from_vec, NR_SHMEM, -nr_pages);
+			__mod_lruvec_state(to_vec, NR_SHMEM, nr_pages);
+		}
+
+		if (folio_mapped(folio)) {
+			__mod_lruvec_state(from_vec, NR_FILE_MAPPED, -nr_pages);
+			__mod_lruvec_state(to_vec, NR_FILE_MAPPED, nr_pages);
+		}
+
+		if (folio_test_dirty(folio)) {
+			struct address_space *mapping = folio_mapping(folio);
+
+			if (mapping_can_writeback(mapping)) {
+				__mod_lruvec_state(from_vec, NR_FILE_DIRTY,
+						   -nr_pages);
+				__mod_lruvec_state(to_vec, NR_FILE_DIRTY,
+						   nr_pages);
+			}
+		}
+	}
+
+#ifdef CONFIG_SWAP
+	if (folio_test_swapcache(folio)) {
+		__mod_lruvec_state(from_vec, NR_SWAPCACHE, -nr_pages);
+		__mod_lruvec_state(to_vec, NR_SWAPCACHE, nr_pages);
+	}
+#endif
+	if (folio_test_writeback(folio)) {
+		__mod_lruvec_state(from_vec, NR_WRITEBACK, -nr_pages);
+		__mod_lruvec_state(to_vec, NR_WRITEBACK, nr_pages);
+	}
+
+	/*
+	 * All state has been migrated, let's switch to the new memcg.
+	 *
+	 * It is safe to change page's memcg here because the page
+	 * is referenced, charged, isolated, and locked: we can't race
+	 * with (un)charging, migration, LRU putback, or anything else
+	 * that would rely on a stable page's memory cgroup.
+	 *
+	 * Note that folio_memcg_lock is a memcg lock, not a page lock,
+	 * to save space. As soon as we switch page's memory cgroup to a
+	 * new memcg that isn't locked, the above state can change
+	 * concurrently again. Make sure we're truly done with it.
+	 */
+	smp_mb();
+
+	css_get(&to->css);
+	css_put(&from->css);
+
+	folio->memcg_data = (unsigned long)to;
+
+	__folio_memcg_unlock(from);
+
+	ret = 0;
+	nid = folio_nid(folio);
+
+	local_irq_disable();
+	mem_cgroup_charge_statistics(to, nr_pages);
+	memcg_check_events(to, nid);
+	mem_cgroup_charge_statistics(from, -nr_pages);
+	memcg_check_events(from, nid);
+	local_irq_enable();
+out:
+	return ret;
+}
+
+/**
+ * get_mctgt_type - get target type of moving charge
+ * @vma: the vma the pte to be checked belongs
+ * @addr: the address corresponding to the pte to be checked
+ * @ptent: the pte to be checked
+ * @target: the pointer the target page or swap ent will be stored(can be NULL)
+ *
+ * Context: Called with pte lock held.
+ * Return:
+ * * MC_TARGET_NONE - If the pte is not a target for move charge.
+ * * MC_TARGET_PAGE - If the page corresponding to this pte is a target for
+ *   move charge. If @target is not NULL, the folio is stored in target->folio
+ *   with extra refcnt taken (Caller should release it).
+ * * MC_TARGET_SWAP - If the swap entry corresponding to this pte is a
+ *   target for charge migration.  If @target is not NULL, the entry is
+ *   stored in target->ent.
+ * * MC_TARGET_DEVICE - Like MC_TARGET_PAGE but page is device memory and
+ *   thus not on the lru.  For now such page is charged like a regular page
+ *   would be as it is just special memory taking the place of a regular page.
+ *   See Documentations/vm/hmm.txt and include/linux/hmm.h
+ */
+static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
+		unsigned long addr, pte_t ptent, union mc_target *target)
+{
+	struct page *page = NULL;
+	struct folio *folio;
+	enum mc_target_type ret = MC_TARGET_NONE;
+	swp_entry_t ent = { .val = 0 };
+
+	if (pte_present(ptent))
+		page = mc_handle_present_pte(vma, addr, ptent);
+	else if (pte_none_mostly(ptent))
+		/*
+		 * PTE markers should be treated as a none pte here, separated
+		 * from other swap handling below.
+		 */
+		page = mc_handle_file_pte(vma, addr, ptent);
+	else if (is_swap_pte(ptent))
+		page = mc_handle_swap_pte(vma, ptent, &ent);
+
+	if (page)
+		folio = page_folio(page);
+	if (target && page) {
+		if (!folio_trylock(folio)) {
+			folio_put(folio);
+			return ret;
+		}
+		/*
+		 * page_mapped() must be stable during the move. This
+		 * pte is locked, so if it's present, the page cannot
+		 * become unmapped. If it isn't, we have only partial
+		 * control over the mapped state: the page lock will
+		 * prevent new faults against pagecache and swapcache,
+		 * so an unmapped page cannot become mapped. However,
+		 * if the page is already mapped elsewhere, it can
+		 * unmap, and there is nothing we can do about it.
+		 * Alas, skip moving the page in this case.
+		 */
+		if (!pte_present(ptent) && page_mapped(page)) {
+			folio_unlock(folio);
+			folio_put(folio);
+			return ret;
+		}
+	}
+
+	if (!page && !ent.val)
+		return ret;
+	if (page) {
+		/*
+		 * Do only loose check w/o serialization.
+		 * mem_cgroup_move_account() checks the page is valid or
+		 * not under LRU exclusion.
+		 */
+		if (folio_memcg(folio) == mc.from) {
+			ret = MC_TARGET_PAGE;
+			if (folio_is_device_private(folio) ||
+			    folio_is_device_coherent(folio))
+				ret = MC_TARGET_DEVICE;
+			if (target)
+				target->folio = folio;
+		}
+		if (!ret || !target) {
+			if (target)
+				folio_unlock(folio);
+			folio_put(folio);
+		}
+	}
+	/*
+	 * There is a swap entry and a page doesn't exist or isn't charged.
+	 * But we cannot move a tail-page in a THP.
+	 */
+	if (ent.val && !ret && (!page || !PageTransCompound(page)) &&
+	    mem_cgroup_id(mc.from) == lookup_swap_cgroup_id(ent)) {
+		ret = MC_TARGET_SWAP;
+		if (target)
+			target->ent = ent;
+	}
+	return ret;
+}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+/*
+ * We don't consider PMD mapped swapping or file mapped pages because THP does
+ * not support them for now.
+ * Caller should make sure that pmd_trans_huge(pmd) is true.
+ */
+static enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
+		unsigned long addr, pmd_t pmd, union mc_target *target)
+{
+	struct page *page = NULL;
+	struct folio *folio;
+	enum mc_target_type ret = MC_TARGET_NONE;
+
+	if (unlikely(is_swap_pmd(pmd))) {
+		VM_BUG_ON(thp_migration_supported() &&
+				  !is_pmd_migration_entry(pmd));
+		return ret;
+	}
+	page = pmd_page(pmd);
+	VM_BUG_ON_PAGE(!page || !PageHead(page), page);
+	folio = page_folio(page);
+	if (!(mc.flags & MOVE_ANON))
+		return ret;
+	if (folio_memcg(folio) == mc.from) {
+		ret = MC_TARGET_PAGE;
+		if (target) {
+			folio_get(folio);
+			if (!folio_trylock(folio)) {
+				folio_put(folio);
+				return MC_TARGET_NONE;
+			}
+			target->folio = folio;
+		}
+	}
+	return ret;
+}
+#else
+static inline enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
+		unsigned long addr, pmd_t pmd, union mc_target *target)
+{
+	return MC_TARGET_NONE;
+}
+#endif
+
+static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
+					unsigned long addr, unsigned long end,
+					struct mm_walk *walk)
+{
+	struct vm_area_struct *vma = walk->vma;
+	pte_t *pte;
+	spinlock_t *ptl;
+
+	ptl = pmd_trans_huge_lock(pmd, vma);
+	if (ptl) {
+		/*
+		 * Note their can not be MC_TARGET_DEVICE for now as we do not
+		 * support transparent huge page with MEMORY_DEVICE_PRIVATE but
+		 * this might change.
+		 */
+		if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE)
+			mc.precharge += HPAGE_PMD_NR;
+		spin_unlock(ptl);
+		return 0;
+	}
+
+	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	if (!pte)
+		return 0;
+	for (; addr != end; pte++, addr += PAGE_SIZE)
+		if (get_mctgt_type(vma, addr, ptep_get(pte), NULL))
+			mc.precharge++;	/* increment precharge temporarily */
+	pte_unmap_unlock(pte - 1, ptl);
+	cond_resched();
+
+	return 0;
+}
+
+static const struct mm_walk_ops precharge_walk_ops = {
+	.pmd_entry	= mem_cgroup_count_precharge_pte_range,
+	.walk_lock	= PGWALK_RDLOCK,
+};
+
+static unsigned long mem_cgroup_count_precharge(struct mm_struct *mm)
+{
+	unsigned long precharge;
+
+	mmap_read_lock(mm);
+	walk_page_range(mm, 0, ULONG_MAX, &precharge_walk_ops, NULL);
+	mmap_read_unlock(mm);
+
+	precharge = mc.precharge;
+	mc.precharge = 0;
+
+	return precharge;
+}
+
+static int mem_cgroup_precharge_mc(struct mm_struct *mm)
+{
+	unsigned long precharge = mem_cgroup_count_precharge(mm);
+
+	VM_BUG_ON(mc.moving_task);
+	mc.moving_task = current;
+	return mem_cgroup_do_precharge(precharge);
+}
+
+/* cancels all extra charges on mc.from and mc.to, and wakes up all waiters. */
+static void __mem_cgroup_clear_mc(void)
+{
+	struct mem_cgroup *from = mc.from;
+	struct mem_cgroup *to = mc.to;
+
+	/* we must uncharge all the leftover precharges from mc.to */
+	if (mc.precharge) {
+		mem_cgroup_cancel_charge(mc.to, mc.precharge);
+		mc.precharge = 0;
+	}
+	/*
+	 * we didn't uncharge from mc.from at mem_cgroup_move_account(), so
+	 * we must uncharge here.
+	 */
+	if (mc.moved_charge) {
+		mem_cgroup_cancel_charge(mc.from, mc.moved_charge);
+		mc.moved_charge = 0;
+	}
+	/* we must fixup refcnts and charges */
+	if (mc.moved_swap) {
+		/* uncharge swap account from the old cgroup */
+		if (!mem_cgroup_is_root(mc.from))
+			page_counter_uncharge(&mc.from->memsw, mc.moved_swap);
+
+		mem_cgroup_id_put_many(mc.from, mc.moved_swap);
+
+		/*
+		 * we charged both to->memory and to->memsw, so we
+		 * should uncharge to->memory.
+		 */
+		if (!mem_cgroup_is_root(mc.to))
+			page_counter_uncharge(&mc.to->memory, mc.moved_swap);
+
+		mc.moved_swap = 0;
+	}
+	memcg_oom_recover(from);
+	memcg_oom_recover(to);
+	wake_up_all(&mc.waitq);
+}
+
+static void mem_cgroup_clear_mc(void)
+{
+	struct mm_struct *mm = mc.mm;
+
+	/*
+	 * we must clear moving_task before waking up waiters at the end of
+	 * task migration.
+	 */
+	mc.moving_task = NULL;
+	__mem_cgroup_clear_mc();
+	spin_lock(&mc.lock);
+	mc.from = NULL;
+	mc.to = NULL;
+	mc.mm = NULL;
+	spin_unlock(&mc.lock);
+
+	mmput(mm);
+}
+
+int mem_cgroup_can_attach(struct cgroup_taskset *tset)
+{
+	struct cgroup_subsys_state *css;
+	struct mem_cgroup *memcg = NULL; /* unneeded init to make gcc happy */
+	struct mem_cgroup *from;
+	struct task_struct *leader, *p;
+	struct mm_struct *mm;
+	unsigned long move_flags;
+	int ret = 0;
+
+	/* charge immigration isn't supported on the default hierarchy */
+	if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
+		return 0;
+
+	/*
+	 * Multi-process migrations only happen on the default hierarchy
+	 * where charge immigration is not used.  Perform charge
+	 * immigration if @tset contains a leader and whine if there are
+	 * multiple.
+	 */
+	p = NULL;
+	cgroup_taskset_for_each_leader(leader, css, tset) {
+		WARN_ON_ONCE(p);
+		p = leader;
+		memcg = mem_cgroup_from_css(css);
+	}
+	if (!p)
+		return 0;
+
+	/*
+	 * We are now committed to this value whatever it is. Changes in this
+	 * tunable will only affect upcoming migrations, not the current one.
+	 * So we need to save it, and keep it going.
+	 */
+	move_flags = READ_ONCE(memcg->move_charge_at_immigrate);
+	if (!move_flags)
+		return 0;
+
+	from = mem_cgroup_from_task(p);
+
+	VM_BUG_ON(from == memcg);
+
+	mm = get_task_mm(p);
+	if (!mm)
+		return 0;
+	/* We move charges only when we move a owner of the mm */
+	if (mm->owner == p) {
+		VM_BUG_ON(mc.from);
+		VM_BUG_ON(mc.to);
+		VM_BUG_ON(mc.precharge);
+		VM_BUG_ON(mc.moved_charge);
+		VM_BUG_ON(mc.moved_swap);
+
+		spin_lock(&mc.lock);
+		mc.mm = mm;
+		mc.from = from;
+		mc.to = memcg;
+		mc.flags = move_flags;
+		spin_unlock(&mc.lock);
+		/* We set mc.moving_task later */
+
+		ret = mem_cgroup_precharge_mc(mm);
+		if (ret)
+			mem_cgroup_clear_mc();
+	} else {
+		mmput(mm);
+	}
+	return ret;
+}
+
+void mem_cgroup_cancel_attach(struct cgroup_taskset *tset)
+{
+	if (mc.to)
+		mem_cgroup_clear_mc();
+}
+
+static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
+				unsigned long addr, unsigned long end,
+				struct mm_walk *walk)
+{
+	int ret = 0;
+	struct vm_area_struct *vma = walk->vma;
+	pte_t *pte;
+	spinlock_t *ptl;
+	enum mc_target_type target_type;
+	union mc_target target;
+	struct folio *folio;
+
+	ptl = pmd_trans_huge_lock(pmd, vma);
+	if (ptl) {
+		if (mc.precharge < HPAGE_PMD_NR) {
+			spin_unlock(ptl);
+			return 0;
+		}
+		target_type = get_mctgt_type_thp(vma, addr, *pmd, &target);
+		if (target_type == MC_TARGET_PAGE) {
+			folio = target.folio;
+			if (folio_isolate_lru(folio)) {
+				if (!mem_cgroup_move_account(folio, true,
+							     mc.from, mc.to)) {
+					mc.precharge -= HPAGE_PMD_NR;
+					mc.moved_charge += HPAGE_PMD_NR;
+				}
+				folio_putback_lru(folio);
+			}
+			folio_unlock(folio);
+			folio_put(folio);
+		} else if (target_type == MC_TARGET_DEVICE) {
+			folio = target.folio;
+			if (!mem_cgroup_move_account(folio, true,
+						     mc.from, mc.to)) {
+				mc.precharge -= HPAGE_PMD_NR;
+				mc.moved_charge += HPAGE_PMD_NR;
+			}
+			folio_unlock(folio);
+			folio_put(folio);
+		}
+		spin_unlock(ptl);
+		return 0;
+	}
+
+retry:
+	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+	if (!pte)
+		return 0;
+	for (; addr != end; addr += PAGE_SIZE) {
+		pte_t ptent = ptep_get(pte++);
+		bool device = false;
+		swp_entry_t ent;
+
+		if (!mc.precharge)
+			break;
+
+		switch (get_mctgt_type(vma, addr, ptent, &target)) {
+		case MC_TARGET_DEVICE:
+			device = true;
+			fallthrough;
+		case MC_TARGET_PAGE:
+			folio = target.folio;
+			/*
+			 * We can have a part of the split pmd here. Moving it
+			 * can be done but it would be too convoluted so simply
+			 * ignore such a partial THP and keep it in original
+			 * memcg. There should be somebody mapping the head.
+			 */
+			if (folio_test_large(folio))
+				goto put;
+			if (!device && !folio_isolate_lru(folio))
+				goto put;
+			if (!mem_cgroup_move_account(folio, false,
+						mc.from, mc.to)) {
+				mc.precharge--;
+				/* we uncharge from mc.from later. */
+				mc.moved_charge++;
+			}
+			if (!device)
+				folio_putback_lru(folio);
+put:			/* get_mctgt_type() gets & locks the page */
+			folio_unlock(folio);
+			folio_put(folio);
+			break;
+		case MC_TARGET_SWAP:
+			ent = target.ent;
+			if (!mem_cgroup_move_swap_account(ent, mc.from, mc.to)) {
+				mc.precharge--;
+				mem_cgroup_id_get_many(mc.to, 1);
+				/* we fixup other refcnts and charges later. */
+				mc.moved_swap++;
+			}
+			break;
+		default:
+			break;
+		}
+	}
+	pte_unmap_unlock(pte - 1, ptl);
+	cond_resched();
+
+	if (addr != end) {
+		/*
+		 * We have consumed all precharges we got in can_attach().
+		 * We try charge one by one, but don't do any additional
+		 * charges to mc.to if we have failed in charge once in attach()
+		 * phase.
+		 */
+		ret = mem_cgroup_do_precharge(1);
+		if (!ret)
+			goto retry;
+	}
+
+	return ret;
+}
+
+static const struct mm_walk_ops charge_walk_ops = {
+	.pmd_entry	= mem_cgroup_move_charge_pte_range,
+	.walk_lock	= PGWALK_RDLOCK,
+};
+
+static void mem_cgroup_move_charge(void)
+{
+	lru_add_drain_all();
+	/*
+	 * Signal folio_memcg_lock() to take the memcg's move_lock
+	 * while we're moving its pages to another memcg. Then wait
+	 * for already started RCU-only updates to finish.
+	 */
+	atomic_inc(&mc.from->moving_account);
+	synchronize_rcu();
+retry:
+	if (unlikely(!mmap_read_trylock(mc.mm))) {
+		/*
+		 * Someone who are holding the mmap_lock might be waiting in
+		 * waitq. So we cancel all extra charges, wake up all waiters,
+		 * and retry. Because we cancel precharges, we might not be able
+		 * to move enough charges, but moving charge is a best-effort
+		 * feature anyway, so it wouldn't be a big problem.
+		 */
+		__mem_cgroup_clear_mc();
+		cond_resched();
+		goto retry;
+	}
+	/*
+	 * When we have consumed all precharges and failed in doing
+	 * additional charge, the page walk just aborts.
+	 */
+	walk_page_range(mc.mm, 0, ULONG_MAX, &charge_walk_ops, NULL);
+	mmap_read_unlock(mc.mm);
+	atomic_dec(&mc.from->moving_account);
+}
+
+void mem_cgroup_move_task(void)
+{
+	if (mc.to) {
+		mem_cgroup_move_charge();
+		mem_cgroup_clear_mc();
+	}
+}
+
+#else	/* !CONFIG_MMU */
+static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
+{
+	return 0;
+}
+static void mem_cgroup_cancel_attach(struct cgroup_taskset *tset)
+{
+}
+static void mem_cgroup_move_task(void)
+{
+}
+#endif
+
 static int __init memcg1_init(void)
 {
 	int node;
diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
index e37bc7e8d955..55e7c4f90c39 100644
--- a/mm/memcontrol-v1.h
+++ b/mm/memcontrol-v1.h
@@ -11,4 +11,34 @@ static inline void memcg1_soft_limit_reset(struct mem_cgroup *memcg)
 	WRITE_ONCE(memcg->soft_limit, PAGE_COUNTER_MAX);
 }
 
+void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, int nr_pages);
+void memcg_check_events(struct mem_cgroup *memcg, int nid);
+void memcg_oom_recover(struct mem_cgroup *memcg);
+int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
+		     unsigned int nr_pages);
+
+static inline int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
+			     unsigned int nr_pages)
+{
+	if (mem_cgroup_is_root(memcg))
+		return 0;
+
+	return try_charge_memcg(memcg, gfp_mask, nr_pages);
+}
+
+void mem_cgroup_id_get_many(struct mem_cgroup *memcg, unsigned int n);
+void mem_cgroup_id_put_many(struct mem_cgroup *memcg, unsigned int n);
+
+bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg);
+struct cgroup_taskset;
+int mem_cgroup_can_attach(struct cgroup_taskset *tset);
+void mem_cgroup_cancel_attach(struct cgroup_taskset *tset);
+void mem_cgroup_move_task(void);
+
+struct cftype;
+u64 mem_cgroup_move_charge_read(struct cgroup_subsys_state *css,
+				struct cftype *cft);
+int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css,
+				 struct cftype *cft, u64 val);
+
 #endif	/* __MM_MEMCONTROL_V1_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3479e1af12d5..3332c89cae2e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -28,7 +28,6 @@
 #include <linux/page_counter.h>
 #include <linux/memcontrol.h>
 #include <linux/cgroup.h>
-#include <linux/pagewalk.h>
 #include <linux/sched/mm.h>
 #include <linux/shmem_fs.h>
 #include <linux/hugetlb.h>
@@ -45,7 +44,6 @@
 #include <linux/mutex.h>
 #include <linux/rbtree.h>
 #include <linux/slab.h>
-#include <linux/swap.h>
 #include <linux/swapops.h>
 #include <linux/spinlock.h>
 #include <linux/eventfd.h>
@@ -71,7 +69,6 @@
 #include <net/sock.h>
 #include <net/ip.h>
 #include "slab.h"
-#include "swap.h"
 #include "memcontrol-v1.h"
 
 #include <linux/uaccess.h>
@@ -158,31 +155,6 @@ struct mem_cgroup_event {
 static void mem_cgroup_threshold(struct mem_cgroup *memcg);
 static void mem_cgroup_oom_notify(struct mem_cgroup *memcg);
 
-/* Stuffs for move charges at task migration. */
-/*
- * Types of charges to be moved.
- */
-#define MOVE_ANON	0x1U
-#define MOVE_FILE	0x2U
-#define MOVE_MASK	(MOVE_ANON | MOVE_FILE)
-
-/* "mc" and its members are protected by cgroup_mutex */
-static struct move_charge_struct {
-	spinlock_t	  lock; /* for from, to */
-	struct mm_struct  *mm;
-	struct mem_cgroup *from;
-	struct mem_cgroup *to;
-	unsigned long flags;
-	unsigned long precharge;
-	unsigned long moved_charge;
-	unsigned long moved_swap;
-	struct task_struct *moving_task;	/* a task moving charges */
-	wait_queue_head_t waitq;		/* a waitq for other context */
-} mc = {
-	.lock = __SPIN_LOCK_UNLOCKED(mc.lock),
-	.waitq = __WAIT_QUEUE_HEAD_INITIALIZER(mc.waitq),
-};
-
 /* for encoding cft->private value on file */
 enum res_type {
 	_MEM,
@@ -955,8 +927,7 @@ static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event)
 	return READ_ONCE(memcg->vmstats->events_local[i]);
 }
 
-static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
-					 int nr_pages)
+void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, int nr_pages)
 {
 	/* pagein of a big page is an event. So, ignore page size */
 	if (nr_pages > 0)
@@ -998,7 +969,7 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
  * Check events in order.
  *
  */
-static void memcg_check_events(struct mem_cgroup *memcg, int nid)
+void memcg_check_events(struct mem_cgroup *memcg, int nid)
 {
 	if (IS_ENABLED(CONFIG_PREEMPT_RT))
 		return;
@@ -1467,51 +1438,6 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
 	return margin;
 }
 
-/*
- * A routine for checking "mem" is under move_account() or not.
- *
- * Checking a cgroup is mc.from or mc.to or under hierarchy of
- * moving cgroups. This is for waiting at high-memory pressure
- * caused by "move".
- */
-static bool mem_cgroup_under_move(struct mem_cgroup *memcg)
-{
-	struct mem_cgroup *from;
-	struct mem_cgroup *to;
-	bool ret = false;
-	/*
-	 * Unlike task_move routines, we access mc.to, mc.from not under
-	 * mutual exclusion by cgroup_mutex. Here, we take spinlock instead.
-	 */
-	spin_lock(&mc.lock);
-	from = mc.from;
-	to = mc.to;
-	if (!from)
-		goto unlock;
-
-	ret = mem_cgroup_is_descendant(from, memcg) ||
-		mem_cgroup_is_descendant(to, memcg);
-unlock:
-	spin_unlock(&mc.lock);
-	return ret;
-}
-
-static bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg)
-{
-	if (mc.moving_task && current != mc.moving_task) {
-		if (mem_cgroup_under_move(memcg)) {
-			DEFINE_WAIT(wait);
-			prepare_to_wait(&mc.waitq, &wait, TASK_INTERRUPTIBLE);
-			/* moving charge context might have finished. */
-			if (mc.moving_task)
-				schedule();
-			finish_wait(&mc.waitq, &wait);
-			return true;
-		}
-	}
-	return false;
-}
-
 struct memory_stat {
 	const char *name;
 	unsigned int idx;
@@ -1904,7 +1830,7 @@ static int memcg_oom_wake_function(wait_queue_entry_t *wait,
 	return autoremove_wake_function(wait, mode, sync, arg);
 }
 
-static void memcg_oom_recover(struct mem_cgroup *memcg)
+void memcg_oom_recover(struct mem_cgroup *memcg)
 {
 	/*
 	 * For the following lockless ->under_oom test, the only required
@@ -2093,87 +2019,6 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg)
 	pr_cont(" are going to be killed due to memory.oom.group set\n");
 }
 
-/**
- * folio_memcg_lock - Bind a folio to its memcg.
- * @folio: The folio.
- *
- * This function prevents unlocked LRU folios from being moved to
- * another cgroup.
- *
- * It ensures lifetime of the bound memcg.  The caller is responsible
- * for the lifetime of the folio.
- */
-void folio_memcg_lock(struct folio *folio)
-{
-	struct mem_cgroup *memcg;
-	unsigned long flags;
-
-	/*
-	 * The RCU lock is held throughout the transaction.  The fast
-	 * path can get away without acquiring the memcg->move_lock
-	 * because page moving starts with an RCU grace period.
-         */
-	rcu_read_lock();
-
-	if (mem_cgroup_disabled())
-		return;
-again:
-	memcg = folio_memcg(folio);
-	if (unlikely(!memcg))
-		return;
-
-#ifdef CONFIG_PROVE_LOCKING
-	local_irq_save(flags);
-	might_lock(&memcg->move_lock);
-	local_irq_restore(flags);
-#endif
-
-	if (atomic_read(&memcg->moving_account) <= 0)
-		return;
-
-	spin_lock_irqsave(&memcg->move_lock, flags);
-	if (memcg != folio_memcg(folio)) {
-		spin_unlock_irqrestore(&memcg->move_lock, flags);
-		goto again;
-	}
-
-	/*
-	 * When charge migration first begins, we can have multiple
-	 * critical sections holding the fast-path RCU lock and one
-	 * holding the slowpath move_lock. Track the task who has the
-	 * move_lock for folio_memcg_unlock().
-	 */
-	memcg->move_lock_task = current;
-	memcg->move_lock_flags = flags;
-}
-
-static void __folio_memcg_unlock(struct mem_cgroup *memcg)
-{
-	if (memcg && memcg->move_lock_task == current) {
-		unsigned long flags = memcg->move_lock_flags;
-
-		memcg->move_lock_task = NULL;
-		memcg->move_lock_flags = 0;
-
-		spin_unlock_irqrestore(&memcg->move_lock, flags);
-	}
-
-	rcu_read_unlock();
-}
-
-/**
- * folio_memcg_unlock - Release the binding between a folio and its memcg.
- * @folio: The folio.
- *
- * This releases the binding created by folio_memcg_lock().  This does
- * not change the accounting of this folio to its memcg, but it does
- * permit others to change it.
- */
-void folio_memcg_unlock(struct folio *folio)
-{
-	__folio_memcg_unlock(folio_memcg(folio));
-}
-
 struct memcg_stock_pcp {
 	local_lock_t stock_lock;
 	struct mem_cgroup *cached; /* this never be root cgroup */
@@ -2653,8 +2498,8 @@ void mem_cgroup_handle_over_high(gfp_t gfp_mask)
 	css_put(&memcg->css);
 }
 
-static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
-			unsigned int nr_pages)
+int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
+		     unsigned int nr_pages)
 {
 	unsigned int batch = max(MEMCG_CHARGE_BATCH, nr_pages);
 	int nr_retries = MAX_RECLAIM_RETRIES;
@@ -2849,15 +2694,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	return 0;
 }
 
-static inline int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
-			     unsigned int nr_pages)
-{
-	if (mem_cgroup_is_root(memcg))
-		return 0;
-
-	return try_charge_memcg(memcg, gfp_mask, nr_pages);
-}
-
 /**
  * mem_cgroup_cancel_charge() - cancel an uncommitted try_charge() call.
  * @memcg: memcg previously charged.
@@ -3595,43 +3431,6 @@ void split_page_memcg(struct page *head, int old_order, int new_order)
 		css_get_many(&memcg->css, old_nr / new_nr - 1);
 }
 
-#ifdef CONFIG_SWAP
-/**
- * mem_cgroup_move_swap_account - move swap charge and swap_cgroup's record.
- * @entry: swap entry to be moved
- * @from:  mem_cgroup which the entry is moved from
- * @to:  mem_cgroup which the entry is moved to
- *
- * It succeeds only when the swap_cgroup's record for this entry is the same
- * as the mem_cgroup's id of @from.
- *
- * Returns 0 on success, -EINVAL on failure.
- *
- * The caller must have charged to @to, IOW, called page_counter_charge() about
- * both res and memsw, and called css_get().
- */
-static int mem_cgroup_move_swap_account(swp_entry_t entry,
-				struct mem_cgroup *from, struct mem_cgroup *to)
-{
-	unsigned short old_id, new_id;
-
-	old_id = mem_cgroup_id(from);
-	new_id = mem_cgroup_id(to);
-
-	if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id) {
-		mod_memcg_state(from, MEMCG_SWAP, -1);
-		mod_memcg_state(to, MEMCG_SWAP, 1);
-		return 0;
-	}
-	return -EINVAL;
-}
-#else
-static inline int mem_cgroup_move_swap_account(swp_entry_t entry,
-				struct mem_cgroup *from, struct mem_cgroup *to)
-{
-	return -EINVAL;
-}
-#endif
 
 static DEFINE_MUTEX(memcg_max_mutex);
 
@@ -4015,42 +3814,6 @@ static ssize_t mem_cgroup_reset(struct kernfs_open_file *of, char *buf,
 	return nbytes;
 }
 
-static u64 mem_cgroup_move_charge_read(struct cgroup_subsys_state *css,
-					struct cftype *cft)
-{
-	return mem_cgroup_from_css(css)->move_charge_at_immigrate;
-}
-
-#ifdef CONFIG_MMU
-static int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css,
-					struct cftype *cft, u64 val)
-{
-	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
-
-	pr_warn_once("Cgroup memory moving (move_charge_at_immigrate) is deprecated. "
-		     "Please report your usecase to linux-mm@kvack.org if you "
-		     "depend on this functionality.\n");
-
-	if (val & ~MOVE_MASK)
-		return -EINVAL;
-
-	/*
-	 * No kind of locking is needed in here, because ->can_attach() will
-	 * check this value once in the beginning of the process, and then carry
-	 * on with stale data. This means that changes to this value will only
-	 * affect task migrations starting after the change.
-	 */
-	memcg->move_charge_at_immigrate = val;
-	return 0;
-}
-#else
-static int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css,
-					struct cftype *cft, u64 val)
-{
-	return -ENOSYS;
-}
-#endif
-
 #ifdef CONFIG_NUMA
 
 #define LRU_ALL_FILE (BIT(LRU_INACTIVE_FILE) | BIT(LRU_ACTIVE_FILE))
@@ -5261,13 +5024,13 @@ static void mem_cgroup_id_remove(struct mem_cgroup *memcg)
 	}
 }
 
-static void __maybe_unused mem_cgroup_id_get_many(struct mem_cgroup *memcg,
-						  unsigned int n)
+void __maybe_unused mem_cgroup_id_get_many(struct mem_cgroup *memcg,
+					   unsigned int n)
 {
 	refcount_add(n, &memcg->id.ref);
 }
 
-static void mem_cgroup_id_put_many(struct mem_cgroup *memcg, unsigned int n)
+void mem_cgroup_id_put_many(struct mem_cgroup *memcg, unsigned int n)
 {
 	if (refcount_sub_and_test(n, &memcg->id.ref)) {
 		mem_cgroup_id_remove(memcg);
@@ -5747,757 +5510,6 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
 		atomic64_set(&memcg->vmstats->stats_updates, 0);
 }
 
-#ifdef CONFIG_MMU
-/* Handlers for move charge at task migration. */
-static int mem_cgroup_do_precharge(unsigned long count)
-{
-	int ret;
-
-	/* Try a single bulk charge without reclaim first, kswapd may wake */
-	ret = try_charge(mc.to, GFP_KERNEL & ~__GFP_DIRECT_RECLAIM, count);
-	if (!ret) {
-		mc.precharge += count;
-		return ret;
-	}
-
-	/* Try charges one by one with reclaim, but do not retry */
-	while (count--) {
-		ret = try_charge(mc.to, GFP_KERNEL | __GFP_NORETRY, 1);
-		if (ret)
-			return ret;
-		mc.precharge++;
-		cond_resched();
-	}
-	return 0;
-}
-
-union mc_target {
-	struct folio	*folio;
-	swp_entry_t	ent;
-};
-
-enum mc_target_type {
-	MC_TARGET_NONE = 0,
-	MC_TARGET_PAGE,
-	MC_TARGET_SWAP,
-	MC_TARGET_DEVICE,
-};
-
-static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
-						unsigned long addr, pte_t ptent)
-{
-	struct page *page = vm_normal_page(vma, addr, ptent);
-
-	if (!page)
-		return NULL;
-	if (PageAnon(page)) {
-		if (!(mc.flags & MOVE_ANON))
-			return NULL;
-	} else {
-		if (!(mc.flags & MOVE_FILE))
-			return NULL;
-	}
-	get_page(page);
-
-	return page;
-}
-
-#if defined(CONFIG_SWAP) || defined(CONFIG_DEVICE_PRIVATE)
-static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
-			pte_t ptent, swp_entry_t *entry)
-{
-	struct page *page = NULL;
-	swp_entry_t ent = pte_to_swp_entry(ptent);
-
-	if (!(mc.flags & MOVE_ANON))
-		return NULL;
-
-	/*
-	 * Handle device private pages that are not accessible by the CPU, but
-	 * stored as special swap entries in the page table.
-	 */
-	if (is_device_private_entry(ent)) {
-		page = pfn_swap_entry_to_page(ent);
-		if (!get_page_unless_zero(page))
-			return NULL;
-		return page;
-	}
-
-	if (non_swap_entry(ent))
-		return NULL;
-
-	/*
-	 * Because swap_cache_get_folio() updates some statistics counter,
-	 * we call find_get_page() with swapper_space directly.
-	 */
-	page = find_get_page(swap_address_space(ent), swap_cache_index(ent));
-	entry->val = ent.val;
-
-	return page;
-}
-#else
-static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
-			pte_t ptent, swp_entry_t *entry)
-{
-	return NULL;
-}
-#endif
-
-static struct page *mc_handle_file_pte(struct vm_area_struct *vma,
-			unsigned long addr, pte_t ptent)
-{
-	unsigned long index;
-	struct folio *folio;
-
-	if (!vma->vm_file) /* anonymous vma */
-		return NULL;
-	if (!(mc.flags & MOVE_FILE))
-		return NULL;
-
-	/* folio is moved even if it's not RSS of this task(page-faulted). */
-	/* shmem/tmpfs may report page out on swap: account for that too. */
-	index = linear_page_index(vma, addr);
-	folio = filemap_get_incore_folio(vma->vm_file->f_mapping, index);
-	if (IS_ERR(folio))
-		return NULL;
-	return folio_file_page(folio, index);
-}
-
-/**
- * mem_cgroup_move_account - move account of the folio
- * @folio: The folio.
- * @compound: charge the page as compound or small page
- * @from: mem_cgroup which the folio is moved from.
- * @to:	mem_cgroup which the folio is moved to. @from != @to.
- *
- * The folio must be locked and not on the LRU.
- *
- * This function doesn't do "charge" to new cgroup and doesn't do "uncharge"
- * from old cgroup.
- */
-static int mem_cgroup_move_account(struct folio *folio,
-				   bool compound,
-				   struct mem_cgroup *from,
-				   struct mem_cgroup *to)
-{
-	struct lruvec *from_vec, *to_vec;
-	struct pglist_data *pgdat;
-	unsigned int nr_pages = compound ? folio_nr_pages(folio) : 1;
-	int nid, ret;
-
-	VM_BUG_ON(from == to);
-	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
-	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
-	VM_BUG_ON(compound && !folio_test_large(folio));
-
-	ret = -EINVAL;
-	if (folio_memcg(folio) != from)
-		goto out;
-
-	pgdat = folio_pgdat(folio);
-	from_vec = mem_cgroup_lruvec(from, pgdat);
-	to_vec = mem_cgroup_lruvec(to, pgdat);
-
-	folio_memcg_lock(folio);
-
-	if (folio_test_anon(folio)) {
-		if (folio_mapped(folio)) {
-			__mod_lruvec_state(from_vec, NR_ANON_MAPPED, -nr_pages);
-			__mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages);
-			if (folio_test_pmd_mappable(folio)) {
-				__mod_lruvec_state(from_vec, NR_ANON_THPS,
-						   -nr_pages);
-				__mod_lruvec_state(to_vec, NR_ANON_THPS,
-						   nr_pages);
-			}
-		}
-	} else {
-		__mod_lruvec_state(from_vec, NR_FILE_PAGES, -nr_pages);
-		__mod_lruvec_state(to_vec, NR_FILE_PAGES, nr_pages);
-
-		if (folio_test_swapbacked(folio)) {
-			__mod_lruvec_state(from_vec, NR_SHMEM, -nr_pages);
-			__mod_lruvec_state(to_vec, NR_SHMEM, nr_pages);
-		}
-
-		if (folio_mapped(folio)) {
-			__mod_lruvec_state(from_vec, NR_FILE_MAPPED, -nr_pages);
-			__mod_lruvec_state(to_vec, NR_FILE_MAPPED, nr_pages);
-		}
-
-		if (folio_test_dirty(folio)) {
-			struct address_space *mapping = folio_mapping(folio);
-
-			if (mapping_can_writeback(mapping)) {
-				__mod_lruvec_state(from_vec, NR_FILE_DIRTY,
-						   -nr_pages);
-				__mod_lruvec_state(to_vec, NR_FILE_DIRTY,
-						   nr_pages);
-			}
-		}
-	}
-
-#ifdef CONFIG_SWAP
-	if (folio_test_swapcache(folio)) {
-		__mod_lruvec_state(from_vec, NR_SWAPCACHE, -nr_pages);
-		__mod_lruvec_state(to_vec, NR_SWAPCACHE, nr_pages);
-	}
-#endif
-	if (folio_test_writeback(folio)) {
-		__mod_lruvec_state(from_vec, NR_WRITEBACK, -nr_pages);
-		__mod_lruvec_state(to_vec, NR_WRITEBACK, nr_pages);
-	}
-
-	/*
-	 * All state has been migrated, let's switch to the new memcg.
-	 *
-	 * It is safe to change page's memcg here because the page
-	 * is referenced, charged, isolated, and locked: we can't race
-	 * with (un)charging, migration, LRU putback, or anything else
-	 * that would rely on a stable page's memory cgroup.
-	 *
-	 * Note that folio_memcg_lock is a memcg lock, not a page lock,
-	 * to save space. As soon as we switch page's memory cgroup to a
-	 * new memcg that isn't locked, the above state can change
-	 * concurrently again. Make sure we're truly done with it.
-	 */
-	smp_mb();
-
-	css_get(&to->css);
-	css_put(&from->css);
-
-	folio->memcg_data = (unsigned long)to;
-
-	__folio_memcg_unlock(from);
-
-	ret = 0;
-	nid = folio_nid(folio);
-
-	local_irq_disable();
-	mem_cgroup_charge_statistics(to, nr_pages);
-	memcg_check_events(to, nid);
-	mem_cgroup_charge_statistics(from, -nr_pages);
-	memcg_check_events(from, nid);
-	local_irq_enable();
-out:
-	return ret;
-}
-
-/**
- * get_mctgt_type - get target type of moving charge
- * @vma: the vma the pte to be checked belongs
- * @addr: the address corresponding to the pte to be checked
- * @ptent: the pte to be checked
- * @target: the pointer the target page or swap ent will be stored(can be NULL)
- *
- * Context: Called with pte lock held.
- * Return:
- * * MC_TARGET_NONE - If the pte is not a target for move charge.
- * * MC_TARGET_PAGE - If the page corresponding to this pte is a target for
- *   move charge. If @target is not NULL, the folio is stored in target->folio
- *   with extra refcnt taken (Caller should release it).
- * * MC_TARGET_SWAP - If the swap entry corresponding to this pte is a
- *   target for charge migration.  If @target is not NULL, the entry is
- *   stored in target->ent.
- * * MC_TARGET_DEVICE - Like MC_TARGET_PAGE but page is device memory and
- *   thus not on the lru.  For now such page is charged like a regular page
- *   would be as it is just special memory taking the place of a regular page.
- *   See Documentations/vm/hmm.txt and include/linux/hmm.h
- */
-static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
-		unsigned long addr, pte_t ptent, union mc_target *target)
-{
-	struct page *page = NULL;
-	struct folio *folio;
-	enum mc_target_type ret = MC_TARGET_NONE;
-	swp_entry_t ent = { .val = 0 };
-
-	if (pte_present(ptent))
-		page = mc_handle_present_pte(vma, addr, ptent);
-	else if (pte_none_mostly(ptent))
-		/*
-		 * PTE markers should be treated as a none pte here, separated
-		 * from other swap handling below.
-		 */
-		page = mc_handle_file_pte(vma, addr, ptent);
-	else if (is_swap_pte(ptent))
-		page = mc_handle_swap_pte(vma, ptent, &ent);
-
-	if (page)
-		folio = page_folio(page);
-	if (target && page) {
-		if (!folio_trylock(folio)) {
-			folio_put(folio);
-			return ret;
-		}
-		/*
-		 * page_mapped() must be stable during the move. This
-		 * pte is locked, so if it's present, the page cannot
-		 * become unmapped. If it isn't, we have only partial
-		 * control over the mapped state: the page lock will
-		 * prevent new faults against pagecache and swapcache,
-		 * so an unmapped page cannot become mapped. However,
-		 * if the page is already mapped elsewhere, it can
-		 * unmap, and there is nothing we can do about it.
-		 * Alas, skip moving the page in this case.
-		 */
-		if (!pte_present(ptent) && page_mapped(page)) {
-			folio_unlock(folio);
-			folio_put(folio);
-			return ret;
-		}
-	}
-
-	if (!page && !ent.val)
-		return ret;
-	if (page) {
-		/*
-		 * Do only loose check w/o serialization.
-		 * mem_cgroup_move_account() checks the page is valid or
-		 * not under LRU exclusion.
-		 */
-		if (folio_memcg(folio) == mc.from) {
-			ret = MC_TARGET_PAGE;
-			if (folio_is_device_private(folio) ||
-			    folio_is_device_coherent(folio))
-				ret = MC_TARGET_DEVICE;
-			if (target)
-				target->folio = folio;
-		}
-		if (!ret || !target) {
-			if (target)
-				folio_unlock(folio);
-			folio_put(folio);
-		}
-	}
-	/*
-	 * There is a swap entry and a page doesn't exist or isn't charged.
-	 * But we cannot move a tail-page in a THP.
-	 */
-	if (ent.val && !ret && (!page || !PageTransCompound(page)) &&
-	    mem_cgroup_id(mc.from) == lookup_swap_cgroup_id(ent)) {
-		ret = MC_TARGET_SWAP;
-		if (target)
-			target->ent = ent;
-	}
-	return ret;
-}
-
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-/*
- * We don't consider PMD mapped swapping or file mapped pages because THP does
- * not support them for now.
- * Caller should make sure that pmd_trans_huge(pmd) is true.
- */
-static enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
-		unsigned long addr, pmd_t pmd, union mc_target *target)
-{
-	struct page *page = NULL;
-	struct folio *folio;
-	enum mc_target_type ret = MC_TARGET_NONE;
-
-	if (unlikely(is_swap_pmd(pmd))) {
-		VM_BUG_ON(thp_migration_supported() &&
-				  !is_pmd_migration_entry(pmd));
-		return ret;
-	}
-	page = pmd_page(pmd);
-	VM_BUG_ON_PAGE(!page || !PageHead(page), page);
-	folio = page_folio(page);
-	if (!(mc.flags & MOVE_ANON))
-		return ret;
-	if (folio_memcg(folio) == mc.from) {
-		ret = MC_TARGET_PAGE;
-		if (target) {
-			folio_get(folio);
-			if (!folio_trylock(folio)) {
-				folio_put(folio);
-				return MC_TARGET_NONE;
-			}
-			target->folio = folio;
-		}
-	}
-	return ret;
-}
-#else
-static inline enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
-		unsigned long addr, pmd_t pmd, union mc_target *target)
-{
-	return MC_TARGET_NONE;
-}
-#endif
-
-static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
-					unsigned long addr, unsigned long end,
-					struct mm_walk *walk)
-{
-	struct vm_area_struct *vma = walk->vma;
-	pte_t *pte;
-	spinlock_t *ptl;
-
-	ptl = pmd_trans_huge_lock(pmd, vma);
-	if (ptl) {
-		/*
-		 * Note their can not be MC_TARGET_DEVICE for now as we do not
-		 * support transparent huge page with MEMORY_DEVICE_PRIVATE but
-		 * this might change.
-		 */
-		if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE)
-			mc.precharge += HPAGE_PMD_NR;
-		spin_unlock(ptl);
-		return 0;
-	}
-
-	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
-	if (!pte)
-		return 0;
-	for (; addr != end; pte++, addr += PAGE_SIZE)
-		if (get_mctgt_type(vma, addr, ptep_get(pte), NULL))
-			mc.precharge++;	/* increment precharge temporarily */
-	pte_unmap_unlock(pte - 1, ptl);
-	cond_resched();
-
-	return 0;
-}
-
-static const struct mm_walk_ops precharge_walk_ops = {
-	.pmd_entry	= mem_cgroup_count_precharge_pte_range,
-	.walk_lock	= PGWALK_RDLOCK,
-};
-
-static unsigned long mem_cgroup_count_precharge(struct mm_struct *mm)
-{
-	unsigned long precharge;
-
-	mmap_read_lock(mm);
-	walk_page_range(mm, 0, ULONG_MAX, &precharge_walk_ops, NULL);
-	mmap_read_unlock(mm);
-
-	precharge = mc.precharge;
-	mc.precharge = 0;
-
-	return precharge;
-}
-
-static int mem_cgroup_precharge_mc(struct mm_struct *mm)
-{
-	unsigned long precharge = mem_cgroup_count_precharge(mm);
-
-	VM_BUG_ON(mc.moving_task);
-	mc.moving_task = current;
-	return mem_cgroup_do_precharge(precharge);
-}
-
-/* cancels all extra charges on mc.from and mc.to, and wakes up all waiters. */
-static void __mem_cgroup_clear_mc(void)
-{
-	struct mem_cgroup *from = mc.from;
-	struct mem_cgroup *to = mc.to;
-
-	/* we must uncharge all the leftover precharges from mc.to */
-	if (mc.precharge) {
-		mem_cgroup_cancel_charge(mc.to, mc.precharge);
-		mc.precharge = 0;
-	}
-	/*
-	 * we didn't uncharge from mc.from at mem_cgroup_move_account(), so
-	 * we must uncharge here.
-	 */
-	if (mc.moved_charge) {
-		mem_cgroup_cancel_charge(mc.from, mc.moved_charge);
-		mc.moved_charge = 0;
-	}
-	/* we must fixup refcnts and charges */
-	if (mc.moved_swap) {
-		/* uncharge swap account from the old cgroup */
-		if (!mem_cgroup_is_root(mc.from))
-			page_counter_uncharge(&mc.from->memsw, mc.moved_swap);
-
-		mem_cgroup_id_put_many(mc.from, mc.moved_swap);
-
-		/*
-		 * we charged both to->memory and to->memsw, so we
-		 * should uncharge to->memory.
-		 */
-		if (!mem_cgroup_is_root(mc.to))
-			page_counter_uncharge(&mc.to->memory, mc.moved_swap);
-
-		mc.moved_swap = 0;
-	}
-	memcg_oom_recover(from);
-	memcg_oom_recover(to);
-	wake_up_all(&mc.waitq);
-}
-
-static void mem_cgroup_clear_mc(void)
-{
-	struct mm_struct *mm = mc.mm;
-
-	/*
-	 * we must clear moving_task before waking up waiters at the end of
-	 * task migration.
-	 */
-	mc.moving_task = NULL;
-	__mem_cgroup_clear_mc();
-	spin_lock(&mc.lock);
-	mc.from = NULL;
-	mc.to = NULL;
-	mc.mm = NULL;
-	spin_unlock(&mc.lock);
-
-	mmput(mm);
-}
-
-static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
-{
-	struct cgroup_subsys_state *css;
-	struct mem_cgroup *memcg = NULL; /* unneeded init to make gcc happy */
-	struct mem_cgroup *from;
-	struct task_struct *leader, *p;
-	struct mm_struct *mm;
-	unsigned long move_flags;
-	int ret = 0;
-
-	/* charge immigration isn't supported on the default hierarchy */
-	if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
-		return 0;
-
-	/*
-	 * Multi-process migrations only happen on the default hierarchy
-	 * where charge immigration is not used.  Perform charge
-	 * immigration if @tset contains a leader and whine if there are
-	 * multiple.
-	 */
-	p = NULL;
-	cgroup_taskset_for_each_leader(leader, css, tset) {
-		WARN_ON_ONCE(p);
-		p = leader;
-		memcg = mem_cgroup_from_css(css);
-	}
-	if (!p)
-		return 0;
-
-	/*
-	 * We are now committed to this value whatever it is. Changes in this
-	 * tunable will only affect upcoming migrations, not the current one.
-	 * So we need to save it, and keep it going.
-	 */
-	move_flags = READ_ONCE(memcg->move_charge_at_immigrate);
-	if (!move_flags)
-		return 0;
-
-	from = mem_cgroup_from_task(p);
-
-	VM_BUG_ON(from == memcg);
-
-	mm = get_task_mm(p);
-	if (!mm)
-		return 0;
-	/* We move charges only when we move a owner of the mm */
-	if (mm->owner == p) {
-		VM_BUG_ON(mc.from);
-		VM_BUG_ON(mc.to);
-		VM_BUG_ON(mc.precharge);
-		VM_BUG_ON(mc.moved_charge);
-		VM_BUG_ON(mc.moved_swap);
-
-		spin_lock(&mc.lock);
-		mc.mm = mm;
-		mc.from = from;
-		mc.to = memcg;
-		mc.flags = move_flags;
-		spin_unlock(&mc.lock);
-		/* We set mc.moving_task later */
-
-		ret = mem_cgroup_precharge_mc(mm);
-		if (ret)
-			mem_cgroup_clear_mc();
-	} else {
-		mmput(mm);
-	}
-	return ret;
-}
-
-static void mem_cgroup_cancel_attach(struct cgroup_taskset *tset)
-{
-	if (mc.to)
-		mem_cgroup_clear_mc();
-}
-
-static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
-				unsigned long addr, unsigned long end,
-				struct mm_walk *walk)
-{
-	int ret = 0;
-	struct vm_area_struct *vma = walk->vma;
-	pte_t *pte;
-	spinlock_t *ptl;
-	enum mc_target_type target_type;
-	union mc_target target;
-	struct folio *folio;
-
-	ptl = pmd_trans_huge_lock(pmd, vma);
-	if (ptl) {
-		if (mc.precharge < HPAGE_PMD_NR) {
-			spin_unlock(ptl);
-			return 0;
-		}
-		target_type = get_mctgt_type_thp(vma, addr, *pmd, &target);
-		if (target_type == MC_TARGET_PAGE) {
-			folio = target.folio;
-			if (folio_isolate_lru(folio)) {
-				if (!mem_cgroup_move_account(folio, true,
-							     mc.from, mc.to)) {
-					mc.precharge -= HPAGE_PMD_NR;
-					mc.moved_charge += HPAGE_PMD_NR;
-				}
-				folio_putback_lru(folio);
-			}
-			folio_unlock(folio);
-			folio_put(folio);
-		} else if (target_type == MC_TARGET_DEVICE) {
-			folio = target.folio;
-			if (!mem_cgroup_move_account(folio, true,
-						     mc.from, mc.to)) {
-				mc.precharge -= HPAGE_PMD_NR;
-				mc.moved_charge += HPAGE_PMD_NR;
-			}
-			folio_unlock(folio);
-			folio_put(folio);
-		}
-		spin_unlock(ptl);
-		return 0;
-	}
-
-retry:
-	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
-	if (!pte)
-		return 0;
-	for (; addr != end; addr += PAGE_SIZE) {
-		pte_t ptent = ptep_get(pte++);
-		bool device = false;
-		swp_entry_t ent;
-
-		if (!mc.precharge)
-			break;
-
-		switch (get_mctgt_type(vma, addr, ptent, &target)) {
-		case MC_TARGET_DEVICE:
-			device = true;
-			fallthrough;
-		case MC_TARGET_PAGE:
-			folio = target.folio;
-			/*
-			 * We can have a part of the split pmd here. Moving it
-			 * can be done but it would be too convoluted so simply
-			 * ignore such a partial THP and keep it in original
-			 * memcg. There should be somebody mapping the head.
-			 */
-			if (folio_test_large(folio))
-				goto put;
-			if (!device && !folio_isolate_lru(folio))
-				goto put;
-			if (!mem_cgroup_move_account(folio, false,
-						mc.from, mc.to)) {
-				mc.precharge--;
-				/* we uncharge from mc.from later. */
-				mc.moved_charge++;
-			}
-			if (!device)
-				folio_putback_lru(folio);
-put:			/* get_mctgt_type() gets & locks the page */
-			folio_unlock(folio);
-			folio_put(folio);
-			break;
-		case MC_TARGET_SWAP:
-			ent = target.ent;
-			if (!mem_cgroup_move_swap_account(ent, mc.from, mc.to)) {
-				mc.precharge--;
-				mem_cgroup_id_get_many(mc.to, 1);
-				/* we fixup other refcnts and charges later. */
-				mc.moved_swap++;
-			}
-			break;
-		default:
-			break;
-		}
-	}
-	pte_unmap_unlock(pte - 1, ptl);
-	cond_resched();
-
-	if (addr != end) {
-		/*
-		 * We have consumed all precharges we got in can_attach().
-		 * We try charge one by one, but don't do any additional
-		 * charges to mc.to if we have failed in charge once in attach()
-		 * phase.
-		 */
-		ret = mem_cgroup_do_precharge(1);
-		if (!ret)
-			goto retry;
-	}
-
-	return ret;
-}
-
-static const struct mm_walk_ops charge_walk_ops = {
-	.pmd_entry	= mem_cgroup_move_charge_pte_range,
-	.walk_lock	= PGWALK_RDLOCK,
-};
-
-static void mem_cgroup_move_charge(void)
-{
-	lru_add_drain_all();
-	/*
-	 * Signal folio_memcg_lock() to take the memcg's move_lock
-	 * while we're moving its pages to another memcg. Then wait
-	 * for already started RCU-only updates to finish.
-	 */
-	atomic_inc(&mc.from->moving_account);
-	synchronize_rcu();
-retry:
-	if (unlikely(!mmap_read_trylock(mc.mm))) {
-		/*
-		 * Someone who are holding the mmap_lock might be waiting in
-		 * waitq. So we cancel all extra charges, wake up all waiters,
-		 * and retry. Because we cancel precharges, we might not be able
-		 * to move enough charges, but moving charge is a best-effort
-		 * feature anyway, so it wouldn't be a big problem.
-		 */
-		__mem_cgroup_clear_mc();
-		cond_resched();
-		goto retry;
-	}
-	/*
-	 * When we have consumed all precharges and failed in doing
-	 * additional charge, the page walk just aborts.
-	 */
-	walk_page_range(mc.mm, 0, ULONG_MAX, &charge_walk_ops, NULL);
-	mmap_read_unlock(mc.mm);
-	atomic_dec(&mc.from->moving_account);
-}
-
-static void mem_cgroup_move_task(void)
-{
-	if (mc.to) {
-		mem_cgroup_move_charge();
-		mem_cgroup_clear_mc();
-	}
-}
-
-#else	/* !CONFIG_MMU */
-static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
-{
-	return 0;
-}
-static void mem_cgroup_cancel_attach(struct cgroup_taskset *tset)
-{
-}
-static void mem_cgroup_move_task(void)
-{
-}
-#endif
-
 #ifdef CONFIG_MEMCG_KMEM
 static void mem_cgroup_fork(struct task_struct *task)
 {
-- 
2.45.2



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 05/14] mm: memcg: rename charge move-related functions
  2024-06-25  0:58 [PATCH v2 00/14] mm: memcg: separate legacy cgroup v1 code and put under config option Roman Gushchin
                   ` (3 preceding siblings ...)
  2024-06-25  0:58 ` [PATCH v2 04/14] mm: memcg: move charge migration code to memcontrol-v1.c Roman Gushchin
@ 2024-06-25  0:58 ` Roman Gushchin
  2024-06-25  7:07   ` Michal Hocko
  2024-06-25  0:58 ` [PATCH v2 06/14] mm: memcg: move legacy memcg event code into memcontrol-v1.c Roman Gushchin
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Roman Gushchin @ 2024-06-25  0:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song,
	linux-kernel, cgroups, linux-mm, Roman Gushchin

Rename exported function related to the charge move to have
the memcg1_ prefix.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 mm/memcontrol-v1.c | 14 +++++++-------
 mm/memcontrol-v1.h |  8 ++++----
 mm/memcontrol.c    |  8 ++++----
 3 files changed, 15 insertions(+), 15 deletions(-)

diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index f4c8bec5ae1b..c25e038ac874 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -384,7 +384,7 @@ static bool mem_cgroup_under_move(struct mem_cgroup *memcg)
 	return ret;
 }
 
-bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg)
+bool memcg1_wait_acct_move(struct mem_cgroup *memcg)
 {
 	if (mc.moving_task && current != mc.moving_task) {
 		if (mem_cgroup_under_move(memcg)) {
@@ -1056,7 +1056,7 @@ static void mem_cgroup_clear_mc(void)
 	mmput(mm);
 }
 
-int mem_cgroup_can_attach(struct cgroup_taskset *tset)
+int memcg1_can_attach(struct cgroup_taskset *tset)
 {
 	struct cgroup_subsys_state *css;
 	struct mem_cgroup *memcg = NULL; /* unneeded init to make gcc happy */
@@ -1126,7 +1126,7 @@ int mem_cgroup_can_attach(struct cgroup_taskset *tset)
 	return ret;
 }
 
-void mem_cgroup_cancel_attach(struct cgroup_taskset *tset)
+void memcg1_cancel_attach(struct cgroup_taskset *tset)
 {
 	if (mc.to)
 		mem_cgroup_clear_mc();
@@ -1285,7 +1285,7 @@ static void mem_cgroup_move_charge(void)
 	atomic_dec(&mc.from->moving_account);
 }
 
-void mem_cgroup_move_task(void)
+void memcg1_move_task(void)
 {
 	if (mc.to) {
 		mem_cgroup_move_charge();
@@ -1294,14 +1294,14 @@ void mem_cgroup_move_task(void)
 }
 
 #else	/* !CONFIG_MMU */
-static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
+int memcg1_can_attach(struct cgroup_taskset *tset)
 {
 	return 0;
 }
-static void mem_cgroup_cancel_attach(struct cgroup_taskset *tset)
+void memcg1_cancel_attach(struct cgroup_taskset *tset)
 {
 }
-static void mem_cgroup_move_task(void)
+void memcg1_move_task(void)
 {
 }
 #endif
diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
index 55e7c4f90c39..d377c0be9880 100644
--- a/mm/memcontrol-v1.h
+++ b/mm/memcontrol-v1.h
@@ -29,11 +29,11 @@ static inline int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 void mem_cgroup_id_get_many(struct mem_cgroup *memcg, unsigned int n);
 void mem_cgroup_id_put_many(struct mem_cgroup *memcg, unsigned int n);
 
-bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg);
+bool memcg1_wait_acct_move(struct mem_cgroup *memcg);
 struct cgroup_taskset;
-int mem_cgroup_can_attach(struct cgroup_taskset *tset);
-void mem_cgroup_cancel_attach(struct cgroup_taskset *tset);
-void mem_cgroup_move_task(void);
+int memcg1_can_attach(struct cgroup_taskset *tset);
+void memcg1_cancel_attach(struct cgroup_taskset *tset);
+void memcg1_move_task(void);
 
 struct cftype;
 u64 mem_cgroup_move_charge_read(struct cgroup_subsys_state *css,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3332c89cae2e..da2c0fa0de1b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2582,7 +2582,7 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	 * At task move, charge accounts can be doubly counted. So, it's
 	 * better to wait until the end of task_move if something is going on.
 	 */
-	if (mem_cgroup_wait_acct_move(mem_over_limit))
+	if (memcg1_wait_acct_move(mem_over_limit))
 		goto retry;
 
 	if (nr_retries--)
@@ -6030,12 +6030,12 @@ struct cgroup_subsys memory_cgrp_subsys = {
 	.css_free = mem_cgroup_css_free,
 	.css_reset = mem_cgroup_css_reset,
 	.css_rstat_flush = mem_cgroup_css_rstat_flush,
-	.can_attach = mem_cgroup_can_attach,
+	.can_attach = memcg1_can_attach,
 #if defined(CONFIG_LRU_GEN) || defined(CONFIG_MEMCG_KMEM)
 	.attach = mem_cgroup_attach,
 #endif
-	.cancel_attach = mem_cgroup_cancel_attach,
-	.post_attach = mem_cgroup_move_task,
+	.cancel_attach = memcg1_cancel_attach,
+	.post_attach = memcg1_move_task,
 #ifdef CONFIG_MEMCG_KMEM
 	.fork = mem_cgroup_fork,
 	.exit = mem_cgroup_exit,
-- 
2.45.2



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 06/14] mm: memcg: move legacy memcg event code into memcontrol-v1.c
  2024-06-25  0:58 [PATCH v2 00/14] mm: memcg: separate legacy cgroup v1 code and put under config option Roman Gushchin
                   ` (4 preceding siblings ...)
  2024-06-25  0:58 ` [PATCH v2 05/14] mm: memcg: rename charge move-related functions Roman Gushchin
@ 2024-06-25  0:58 ` Roman Gushchin
  2024-06-25  7:07   ` Michal Hocko
  2024-06-25  0:58 ` [PATCH v2 07/14] mm: memcg: rename memcg_check_events() Roman Gushchin
                   ` (8 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Roman Gushchin @ 2024-06-25  0:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song,
	linux-kernel, cgroups, linux-mm, Roman Gushchin

Cgroup v1's memory controller contains a pretty complicated
event notifications mechanism which is not used on cgroup v2.
Let's move the corresponding code into memcontrol-v1.c.

Please, note, that mem_cgroup_event_ratelimit() remains in
memcontrol.c, otherwise it would require exporting too many
details on memcg stats outside of memcontrol.c.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 include/linux/memcontrol.h |  12 -
 mm/memcontrol-v1.c         | 653 +++++++++++++++++++++++++++++++++++
 mm/memcontrol-v1.h         |  51 +++
 mm/memcontrol.c            | 687 +------------------------------------
 4 files changed, 709 insertions(+), 694 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 83c8327455d8..588179d29849 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -69,18 +69,6 @@ struct mem_cgroup_id {
 	refcount_t ref;
 };
 
-/*
- * Per memcg event counter is incremented at every pagein/pageout. With THP,
- * it will be incremented by the number of pages. This counter is used
- * to trigger some periodic events. This is straightforward and better
- * than using jiffies etc. to handle periodic memcg event.
- */
-enum mem_cgroup_events_target {
-	MEM_CGROUP_TARGET_THRESH,
-	MEM_CGROUP_TARGET_SOFTLIMIT,
-	MEM_CGROUP_NTARGETS,
-};
-
 struct memcg_vmstats_percpu;
 struct memcg_vmstats;
 struct lruvec_stats_percpu;
diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index c25e038ac874..4b2290ceace6 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -6,6 +6,10 @@
 #include <linux/pagewalk.h>
 #include <linux/backing-dev.h>
 #include <linux/swap_cgroup.h>
+#include <linux/eventfd.h>
+#include <linux/poll.h>
+#include <linux/sort.h>
+#include <linux/file.h>
 
 #include "internal.h"
 #include "swap.h"
@@ -60,6 +64,54 @@ static struct move_charge_struct {
 	.waitq = __WAIT_QUEUE_HEAD_INITIALIZER(mc.waitq),
 };
 
+/* for OOM */
+struct mem_cgroup_eventfd_list {
+	struct list_head list;
+	struct eventfd_ctx *eventfd;
+};
+
+/*
+ * cgroup_event represents events which userspace want to receive.
+ */
+struct mem_cgroup_event {
+	/*
+	 * memcg which the event belongs to.
+	 */
+	struct mem_cgroup *memcg;
+	/*
+	 * eventfd to signal userspace about the event.
+	 */
+	struct eventfd_ctx *eventfd;
+	/*
+	 * Each of these stored in a list by the cgroup.
+	 */
+	struct list_head list;
+	/*
+	 * register_event() callback will be used to add new userspace
+	 * waiter for changes related to this event.  Use eventfd_signal()
+	 * on eventfd to send notification to userspace.
+	 */
+	int (*register_event)(struct mem_cgroup *memcg,
+			      struct eventfd_ctx *eventfd, const char *args);
+	/*
+	 * unregister_event() callback will be called when userspace closes
+	 * the eventfd or on cgroup removing.  This callback must be set,
+	 * if you want provide notification functionality.
+	 */
+	void (*unregister_event)(struct mem_cgroup *memcg,
+				 struct eventfd_ctx *eventfd);
+	/*
+	 * All fields below needed to unregister event when
+	 * userspace closes eventfd.
+	 */
+	poll_table pt;
+	wait_queue_head_t *wqh;
+	wait_queue_entry_t wait;
+	struct work_struct remove;
+};
+
+extern spinlock_t memcg_oom_lock;
+
 static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_node *mz,
 					 struct mem_cgroup_tree_per_node *mctz,
 					 unsigned long new_usage_in_excess)
@@ -1306,6 +1358,607 @@ void memcg1_move_task(void)
 }
 #endif
 
+static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
+{
+	struct mem_cgroup_threshold_ary *t;
+	unsigned long usage;
+	int i;
+
+	rcu_read_lock();
+	if (!swap)
+		t = rcu_dereference(memcg->thresholds.primary);
+	else
+		t = rcu_dereference(memcg->memsw_thresholds.primary);
+
+	if (!t)
+		goto unlock;
+
+	usage = mem_cgroup_usage(memcg, swap);
+
+	/*
+	 * current_threshold points to threshold just below or equal to usage.
+	 * If it's not true, a threshold was crossed after last
+	 * call of __mem_cgroup_threshold().
+	 */
+	i = t->current_threshold;
+
+	/*
+	 * Iterate backward over array of thresholds starting from
+	 * current_threshold and check if a threshold is crossed.
+	 * If none of thresholds below usage is crossed, we read
+	 * only one element of the array here.
+	 */
+	for (; i >= 0 && unlikely(t->entries[i].threshold > usage); i--)
+		eventfd_signal(t->entries[i].eventfd);
+
+	/* i = current_threshold + 1 */
+	i++;
+
+	/*
+	 * Iterate forward over array of thresholds starting from
+	 * current_threshold+1 and check if a threshold is crossed.
+	 * If none of thresholds above usage is crossed, we read
+	 * only one element of the array here.
+	 */
+	for (; i < t->size && unlikely(t->entries[i].threshold <= usage); i++)
+		eventfd_signal(t->entries[i].eventfd);
+
+	/* Update current_threshold */
+	t->current_threshold = i - 1;
+unlock:
+	rcu_read_unlock();
+}
+
+static void mem_cgroup_threshold(struct mem_cgroup *memcg)
+{
+	while (memcg) {
+		__mem_cgroup_threshold(memcg, false);
+		if (do_memsw_account())
+			__mem_cgroup_threshold(memcg, true);
+
+		memcg = parent_mem_cgroup(memcg);
+	}
+}
+
+/*
+ * Check events in order.
+ *
+ */
+void memcg_check_events(struct mem_cgroup *memcg, int nid)
+{
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		return;
+
+	/* threshold event is triggered in finer grain than soft limit */
+	if (unlikely(mem_cgroup_event_ratelimit(memcg,
+						MEM_CGROUP_TARGET_THRESH))) {
+		bool do_softlimit;
+
+		do_softlimit = mem_cgroup_event_ratelimit(memcg,
+						MEM_CGROUP_TARGET_SOFTLIMIT);
+		mem_cgroup_threshold(memcg);
+		if (unlikely(do_softlimit))
+			memcg1_update_tree(memcg, nid);
+	}
+}
+
+static int compare_thresholds(const void *a, const void *b)
+{
+	const struct mem_cgroup_threshold *_a = a;
+	const struct mem_cgroup_threshold *_b = b;
+
+	if (_a->threshold > _b->threshold)
+		return 1;
+
+	if (_a->threshold < _b->threshold)
+		return -1;
+
+	return 0;
+}
+
+static int mem_cgroup_oom_notify_cb(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup_eventfd_list *ev;
+
+	spin_lock(&memcg_oom_lock);
+
+	list_for_each_entry(ev, &memcg->oom_notify, list)
+		eventfd_signal(ev->eventfd);
+
+	spin_unlock(&memcg_oom_lock);
+	return 0;
+}
+
+void mem_cgroup_oom_notify(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *iter;
+
+	for_each_mem_cgroup_tree(iter, memcg)
+		mem_cgroup_oom_notify_cb(iter);
+}
+
+static int __mem_cgroup_usage_register_event(struct mem_cgroup *memcg,
+	struct eventfd_ctx *eventfd, const char *args, enum res_type type)
+{
+	struct mem_cgroup_thresholds *thresholds;
+	struct mem_cgroup_threshold_ary *new;
+	unsigned long threshold;
+	unsigned long usage;
+	int i, size, ret;
+
+	ret = page_counter_memparse(args, "-1", &threshold);
+	if (ret)
+		return ret;
+
+	mutex_lock(&memcg->thresholds_lock);
+
+	if (type == _MEM) {
+		thresholds = &memcg->thresholds;
+		usage = mem_cgroup_usage(memcg, false);
+	} else if (type == _MEMSWAP) {
+		thresholds = &memcg->memsw_thresholds;
+		usage = mem_cgroup_usage(memcg, true);
+	} else
+		BUG();
+
+	/* Check if a threshold crossed before adding a new one */
+	if (thresholds->primary)
+		__mem_cgroup_threshold(memcg, type == _MEMSWAP);
+
+	size = thresholds->primary ? thresholds->primary->size + 1 : 1;
+
+	/* Allocate memory for new array of thresholds */
+	new = kmalloc(struct_size(new, entries, size), GFP_KERNEL);
+	if (!new) {
+		ret = -ENOMEM;
+		goto unlock;
+	}
+	new->size = size;
+
+	/* Copy thresholds (if any) to new array */
+	if (thresholds->primary)
+		memcpy(new->entries, thresholds->primary->entries,
+		       flex_array_size(new, entries, size - 1));
+
+	/* Add new threshold */
+	new->entries[size - 1].eventfd = eventfd;
+	new->entries[size - 1].threshold = threshold;
+
+	/* Sort thresholds. Registering of new threshold isn't time-critical */
+	sort(new->entries, size, sizeof(*new->entries),
+			compare_thresholds, NULL);
+
+	/* Find current threshold */
+	new->current_threshold = -1;
+	for (i = 0; i < size; i++) {
+		if (new->entries[i].threshold <= usage) {
+			/*
+			 * new->current_threshold will not be used until
+			 * rcu_assign_pointer(), so it's safe to increment
+			 * it here.
+			 */
+			++new->current_threshold;
+		} else
+			break;
+	}
+
+	/* Free old spare buffer and save old primary buffer as spare */
+	kfree(thresholds->spare);
+	thresholds->spare = thresholds->primary;
+
+	rcu_assign_pointer(thresholds->primary, new);
+
+	/* To be sure that nobody uses thresholds */
+	synchronize_rcu();
+
+unlock:
+	mutex_unlock(&memcg->thresholds_lock);
+
+	return ret;
+}
+
+static int mem_cgroup_usage_register_event(struct mem_cgroup *memcg,
+	struct eventfd_ctx *eventfd, const char *args)
+{
+	return __mem_cgroup_usage_register_event(memcg, eventfd, args, _MEM);
+}
+
+static int memsw_cgroup_usage_register_event(struct mem_cgroup *memcg,
+	struct eventfd_ctx *eventfd, const char *args)
+{
+	return __mem_cgroup_usage_register_event(memcg, eventfd, args, _MEMSWAP);
+}
+
+static void __mem_cgroup_usage_unregister_event(struct mem_cgroup *memcg,
+	struct eventfd_ctx *eventfd, enum res_type type)
+{
+	struct mem_cgroup_thresholds *thresholds;
+	struct mem_cgroup_threshold_ary *new;
+	unsigned long usage;
+	int i, j, size, entries;
+
+	mutex_lock(&memcg->thresholds_lock);
+
+	if (type == _MEM) {
+		thresholds = &memcg->thresholds;
+		usage = mem_cgroup_usage(memcg, false);
+	} else if (type == _MEMSWAP) {
+		thresholds = &memcg->memsw_thresholds;
+		usage = mem_cgroup_usage(memcg, true);
+	} else
+		BUG();
+
+	if (!thresholds->primary)
+		goto unlock;
+
+	/* Check if a threshold crossed before removing */
+	__mem_cgroup_threshold(memcg, type == _MEMSWAP);
+
+	/* Calculate new number of threshold */
+	size = entries = 0;
+	for (i = 0; i < thresholds->primary->size; i++) {
+		if (thresholds->primary->entries[i].eventfd != eventfd)
+			size++;
+		else
+			entries++;
+	}
+
+	new = thresholds->spare;
+
+	/* If no items related to eventfd have been cleared, nothing to do */
+	if (!entries)
+		goto unlock;
+
+	/* Set thresholds array to NULL if we don't have thresholds */
+	if (!size) {
+		kfree(new);
+		new = NULL;
+		goto swap_buffers;
+	}
+
+	new->size = size;
+
+	/* Copy thresholds and find current threshold */
+	new->current_threshold = -1;
+	for (i = 0, j = 0; i < thresholds->primary->size; i++) {
+		if (thresholds->primary->entries[i].eventfd == eventfd)
+			continue;
+
+		new->entries[j] = thresholds->primary->entries[i];
+		if (new->entries[j].threshold <= usage) {
+			/*
+			 * new->current_threshold will not be used
+			 * until rcu_assign_pointer(), so it's safe to increment
+			 * it here.
+			 */
+			++new->current_threshold;
+		}
+		j++;
+	}
+
+swap_buffers:
+	/* Swap primary and spare array */
+	thresholds->spare = thresholds->primary;
+
+	rcu_assign_pointer(thresholds->primary, new);
+
+	/* To be sure that nobody uses thresholds */
+	synchronize_rcu();
+
+	/* If all events are unregistered, free the spare array */
+	if (!new) {
+		kfree(thresholds->spare);
+		thresholds->spare = NULL;
+	}
+unlock:
+	mutex_unlock(&memcg->thresholds_lock);
+}
+
+static void mem_cgroup_usage_unregister_event(struct mem_cgroup *memcg,
+	struct eventfd_ctx *eventfd)
+{
+	return __mem_cgroup_usage_unregister_event(memcg, eventfd, _MEM);
+}
+
+static void memsw_cgroup_usage_unregister_event(struct mem_cgroup *memcg,
+	struct eventfd_ctx *eventfd)
+{
+	return __mem_cgroup_usage_unregister_event(memcg, eventfd, _MEMSWAP);
+}
+
+static int mem_cgroup_oom_register_event(struct mem_cgroup *memcg,
+	struct eventfd_ctx *eventfd, const char *args)
+{
+	struct mem_cgroup_eventfd_list *event;
+
+	event = kmalloc(sizeof(*event),	GFP_KERNEL);
+	if (!event)
+		return -ENOMEM;
+
+	spin_lock(&memcg_oom_lock);
+
+	event->eventfd = eventfd;
+	list_add(&event->list, &memcg->oom_notify);
+
+	/* already in OOM ? */
+	if (memcg->under_oom)
+		eventfd_signal(eventfd);
+	spin_unlock(&memcg_oom_lock);
+
+	return 0;
+}
+
+static void mem_cgroup_oom_unregister_event(struct mem_cgroup *memcg,
+	struct eventfd_ctx *eventfd)
+{
+	struct mem_cgroup_eventfd_list *ev, *tmp;
+
+	spin_lock(&memcg_oom_lock);
+
+	list_for_each_entry_safe(ev, tmp, &memcg->oom_notify, list) {
+		if (ev->eventfd == eventfd) {
+			list_del(&ev->list);
+			kfree(ev);
+		}
+	}
+
+	spin_unlock(&memcg_oom_lock);
+}
+
+/*
+ * DO NOT USE IN NEW FILES.
+ *
+ * "cgroup.event_control" implementation.
+ *
+ * This is way over-engineered.  It tries to support fully configurable
+ * events for each user.  Such level of flexibility is completely
+ * unnecessary especially in the light of the planned unified hierarchy.
+ *
+ * Please deprecate this and replace with something simpler if at all
+ * possible.
+ */
+
+/*
+ * Unregister event and free resources.
+ *
+ * Gets called from workqueue.
+ */
+static void memcg_event_remove(struct work_struct *work)
+{
+	struct mem_cgroup_event *event =
+		container_of(work, struct mem_cgroup_event, remove);
+	struct mem_cgroup *memcg = event->memcg;
+
+	remove_wait_queue(event->wqh, &event->wait);
+
+	event->unregister_event(memcg, event->eventfd);
+
+	/* Notify userspace the event is going away. */
+	eventfd_signal(event->eventfd);
+
+	eventfd_ctx_put(event->eventfd);
+	kfree(event);
+	css_put(&memcg->css);
+}
+
+/*
+ * Gets called on EPOLLHUP on eventfd when user closes it.
+ *
+ * Called with wqh->lock held and interrupts disabled.
+ */
+static int memcg_event_wake(wait_queue_entry_t *wait, unsigned mode,
+			    int sync, void *key)
+{
+	struct mem_cgroup_event *event =
+		container_of(wait, struct mem_cgroup_event, wait);
+	struct mem_cgroup *memcg = event->memcg;
+	__poll_t flags = key_to_poll(key);
+
+	if (flags & EPOLLHUP) {
+		/*
+		 * If the event has been detached at cgroup removal, we
+		 * can simply return knowing the other side will cleanup
+		 * for us.
+		 *
+		 * We can't race against event freeing since the other
+		 * side will require wqh->lock via remove_wait_queue(),
+		 * which we hold.
+		 */
+		spin_lock(&memcg->event_list_lock);
+		if (!list_empty(&event->list)) {
+			list_del_init(&event->list);
+			/*
+			 * We are in atomic context, but cgroup_event_remove()
+			 * may sleep, so we have to call it in workqueue.
+			 */
+			schedule_work(&event->remove);
+		}
+		spin_unlock(&memcg->event_list_lock);
+	}
+
+	return 0;
+}
+
+static void memcg_event_ptable_queue_proc(struct file *file,
+		wait_queue_head_t *wqh, poll_table *pt)
+{
+	struct mem_cgroup_event *event =
+		container_of(pt, struct mem_cgroup_event, pt);
+
+	event->wqh = wqh;
+	add_wait_queue(wqh, &event->wait);
+}
+
+/*
+ * DO NOT USE IN NEW FILES.
+ *
+ * Parse input and register new cgroup event handler.
+ *
+ * Input must be in format '<event_fd> <control_fd> <args>'.
+ * Interpretation of args is defined by control file implementation.
+ */
+ssize_t memcg_write_event_control(struct kernfs_open_file *of,
+				  char *buf, size_t nbytes, loff_t off)
+{
+	struct cgroup_subsys_state *css = of_css(of);
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+	struct mem_cgroup_event *event;
+	struct cgroup_subsys_state *cfile_css;
+	unsigned int efd, cfd;
+	struct fd efile;
+	struct fd cfile;
+	struct dentry *cdentry;
+	const char *name;
+	char *endp;
+	int ret;
+
+	if (IS_ENABLED(CONFIG_PREEMPT_RT))
+		return -EOPNOTSUPP;
+
+	buf = strstrip(buf);
+
+	efd = simple_strtoul(buf, &endp, 10);
+	if (*endp != ' ')
+		return -EINVAL;
+	buf = endp + 1;
+
+	cfd = simple_strtoul(buf, &endp, 10);
+	if ((*endp != ' ') && (*endp != '\0'))
+		return -EINVAL;
+	buf = endp + 1;
+
+	event = kzalloc(sizeof(*event), GFP_KERNEL);
+	if (!event)
+		return -ENOMEM;
+
+	event->memcg = memcg;
+	INIT_LIST_HEAD(&event->list);
+	init_poll_funcptr(&event->pt, memcg_event_ptable_queue_proc);
+	init_waitqueue_func_entry(&event->wait, memcg_event_wake);
+	INIT_WORK(&event->remove, memcg_event_remove);
+
+	efile = fdget(efd);
+	if (!efile.file) {
+		ret = -EBADF;
+		goto out_kfree;
+	}
+
+	event->eventfd = eventfd_ctx_fileget(efile.file);
+	if (IS_ERR(event->eventfd)) {
+		ret = PTR_ERR(event->eventfd);
+		goto out_put_efile;
+	}
+
+	cfile = fdget(cfd);
+	if (!cfile.file) {
+		ret = -EBADF;
+		goto out_put_eventfd;
+	}
+
+	/* the process need read permission on control file */
+	/* AV: shouldn't we check that it's been opened for read instead? */
+	ret = file_permission(cfile.file, MAY_READ);
+	if (ret < 0)
+		goto out_put_cfile;
+
+	/*
+	 * The control file must be a regular cgroup1 file. As a regular cgroup
+	 * file can't be renamed, it's safe to access its name afterwards.
+	 */
+	cdentry = cfile.file->f_path.dentry;
+	if (cdentry->d_sb->s_type != &cgroup_fs_type || !d_is_reg(cdentry)) {
+		ret = -EINVAL;
+		goto out_put_cfile;
+	}
+
+	/*
+	 * Determine the event callbacks and set them in @event.  This used
+	 * to be done via struct cftype but cgroup core no longer knows
+	 * about these events.  The following is crude but the whole thing
+	 * is for compatibility anyway.
+	 *
+	 * DO NOT ADD NEW FILES.
+	 */
+	name = cdentry->d_name.name;
+
+	if (!strcmp(name, "memory.usage_in_bytes")) {
+		event->register_event = mem_cgroup_usage_register_event;
+		event->unregister_event = mem_cgroup_usage_unregister_event;
+	} else if (!strcmp(name, "memory.oom_control")) {
+		event->register_event = mem_cgroup_oom_register_event;
+		event->unregister_event = mem_cgroup_oom_unregister_event;
+	} else if (!strcmp(name, "memory.pressure_level")) {
+		event->register_event = vmpressure_register_event;
+		event->unregister_event = vmpressure_unregister_event;
+	} else if (!strcmp(name, "memory.memsw.usage_in_bytes")) {
+		event->register_event = memsw_cgroup_usage_register_event;
+		event->unregister_event = memsw_cgroup_usage_unregister_event;
+	} else {
+		ret = -EINVAL;
+		goto out_put_cfile;
+	}
+
+	/*
+	 * Verify @cfile should belong to @css.  Also, remaining events are
+	 * automatically removed on cgroup destruction but the removal is
+	 * asynchronous, so take an extra ref on @css.
+	 */
+	cfile_css = css_tryget_online_from_dir(cdentry->d_parent,
+					       &memory_cgrp_subsys);
+	ret = -EINVAL;
+	if (IS_ERR(cfile_css))
+		goto out_put_cfile;
+	if (cfile_css != css) {
+		css_put(cfile_css);
+		goto out_put_cfile;
+	}
+
+	ret = event->register_event(memcg, event->eventfd, buf);
+	if (ret)
+		goto out_put_css;
+
+	vfs_poll(efile.file, &event->pt);
+
+	spin_lock_irq(&memcg->event_list_lock);
+	list_add(&event->list, &memcg->event_list);
+	spin_unlock_irq(&memcg->event_list_lock);
+
+	fdput(cfile);
+	fdput(efile);
+
+	return nbytes;
+
+out_put_css:
+	css_put(css);
+out_put_cfile:
+	fdput(cfile);
+out_put_eventfd:
+	eventfd_ctx_put(event->eventfd);
+out_put_efile:
+	fdput(efile);
+out_kfree:
+	kfree(event);
+
+	return ret;
+}
+
+void memcg1_css_offline(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup_event *event, *tmp;
+
+	/*
+	 * Unregister events and notify userspace.
+	 * Notify userspace about cgroup removing only after rmdir of cgroup
+	 * directory to avoid race between userspace and kernelspace.
+	 */
+	spin_lock_irq(&memcg->event_list_lock);
+	list_for_each_entry_safe(event, tmp, &memcg->event_list, list) {
+		list_del_init(&event->list);
+		schedule_work(&event->remove);
+	}
+	spin_unlock_irq(&memcg->event_list_lock);
+}
+
 static int __init memcg1_init(void)
 {
 	int node;
diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
index d377c0be9880..524a2c76ffc9 100644
--- a/mm/memcontrol-v1.h
+++ b/mm/memcontrol-v1.h
@@ -41,4 +41,55 @@ u64 mem_cgroup_move_charge_read(struct cgroup_subsys_state *css,
 int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css,
 				 struct cftype *cft, u64 val);
 
+/*
+ * Per memcg event counter is incremented at every pagein/pageout. With THP,
+ * it will be incremented by the number of pages. This counter is used
+ * to trigger some periodic events. This is straightforward and better
+ * than using jiffies etc. to handle periodic memcg event.
+ */
+enum mem_cgroup_events_target {
+	MEM_CGROUP_TARGET_THRESH,
+	MEM_CGROUP_TARGET_SOFTLIMIT,
+	MEM_CGROUP_NTARGETS,
+};
+
+/* Whether legacy memory+swap accounting is active */
+static bool do_memsw_account(void)
+{
+	return !cgroup_subsys_on_dfl(memory_cgrp_subsys);
+}
+
+/*
+ * Iteration constructs for visiting all cgroups (under a tree).  If
+ * loops are exited prematurely (break), mem_cgroup_iter_break() must
+ * be used for reference counting.
+ */
+#define for_each_mem_cgroup_tree(iter, root)		\
+	for (iter = mem_cgroup_iter(root, NULL, NULL);	\
+	     iter != NULL;				\
+	     iter = mem_cgroup_iter(root, iter, NULL))
+
+#define for_each_mem_cgroup(iter)			\
+	for (iter = mem_cgroup_iter(NULL, NULL, NULL);	\
+	     iter != NULL;				\
+	     iter = mem_cgroup_iter(NULL, iter, NULL))
+
+void memcg1_css_offline(struct mem_cgroup *memcg);
+
+/* for encoding cft->private value on file */
+enum res_type {
+	_MEM,
+	_MEMSWAP,
+	_KMEM,
+	_TCP,
+};
+
+bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
+				enum mem_cgroup_events_target target);
+unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap);
+void mem_cgroup_oom_notify(struct mem_cgroup *memcg);
+ssize_t memcg_write_event_control(struct kernfs_open_file *of,
+				  char *buf, size_t nbytes, loff_t off);
+
+
 #endif	/* __MM_MEMCONTROL_V1_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index da2c0fa0de1b..bd4b26a73596 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -46,9 +46,6 @@
 #include <linux/slab.h>
 #include <linux/swapops.h>
 #include <linux/spinlock.h>
-#include <linux/eventfd.h>
-#include <linux/poll.h>
-#include <linux/sort.h>
 #include <linux/fs.h>
 #include <linux/seq_file.h>
 #include <linux/parser.h>
@@ -59,7 +56,6 @@
 #include <linux/cpu.h>
 #include <linux/oom.h>
 #include <linux/lockdep.h>
-#include <linux/file.h>
 #include <linux/resume_user_mode.h>
 #include <linux/psi.h>
 #include <linux/seq_buf.h>
@@ -97,91 +93,13 @@ static bool cgroup_memory_nobpf __ro_after_init;
 static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq);
 #endif
 
-/* Whether legacy memory+swap accounting is active */
-static bool do_memsw_account(void)
-{
-	return !cgroup_subsys_on_dfl(memory_cgrp_subsys);
-}
-
 #define THRESHOLDS_EVENTS_TARGET 128
 #define SOFTLIMIT_EVENTS_TARGET 1024
 
-/* for OOM */
-struct mem_cgroup_eventfd_list {
-	struct list_head list;
-	struct eventfd_ctx *eventfd;
-};
-
-/*
- * cgroup_event represents events which userspace want to receive.
- */
-struct mem_cgroup_event {
-	/*
-	 * memcg which the event belongs to.
-	 */
-	struct mem_cgroup *memcg;
-	/*
-	 * eventfd to signal userspace about the event.
-	 */
-	struct eventfd_ctx *eventfd;
-	/*
-	 * Each of these stored in a list by the cgroup.
-	 */
-	struct list_head list;
-	/*
-	 * register_event() callback will be used to add new userspace
-	 * waiter for changes related to this event.  Use eventfd_signal()
-	 * on eventfd to send notification to userspace.
-	 */
-	int (*register_event)(struct mem_cgroup *memcg,
-			      struct eventfd_ctx *eventfd, const char *args);
-	/*
-	 * unregister_event() callback will be called when userspace closes
-	 * the eventfd or on cgroup removing.  This callback must be set,
-	 * if you want provide notification functionality.
-	 */
-	void (*unregister_event)(struct mem_cgroup *memcg,
-				 struct eventfd_ctx *eventfd);
-	/*
-	 * All fields below needed to unregister event when
-	 * userspace closes eventfd.
-	 */
-	poll_table pt;
-	wait_queue_head_t *wqh;
-	wait_queue_entry_t wait;
-	struct work_struct remove;
-};
-
-static void mem_cgroup_threshold(struct mem_cgroup *memcg);
-static void mem_cgroup_oom_notify(struct mem_cgroup *memcg);
-
-/* for encoding cft->private value on file */
-enum res_type {
-	_MEM,
-	_MEMSWAP,
-	_KMEM,
-	_TCP,
-};
-
 #define MEMFILE_PRIVATE(x, val)	((x) << 16 | (val))
 #define MEMFILE_TYPE(val)	((val) >> 16 & 0xffff)
 #define MEMFILE_ATTR(val)	((val) & 0xffff)
 
-/*
- * Iteration constructs for visiting all cgroups (under a tree).  If
- * loops are exited prematurely (break), mem_cgroup_iter_break() must
- * be used for reference counting.
- */
-#define for_each_mem_cgroup_tree(iter, root)		\
-	for (iter = mem_cgroup_iter(root, NULL, NULL);	\
-	     iter != NULL;				\
-	     iter = mem_cgroup_iter(root, iter, NULL))
-
-#define for_each_mem_cgroup(iter)			\
-	for (iter = mem_cgroup_iter(NULL, NULL, NULL);	\
-	     iter != NULL;				\
-	     iter = mem_cgroup_iter(NULL, iter, NULL))
-
 static inline bool task_is_dying(void)
 {
 	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
@@ -940,8 +858,8 @@ void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, int nr_pages)
 	__this_cpu_add(memcg->vmstats_percpu->nr_page_events, nr_pages);
 }
 
-static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
-				       enum mem_cgroup_events_target target)
+bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
+				enum mem_cgroup_events_target target)
 {
 	unsigned long val, next;
 
@@ -965,28 +883,6 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
 	return false;
 }
 
-/*
- * Check events in order.
- *
- */
-void memcg_check_events(struct mem_cgroup *memcg, int nid)
-{
-	if (IS_ENABLED(CONFIG_PREEMPT_RT))
-		return;
-
-	/* threshold event is triggered in finer grain than soft limit */
-	if (unlikely(mem_cgroup_event_ratelimit(memcg,
-						MEM_CGROUP_TARGET_THRESH))) {
-		bool do_softlimit;
-
-		do_softlimit = mem_cgroup_event_ratelimit(memcg,
-						MEM_CGROUP_TARGET_SOFTLIMIT);
-		mem_cgroup_threshold(memcg);
-		if (unlikely(do_softlimit))
-			memcg1_update_tree(memcg, nid);
-	}
-}
-
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
 {
 	/*
@@ -1726,7 +1622,7 @@ static struct lockdep_map memcg_oom_lock_dep_map = {
 };
 #endif
 
-static DEFINE_SPINLOCK(memcg_oom_lock);
+DEFINE_SPINLOCK(memcg_oom_lock);
 
 /*
  * Check OOM-Killer is already running under our hierarchy.
@@ -3545,7 +3441,7 @@ static int mem_cgroup_hierarchy_write(struct cgroup_subsys_state *css,
 	return -EINVAL;
 }
 
-static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
+unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
 {
 	unsigned long val;
 
@@ -4046,331 +3942,6 @@ static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css,
 	return 0;
 }
 
-static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
-{
-	struct mem_cgroup_threshold_ary *t;
-	unsigned long usage;
-	int i;
-
-	rcu_read_lock();
-	if (!swap)
-		t = rcu_dereference(memcg->thresholds.primary);
-	else
-		t = rcu_dereference(memcg->memsw_thresholds.primary);
-
-	if (!t)
-		goto unlock;
-
-	usage = mem_cgroup_usage(memcg, swap);
-
-	/*
-	 * current_threshold points to threshold just below or equal to usage.
-	 * If it's not true, a threshold was crossed after last
-	 * call of __mem_cgroup_threshold().
-	 */
-	i = t->current_threshold;
-
-	/*
-	 * Iterate backward over array of thresholds starting from
-	 * current_threshold and check if a threshold is crossed.
-	 * If none of thresholds below usage is crossed, we read
-	 * only one element of the array here.
-	 */
-	for (; i >= 0 && unlikely(t->entries[i].threshold > usage); i--)
-		eventfd_signal(t->entries[i].eventfd);
-
-	/* i = current_threshold + 1 */
-	i++;
-
-	/*
-	 * Iterate forward over array of thresholds starting from
-	 * current_threshold+1 and check if a threshold is crossed.
-	 * If none of thresholds above usage is crossed, we read
-	 * only one element of the array here.
-	 */
-	for (; i < t->size && unlikely(t->entries[i].threshold <= usage); i++)
-		eventfd_signal(t->entries[i].eventfd);
-
-	/* Update current_threshold */
-	t->current_threshold = i - 1;
-unlock:
-	rcu_read_unlock();
-}
-
-static void mem_cgroup_threshold(struct mem_cgroup *memcg)
-{
-	while (memcg) {
-		__mem_cgroup_threshold(memcg, false);
-		if (do_memsw_account())
-			__mem_cgroup_threshold(memcg, true);
-
-		memcg = parent_mem_cgroup(memcg);
-	}
-}
-
-static int compare_thresholds(const void *a, const void *b)
-{
-	const struct mem_cgroup_threshold *_a = a;
-	const struct mem_cgroup_threshold *_b = b;
-
-	if (_a->threshold > _b->threshold)
-		return 1;
-
-	if (_a->threshold < _b->threshold)
-		return -1;
-
-	return 0;
-}
-
-static int mem_cgroup_oom_notify_cb(struct mem_cgroup *memcg)
-{
-	struct mem_cgroup_eventfd_list *ev;
-
-	spin_lock(&memcg_oom_lock);
-
-	list_for_each_entry(ev, &memcg->oom_notify, list)
-		eventfd_signal(ev->eventfd);
-
-	spin_unlock(&memcg_oom_lock);
-	return 0;
-}
-
-static void mem_cgroup_oom_notify(struct mem_cgroup *memcg)
-{
-	struct mem_cgroup *iter;
-
-	for_each_mem_cgroup_tree(iter, memcg)
-		mem_cgroup_oom_notify_cb(iter);
-}
-
-static int __mem_cgroup_usage_register_event(struct mem_cgroup *memcg,
-	struct eventfd_ctx *eventfd, const char *args, enum res_type type)
-{
-	struct mem_cgroup_thresholds *thresholds;
-	struct mem_cgroup_threshold_ary *new;
-	unsigned long threshold;
-	unsigned long usage;
-	int i, size, ret;
-
-	ret = page_counter_memparse(args, "-1", &threshold);
-	if (ret)
-		return ret;
-
-	mutex_lock(&memcg->thresholds_lock);
-
-	if (type == _MEM) {
-		thresholds = &memcg->thresholds;
-		usage = mem_cgroup_usage(memcg, false);
-	} else if (type == _MEMSWAP) {
-		thresholds = &memcg->memsw_thresholds;
-		usage = mem_cgroup_usage(memcg, true);
-	} else
-		BUG();
-
-	/* Check if a threshold crossed before adding a new one */
-	if (thresholds->primary)
-		__mem_cgroup_threshold(memcg, type == _MEMSWAP);
-
-	size = thresholds->primary ? thresholds->primary->size + 1 : 1;
-
-	/* Allocate memory for new array of thresholds */
-	new = kmalloc(struct_size(new, entries, size), GFP_KERNEL);
-	if (!new) {
-		ret = -ENOMEM;
-		goto unlock;
-	}
-	new->size = size;
-
-	/* Copy thresholds (if any) to new array */
-	if (thresholds->primary)
-		memcpy(new->entries, thresholds->primary->entries,
-		       flex_array_size(new, entries, size - 1));
-
-	/* Add new threshold */
-	new->entries[size - 1].eventfd = eventfd;
-	new->entries[size - 1].threshold = threshold;
-
-	/* Sort thresholds. Registering of new threshold isn't time-critical */
-	sort(new->entries, size, sizeof(*new->entries),
-			compare_thresholds, NULL);
-
-	/* Find current threshold */
-	new->current_threshold = -1;
-	for (i = 0; i < size; i++) {
-		if (new->entries[i].threshold <= usage) {
-			/*
-			 * new->current_threshold will not be used until
-			 * rcu_assign_pointer(), so it's safe to increment
-			 * it here.
-			 */
-			++new->current_threshold;
-		} else
-			break;
-	}
-
-	/* Free old spare buffer and save old primary buffer as spare */
-	kfree(thresholds->spare);
-	thresholds->spare = thresholds->primary;
-
-	rcu_assign_pointer(thresholds->primary, new);
-
-	/* To be sure that nobody uses thresholds */
-	synchronize_rcu();
-
-unlock:
-	mutex_unlock(&memcg->thresholds_lock);
-
-	return ret;
-}
-
-static int mem_cgroup_usage_register_event(struct mem_cgroup *memcg,
-	struct eventfd_ctx *eventfd, const char *args)
-{
-	return __mem_cgroup_usage_register_event(memcg, eventfd, args, _MEM);
-}
-
-static int memsw_cgroup_usage_register_event(struct mem_cgroup *memcg,
-	struct eventfd_ctx *eventfd, const char *args)
-{
-	return __mem_cgroup_usage_register_event(memcg, eventfd, args, _MEMSWAP);
-}
-
-static void __mem_cgroup_usage_unregister_event(struct mem_cgroup *memcg,
-	struct eventfd_ctx *eventfd, enum res_type type)
-{
-	struct mem_cgroup_thresholds *thresholds;
-	struct mem_cgroup_threshold_ary *new;
-	unsigned long usage;
-	int i, j, size, entries;
-
-	mutex_lock(&memcg->thresholds_lock);
-
-	if (type == _MEM) {
-		thresholds = &memcg->thresholds;
-		usage = mem_cgroup_usage(memcg, false);
-	} else if (type == _MEMSWAP) {
-		thresholds = &memcg->memsw_thresholds;
-		usage = mem_cgroup_usage(memcg, true);
-	} else
-		BUG();
-
-	if (!thresholds->primary)
-		goto unlock;
-
-	/* Check if a threshold crossed before removing */
-	__mem_cgroup_threshold(memcg, type == _MEMSWAP);
-
-	/* Calculate new number of threshold */
-	size = entries = 0;
-	for (i = 0; i < thresholds->primary->size; i++) {
-		if (thresholds->primary->entries[i].eventfd != eventfd)
-			size++;
-		else
-			entries++;
-	}
-
-	new = thresholds->spare;
-
-	/* If no items related to eventfd have been cleared, nothing to do */
-	if (!entries)
-		goto unlock;
-
-	/* Set thresholds array to NULL if we don't have thresholds */
-	if (!size) {
-		kfree(new);
-		new = NULL;
-		goto swap_buffers;
-	}
-
-	new->size = size;
-
-	/* Copy thresholds and find current threshold */
-	new->current_threshold = -1;
-	for (i = 0, j = 0; i < thresholds->primary->size; i++) {
-		if (thresholds->primary->entries[i].eventfd == eventfd)
-			continue;
-
-		new->entries[j] = thresholds->primary->entries[i];
-		if (new->entries[j].threshold <= usage) {
-			/*
-			 * new->current_threshold will not be used
-			 * until rcu_assign_pointer(), so it's safe to increment
-			 * it here.
-			 */
-			++new->current_threshold;
-		}
-		j++;
-	}
-
-swap_buffers:
-	/* Swap primary and spare array */
-	thresholds->spare = thresholds->primary;
-
-	rcu_assign_pointer(thresholds->primary, new);
-
-	/* To be sure that nobody uses thresholds */
-	synchronize_rcu();
-
-	/* If all events are unregistered, free the spare array */
-	if (!new) {
-		kfree(thresholds->spare);
-		thresholds->spare = NULL;
-	}
-unlock:
-	mutex_unlock(&memcg->thresholds_lock);
-}
-
-static void mem_cgroup_usage_unregister_event(struct mem_cgroup *memcg,
-	struct eventfd_ctx *eventfd)
-{
-	return __mem_cgroup_usage_unregister_event(memcg, eventfd, _MEM);
-}
-
-static void memsw_cgroup_usage_unregister_event(struct mem_cgroup *memcg,
-	struct eventfd_ctx *eventfd)
-{
-	return __mem_cgroup_usage_unregister_event(memcg, eventfd, _MEMSWAP);
-}
-
-static int mem_cgroup_oom_register_event(struct mem_cgroup *memcg,
-	struct eventfd_ctx *eventfd, const char *args)
-{
-	struct mem_cgroup_eventfd_list *event;
-
-	event = kmalloc(sizeof(*event),	GFP_KERNEL);
-	if (!event)
-		return -ENOMEM;
-
-	spin_lock(&memcg_oom_lock);
-
-	event->eventfd = eventfd;
-	list_add(&event->list, &memcg->oom_notify);
-
-	/* already in OOM ? */
-	if (memcg->under_oom)
-		eventfd_signal(eventfd);
-	spin_unlock(&memcg_oom_lock);
-
-	return 0;
-}
-
-static void mem_cgroup_oom_unregister_event(struct mem_cgroup *memcg,
-	struct eventfd_ctx *eventfd)
-{
-	struct mem_cgroup_eventfd_list *ev, *tmp;
-
-	spin_lock(&memcg_oom_lock);
-
-	list_for_each_entry_safe(ev, tmp, &memcg->oom_notify, list) {
-		if (ev->eventfd == eventfd) {
-			list_del(&ev->list);
-			kfree(ev);
-		}
-	}
-
-	spin_unlock(&memcg_oom_lock);
-}
-
 static int mem_cgroup_oom_control_read(struct seq_file *sf, void *v)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_seq(sf);
@@ -4611,243 +4182,6 @@ static void memcg_wb_domain_size_changed(struct mem_cgroup *memcg)
 
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
-/*
- * DO NOT USE IN NEW FILES.
- *
- * "cgroup.event_control" implementation.
- *
- * This is way over-engineered.  It tries to support fully configurable
- * events for each user.  Such level of flexibility is completely
- * unnecessary especially in the light of the planned unified hierarchy.
- *
- * Please deprecate this and replace with something simpler if at all
- * possible.
- */
-
-/*
- * Unregister event and free resources.
- *
- * Gets called from workqueue.
- */
-static void memcg_event_remove(struct work_struct *work)
-{
-	struct mem_cgroup_event *event =
-		container_of(work, struct mem_cgroup_event, remove);
-	struct mem_cgroup *memcg = event->memcg;
-
-	remove_wait_queue(event->wqh, &event->wait);
-
-	event->unregister_event(memcg, event->eventfd);
-
-	/* Notify userspace the event is going away. */
-	eventfd_signal(event->eventfd);
-
-	eventfd_ctx_put(event->eventfd);
-	kfree(event);
-	css_put(&memcg->css);
-}
-
-/*
- * Gets called on EPOLLHUP on eventfd when user closes it.
- *
- * Called with wqh->lock held and interrupts disabled.
- */
-static int memcg_event_wake(wait_queue_entry_t *wait, unsigned mode,
-			    int sync, void *key)
-{
-	struct mem_cgroup_event *event =
-		container_of(wait, struct mem_cgroup_event, wait);
-	struct mem_cgroup *memcg = event->memcg;
-	__poll_t flags = key_to_poll(key);
-
-	if (flags & EPOLLHUP) {
-		/*
-		 * If the event has been detached at cgroup removal, we
-		 * can simply return knowing the other side will cleanup
-		 * for us.
-		 *
-		 * We can't race against event freeing since the other
-		 * side will require wqh->lock via remove_wait_queue(),
-		 * which we hold.
-		 */
-		spin_lock(&memcg->event_list_lock);
-		if (!list_empty(&event->list)) {
-			list_del_init(&event->list);
-			/*
-			 * We are in atomic context, but cgroup_event_remove()
-			 * may sleep, so we have to call it in workqueue.
-			 */
-			schedule_work(&event->remove);
-		}
-		spin_unlock(&memcg->event_list_lock);
-	}
-
-	return 0;
-}
-
-static void memcg_event_ptable_queue_proc(struct file *file,
-		wait_queue_head_t *wqh, poll_table *pt)
-{
-	struct mem_cgroup_event *event =
-		container_of(pt, struct mem_cgroup_event, pt);
-
-	event->wqh = wqh;
-	add_wait_queue(wqh, &event->wait);
-}
-
-/*
- * DO NOT USE IN NEW FILES.
- *
- * Parse input and register new cgroup event handler.
- *
- * Input must be in format '<event_fd> <control_fd> <args>'.
- * Interpretation of args is defined by control file implementation.
- */
-static ssize_t memcg_write_event_control(struct kernfs_open_file *of,
-					 char *buf, size_t nbytes, loff_t off)
-{
-	struct cgroup_subsys_state *css = of_css(of);
-	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
-	struct mem_cgroup_event *event;
-	struct cgroup_subsys_state *cfile_css;
-	unsigned int efd, cfd;
-	struct fd efile;
-	struct fd cfile;
-	struct dentry *cdentry;
-	const char *name;
-	char *endp;
-	int ret;
-
-	if (IS_ENABLED(CONFIG_PREEMPT_RT))
-		return -EOPNOTSUPP;
-
-	buf = strstrip(buf);
-
-	efd = simple_strtoul(buf, &endp, 10);
-	if (*endp != ' ')
-		return -EINVAL;
-	buf = endp + 1;
-
-	cfd = simple_strtoul(buf, &endp, 10);
-	if ((*endp != ' ') && (*endp != '\0'))
-		return -EINVAL;
-	buf = endp + 1;
-
-	event = kzalloc(sizeof(*event), GFP_KERNEL);
-	if (!event)
-		return -ENOMEM;
-
-	event->memcg = memcg;
-	INIT_LIST_HEAD(&event->list);
-	init_poll_funcptr(&event->pt, memcg_event_ptable_queue_proc);
-	init_waitqueue_func_entry(&event->wait, memcg_event_wake);
-	INIT_WORK(&event->remove, memcg_event_remove);
-
-	efile = fdget(efd);
-	if (!efile.file) {
-		ret = -EBADF;
-		goto out_kfree;
-	}
-
-	event->eventfd = eventfd_ctx_fileget(efile.file);
-	if (IS_ERR(event->eventfd)) {
-		ret = PTR_ERR(event->eventfd);
-		goto out_put_efile;
-	}
-
-	cfile = fdget(cfd);
-	if (!cfile.file) {
-		ret = -EBADF;
-		goto out_put_eventfd;
-	}
-
-	/* the process need read permission on control file */
-	/* AV: shouldn't we check that it's been opened for read instead? */
-	ret = file_permission(cfile.file, MAY_READ);
-	if (ret < 0)
-		goto out_put_cfile;
-
-	/*
-	 * The control file must be a regular cgroup1 file. As a regular cgroup
-	 * file can't be renamed, it's safe to access its name afterwards.
-	 */
-	cdentry = cfile.file->f_path.dentry;
-	if (cdentry->d_sb->s_type != &cgroup_fs_type || !d_is_reg(cdentry)) {
-		ret = -EINVAL;
-		goto out_put_cfile;
-	}
-
-	/*
-	 * Determine the event callbacks and set them in @event.  This used
-	 * to be done via struct cftype but cgroup core no longer knows
-	 * about these events.  The following is crude but the whole thing
-	 * is for compatibility anyway.
-	 *
-	 * DO NOT ADD NEW FILES.
-	 */
-	name = cdentry->d_name.name;
-
-	if (!strcmp(name, "memory.usage_in_bytes")) {
-		event->register_event = mem_cgroup_usage_register_event;
-		event->unregister_event = mem_cgroup_usage_unregister_event;
-	} else if (!strcmp(name, "memory.oom_control")) {
-		event->register_event = mem_cgroup_oom_register_event;
-		event->unregister_event = mem_cgroup_oom_unregister_event;
-	} else if (!strcmp(name, "memory.pressure_level")) {
-		event->register_event = vmpressure_register_event;
-		event->unregister_event = vmpressure_unregister_event;
-	} else if (!strcmp(name, "memory.memsw.usage_in_bytes")) {
-		event->register_event = memsw_cgroup_usage_register_event;
-		event->unregister_event = memsw_cgroup_usage_unregister_event;
-	} else {
-		ret = -EINVAL;
-		goto out_put_cfile;
-	}
-
-	/*
-	 * Verify @cfile should belong to @css.  Also, remaining events are
-	 * automatically removed on cgroup destruction but the removal is
-	 * asynchronous, so take an extra ref on @css.
-	 */
-	cfile_css = css_tryget_online_from_dir(cdentry->d_parent,
-					       &memory_cgrp_subsys);
-	ret = -EINVAL;
-	if (IS_ERR(cfile_css))
-		goto out_put_cfile;
-	if (cfile_css != css) {
-		css_put(cfile_css);
-		goto out_put_cfile;
-	}
-
-	ret = event->register_event(memcg, event->eventfd, buf);
-	if (ret)
-		goto out_put_css;
-
-	vfs_poll(efile.file, &event->pt);
-
-	spin_lock_irq(&memcg->event_list_lock);
-	list_add(&event->list, &memcg->event_list);
-	spin_unlock_irq(&memcg->event_list_lock);
-
-	fdput(cfile);
-	fdput(efile);
-
-	return nbytes;
-
-out_put_css:
-	css_put(css);
-out_put_cfile:
-	fdput(cfile);
-out_put_eventfd:
-	eventfd_ctx_put(event->eventfd);
-out_put_efile:
-	fdput(efile);
-out_kfree:
-	kfree(event);
-
-	return ret;
-}
-
 #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_SLUB_DEBUG)
 static int mem_cgroup_slab_show(struct seq_file *m, void *p)
 {
@@ -5314,19 +4648,8 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
 static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
-	struct mem_cgroup_event *event, *tmp;
 
-	/*
-	 * Unregister events and notify userspace.
-	 * Notify userspace about cgroup removing only after rmdir of cgroup
-	 * directory to avoid race between userspace and kernelspace.
-	 */
-	spin_lock_irq(&memcg->event_list_lock);
-	list_for_each_entry_safe(event, tmp, &memcg->event_list, list) {
-		list_del_init(&event->list);
-		schedule_work(&event->remove);
-	}
-	spin_unlock_irq(&memcg->event_list_lock);
+	memcg1_css_offline(memcg);
 
 	page_counter_set_min(&memcg->memory, 0);
 	page_counter_set_low(&memcg->memory, 0);
-- 
2.45.2



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 07/14] mm: memcg: rename memcg_check_events()
  2024-06-25  0:58 [PATCH v2 00/14] mm: memcg: separate legacy cgroup v1 code and put under config option Roman Gushchin
                   ` (5 preceding siblings ...)
  2024-06-25  0:58 ` [PATCH v2 06/14] mm: memcg: move legacy memcg event code into memcontrol-v1.c Roman Gushchin
@ 2024-06-25  0:58 ` Roman Gushchin
  2024-06-25  7:08   ` Michal Hocko
  2024-06-25  0:59 ` [PATCH v2 08/14] mm: memcg: move cgroup v1 oom handling code into memcontrol-v1.c Roman Gushchin
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Roman Gushchin @ 2024-06-25  0:58 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song,
	linux-kernel, cgroups, linux-mm, Roman Gushchin

Rename memcg_check_events() into memcg1_check_events() for
consistency with other cgroup v1-specific functions.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 mm/memcontrol-v1.c | 6 +++---
 mm/memcontrol-v1.h | 2 +-
 mm/memcontrol.c    | 8 ++++----
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index 4b2290ceace6..d7b5c4c14732 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -835,9 +835,9 @@ static int mem_cgroup_move_account(struct folio *folio,
 
 	local_irq_disable();
 	mem_cgroup_charge_statistics(to, nr_pages);
-	memcg_check_events(to, nid);
+	memcg1_check_events(to, nid);
 	mem_cgroup_charge_statistics(from, -nr_pages);
-	memcg_check_events(from, nid);
+	memcg1_check_events(from, nid);
 	local_irq_enable();
 out:
 	return ret;
@@ -1424,7 +1424,7 @@ static void mem_cgroup_threshold(struct mem_cgroup *memcg)
  * Check events in order.
  *
  */
-void memcg_check_events(struct mem_cgroup *memcg, int nid)
+void memcg1_check_events(struct mem_cgroup *memcg, int nid)
 {
 	if (IS_ENABLED(CONFIG_PREEMPT_RT))
 		return;
diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
index 524a2c76ffc9..ef1b7037cbdc 100644
--- a/mm/memcontrol-v1.h
+++ b/mm/memcontrol-v1.h
@@ -12,7 +12,7 @@ static inline void memcg1_soft_limit_reset(struct mem_cgroup *memcg)
 }
 
 void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, int nr_pages);
-void memcg_check_events(struct mem_cgroup *memcg, int nid);
+void memcg1_check_events(struct mem_cgroup *memcg, int nid);
 void memcg_oom_recover(struct mem_cgroup *memcg);
 int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		     unsigned int nr_pages);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bd4b26a73596..92fb72bbd494 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2632,7 +2632,7 @@ void mem_cgroup_commit_charge(struct folio *folio, struct mem_cgroup *memcg)
 
 	local_irq_disable();
 	mem_cgroup_charge_statistics(memcg, folio_nr_pages(folio));
-	memcg_check_events(memcg, folio_nid(folio));
+	memcg1_check_events(memcg, folio_nid(folio));
 	local_irq_enable();
 }
 
@@ -5697,7 +5697,7 @@ static void uncharge_batch(const struct uncharge_gather *ug)
 	local_irq_save(flags);
 	__count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout);
 	__this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_memory);
-	memcg_check_events(ug->memcg, ug->nid);
+	memcg1_check_events(ug->memcg, ug->nid);
 	local_irq_restore(flags);
 
 	/* drop reference from uncharge_folio */
@@ -5836,7 +5836,7 @@ void mem_cgroup_replace_folio(struct folio *old, struct folio *new)
 
 	local_irq_save(flags);
 	mem_cgroup_charge_statistics(memcg, nr_pages);
-	memcg_check_events(memcg, folio_nid(new));
+	memcg1_check_events(memcg, folio_nid(new));
 	local_irq_restore(flags);
 }
 
@@ -6104,7 +6104,7 @@ void mem_cgroup_swapout(struct folio *folio, swp_entry_t entry)
 	memcg_stats_lock();
 	mem_cgroup_charge_statistics(memcg, -nr_entries);
 	memcg_stats_unlock();
-	memcg_check_events(memcg, folio_nid(folio));
+	memcg1_check_events(memcg, folio_nid(folio));
 
 	css_put(&memcg->css);
 }
-- 
2.45.2



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 08/14] mm: memcg: move cgroup v1 oom handling code into memcontrol-v1.c
  2024-06-25  0:58 [PATCH v2 00/14] mm: memcg: separate legacy cgroup v1 code and put under config option Roman Gushchin
                   ` (6 preceding siblings ...)
  2024-06-25  0:58 ` [PATCH v2 07/14] mm: memcg: rename memcg_check_events() Roman Gushchin
@ 2024-06-25  0:59 ` Roman Gushchin
  2024-06-25  7:08   ` Michal Hocko
  2024-06-25  0:59 ` [PATCH v2 09/14] mm: memcg: rename memcg_oom_recover() Roman Gushchin
                   ` (6 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Roman Gushchin @ 2024-06-25  0:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song,
	linux-kernel, cgroups, linux-mm, Roman Gushchin

Cgroup v1 supports a complicated OOM handling in userspace mechanism,
which is not supported by cgroup v2. Let's move the corresponding code
into memcontrol-v1.c.

Aside from mechanical code movement this patch introduces two new
functions: memcg1_oom_prepare() and memcg1_oom_finish().
Those are implementing cgroup v1-specific parts of the common memcg
OOM handling path.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 mm/memcontrol-v1.c | 229 ++++++++++++++++++++++++++++++++++++++++++++-
 mm/memcontrol-v1.h |   3 +-
 mm/memcontrol.c    | 216 +-----------------------------------------
 3 files changed, 231 insertions(+), 217 deletions(-)

diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index d7b5c4c14732..253d49d5fb12 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -110,7 +110,13 @@ struct mem_cgroup_event {
 	struct work_struct remove;
 };
 
-extern spinlock_t memcg_oom_lock;
+#ifdef CONFIG_LOCKDEP
+static struct lockdep_map memcg_oom_lock_dep_map = {
+	.name = "memcg_oom_lock",
+};
+#endif
+
+DEFINE_SPINLOCK(memcg_oom_lock);
 
 static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_node *mz,
 					 struct mem_cgroup_tree_per_node *mctz,
@@ -1469,7 +1475,7 @@ static int mem_cgroup_oom_notify_cb(struct mem_cgroup *memcg)
 	return 0;
 }
 
-void mem_cgroup_oom_notify(struct mem_cgroup *memcg)
+static void mem_cgroup_oom_notify(struct mem_cgroup *memcg)
 {
 	struct mem_cgroup *iter;
 
@@ -1959,6 +1965,225 @@ void memcg1_css_offline(struct mem_cgroup *memcg)
 	spin_unlock_irq(&memcg->event_list_lock);
 }
 
+/*
+ * Check OOM-Killer is already running under our hierarchy.
+ * If someone is running, return false.
+ */
+static bool mem_cgroup_oom_trylock(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *iter, *failed = NULL;
+
+	spin_lock(&memcg_oom_lock);
+
+	for_each_mem_cgroup_tree(iter, memcg) {
+		if (iter->oom_lock) {
+			/*
+			 * this subtree of our hierarchy is already locked
+			 * so we cannot give a lock.
+			 */
+			failed = iter;
+			mem_cgroup_iter_break(memcg, iter);
+			break;
+		} else
+			iter->oom_lock = true;
+	}
+
+	if (failed) {
+		/*
+		 * OK, we failed to lock the whole subtree so we have
+		 * to clean up what we set up to the failing subtree
+		 */
+		for_each_mem_cgroup_tree(iter, memcg) {
+			if (iter == failed) {
+				mem_cgroup_iter_break(memcg, iter);
+				break;
+			}
+			iter->oom_lock = false;
+		}
+	} else
+		mutex_acquire(&memcg_oom_lock_dep_map, 0, 1, _RET_IP_);
+
+	spin_unlock(&memcg_oom_lock);
+
+	return !failed;
+}
+
+static void mem_cgroup_oom_unlock(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *iter;
+
+	spin_lock(&memcg_oom_lock);
+	mutex_release(&memcg_oom_lock_dep_map, _RET_IP_);
+	for_each_mem_cgroup_tree(iter, memcg)
+		iter->oom_lock = false;
+	spin_unlock(&memcg_oom_lock);
+}
+
+static void mem_cgroup_mark_under_oom(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *iter;
+
+	spin_lock(&memcg_oom_lock);
+	for_each_mem_cgroup_tree(iter, memcg)
+		iter->under_oom++;
+	spin_unlock(&memcg_oom_lock);
+}
+
+static void mem_cgroup_unmark_under_oom(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *iter;
+
+	/*
+	 * Be careful about under_oom underflows because a child memcg
+	 * could have been added after mem_cgroup_mark_under_oom.
+	 */
+	spin_lock(&memcg_oom_lock);
+	for_each_mem_cgroup_tree(iter, memcg)
+		if (iter->under_oom > 0)
+			iter->under_oom--;
+	spin_unlock(&memcg_oom_lock);
+}
+
+static DECLARE_WAIT_QUEUE_HEAD(memcg_oom_waitq);
+
+struct oom_wait_info {
+	struct mem_cgroup *memcg;
+	wait_queue_entry_t	wait;
+};
+
+static int memcg_oom_wake_function(wait_queue_entry_t *wait,
+	unsigned mode, int sync, void *arg)
+{
+	struct mem_cgroup *wake_memcg = (struct mem_cgroup *)arg;
+	struct mem_cgroup *oom_wait_memcg;
+	struct oom_wait_info *oom_wait_info;
+
+	oom_wait_info = container_of(wait, struct oom_wait_info, wait);
+	oom_wait_memcg = oom_wait_info->memcg;
+
+	if (!mem_cgroup_is_descendant(wake_memcg, oom_wait_memcg) &&
+	    !mem_cgroup_is_descendant(oom_wait_memcg, wake_memcg))
+		return 0;
+	return autoremove_wake_function(wait, mode, sync, arg);
+}
+
+void memcg_oom_recover(struct mem_cgroup *memcg)
+{
+	/*
+	 * For the following lockless ->under_oom test, the only required
+	 * guarantee is that it must see the state asserted by an OOM when
+	 * this function is called as a result of userland actions
+	 * triggered by the notification of the OOM.  This is trivially
+	 * achieved by invoking mem_cgroup_mark_under_oom() before
+	 * triggering notification.
+	 */
+	if (memcg && memcg->under_oom)
+		__wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg);
+}
+
+/**
+ * mem_cgroup_oom_synchronize - complete memcg OOM handling
+ * @handle: actually kill/wait or just clean up the OOM state
+ *
+ * This has to be called at the end of a page fault if the memcg OOM
+ * handler was enabled.
+ *
+ * Memcg supports userspace OOM handling where failed allocations must
+ * sleep on a waitqueue until the userspace task resolves the
+ * situation.  Sleeping directly in the charge context with all kinds
+ * of locks held is not a good idea, instead we remember an OOM state
+ * in the task and mem_cgroup_oom_synchronize() has to be called at
+ * the end of the page fault to complete the OOM handling.
+ *
+ * Returns %true if an ongoing memcg OOM situation was detected and
+ * completed, %false otherwise.
+ */
+bool mem_cgroup_oom_synchronize(bool handle)
+{
+	struct mem_cgroup *memcg = current->memcg_in_oom;
+	struct oom_wait_info owait;
+	bool locked;
+
+	/* OOM is global, do not handle */
+	if (!memcg)
+		return false;
+
+	if (!handle)
+		goto cleanup;
+
+	owait.memcg = memcg;
+	owait.wait.flags = 0;
+	owait.wait.func = memcg_oom_wake_function;
+	owait.wait.private = current;
+	INIT_LIST_HEAD(&owait.wait.entry);
+
+	prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
+	mem_cgroup_mark_under_oom(memcg);
+
+	locked = mem_cgroup_oom_trylock(memcg);
+
+	if (locked)
+		mem_cgroup_oom_notify(memcg);
+
+	schedule();
+	mem_cgroup_unmark_under_oom(memcg);
+	finish_wait(&memcg_oom_waitq, &owait.wait);
+
+	if (locked)
+		mem_cgroup_oom_unlock(memcg);
+cleanup:
+	current->memcg_in_oom = NULL;
+	css_put(&memcg->css);
+	return true;
+}
+
+
+bool memcg1_oom_prepare(struct mem_cgroup *memcg, bool *locked)
+{
+	/*
+	 * We are in the middle of the charge context here, so we
+	 * don't want to block when potentially sitting on a callstack
+	 * that holds all kinds of filesystem and mm locks.
+	 *
+	 * cgroup1 allows disabling the OOM killer and waiting for outside
+	 * handling until the charge can succeed; remember the context and put
+	 * the task to sleep at the end of the page fault when all locks are
+	 * released.
+	 *
+	 * On the other hand, in-kernel OOM killer allows for an async victim
+	 * memory reclaim (oom_reaper) and that means that we are not solely
+	 * relying on the oom victim to make a forward progress and we can
+	 * invoke the oom killer here.
+	 *
+	 * Please note that mem_cgroup_out_of_memory might fail to find a
+	 * victim and then we have to bail out from the charge path.
+	 */
+	if (READ_ONCE(memcg->oom_kill_disable)) {
+		if (current->in_user_fault) {
+			css_get(&memcg->css);
+			current->memcg_in_oom = memcg;
+		}
+		return false;
+	}
+
+	mem_cgroup_mark_under_oom(memcg);
+
+	*locked = mem_cgroup_oom_trylock(memcg);
+
+	if (*locked)
+		mem_cgroup_oom_notify(memcg);
+
+	mem_cgroup_unmark_under_oom(memcg);
+
+	return true;
+}
+
+void memcg1_oom_finish(struct mem_cgroup *memcg, bool locked)
+{
+	if (locked)
+		mem_cgroup_oom_unlock(memcg);
+}
+
 static int __init memcg1_init(void)
 {
 	int node;
diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
index ef1b7037cbdc..3de956b2422f 100644
--- a/mm/memcontrol-v1.h
+++ b/mm/memcontrol-v1.h
@@ -87,9 +87,10 @@ enum res_type {
 bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
 				enum mem_cgroup_events_target target);
 unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap);
-void mem_cgroup_oom_notify(struct mem_cgroup *memcg);
 ssize_t memcg_write_event_control(struct kernfs_open_file *of,
 				  char *buf, size_t nbytes, loff_t off);
 
+bool memcg1_oom_prepare(struct mem_cgroup *memcg, bool *locked);
+void memcg1_oom_finish(struct mem_cgroup *memcg, bool locked);
 
 #endif	/* __MM_MEMCONTROL_V1_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 92fb72bbd494..8abd364ac837 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1616,130 +1616,6 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	return ret;
 }
 
-#ifdef CONFIG_LOCKDEP
-static struct lockdep_map memcg_oom_lock_dep_map = {
-	.name = "memcg_oom_lock",
-};
-#endif
-
-DEFINE_SPINLOCK(memcg_oom_lock);
-
-/*
- * Check OOM-Killer is already running under our hierarchy.
- * If someone is running, return false.
- */
-static bool mem_cgroup_oom_trylock(struct mem_cgroup *memcg)
-{
-	struct mem_cgroup *iter, *failed = NULL;
-
-	spin_lock(&memcg_oom_lock);
-
-	for_each_mem_cgroup_tree(iter, memcg) {
-		if (iter->oom_lock) {
-			/*
-			 * this subtree of our hierarchy is already locked
-			 * so we cannot give a lock.
-			 */
-			failed = iter;
-			mem_cgroup_iter_break(memcg, iter);
-			break;
-		} else
-			iter->oom_lock = true;
-	}
-
-	if (failed) {
-		/*
-		 * OK, we failed to lock the whole subtree so we have
-		 * to clean up what we set up to the failing subtree
-		 */
-		for_each_mem_cgroup_tree(iter, memcg) {
-			if (iter == failed) {
-				mem_cgroup_iter_break(memcg, iter);
-				break;
-			}
-			iter->oom_lock = false;
-		}
-	} else
-		mutex_acquire(&memcg_oom_lock_dep_map, 0, 1, _RET_IP_);
-
-	spin_unlock(&memcg_oom_lock);
-
-	return !failed;
-}
-
-static void mem_cgroup_oom_unlock(struct mem_cgroup *memcg)
-{
-	struct mem_cgroup *iter;
-
-	spin_lock(&memcg_oom_lock);
-	mutex_release(&memcg_oom_lock_dep_map, _RET_IP_);
-	for_each_mem_cgroup_tree(iter, memcg)
-		iter->oom_lock = false;
-	spin_unlock(&memcg_oom_lock);
-}
-
-static void mem_cgroup_mark_under_oom(struct mem_cgroup *memcg)
-{
-	struct mem_cgroup *iter;
-
-	spin_lock(&memcg_oom_lock);
-	for_each_mem_cgroup_tree(iter, memcg)
-		iter->under_oom++;
-	spin_unlock(&memcg_oom_lock);
-}
-
-static void mem_cgroup_unmark_under_oom(struct mem_cgroup *memcg)
-{
-	struct mem_cgroup *iter;
-
-	/*
-	 * Be careful about under_oom underflows because a child memcg
-	 * could have been added after mem_cgroup_mark_under_oom.
-	 */
-	spin_lock(&memcg_oom_lock);
-	for_each_mem_cgroup_tree(iter, memcg)
-		if (iter->under_oom > 0)
-			iter->under_oom--;
-	spin_unlock(&memcg_oom_lock);
-}
-
-static DECLARE_WAIT_QUEUE_HEAD(memcg_oom_waitq);
-
-struct oom_wait_info {
-	struct mem_cgroup *memcg;
-	wait_queue_entry_t	wait;
-};
-
-static int memcg_oom_wake_function(wait_queue_entry_t *wait,
-	unsigned mode, int sync, void *arg)
-{
-	struct mem_cgroup *wake_memcg = (struct mem_cgroup *)arg;
-	struct mem_cgroup *oom_wait_memcg;
-	struct oom_wait_info *oom_wait_info;
-
-	oom_wait_info = container_of(wait, struct oom_wait_info, wait);
-	oom_wait_memcg = oom_wait_info->memcg;
-
-	if (!mem_cgroup_is_descendant(wake_memcg, oom_wait_memcg) &&
-	    !mem_cgroup_is_descendant(oom_wait_memcg, wake_memcg))
-		return 0;
-	return autoremove_wake_function(wait, mode, sync, arg);
-}
-
-void memcg_oom_recover(struct mem_cgroup *memcg)
-{
-	/*
-	 * For the following lockless ->under_oom test, the only required
-	 * guarantee is that it must see the state asserted by an OOM when
-	 * this function is called as a result of userland actions
-	 * triggered by the notification of the OOM.  This is trivially
-	 * achieved by invoking mem_cgroup_mark_under_oom() before
-	 * triggering notification.
-	 */
-	if (memcg && memcg->under_oom)
-		__wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg);
-}
-
 /*
  * Returns true if successfully killed one or more processes. Though in some
  * corner cases it can return true even without killing any process.
@@ -1753,104 +1629,16 @@ static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
 
 	memcg_memory_event(memcg, MEMCG_OOM);
 
-	/*
-	 * We are in the middle of the charge context here, so we
-	 * don't want to block when potentially sitting on a callstack
-	 * that holds all kinds of filesystem and mm locks.
-	 *
-	 * cgroup1 allows disabling the OOM killer and waiting for outside
-	 * handling until the charge can succeed; remember the context and put
-	 * the task to sleep at the end of the page fault when all locks are
-	 * released.
-	 *
-	 * On the other hand, in-kernel OOM killer allows for an async victim
-	 * memory reclaim (oom_reaper) and that means that we are not solely
-	 * relying on the oom victim to make a forward progress and we can
-	 * invoke the oom killer here.
-	 *
-	 * Please note that mem_cgroup_out_of_memory might fail to find a
-	 * victim and then we have to bail out from the charge path.
-	 */
-	if (READ_ONCE(memcg->oom_kill_disable)) {
-		if (current->in_user_fault) {
-			css_get(&memcg->css);
-			current->memcg_in_oom = memcg;
-		}
+	if (!memcg1_oom_prepare(memcg, &locked))
 		return false;
-	}
-
-	mem_cgroup_mark_under_oom(memcg);
 
-	locked = mem_cgroup_oom_trylock(memcg);
-
-	if (locked)
-		mem_cgroup_oom_notify(memcg);
-
-	mem_cgroup_unmark_under_oom(memcg);
 	ret = mem_cgroup_out_of_memory(memcg, mask, order);
 
-	if (locked)
-		mem_cgroup_oom_unlock(memcg);
+	memcg1_oom_finish(memcg, locked);
 
 	return ret;
 }
 
-/**
- * mem_cgroup_oom_synchronize - complete memcg OOM handling
- * @handle: actually kill/wait or just clean up the OOM state
- *
- * This has to be called at the end of a page fault if the memcg OOM
- * handler was enabled.
- *
- * Memcg supports userspace OOM handling where failed allocations must
- * sleep on a waitqueue until the userspace task resolves the
- * situation.  Sleeping directly in the charge context with all kinds
- * of locks held is not a good idea, instead we remember an OOM state
- * in the task and mem_cgroup_oom_synchronize() has to be called at
- * the end of the page fault to complete the OOM handling.
- *
- * Returns %true if an ongoing memcg OOM situation was detected and
- * completed, %false otherwise.
- */
-bool mem_cgroup_oom_synchronize(bool handle)
-{
-	struct mem_cgroup *memcg = current->memcg_in_oom;
-	struct oom_wait_info owait;
-	bool locked;
-
-	/* OOM is global, do not handle */
-	if (!memcg)
-		return false;
-
-	if (!handle)
-		goto cleanup;
-
-	owait.memcg = memcg;
-	owait.wait.flags = 0;
-	owait.wait.func = memcg_oom_wake_function;
-	owait.wait.private = current;
-	INIT_LIST_HEAD(&owait.wait.entry);
-
-	prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
-	mem_cgroup_mark_under_oom(memcg);
-
-	locked = mem_cgroup_oom_trylock(memcg);
-
-	if (locked)
-		mem_cgroup_oom_notify(memcg);
-
-	schedule();
-	mem_cgroup_unmark_under_oom(memcg);
-	finish_wait(&memcg_oom_waitq, &owait.wait);
-
-	if (locked)
-		mem_cgroup_oom_unlock(memcg);
-cleanup:
-	current->memcg_in_oom = NULL;
-	css_put(&memcg->css);
-	return true;
-}
-
 /**
  * mem_cgroup_get_oom_group - get a memory cgroup to clean up after OOM
  * @victim: task to be killed by the OOM killer
-- 
2.45.2



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 09/14] mm: memcg: rename memcg_oom_recover()
  2024-06-25  0:58 [PATCH v2 00/14] mm: memcg: separate legacy cgroup v1 code and put under config option Roman Gushchin
                   ` (7 preceding siblings ...)
  2024-06-25  0:59 ` [PATCH v2 08/14] mm: memcg: move cgroup v1 oom handling code into memcontrol-v1.c Roman Gushchin
@ 2024-06-25  0:59 ` Roman Gushchin
  2024-06-25  7:08   ` Michal Hocko
  2024-06-25  0:59 ` [PATCH v2 10/14] mm: memcg: move cgroup v1 interface files to memcontrol-v1.c Roman Gushchin
                   ` (5 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Roman Gushchin @ 2024-06-25  0:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song,
	linux-kernel, cgroups, linux-mm, Roman Gushchin

Rename memcg_oom_recover() into memcg1_oom_recover() for consistency
with other memory cgroup v1-related functions.

Move the declaration in mm/memcontrol-v1.h to be nearby other
memcg v1 oom handling functions.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 mm/memcontrol-v1.c | 6 +++---
 mm/memcontrol-v1.h | 2 +-
 mm/memcontrol.c    | 6 +++---
 3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index 253d49d5fb12..1d5608ee1606 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -1090,8 +1090,8 @@ static void __mem_cgroup_clear_mc(void)
 
 		mc.moved_swap = 0;
 	}
-	memcg_oom_recover(from);
-	memcg_oom_recover(to);
+	memcg1_oom_recover(from);
+	memcg1_oom_recover(to);
 	wake_up_all(&mc.waitq);
 }
 
@@ -2067,7 +2067,7 @@ static int memcg_oom_wake_function(wait_queue_entry_t *wait,
 	return autoremove_wake_function(wait, mode, sync, arg);
 }
 
-void memcg_oom_recover(struct mem_cgroup *memcg)
+void memcg1_oom_recover(struct mem_cgroup *memcg)
 {
 	/*
 	 * For the following lockless ->under_oom test, the only required
diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
index 3de956b2422f..972c493a8ae3 100644
--- a/mm/memcontrol-v1.h
+++ b/mm/memcontrol-v1.h
@@ -13,7 +13,6 @@ static inline void memcg1_soft_limit_reset(struct mem_cgroup *memcg)
 
 void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, int nr_pages);
 void memcg1_check_events(struct mem_cgroup *memcg, int nid);
-void memcg_oom_recover(struct mem_cgroup *memcg);
 int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		     unsigned int nr_pages);
 
@@ -92,5 +91,6 @@ ssize_t memcg_write_event_control(struct kernfs_open_file *of,
 
 bool memcg1_oom_prepare(struct mem_cgroup *memcg, bool *locked);
 void memcg1_oom_finish(struct mem_cgroup *memcg, bool locked);
+void memcg1_oom_recover(struct mem_cgroup *memcg);
 
 #endif	/* __MM_MEMCONTROL_V1_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8abd364ac837..37e0af5b26f3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3167,7 +3167,7 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
 	} while (true);
 
 	if (!ret && enlarge)
-		memcg_oom_recover(memcg);
+		memcg1_oom_recover(memcg);
 
 	return ret;
 }
@@ -3752,7 +3752,7 @@ static int mem_cgroup_oom_control_write(struct cgroup_subsys_state *css,
 
 	WRITE_ONCE(memcg->oom_kill_disable, val);
 	if (!val)
-		memcg_oom_recover(memcg);
+		memcg1_oom_recover(memcg);
 
 	return 0;
 }
@@ -5479,7 +5479,7 @@ static void uncharge_batch(const struct uncharge_gather *ug)
 			page_counter_uncharge(&ug->memcg->memsw, ug->nr_memory);
 		if (ug->nr_kmem)
 			memcg_account_kmem(ug->memcg, -ug->nr_kmem);
-		memcg_oom_recover(ug->memcg);
+		memcg1_oom_recover(ug->memcg);
 	}
 
 	local_irq_save(flags);
-- 
2.45.2



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 10/14] mm: memcg: move cgroup v1 interface files to memcontrol-v1.c
  2024-06-25  0:58 [PATCH v2 00/14] mm: memcg: separate legacy cgroup v1 code and put under config option Roman Gushchin
                   ` (8 preceding siblings ...)
  2024-06-25  0:59 ` [PATCH v2 09/14] mm: memcg: rename memcg_oom_recover() Roman Gushchin
@ 2024-06-25  0:59 ` Roman Gushchin
  2024-06-25  7:09   ` Michal Hocko
  2024-06-25  0:59 ` [PATCH v2 11/14] mm: memcg: make memcg1_update_tree() static Roman Gushchin
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Roman Gushchin @ 2024-06-25  0:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song,
	linux-kernel, cgroups, linux-mm, Roman Gushchin

Move legacy cgroup v1 memory controller interfaces and corresponding
code into memcontrol-v1.c.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 mm/memcontrol-v1.c | 739 ++++++++++++++++++++++++++++++++++++++++++++-
 mm/memcontrol-v1.h |  29 +-
 mm/memcontrol.c    | 721 +------------------------------------------
 3 files changed, 767 insertions(+), 722 deletions(-)

diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index 1d5608ee1606..1b7337d0170d 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -10,6 +10,7 @@
 #include <linux/poll.h>
 #include <linux/sort.h>
 #include <linux/file.h>
+#include <linux/seq_buf.h>
 
 #include "internal.h"
 #include "swap.h"
@@ -110,6 +111,18 @@ struct mem_cgroup_event {
 	struct work_struct remove;
 };
 
+#define MEMFILE_PRIVATE(x, val)	((x) << 16 | (val))
+#define MEMFILE_TYPE(val)	((val) >> 16 & 0xffff)
+#define MEMFILE_ATTR(val)	((val) & 0xffff)
+
+enum {
+	RES_USAGE,
+	RES_LIMIT,
+	RES_MAX_USAGE,
+	RES_FAILCNT,
+	RES_SOFT_LIMIT,
+};
+
 #ifdef CONFIG_LOCKDEP
 static struct lockdep_map memcg_oom_lock_dep_map = {
 	.name = "memcg_oom_lock",
@@ -577,14 +590,14 @@ static inline int mem_cgroup_move_swap_account(swp_entry_t entry,
 }
 #endif
 
-u64 mem_cgroup_move_charge_read(struct cgroup_subsys_state *css,
+static u64 mem_cgroup_move_charge_read(struct cgroup_subsys_state *css,
 				struct cftype *cft)
 {
 	return mem_cgroup_from_css(css)->move_charge_at_immigrate;
 }
 
 #ifdef CONFIG_MMU
-int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css,
+static int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css,
 				 struct cftype *cft, u64 val)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
@@ -606,7 +619,7 @@ int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css,
 	return 0;
 }
 #else
-int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css,
+static int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css,
 				 struct cftype *cft, u64 val)
 {
 	return -ENOSYS;
@@ -1803,8 +1816,8 @@ static void memcg_event_ptable_queue_proc(struct file *file,
  * Input must be in format '<event_fd> <control_fd> <args>'.
  * Interpretation of args is defined by control file implementation.
  */
-ssize_t memcg_write_event_control(struct kernfs_open_file *of,
-				  char *buf, size_t nbytes, loff_t off)
+static ssize_t memcg_write_event_control(struct kernfs_open_file *of,
+					 char *buf, size_t nbytes, loff_t off)
 {
 	struct cgroup_subsys_state *css = of_css(of);
 	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
@@ -2184,6 +2197,722 @@ void memcg1_oom_finish(struct mem_cgroup *memcg, bool locked)
 		mem_cgroup_oom_unlock(memcg);
 }
 
+static DEFINE_MUTEX(memcg_max_mutex);
+
+static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
+				 unsigned long max, bool memsw)
+{
+	bool enlarge = false;
+	bool drained = false;
+	int ret;
+	bool limits_invariant;
+	struct page_counter *counter = memsw ? &memcg->memsw : &memcg->memory;
+
+	do {
+		if (signal_pending(current)) {
+			ret = -EINTR;
+			break;
+		}
+
+		mutex_lock(&memcg_max_mutex);
+		/*
+		 * Make sure that the new limit (memsw or memory limit) doesn't
+		 * break our basic invariant rule memory.max <= memsw.max.
+		 */
+		limits_invariant = memsw ? max >= READ_ONCE(memcg->memory.max) :
+					   max <= memcg->memsw.max;
+		if (!limits_invariant) {
+			mutex_unlock(&memcg_max_mutex);
+			ret = -EINVAL;
+			break;
+		}
+		if (max > counter->max)
+			enlarge = true;
+		ret = page_counter_set_max(counter, max);
+		mutex_unlock(&memcg_max_mutex);
+
+		if (!ret)
+			break;
+
+		if (!drained) {
+			drain_all_stock(memcg);
+			drained = true;
+			continue;
+		}
+
+		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
+						  memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, NULL)) {
+			ret = -EBUSY;
+			break;
+		}
+	} while (true);
+
+	if (!ret && enlarge)
+		memcg1_oom_recover(memcg);
+
+	return ret;
+}
+
+/*
+ * Reclaims as many pages from the given memcg as possible.
+ *
+ * Caller is responsible for holding css reference for memcg.
+ */
+static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
+{
+	int nr_retries = MAX_RECLAIM_RETRIES;
+
+	/* we call try-to-free pages for make this cgroup empty */
+	lru_add_drain_all();
+
+	drain_all_stock(memcg);
+
+	/* try to free all pages in this cgroup */
+	while (nr_retries && page_counter_read(&memcg->memory)) {
+		if (signal_pending(current))
+			return -EINTR;
+
+		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
+						  MEMCG_RECLAIM_MAY_SWAP, NULL))
+			nr_retries--;
+	}
+
+	return 0;
+}
+
+static ssize_t mem_cgroup_force_empty_write(struct kernfs_open_file *of,
+					    char *buf, size_t nbytes,
+					    loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+
+	if (mem_cgroup_is_root(memcg))
+		return -EINVAL;
+	return mem_cgroup_force_empty(memcg) ?: nbytes;
+}
+
+static u64 mem_cgroup_hierarchy_read(struct cgroup_subsys_state *css,
+				     struct cftype *cft)
+{
+	return 1;
+}
+
+static int mem_cgroup_hierarchy_write(struct cgroup_subsys_state *css,
+				      struct cftype *cft, u64 val)
+{
+	if (val == 1)
+		return 0;
+
+	pr_warn_once("Non-hierarchical mode is deprecated. "
+		     "Please report your usecase to linux-mm@kvack.org if you "
+		     "depend on this functionality.\n");
+
+	return -EINVAL;
+}
+
+static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
+			       struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+	struct page_counter *counter;
+
+	switch (MEMFILE_TYPE(cft->private)) {
+	case _MEM:
+		counter = &memcg->memory;
+		break;
+	case _MEMSWAP:
+		counter = &memcg->memsw;
+		break;
+	case _KMEM:
+		counter = &memcg->kmem;
+		break;
+	case _TCP:
+		counter = &memcg->tcpmem;
+		break;
+	default:
+		BUG();
+	}
+
+	switch (MEMFILE_ATTR(cft->private)) {
+	case RES_USAGE:
+		if (counter == &memcg->memory)
+			return (u64)mem_cgroup_usage(memcg, false) * PAGE_SIZE;
+		if (counter == &memcg->memsw)
+			return (u64)mem_cgroup_usage(memcg, true) * PAGE_SIZE;
+		return (u64)page_counter_read(counter) * PAGE_SIZE;
+	case RES_LIMIT:
+		return (u64)counter->max * PAGE_SIZE;
+	case RES_MAX_USAGE:
+		return (u64)counter->watermark * PAGE_SIZE;
+	case RES_FAILCNT:
+		return counter->failcnt;
+	case RES_SOFT_LIMIT:
+		return (u64)READ_ONCE(memcg->soft_limit) * PAGE_SIZE;
+	default:
+		BUG();
+	}
+}
+
+/*
+ * This function doesn't do anything useful. Its only job is to provide a read
+ * handler for a file so that cgroup_file_mode() will add read permissions.
+ */
+static int mem_cgroup_dummy_seq_show(__always_unused struct seq_file *m,
+				     __always_unused void *v)
+{
+	return -EINVAL;
+}
+
+static int memcg_update_tcp_max(struct mem_cgroup *memcg, unsigned long max)
+{
+	int ret;
+
+	mutex_lock(&memcg_max_mutex);
+
+	ret = page_counter_set_max(&memcg->tcpmem, max);
+	if (ret)
+		goto out;
+
+	if (!memcg->tcpmem_active) {
+		/*
+		 * The active flag needs to be written after the static_key
+		 * update. This is what guarantees that the socket activation
+		 * function is the last one to run. See mem_cgroup_sk_alloc()
+		 * for details, and note that we don't mark any socket as
+		 * belonging to this memcg until that flag is up.
+		 *
+		 * We need to do this, because static_keys will span multiple
+		 * sites, but we can't control their order. If we mark a socket
+		 * as accounted, but the accounting functions are not patched in
+		 * yet, we'll lose accounting.
+		 *
+		 * We never race with the readers in mem_cgroup_sk_alloc(),
+		 * because when this value change, the code to process it is not
+		 * patched in yet.
+		 */
+		static_branch_inc(&memcg_sockets_enabled_key);
+		memcg->tcpmem_active = true;
+	}
+out:
+	mutex_unlock(&memcg_max_mutex);
+	return ret;
+}
+
+/*
+ * The user of this function is...
+ * RES_LIMIT.
+ */
+static ssize_t mem_cgroup_write(struct kernfs_open_file *of,
+				char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	unsigned long nr_pages;
+	int ret;
+
+	buf = strstrip(buf);
+	ret = page_counter_memparse(buf, "-1", &nr_pages);
+	if (ret)
+		return ret;
+
+	switch (MEMFILE_ATTR(of_cft(of)->private)) {
+	case RES_LIMIT:
+		if (mem_cgroup_is_root(memcg)) { /* Can't set limit on root */
+			ret = -EINVAL;
+			break;
+		}
+		switch (MEMFILE_TYPE(of_cft(of)->private)) {
+		case _MEM:
+			ret = mem_cgroup_resize_max(memcg, nr_pages, false);
+			break;
+		case _MEMSWAP:
+			ret = mem_cgroup_resize_max(memcg, nr_pages, true);
+			break;
+		case _KMEM:
+			pr_warn_once("kmem.limit_in_bytes is deprecated and will be removed. "
+				     "Writing any value to this file has no effect. "
+				     "Please report your usecase to linux-mm@kvack.org if you "
+				     "depend on this functionality.\n");
+			ret = 0;
+			break;
+		case _TCP:
+			ret = memcg_update_tcp_max(memcg, nr_pages);
+			break;
+		}
+		break;
+	case RES_SOFT_LIMIT:
+		if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
+			ret = -EOPNOTSUPP;
+		} else {
+			WRITE_ONCE(memcg->soft_limit, nr_pages);
+			ret = 0;
+		}
+		break;
+	}
+	return ret ?: nbytes;
+}
+
+static ssize_t mem_cgroup_reset(struct kernfs_open_file *of, char *buf,
+				size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	struct page_counter *counter;
+
+	switch (MEMFILE_TYPE(of_cft(of)->private)) {
+	case _MEM:
+		counter = &memcg->memory;
+		break;
+	case _MEMSWAP:
+		counter = &memcg->memsw;
+		break;
+	case _KMEM:
+		counter = &memcg->kmem;
+		break;
+	case _TCP:
+		counter = &memcg->tcpmem;
+		break;
+	default:
+		BUG();
+	}
+
+	switch (MEMFILE_ATTR(of_cft(of)->private)) {
+	case RES_MAX_USAGE:
+		page_counter_reset_watermark(counter);
+		break;
+	case RES_FAILCNT:
+		counter->failcnt = 0;
+		break;
+	default:
+		BUG();
+	}
+
+	return nbytes;
+}
+
+#ifdef CONFIG_NUMA
+
+#define LRU_ALL_FILE (BIT(LRU_INACTIVE_FILE) | BIT(LRU_ACTIVE_FILE))
+#define LRU_ALL_ANON (BIT(LRU_INACTIVE_ANON) | BIT(LRU_ACTIVE_ANON))
+#define LRU_ALL	     ((1 << NR_LRU_LISTS) - 1)
+
+/* static unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg, */
+/* 				int nid, unsigned int lru_mask, bool tree) */
+/* { */
+/* 	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid)); */
+/* 	unsigned long nr = 0; */
+/* 	enum lru_list lru; */
+
+/* 	VM_BUG_ON((unsigned)nid >= nr_node_ids); */
+
+/* 	for_each_lru(lru) { */
+/* 		if (!(BIT(lru) & lru_mask)) */
+/* 			continue; */
+/* 		if (tree) */
+/* 			nr += lruvec_page_state(lruvec, NR_LRU_BASE + lru); */
+/* 		else */
+/* 			nr += lruvec_page_state_local(lruvec, NR_LRU_BASE + lru); */
+/* 	} */
+/* 	return nr; */
+/* } */
+
+/* static unsigned long mem_cgroup_nr_lru_pages(struct mem_cgroup *memcg, */
+/* 					     unsigned int lru_mask, */
+/* 					     bool tree) */
+/* { */
+/* 	unsigned long nr = 0; */
+/* 	enum lru_list lru; */
+
+/* 	for_each_lru(lru) { */
+/* 		if (!(BIT(lru) & lru_mask)) */
+/* 			continue; */
+/* 		if (tree) */
+/* 			nr += memcg_page_state(memcg, NR_LRU_BASE + lru); */
+/* 		else */
+/* 			nr += memcg_page_state_local(memcg, NR_LRU_BASE + lru); */
+/* 	} */
+/* 	return nr; */
+/* } */
+
+static int memcg_numa_stat_show(struct seq_file *m, void *v)
+{
+	struct numa_stat {
+		const char *name;
+		unsigned int lru_mask;
+	};
+
+	static const struct numa_stat stats[] = {
+		{ "total", LRU_ALL },
+		{ "file", LRU_ALL_FILE },
+		{ "anon", LRU_ALL_ANON },
+		{ "unevictable", BIT(LRU_UNEVICTABLE) },
+	};
+	const struct numa_stat *stat;
+	int nid;
+	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+
+	mem_cgroup_flush_stats(memcg);
+
+	for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) {
+		seq_printf(m, "%s=%lu", stat->name,
+			   mem_cgroup_nr_lru_pages(memcg, stat->lru_mask,
+						   false));
+		for_each_node_state(nid, N_MEMORY)
+			seq_printf(m, " N%d=%lu", nid,
+				   mem_cgroup_node_nr_lru_pages(memcg, nid,
+							stat->lru_mask, false));
+		seq_putc(m, '\n');
+	}
+
+	for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) {
+
+		seq_printf(m, "hierarchical_%s=%lu", stat->name,
+			   mem_cgroup_nr_lru_pages(memcg, stat->lru_mask,
+						   true));
+		for_each_node_state(nid, N_MEMORY)
+			seq_printf(m, " N%d=%lu", nid,
+				   mem_cgroup_node_nr_lru_pages(memcg, nid,
+							stat->lru_mask, true));
+		seq_putc(m, '\n');
+	}
+
+	return 0;
+}
+#endif /* CONFIG_NUMA */
+
+static const unsigned int memcg1_stats[] = {
+	NR_FILE_PAGES,
+	NR_ANON_MAPPED,
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	NR_ANON_THPS,
+#endif
+	NR_SHMEM,
+	NR_FILE_MAPPED,
+	NR_FILE_DIRTY,
+	NR_WRITEBACK,
+	WORKINGSET_REFAULT_ANON,
+	WORKINGSET_REFAULT_FILE,
+#ifdef CONFIG_SWAP
+	MEMCG_SWAP,
+	NR_SWAPCACHE,
+#endif
+};
+
+static const char *const memcg1_stat_names[] = {
+	"cache",
+	"rss",
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	"rss_huge",
+#endif
+	"shmem",
+	"mapped_file",
+	"dirty",
+	"writeback",
+	"workingset_refault_anon",
+	"workingset_refault_file",
+#ifdef CONFIG_SWAP
+	"swap",
+	"swapcached",
+#endif
+};
+
+/* Universal VM events cgroup1 shows, original sort order */
+static const unsigned int memcg1_events[] = {
+	PGPGIN,
+	PGPGOUT,
+	PGFAULT,
+	PGMAJFAULT,
+};
+
+void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
+{
+	unsigned long memory, memsw;
+	struct mem_cgroup *mi;
+	unsigned int i;
+
+	BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats));
+
+	mem_cgroup_flush_stats(memcg);
+
+	for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
+		unsigned long nr;
+
+		nr = memcg_page_state_local_output(memcg, memcg1_stats[i]);
+		seq_buf_printf(s, "%s %lu\n", memcg1_stat_names[i], nr);
+	}
+
+	for (i = 0; i < ARRAY_SIZE(memcg1_events); i++)
+		seq_buf_printf(s, "%s %lu\n", vm_event_name(memcg1_events[i]),
+			       memcg_events_local(memcg, memcg1_events[i]));
+
+	for (i = 0; i < NR_LRU_LISTS; i++)
+		seq_buf_printf(s, "%s %lu\n", lru_list_name(i),
+			       memcg_page_state_local(memcg, NR_LRU_BASE + i) *
+			       PAGE_SIZE);
+
+	/* Hierarchical information */
+	memory = memsw = PAGE_COUNTER_MAX;
+	for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) {
+		memory = min(memory, READ_ONCE(mi->memory.max));
+		memsw = min(memsw, READ_ONCE(mi->memsw.max));
+	}
+	seq_buf_printf(s, "hierarchical_memory_limit %llu\n",
+		       (u64)memory * PAGE_SIZE);
+	seq_buf_printf(s, "hierarchical_memsw_limit %llu\n",
+		       (u64)memsw * PAGE_SIZE);
+
+	for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
+		unsigned long nr;
+
+		nr = memcg_page_state_output(memcg, memcg1_stats[i]);
+		seq_buf_printf(s, "total_%s %llu\n", memcg1_stat_names[i],
+			       (u64)nr);
+	}
+
+	for (i = 0; i < ARRAY_SIZE(memcg1_events); i++)
+		seq_buf_printf(s, "total_%s %llu\n",
+			       vm_event_name(memcg1_events[i]),
+			       (u64)memcg_events(memcg, memcg1_events[i]));
+
+	for (i = 0; i < NR_LRU_LISTS; i++)
+		seq_buf_printf(s, "total_%s %llu\n", lru_list_name(i),
+			       (u64)memcg_page_state(memcg, NR_LRU_BASE + i) *
+			       PAGE_SIZE);
+
+#ifdef CONFIG_DEBUG_VM
+	{
+		pg_data_t *pgdat;
+		struct mem_cgroup_per_node *mz;
+		unsigned long anon_cost = 0;
+		unsigned long file_cost = 0;
+
+		for_each_online_pgdat(pgdat) {
+			mz = memcg->nodeinfo[pgdat->node_id];
+
+			anon_cost += mz->lruvec.anon_cost;
+			file_cost += mz->lruvec.file_cost;
+		}
+		seq_buf_printf(s, "anon_cost %lu\n", anon_cost);
+		seq_buf_printf(s, "file_cost %lu\n", file_cost);
+	}
+#endif
+}
+
+static u64 mem_cgroup_swappiness_read(struct cgroup_subsys_state *css,
+				      struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+	return mem_cgroup_swappiness(memcg);
+}
+
+static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css,
+				       struct cftype *cft, u64 val)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+	if (val > MAX_SWAPPINESS)
+		return -EINVAL;
+
+	if (!mem_cgroup_is_root(memcg))
+		WRITE_ONCE(memcg->swappiness, val);
+	else
+		WRITE_ONCE(vm_swappiness, val);
+
+	return 0;
+}
+
+static int mem_cgroup_oom_control_read(struct seq_file *sf, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_seq(sf);
+
+	seq_printf(sf, "oom_kill_disable %d\n", READ_ONCE(memcg->oom_kill_disable));
+	seq_printf(sf, "under_oom %d\n", (bool)memcg->under_oom);
+	seq_printf(sf, "oom_kill %lu\n",
+		   atomic_long_read(&memcg->memory_events[MEMCG_OOM_KILL]));
+	return 0;
+}
+
+static int mem_cgroup_oom_control_write(struct cgroup_subsys_state *css,
+	struct cftype *cft, u64 val)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+	/* cannot set to root cgroup and only 0 and 1 are allowed */
+	if (mem_cgroup_is_root(memcg) || !((val == 0) || (val == 1)))
+		return -EINVAL;
+
+	WRITE_ONCE(memcg->oom_kill_disable, val);
+	if (!val)
+		memcg1_oom_recover(memcg);
+
+	return 0;
+}
+
+#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_SLUB_DEBUG)
+static int mem_cgroup_slab_show(struct seq_file *m, void *p)
+{
+	/*
+	 * Deprecated.
+	 * Please, take a look at tools/cgroup/memcg_slabinfo.py .
+	 */
+	return 0;
+}
+#endif
+
+struct cftype mem_cgroup_legacy_files[] = {
+	{
+		.name = "usage_in_bytes",
+		.private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
+		.read_u64 = mem_cgroup_read_u64,
+	},
+	{
+		.name = "max_usage_in_bytes",
+		.private = MEMFILE_PRIVATE(_MEM, RES_MAX_USAGE),
+		.write = mem_cgroup_reset,
+		.read_u64 = mem_cgroup_read_u64,
+	},
+	{
+		.name = "limit_in_bytes",
+		.private = MEMFILE_PRIVATE(_MEM, RES_LIMIT),
+		.write = mem_cgroup_write,
+		.read_u64 = mem_cgroup_read_u64,
+	},
+	{
+		.name = "soft_limit_in_bytes",
+		.private = MEMFILE_PRIVATE(_MEM, RES_SOFT_LIMIT),
+		.write = mem_cgroup_write,
+		.read_u64 = mem_cgroup_read_u64,
+	},
+	{
+		.name = "failcnt",
+		.private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT),
+		.write = mem_cgroup_reset,
+		.read_u64 = mem_cgroup_read_u64,
+	},
+	{
+		.name = "stat",
+		.seq_show = memory_stat_show,
+	},
+	{
+		.name = "force_empty",
+		.write = mem_cgroup_force_empty_write,
+	},
+	{
+		.name = "use_hierarchy",
+		.write_u64 = mem_cgroup_hierarchy_write,
+		.read_u64 = mem_cgroup_hierarchy_read,
+	},
+	{
+		.name = "cgroup.event_control",		/* XXX: for compat */
+		.write = memcg_write_event_control,
+		.flags = CFTYPE_NO_PREFIX | CFTYPE_WORLD_WRITABLE,
+	},
+	{
+		.name = "swappiness",
+		.read_u64 = mem_cgroup_swappiness_read,
+		.write_u64 = mem_cgroup_swappiness_write,
+	},
+	{
+		.name = "move_charge_at_immigrate",
+		.read_u64 = mem_cgroup_move_charge_read,
+		.write_u64 = mem_cgroup_move_charge_write,
+	},
+	{
+		.name = "oom_control",
+		.seq_show = mem_cgroup_oom_control_read,
+		.write_u64 = mem_cgroup_oom_control_write,
+	},
+	{
+		.name = "pressure_level",
+		.seq_show = mem_cgroup_dummy_seq_show,
+	},
+#ifdef CONFIG_NUMA
+	{
+		.name = "numa_stat",
+		.seq_show = memcg_numa_stat_show,
+	},
+#endif
+	{
+		.name = "kmem.limit_in_bytes",
+		.private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
+		.write = mem_cgroup_write,
+		.read_u64 = mem_cgroup_read_u64,
+	},
+	{
+		.name = "kmem.usage_in_bytes",
+		.private = MEMFILE_PRIVATE(_KMEM, RES_USAGE),
+		.read_u64 = mem_cgroup_read_u64,
+	},
+	{
+		.name = "kmem.failcnt",
+		.private = MEMFILE_PRIVATE(_KMEM, RES_FAILCNT),
+		.write = mem_cgroup_reset,
+		.read_u64 = mem_cgroup_read_u64,
+	},
+	{
+		.name = "kmem.max_usage_in_bytes",
+		.private = MEMFILE_PRIVATE(_KMEM, RES_MAX_USAGE),
+		.write = mem_cgroup_reset,
+		.read_u64 = mem_cgroup_read_u64,
+	},
+#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_SLUB_DEBUG)
+	{
+		.name = "kmem.slabinfo",
+		.seq_show = mem_cgroup_slab_show,
+	},
+#endif
+	{
+		.name = "kmem.tcp.limit_in_bytes",
+		.private = MEMFILE_PRIVATE(_TCP, RES_LIMIT),
+		.write = mem_cgroup_write,
+		.read_u64 = mem_cgroup_read_u64,
+	},
+	{
+		.name = "kmem.tcp.usage_in_bytes",
+		.private = MEMFILE_PRIVATE(_TCP, RES_USAGE),
+		.read_u64 = mem_cgroup_read_u64,
+	},
+	{
+		.name = "kmem.tcp.failcnt",
+		.private = MEMFILE_PRIVATE(_TCP, RES_FAILCNT),
+		.write = mem_cgroup_reset,
+		.read_u64 = mem_cgroup_read_u64,
+	},
+	{
+		.name = "kmem.tcp.max_usage_in_bytes",
+		.private = MEMFILE_PRIVATE(_TCP, RES_MAX_USAGE),
+		.write = mem_cgroup_reset,
+		.read_u64 = mem_cgroup_read_u64,
+	},
+	{ },	/* terminate */
+};
+
+struct cftype memsw_files[] = {
+	{
+		.name = "memsw.usage_in_bytes",
+		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_USAGE),
+		.read_u64 = mem_cgroup_read_u64,
+	},
+	{
+		.name = "memsw.max_usage_in_bytes",
+		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_MAX_USAGE),
+		.write = mem_cgroup_reset,
+		.read_u64 = mem_cgroup_read_u64,
+	},
+	{
+		.name = "memsw.limit_in_bytes",
+		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_LIMIT),
+		.write = mem_cgroup_write,
+		.read_u64 = mem_cgroup_read_u64,
+	},
+	{
+		.name = "memsw.failcnt",
+		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_FAILCNT),
+		.write = mem_cgroup_reset,
+		.read_u64 = mem_cgroup_read_u64,
+	},
+	{ },	/* terminate */
+};
+
 static int __init memcg1_init(void)
 {
 	int node;
diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
index 972c493a8ae3..7be4670d9abb 100644
--- a/mm/memcontrol-v1.h
+++ b/mm/memcontrol-v1.h
@@ -3,6 +3,8 @@
 #ifndef __MM_MEMCONTROL_V1_H
 #define __MM_MEMCONTROL_V1_H
 
+#include <linux/cgroup-defs.h>
+
 void memcg1_update_tree(struct mem_cgroup *memcg, int nid);
 void memcg1_remove_from_trees(struct mem_cgroup *memcg);
 
@@ -34,12 +36,6 @@ int memcg1_can_attach(struct cgroup_taskset *tset);
 void memcg1_cancel_attach(struct cgroup_taskset *tset);
 void memcg1_move_task(void);
 
-struct cftype;
-u64 mem_cgroup_move_charge_read(struct cgroup_subsys_state *css,
-				struct cftype *cft);
-int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css,
-				 struct cftype *cft, u64 val);
-
 /*
  * Per memcg event counter is incremented at every pagein/pageout. With THP,
  * it will be incremented by the number of pages. This counter is used
@@ -86,11 +82,28 @@ enum res_type {
 bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
 				enum mem_cgroup_events_target target);
 unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap);
-ssize_t memcg_write_event_control(struct kernfs_open_file *of,
-				  char *buf, size_t nbytes, loff_t off);
 
 bool memcg1_oom_prepare(struct mem_cgroup *memcg, bool *locked);
 void memcg1_oom_finish(struct mem_cgroup *memcg, bool locked);
 void memcg1_oom_recover(struct mem_cgroup *memcg);
 
+void drain_all_stock(struct mem_cgroup *root_memcg);
+unsigned long mem_cgroup_nr_lru_pages(struct mem_cgroup *memcg,
+				      unsigned int lru_mask, bool tree);
+unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
+					   int nid, unsigned int lru_mask,
+					   bool tree);
+
+unsigned long memcg_events(struct mem_cgroup *memcg, int event);
+unsigned long memcg_events_local(struct mem_cgroup *memcg, int event);
+unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx);
+unsigned long memcg_page_state_output(struct mem_cgroup *memcg, int item);
+unsigned long memcg_page_state_local_output(struct mem_cgroup *memcg, int item);
+int memory_stat_show(struct seq_file *m, void *v);
+
+void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s);
+
+extern struct cftype memsw_files[];
+extern struct cftype mem_cgroup_legacy_files[];
+
 #endif	/* __MM_MEMCONTROL_V1_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 37e0af5b26f3..c7341e811945 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -96,10 +96,6 @@ static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq);
 #define THRESHOLDS_EVENTS_TARGET 128
 #define SOFTLIMIT_EVENTS_TARGET 1024
 
-#define MEMFILE_PRIVATE(x, val)	((x) << 16 | (val))
-#define MEMFILE_TYPE(val)	((val) >> 16 & 0xffff)
-#define MEMFILE_ATTR(val)	((val) & 0xffff)
-
 static inline bool task_is_dying(void)
 {
 	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
@@ -676,7 +672,7 @@ void __mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx,
 }
 
 /* idx can be of type enum memcg_stat_item or node_stat_item. */
-static unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx)
+unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx)
 {
 	long x;
 	int i = memcg_stats_index(idx);
@@ -825,7 +821,7 @@ void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
 	memcg_stats_unlock();
 }
 
-static unsigned long memcg_events(struct mem_cgroup *memcg, int event)
+unsigned long memcg_events(struct mem_cgroup *memcg, int event)
 {
 	int i = memcg_events_index(event);
 
@@ -835,7 +831,7 @@ static unsigned long memcg_events(struct mem_cgroup *memcg, int event)
 	return READ_ONCE(memcg->vmstats->events[i]);
 }
 
-static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event)
+unsigned long memcg_events_local(struct mem_cgroup *memcg, int event)
 {
 	int i = memcg_events_index(event);
 
@@ -1420,15 +1416,13 @@ static int memcg_page_state_output_unit(int item)
 	}
 }
 
-static inline unsigned long memcg_page_state_output(struct mem_cgroup *memcg,
-						    int item)
+unsigned long memcg_page_state_output(struct mem_cgroup *memcg, int item)
 {
 	return memcg_page_state(memcg, item) *
 		memcg_page_state_output_unit(item);
 }
 
-static inline unsigned long memcg_page_state_local_output(
-		struct mem_cgroup *memcg, int item)
+unsigned long memcg_page_state_local_output(struct mem_cgroup *memcg, int item)
 {
 	return memcg_page_state_local(memcg, item) *
 		memcg_page_state_output_unit(item);
@@ -1487,8 +1481,6 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
 	WARN_ON_ONCE(seq_buf_has_overflowed(s));
 }
 
-static void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s);
-
 static void memory_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
 {
 	if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
@@ -1861,7 +1853,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
  * Drains all per-CPU charge caches for given root_memcg resp. subtree
  * of the hierarchy under it.
  */
-static void drain_all_stock(struct mem_cgroup *root_memcg)
+void drain_all_stock(struct mem_cgroup *root_memcg)
 {
 	int cpu, curcpu;
 
@@ -3115,120 +3107,6 @@ void split_page_memcg(struct page *head, int old_order, int new_order)
 		css_get_many(&memcg->css, old_nr / new_nr - 1);
 }
 
-
-static DEFINE_MUTEX(memcg_max_mutex);
-
-static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
-				 unsigned long max, bool memsw)
-{
-	bool enlarge = false;
-	bool drained = false;
-	int ret;
-	bool limits_invariant;
-	struct page_counter *counter = memsw ? &memcg->memsw : &memcg->memory;
-
-	do {
-		if (signal_pending(current)) {
-			ret = -EINTR;
-			break;
-		}
-
-		mutex_lock(&memcg_max_mutex);
-		/*
-		 * Make sure that the new limit (memsw or memory limit) doesn't
-		 * break our basic invariant rule memory.max <= memsw.max.
-		 */
-		limits_invariant = memsw ? max >= READ_ONCE(memcg->memory.max) :
-					   max <= memcg->memsw.max;
-		if (!limits_invariant) {
-			mutex_unlock(&memcg_max_mutex);
-			ret = -EINVAL;
-			break;
-		}
-		if (max > counter->max)
-			enlarge = true;
-		ret = page_counter_set_max(counter, max);
-		mutex_unlock(&memcg_max_mutex);
-
-		if (!ret)
-			break;
-
-		if (!drained) {
-			drain_all_stock(memcg);
-			drained = true;
-			continue;
-		}
-
-		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
-					memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, NULL)) {
-			ret = -EBUSY;
-			break;
-		}
-	} while (true);
-
-	if (!ret && enlarge)
-		memcg1_oom_recover(memcg);
-
-	return ret;
-}
-
-/*
- * Reclaims as many pages from the given memcg as possible.
- *
- * Caller is responsible for holding css reference for memcg.
- */
-static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
-{
-	int nr_retries = MAX_RECLAIM_RETRIES;
-
-	/* we call try-to-free pages for make this cgroup empty */
-	lru_add_drain_all();
-
-	drain_all_stock(memcg);
-
-	/* try to free all pages in this cgroup */
-	while (nr_retries && page_counter_read(&memcg->memory)) {
-		if (signal_pending(current))
-			return -EINTR;
-
-		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
-						  MEMCG_RECLAIM_MAY_SWAP, NULL))
-			nr_retries--;
-	}
-
-	return 0;
-}
-
-static ssize_t mem_cgroup_force_empty_write(struct kernfs_open_file *of,
-					    char *buf, size_t nbytes,
-					    loff_t off)
-{
-	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
-
-	if (mem_cgroup_is_root(memcg))
-		return -EINVAL;
-	return mem_cgroup_force_empty(memcg) ?: nbytes;
-}
-
-static u64 mem_cgroup_hierarchy_read(struct cgroup_subsys_state *css,
-				     struct cftype *cft)
-{
-	return 1;
-}
-
-static int mem_cgroup_hierarchy_write(struct cgroup_subsys_state *css,
-				      struct cftype *cft, u64 val)
-{
-	if (val == 1)
-		return 0;
-
-	pr_warn_once("Non-hierarchical mode is deprecated. "
-		     "Please report your usecase to linux-mm@kvack.org if you "
-		     "depend on this functionality.\n");
-
-	return -EINVAL;
-}
-
 unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
 {
 	unsigned long val;
@@ -3251,67 +3129,6 @@ unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
 	return val;
 }
 
-enum {
-	RES_USAGE,
-	RES_LIMIT,
-	RES_MAX_USAGE,
-	RES_FAILCNT,
-	RES_SOFT_LIMIT,
-};
-
-static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
-			       struct cftype *cft)
-{
-	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
-	struct page_counter *counter;
-
-	switch (MEMFILE_TYPE(cft->private)) {
-	case _MEM:
-		counter = &memcg->memory;
-		break;
-	case _MEMSWAP:
-		counter = &memcg->memsw;
-		break;
-	case _KMEM:
-		counter = &memcg->kmem;
-		break;
-	case _TCP:
-		counter = &memcg->tcpmem;
-		break;
-	default:
-		BUG();
-	}
-
-	switch (MEMFILE_ATTR(cft->private)) {
-	case RES_USAGE:
-		if (counter == &memcg->memory)
-			return (u64)mem_cgroup_usage(memcg, false) * PAGE_SIZE;
-		if (counter == &memcg->memsw)
-			return (u64)mem_cgroup_usage(memcg, true) * PAGE_SIZE;
-		return (u64)page_counter_read(counter) * PAGE_SIZE;
-	case RES_LIMIT:
-		return (u64)counter->max * PAGE_SIZE;
-	case RES_MAX_USAGE:
-		return (u64)counter->watermark * PAGE_SIZE;
-	case RES_FAILCNT:
-		return counter->failcnt;
-	case RES_SOFT_LIMIT:
-		return (u64)READ_ONCE(memcg->soft_limit) * PAGE_SIZE;
-	default:
-		BUG();
-	}
-}
-
-/*
- * This function doesn't do anything useful. Its only job is to provide a read
- * handler for a file so that cgroup_file_mode() will add read permissions.
- */
-static int mem_cgroup_dummy_seq_show(__always_unused struct seq_file *m,
-				     __always_unused void *v)
-{
-	return -EINVAL;
-}
-
 #ifdef CONFIG_MEMCG_KMEM
 static int memcg_online_kmem(struct mem_cgroup *memcg)
 {
@@ -3373,139 +3190,9 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
 }
 #endif /* CONFIG_MEMCG_KMEM */
 
-static int memcg_update_tcp_max(struct mem_cgroup *memcg, unsigned long max)
-{
-	int ret;
-
-	mutex_lock(&memcg_max_mutex);
-
-	ret = page_counter_set_max(&memcg->tcpmem, max);
-	if (ret)
-		goto out;
-
-	if (!memcg->tcpmem_active) {
-		/*
-		 * The active flag needs to be written after the static_key
-		 * update. This is what guarantees that the socket activation
-		 * function is the last one to run. See mem_cgroup_sk_alloc()
-		 * for details, and note that we don't mark any socket as
-		 * belonging to this memcg until that flag is up.
-		 *
-		 * We need to do this, because static_keys will span multiple
-		 * sites, but we can't control their order. If we mark a socket
-		 * as accounted, but the accounting functions are not patched in
-		 * yet, we'll lose accounting.
-		 *
-		 * We never race with the readers in mem_cgroup_sk_alloc(),
-		 * because when this value change, the code to process it is not
-		 * patched in yet.
-		 */
-		static_branch_inc(&memcg_sockets_enabled_key);
-		memcg->tcpmem_active = true;
-	}
-out:
-	mutex_unlock(&memcg_max_mutex);
-	return ret;
-}
-
-/*
- * The user of this function is...
- * RES_LIMIT.
- */
-static ssize_t mem_cgroup_write(struct kernfs_open_file *of,
-				char *buf, size_t nbytes, loff_t off)
-{
-	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
-	unsigned long nr_pages;
-	int ret;
-
-	buf = strstrip(buf);
-	ret = page_counter_memparse(buf, "-1", &nr_pages);
-	if (ret)
-		return ret;
-
-	switch (MEMFILE_ATTR(of_cft(of)->private)) {
-	case RES_LIMIT:
-		if (mem_cgroup_is_root(memcg)) { /* Can't set limit on root */
-			ret = -EINVAL;
-			break;
-		}
-		switch (MEMFILE_TYPE(of_cft(of)->private)) {
-		case _MEM:
-			ret = mem_cgroup_resize_max(memcg, nr_pages, false);
-			break;
-		case _MEMSWAP:
-			ret = mem_cgroup_resize_max(memcg, nr_pages, true);
-			break;
-		case _KMEM:
-			pr_warn_once("kmem.limit_in_bytes is deprecated and will be removed. "
-				     "Writing any value to this file has no effect. "
-				     "Please report your usecase to linux-mm@kvack.org if you "
-				     "depend on this functionality.\n");
-			ret = 0;
-			break;
-		case _TCP:
-			ret = memcg_update_tcp_max(memcg, nr_pages);
-			break;
-		}
-		break;
-	case RES_SOFT_LIMIT:
-		if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
-			ret = -EOPNOTSUPP;
-		} else {
-			WRITE_ONCE(memcg->soft_limit, nr_pages);
-			ret = 0;
-		}
-		break;
-	}
-	return ret ?: nbytes;
-}
-
-static ssize_t mem_cgroup_reset(struct kernfs_open_file *of, char *buf,
-				size_t nbytes, loff_t off)
-{
-	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
-	struct page_counter *counter;
-
-	switch (MEMFILE_TYPE(of_cft(of)->private)) {
-	case _MEM:
-		counter = &memcg->memory;
-		break;
-	case _MEMSWAP:
-		counter = &memcg->memsw;
-		break;
-	case _KMEM:
-		counter = &memcg->kmem;
-		break;
-	case _TCP:
-		counter = &memcg->tcpmem;
-		break;
-	default:
-		BUG();
-	}
-
-	switch (MEMFILE_ATTR(of_cft(of)->private)) {
-	case RES_MAX_USAGE:
-		page_counter_reset_watermark(counter);
-		break;
-	case RES_FAILCNT:
-		counter->failcnt = 0;
-		break;
-	default:
-		BUG();
-	}
-
-	return nbytes;
-}
-
-#ifdef CONFIG_NUMA
-
-#define LRU_ALL_FILE (BIT(LRU_INACTIVE_FILE) | BIT(LRU_ACTIVE_FILE))
-#define LRU_ALL_ANON (BIT(LRU_INACTIVE_ANON) | BIT(LRU_ACTIVE_ANON))
-#define LRU_ALL	     ((1 << NR_LRU_LISTS) - 1)
-
-static unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
-				int nid, unsigned int lru_mask, bool tree)
+unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
+					   int nid, unsigned int lru_mask,
+					   bool tree)
 {
 	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
 	unsigned long nr = 0;
@@ -3524,9 +3211,8 @@ static unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
 	return nr;
 }
 
-static unsigned long mem_cgroup_nr_lru_pages(struct mem_cgroup *memcg,
-					     unsigned int lru_mask,
-					     bool tree)
+unsigned long mem_cgroup_nr_lru_pages(struct mem_cgroup *memcg,
+				      unsigned int lru_mask, bool tree)
 {
 	unsigned long nr = 0;
 	enum lru_list lru;
@@ -3542,221 +3228,6 @@ static unsigned long mem_cgroup_nr_lru_pages(struct mem_cgroup *memcg,
 	return nr;
 }
 
-static int memcg_numa_stat_show(struct seq_file *m, void *v)
-{
-	struct numa_stat {
-		const char *name;
-		unsigned int lru_mask;
-	};
-
-	static const struct numa_stat stats[] = {
-		{ "total", LRU_ALL },
-		{ "file", LRU_ALL_FILE },
-		{ "anon", LRU_ALL_ANON },
-		{ "unevictable", BIT(LRU_UNEVICTABLE) },
-	};
-	const struct numa_stat *stat;
-	int nid;
-	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
-
-	mem_cgroup_flush_stats(memcg);
-
-	for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) {
-		seq_printf(m, "%s=%lu", stat->name,
-			   mem_cgroup_nr_lru_pages(memcg, stat->lru_mask,
-						   false));
-		for_each_node_state(nid, N_MEMORY)
-			seq_printf(m, " N%d=%lu", nid,
-				   mem_cgroup_node_nr_lru_pages(memcg, nid,
-							stat->lru_mask, false));
-		seq_putc(m, '\n');
-	}
-
-	for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) {
-
-		seq_printf(m, "hierarchical_%s=%lu", stat->name,
-			   mem_cgroup_nr_lru_pages(memcg, stat->lru_mask,
-						   true));
-		for_each_node_state(nid, N_MEMORY)
-			seq_printf(m, " N%d=%lu", nid,
-				   mem_cgroup_node_nr_lru_pages(memcg, nid,
-							stat->lru_mask, true));
-		seq_putc(m, '\n');
-	}
-
-	return 0;
-}
-#endif /* CONFIG_NUMA */
-
-static const unsigned int memcg1_stats[] = {
-	NR_FILE_PAGES,
-	NR_ANON_MAPPED,
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	NR_ANON_THPS,
-#endif
-	NR_SHMEM,
-	NR_FILE_MAPPED,
-	NR_FILE_DIRTY,
-	NR_WRITEBACK,
-	WORKINGSET_REFAULT_ANON,
-	WORKINGSET_REFAULT_FILE,
-#ifdef CONFIG_SWAP
-	MEMCG_SWAP,
-	NR_SWAPCACHE,
-#endif
-};
-
-static const char *const memcg1_stat_names[] = {
-	"cache",
-	"rss",
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	"rss_huge",
-#endif
-	"shmem",
-	"mapped_file",
-	"dirty",
-	"writeback",
-	"workingset_refault_anon",
-	"workingset_refault_file",
-#ifdef CONFIG_SWAP
-	"swap",
-	"swapcached",
-#endif
-};
-
-/* Universal VM events cgroup1 shows, original sort order */
-static const unsigned int memcg1_events[] = {
-	PGPGIN,
-	PGPGOUT,
-	PGFAULT,
-	PGMAJFAULT,
-};
-
-static void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
-{
-	unsigned long memory, memsw;
-	struct mem_cgroup *mi;
-	unsigned int i;
-
-	BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats));
-
-	mem_cgroup_flush_stats(memcg);
-
-	for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
-		unsigned long nr;
-
-		nr = memcg_page_state_local_output(memcg, memcg1_stats[i]);
-		seq_buf_printf(s, "%s %lu\n", memcg1_stat_names[i], nr);
-	}
-
-	for (i = 0; i < ARRAY_SIZE(memcg1_events); i++)
-		seq_buf_printf(s, "%s %lu\n", vm_event_name(memcg1_events[i]),
-			       memcg_events_local(memcg, memcg1_events[i]));
-
-	for (i = 0; i < NR_LRU_LISTS; i++)
-		seq_buf_printf(s, "%s %lu\n", lru_list_name(i),
-			       memcg_page_state_local(memcg, NR_LRU_BASE + i) *
-			       PAGE_SIZE);
-
-	/* Hierarchical information */
-	memory = memsw = PAGE_COUNTER_MAX;
-	for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) {
-		memory = min(memory, READ_ONCE(mi->memory.max));
-		memsw = min(memsw, READ_ONCE(mi->memsw.max));
-	}
-	seq_buf_printf(s, "hierarchical_memory_limit %llu\n",
-		       (u64)memory * PAGE_SIZE);
-	seq_buf_printf(s, "hierarchical_memsw_limit %llu\n",
-		       (u64)memsw * PAGE_SIZE);
-
-	for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
-		unsigned long nr;
-
-		nr = memcg_page_state_output(memcg, memcg1_stats[i]);
-		seq_buf_printf(s, "total_%s %llu\n", memcg1_stat_names[i],
-			       (u64)nr);
-	}
-
-	for (i = 0; i < ARRAY_SIZE(memcg1_events); i++)
-		seq_buf_printf(s, "total_%s %llu\n",
-			       vm_event_name(memcg1_events[i]),
-			       (u64)memcg_events(memcg, memcg1_events[i]));
-
-	for (i = 0; i < NR_LRU_LISTS; i++)
-		seq_buf_printf(s, "total_%s %llu\n", lru_list_name(i),
-			       (u64)memcg_page_state(memcg, NR_LRU_BASE + i) *
-			       PAGE_SIZE);
-
-#ifdef CONFIG_DEBUG_VM
-	{
-		pg_data_t *pgdat;
-		struct mem_cgroup_per_node *mz;
-		unsigned long anon_cost = 0;
-		unsigned long file_cost = 0;
-
-		for_each_online_pgdat(pgdat) {
-			mz = memcg->nodeinfo[pgdat->node_id];
-
-			anon_cost += mz->lruvec.anon_cost;
-			file_cost += mz->lruvec.file_cost;
-		}
-		seq_buf_printf(s, "anon_cost %lu\n", anon_cost);
-		seq_buf_printf(s, "file_cost %lu\n", file_cost);
-	}
-#endif
-}
-
-static u64 mem_cgroup_swappiness_read(struct cgroup_subsys_state *css,
-				      struct cftype *cft)
-{
-	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
-
-	return mem_cgroup_swappiness(memcg);
-}
-
-static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css,
-				       struct cftype *cft, u64 val)
-{
-	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
-
-	if (val > MAX_SWAPPINESS)
-		return -EINVAL;
-
-	if (!mem_cgroup_is_root(memcg))
-		WRITE_ONCE(memcg->swappiness, val);
-	else
-		WRITE_ONCE(vm_swappiness, val);
-
-	return 0;
-}
-
-static int mem_cgroup_oom_control_read(struct seq_file *sf, void *v)
-{
-	struct mem_cgroup *memcg = mem_cgroup_from_seq(sf);
-
-	seq_printf(sf, "oom_kill_disable %d\n", READ_ONCE(memcg->oom_kill_disable));
-	seq_printf(sf, "under_oom %d\n", (bool)memcg->under_oom);
-	seq_printf(sf, "oom_kill %lu\n",
-		   atomic_long_read(&memcg->memory_events[MEMCG_OOM_KILL]));
-	return 0;
-}
-
-static int mem_cgroup_oom_control_write(struct cgroup_subsys_state *css,
-	struct cftype *cft, u64 val)
-{
-	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
-
-	/* cannot set to root cgroup and only 0 and 1 are allowed */
-	if (mem_cgroup_is_root(memcg) || !((val == 0) || (val == 1)))
-		return -EINVAL;
-
-	WRITE_ONCE(memcg->oom_kill_disable, val);
-	if (!val)
-		memcg1_oom_recover(memcg);
-
-	return 0;
-}
-
 #ifdef CONFIG_CGROUP_WRITEBACK
 
 #include <trace/events/writeback.h>
@@ -3970,147 +3441,6 @@ static void memcg_wb_domain_size_changed(struct mem_cgroup *memcg)
 
 #endif	/* CONFIG_CGROUP_WRITEBACK */
 
-#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_SLUB_DEBUG)
-static int mem_cgroup_slab_show(struct seq_file *m, void *p)
-{
-	/*
-	 * Deprecated.
-	 * Please, take a look at tools/cgroup/memcg_slabinfo.py .
-	 */
-	return 0;
-}
-#endif
-
-static int memory_stat_show(struct seq_file *m, void *v);
-
-static struct cftype mem_cgroup_legacy_files[] = {
-	{
-		.name = "usage_in_bytes",
-		.private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
-		.read_u64 = mem_cgroup_read_u64,
-	},
-	{
-		.name = "max_usage_in_bytes",
-		.private = MEMFILE_PRIVATE(_MEM, RES_MAX_USAGE),
-		.write = mem_cgroup_reset,
-		.read_u64 = mem_cgroup_read_u64,
-	},
-	{
-		.name = "limit_in_bytes",
-		.private = MEMFILE_PRIVATE(_MEM, RES_LIMIT),
-		.write = mem_cgroup_write,
-		.read_u64 = mem_cgroup_read_u64,
-	},
-	{
-		.name = "soft_limit_in_bytes",
-		.private = MEMFILE_PRIVATE(_MEM, RES_SOFT_LIMIT),
-		.write = mem_cgroup_write,
-		.read_u64 = mem_cgroup_read_u64,
-	},
-	{
-		.name = "failcnt",
-		.private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT),
-		.write = mem_cgroup_reset,
-		.read_u64 = mem_cgroup_read_u64,
-	},
-	{
-		.name = "stat",
-		.seq_show = memory_stat_show,
-	},
-	{
-		.name = "force_empty",
-		.write = mem_cgroup_force_empty_write,
-	},
-	{
-		.name = "use_hierarchy",
-		.write_u64 = mem_cgroup_hierarchy_write,
-		.read_u64 = mem_cgroup_hierarchy_read,
-	},
-	{
-		.name = "cgroup.event_control",		/* XXX: for compat */
-		.write = memcg_write_event_control,
-		.flags = CFTYPE_NO_PREFIX | CFTYPE_WORLD_WRITABLE,
-	},
-	{
-		.name = "swappiness",
-		.read_u64 = mem_cgroup_swappiness_read,
-		.write_u64 = mem_cgroup_swappiness_write,
-	},
-	{
-		.name = "move_charge_at_immigrate",
-		.read_u64 = mem_cgroup_move_charge_read,
-		.write_u64 = mem_cgroup_move_charge_write,
-	},
-	{
-		.name = "oom_control",
-		.seq_show = mem_cgroup_oom_control_read,
-		.write_u64 = mem_cgroup_oom_control_write,
-	},
-	{
-		.name = "pressure_level",
-		.seq_show = mem_cgroup_dummy_seq_show,
-	},
-#ifdef CONFIG_NUMA
-	{
-		.name = "numa_stat",
-		.seq_show = memcg_numa_stat_show,
-	},
-#endif
-	{
-		.name = "kmem.limit_in_bytes",
-		.private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
-		.write = mem_cgroup_write,
-		.read_u64 = mem_cgroup_read_u64,
-	},
-	{
-		.name = "kmem.usage_in_bytes",
-		.private = MEMFILE_PRIVATE(_KMEM, RES_USAGE),
-		.read_u64 = mem_cgroup_read_u64,
-	},
-	{
-		.name = "kmem.failcnt",
-		.private = MEMFILE_PRIVATE(_KMEM, RES_FAILCNT),
-		.write = mem_cgroup_reset,
-		.read_u64 = mem_cgroup_read_u64,
-	},
-	{
-		.name = "kmem.max_usage_in_bytes",
-		.private = MEMFILE_PRIVATE(_KMEM, RES_MAX_USAGE),
-		.write = mem_cgroup_reset,
-		.read_u64 = mem_cgroup_read_u64,
-	},
-#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_SLUB_DEBUG)
-	{
-		.name = "kmem.slabinfo",
-		.seq_show = mem_cgroup_slab_show,
-	},
-#endif
-	{
-		.name = "kmem.tcp.limit_in_bytes",
-		.private = MEMFILE_PRIVATE(_TCP, RES_LIMIT),
-		.write = mem_cgroup_write,
-		.read_u64 = mem_cgroup_read_u64,
-	},
-	{
-		.name = "kmem.tcp.usage_in_bytes",
-		.private = MEMFILE_PRIVATE(_TCP, RES_USAGE),
-		.read_u64 = mem_cgroup_read_u64,
-	},
-	{
-		.name = "kmem.tcp.failcnt",
-		.private = MEMFILE_PRIVATE(_TCP, RES_FAILCNT),
-		.write = mem_cgroup_reset,
-		.read_u64 = mem_cgroup_read_u64,
-	},
-	{
-		.name = "kmem.tcp.max_usage_in_bytes",
-		.private = MEMFILE_PRIVATE(_TCP, RES_MAX_USAGE),
-		.write = mem_cgroup_reset,
-		.read_u64 = mem_cgroup_read_u64,
-	},
-	{ },	/* terminate */
-};
-
 /*
  * Private memory cgroup IDR
  *
@@ -4902,7 +4232,7 @@ static int memory_events_local_show(struct seq_file *m, void *v)
 	return 0;
 }
 
-static int memory_stat_show(struct seq_file *m, void *v)
+int memory_stat_show(struct seq_file *m, void *v)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
 	char *buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
@@ -6133,33 +5463,6 @@ static struct cftype swap_files[] = {
 	{ }	/* terminate */
 };
 
-static struct cftype memsw_files[] = {
-	{
-		.name = "memsw.usage_in_bytes",
-		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_USAGE),
-		.read_u64 = mem_cgroup_read_u64,
-	},
-	{
-		.name = "memsw.max_usage_in_bytes",
-		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_MAX_USAGE),
-		.write = mem_cgroup_reset,
-		.read_u64 = mem_cgroup_read_u64,
-	},
-	{
-		.name = "memsw.limit_in_bytes",
-		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_LIMIT),
-		.write = mem_cgroup_write,
-		.read_u64 = mem_cgroup_read_u64,
-	},
-	{
-		.name = "memsw.failcnt",
-		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_FAILCNT),
-		.write = mem_cgroup_reset,
-		.read_u64 = mem_cgroup_read_u64,
-	},
-	{ },	/* terminate */
-};
-
 #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
 /**
  * obj_cgroup_may_zswap - check if this cgroup can zswap
-- 
2.45.2



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 11/14] mm: memcg: make memcg1_update_tree() static
  2024-06-25  0:58 [PATCH v2 00/14] mm: memcg: separate legacy cgroup v1 code and put under config option Roman Gushchin
                   ` (9 preceding siblings ...)
  2024-06-25  0:59 ` [PATCH v2 10/14] mm: memcg: move cgroup v1 interface files to memcontrol-v1.c Roman Gushchin
@ 2024-06-25  0:59 ` Roman Gushchin
  2024-06-25  7:09   ` Michal Hocko
  2024-06-25  0:59 ` [PATCH v2 12/14] mm: memcg: group cgroup v1 memcg related declarations Roman Gushchin
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Roman Gushchin @ 2024-06-25  0:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song,
	linux-kernel, cgroups, linux-mm, Roman Gushchin

memcg1_update_tree() is not used outside of mm/memcontrol-v1.c
anymore, define it as static and remove the declaration from
the header file.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 mm/memcontrol-v1.c | 2 +-
 mm/memcontrol-v1.h | 1 -
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index 1b7337d0170d..f89de413004b 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -201,7 +201,7 @@ static unsigned long soft_limit_excess(struct mem_cgroup *memcg)
 	return excess;
 }
 
-void memcg1_update_tree(struct mem_cgroup *memcg, int nid)
+static void memcg1_update_tree(struct mem_cgroup *memcg, int nid)
 {
 	unsigned long excess;
 	struct mem_cgroup_per_node *mz;
diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
index 7be4670d9abb..7d6ac4a4fb36 100644
--- a/mm/memcontrol-v1.h
+++ b/mm/memcontrol-v1.h
@@ -5,7 +5,6 @@
 
 #include <linux/cgroup-defs.h>
 
-void memcg1_update_tree(struct mem_cgroup *memcg, int nid);
 void memcg1_remove_from_trees(struct mem_cgroup *memcg);
 
 static inline void memcg1_soft_limit_reset(struct mem_cgroup *memcg)
-- 
2.45.2



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 12/14] mm: memcg: group cgroup v1 memcg related declarations
  2024-06-25  0:58 [PATCH v2 00/14] mm: memcg: separate legacy cgroup v1 code and put under config option Roman Gushchin
                   ` (10 preceding siblings ...)
  2024-06-25  0:59 ` [PATCH v2 11/14] mm: memcg: make memcg1_update_tree() static Roman Gushchin
@ 2024-06-25  0:59 ` Roman Gushchin
  2024-06-25  7:09   ` Michal Hocko
  2024-06-25  0:59 ` [PATCH v2 13/14] mm: memcg: put cgroup v1-related members of task_struct under config option Roman Gushchin
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 31+ messages in thread
From: Roman Gushchin @ 2024-06-25  0:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song,
	linux-kernel, cgroups, linux-mm, Roman Gushchin

Group all cgroup v1-related declarations at the end of memcontrol.h
and mm/memcontrol-v1.h with an intention to put them all together
under a config option later on. It should make things easier to
follow and maintain too.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 include/linux/memcontrol.h | 144 +++++++++++++++++++------------------
 mm/memcontrol-v1.h         |  89 ++++++++++++-----------
 2 files changed, 123 insertions(+), 110 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 588179d29849..a70d64ed04f5 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -950,39 +950,13 @@ static inline void mem_cgroup_exit_user_fault(void)
 	current->in_user_fault = 0;
 }
 
-static inline bool task_in_memcg_oom(struct task_struct *p)
-{
-	return p->memcg_in_oom;
-}
-
-bool mem_cgroup_oom_synchronize(bool wait);
 struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim,
 					    struct mem_cgroup *oom_domain);
 void mem_cgroup_print_oom_group(struct mem_cgroup *memcg);
 
-void folio_memcg_lock(struct folio *folio);
-void folio_memcg_unlock(struct folio *folio);
-
 void __mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx,
 		       int val);
 
-/* try to stablize folio_memcg() for all the pages in a memcg */
-static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
-{
-	rcu_read_lock();
-
-	if (mem_cgroup_disabled() || !atomic_read(&memcg->moving_account))
-		return true;
-
-	rcu_read_unlock();
-	return false;
-}
-
-static inline void mem_cgroup_unlock_pages(void)
-{
-	rcu_read_unlock();
-}
-
 /* idx can be of type enum memcg_stat_item or node_stat_item */
 static inline void mod_memcg_state(struct mem_cgroup *memcg,
 				   enum memcg_stat_item idx, int val)
@@ -1109,10 +1083,6 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
 
 void split_page_memcg(struct page *head, int old_order, int new_order);
 
-unsigned long memcg1_soft_limit_reclaim(pg_data_t *pgdat, int order,
-					gfp_t gfp_mask,
-					unsigned long *total_scanned);
-
 #else /* CONFIG_MEMCG */
 
 #define MEM_CGROUP_ID_SHIFT	0
@@ -1423,26 +1393,6 @@ mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg)
 {
 }
 
-static inline void folio_memcg_lock(struct folio *folio)
-{
-}
-
-static inline void folio_memcg_unlock(struct folio *folio)
-{
-}
-
-static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
-{
-	/* to match folio_memcg_rcu() */
-	rcu_read_lock();
-	return true;
-}
-
-static inline void mem_cgroup_unlock_pages(void)
-{
-	rcu_read_unlock();
-}
-
 static inline void mem_cgroup_handle_over_high(gfp_t gfp_mask)
 {
 }
@@ -1455,16 +1405,6 @@ static inline void mem_cgroup_exit_user_fault(void)
 {
 }
 
-static inline bool task_in_memcg_oom(struct task_struct *p)
-{
-	return false;
-}
-
-static inline bool mem_cgroup_oom_synchronize(bool wait)
-{
-	return false;
-}
-
 static inline struct mem_cgroup *mem_cgroup_get_oom_group(
 	struct task_struct *victim, struct mem_cgroup *oom_domain)
 {
@@ -1558,14 +1498,6 @@ void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
 static inline void split_page_memcg(struct page *head, int old_order, int new_order)
 {
 }
-
-static inline
-unsigned long memcg1_soft_limit_reclaim(pg_data_t *pgdat, int order,
-					gfp_t gfp_mask,
-					unsigned long *total_scanned)
-{
-	return 0;
-}
 #endif /* CONFIG_MEMCG */
 
 /*
@@ -1916,4 +1848,80 @@ static inline bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg)
 }
 #endif
 
+
+/* Cgroup v1-related declarations */
+
+#ifdef CONFIG_MEMCG
+unsigned long memcg1_soft_limit_reclaim(pg_data_t *pgdat, int order,
+					gfp_t gfp_mask,
+					unsigned long *total_scanned);
+
+bool mem_cgroup_oom_synchronize(bool wait);
+
+static inline bool task_in_memcg_oom(struct task_struct *p)
+{
+	return p->memcg_in_oom;
+}
+
+void folio_memcg_lock(struct folio *folio);
+void folio_memcg_unlock(struct folio *folio);
+
+/* try to stablize folio_memcg() for all the pages in a memcg */
+static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
+{
+	rcu_read_lock();
+
+	if (mem_cgroup_disabled() || !atomic_read(&memcg->moving_account))
+		return true;
+
+	rcu_read_unlock();
+	return false;
+}
+
+static inline void mem_cgroup_unlock_pages(void)
+{
+	rcu_read_unlock();
+}
+
+#else /* CONFIG_MEMCG */
+static inline
+unsigned long memcg1_soft_limit_reclaim(pg_data_t *pgdat, int order,
+					gfp_t gfp_mask,
+					unsigned long *total_scanned)
+{
+	return 0;
+}
+
+static inline void folio_memcg_lock(struct folio *folio)
+{
+}
+
+static inline void folio_memcg_unlock(struct folio *folio)
+{
+}
+
+static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
+{
+	/* to match folio_memcg_rcu() */
+	rcu_read_lock();
+	return true;
+}
+
+static inline void mem_cgroup_unlock_pages(void)
+{
+	rcu_read_unlock();
+}
+
+static inline bool task_in_memcg_oom(struct task_struct *p)
+{
+	return false;
+}
+
+static inline bool mem_cgroup_oom_synchronize(bool wait)
+{
+	return false;
+}
+
+#endif /* CONFIG_MEMCG */
+
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
index 7d6ac4a4fb36..89d420793048 100644
--- a/mm/memcontrol-v1.h
+++ b/mm/memcontrol-v1.h
@@ -5,15 +5,9 @@
 
 #include <linux/cgroup-defs.h>
 
-void memcg1_remove_from_trees(struct mem_cgroup *memcg);
-
-static inline void memcg1_soft_limit_reset(struct mem_cgroup *memcg)
-{
-	WRITE_ONCE(memcg->soft_limit, PAGE_COUNTER_MAX);
-}
+/* Cgroup v1 and v2 common declarations */
 
 void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, int nr_pages);
-void memcg1_check_events(struct mem_cgroup *memcg, int nid);
 int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
 		     unsigned int nr_pages);
 
@@ -29,30 +23,6 @@ static inline int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 void mem_cgroup_id_get_many(struct mem_cgroup *memcg, unsigned int n);
 void mem_cgroup_id_put_many(struct mem_cgroup *memcg, unsigned int n);
 
-bool memcg1_wait_acct_move(struct mem_cgroup *memcg);
-struct cgroup_taskset;
-int memcg1_can_attach(struct cgroup_taskset *tset);
-void memcg1_cancel_attach(struct cgroup_taskset *tset);
-void memcg1_move_task(void);
-
-/*
- * Per memcg event counter is incremented at every pagein/pageout. With THP,
- * it will be incremented by the number of pages. This counter is used
- * to trigger some periodic events. This is straightforward and better
- * than using jiffies etc. to handle periodic memcg event.
- */
-enum mem_cgroup_events_target {
-	MEM_CGROUP_TARGET_THRESH,
-	MEM_CGROUP_TARGET_SOFTLIMIT,
-	MEM_CGROUP_NTARGETS,
-};
-
-/* Whether legacy memory+swap accounting is active */
-static bool do_memsw_account(void)
-{
-	return !cgroup_subsys_on_dfl(memory_cgrp_subsys);
-}
-
 /*
  * Iteration constructs for visiting all cgroups (under a tree).  If
  * loops are exited prematurely (break), mem_cgroup_iter_break() must
@@ -68,24 +38,28 @@ static bool do_memsw_account(void)
 	     iter != NULL;				\
 	     iter = mem_cgroup_iter(NULL, iter, NULL))
 
-void memcg1_css_offline(struct mem_cgroup *memcg);
+/* Whether legacy memory+swap accounting is active */
+static bool do_memsw_account(void)
+{
+	return !cgroup_subsys_on_dfl(memory_cgrp_subsys);
+}
 
-/* for encoding cft->private value on file */
-enum res_type {
-	_MEM,
-	_MEMSWAP,
-	_KMEM,
-	_TCP,
+/*
+ * Per memcg event counter is incremented at every pagein/pageout. With THP,
+ * it will be incremented by the number of pages. This counter is used
+ * to trigger some periodic events. This is straightforward and better
+ * than using jiffies etc. to handle periodic memcg event.
+ */
+enum mem_cgroup_events_target {
+	MEM_CGROUP_TARGET_THRESH,
+	MEM_CGROUP_TARGET_SOFTLIMIT,
+	MEM_CGROUP_NTARGETS,
 };
 
 bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
 				enum mem_cgroup_events_target target);
 unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap);
 
-bool memcg1_oom_prepare(struct mem_cgroup *memcg, bool *locked);
-void memcg1_oom_finish(struct mem_cgroup *memcg, bool locked);
-void memcg1_oom_recover(struct mem_cgroup *memcg);
-
 void drain_all_stock(struct mem_cgroup *root_memcg);
 unsigned long mem_cgroup_nr_lru_pages(struct mem_cgroup *memcg,
 				      unsigned int lru_mask, bool tree);
@@ -100,6 +74,37 @@ unsigned long memcg_page_state_output(struct mem_cgroup *memcg, int item);
 unsigned long memcg_page_state_local_output(struct mem_cgroup *memcg, int item);
 int memory_stat_show(struct seq_file *m, void *v);
 
+/* Cgroup v1-specific declarations */
+
+void memcg1_remove_from_trees(struct mem_cgroup *memcg);
+
+static inline void memcg1_soft_limit_reset(struct mem_cgroup *memcg)
+{
+	WRITE_ONCE(memcg->soft_limit, PAGE_COUNTER_MAX);
+}
+
+bool memcg1_wait_acct_move(struct mem_cgroup *memcg);
+
+struct cgroup_taskset;
+int memcg1_can_attach(struct cgroup_taskset *tset);
+void memcg1_cancel_attach(struct cgroup_taskset *tset);
+void memcg1_move_task(void);
+void memcg1_css_offline(struct mem_cgroup *memcg);
+
+/* for encoding cft->private value on file */
+enum res_type {
+	_MEM,
+	_MEMSWAP,
+	_KMEM,
+	_TCP,
+};
+
+bool memcg1_oom_prepare(struct mem_cgroup *memcg, bool *locked);
+void memcg1_oom_finish(struct mem_cgroup *memcg, bool locked);
+void memcg1_oom_recover(struct mem_cgroup *memcg);
+
+void memcg1_check_events(struct mem_cgroup *memcg, int nid);
+
 void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s);
 
 extern struct cftype memsw_files[];
-- 
2.45.2



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 13/14] mm: memcg: put cgroup v1-related members of task_struct under config option
  2024-06-25  0:58 [PATCH v2 00/14] mm: memcg: separate legacy cgroup v1 code and put under config option Roman Gushchin
                   ` (11 preceding siblings ...)
  2024-06-25  0:59 ` [PATCH v2 12/14] mm: memcg: group cgroup v1 memcg related declarations Roman Gushchin
@ 2024-06-25  0:59 ` Roman Gushchin
  2024-06-25  7:19   ` Michal Hocko
  2024-06-25  0:59 ` [PATCH v2 14/14] MAINTAINERS: add mm/memcontrol-v1.c/h to the list of maintained files Roman Gushchin
  2024-06-25 17:03 ` [PATCH v2 00/14] mm: memcg: separate legacy cgroup v1 code and put under config option Shakeel Butt
  14 siblings, 1 reply; 31+ messages in thread
From: Roman Gushchin @ 2024-06-25  0:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song,
	linux-kernel, cgroups, linux-mm, Roman Gushchin

Guard cgroup v1-related members of task_struct under the CONFIG_MEMCG_V1
config option, so that users who adopted cgroup v2 don't have to waste
the memory for fields which are never accessed.

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 include/linux/memcontrol.h |  6 +++---
 init/Kconfig               |  9 +++++++++
 mm/Makefile                |  3 ++-
 mm/memcontrol-v1.h         | 21 ++++++++++++++++++++-
 mm/memcontrol.c            | 10 +++++++---
 5 files changed, 41 insertions(+), 8 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a70d64ed04f5..796cfa842346 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1851,7 +1851,7 @@ static inline bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg)
 
 /* Cgroup v1-related declarations */
 
-#ifdef CONFIG_MEMCG
+#ifdef CONFIG_MEMCG_V1
 unsigned long memcg1_soft_limit_reclaim(pg_data_t *pgdat, int order,
 					gfp_t gfp_mask,
 					unsigned long *total_scanned);
@@ -1883,7 +1883,7 @@ static inline void mem_cgroup_unlock_pages(void)
 	rcu_read_unlock();
 }
 
-#else /* CONFIG_MEMCG */
+#else /* CONFIG_MEMCG_V1 */
 static inline
 unsigned long memcg1_soft_limit_reclaim(pg_data_t *pgdat, int order,
 					gfp_t gfp_mask,
@@ -1922,6 +1922,6 @@ static inline bool mem_cgroup_oom_synchronize(bool wait)
 	return false;
 }
 
-#endif /* CONFIG_MEMCG */
+#endif /* CONFIG_MEMCG_V1 */
 
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/init/Kconfig b/init/Kconfig
index febdea2afc3b..5191b6435b4e 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -969,6 +969,15 @@ config MEMCG
 	help
 	  Provides control over the memory footprint of tasks in a cgroup.
 
+config MEMCG_V1
+	bool "Legacy memory controller"
+	depends on MEMCG
+	default n
+	help
+	  Legacy cgroup v1 memory controller.
+
+	  San N is unsure.
+
 config MEMCG_KMEM
 	bool
 	depends on MEMCG
diff --git a/mm/Makefile b/mm/Makefile
index 124d4dea2035..d2915f8c9dc0 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -96,7 +96,8 @@ obj-$(CONFIG_NUMA) += memory-tiers.o
 obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
-obj-$(CONFIG_MEMCG) += memcontrol.o memcontrol-v1.o vmpressure.o
+obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
+obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
 ifdef CONFIG_SWAP
 obj-$(CONFIG_MEMCG) += swap_cgroup.o
 endif
diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
index 89d420793048..64b053d7f131 100644
--- a/mm/memcontrol-v1.h
+++ b/mm/memcontrol-v1.h
@@ -75,7 +75,7 @@ unsigned long memcg_page_state_local_output(struct mem_cgroup *memcg, int item);
 int memory_stat_show(struct seq_file *m, void *v);
 
 /* Cgroup v1-specific declarations */
-
+#ifdef CONFIG_MEMCG_V1
 void memcg1_remove_from_trees(struct mem_cgroup *memcg);
 
 static inline void memcg1_soft_limit_reset(struct mem_cgroup *memcg)
@@ -110,4 +110,23 @@ void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s);
 extern struct cftype memsw_files[];
 extern struct cftype mem_cgroup_legacy_files[];
 
+#else	/* CONFIG_MEMCG_V1 */
+
+static inline void memcg1_remove_from_trees(struct mem_cgroup *memcg) {}
+static inline void memcg1_soft_limit_reset(struct mem_cgroup *memcg) {}
+static inline bool memcg1_wait_acct_move(struct mem_cgroup *memcg) { return false; }
+static inline void memcg1_css_offline(struct mem_cgroup *memcg) {}
+
+static inline bool memcg1_oom_prepare(struct mem_cgroup *memcg, bool *locked) { return true; }
+static inline void memcg1_oom_finish(struct mem_cgroup *memcg, bool locked) {}
+static inline void memcg1_oom_recover(struct mem_cgroup *memcg) {}
+
+static inline void memcg1_check_events(struct mem_cgroup *memcg, int nid) {}
+
+static inline void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) {}
+
+extern struct cftype memsw_files[];
+extern struct cftype mem_cgroup_legacy_files[];
+#endif	/* CONFIG_MEMCG_V1 */
+
 #endif	/* __MM_MEMCONTROL_V1_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c7341e811945..d2e1f8baeae8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4471,18 +4471,20 @@ struct cgroup_subsys memory_cgrp_subsys = {
 	.css_free = mem_cgroup_css_free,
 	.css_reset = mem_cgroup_css_reset,
 	.css_rstat_flush = mem_cgroup_css_rstat_flush,
-	.can_attach = memcg1_can_attach,
 #if defined(CONFIG_LRU_GEN) || defined(CONFIG_MEMCG_KMEM)
 	.attach = mem_cgroup_attach,
 #endif
-	.cancel_attach = memcg1_cancel_attach,
-	.post_attach = memcg1_move_task,
 #ifdef CONFIG_MEMCG_KMEM
 	.fork = mem_cgroup_fork,
 	.exit = mem_cgroup_exit,
 #endif
 	.dfl_cftypes = memory_files,
+#ifdef CONFIG_MEMCG_V1
+	.can_attach = memcg1_can_attach,
+	.cancel_attach = memcg1_cancel_attach,
+	.post_attach = memcg1_move_task,
 	.legacy_cftypes = mem_cgroup_legacy_files,
+#endif
 	.early_init = 0,
 };
 
@@ -5653,7 +5655,9 @@ static int __init mem_cgroup_swap_init(void)
 		return 0;
 
 	WARN_ON(cgroup_add_dfl_cftypes(&memory_cgrp_subsys, swap_files));
+#ifdef CONFIG_MEMCG_V1
 	WARN_ON(cgroup_add_legacy_cftypes(&memory_cgrp_subsys, memsw_files));
+#endif
 #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
 	WARN_ON(cgroup_add_dfl_cftypes(&memory_cgrp_subsys, zswap_files));
 #endif
-- 
2.45.2



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH v2 14/14] MAINTAINERS: add mm/memcontrol-v1.c/h to the list of maintained files
  2024-06-25  0:58 [PATCH v2 00/14] mm: memcg: separate legacy cgroup v1 code and put under config option Roman Gushchin
                   ` (12 preceding siblings ...)
  2024-06-25  0:59 ` [PATCH v2 13/14] mm: memcg: put cgroup v1-related members of task_struct under config option Roman Gushchin
@ 2024-06-25  0:59 ` Roman Gushchin
  2024-06-25 17:03 ` [PATCH v2 00/14] mm: memcg: separate legacy cgroup v1 code and put under config option Shakeel Butt
  14 siblings, 0 replies; 31+ messages in thread
From: Roman Gushchin @ 2024-06-25  0:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song,
	linux-kernel, cgroups, linux-mm, Roman Gushchin

Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 MAINTAINERS | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 7ad96cbb9f28..52a4089746b3 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5582,6 +5582,8 @@ L:	linux-mm@kvack.org
 S:	Maintained
 F:	include/linux/memcontrol.h
 F:	mm/memcontrol.c
+F:	mm/memcontrol-v1.c
+F:	mm/memcontrol-v1.h
 F:	mm/swap_cgroup.c
 F:	samples/cgroup/*
 F:	tools/testing/selftests/cgroup/memcg_protection.m
-- 
2.45.2



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 01/14] mm: memcg: introduce memcontrol-v1.c
  2024-06-25  0:58 ` [PATCH v2 01/14] mm: memcg: introduce memcontrol-v1.c Roman Gushchin
@ 2024-06-25  7:05   ` Michal Hocko
  0 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2024-06-25  7:05 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Johannes Weiner, Shakeel Butt, Muchun Song,
	linux-kernel, cgroups, linux-mm

On Mon 24-06-24 17:58:53, Roman Gushchin wrote:
> This patch introduces the mm/memcontrol-v1.c source file which will be used for
> all legacy (cgroup v1) memory cgroup code. It also introduces mm/memcontrol-v1.h
> to keep declarations shared between mm/memcontrol.c and mm/memcontrol-v1.c.
> 
> As of now, let's compile it if CONFIG_MEMCG is set, similar to mm/memcontrol.c.
> Later on it can be switched to use a separate config option, so that the legacy
> code won't be compiled if not required.

I do not feel strongly about that but wouldn't having the new config
here already make it easier to test?

> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>

Anyway
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/Makefile        | 3 ++-
>  mm/memcontrol-v1.c | 3 +++
>  mm/memcontrol-v1.h | 7 +++++++
>  3 files changed, 12 insertions(+), 1 deletion(-)
>  create mode 100644 mm/memcontrol-v1.c
>  create mode 100644 mm/memcontrol-v1.h
> 
> diff --git a/mm/Makefile b/mm/Makefile
> index 8fb85acda1b1..124d4dea2035 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -26,6 +26,7 @@ KCOV_INSTRUMENT_page_alloc.o := n
>  KCOV_INSTRUMENT_debug-pagealloc.o := n
>  KCOV_INSTRUMENT_kmemleak.o := n
>  KCOV_INSTRUMENT_memcontrol.o := n
> +KCOV_INSTRUMENT_memcontrol-v1.o := n
>  KCOV_INSTRUMENT_mmzone.o := n
>  KCOV_INSTRUMENT_vmstat.o := n
>  KCOV_INSTRUMENT_failslab.o := n
> @@ -95,7 +96,7 @@ obj-$(CONFIG_NUMA) += memory-tiers.o
>  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
>  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
> -obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
> +obj-$(CONFIG_MEMCG) += memcontrol.o memcontrol-v1.o vmpressure.o
>  ifdef CONFIG_SWAP
>  obj-$(CONFIG_MEMCG) += swap_cgroup.o
>  endif
> diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
> new file mode 100644
> index 000000000000..a941446ba575
> --- /dev/null
> +++ b/mm/memcontrol-v1.c
> @@ -0,0 +1,3 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +
> +#include "memcontrol-v1.h"
> diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
> new file mode 100644
> index 000000000000..7c5f094755ff
> --- /dev/null
> +++ b/mm/memcontrol-v1.h
> @@ -0,0 +1,7 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +
> +#ifndef __MM_MEMCONTROL_V1_H
> +#define __MM_MEMCONTROL_V1_H
> +
> +
> +#endif	/* __MM_MEMCONTROL_V1_H */
> -- 
> 2.45.2

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 02/14] mm: memcg: move soft limit reclaim code to memcontrol-v1.c
  2024-06-25  0:58 ` [PATCH v2 02/14] mm: memcg: move soft limit reclaim code to memcontrol-v1.c Roman Gushchin
@ 2024-06-25  7:06   ` Michal Hocko
  0 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2024-06-25  7:06 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Johannes Weiner, Shakeel Butt, Muchun Song,
	linux-kernel, cgroups, linux-mm

On Mon 24-06-24 17:58:54, Roman Gushchin wrote:
> Soft limits are cgroup v1-specific and are not supported by cgroup v2,
> so let's move the corresponding code into memcontrol-v1.c.
> 
> Aside from simple moving the code, this commits introduces a trivial
> memcg1_soft_limit_reset() function to reset soft limits and also
> moves the global soft limit tree initialization code into a new
> memcg1_init() function.
> 
> It also moves corresponding declarations shared between memcontrol.c
> and memcontrol-v1.c into mm/memcontrol-v1.h.
> 
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>

I haven't done line for line check here and in other patches that move a
lot of code. I like you have separated the renaming into its own patch
because this makes review just easier.

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/memcontrol-v1.c | 342 +++++++++++++++++++++++++++++++++++++++++++++
>  mm/memcontrol-v1.h |   7 +
>  mm/memcontrol.c    | 337 +-------------------------------------------
>  3 files changed, 353 insertions(+), 333 deletions(-)
> 
> diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
> index a941446ba575..2ccb8406fa84 100644
> --- a/mm/memcontrol-v1.c
> +++ b/mm/memcontrol-v1.c
> @@ -1,3 +1,345 @@
>  // SPDX-License-Identifier: GPL-2.0-or-later
>  
> +#include <linux/memcontrol.h>
> +#include <linux/swap.h>
> +#include <linux/mm_inline.h>
> +
>  #include "memcontrol-v1.h"
> +
> +/*
> + * Cgroups above their limits are maintained in a RB-Tree, independent of
> + * their hierarchy representation
> + */
> +
> +struct mem_cgroup_tree_per_node {
> +	struct rb_root rb_root;
> +	struct rb_node *rb_rightmost;
> +	spinlock_t lock;
> +};
> +
> +struct mem_cgroup_tree {
> +	struct mem_cgroup_tree_per_node *rb_tree_per_node[MAX_NUMNODES];
> +};
> +
> +static struct mem_cgroup_tree soft_limit_tree __read_mostly;
> +
> +/*
> + * Maximum loops in mem_cgroup_soft_reclaim(), used for soft
> + * limit reclaim to prevent infinite loops, if they ever occur.
> + */
> +#define	MEM_CGROUP_MAX_RECLAIM_LOOPS		100
> +#define	MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS	2
> +
> +static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_node *mz,
> +					 struct mem_cgroup_tree_per_node *mctz,
> +					 unsigned long new_usage_in_excess)
> +{
> +	struct rb_node **p = &mctz->rb_root.rb_node;
> +	struct rb_node *parent = NULL;
> +	struct mem_cgroup_per_node *mz_node;
> +	bool rightmost = true;
> +
> +	if (mz->on_tree)
> +		return;
> +
> +	mz->usage_in_excess = new_usage_in_excess;
> +	if (!mz->usage_in_excess)
> +		return;
> +	while (*p) {
> +		parent = *p;
> +		mz_node = rb_entry(parent, struct mem_cgroup_per_node,
> +					tree_node);
> +		if (mz->usage_in_excess < mz_node->usage_in_excess) {
> +			p = &(*p)->rb_left;
> +			rightmost = false;
> +		} else {
> +			p = &(*p)->rb_right;
> +		}
> +	}
> +
> +	if (rightmost)
> +		mctz->rb_rightmost = &mz->tree_node;
> +
> +	rb_link_node(&mz->tree_node, parent, p);
> +	rb_insert_color(&mz->tree_node, &mctz->rb_root);
> +	mz->on_tree = true;
> +}
> +
> +static void __mem_cgroup_remove_exceeded(struct mem_cgroup_per_node *mz,
> +					 struct mem_cgroup_tree_per_node *mctz)
> +{
> +	if (!mz->on_tree)
> +		return;
> +
> +	if (&mz->tree_node == mctz->rb_rightmost)
> +		mctz->rb_rightmost = rb_prev(&mz->tree_node);
> +
> +	rb_erase(&mz->tree_node, &mctz->rb_root);
> +	mz->on_tree = false;
> +}
> +
> +static void mem_cgroup_remove_exceeded(struct mem_cgroup_per_node *mz,
> +				       struct mem_cgroup_tree_per_node *mctz)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&mctz->lock, flags);
> +	__mem_cgroup_remove_exceeded(mz, mctz);
> +	spin_unlock_irqrestore(&mctz->lock, flags);
> +}
> +
> +static unsigned long soft_limit_excess(struct mem_cgroup *memcg)
> +{
> +	unsigned long nr_pages = page_counter_read(&memcg->memory);
> +	unsigned long soft_limit = READ_ONCE(memcg->soft_limit);
> +	unsigned long excess = 0;
> +
> +	if (nr_pages > soft_limit)
> +		excess = nr_pages - soft_limit;
> +
> +	return excess;
> +}
> +
> +void mem_cgroup_update_tree(struct mem_cgroup *memcg, int nid)
> +{
> +	unsigned long excess;
> +	struct mem_cgroup_per_node *mz;
> +	struct mem_cgroup_tree_per_node *mctz;
> +
> +	if (lru_gen_enabled()) {
> +		if (soft_limit_excess(memcg))
> +			lru_gen_soft_reclaim(memcg, nid);
> +		return;
> +	}
> +
> +	mctz = soft_limit_tree.rb_tree_per_node[nid];
> +	if (!mctz)
> +		return;
> +	/*
> +	 * Necessary to update all ancestors when hierarchy is used.
> +	 * because their event counter is not touched.
> +	 */
> +	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
> +		mz = memcg->nodeinfo[nid];
> +		excess = soft_limit_excess(memcg);
> +		/*
> +		 * We have to update the tree if mz is on RB-tree or
> +		 * mem is over its softlimit.
> +		 */
> +		if (excess || mz->on_tree) {
> +			unsigned long flags;
> +
> +			spin_lock_irqsave(&mctz->lock, flags);
> +			/* if on-tree, remove it */
> +			if (mz->on_tree)
> +				__mem_cgroup_remove_exceeded(mz, mctz);
> +			/*
> +			 * Insert again. mz->usage_in_excess will be updated.
> +			 * If excess is 0, no tree ops.
> +			 */
> +			__mem_cgroup_insert_exceeded(mz, mctz, excess);
> +			spin_unlock_irqrestore(&mctz->lock, flags);
> +		}
> +	}
> +}
> +
> +void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg)
> +{
> +	struct mem_cgroup_tree_per_node *mctz;
> +	struct mem_cgroup_per_node *mz;
> +	int nid;
> +
> +	for_each_node(nid) {
> +		mz = memcg->nodeinfo[nid];
> +		mctz = soft_limit_tree.rb_tree_per_node[nid];
> +		if (mctz)
> +			mem_cgroup_remove_exceeded(mz, mctz);
> +	}
> +}
> +
> +static struct mem_cgroup_per_node *
> +__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
> +{
> +	struct mem_cgroup_per_node *mz;
> +
> +retry:
> +	mz = NULL;
> +	if (!mctz->rb_rightmost)
> +		goto done;		/* Nothing to reclaim from */
> +
> +	mz = rb_entry(mctz->rb_rightmost,
> +		      struct mem_cgroup_per_node, tree_node);
> +	/*
> +	 * Remove the node now but someone else can add it back,
> +	 * we will to add it back at the end of reclaim to its correct
> +	 * position in the tree.
> +	 */
> +	__mem_cgroup_remove_exceeded(mz, mctz);
> +	if (!soft_limit_excess(mz->memcg) ||
> +	    !css_tryget(&mz->memcg->css))
> +		goto retry;
> +done:
> +	return mz;
> +}
> +
> +static struct mem_cgroup_per_node *
> +mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
> +{
> +	struct mem_cgroup_per_node *mz;
> +
> +	spin_lock_irq(&mctz->lock);
> +	mz = __mem_cgroup_largest_soft_limit_node(mctz);
> +	spin_unlock_irq(&mctz->lock);
> +	return mz;
> +}
> +
> +static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
> +				   pg_data_t *pgdat,
> +				   gfp_t gfp_mask,
> +				   unsigned long *total_scanned)
> +{
> +	struct mem_cgroup *victim = NULL;
> +	int total = 0;
> +	int loop = 0;
> +	unsigned long excess;
> +	unsigned long nr_scanned;
> +	struct mem_cgroup_reclaim_cookie reclaim = {
> +		.pgdat = pgdat,
> +	};
> +
> +	excess = soft_limit_excess(root_memcg);
> +
> +	while (1) {
> +		victim = mem_cgroup_iter(root_memcg, victim, &reclaim);
> +		if (!victim) {
> +			loop++;
> +			if (loop >= 2) {
> +				/*
> +				 * If we have not been able to reclaim
> +				 * anything, it might because there are
> +				 * no reclaimable pages under this hierarchy
> +				 */
> +				if (!total)
> +					break;
> +				/*
> +				 * We want to do more targeted reclaim.
> +				 * excess >> 2 is not to excessive so as to
> +				 * reclaim too much, nor too less that we keep
> +				 * coming back to reclaim from this cgroup
> +				 */
> +				if (total >= (excess >> 2) ||
> +					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS))
> +					break;
> +			}
> +			continue;
> +		}
> +		total += mem_cgroup_shrink_node(victim, gfp_mask, false,
> +					pgdat, &nr_scanned);
> +		*total_scanned += nr_scanned;
> +		if (!soft_limit_excess(root_memcg))
> +			break;
> +	}
> +	mem_cgroup_iter_break(root_memcg, victim);
> +	return total;
> +}
> +
> +unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
> +					    gfp_t gfp_mask,
> +					    unsigned long *total_scanned)
> +{
> +	unsigned long nr_reclaimed = 0;
> +	struct mem_cgroup_per_node *mz, *next_mz = NULL;
> +	unsigned long reclaimed;
> +	int loop = 0;
> +	struct mem_cgroup_tree_per_node *mctz;
> +	unsigned long excess;
> +
> +	if (lru_gen_enabled())
> +		return 0;
> +
> +	if (order > 0)
> +		return 0;
> +
> +	mctz = soft_limit_tree.rb_tree_per_node[pgdat->node_id];
> +
> +	/*
> +	 * Do not even bother to check the largest node if the root
> +	 * is empty. Do it lockless to prevent lock bouncing. Races
> +	 * are acceptable as soft limit is best effort anyway.
> +	 */
> +	if (!mctz || RB_EMPTY_ROOT(&mctz->rb_root))
> +		return 0;
> +
> +	/*
> +	 * This loop can run a while, specially if mem_cgroup's continuously
> +	 * keep exceeding their soft limit and putting the system under
> +	 * pressure
> +	 */
> +	do {
> +		if (next_mz)
> +			mz = next_mz;
> +		else
> +			mz = mem_cgroup_largest_soft_limit_node(mctz);
> +		if (!mz)
> +			break;
> +
> +		reclaimed = mem_cgroup_soft_reclaim(mz->memcg, pgdat,
> +						    gfp_mask, total_scanned);
> +		nr_reclaimed += reclaimed;
> +		spin_lock_irq(&mctz->lock);
> +
> +		/*
> +		 * If we failed to reclaim anything from this memory cgroup
> +		 * it is time to move on to the next cgroup
> +		 */
> +		next_mz = NULL;
> +		if (!reclaimed)
> +			next_mz = __mem_cgroup_largest_soft_limit_node(mctz);
> +
> +		excess = soft_limit_excess(mz->memcg);
> +		/*
> +		 * One school of thought says that we should not add
> +		 * back the node to the tree if reclaim returns 0.
> +		 * But our reclaim could return 0, simply because due
> +		 * to priority we are exposing a smaller subset of
> +		 * memory to reclaim from. Consider this as a longer
> +		 * term TODO.
> +		 */
> +		/* If excess == 0, no tree ops */
> +		__mem_cgroup_insert_exceeded(mz, mctz, excess);
> +		spin_unlock_irq(&mctz->lock);
> +		css_put(&mz->memcg->css);
> +		loop++;
> +		/*
> +		 * Could not reclaim anything and there are no more
> +		 * mem cgroups to try or we seem to be looping without
> +		 * reclaiming anything.
> +		 */
> +		if (!nr_reclaimed &&
> +			(next_mz == NULL ||
> +			loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
> +			break;
> +	} while (!nr_reclaimed);
> +	if (next_mz)
> +		css_put(&next_mz->memcg->css);
> +	return nr_reclaimed;
> +}
> +
> +static int __init memcg1_init(void)
> +{
> +	int node;
> +
> +	for_each_node(node) {
> +		struct mem_cgroup_tree_per_node *rtpn;
> +
> +		rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, node);
> +
> +		rtpn->rb_root = RB_ROOT;
> +		rtpn->rb_rightmost = NULL;
> +		spin_lock_init(&rtpn->lock);
> +		soft_limit_tree.rb_tree_per_node[node] = rtpn;
> +	}
> +
> +	return 0;
> +}
> +subsys_initcall(memcg1_init);
> diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
> index 7c5f094755ff..4da6fa561c6d 100644
> --- a/mm/memcontrol-v1.h
> +++ b/mm/memcontrol-v1.h
> @@ -3,5 +3,12 @@
>  #ifndef __MM_MEMCONTROL_V1_H
>  #define __MM_MEMCONTROL_V1_H
>  
> +void mem_cgroup_update_tree(struct mem_cgroup *memcg, int nid);
> +void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg);
> +
> +static inline void memcg1_soft_limit_reset(struct mem_cgroup *memcg)
> +{
> +	WRITE_ONCE(memcg->soft_limit, PAGE_COUNTER_MAX);
> +}
>  
>  #endif	/* __MM_MEMCONTROL_V1_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 974bd160838c..003e944f34ea 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -72,6 +72,7 @@
>  #include <net/ip.h>
>  #include "slab.h"
>  #include "swap.h"
> +#include "memcontrol-v1.h"
>  
>  #include <linux/uaccess.h>
>  
> @@ -108,23 +109,6 @@ static bool do_memsw_account(void)
>  #define THRESHOLDS_EVENTS_TARGET 128
>  #define SOFTLIMIT_EVENTS_TARGET 1024
>  
> -/*
> - * Cgroups above their limits are maintained in a RB-Tree, independent of
> - * their hierarchy representation
> - */
> -
> -struct mem_cgroup_tree_per_node {
> -	struct rb_root rb_root;
> -	struct rb_node *rb_rightmost;
> -	spinlock_t lock;
> -};
> -
> -struct mem_cgroup_tree {
> -	struct mem_cgroup_tree_per_node *rb_tree_per_node[MAX_NUMNODES];
> -};
> -
> -static struct mem_cgroup_tree soft_limit_tree __read_mostly;
> -
>  /* for OOM */
>  struct mem_cgroup_eventfd_list {
>  	struct list_head list;
> @@ -199,13 +183,6 @@ static struct move_charge_struct {
>  	.waitq = __WAIT_QUEUE_HEAD_INITIALIZER(mc.waitq),
>  };
>  
> -/*
> - * Maximum loops in mem_cgroup_soft_reclaim(), used for soft
> - * limit reclaim to prevent infinite loops, if they ever occur.
> - */
> -#define	MEM_CGROUP_MAX_RECLAIM_LOOPS		100
> -#define	MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS	2
> -
>  /* for encoding cft->private value on file */
>  enum res_type {
>  	_MEM,
> @@ -413,169 +390,6 @@ ino_t page_cgroup_ino(struct page *page)
>  	return ino;
>  }
>  
> -static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_node *mz,
> -					 struct mem_cgroup_tree_per_node *mctz,
> -					 unsigned long new_usage_in_excess)
> -{
> -	struct rb_node **p = &mctz->rb_root.rb_node;
> -	struct rb_node *parent = NULL;
> -	struct mem_cgroup_per_node *mz_node;
> -	bool rightmost = true;
> -
> -	if (mz->on_tree)
> -		return;
> -
> -	mz->usage_in_excess = new_usage_in_excess;
> -	if (!mz->usage_in_excess)
> -		return;
> -	while (*p) {
> -		parent = *p;
> -		mz_node = rb_entry(parent, struct mem_cgroup_per_node,
> -					tree_node);
> -		if (mz->usage_in_excess < mz_node->usage_in_excess) {
> -			p = &(*p)->rb_left;
> -			rightmost = false;
> -		} else {
> -			p = &(*p)->rb_right;
> -		}
> -	}
> -
> -	if (rightmost)
> -		mctz->rb_rightmost = &mz->tree_node;
> -
> -	rb_link_node(&mz->tree_node, parent, p);
> -	rb_insert_color(&mz->tree_node, &mctz->rb_root);
> -	mz->on_tree = true;
> -}
> -
> -static void __mem_cgroup_remove_exceeded(struct mem_cgroup_per_node *mz,
> -					 struct mem_cgroup_tree_per_node *mctz)
> -{
> -	if (!mz->on_tree)
> -		return;
> -
> -	if (&mz->tree_node == mctz->rb_rightmost)
> -		mctz->rb_rightmost = rb_prev(&mz->tree_node);
> -
> -	rb_erase(&mz->tree_node, &mctz->rb_root);
> -	mz->on_tree = false;
> -}
> -
> -static void mem_cgroup_remove_exceeded(struct mem_cgroup_per_node *mz,
> -				       struct mem_cgroup_tree_per_node *mctz)
> -{
> -	unsigned long flags;
> -
> -	spin_lock_irqsave(&mctz->lock, flags);
> -	__mem_cgroup_remove_exceeded(mz, mctz);
> -	spin_unlock_irqrestore(&mctz->lock, flags);
> -}
> -
> -static unsigned long soft_limit_excess(struct mem_cgroup *memcg)
> -{
> -	unsigned long nr_pages = page_counter_read(&memcg->memory);
> -	unsigned long soft_limit = READ_ONCE(memcg->soft_limit);
> -	unsigned long excess = 0;
> -
> -	if (nr_pages > soft_limit)
> -		excess = nr_pages - soft_limit;
> -
> -	return excess;
> -}
> -
> -static void mem_cgroup_update_tree(struct mem_cgroup *memcg, int nid)
> -{
> -	unsigned long excess;
> -	struct mem_cgroup_per_node *mz;
> -	struct mem_cgroup_tree_per_node *mctz;
> -
> -	if (lru_gen_enabled()) {
> -		if (soft_limit_excess(memcg))
> -			lru_gen_soft_reclaim(memcg, nid);
> -		return;
> -	}
> -
> -	mctz = soft_limit_tree.rb_tree_per_node[nid];
> -	if (!mctz)
> -		return;
> -	/*
> -	 * Necessary to update all ancestors when hierarchy is used.
> -	 * because their event counter is not touched.
> -	 */
> -	for (; memcg; memcg = parent_mem_cgroup(memcg)) {
> -		mz = memcg->nodeinfo[nid];
> -		excess = soft_limit_excess(memcg);
> -		/*
> -		 * We have to update the tree if mz is on RB-tree or
> -		 * mem is over its softlimit.
> -		 */
> -		if (excess || mz->on_tree) {
> -			unsigned long flags;
> -
> -			spin_lock_irqsave(&mctz->lock, flags);
> -			/* if on-tree, remove it */
> -			if (mz->on_tree)
> -				__mem_cgroup_remove_exceeded(mz, mctz);
> -			/*
> -			 * Insert again. mz->usage_in_excess will be updated.
> -			 * If excess is 0, no tree ops.
> -			 */
> -			__mem_cgroup_insert_exceeded(mz, mctz, excess);
> -			spin_unlock_irqrestore(&mctz->lock, flags);
> -		}
> -	}
> -}
> -
> -static void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg)
> -{
> -	struct mem_cgroup_tree_per_node *mctz;
> -	struct mem_cgroup_per_node *mz;
> -	int nid;
> -
> -	for_each_node(nid) {
> -		mz = memcg->nodeinfo[nid];
> -		mctz = soft_limit_tree.rb_tree_per_node[nid];
> -		if (mctz)
> -			mem_cgroup_remove_exceeded(mz, mctz);
> -	}
> -}
> -
> -static struct mem_cgroup_per_node *
> -__mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
> -{
> -	struct mem_cgroup_per_node *mz;
> -
> -retry:
> -	mz = NULL;
> -	if (!mctz->rb_rightmost)
> -		goto done;		/* Nothing to reclaim from */
> -
> -	mz = rb_entry(mctz->rb_rightmost,
> -		      struct mem_cgroup_per_node, tree_node);
> -	/*
> -	 * Remove the node now but someone else can add it back,
> -	 * we will to add it back at the end of reclaim to its correct
> -	 * position in the tree.
> -	 */
> -	__mem_cgroup_remove_exceeded(mz, mctz);
> -	if (!soft_limit_excess(mz->memcg) ||
> -	    !css_tryget(&mz->memcg->css))
> -		goto retry;
> -done:
> -	return mz;
> -}
> -
> -static struct mem_cgroup_per_node *
> -mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
> -{
> -	struct mem_cgroup_per_node *mz;
> -
> -	spin_lock_irq(&mctz->lock);
> -	mz = __mem_cgroup_largest_soft_limit_node(mctz);
> -	spin_unlock_irq(&mctz->lock);
> -	return mz;
> -}
> -
>  /* Subset of node_stat_item for memcg stats */
>  static const unsigned int memcg_node_stat_items[] = {
>  	NR_INACTIVE_ANON,
> @@ -1980,56 +1794,6 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	return ret;
>  }
>  
> -static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
> -				   pg_data_t *pgdat,
> -				   gfp_t gfp_mask,
> -				   unsigned long *total_scanned)
> -{
> -	struct mem_cgroup *victim = NULL;
> -	int total = 0;
> -	int loop = 0;
> -	unsigned long excess;
> -	unsigned long nr_scanned;
> -	struct mem_cgroup_reclaim_cookie reclaim = {
> -		.pgdat = pgdat,
> -	};
> -
> -	excess = soft_limit_excess(root_memcg);
> -
> -	while (1) {
> -		victim = mem_cgroup_iter(root_memcg, victim, &reclaim);
> -		if (!victim) {
> -			loop++;
> -			if (loop >= 2) {
> -				/*
> -				 * If we have not been able to reclaim
> -				 * anything, it might because there are
> -				 * no reclaimable pages under this hierarchy
> -				 */
> -				if (!total)
> -					break;
> -				/*
> -				 * We want to do more targeted reclaim.
> -				 * excess >> 2 is not to excessive so as to
> -				 * reclaim too much, nor too less that we keep
> -				 * coming back to reclaim from this cgroup
> -				 */
> -				if (total >= (excess >> 2) ||
> -					(loop > MEM_CGROUP_MAX_RECLAIM_LOOPS))
> -					break;
> -			}
> -			continue;
> -		}
> -		total += mem_cgroup_shrink_node(victim, gfp_mask, false,
> -					pgdat, &nr_scanned);
> -		*total_scanned += nr_scanned;
> -		if (!soft_limit_excess(root_memcg))
> -			break;
> -	}
> -	mem_cgroup_iter_break(root_memcg, victim);
> -	return total;
> -}
> -
>  #ifdef CONFIG_LOCKDEP
>  static struct lockdep_map memcg_oom_lock_dep_map = {
>  	.name = "memcg_oom_lock",
> @@ -3925,88 +3689,6 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
>  	return ret;
>  }
>  
> -unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
> -					    gfp_t gfp_mask,
> -					    unsigned long *total_scanned)
> -{
> -	unsigned long nr_reclaimed = 0;
> -	struct mem_cgroup_per_node *mz, *next_mz = NULL;
> -	unsigned long reclaimed;
> -	int loop = 0;
> -	struct mem_cgroup_tree_per_node *mctz;
> -	unsigned long excess;
> -
> -	if (lru_gen_enabled())
> -		return 0;
> -
> -	if (order > 0)
> -		return 0;
> -
> -	mctz = soft_limit_tree.rb_tree_per_node[pgdat->node_id];
> -
> -	/*
> -	 * Do not even bother to check the largest node if the root
> -	 * is empty. Do it lockless to prevent lock bouncing. Races
> -	 * are acceptable as soft limit is best effort anyway.
> -	 */
> -	if (!mctz || RB_EMPTY_ROOT(&mctz->rb_root))
> -		return 0;
> -
> -	/*
> -	 * This loop can run a while, specially if mem_cgroup's continuously
> -	 * keep exceeding their soft limit and putting the system under
> -	 * pressure
> -	 */
> -	do {
> -		if (next_mz)
> -			mz = next_mz;
> -		else
> -			mz = mem_cgroup_largest_soft_limit_node(mctz);
> -		if (!mz)
> -			break;
> -
> -		reclaimed = mem_cgroup_soft_reclaim(mz->memcg, pgdat,
> -						    gfp_mask, total_scanned);
> -		nr_reclaimed += reclaimed;
> -		spin_lock_irq(&mctz->lock);
> -
> -		/*
> -		 * If we failed to reclaim anything from this memory cgroup
> -		 * it is time to move on to the next cgroup
> -		 */
> -		next_mz = NULL;
> -		if (!reclaimed)
> -			next_mz = __mem_cgroup_largest_soft_limit_node(mctz);
> -
> -		excess = soft_limit_excess(mz->memcg);
> -		/*
> -		 * One school of thought says that we should not add
> -		 * back the node to the tree if reclaim returns 0.
> -		 * But our reclaim could return 0, simply because due
> -		 * to priority we are exposing a smaller subset of
> -		 * memory to reclaim from. Consider this as a longer
> -		 * term TODO.
> -		 */
> -		/* If excess == 0, no tree ops */
> -		__mem_cgroup_insert_exceeded(mz, mctz, excess);
> -		spin_unlock_irq(&mctz->lock);
> -		css_put(&mz->memcg->css);
> -		loop++;
> -		/*
> -		 * Could not reclaim anything and there are no more
> -		 * mem cgroups to try or we seem to be looping without
> -		 * reclaiming anything.
> -		 */
> -		if (!nr_reclaimed &&
> -			(next_mz == NULL ||
> -			loop > MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS))
> -			break;
> -	} while (!nr_reclaimed);
> -	if (next_mz)
> -		css_put(&next_mz->memcg->css);
> -	return nr_reclaimed;
> -}
> -
>  /*
>   * Reclaims as many pages from the given memcg as possible.
>   *
> @@ -5784,7 +5466,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
>  		return ERR_CAST(memcg);
>  
>  	page_counter_set_high(&memcg->memory, PAGE_COUNTER_MAX);
> -	WRITE_ONCE(memcg->soft_limit, PAGE_COUNTER_MAX);
> +	memcg1_soft_limit_reset(memcg);
>  #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
>  	memcg->zswap_max = PAGE_COUNTER_MAX;
>  	WRITE_ONCE(memcg->zswap_writeback,
> @@ -5957,7 +5639,7 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
>  	page_counter_set_min(&memcg->memory, 0);
>  	page_counter_set_low(&memcg->memory, 0);
>  	page_counter_set_high(&memcg->memory, PAGE_COUNTER_MAX);
> -	WRITE_ONCE(memcg->soft_limit, PAGE_COUNTER_MAX);
> +	memcg1_soft_limit_reset(memcg);
>  	page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX);
>  	memcg_wb_domain_size_changed(memcg);
>  }
> @@ -7984,7 +7666,7 @@ __setup("cgroup.memory=", cgroup_memory);
>   */
>  static int __init mem_cgroup_init(void)
>  {
> -	int cpu, node;
> +	int cpu;
>  
>  	/*
>  	 * Currently s32 type (can refer to struct batched_lruvec_stat) is
> @@ -8001,17 +7683,6 @@ static int __init mem_cgroup_init(void)
>  		INIT_WORK(&per_cpu_ptr(&memcg_stock, cpu)->work,
>  			  drain_local_stock);
>  
> -	for_each_node(node) {
> -		struct mem_cgroup_tree_per_node *rtpn;
> -
> -		rtpn = kzalloc_node(sizeof(*rtpn), GFP_KERNEL, node);
> -
> -		rtpn->rb_root = RB_ROOT;
> -		rtpn->rb_rightmost = NULL;
> -		spin_lock_init(&rtpn->lock);
> -		soft_limit_tree.rb_tree_per_node[node] = rtpn;
> -	}
> -
>  	return 0;
>  }
>  subsys_initcall(mem_cgroup_init);
> -- 
> 2.45.2

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 03/14] mm: memcg: rename soft limit reclaim-related functions
  2024-06-25  0:58 ` [PATCH v2 03/14] mm: memcg: rename soft limit reclaim-related functions Roman Gushchin
@ 2024-06-25  7:06   ` Michal Hocko
  0 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2024-06-25  7:06 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Johannes Weiner, Shakeel Butt, Muchun Song,
	linux-kernel, cgroups, linux-mm

On Mon 24-06-24 17:58:55, Roman Gushchin wrote:
> Rename exported function related to the softlimit reclaim
> to have memcg1_ prefix.
> 
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/memcontrol.h | 12 ++++++------
>  mm/memcontrol-v1.c         |  6 +++---
>  mm/memcontrol-v1.h         |  4 ++--
>  mm/memcontrol.c            |  4 ++--
>  mm/vmscan.c                | 10 +++++-----
>  5 files changed, 18 insertions(+), 18 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 7403dd5926eb..83c8327455d8 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1121,9 +1121,9 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
>  
>  void split_page_memcg(struct page *head, int old_order, int new_order);
>  
> -unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
> -						gfp_t gfp_mask,
> -						unsigned long *total_scanned);
> +unsigned long memcg1_soft_limit_reclaim(pg_data_t *pgdat, int order,
> +					gfp_t gfp_mask,
> +					unsigned long *total_scanned);
>  
>  #else /* CONFIG_MEMCG */
>  
> @@ -1572,9 +1572,9 @@ static inline void split_page_memcg(struct page *head, int old_order, int new_or
>  }
>  
>  static inline
> -unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
> -					    gfp_t gfp_mask,
> -					    unsigned long *total_scanned)
> +unsigned long memcg1_soft_limit_reclaim(pg_data_t *pgdat, int order,
> +					gfp_t gfp_mask,
> +					unsigned long *total_scanned)
>  {
>  	return 0;
>  }
> diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
> index 2ccb8406fa84..68e2f1a718d3 100644
> --- a/mm/memcontrol-v1.c
> +++ b/mm/memcontrol-v1.c
> @@ -100,7 +100,7 @@ static unsigned long soft_limit_excess(struct mem_cgroup *memcg)
>  	return excess;
>  }
>  
> -void mem_cgroup_update_tree(struct mem_cgroup *memcg, int nid)
> +void memcg1_update_tree(struct mem_cgroup *memcg, int nid)
>  {
>  	unsigned long excess;
>  	struct mem_cgroup_per_node *mz;
> @@ -143,7 +143,7 @@ void mem_cgroup_update_tree(struct mem_cgroup *memcg, int nid)
>  	}
>  }
>  
> -void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg)
> +void memcg1_remove_from_trees(struct mem_cgroup *memcg)
>  {
>  	struct mem_cgroup_tree_per_node *mctz;
>  	struct mem_cgroup_per_node *mz;
> @@ -243,7 +243,7 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
>  	return total;
>  }
>  
> -unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
> +unsigned long memcg1_soft_limit_reclaim(pg_data_t *pgdat, int order,
>  					    gfp_t gfp_mask,
>  					    unsigned long *total_scanned)
>  {
> diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
> index 4da6fa561c6d..e37bc7e8d955 100644
> --- a/mm/memcontrol-v1.h
> +++ b/mm/memcontrol-v1.h
> @@ -3,8 +3,8 @@
>  #ifndef __MM_MEMCONTROL_V1_H
>  #define __MM_MEMCONTROL_V1_H
>  
> -void mem_cgroup_update_tree(struct mem_cgroup *memcg, int nid);
> -void mem_cgroup_remove_from_trees(struct mem_cgroup *memcg);
> +void memcg1_update_tree(struct mem_cgroup *memcg, int nid);
> +void memcg1_remove_from_trees(struct mem_cgroup *memcg);
>  
>  static inline void memcg1_soft_limit_reset(struct mem_cgroup *memcg)
>  {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 003e944f34ea..3479e1af12d5 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1012,7 +1012,7 @@ static void memcg_check_events(struct mem_cgroup *memcg, int nid)
>  						MEM_CGROUP_TARGET_SOFTLIMIT);
>  		mem_cgroup_threshold(memcg);
>  		if (unlikely(do_softlimit))
> -			mem_cgroup_update_tree(memcg, nid);
> +			memcg1_update_tree(memcg, nid);
>  	}
>  }
>  
> @@ -5610,7 +5610,7 @@ static void mem_cgroup_css_free(struct cgroup_subsys_state *css)
>  
>  	vmpressure_cleanup(&memcg->vmpressure);
>  	cancel_work_sync(&memcg->high_work);
> -	mem_cgroup_remove_from_trees(memcg);
> +	memcg1_remove_from_trees(memcg);
>  	free_shrinker_info(memcg);
>  	mem_cgroup_free(memcg);
>  }
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 900bad16e506..3d4c681c6d40 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -6186,9 +6186,9 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
>  			 * and balancing, not for a memcg's limit.
>  			 */
>  			nr_soft_scanned = 0;
> -			nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(zone->zone_pgdat,
> -						sc->order, sc->gfp_mask,
> -						&nr_soft_scanned);
> +			nr_soft_reclaimed = memcg1_soft_limit_reclaim(zone->zone_pgdat,
> +								      sc->order, sc->gfp_mask,
> +								      &nr_soft_scanned);
>  			sc->nr_reclaimed += nr_soft_reclaimed;
>  			sc->nr_scanned += nr_soft_scanned;
>  			/* need some check for avoid more shrink_zone() */
> @@ -6952,8 +6952,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
>  		/* Call soft limit reclaim before calling shrink_node. */
>  		sc.nr_scanned = 0;
>  		nr_soft_scanned = 0;
> -		nr_soft_reclaimed = mem_cgroup_soft_limit_reclaim(pgdat, sc.order,
> -						sc.gfp_mask, &nr_soft_scanned);
> +		nr_soft_reclaimed = memcg1_soft_limit_reclaim(pgdat, sc.order,
> +							      sc.gfp_mask, &nr_soft_scanned);
>  		sc.nr_reclaimed += nr_soft_reclaimed;
>  
>  		/*
> -- 
> 2.45.2

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 04/14] mm: memcg: move charge migration code to memcontrol-v1.c
  2024-06-25  0:58 ` [PATCH v2 04/14] mm: memcg: move charge migration code to memcontrol-v1.c Roman Gushchin
@ 2024-06-25  7:07   ` Michal Hocko
  0 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2024-06-25  7:07 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Johannes Weiner, Shakeel Butt, Muchun Song,
	linux-kernel, cgroups, linux-mm

On Mon 24-06-24 17:58:56, Roman Gushchin wrote:
> Unlike the legacy cgroup v1 memory controller, cgroup v2 memory
> controller doesn't support moving charged pages between cgroups.
> 
> It's a fairly large and complicated code which created a number
> of problems in the past. Let's move this code into memcontrol-v1.c.
> It shaves off 1k lines from memcontrol.c. It's also another step
> towards making the legacy memory controller code optionally compiled.

Acked-by: Michal Hocko <mhocko@suse.com>

> 
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> ---
>  mm/memcontrol-v1.c |  981 +++++++++++++++++++++++++++++++++++++++++++
>  mm/memcontrol-v1.h |   30 ++
>  mm/memcontrol.c    | 1004 +-------------------------------------------
>  3 files changed, 1019 insertions(+), 996 deletions(-)
> 
> diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
> index 68e2f1a718d3..f4c8bec5ae1b 100644
> --- a/mm/memcontrol-v1.c
> +++ b/mm/memcontrol-v1.c
> @@ -3,7 +3,12 @@
>  #include <linux/memcontrol.h>
>  #include <linux/swap.h>
>  #include <linux/mm_inline.h>
> +#include <linux/pagewalk.h>
> +#include <linux/backing-dev.h>
> +#include <linux/swap_cgroup.h>
>  
> +#include "internal.h"
> +#include "swap.h"
>  #include "memcontrol-v1.h"
>  
>  /*
> @@ -30,6 +35,31 @@ static struct mem_cgroup_tree soft_limit_tree __read_mostly;
>  #define	MEM_CGROUP_MAX_RECLAIM_LOOPS		100
>  #define	MEM_CGROUP_MAX_SOFT_LIMIT_RECLAIM_LOOPS	2
>  
> +/* Stuffs for move charges at task migration. */
> +/*
> + * Types of charges to be moved.
> + */
> +#define MOVE_ANON	0x1U
> +#define MOVE_FILE	0x2U
> +#define MOVE_MASK	(MOVE_ANON | MOVE_FILE)
> +
> +/* "mc" and its members are protected by cgroup_mutex */
> +static struct move_charge_struct {
> +	spinlock_t	  lock; /* for from, to */
> +	struct mm_struct  *mm;
> +	struct mem_cgroup *from;
> +	struct mem_cgroup *to;
> +	unsigned long flags;
> +	unsigned long precharge;
> +	unsigned long moved_charge;
> +	unsigned long moved_swap;
> +	struct task_struct *moving_task;	/* a task moving charges */
> +	wait_queue_head_t waitq;		/* a waitq for other context */
> +} mc = {
> +	.lock = __SPIN_LOCK_UNLOCKED(mc.lock),
> +	.waitq = __WAIT_QUEUE_HEAD_INITIALIZER(mc.waitq),
> +};
> +
>  static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_node *mz,
>  					 struct mem_cgroup_tree_per_node *mctz,
>  					 unsigned long new_usage_in_excess)
> @@ -325,6 +355,957 @@ unsigned long memcg1_soft_limit_reclaim(pg_data_t *pgdat, int order,
>  	return nr_reclaimed;
>  }
>  
> +/*
> + * A routine for checking "mem" is under move_account() or not.
> + *
> + * Checking a cgroup is mc.from or mc.to or under hierarchy of
> + * moving cgroups. This is for waiting at high-memory pressure
> + * caused by "move".
> + */
> +static bool mem_cgroup_under_move(struct mem_cgroup *memcg)
> +{
> +	struct mem_cgroup *from;
> +	struct mem_cgroup *to;
> +	bool ret = false;
> +	/*
> +	 * Unlike task_move routines, we access mc.to, mc.from not under
> +	 * mutual exclusion by cgroup_mutex. Here, we take spinlock instead.
> +	 */
> +	spin_lock(&mc.lock);
> +	from = mc.from;
> +	to = mc.to;
> +	if (!from)
> +		goto unlock;
> +
> +	ret = mem_cgroup_is_descendant(from, memcg) ||
> +		mem_cgroup_is_descendant(to, memcg);
> +unlock:
> +	spin_unlock(&mc.lock);
> +	return ret;
> +}
> +
> +bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg)
> +{
> +	if (mc.moving_task && current != mc.moving_task) {
> +		if (mem_cgroup_under_move(memcg)) {
> +			DEFINE_WAIT(wait);
> +			prepare_to_wait(&mc.waitq, &wait, TASK_INTERRUPTIBLE);
> +			/* moving charge context might have finished. */
> +			if (mc.moving_task)
> +				schedule();
> +			finish_wait(&mc.waitq, &wait);
> +			return true;
> +		}
> +	}
> +	return false;
> +}
> +
> +/**
> + * folio_memcg_lock - Bind a folio to its memcg.
> + * @folio: The folio.
> + *
> + * This function prevents unlocked LRU folios from being moved to
> + * another cgroup.
> + *
> + * It ensures lifetime of the bound memcg.  The caller is responsible
> + * for the lifetime of the folio.
> + */
> +void folio_memcg_lock(struct folio *folio)
> +{
> +	struct mem_cgroup *memcg;
> +	unsigned long flags;
> +
> +	/*
> +	 * The RCU lock is held throughout the transaction.  The fast
> +	 * path can get away without acquiring the memcg->move_lock
> +	 * because page moving starts with an RCU grace period.
> +         */
> +	rcu_read_lock();
> +
> +	if (mem_cgroup_disabled())
> +		return;
> +again:
> +	memcg = folio_memcg(folio);
> +	if (unlikely(!memcg))
> +		return;
> +
> +#ifdef CONFIG_PROVE_LOCKING
> +	local_irq_save(flags);
> +	might_lock(&memcg->move_lock);
> +	local_irq_restore(flags);
> +#endif
> +
> +	if (atomic_read(&memcg->moving_account) <= 0)
> +		return;
> +
> +	spin_lock_irqsave(&memcg->move_lock, flags);
> +	if (memcg != folio_memcg(folio)) {
> +		spin_unlock_irqrestore(&memcg->move_lock, flags);
> +		goto again;
> +	}
> +
> +	/*
> +	 * When charge migration first begins, we can have multiple
> +	 * critical sections holding the fast-path RCU lock and one
> +	 * holding the slowpath move_lock. Track the task who has the
> +	 * move_lock for folio_memcg_unlock().
> +	 */
> +	memcg->move_lock_task = current;
> +	memcg->move_lock_flags = flags;
> +}
> +
> +static void __folio_memcg_unlock(struct mem_cgroup *memcg)
> +{
> +	if (memcg && memcg->move_lock_task == current) {
> +		unsigned long flags = memcg->move_lock_flags;
> +
> +		memcg->move_lock_task = NULL;
> +		memcg->move_lock_flags = 0;
> +
> +		spin_unlock_irqrestore(&memcg->move_lock, flags);
> +	}
> +
> +	rcu_read_unlock();
> +}
> +
> +/**
> + * folio_memcg_unlock - Release the binding between a folio and its memcg.
> + * @folio: The folio.
> + *
> + * This releases the binding created by folio_memcg_lock().  This does
> + * not change the accounting of this folio to its memcg, but it does
> + * permit others to change it.
> + */
> +void folio_memcg_unlock(struct folio *folio)
> +{
> +	__folio_memcg_unlock(folio_memcg(folio));
> +}
> +
> +#ifdef CONFIG_SWAP
> +/**
> + * mem_cgroup_move_swap_account - move swap charge and swap_cgroup's record.
> + * @entry: swap entry to be moved
> + * @from:  mem_cgroup which the entry is moved from
> + * @to:  mem_cgroup which the entry is moved to
> + *
> + * It succeeds only when the swap_cgroup's record for this entry is the same
> + * as the mem_cgroup's id of @from.
> + *
> + * Returns 0 on success, -EINVAL on failure.
> + *
> + * The caller must have charged to @to, IOW, called page_counter_charge() about
> + * both res and memsw, and called css_get().
> + */
> +static int mem_cgroup_move_swap_account(swp_entry_t entry,
> +				struct mem_cgroup *from, struct mem_cgroup *to)
> +{
> +	unsigned short old_id, new_id;
> +
> +	old_id = mem_cgroup_id(from);
> +	new_id = mem_cgroup_id(to);
> +
> +	if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id) {
> +		mod_memcg_state(from, MEMCG_SWAP, -1);
> +		mod_memcg_state(to, MEMCG_SWAP, 1);
> +		return 0;
> +	}
> +	return -EINVAL;
> +}
> +#else
> +static inline int mem_cgroup_move_swap_account(swp_entry_t entry,
> +				struct mem_cgroup *from, struct mem_cgroup *to)
> +{
> +	return -EINVAL;
> +}
> +#endif
> +
> +u64 mem_cgroup_move_charge_read(struct cgroup_subsys_state *css,
> +				struct cftype *cft)
> +{
> +	return mem_cgroup_from_css(css)->move_charge_at_immigrate;
> +}
> +
> +#ifdef CONFIG_MMU
> +int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css,
> +				 struct cftype *cft, u64 val)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> +
> +	pr_warn_once("Cgroup memory moving (move_charge_at_immigrate) is deprecated. "
> +		     "Please report your usecase to linux-mm@kvack.org if you "
> +		     "depend on this functionality.\n");
> +
> +	if (val & ~MOVE_MASK)
> +		return -EINVAL;
> +
> +	/*
> +	 * No kind of locking is needed in here, because ->can_attach() will
> +	 * check this value once in the beginning of the process, and then carry
> +	 * on with stale data. This means that changes to this value will only
> +	 * affect task migrations starting after the change.
> +	 */
> +	memcg->move_charge_at_immigrate = val;
> +	return 0;
> +}
> +#else
> +int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css,
> +				 struct cftype *cft, u64 val)
> +{
> +	return -ENOSYS;
> +}
> +#endif
> +
> +#ifdef CONFIG_MMU
> +/* Handlers for move charge at task migration. */
> +static int mem_cgroup_do_precharge(unsigned long count)
> +{
> +	int ret;
> +
> +	/* Try a single bulk charge without reclaim first, kswapd may wake */
> +	ret = try_charge(mc.to, GFP_KERNEL & ~__GFP_DIRECT_RECLAIM, count);
> +	if (!ret) {
> +		mc.precharge += count;
> +		return ret;
> +	}
> +
> +	/* Try charges one by one with reclaim, but do not retry */
> +	while (count--) {
> +		ret = try_charge(mc.to, GFP_KERNEL | __GFP_NORETRY, 1);
> +		if (ret)
> +			return ret;
> +		mc.precharge++;
> +		cond_resched();
> +	}
> +	return 0;
> +}
> +
> +union mc_target {
> +	struct folio	*folio;
> +	swp_entry_t	ent;
> +};
> +
> +enum mc_target_type {
> +	MC_TARGET_NONE = 0,
> +	MC_TARGET_PAGE,
> +	MC_TARGET_SWAP,
> +	MC_TARGET_DEVICE,
> +};
> +
> +static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
> +						unsigned long addr, pte_t ptent)
> +{
> +	struct page *page = vm_normal_page(vma, addr, ptent);
> +
> +	if (!page)
> +		return NULL;
> +	if (PageAnon(page)) {
> +		if (!(mc.flags & MOVE_ANON))
> +			return NULL;
> +	} else {
> +		if (!(mc.flags & MOVE_FILE))
> +			return NULL;
> +	}
> +	get_page(page);
> +
> +	return page;
> +}
> +
> +#if defined(CONFIG_SWAP) || defined(CONFIG_DEVICE_PRIVATE)
> +static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
> +			pte_t ptent, swp_entry_t *entry)
> +{
> +	struct page *page = NULL;
> +	swp_entry_t ent = pte_to_swp_entry(ptent);
> +
> +	if (!(mc.flags & MOVE_ANON))
> +		return NULL;
> +
> +	/*
> +	 * Handle device private pages that are not accessible by the CPU, but
> +	 * stored as special swap entries in the page table.
> +	 */
> +	if (is_device_private_entry(ent)) {
> +		page = pfn_swap_entry_to_page(ent);
> +		if (!get_page_unless_zero(page))
> +			return NULL;
> +		return page;
> +	}
> +
> +	if (non_swap_entry(ent))
> +		return NULL;
> +
> +	/*
> +	 * Because swap_cache_get_folio() updates some statistics counter,
> +	 * we call find_get_page() with swapper_space directly.
> +	 */
> +	page = find_get_page(swap_address_space(ent), swap_cache_index(ent));
> +	entry->val = ent.val;
> +
> +	return page;
> +}
> +#else
> +static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
> +			pte_t ptent, swp_entry_t *entry)
> +{
> +	return NULL;
> +}
> +#endif
> +
> +static struct page *mc_handle_file_pte(struct vm_area_struct *vma,
> +			unsigned long addr, pte_t ptent)
> +{
> +	unsigned long index;
> +	struct folio *folio;
> +
> +	if (!vma->vm_file) /* anonymous vma */
> +		return NULL;
> +	if (!(mc.flags & MOVE_FILE))
> +		return NULL;
> +
> +	/* folio is moved even if it's not RSS of this task(page-faulted). */
> +	/* shmem/tmpfs may report page out on swap: account for that too. */
> +	index = linear_page_index(vma, addr);
> +	folio = filemap_get_incore_folio(vma->vm_file->f_mapping, index);
> +	if (IS_ERR(folio))
> +		return NULL;
> +	return folio_file_page(folio, index);
> +}
> +
> +/**
> + * mem_cgroup_move_account - move account of the folio
> + * @folio: The folio.
> + * @compound: charge the page as compound or small page
> + * @from: mem_cgroup which the folio is moved from.
> + * @to:	mem_cgroup which the folio is moved to. @from != @to.
> + *
> + * The folio must be locked and not on the LRU.
> + *
> + * This function doesn't do "charge" to new cgroup and doesn't do "uncharge"
> + * from old cgroup.
> + */
> +static int mem_cgroup_move_account(struct folio *folio,
> +				   bool compound,
> +				   struct mem_cgroup *from,
> +				   struct mem_cgroup *to)
> +{
> +	struct lruvec *from_vec, *to_vec;
> +	struct pglist_data *pgdat;
> +	unsigned int nr_pages = compound ? folio_nr_pages(folio) : 1;
> +	int nid, ret;
> +
> +	VM_BUG_ON(from == to);
> +	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
> +	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
> +	VM_BUG_ON(compound && !folio_test_large(folio));
> +
> +	ret = -EINVAL;
> +	if (folio_memcg(folio) != from)
> +		goto out;
> +
> +	pgdat = folio_pgdat(folio);
> +	from_vec = mem_cgroup_lruvec(from, pgdat);
> +	to_vec = mem_cgroup_lruvec(to, pgdat);
> +
> +	folio_memcg_lock(folio);
> +
> +	if (folio_test_anon(folio)) {
> +		if (folio_mapped(folio)) {
> +			__mod_lruvec_state(from_vec, NR_ANON_MAPPED, -nr_pages);
> +			__mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages);
> +			if (folio_test_pmd_mappable(folio)) {
> +				__mod_lruvec_state(from_vec, NR_ANON_THPS,
> +						   -nr_pages);
> +				__mod_lruvec_state(to_vec, NR_ANON_THPS,
> +						   nr_pages);
> +			}
> +		}
> +	} else {
> +		__mod_lruvec_state(from_vec, NR_FILE_PAGES, -nr_pages);
> +		__mod_lruvec_state(to_vec, NR_FILE_PAGES, nr_pages);
> +
> +		if (folio_test_swapbacked(folio)) {
> +			__mod_lruvec_state(from_vec, NR_SHMEM, -nr_pages);
> +			__mod_lruvec_state(to_vec, NR_SHMEM, nr_pages);
> +		}
> +
> +		if (folio_mapped(folio)) {
> +			__mod_lruvec_state(from_vec, NR_FILE_MAPPED, -nr_pages);
> +			__mod_lruvec_state(to_vec, NR_FILE_MAPPED, nr_pages);
> +		}
> +
> +		if (folio_test_dirty(folio)) {
> +			struct address_space *mapping = folio_mapping(folio);
> +
> +			if (mapping_can_writeback(mapping)) {
> +				__mod_lruvec_state(from_vec, NR_FILE_DIRTY,
> +						   -nr_pages);
> +				__mod_lruvec_state(to_vec, NR_FILE_DIRTY,
> +						   nr_pages);
> +			}
> +		}
> +	}
> +
> +#ifdef CONFIG_SWAP
> +	if (folio_test_swapcache(folio)) {
> +		__mod_lruvec_state(from_vec, NR_SWAPCACHE, -nr_pages);
> +		__mod_lruvec_state(to_vec, NR_SWAPCACHE, nr_pages);
> +	}
> +#endif
> +	if (folio_test_writeback(folio)) {
> +		__mod_lruvec_state(from_vec, NR_WRITEBACK, -nr_pages);
> +		__mod_lruvec_state(to_vec, NR_WRITEBACK, nr_pages);
> +	}
> +
> +	/*
> +	 * All state has been migrated, let's switch to the new memcg.
> +	 *
> +	 * It is safe to change page's memcg here because the page
> +	 * is referenced, charged, isolated, and locked: we can't race
> +	 * with (un)charging, migration, LRU putback, or anything else
> +	 * that would rely on a stable page's memory cgroup.
> +	 *
> +	 * Note that folio_memcg_lock is a memcg lock, not a page lock,
> +	 * to save space. As soon as we switch page's memory cgroup to a
> +	 * new memcg that isn't locked, the above state can change
> +	 * concurrently again. Make sure we're truly done with it.
> +	 */
> +	smp_mb();
> +
> +	css_get(&to->css);
> +	css_put(&from->css);
> +
> +	folio->memcg_data = (unsigned long)to;
> +
> +	__folio_memcg_unlock(from);
> +
> +	ret = 0;
> +	nid = folio_nid(folio);
> +
> +	local_irq_disable();
> +	mem_cgroup_charge_statistics(to, nr_pages);
> +	memcg_check_events(to, nid);
> +	mem_cgroup_charge_statistics(from, -nr_pages);
> +	memcg_check_events(from, nid);
> +	local_irq_enable();
> +out:
> +	return ret;
> +}
> +
> +/**
> + * get_mctgt_type - get target type of moving charge
> + * @vma: the vma the pte to be checked belongs
> + * @addr: the address corresponding to the pte to be checked
> + * @ptent: the pte to be checked
> + * @target: the pointer the target page or swap ent will be stored(can be NULL)
> + *
> + * Context: Called with pte lock held.
> + * Return:
> + * * MC_TARGET_NONE - If the pte is not a target for move charge.
> + * * MC_TARGET_PAGE - If the page corresponding to this pte is a target for
> + *   move charge. If @target is not NULL, the folio is stored in target->folio
> + *   with extra refcnt taken (Caller should release it).
> + * * MC_TARGET_SWAP - If the swap entry corresponding to this pte is a
> + *   target for charge migration.  If @target is not NULL, the entry is
> + *   stored in target->ent.
> + * * MC_TARGET_DEVICE - Like MC_TARGET_PAGE but page is device memory and
> + *   thus not on the lru.  For now such page is charged like a regular page
> + *   would be as it is just special memory taking the place of a regular page.
> + *   See Documentations/vm/hmm.txt and include/linux/hmm.h
> + */
> +static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
> +		unsigned long addr, pte_t ptent, union mc_target *target)
> +{
> +	struct page *page = NULL;
> +	struct folio *folio;
> +	enum mc_target_type ret = MC_TARGET_NONE;
> +	swp_entry_t ent = { .val = 0 };
> +
> +	if (pte_present(ptent))
> +		page = mc_handle_present_pte(vma, addr, ptent);
> +	else if (pte_none_mostly(ptent))
> +		/*
> +		 * PTE markers should be treated as a none pte here, separated
> +		 * from other swap handling below.
> +		 */
> +		page = mc_handle_file_pte(vma, addr, ptent);
> +	else if (is_swap_pte(ptent))
> +		page = mc_handle_swap_pte(vma, ptent, &ent);
> +
> +	if (page)
> +		folio = page_folio(page);
> +	if (target && page) {
> +		if (!folio_trylock(folio)) {
> +			folio_put(folio);
> +			return ret;
> +		}
> +		/*
> +		 * page_mapped() must be stable during the move. This
> +		 * pte is locked, so if it's present, the page cannot
> +		 * become unmapped. If it isn't, we have only partial
> +		 * control over the mapped state: the page lock will
> +		 * prevent new faults against pagecache and swapcache,
> +		 * so an unmapped page cannot become mapped. However,
> +		 * if the page is already mapped elsewhere, it can
> +		 * unmap, and there is nothing we can do about it.
> +		 * Alas, skip moving the page in this case.
> +		 */
> +		if (!pte_present(ptent) && page_mapped(page)) {
> +			folio_unlock(folio);
> +			folio_put(folio);
> +			return ret;
> +		}
> +	}
> +
> +	if (!page && !ent.val)
> +		return ret;
> +	if (page) {
> +		/*
> +		 * Do only loose check w/o serialization.
> +		 * mem_cgroup_move_account() checks the page is valid or
> +		 * not under LRU exclusion.
> +		 */
> +		if (folio_memcg(folio) == mc.from) {
> +			ret = MC_TARGET_PAGE;
> +			if (folio_is_device_private(folio) ||
> +			    folio_is_device_coherent(folio))
> +				ret = MC_TARGET_DEVICE;
> +			if (target)
> +				target->folio = folio;
> +		}
> +		if (!ret || !target) {
> +			if (target)
> +				folio_unlock(folio);
> +			folio_put(folio);
> +		}
> +	}
> +	/*
> +	 * There is a swap entry and a page doesn't exist or isn't charged.
> +	 * But we cannot move a tail-page in a THP.
> +	 */
> +	if (ent.val && !ret && (!page || !PageTransCompound(page)) &&
> +	    mem_cgroup_id(mc.from) == lookup_swap_cgroup_id(ent)) {
> +		ret = MC_TARGET_SWAP;
> +		if (target)
> +			target->ent = ent;
> +	}
> +	return ret;
> +}
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +/*
> + * We don't consider PMD mapped swapping or file mapped pages because THP does
> + * not support them for now.
> + * Caller should make sure that pmd_trans_huge(pmd) is true.
> + */
> +static enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
> +		unsigned long addr, pmd_t pmd, union mc_target *target)
> +{
> +	struct page *page = NULL;
> +	struct folio *folio;
> +	enum mc_target_type ret = MC_TARGET_NONE;
> +
> +	if (unlikely(is_swap_pmd(pmd))) {
> +		VM_BUG_ON(thp_migration_supported() &&
> +				  !is_pmd_migration_entry(pmd));
> +		return ret;
> +	}
> +	page = pmd_page(pmd);
> +	VM_BUG_ON_PAGE(!page || !PageHead(page), page);
> +	folio = page_folio(page);
> +	if (!(mc.flags & MOVE_ANON))
> +		return ret;
> +	if (folio_memcg(folio) == mc.from) {
> +		ret = MC_TARGET_PAGE;
> +		if (target) {
> +			folio_get(folio);
> +			if (!folio_trylock(folio)) {
> +				folio_put(folio);
> +				return MC_TARGET_NONE;
> +			}
> +			target->folio = folio;
> +		}
> +	}
> +	return ret;
> +}
> +#else
> +static inline enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
> +		unsigned long addr, pmd_t pmd, union mc_target *target)
> +{
> +	return MC_TARGET_NONE;
> +}
> +#endif
> +
> +static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
> +					unsigned long addr, unsigned long end,
> +					struct mm_walk *walk)
> +{
> +	struct vm_area_struct *vma = walk->vma;
> +	pte_t *pte;
> +	spinlock_t *ptl;
> +
> +	ptl = pmd_trans_huge_lock(pmd, vma);
> +	if (ptl) {
> +		/*
> +		 * Note their can not be MC_TARGET_DEVICE for now as we do not
> +		 * support transparent huge page with MEMORY_DEVICE_PRIVATE but
> +		 * this might change.
> +		 */
> +		if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE)
> +			mc.precharge += HPAGE_PMD_NR;
> +		spin_unlock(ptl);
> +		return 0;
> +	}
> +
> +	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> +	if (!pte)
> +		return 0;
> +	for (; addr != end; pte++, addr += PAGE_SIZE)
> +		if (get_mctgt_type(vma, addr, ptep_get(pte), NULL))
> +			mc.precharge++;	/* increment precharge temporarily */
> +	pte_unmap_unlock(pte - 1, ptl);
> +	cond_resched();
> +
> +	return 0;
> +}
> +
> +static const struct mm_walk_ops precharge_walk_ops = {
> +	.pmd_entry	= mem_cgroup_count_precharge_pte_range,
> +	.walk_lock	= PGWALK_RDLOCK,
> +};
> +
> +static unsigned long mem_cgroup_count_precharge(struct mm_struct *mm)
> +{
> +	unsigned long precharge;
> +
> +	mmap_read_lock(mm);
> +	walk_page_range(mm, 0, ULONG_MAX, &precharge_walk_ops, NULL);
> +	mmap_read_unlock(mm);
> +
> +	precharge = mc.precharge;
> +	mc.precharge = 0;
> +
> +	return precharge;
> +}
> +
> +static int mem_cgroup_precharge_mc(struct mm_struct *mm)
> +{
> +	unsigned long precharge = mem_cgroup_count_precharge(mm);
> +
> +	VM_BUG_ON(mc.moving_task);
> +	mc.moving_task = current;
> +	return mem_cgroup_do_precharge(precharge);
> +}
> +
> +/* cancels all extra charges on mc.from and mc.to, and wakes up all waiters. */
> +static void __mem_cgroup_clear_mc(void)
> +{
> +	struct mem_cgroup *from = mc.from;
> +	struct mem_cgroup *to = mc.to;
> +
> +	/* we must uncharge all the leftover precharges from mc.to */
> +	if (mc.precharge) {
> +		mem_cgroup_cancel_charge(mc.to, mc.precharge);
> +		mc.precharge = 0;
> +	}
> +	/*
> +	 * we didn't uncharge from mc.from at mem_cgroup_move_account(), so
> +	 * we must uncharge here.
> +	 */
> +	if (mc.moved_charge) {
> +		mem_cgroup_cancel_charge(mc.from, mc.moved_charge);
> +		mc.moved_charge = 0;
> +	}
> +	/* we must fixup refcnts and charges */
> +	if (mc.moved_swap) {
> +		/* uncharge swap account from the old cgroup */
> +		if (!mem_cgroup_is_root(mc.from))
> +			page_counter_uncharge(&mc.from->memsw, mc.moved_swap);
> +
> +		mem_cgroup_id_put_many(mc.from, mc.moved_swap);
> +
> +		/*
> +		 * we charged both to->memory and to->memsw, so we
> +		 * should uncharge to->memory.
> +		 */
> +		if (!mem_cgroup_is_root(mc.to))
> +			page_counter_uncharge(&mc.to->memory, mc.moved_swap);
> +
> +		mc.moved_swap = 0;
> +	}
> +	memcg_oom_recover(from);
> +	memcg_oom_recover(to);
> +	wake_up_all(&mc.waitq);
> +}
> +
> +static void mem_cgroup_clear_mc(void)
> +{
> +	struct mm_struct *mm = mc.mm;
> +
> +	/*
> +	 * we must clear moving_task before waking up waiters at the end of
> +	 * task migration.
> +	 */
> +	mc.moving_task = NULL;
> +	__mem_cgroup_clear_mc();
> +	spin_lock(&mc.lock);
> +	mc.from = NULL;
> +	mc.to = NULL;
> +	mc.mm = NULL;
> +	spin_unlock(&mc.lock);
> +
> +	mmput(mm);
> +}
> +
> +int mem_cgroup_can_attach(struct cgroup_taskset *tset)
> +{
> +	struct cgroup_subsys_state *css;
> +	struct mem_cgroup *memcg = NULL; /* unneeded init to make gcc happy */
> +	struct mem_cgroup *from;
> +	struct task_struct *leader, *p;
> +	struct mm_struct *mm;
> +	unsigned long move_flags;
> +	int ret = 0;
> +
> +	/* charge immigration isn't supported on the default hierarchy */
> +	if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
> +		return 0;
> +
> +	/*
> +	 * Multi-process migrations only happen on the default hierarchy
> +	 * where charge immigration is not used.  Perform charge
> +	 * immigration if @tset contains a leader and whine if there are
> +	 * multiple.
> +	 */
> +	p = NULL;
> +	cgroup_taskset_for_each_leader(leader, css, tset) {
> +		WARN_ON_ONCE(p);
> +		p = leader;
> +		memcg = mem_cgroup_from_css(css);
> +	}
> +	if (!p)
> +		return 0;
> +
> +	/*
> +	 * We are now committed to this value whatever it is. Changes in this
> +	 * tunable will only affect upcoming migrations, not the current one.
> +	 * So we need to save it, and keep it going.
> +	 */
> +	move_flags = READ_ONCE(memcg->move_charge_at_immigrate);
> +	if (!move_flags)
> +		return 0;
> +
> +	from = mem_cgroup_from_task(p);
> +
> +	VM_BUG_ON(from == memcg);
> +
> +	mm = get_task_mm(p);
> +	if (!mm)
> +		return 0;
> +	/* We move charges only when we move a owner of the mm */
> +	if (mm->owner == p) {
> +		VM_BUG_ON(mc.from);
> +		VM_BUG_ON(mc.to);
> +		VM_BUG_ON(mc.precharge);
> +		VM_BUG_ON(mc.moved_charge);
> +		VM_BUG_ON(mc.moved_swap);
> +
> +		spin_lock(&mc.lock);
> +		mc.mm = mm;
> +		mc.from = from;
> +		mc.to = memcg;
> +		mc.flags = move_flags;
> +		spin_unlock(&mc.lock);
> +		/* We set mc.moving_task later */
> +
> +		ret = mem_cgroup_precharge_mc(mm);
> +		if (ret)
> +			mem_cgroup_clear_mc();
> +	} else {
> +		mmput(mm);
> +	}
> +	return ret;
> +}
> +
> +void mem_cgroup_cancel_attach(struct cgroup_taskset *tset)
> +{
> +	if (mc.to)
> +		mem_cgroup_clear_mc();
> +}
> +
> +static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
> +				unsigned long addr, unsigned long end,
> +				struct mm_walk *walk)
> +{
> +	int ret = 0;
> +	struct vm_area_struct *vma = walk->vma;
> +	pte_t *pte;
> +	spinlock_t *ptl;
> +	enum mc_target_type target_type;
> +	union mc_target target;
> +	struct folio *folio;
> +
> +	ptl = pmd_trans_huge_lock(pmd, vma);
> +	if (ptl) {
> +		if (mc.precharge < HPAGE_PMD_NR) {
> +			spin_unlock(ptl);
> +			return 0;
> +		}
> +		target_type = get_mctgt_type_thp(vma, addr, *pmd, &target);
> +		if (target_type == MC_TARGET_PAGE) {
> +			folio = target.folio;
> +			if (folio_isolate_lru(folio)) {
> +				if (!mem_cgroup_move_account(folio, true,
> +							     mc.from, mc.to)) {
> +					mc.precharge -= HPAGE_PMD_NR;
> +					mc.moved_charge += HPAGE_PMD_NR;
> +				}
> +				folio_putback_lru(folio);
> +			}
> +			folio_unlock(folio);
> +			folio_put(folio);
> +		} else if (target_type == MC_TARGET_DEVICE) {
> +			folio = target.folio;
> +			if (!mem_cgroup_move_account(folio, true,
> +						     mc.from, mc.to)) {
> +				mc.precharge -= HPAGE_PMD_NR;
> +				mc.moved_charge += HPAGE_PMD_NR;
> +			}
> +			folio_unlock(folio);
> +			folio_put(folio);
> +		}
> +		spin_unlock(ptl);
> +		return 0;
> +	}
> +
> +retry:
> +	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> +	if (!pte)
> +		return 0;
> +	for (; addr != end; addr += PAGE_SIZE) {
> +		pte_t ptent = ptep_get(pte++);
> +		bool device = false;
> +		swp_entry_t ent;
> +
> +		if (!mc.precharge)
> +			break;
> +
> +		switch (get_mctgt_type(vma, addr, ptent, &target)) {
> +		case MC_TARGET_DEVICE:
> +			device = true;
> +			fallthrough;
> +		case MC_TARGET_PAGE:
> +			folio = target.folio;
> +			/*
> +			 * We can have a part of the split pmd here. Moving it
> +			 * can be done but it would be too convoluted so simply
> +			 * ignore such a partial THP and keep it in original
> +			 * memcg. There should be somebody mapping the head.
> +			 */
> +			if (folio_test_large(folio))
> +				goto put;
> +			if (!device && !folio_isolate_lru(folio))
> +				goto put;
> +			if (!mem_cgroup_move_account(folio, false,
> +						mc.from, mc.to)) {
> +				mc.precharge--;
> +				/* we uncharge from mc.from later. */
> +				mc.moved_charge++;
> +			}
> +			if (!device)
> +				folio_putback_lru(folio);
> +put:			/* get_mctgt_type() gets & locks the page */
> +			folio_unlock(folio);
> +			folio_put(folio);
> +			break;
> +		case MC_TARGET_SWAP:
> +			ent = target.ent;
> +			if (!mem_cgroup_move_swap_account(ent, mc.from, mc.to)) {
> +				mc.precharge--;
> +				mem_cgroup_id_get_many(mc.to, 1);
> +				/* we fixup other refcnts and charges later. */
> +				mc.moved_swap++;
> +			}
> +			break;
> +		default:
> +			break;
> +		}
> +	}
> +	pte_unmap_unlock(pte - 1, ptl);
> +	cond_resched();
> +
> +	if (addr != end) {
> +		/*
> +		 * We have consumed all precharges we got in can_attach().
> +		 * We try charge one by one, but don't do any additional
> +		 * charges to mc.to if we have failed in charge once in attach()
> +		 * phase.
> +		 */
> +		ret = mem_cgroup_do_precharge(1);
> +		if (!ret)
> +			goto retry;
> +	}
> +
> +	return ret;
> +}
> +
> +static const struct mm_walk_ops charge_walk_ops = {
> +	.pmd_entry	= mem_cgroup_move_charge_pte_range,
> +	.walk_lock	= PGWALK_RDLOCK,
> +};
> +
> +static void mem_cgroup_move_charge(void)
> +{
> +	lru_add_drain_all();
> +	/*
> +	 * Signal folio_memcg_lock() to take the memcg's move_lock
> +	 * while we're moving its pages to another memcg. Then wait
> +	 * for already started RCU-only updates to finish.
> +	 */
> +	atomic_inc(&mc.from->moving_account);
> +	synchronize_rcu();
> +retry:
> +	if (unlikely(!mmap_read_trylock(mc.mm))) {
> +		/*
> +		 * Someone who are holding the mmap_lock might be waiting in
> +		 * waitq. So we cancel all extra charges, wake up all waiters,
> +		 * and retry. Because we cancel precharges, we might not be able
> +		 * to move enough charges, but moving charge is a best-effort
> +		 * feature anyway, so it wouldn't be a big problem.
> +		 */
> +		__mem_cgroup_clear_mc();
> +		cond_resched();
> +		goto retry;
> +	}
> +	/*
> +	 * When we have consumed all precharges and failed in doing
> +	 * additional charge, the page walk just aborts.
> +	 */
> +	walk_page_range(mc.mm, 0, ULONG_MAX, &charge_walk_ops, NULL);
> +	mmap_read_unlock(mc.mm);
> +	atomic_dec(&mc.from->moving_account);
> +}
> +
> +void mem_cgroup_move_task(void)
> +{
> +	if (mc.to) {
> +		mem_cgroup_move_charge();
> +		mem_cgroup_clear_mc();
> +	}
> +}
> +
> +#else	/* !CONFIG_MMU */
> +static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
> +{
> +	return 0;
> +}
> +static void mem_cgroup_cancel_attach(struct cgroup_taskset *tset)
> +{
> +}
> +static void mem_cgroup_move_task(void)
> +{
> +}
> +#endif
> +
>  static int __init memcg1_init(void)
>  {
>  	int node;
> diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
> index e37bc7e8d955..55e7c4f90c39 100644
> --- a/mm/memcontrol-v1.h
> +++ b/mm/memcontrol-v1.h
> @@ -11,4 +11,34 @@ static inline void memcg1_soft_limit_reset(struct mem_cgroup *memcg)
>  	WRITE_ONCE(memcg->soft_limit, PAGE_COUNTER_MAX);
>  }
>  
> +void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, int nr_pages);
> +void memcg_check_events(struct mem_cgroup *memcg, int nid);
> +void memcg_oom_recover(struct mem_cgroup *memcg);
> +int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> +		     unsigned int nr_pages);
> +
> +static inline int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> +			     unsigned int nr_pages)
> +{
> +	if (mem_cgroup_is_root(memcg))
> +		return 0;
> +
> +	return try_charge_memcg(memcg, gfp_mask, nr_pages);
> +}
> +
> +void mem_cgroup_id_get_many(struct mem_cgroup *memcg, unsigned int n);
> +void mem_cgroup_id_put_many(struct mem_cgroup *memcg, unsigned int n);
> +
> +bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg);
> +struct cgroup_taskset;
> +int mem_cgroup_can_attach(struct cgroup_taskset *tset);
> +void mem_cgroup_cancel_attach(struct cgroup_taskset *tset);
> +void mem_cgroup_move_task(void);
> +
> +struct cftype;
> +u64 mem_cgroup_move_charge_read(struct cgroup_subsys_state *css,
> +				struct cftype *cft);
> +int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css,
> +				 struct cftype *cft, u64 val);
> +
>  #endif	/* __MM_MEMCONTROL_V1_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 3479e1af12d5..3332c89cae2e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -28,7 +28,6 @@
>  #include <linux/page_counter.h>
>  #include <linux/memcontrol.h>
>  #include <linux/cgroup.h>
> -#include <linux/pagewalk.h>
>  #include <linux/sched/mm.h>
>  #include <linux/shmem_fs.h>
>  #include <linux/hugetlb.h>
> @@ -45,7 +44,6 @@
>  #include <linux/mutex.h>
>  #include <linux/rbtree.h>
>  #include <linux/slab.h>
> -#include <linux/swap.h>
>  #include <linux/swapops.h>
>  #include <linux/spinlock.h>
>  #include <linux/eventfd.h>
> @@ -71,7 +69,6 @@
>  #include <net/sock.h>
>  #include <net/ip.h>
>  #include "slab.h"
> -#include "swap.h"
>  #include "memcontrol-v1.h"
>  
>  #include <linux/uaccess.h>
> @@ -158,31 +155,6 @@ struct mem_cgroup_event {
>  static void mem_cgroup_threshold(struct mem_cgroup *memcg);
>  static void mem_cgroup_oom_notify(struct mem_cgroup *memcg);
>  
> -/* Stuffs for move charges at task migration. */
> -/*
> - * Types of charges to be moved.
> - */
> -#define MOVE_ANON	0x1U
> -#define MOVE_FILE	0x2U
> -#define MOVE_MASK	(MOVE_ANON | MOVE_FILE)
> -
> -/* "mc" and its members are protected by cgroup_mutex */
> -static struct move_charge_struct {
> -	spinlock_t	  lock; /* for from, to */
> -	struct mm_struct  *mm;
> -	struct mem_cgroup *from;
> -	struct mem_cgroup *to;
> -	unsigned long flags;
> -	unsigned long precharge;
> -	unsigned long moved_charge;
> -	unsigned long moved_swap;
> -	struct task_struct *moving_task;	/* a task moving charges */
> -	wait_queue_head_t waitq;		/* a waitq for other context */
> -} mc = {
> -	.lock = __SPIN_LOCK_UNLOCKED(mc.lock),
> -	.waitq = __WAIT_QUEUE_HEAD_INITIALIZER(mc.waitq),
> -};
> -
>  /* for encoding cft->private value on file */
>  enum res_type {
>  	_MEM,
> @@ -955,8 +927,7 @@ static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event)
>  	return READ_ONCE(memcg->vmstats->events_local[i]);
>  }
>  
> -static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
> -					 int nr_pages)
> +void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, int nr_pages)
>  {
>  	/* pagein of a big page is an event. So, ignore page size */
>  	if (nr_pages > 0)
> @@ -998,7 +969,7 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
>   * Check events in order.
>   *
>   */
> -static void memcg_check_events(struct mem_cgroup *memcg, int nid)
> +void memcg_check_events(struct mem_cgroup *memcg, int nid)
>  {
>  	if (IS_ENABLED(CONFIG_PREEMPT_RT))
>  		return;
> @@ -1467,51 +1438,6 @@ static unsigned long mem_cgroup_margin(struct mem_cgroup *memcg)
>  	return margin;
>  }
>  
> -/*
> - * A routine for checking "mem" is under move_account() or not.
> - *
> - * Checking a cgroup is mc.from or mc.to or under hierarchy of
> - * moving cgroups. This is for waiting at high-memory pressure
> - * caused by "move".
> - */
> -static bool mem_cgroup_under_move(struct mem_cgroup *memcg)
> -{
> -	struct mem_cgroup *from;
> -	struct mem_cgroup *to;
> -	bool ret = false;
> -	/*
> -	 * Unlike task_move routines, we access mc.to, mc.from not under
> -	 * mutual exclusion by cgroup_mutex. Here, we take spinlock instead.
> -	 */
> -	spin_lock(&mc.lock);
> -	from = mc.from;
> -	to = mc.to;
> -	if (!from)
> -		goto unlock;
> -
> -	ret = mem_cgroup_is_descendant(from, memcg) ||
> -		mem_cgroup_is_descendant(to, memcg);
> -unlock:
> -	spin_unlock(&mc.lock);
> -	return ret;
> -}
> -
> -static bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg)
> -{
> -	if (mc.moving_task && current != mc.moving_task) {
> -		if (mem_cgroup_under_move(memcg)) {
> -			DEFINE_WAIT(wait);
> -			prepare_to_wait(&mc.waitq, &wait, TASK_INTERRUPTIBLE);
> -			/* moving charge context might have finished. */
> -			if (mc.moving_task)
> -				schedule();
> -			finish_wait(&mc.waitq, &wait);
> -			return true;
> -		}
> -	}
> -	return false;
> -}
> -
>  struct memory_stat {
>  	const char *name;
>  	unsigned int idx;
> @@ -1904,7 +1830,7 @@ static int memcg_oom_wake_function(wait_queue_entry_t *wait,
>  	return autoremove_wake_function(wait, mode, sync, arg);
>  }
>  
> -static void memcg_oom_recover(struct mem_cgroup *memcg)
> +void memcg_oom_recover(struct mem_cgroup *memcg)
>  {
>  	/*
>  	 * For the following lockless ->under_oom test, the only required
> @@ -2093,87 +2019,6 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg)
>  	pr_cont(" are going to be killed due to memory.oom.group set\n");
>  }
>  
> -/**
> - * folio_memcg_lock - Bind a folio to its memcg.
> - * @folio: The folio.
> - *
> - * This function prevents unlocked LRU folios from being moved to
> - * another cgroup.
> - *
> - * It ensures lifetime of the bound memcg.  The caller is responsible
> - * for the lifetime of the folio.
> - */
> -void folio_memcg_lock(struct folio *folio)
> -{
> -	struct mem_cgroup *memcg;
> -	unsigned long flags;
> -
> -	/*
> -	 * The RCU lock is held throughout the transaction.  The fast
> -	 * path can get away without acquiring the memcg->move_lock
> -	 * because page moving starts with an RCU grace period.
> -         */
> -	rcu_read_lock();
> -
> -	if (mem_cgroup_disabled())
> -		return;
> -again:
> -	memcg = folio_memcg(folio);
> -	if (unlikely(!memcg))
> -		return;
> -
> -#ifdef CONFIG_PROVE_LOCKING
> -	local_irq_save(flags);
> -	might_lock(&memcg->move_lock);
> -	local_irq_restore(flags);
> -#endif
> -
> -	if (atomic_read(&memcg->moving_account) <= 0)
> -		return;
> -
> -	spin_lock_irqsave(&memcg->move_lock, flags);
> -	if (memcg != folio_memcg(folio)) {
> -		spin_unlock_irqrestore(&memcg->move_lock, flags);
> -		goto again;
> -	}
> -
> -	/*
> -	 * When charge migration first begins, we can have multiple
> -	 * critical sections holding the fast-path RCU lock and one
> -	 * holding the slowpath move_lock. Track the task who has the
> -	 * move_lock for folio_memcg_unlock().
> -	 */
> -	memcg->move_lock_task = current;
> -	memcg->move_lock_flags = flags;
> -}
> -
> -static void __folio_memcg_unlock(struct mem_cgroup *memcg)
> -{
> -	if (memcg && memcg->move_lock_task == current) {
> -		unsigned long flags = memcg->move_lock_flags;
> -
> -		memcg->move_lock_task = NULL;
> -		memcg->move_lock_flags = 0;
> -
> -		spin_unlock_irqrestore(&memcg->move_lock, flags);
> -	}
> -
> -	rcu_read_unlock();
> -}
> -
> -/**
> - * folio_memcg_unlock - Release the binding between a folio and its memcg.
> - * @folio: The folio.
> - *
> - * This releases the binding created by folio_memcg_lock().  This does
> - * not change the accounting of this folio to its memcg, but it does
> - * permit others to change it.
> - */
> -void folio_memcg_unlock(struct folio *folio)
> -{
> -	__folio_memcg_unlock(folio_memcg(folio));
> -}
> -
>  struct memcg_stock_pcp {
>  	local_lock_t stock_lock;
>  	struct mem_cgroup *cached; /* this never be root cgroup */
> @@ -2653,8 +2498,8 @@ void mem_cgroup_handle_over_high(gfp_t gfp_mask)
>  	css_put(&memcg->css);
>  }
>  
> -static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> -			unsigned int nr_pages)
> +int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
> +		     unsigned int nr_pages)
>  {
>  	unsigned int batch = max(MEMCG_CHARGE_BATCH, nr_pages);
>  	int nr_retries = MAX_RECLAIM_RETRIES;
> @@ -2849,15 +2694,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	return 0;
>  }
>  
> -static inline int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> -			     unsigned int nr_pages)
> -{
> -	if (mem_cgroup_is_root(memcg))
> -		return 0;
> -
> -	return try_charge_memcg(memcg, gfp_mask, nr_pages);
> -}
> -
>  /**
>   * mem_cgroup_cancel_charge() - cancel an uncommitted try_charge() call.
>   * @memcg: memcg previously charged.
> @@ -3595,43 +3431,6 @@ void split_page_memcg(struct page *head, int old_order, int new_order)
>  		css_get_many(&memcg->css, old_nr / new_nr - 1);
>  }
>  
> -#ifdef CONFIG_SWAP
> -/**
> - * mem_cgroup_move_swap_account - move swap charge and swap_cgroup's record.
> - * @entry: swap entry to be moved
> - * @from:  mem_cgroup which the entry is moved from
> - * @to:  mem_cgroup which the entry is moved to
> - *
> - * It succeeds only when the swap_cgroup's record for this entry is the same
> - * as the mem_cgroup's id of @from.
> - *
> - * Returns 0 on success, -EINVAL on failure.
> - *
> - * The caller must have charged to @to, IOW, called page_counter_charge() about
> - * both res and memsw, and called css_get().
> - */
> -static int mem_cgroup_move_swap_account(swp_entry_t entry,
> -				struct mem_cgroup *from, struct mem_cgroup *to)
> -{
> -	unsigned short old_id, new_id;
> -
> -	old_id = mem_cgroup_id(from);
> -	new_id = mem_cgroup_id(to);
> -
> -	if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id) {
> -		mod_memcg_state(from, MEMCG_SWAP, -1);
> -		mod_memcg_state(to, MEMCG_SWAP, 1);
> -		return 0;
> -	}
> -	return -EINVAL;
> -}
> -#else
> -static inline int mem_cgroup_move_swap_account(swp_entry_t entry,
> -				struct mem_cgroup *from, struct mem_cgroup *to)
> -{
> -	return -EINVAL;
> -}
> -#endif
>  
>  static DEFINE_MUTEX(memcg_max_mutex);
>  
> @@ -4015,42 +3814,6 @@ static ssize_t mem_cgroup_reset(struct kernfs_open_file *of, char *buf,
>  	return nbytes;
>  }
>  
> -static u64 mem_cgroup_move_charge_read(struct cgroup_subsys_state *css,
> -					struct cftype *cft)
> -{
> -	return mem_cgroup_from_css(css)->move_charge_at_immigrate;
> -}
> -
> -#ifdef CONFIG_MMU
> -static int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css,
> -					struct cftype *cft, u64 val)
> -{
> -	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> -
> -	pr_warn_once("Cgroup memory moving (move_charge_at_immigrate) is deprecated. "
> -		     "Please report your usecase to linux-mm@kvack.org if you "
> -		     "depend on this functionality.\n");
> -
> -	if (val & ~MOVE_MASK)
> -		return -EINVAL;
> -
> -	/*
> -	 * No kind of locking is needed in here, because ->can_attach() will
> -	 * check this value once in the beginning of the process, and then carry
> -	 * on with stale data. This means that changes to this value will only
> -	 * affect task migrations starting after the change.
> -	 */
> -	memcg->move_charge_at_immigrate = val;
> -	return 0;
> -}
> -#else
> -static int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css,
> -					struct cftype *cft, u64 val)
> -{
> -	return -ENOSYS;
> -}
> -#endif
> -
>  #ifdef CONFIG_NUMA
>  
>  #define LRU_ALL_FILE (BIT(LRU_INACTIVE_FILE) | BIT(LRU_ACTIVE_FILE))
> @@ -5261,13 +5024,13 @@ static void mem_cgroup_id_remove(struct mem_cgroup *memcg)
>  	}
>  }
>  
> -static void __maybe_unused mem_cgroup_id_get_many(struct mem_cgroup *memcg,
> -						  unsigned int n)
> +void __maybe_unused mem_cgroup_id_get_many(struct mem_cgroup *memcg,
> +					   unsigned int n)
>  {
>  	refcount_add(n, &memcg->id.ref);
>  }
>  
> -static void mem_cgroup_id_put_many(struct mem_cgroup *memcg, unsigned int n)
> +void mem_cgroup_id_put_many(struct mem_cgroup *memcg, unsigned int n)
>  {
>  	if (refcount_sub_and_test(n, &memcg->id.ref)) {
>  		mem_cgroup_id_remove(memcg);
> @@ -5747,757 +5510,6 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
>  		atomic64_set(&memcg->vmstats->stats_updates, 0);
>  }
>  
> -#ifdef CONFIG_MMU
> -/* Handlers for move charge at task migration. */
> -static int mem_cgroup_do_precharge(unsigned long count)
> -{
> -	int ret;
> -
> -	/* Try a single bulk charge without reclaim first, kswapd may wake */
> -	ret = try_charge(mc.to, GFP_KERNEL & ~__GFP_DIRECT_RECLAIM, count);
> -	if (!ret) {
> -		mc.precharge += count;
> -		return ret;
> -	}
> -
> -	/* Try charges one by one with reclaim, but do not retry */
> -	while (count--) {
> -		ret = try_charge(mc.to, GFP_KERNEL | __GFP_NORETRY, 1);
> -		if (ret)
> -			return ret;
> -		mc.precharge++;
> -		cond_resched();
> -	}
> -	return 0;
> -}
> -
> -union mc_target {
> -	struct folio	*folio;
> -	swp_entry_t	ent;
> -};
> -
> -enum mc_target_type {
> -	MC_TARGET_NONE = 0,
> -	MC_TARGET_PAGE,
> -	MC_TARGET_SWAP,
> -	MC_TARGET_DEVICE,
> -};
> -
> -static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
> -						unsigned long addr, pte_t ptent)
> -{
> -	struct page *page = vm_normal_page(vma, addr, ptent);
> -
> -	if (!page)
> -		return NULL;
> -	if (PageAnon(page)) {
> -		if (!(mc.flags & MOVE_ANON))
> -			return NULL;
> -	} else {
> -		if (!(mc.flags & MOVE_FILE))
> -			return NULL;
> -	}
> -	get_page(page);
> -
> -	return page;
> -}
> -
> -#if defined(CONFIG_SWAP) || defined(CONFIG_DEVICE_PRIVATE)
> -static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
> -			pte_t ptent, swp_entry_t *entry)
> -{
> -	struct page *page = NULL;
> -	swp_entry_t ent = pte_to_swp_entry(ptent);
> -
> -	if (!(mc.flags & MOVE_ANON))
> -		return NULL;
> -
> -	/*
> -	 * Handle device private pages that are not accessible by the CPU, but
> -	 * stored as special swap entries in the page table.
> -	 */
> -	if (is_device_private_entry(ent)) {
> -		page = pfn_swap_entry_to_page(ent);
> -		if (!get_page_unless_zero(page))
> -			return NULL;
> -		return page;
> -	}
> -
> -	if (non_swap_entry(ent))
> -		return NULL;
> -
> -	/*
> -	 * Because swap_cache_get_folio() updates some statistics counter,
> -	 * we call find_get_page() with swapper_space directly.
> -	 */
> -	page = find_get_page(swap_address_space(ent), swap_cache_index(ent));
> -	entry->val = ent.val;
> -
> -	return page;
> -}
> -#else
> -static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
> -			pte_t ptent, swp_entry_t *entry)
> -{
> -	return NULL;
> -}
> -#endif
> -
> -static struct page *mc_handle_file_pte(struct vm_area_struct *vma,
> -			unsigned long addr, pte_t ptent)
> -{
> -	unsigned long index;
> -	struct folio *folio;
> -
> -	if (!vma->vm_file) /* anonymous vma */
> -		return NULL;
> -	if (!(mc.flags & MOVE_FILE))
> -		return NULL;
> -
> -	/* folio is moved even if it's not RSS of this task(page-faulted). */
> -	/* shmem/tmpfs may report page out on swap: account for that too. */
> -	index = linear_page_index(vma, addr);
> -	folio = filemap_get_incore_folio(vma->vm_file->f_mapping, index);
> -	if (IS_ERR(folio))
> -		return NULL;
> -	return folio_file_page(folio, index);
> -}
> -
> -/**
> - * mem_cgroup_move_account - move account of the folio
> - * @folio: The folio.
> - * @compound: charge the page as compound or small page
> - * @from: mem_cgroup which the folio is moved from.
> - * @to:	mem_cgroup which the folio is moved to. @from != @to.
> - *
> - * The folio must be locked and not on the LRU.
> - *
> - * This function doesn't do "charge" to new cgroup and doesn't do "uncharge"
> - * from old cgroup.
> - */
> -static int mem_cgroup_move_account(struct folio *folio,
> -				   bool compound,
> -				   struct mem_cgroup *from,
> -				   struct mem_cgroup *to)
> -{
> -	struct lruvec *from_vec, *to_vec;
> -	struct pglist_data *pgdat;
> -	unsigned int nr_pages = compound ? folio_nr_pages(folio) : 1;
> -	int nid, ret;
> -
> -	VM_BUG_ON(from == to);
> -	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
> -	VM_BUG_ON_FOLIO(folio_test_lru(folio), folio);
> -	VM_BUG_ON(compound && !folio_test_large(folio));
> -
> -	ret = -EINVAL;
> -	if (folio_memcg(folio) != from)
> -		goto out;
> -
> -	pgdat = folio_pgdat(folio);
> -	from_vec = mem_cgroup_lruvec(from, pgdat);
> -	to_vec = mem_cgroup_lruvec(to, pgdat);
> -
> -	folio_memcg_lock(folio);
> -
> -	if (folio_test_anon(folio)) {
> -		if (folio_mapped(folio)) {
> -			__mod_lruvec_state(from_vec, NR_ANON_MAPPED, -nr_pages);
> -			__mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages);
> -			if (folio_test_pmd_mappable(folio)) {
> -				__mod_lruvec_state(from_vec, NR_ANON_THPS,
> -						   -nr_pages);
> -				__mod_lruvec_state(to_vec, NR_ANON_THPS,
> -						   nr_pages);
> -			}
> -		}
> -	} else {
> -		__mod_lruvec_state(from_vec, NR_FILE_PAGES, -nr_pages);
> -		__mod_lruvec_state(to_vec, NR_FILE_PAGES, nr_pages);
> -
> -		if (folio_test_swapbacked(folio)) {
> -			__mod_lruvec_state(from_vec, NR_SHMEM, -nr_pages);
> -			__mod_lruvec_state(to_vec, NR_SHMEM, nr_pages);
> -		}
> -
> -		if (folio_mapped(folio)) {
> -			__mod_lruvec_state(from_vec, NR_FILE_MAPPED, -nr_pages);
> -			__mod_lruvec_state(to_vec, NR_FILE_MAPPED, nr_pages);
> -		}
> -
> -		if (folio_test_dirty(folio)) {
> -			struct address_space *mapping = folio_mapping(folio);
> -
> -			if (mapping_can_writeback(mapping)) {
> -				__mod_lruvec_state(from_vec, NR_FILE_DIRTY,
> -						   -nr_pages);
> -				__mod_lruvec_state(to_vec, NR_FILE_DIRTY,
> -						   nr_pages);
> -			}
> -		}
> -	}
> -
> -#ifdef CONFIG_SWAP
> -	if (folio_test_swapcache(folio)) {
> -		__mod_lruvec_state(from_vec, NR_SWAPCACHE, -nr_pages);
> -		__mod_lruvec_state(to_vec, NR_SWAPCACHE, nr_pages);
> -	}
> -#endif
> -	if (folio_test_writeback(folio)) {
> -		__mod_lruvec_state(from_vec, NR_WRITEBACK, -nr_pages);
> -		__mod_lruvec_state(to_vec, NR_WRITEBACK, nr_pages);
> -	}
> -
> -	/*
> -	 * All state has been migrated, let's switch to the new memcg.
> -	 *
> -	 * It is safe to change page's memcg here because the page
> -	 * is referenced, charged, isolated, and locked: we can't race
> -	 * with (un)charging, migration, LRU putback, or anything else
> -	 * that would rely on a stable page's memory cgroup.
> -	 *
> -	 * Note that folio_memcg_lock is a memcg lock, not a page lock,
> -	 * to save space. As soon as we switch page's memory cgroup to a
> -	 * new memcg that isn't locked, the above state can change
> -	 * concurrently again. Make sure we're truly done with it.
> -	 */
> -	smp_mb();
> -
> -	css_get(&to->css);
> -	css_put(&from->css);
> -
> -	folio->memcg_data = (unsigned long)to;
> -
> -	__folio_memcg_unlock(from);
> -
> -	ret = 0;
> -	nid = folio_nid(folio);
> -
> -	local_irq_disable();
> -	mem_cgroup_charge_statistics(to, nr_pages);
> -	memcg_check_events(to, nid);
> -	mem_cgroup_charge_statistics(from, -nr_pages);
> -	memcg_check_events(from, nid);
> -	local_irq_enable();
> -out:
> -	return ret;
> -}
> -
> -/**
> - * get_mctgt_type - get target type of moving charge
> - * @vma: the vma the pte to be checked belongs
> - * @addr: the address corresponding to the pte to be checked
> - * @ptent: the pte to be checked
> - * @target: the pointer the target page or swap ent will be stored(can be NULL)
> - *
> - * Context: Called with pte lock held.
> - * Return:
> - * * MC_TARGET_NONE - If the pte is not a target for move charge.
> - * * MC_TARGET_PAGE - If the page corresponding to this pte is a target for
> - *   move charge. If @target is not NULL, the folio is stored in target->folio
> - *   with extra refcnt taken (Caller should release it).
> - * * MC_TARGET_SWAP - If the swap entry corresponding to this pte is a
> - *   target for charge migration.  If @target is not NULL, the entry is
> - *   stored in target->ent.
> - * * MC_TARGET_DEVICE - Like MC_TARGET_PAGE but page is device memory and
> - *   thus not on the lru.  For now such page is charged like a regular page
> - *   would be as it is just special memory taking the place of a regular page.
> - *   See Documentations/vm/hmm.txt and include/linux/hmm.h
> - */
> -static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
> -		unsigned long addr, pte_t ptent, union mc_target *target)
> -{
> -	struct page *page = NULL;
> -	struct folio *folio;
> -	enum mc_target_type ret = MC_TARGET_NONE;
> -	swp_entry_t ent = { .val = 0 };
> -
> -	if (pte_present(ptent))
> -		page = mc_handle_present_pte(vma, addr, ptent);
> -	else if (pte_none_mostly(ptent))
> -		/*
> -		 * PTE markers should be treated as a none pte here, separated
> -		 * from other swap handling below.
> -		 */
> -		page = mc_handle_file_pte(vma, addr, ptent);
> -	else if (is_swap_pte(ptent))
> -		page = mc_handle_swap_pte(vma, ptent, &ent);
> -
> -	if (page)
> -		folio = page_folio(page);
> -	if (target && page) {
> -		if (!folio_trylock(folio)) {
> -			folio_put(folio);
> -			return ret;
> -		}
> -		/*
> -		 * page_mapped() must be stable during the move. This
> -		 * pte is locked, so if it's present, the page cannot
> -		 * become unmapped. If it isn't, we have only partial
> -		 * control over the mapped state: the page lock will
> -		 * prevent new faults against pagecache and swapcache,
> -		 * so an unmapped page cannot become mapped. However,
> -		 * if the page is already mapped elsewhere, it can
> -		 * unmap, and there is nothing we can do about it.
> -		 * Alas, skip moving the page in this case.
> -		 */
> -		if (!pte_present(ptent) && page_mapped(page)) {
> -			folio_unlock(folio);
> -			folio_put(folio);
> -			return ret;
> -		}
> -	}
> -
> -	if (!page && !ent.val)
> -		return ret;
> -	if (page) {
> -		/*
> -		 * Do only loose check w/o serialization.
> -		 * mem_cgroup_move_account() checks the page is valid or
> -		 * not under LRU exclusion.
> -		 */
> -		if (folio_memcg(folio) == mc.from) {
> -			ret = MC_TARGET_PAGE;
> -			if (folio_is_device_private(folio) ||
> -			    folio_is_device_coherent(folio))
> -				ret = MC_TARGET_DEVICE;
> -			if (target)
> -				target->folio = folio;
> -		}
> -		if (!ret || !target) {
> -			if (target)
> -				folio_unlock(folio);
> -			folio_put(folio);
> -		}
> -	}
> -	/*
> -	 * There is a swap entry and a page doesn't exist or isn't charged.
> -	 * But we cannot move a tail-page in a THP.
> -	 */
> -	if (ent.val && !ret && (!page || !PageTransCompound(page)) &&
> -	    mem_cgroup_id(mc.from) == lookup_swap_cgroup_id(ent)) {
> -		ret = MC_TARGET_SWAP;
> -		if (target)
> -			target->ent = ent;
> -	}
> -	return ret;
> -}
> -
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -/*
> - * We don't consider PMD mapped swapping or file mapped pages because THP does
> - * not support them for now.
> - * Caller should make sure that pmd_trans_huge(pmd) is true.
> - */
> -static enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
> -		unsigned long addr, pmd_t pmd, union mc_target *target)
> -{
> -	struct page *page = NULL;
> -	struct folio *folio;
> -	enum mc_target_type ret = MC_TARGET_NONE;
> -
> -	if (unlikely(is_swap_pmd(pmd))) {
> -		VM_BUG_ON(thp_migration_supported() &&
> -				  !is_pmd_migration_entry(pmd));
> -		return ret;
> -	}
> -	page = pmd_page(pmd);
> -	VM_BUG_ON_PAGE(!page || !PageHead(page), page);
> -	folio = page_folio(page);
> -	if (!(mc.flags & MOVE_ANON))
> -		return ret;
> -	if (folio_memcg(folio) == mc.from) {
> -		ret = MC_TARGET_PAGE;
> -		if (target) {
> -			folio_get(folio);
> -			if (!folio_trylock(folio)) {
> -				folio_put(folio);
> -				return MC_TARGET_NONE;
> -			}
> -			target->folio = folio;
> -		}
> -	}
> -	return ret;
> -}
> -#else
> -static inline enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
> -		unsigned long addr, pmd_t pmd, union mc_target *target)
> -{
> -	return MC_TARGET_NONE;
> -}
> -#endif
> -
> -static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
> -					unsigned long addr, unsigned long end,
> -					struct mm_walk *walk)
> -{
> -	struct vm_area_struct *vma = walk->vma;
> -	pte_t *pte;
> -	spinlock_t *ptl;
> -
> -	ptl = pmd_trans_huge_lock(pmd, vma);
> -	if (ptl) {
> -		/*
> -		 * Note their can not be MC_TARGET_DEVICE for now as we do not
> -		 * support transparent huge page with MEMORY_DEVICE_PRIVATE but
> -		 * this might change.
> -		 */
> -		if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE)
> -			mc.precharge += HPAGE_PMD_NR;
> -		spin_unlock(ptl);
> -		return 0;
> -	}
> -
> -	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> -	if (!pte)
> -		return 0;
> -	for (; addr != end; pte++, addr += PAGE_SIZE)
> -		if (get_mctgt_type(vma, addr, ptep_get(pte), NULL))
> -			mc.precharge++;	/* increment precharge temporarily */
> -	pte_unmap_unlock(pte - 1, ptl);
> -	cond_resched();
> -
> -	return 0;
> -}
> -
> -static const struct mm_walk_ops precharge_walk_ops = {
> -	.pmd_entry	= mem_cgroup_count_precharge_pte_range,
> -	.walk_lock	= PGWALK_RDLOCK,
> -};
> -
> -static unsigned long mem_cgroup_count_precharge(struct mm_struct *mm)
> -{
> -	unsigned long precharge;
> -
> -	mmap_read_lock(mm);
> -	walk_page_range(mm, 0, ULONG_MAX, &precharge_walk_ops, NULL);
> -	mmap_read_unlock(mm);
> -
> -	precharge = mc.precharge;
> -	mc.precharge = 0;
> -
> -	return precharge;
> -}
> -
> -static int mem_cgroup_precharge_mc(struct mm_struct *mm)
> -{
> -	unsigned long precharge = mem_cgroup_count_precharge(mm);
> -
> -	VM_BUG_ON(mc.moving_task);
> -	mc.moving_task = current;
> -	return mem_cgroup_do_precharge(precharge);
> -}
> -
> -/* cancels all extra charges on mc.from and mc.to, and wakes up all waiters. */
> -static void __mem_cgroup_clear_mc(void)
> -{
> -	struct mem_cgroup *from = mc.from;
> -	struct mem_cgroup *to = mc.to;
> -
> -	/* we must uncharge all the leftover precharges from mc.to */
> -	if (mc.precharge) {
> -		mem_cgroup_cancel_charge(mc.to, mc.precharge);
> -		mc.precharge = 0;
> -	}
> -	/*
> -	 * we didn't uncharge from mc.from at mem_cgroup_move_account(), so
> -	 * we must uncharge here.
> -	 */
> -	if (mc.moved_charge) {
> -		mem_cgroup_cancel_charge(mc.from, mc.moved_charge);
> -		mc.moved_charge = 0;
> -	}
> -	/* we must fixup refcnts and charges */
> -	if (mc.moved_swap) {
> -		/* uncharge swap account from the old cgroup */
> -		if (!mem_cgroup_is_root(mc.from))
> -			page_counter_uncharge(&mc.from->memsw, mc.moved_swap);
> -
> -		mem_cgroup_id_put_many(mc.from, mc.moved_swap);
> -
> -		/*
> -		 * we charged both to->memory and to->memsw, so we
> -		 * should uncharge to->memory.
> -		 */
> -		if (!mem_cgroup_is_root(mc.to))
> -			page_counter_uncharge(&mc.to->memory, mc.moved_swap);
> -
> -		mc.moved_swap = 0;
> -	}
> -	memcg_oom_recover(from);
> -	memcg_oom_recover(to);
> -	wake_up_all(&mc.waitq);
> -}
> -
> -static void mem_cgroup_clear_mc(void)
> -{
> -	struct mm_struct *mm = mc.mm;
> -
> -	/*
> -	 * we must clear moving_task before waking up waiters at the end of
> -	 * task migration.
> -	 */
> -	mc.moving_task = NULL;
> -	__mem_cgroup_clear_mc();
> -	spin_lock(&mc.lock);
> -	mc.from = NULL;
> -	mc.to = NULL;
> -	mc.mm = NULL;
> -	spin_unlock(&mc.lock);
> -
> -	mmput(mm);
> -}
> -
> -static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
> -{
> -	struct cgroup_subsys_state *css;
> -	struct mem_cgroup *memcg = NULL; /* unneeded init to make gcc happy */
> -	struct mem_cgroup *from;
> -	struct task_struct *leader, *p;
> -	struct mm_struct *mm;
> -	unsigned long move_flags;
> -	int ret = 0;
> -
> -	/* charge immigration isn't supported on the default hierarchy */
> -	if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
> -		return 0;
> -
> -	/*
> -	 * Multi-process migrations only happen on the default hierarchy
> -	 * where charge immigration is not used.  Perform charge
> -	 * immigration if @tset contains a leader and whine if there are
> -	 * multiple.
> -	 */
> -	p = NULL;
> -	cgroup_taskset_for_each_leader(leader, css, tset) {
> -		WARN_ON_ONCE(p);
> -		p = leader;
> -		memcg = mem_cgroup_from_css(css);
> -	}
> -	if (!p)
> -		return 0;
> -
> -	/*
> -	 * We are now committed to this value whatever it is. Changes in this
> -	 * tunable will only affect upcoming migrations, not the current one.
> -	 * So we need to save it, and keep it going.
> -	 */
> -	move_flags = READ_ONCE(memcg->move_charge_at_immigrate);
> -	if (!move_flags)
> -		return 0;
> -
> -	from = mem_cgroup_from_task(p);
> -
> -	VM_BUG_ON(from == memcg);
> -
> -	mm = get_task_mm(p);
> -	if (!mm)
> -		return 0;
> -	/* We move charges only when we move a owner of the mm */
> -	if (mm->owner == p) {
> -		VM_BUG_ON(mc.from);
> -		VM_BUG_ON(mc.to);
> -		VM_BUG_ON(mc.precharge);
> -		VM_BUG_ON(mc.moved_charge);
> -		VM_BUG_ON(mc.moved_swap);
> -
> -		spin_lock(&mc.lock);
> -		mc.mm = mm;
> -		mc.from = from;
> -		mc.to = memcg;
> -		mc.flags = move_flags;
> -		spin_unlock(&mc.lock);
> -		/* We set mc.moving_task later */
> -
> -		ret = mem_cgroup_precharge_mc(mm);
> -		if (ret)
> -			mem_cgroup_clear_mc();
> -	} else {
> -		mmput(mm);
> -	}
> -	return ret;
> -}
> -
> -static void mem_cgroup_cancel_attach(struct cgroup_taskset *tset)
> -{
> -	if (mc.to)
> -		mem_cgroup_clear_mc();
> -}
> -
> -static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
> -				unsigned long addr, unsigned long end,
> -				struct mm_walk *walk)
> -{
> -	int ret = 0;
> -	struct vm_area_struct *vma = walk->vma;
> -	pte_t *pte;
> -	spinlock_t *ptl;
> -	enum mc_target_type target_type;
> -	union mc_target target;
> -	struct folio *folio;
> -
> -	ptl = pmd_trans_huge_lock(pmd, vma);
> -	if (ptl) {
> -		if (mc.precharge < HPAGE_PMD_NR) {
> -			spin_unlock(ptl);
> -			return 0;
> -		}
> -		target_type = get_mctgt_type_thp(vma, addr, *pmd, &target);
> -		if (target_type == MC_TARGET_PAGE) {
> -			folio = target.folio;
> -			if (folio_isolate_lru(folio)) {
> -				if (!mem_cgroup_move_account(folio, true,
> -							     mc.from, mc.to)) {
> -					mc.precharge -= HPAGE_PMD_NR;
> -					mc.moved_charge += HPAGE_PMD_NR;
> -				}
> -				folio_putback_lru(folio);
> -			}
> -			folio_unlock(folio);
> -			folio_put(folio);
> -		} else if (target_type == MC_TARGET_DEVICE) {
> -			folio = target.folio;
> -			if (!mem_cgroup_move_account(folio, true,
> -						     mc.from, mc.to)) {
> -				mc.precharge -= HPAGE_PMD_NR;
> -				mc.moved_charge += HPAGE_PMD_NR;
> -			}
> -			folio_unlock(folio);
> -			folio_put(folio);
> -		}
> -		spin_unlock(ptl);
> -		return 0;
> -	}
> -
> -retry:
> -	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> -	if (!pte)
> -		return 0;
> -	for (; addr != end; addr += PAGE_SIZE) {
> -		pte_t ptent = ptep_get(pte++);
> -		bool device = false;
> -		swp_entry_t ent;
> -
> -		if (!mc.precharge)
> -			break;
> -
> -		switch (get_mctgt_type(vma, addr, ptent, &target)) {
> -		case MC_TARGET_DEVICE:
> -			device = true;
> -			fallthrough;
> -		case MC_TARGET_PAGE:
> -			folio = target.folio;
> -			/*
> -			 * We can have a part of the split pmd here. Moving it
> -			 * can be done but it would be too convoluted so simply
> -			 * ignore such a partial THP and keep it in original
> -			 * memcg. There should be somebody mapping the head.
> -			 */
> -			if (folio_test_large(folio))
> -				goto put;
> -			if (!device && !folio_isolate_lru(folio))
> -				goto put;
> -			if (!mem_cgroup_move_account(folio, false,
> -						mc.from, mc.to)) {
> -				mc.precharge--;
> -				/* we uncharge from mc.from later. */
> -				mc.moved_charge++;
> -			}
> -			if (!device)
> -				folio_putback_lru(folio);
> -put:			/* get_mctgt_type() gets & locks the page */
> -			folio_unlock(folio);
> -			folio_put(folio);
> -			break;
> -		case MC_TARGET_SWAP:
> -			ent = target.ent;
> -			if (!mem_cgroup_move_swap_account(ent, mc.from, mc.to)) {
> -				mc.precharge--;
> -				mem_cgroup_id_get_many(mc.to, 1);
> -				/* we fixup other refcnts and charges later. */
> -				mc.moved_swap++;
> -			}
> -			break;
> -		default:
> -			break;
> -		}
> -	}
> -	pte_unmap_unlock(pte - 1, ptl);
> -	cond_resched();
> -
> -	if (addr != end) {
> -		/*
> -		 * We have consumed all precharges we got in can_attach().
> -		 * We try charge one by one, but don't do any additional
> -		 * charges to mc.to if we have failed in charge once in attach()
> -		 * phase.
> -		 */
> -		ret = mem_cgroup_do_precharge(1);
> -		if (!ret)
> -			goto retry;
> -	}
> -
> -	return ret;
> -}
> -
> -static const struct mm_walk_ops charge_walk_ops = {
> -	.pmd_entry	= mem_cgroup_move_charge_pte_range,
> -	.walk_lock	= PGWALK_RDLOCK,
> -};
> -
> -static void mem_cgroup_move_charge(void)
> -{
> -	lru_add_drain_all();
> -	/*
> -	 * Signal folio_memcg_lock() to take the memcg's move_lock
> -	 * while we're moving its pages to another memcg. Then wait
> -	 * for already started RCU-only updates to finish.
> -	 */
> -	atomic_inc(&mc.from->moving_account);
> -	synchronize_rcu();
> -retry:
> -	if (unlikely(!mmap_read_trylock(mc.mm))) {
> -		/*
> -		 * Someone who are holding the mmap_lock might be waiting in
> -		 * waitq. So we cancel all extra charges, wake up all waiters,
> -		 * and retry. Because we cancel precharges, we might not be able
> -		 * to move enough charges, but moving charge is a best-effort
> -		 * feature anyway, so it wouldn't be a big problem.
> -		 */
> -		__mem_cgroup_clear_mc();
> -		cond_resched();
> -		goto retry;
> -	}
> -	/*
> -	 * When we have consumed all precharges and failed in doing
> -	 * additional charge, the page walk just aborts.
> -	 */
> -	walk_page_range(mc.mm, 0, ULONG_MAX, &charge_walk_ops, NULL);
> -	mmap_read_unlock(mc.mm);
> -	atomic_dec(&mc.from->moving_account);
> -}
> -
> -static void mem_cgroup_move_task(void)
> -{
> -	if (mc.to) {
> -		mem_cgroup_move_charge();
> -		mem_cgroup_clear_mc();
> -	}
> -}
> -
> -#else	/* !CONFIG_MMU */
> -static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
> -{
> -	return 0;
> -}
> -static void mem_cgroup_cancel_attach(struct cgroup_taskset *tset)
> -{
> -}
> -static void mem_cgroup_move_task(void)
> -{
> -}
> -#endif
> -
>  #ifdef CONFIG_MEMCG_KMEM
>  static void mem_cgroup_fork(struct task_struct *task)
>  {
> -- 
> 2.45.2

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 05/14] mm: memcg: rename charge move-related functions
  2024-06-25  0:58 ` [PATCH v2 05/14] mm: memcg: rename charge move-related functions Roman Gushchin
@ 2024-06-25  7:07   ` Michal Hocko
  0 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2024-06-25  7:07 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Johannes Weiner, Shakeel Butt, Muchun Song,
	linux-kernel, cgroups, linux-mm

On Mon 24-06-24 17:58:57, Roman Gushchin wrote:
> Rename exported function related to the charge move to have
> the memcg1_ prefix.
> 
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/memcontrol-v1.c | 14 +++++++-------
>  mm/memcontrol-v1.h |  8 ++++----
>  mm/memcontrol.c    |  8 ++++----
>  3 files changed, 15 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
> index f4c8bec5ae1b..c25e038ac874 100644
> --- a/mm/memcontrol-v1.c
> +++ b/mm/memcontrol-v1.c
> @@ -384,7 +384,7 @@ static bool mem_cgroup_under_move(struct mem_cgroup *memcg)
>  	return ret;
>  }
>  
> -bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg)
> +bool memcg1_wait_acct_move(struct mem_cgroup *memcg)
>  {
>  	if (mc.moving_task && current != mc.moving_task) {
>  		if (mem_cgroup_under_move(memcg)) {
> @@ -1056,7 +1056,7 @@ static void mem_cgroup_clear_mc(void)
>  	mmput(mm);
>  }
>  
> -int mem_cgroup_can_attach(struct cgroup_taskset *tset)
> +int memcg1_can_attach(struct cgroup_taskset *tset)
>  {
>  	struct cgroup_subsys_state *css;
>  	struct mem_cgroup *memcg = NULL; /* unneeded init to make gcc happy */
> @@ -1126,7 +1126,7 @@ int mem_cgroup_can_attach(struct cgroup_taskset *tset)
>  	return ret;
>  }
>  
> -void mem_cgroup_cancel_attach(struct cgroup_taskset *tset)
> +void memcg1_cancel_attach(struct cgroup_taskset *tset)
>  {
>  	if (mc.to)
>  		mem_cgroup_clear_mc();
> @@ -1285,7 +1285,7 @@ static void mem_cgroup_move_charge(void)
>  	atomic_dec(&mc.from->moving_account);
>  }
>  
> -void mem_cgroup_move_task(void)
> +void memcg1_move_task(void)
>  {
>  	if (mc.to) {
>  		mem_cgroup_move_charge();
> @@ -1294,14 +1294,14 @@ void mem_cgroup_move_task(void)
>  }
>  
>  #else	/* !CONFIG_MMU */
> -static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
> +int memcg1_can_attach(struct cgroup_taskset *tset)
>  {
>  	return 0;
>  }
> -static void mem_cgroup_cancel_attach(struct cgroup_taskset *tset)
> +void memcg1_cancel_attach(struct cgroup_taskset *tset)
>  {
>  }
> -static void mem_cgroup_move_task(void)
> +void memcg1_move_task(void)
>  {
>  }
>  #endif
> diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
> index 55e7c4f90c39..d377c0be9880 100644
> --- a/mm/memcontrol-v1.h
> +++ b/mm/memcontrol-v1.h
> @@ -29,11 +29,11 @@ static inline int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  void mem_cgroup_id_get_many(struct mem_cgroup *memcg, unsigned int n);
>  void mem_cgroup_id_put_many(struct mem_cgroup *memcg, unsigned int n);
>  
> -bool mem_cgroup_wait_acct_move(struct mem_cgroup *memcg);
> +bool memcg1_wait_acct_move(struct mem_cgroup *memcg);
>  struct cgroup_taskset;
> -int mem_cgroup_can_attach(struct cgroup_taskset *tset);
> -void mem_cgroup_cancel_attach(struct cgroup_taskset *tset);
> -void mem_cgroup_move_task(void);
> +int memcg1_can_attach(struct cgroup_taskset *tset);
> +void memcg1_cancel_attach(struct cgroup_taskset *tset);
> +void memcg1_move_task(void);
>  
>  struct cftype;
>  u64 mem_cgroup_move_charge_read(struct cgroup_subsys_state *css,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 3332c89cae2e..da2c0fa0de1b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2582,7 +2582,7 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	 * At task move, charge accounts can be doubly counted. So, it's
>  	 * better to wait until the end of task_move if something is going on.
>  	 */
> -	if (mem_cgroup_wait_acct_move(mem_over_limit))
> +	if (memcg1_wait_acct_move(mem_over_limit))
>  		goto retry;
>  
>  	if (nr_retries--)
> @@ -6030,12 +6030,12 @@ struct cgroup_subsys memory_cgrp_subsys = {
>  	.css_free = mem_cgroup_css_free,
>  	.css_reset = mem_cgroup_css_reset,
>  	.css_rstat_flush = mem_cgroup_css_rstat_flush,
> -	.can_attach = mem_cgroup_can_attach,
> +	.can_attach = memcg1_can_attach,
>  #if defined(CONFIG_LRU_GEN) || defined(CONFIG_MEMCG_KMEM)
>  	.attach = mem_cgroup_attach,
>  #endif
> -	.cancel_attach = mem_cgroup_cancel_attach,
> -	.post_attach = mem_cgroup_move_task,
> +	.cancel_attach = memcg1_cancel_attach,
> +	.post_attach = memcg1_move_task,
>  #ifdef CONFIG_MEMCG_KMEM
>  	.fork = mem_cgroup_fork,
>  	.exit = mem_cgroup_exit,
> -- 
> 2.45.2
> 

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 06/14] mm: memcg: move legacy memcg event code into memcontrol-v1.c
  2024-06-25  0:58 ` [PATCH v2 06/14] mm: memcg: move legacy memcg event code into memcontrol-v1.c Roman Gushchin
@ 2024-06-25  7:07   ` Michal Hocko
  0 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2024-06-25  7:07 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Johannes Weiner, Shakeel Butt, Muchun Song,
	linux-kernel, cgroups, linux-mm

On Mon 24-06-24 17:58:58, Roman Gushchin wrote:
> Cgroup v1's memory controller contains a pretty complicated
> event notifications mechanism which is not used on cgroup v2.
> Let's move the corresponding code into memcontrol-v1.c.
> 
> Please, note, that mem_cgroup_event_ratelimit() remains in
> memcontrol.c, otherwise it would require exporting too many
> details on memcg stats outside of memcontrol.c.
> 
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/memcontrol.h |  12 -
>  mm/memcontrol-v1.c         | 653 +++++++++++++++++++++++++++++++++++
>  mm/memcontrol-v1.h         |  51 +++
>  mm/memcontrol.c            | 687 +------------------------------------
>  4 files changed, 709 insertions(+), 694 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 83c8327455d8..588179d29849 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -69,18 +69,6 @@ struct mem_cgroup_id {
>  	refcount_t ref;
>  };
>  
> -/*
> - * Per memcg event counter is incremented at every pagein/pageout. With THP,
> - * it will be incremented by the number of pages. This counter is used
> - * to trigger some periodic events. This is straightforward and better
> - * than using jiffies etc. to handle periodic memcg event.
> - */
> -enum mem_cgroup_events_target {
> -	MEM_CGROUP_TARGET_THRESH,
> -	MEM_CGROUP_TARGET_SOFTLIMIT,
> -	MEM_CGROUP_NTARGETS,
> -};
> -
>  struct memcg_vmstats_percpu;
>  struct memcg_vmstats;
>  struct lruvec_stats_percpu;
> diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
> index c25e038ac874..4b2290ceace6 100644
> --- a/mm/memcontrol-v1.c
> +++ b/mm/memcontrol-v1.c
> @@ -6,6 +6,10 @@
>  #include <linux/pagewalk.h>
>  #include <linux/backing-dev.h>
>  #include <linux/swap_cgroup.h>
> +#include <linux/eventfd.h>
> +#include <linux/poll.h>
> +#include <linux/sort.h>
> +#include <linux/file.h>
>  
>  #include "internal.h"
>  #include "swap.h"
> @@ -60,6 +64,54 @@ static struct move_charge_struct {
>  	.waitq = __WAIT_QUEUE_HEAD_INITIALIZER(mc.waitq),
>  };
>  
> +/* for OOM */
> +struct mem_cgroup_eventfd_list {
> +	struct list_head list;
> +	struct eventfd_ctx *eventfd;
> +};
> +
> +/*
> + * cgroup_event represents events which userspace want to receive.
> + */
> +struct mem_cgroup_event {
> +	/*
> +	 * memcg which the event belongs to.
> +	 */
> +	struct mem_cgroup *memcg;
> +	/*
> +	 * eventfd to signal userspace about the event.
> +	 */
> +	struct eventfd_ctx *eventfd;
> +	/*
> +	 * Each of these stored in a list by the cgroup.
> +	 */
> +	struct list_head list;
> +	/*
> +	 * register_event() callback will be used to add new userspace
> +	 * waiter for changes related to this event.  Use eventfd_signal()
> +	 * on eventfd to send notification to userspace.
> +	 */
> +	int (*register_event)(struct mem_cgroup *memcg,
> +			      struct eventfd_ctx *eventfd, const char *args);
> +	/*
> +	 * unregister_event() callback will be called when userspace closes
> +	 * the eventfd or on cgroup removing.  This callback must be set,
> +	 * if you want provide notification functionality.
> +	 */
> +	void (*unregister_event)(struct mem_cgroup *memcg,
> +				 struct eventfd_ctx *eventfd);
> +	/*
> +	 * All fields below needed to unregister event when
> +	 * userspace closes eventfd.
> +	 */
> +	poll_table pt;
> +	wait_queue_head_t *wqh;
> +	wait_queue_entry_t wait;
> +	struct work_struct remove;
> +};
> +
> +extern spinlock_t memcg_oom_lock;
> +
>  static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_node *mz,
>  					 struct mem_cgroup_tree_per_node *mctz,
>  					 unsigned long new_usage_in_excess)
> @@ -1306,6 +1358,607 @@ void memcg1_move_task(void)
>  }
>  #endif
>  
> +static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
> +{
> +	struct mem_cgroup_threshold_ary *t;
> +	unsigned long usage;
> +	int i;
> +
> +	rcu_read_lock();
> +	if (!swap)
> +		t = rcu_dereference(memcg->thresholds.primary);
> +	else
> +		t = rcu_dereference(memcg->memsw_thresholds.primary);
> +
> +	if (!t)
> +		goto unlock;
> +
> +	usage = mem_cgroup_usage(memcg, swap);
> +
> +	/*
> +	 * current_threshold points to threshold just below or equal to usage.
> +	 * If it's not true, a threshold was crossed after last
> +	 * call of __mem_cgroup_threshold().
> +	 */
> +	i = t->current_threshold;
> +
> +	/*
> +	 * Iterate backward over array of thresholds starting from
> +	 * current_threshold and check if a threshold is crossed.
> +	 * If none of thresholds below usage is crossed, we read
> +	 * only one element of the array here.
> +	 */
> +	for (; i >= 0 && unlikely(t->entries[i].threshold > usage); i--)
> +		eventfd_signal(t->entries[i].eventfd);
> +
> +	/* i = current_threshold + 1 */
> +	i++;
> +
> +	/*
> +	 * Iterate forward over array of thresholds starting from
> +	 * current_threshold+1 and check if a threshold is crossed.
> +	 * If none of thresholds above usage is crossed, we read
> +	 * only one element of the array here.
> +	 */
> +	for (; i < t->size && unlikely(t->entries[i].threshold <= usage); i++)
> +		eventfd_signal(t->entries[i].eventfd);
> +
> +	/* Update current_threshold */
> +	t->current_threshold = i - 1;
> +unlock:
> +	rcu_read_unlock();
> +}
> +
> +static void mem_cgroup_threshold(struct mem_cgroup *memcg)
> +{
> +	while (memcg) {
> +		__mem_cgroup_threshold(memcg, false);
> +		if (do_memsw_account())
> +			__mem_cgroup_threshold(memcg, true);
> +
> +		memcg = parent_mem_cgroup(memcg);
> +	}
> +}
> +
> +/*
> + * Check events in order.
> + *
> + */
> +void memcg_check_events(struct mem_cgroup *memcg, int nid)
> +{
> +	if (IS_ENABLED(CONFIG_PREEMPT_RT))
> +		return;
> +
> +	/* threshold event is triggered in finer grain than soft limit */
> +	if (unlikely(mem_cgroup_event_ratelimit(memcg,
> +						MEM_CGROUP_TARGET_THRESH))) {
> +		bool do_softlimit;
> +
> +		do_softlimit = mem_cgroup_event_ratelimit(memcg,
> +						MEM_CGROUP_TARGET_SOFTLIMIT);
> +		mem_cgroup_threshold(memcg);
> +		if (unlikely(do_softlimit))
> +			memcg1_update_tree(memcg, nid);
> +	}
> +}
> +
> +static int compare_thresholds(const void *a, const void *b)
> +{
> +	const struct mem_cgroup_threshold *_a = a;
> +	const struct mem_cgroup_threshold *_b = b;
> +
> +	if (_a->threshold > _b->threshold)
> +		return 1;
> +
> +	if (_a->threshold < _b->threshold)
> +		return -1;
> +
> +	return 0;
> +}
> +
> +static int mem_cgroup_oom_notify_cb(struct mem_cgroup *memcg)
> +{
> +	struct mem_cgroup_eventfd_list *ev;
> +
> +	spin_lock(&memcg_oom_lock);
> +
> +	list_for_each_entry(ev, &memcg->oom_notify, list)
> +		eventfd_signal(ev->eventfd);
> +
> +	spin_unlock(&memcg_oom_lock);
> +	return 0;
> +}
> +
> +void mem_cgroup_oom_notify(struct mem_cgroup *memcg)
> +{
> +	struct mem_cgroup *iter;
> +
> +	for_each_mem_cgroup_tree(iter, memcg)
> +		mem_cgroup_oom_notify_cb(iter);
> +}
> +
> +static int __mem_cgroup_usage_register_event(struct mem_cgroup *memcg,
> +	struct eventfd_ctx *eventfd, const char *args, enum res_type type)
> +{
> +	struct mem_cgroup_thresholds *thresholds;
> +	struct mem_cgroup_threshold_ary *new;
> +	unsigned long threshold;
> +	unsigned long usage;
> +	int i, size, ret;
> +
> +	ret = page_counter_memparse(args, "-1", &threshold);
> +	if (ret)
> +		return ret;
> +
> +	mutex_lock(&memcg->thresholds_lock);
> +
> +	if (type == _MEM) {
> +		thresholds = &memcg->thresholds;
> +		usage = mem_cgroup_usage(memcg, false);
> +	} else if (type == _MEMSWAP) {
> +		thresholds = &memcg->memsw_thresholds;
> +		usage = mem_cgroup_usage(memcg, true);
> +	} else
> +		BUG();
> +
> +	/* Check if a threshold crossed before adding a new one */
> +	if (thresholds->primary)
> +		__mem_cgroup_threshold(memcg, type == _MEMSWAP);
> +
> +	size = thresholds->primary ? thresholds->primary->size + 1 : 1;
> +
> +	/* Allocate memory for new array of thresholds */
> +	new = kmalloc(struct_size(new, entries, size), GFP_KERNEL);
> +	if (!new) {
> +		ret = -ENOMEM;
> +		goto unlock;
> +	}
> +	new->size = size;
> +
> +	/* Copy thresholds (if any) to new array */
> +	if (thresholds->primary)
> +		memcpy(new->entries, thresholds->primary->entries,
> +		       flex_array_size(new, entries, size - 1));
> +
> +	/* Add new threshold */
> +	new->entries[size - 1].eventfd = eventfd;
> +	new->entries[size - 1].threshold = threshold;
> +
> +	/* Sort thresholds. Registering of new threshold isn't time-critical */
> +	sort(new->entries, size, sizeof(*new->entries),
> +			compare_thresholds, NULL);
> +
> +	/* Find current threshold */
> +	new->current_threshold = -1;
> +	for (i = 0; i < size; i++) {
> +		if (new->entries[i].threshold <= usage) {
> +			/*
> +			 * new->current_threshold will not be used until
> +			 * rcu_assign_pointer(), so it's safe to increment
> +			 * it here.
> +			 */
> +			++new->current_threshold;
> +		} else
> +			break;
> +	}
> +
> +	/* Free old spare buffer and save old primary buffer as spare */
> +	kfree(thresholds->spare);
> +	thresholds->spare = thresholds->primary;
> +
> +	rcu_assign_pointer(thresholds->primary, new);
> +
> +	/* To be sure that nobody uses thresholds */
> +	synchronize_rcu();
> +
> +unlock:
> +	mutex_unlock(&memcg->thresholds_lock);
> +
> +	return ret;
> +}
> +
> +static int mem_cgroup_usage_register_event(struct mem_cgroup *memcg,
> +	struct eventfd_ctx *eventfd, const char *args)
> +{
> +	return __mem_cgroup_usage_register_event(memcg, eventfd, args, _MEM);
> +}
> +
> +static int memsw_cgroup_usage_register_event(struct mem_cgroup *memcg,
> +	struct eventfd_ctx *eventfd, const char *args)
> +{
> +	return __mem_cgroup_usage_register_event(memcg, eventfd, args, _MEMSWAP);
> +}
> +
> +static void __mem_cgroup_usage_unregister_event(struct mem_cgroup *memcg,
> +	struct eventfd_ctx *eventfd, enum res_type type)
> +{
> +	struct mem_cgroup_thresholds *thresholds;
> +	struct mem_cgroup_threshold_ary *new;
> +	unsigned long usage;
> +	int i, j, size, entries;
> +
> +	mutex_lock(&memcg->thresholds_lock);
> +
> +	if (type == _MEM) {
> +		thresholds = &memcg->thresholds;
> +		usage = mem_cgroup_usage(memcg, false);
> +	} else if (type == _MEMSWAP) {
> +		thresholds = &memcg->memsw_thresholds;
> +		usage = mem_cgroup_usage(memcg, true);
> +	} else
> +		BUG();
> +
> +	if (!thresholds->primary)
> +		goto unlock;
> +
> +	/* Check if a threshold crossed before removing */
> +	__mem_cgroup_threshold(memcg, type == _MEMSWAP);
> +
> +	/* Calculate new number of threshold */
> +	size = entries = 0;
> +	for (i = 0; i < thresholds->primary->size; i++) {
> +		if (thresholds->primary->entries[i].eventfd != eventfd)
> +			size++;
> +		else
> +			entries++;
> +	}
> +
> +	new = thresholds->spare;
> +
> +	/* If no items related to eventfd have been cleared, nothing to do */
> +	if (!entries)
> +		goto unlock;
> +
> +	/* Set thresholds array to NULL if we don't have thresholds */
> +	if (!size) {
> +		kfree(new);
> +		new = NULL;
> +		goto swap_buffers;
> +	}
> +
> +	new->size = size;
> +
> +	/* Copy thresholds and find current threshold */
> +	new->current_threshold = -1;
> +	for (i = 0, j = 0; i < thresholds->primary->size; i++) {
> +		if (thresholds->primary->entries[i].eventfd == eventfd)
> +			continue;
> +
> +		new->entries[j] = thresholds->primary->entries[i];
> +		if (new->entries[j].threshold <= usage) {
> +			/*
> +			 * new->current_threshold will not be used
> +			 * until rcu_assign_pointer(), so it's safe to increment
> +			 * it here.
> +			 */
> +			++new->current_threshold;
> +		}
> +		j++;
> +	}
> +
> +swap_buffers:
> +	/* Swap primary and spare array */
> +	thresholds->spare = thresholds->primary;
> +
> +	rcu_assign_pointer(thresholds->primary, new);
> +
> +	/* To be sure that nobody uses thresholds */
> +	synchronize_rcu();
> +
> +	/* If all events are unregistered, free the spare array */
> +	if (!new) {
> +		kfree(thresholds->spare);
> +		thresholds->spare = NULL;
> +	}
> +unlock:
> +	mutex_unlock(&memcg->thresholds_lock);
> +}
> +
> +static void mem_cgroup_usage_unregister_event(struct mem_cgroup *memcg,
> +	struct eventfd_ctx *eventfd)
> +{
> +	return __mem_cgroup_usage_unregister_event(memcg, eventfd, _MEM);
> +}
> +
> +static void memsw_cgroup_usage_unregister_event(struct mem_cgroup *memcg,
> +	struct eventfd_ctx *eventfd)
> +{
> +	return __mem_cgroup_usage_unregister_event(memcg, eventfd, _MEMSWAP);
> +}
> +
> +static int mem_cgroup_oom_register_event(struct mem_cgroup *memcg,
> +	struct eventfd_ctx *eventfd, const char *args)
> +{
> +	struct mem_cgroup_eventfd_list *event;
> +
> +	event = kmalloc(sizeof(*event),	GFP_KERNEL);
> +	if (!event)
> +		return -ENOMEM;
> +
> +	spin_lock(&memcg_oom_lock);
> +
> +	event->eventfd = eventfd;
> +	list_add(&event->list, &memcg->oom_notify);
> +
> +	/* already in OOM ? */
> +	if (memcg->under_oom)
> +		eventfd_signal(eventfd);
> +	spin_unlock(&memcg_oom_lock);
> +
> +	return 0;
> +}
> +
> +static void mem_cgroup_oom_unregister_event(struct mem_cgroup *memcg,
> +	struct eventfd_ctx *eventfd)
> +{
> +	struct mem_cgroup_eventfd_list *ev, *tmp;
> +
> +	spin_lock(&memcg_oom_lock);
> +
> +	list_for_each_entry_safe(ev, tmp, &memcg->oom_notify, list) {
> +		if (ev->eventfd == eventfd) {
> +			list_del(&ev->list);
> +			kfree(ev);
> +		}
> +	}
> +
> +	spin_unlock(&memcg_oom_lock);
> +}
> +
> +/*
> + * DO NOT USE IN NEW FILES.
> + *
> + * "cgroup.event_control" implementation.
> + *
> + * This is way over-engineered.  It tries to support fully configurable
> + * events for each user.  Such level of flexibility is completely
> + * unnecessary especially in the light of the planned unified hierarchy.
> + *
> + * Please deprecate this and replace with something simpler if at all
> + * possible.
> + */
> +
> +/*
> + * Unregister event and free resources.
> + *
> + * Gets called from workqueue.
> + */
> +static void memcg_event_remove(struct work_struct *work)
> +{
> +	struct mem_cgroup_event *event =
> +		container_of(work, struct mem_cgroup_event, remove);
> +	struct mem_cgroup *memcg = event->memcg;
> +
> +	remove_wait_queue(event->wqh, &event->wait);
> +
> +	event->unregister_event(memcg, event->eventfd);
> +
> +	/* Notify userspace the event is going away. */
> +	eventfd_signal(event->eventfd);
> +
> +	eventfd_ctx_put(event->eventfd);
> +	kfree(event);
> +	css_put(&memcg->css);
> +}
> +
> +/*
> + * Gets called on EPOLLHUP on eventfd when user closes it.
> + *
> + * Called with wqh->lock held and interrupts disabled.
> + */
> +static int memcg_event_wake(wait_queue_entry_t *wait, unsigned mode,
> +			    int sync, void *key)
> +{
> +	struct mem_cgroup_event *event =
> +		container_of(wait, struct mem_cgroup_event, wait);
> +	struct mem_cgroup *memcg = event->memcg;
> +	__poll_t flags = key_to_poll(key);
> +
> +	if (flags & EPOLLHUP) {
> +		/*
> +		 * If the event has been detached at cgroup removal, we
> +		 * can simply return knowing the other side will cleanup
> +		 * for us.
> +		 *
> +		 * We can't race against event freeing since the other
> +		 * side will require wqh->lock via remove_wait_queue(),
> +		 * which we hold.
> +		 */
> +		spin_lock(&memcg->event_list_lock);
> +		if (!list_empty(&event->list)) {
> +			list_del_init(&event->list);
> +			/*
> +			 * We are in atomic context, but cgroup_event_remove()
> +			 * may sleep, so we have to call it in workqueue.
> +			 */
> +			schedule_work(&event->remove);
> +		}
> +		spin_unlock(&memcg->event_list_lock);
> +	}
> +
> +	return 0;
> +}
> +
> +static void memcg_event_ptable_queue_proc(struct file *file,
> +		wait_queue_head_t *wqh, poll_table *pt)
> +{
> +	struct mem_cgroup_event *event =
> +		container_of(pt, struct mem_cgroup_event, pt);
> +
> +	event->wqh = wqh;
> +	add_wait_queue(wqh, &event->wait);
> +}
> +
> +/*
> + * DO NOT USE IN NEW FILES.
> + *
> + * Parse input and register new cgroup event handler.
> + *
> + * Input must be in format '<event_fd> <control_fd> <args>'.
> + * Interpretation of args is defined by control file implementation.
> + */
> +ssize_t memcg_write_event_control(struct kernfs_open_file *of,
> +				  char *buf, size_t nbytes, loff_t off)
> +{
> +	struct cgroup_subsys_state *css = of_css(of);
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> +	struct mem_cgroup_event *event;
> +	struct cgroup_subsys_state *cfile_css;
> +	unsigned int efd, cfd;
> +	struct fd efile;
> +	struct fd cfile;
> +	struct dentry *cdentry;
> +	const char *name;
> +	char *endp;
> +	int ret;
> +
> +	if (IS_ENABLED(CONFIG_PREEMPT_RT))
> +		return -EOPNOTSUPP;
> +
> +	buf = strstrip(buf);
> +
> +	efd = simple_strtoul(buf, &endp, 10);
> +	if (*endp != ' ')
> +		return -EINVAL;
> +	buf = endp + 1;
> +
> +	cfd = simple_strtoul(buf, &endp, 10);
> +	if ((*endp != ' ') && (*endp != '\0'))
> +		return -EINVAL;
> +	buf = endp + 1;
> +
> +	event = kzalloc(sizeof(*event), GFP_KERNEL);
> +	if (!event)
> +		return -ENOMEM;
> +
> +	event->memcg = memcg;
> +	INIT_LIST_HEAD(&event->list);
> +	init_poll_funcptr(&event->pt, memcg_event_ptable_queue_proc);
> +	init_waitqueue_func_entry(&event->wait, memcg_event_wake);
> +	INIT_WORK(&event->remove, memcg_event_remove);
> +
> +	efile = fdget(efd);
> +	if (!efile.file) {
> +		ret = -EBADF;
> +		goto out_kfree;
> +	}
> +
> +	event->eventfd = eventfd_ctx_fileget(efile.file);
> +	if (IS_ERR(event->eventfd)) {
> +		ret = PTR_ERR(event->eventfd);
> +		goto out_put_efile;
> +	}
> +
> +	cfile = fdget(cfd);
> +	if (!cfile.file) {
> +		ret = -EBADF;
> +		goto out_put_eventfd;
> +	}
> +
> +	/* the process need read permission on control file */
> +	/* AV: shouldn't we check that it's been opened for read instead? */
> +	ret = file_permission(cfile.file, MAY_READ);
> +	if (ret < 0)
> +		goto out_put_cfile;
> +
> +	/*
> +	 * The control file must be a regular cgroup1 file. As a regular cgroup
> +	 * file can't be renamed, it's safe to access its name afterwards.
> +	 */
> +	cdentry = cfile.file->f_path.dentry;
> +	if (cdentry->d_sb->s_type != &cgroup_fs_type || !d_is_reg(cdentry)) {
> +		ret = -EINVAL;
> +		goto out_put_cfile;
> +	}
> +
> +	/*
> +	 * Determine the event callbacks and set them in @event.  This used
> +	 * to be done via struct cftype but cgroup core no longer knows
> +	 * about these events.  The following is crude but the whole thing
> +	 * is for compatibility anyway.
> +	 *
> +	 * DO NOT ADD NEW FILES.
> +	 */
> +	name = cdentry->d_name.name;
> +
> +	if (!strcmp(name, "memory.usage_in_bytes")) {
> +		event->register_event = mem_cgroup_usage_register_event;
> +		event->unregister_event = mem_cgroup_usage_unregister_event;
> +	} else if (!strcmp(name, "memory.oom_control")) {
> +		event->register_event = mem_cgroup_oom_register_event;
> +		event->unregister_event = mem_cgroup_oom_unregister_event;
> +	} else if (!strcmp(name, "memory.pressure_level")) {
> +		event->register_event = vmpressure_register_event;
> +		event->unregister_event = vmpressure_unregister_event;
> +	} else if (!strcmp(name, "memory.memsw.usage_in_bytes")) {
> +		event->register_event = memsw_cgroup_usage_register_event;
> +		event->unregister_event = memsw_cgroup_usage_unregister_event;
> +	} else {
> +		ret = -EINVAL;
> +		goto out_put_cfile;
> +	}
> +
> +	/*
> +	 * Verify @cfile should belong to @css.  Also, remaining events are
> +	 * automatically removed on cgroup destruction but the removal is
> +	 * asynchronous, so take an extra ref on @css.
> +	 */
> +	cfile_css = css_tryget_online_from_dir(cdentry->d_parent,
> +					       &memory_cgrp_subsys);
> +	ret = -EINVAL;
> +	if (IS_ERR(cfile_css))
> +		goto out_put_cfile;
> +	if (cfile_css != css) {
> +		css_put(cfile_css);
> +		goto out_put_cfile;
> +	}
> +
> +	ret = event->register_event(memcg, event->eventfd, buf);
> +	if (ret)
> +		goto out_put_css;
> +
> +	vfs_poll(efile.file, &event->pt);
> +
> +	spin_lock_irq(&memcg->event_list_lock);
> +	list_add(&event->list, &memcg->event_list);
> +	spin_unlock_irq(&memcg->event_list_lock);
> +
> +	fdput(cfile);
> +	fdput(efile);
> +
> +	return nbytes;
> +
> +out_put_css:
> +	css_put(css);
> +out_put_cfile:
> +	fdput(cfile);
> +out_put_eventfd:
> +	eventfd_ctx_put(event->eventfd);
> +out_put_efile:
> +	fdput(efile);
> +out_kfree:
> +	kfree(event);
> +
> +	return ret;
> +}
> +
> +void memcg1_css_offline(struct mem_cgroup *memcg)
> +{
> +	struct mem_cgroup_event *event, *tmp;
> +
> +	/*
> +	 * Unregister events and notify userspace.
> +	 * Notify userspace about cgroup removing only after rmdir of cgroup
> +	 * directory to avoid race between userspace and kernelspace.
> +	 */
> +	spin_lock_irq(&memcg->event_list_lock);
> +	list_for_each_entry_safe(event, tmp, &memcg->event_list, list) {
> +		list_del_init(&event->list);
> +		schedule_work(&event->remove);
> +	}
> +	spin_unlock_irq(&memcg->event_list_lock);
> +}
> +
>  static int __init memcg1_init(void)
>  {
>  	int node;
> diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
> index d377c0be9880..524a2c76ffc9 100644
> --- a/mm/memcontrol-v1.h
> +++ b/mm/memcontrol-v1.h
> @@ -41,4 +41,55 @@ u64 mem_cgroup_move_charge_read(struct cgroup_subsys_state *css,
>  int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css,
>  				 struct cftype *cft, u64 val);
>  
> +/*
> + * Per memcg event counter is incremented at every pagein/pageout. With THP,
> + * it will be incremented by the number of pages. This counter is used
> + * to trigger some periodic events. This is straightforward and better
> + * than using jiffies etc. to handle periodic memcg event.
> + */
> +enum mem_cgroup_events_target {
> +	MEM_CGROUP_TARGET_THRESH,
> +	MEM_CGROUP_TARGET_SOFTLIMIT,
> +	MEM_CGROUP_NTARGETS,
> +};
> +
> +/* Whether legacy memory+swap accounting is active */
> +static bool do_memsw_account(void)
> +{
> +	return !cgroup_subsys_on_dfl(memory_cgrp_subsys);
> +}
> +
> +/*
> + * Iteration constructs for visiting all cgroups (under a tree).  If
> + * loops are exited prematurely (break), mem_cgroup_iter_break() must
> + * be used for reference counting.
> + */
> +#define for_each_mem_cgroup_tree(iter, root)		\
> +	for (iter = mem_cgroup_iter(root, NULL, NULL);	\
> +	     iter != NULL;				\
> +	     iter = mem_cgroup_iter(root, iter, NULL))
> +
> +#define for_each_mem_cgroup(iter)			\
> +	for (iter = mem_cgroup_iter(NULL, NULL, NULL);	\
> +	     iter != NULL;				\
> +	     iter = mem_cgroup_iter(NULL, iter, NULL))
> +
> +void memcg1_css_offline(struct mem_cgroup *memcg);
> +
> +/* for encoding cft->private value on file */
> +enum res_type {
> +	_MEM,
> +	_MEMSWAP,
> +	_KMEM,
> +	_TCP,
> +};
> +
> +bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
> +				enum mem_cgroup_events_target target);
> +unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap);
> +void mem_cgroup_oom_notify(struct mem_cgroup *memcg);
> +ssize_t memcg_write_event_control(struct kernfs_open_file *of,
> +				  char *buf, size_t nbytes, loff_t off);
> +
> +
>  #endif	/* __MM_MEMCONTROL_V1_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index da2c0fa0de1b..bd4b26a73596 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -46,9 +46,6 @@
>  #include <linux/slab.h>
>  #include <linux/swapops.h>
>  #include <linux/spinlock.h>
> -#include <linux/eventfd.h>
> -#include <linux/poll.h>
> -#include <linux/sort.h>
>  #include <linux/fs.h>
>  #include <linux/seq_file.h>
>  #include <linux/parser.h>
> @@ -59,7 +56,6 @@
>  #include <linux/cpu.h>
>  #include <linux/oom.h>
>  #include <linux/lockdep.h>
> -#include <linux/file.h>
>  #include <linux/resume_user_mode.h>
>  #include <linux/psi.h>
>  #include <linux/seq_buf.h>
> @@ -97,91 +93,13 @@ static bool cgroup_memory_nobpf __ro_after_init;
>  static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq);
>  #endif
>  
> -/* Whether legacy memory+swap accounting is active */
> -static bool do_memsw_account(void)
> -{
> -	return !cgroup_subsys_on_dfl(memory_cgrp_subsys);
> -}
> -
>  #define THRESHOLDS_EVENTS_TARGET 128
>  #define SOFTLIMIT_EVENTS_TARGET 1024
>  
> -/* for OOM */
> -struct mem_cgroup_eventfd_list {
> -	struct list_head list;
> -	struct eventfd_ctx *eventfd;
> -};
> -
> -/*
> - * cgroup_event represents events which userspace want to receive.
> - */
> -struct mem_cgroup_event {
> -	/*
> -	 * memcg which the event belongs to.
> -	 */
> -	struct mem_cgroup *memcg;
> -	/*
> -	 * eventfd to signal userspace about the event.
> -	 */
> -	struct eventfd_ctx *eventfd;
> -	/*
> -	 * Each of these stored in a list by the cgroup.
> -	 */
> -	struct list_head list;
> -	/*
> -	 * register_event() callback will be used to add new userspace
> -	 * waiter for changes related to this event.  Use eventfd_signal()
> -	 * on eventfd to send notification to userspace.
> -	 */
> -	int (*register_event)(struct mem_cgroup *memcg,
> -			      struct eventfd_ctx *eventfd, const char *args);
> -	/*
> -	 * unregister_event() callback will be called when userspace closes
> -	 * the eventfd or on cgroup removing.  This callback must be set,
> -	 * if you want provide notification functionality.
> -	 */
> -	void (*unregister_event)(struct mem_cgroup *memcg,
> -				 struct eventfd_ctx *eventfd);
> -	/*
> -	 * All fields below needed to unregister event when
> -	 * userspace closes eventfd.
> -	 */
> -	poll_table pt;
> -	wait_queue_head_t *wqh;
> -	wait_queue_entry_t wait;
> -	struct work_struct remove;
> -};
> -
> -static void mem_cgroup_threshold(struct mem_cgroup *memcg);
> -static void mem_cgroup_oom_notify(struct mem_cgroup *memcg);
> -
> -/* for encoding cft->private value on file */
> -enum res_type {
> -	_MEM,
> -	_MEMSWAP,
> -	_KMEM,
> -	_TCP,
> -};
> -
>  #define MEMFILE_PRIVATE(x, val)	((x) << 16 | (val))
>  #define MEMFILE_TYPE(val)	((val) >> 16 & 0xffff)
>  #define MEMFILE_ATTR(val)	((val) & 0xffff)
>  
> -/*
> - * Iteration constructs for visiting all cgroups (under a tree).  If
> - * loops are exited prematurely (break), mem_cgroup_iter_break() must
> - * be used for reference counting.
> - */
> -#define for_each_mem_cgroup_tree(iter, root)		\
> -	for (iter = mem_cgroup_iter(root, NULL, NULL);	\
> -	     iter != NULL;				\
> -	     iter = mem_cgroup_iter(root, iter, NULL))
> -
> -#define for_each_mem_cgroup(iter)			\
> -	for (iter = mem_cgroup_iter(NULL, NULL, NULL);	\
> -	     iter != NULL;				\
> -	     iter = mem_cgroup_iter(NULL, iter, NULL))
> -
>  static inline bool task_is_dying(void)
>  {
>  	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
> @@ -940,8 +858,8 @@ void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, int nr_pages)
>  	__this_cpu_add(memcg->vmstats_percpu->nr_page_events, nr_pages);
>  }
>  
> -static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
> -				       enum mem_cgroup_events_target target)
> +bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
> +				enum mem_cgroup_events_target target)
>  {
>  	unsigned long val, next;
>  
> @@ -965,28 +883,6 @@ static bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
>  	return false;
>  }
>  
> -/*
> - * Check events in order.
> - *
> - */
> -void memcg_check_events(struct mem_cgroup *memcg, int nid)
> -{
> -	if (IS_ENABLED(CONFIG_PREEMPT_RT))
> -		return;
> -
> -	/* threshold event is triggered in finer grain than soft limit */
> -	if (unlikely(mem_cgroup_event_ratelimit(memcg,
> -						MEM_CGROUP_TARGET_THRESH))) {
> -		bool do_softlimit;
> -
> -		do_softlimit = mem_cgroup_event_ratelimit(memcg,
> -						MEM_CGROUP_TARGET_SOFTLIMIT);
> -		mem_cgroup_threshold(memcg);
> -		if (unlikely(do_softlimit))
> -			memcg1_update_tree(memcg, nid);
> -	}
> -}
> -
>  struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
>  {
>  	/*
> @@ -1726,7 +1622,7 @@ static struct lockdep_map memcg_oom_lock_dep_map = {
>  };
>  #endif
>  
> -static DEFINE_SPINLOCK(memcg_oom_lock);
> +DEFINE_SPINLOCK(memcg_oom_lock);
>  
>  /*
>   * Check OOM-Killer is already running under our hierarchy.
> @@ -3545,7 +3441,7 @@ static int mem_cgroup_hierarchy_write(struct cgroup_subsys_state *css,
>  	return -EINVAL;
>  }
>  
> -static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
> +unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
>  {
>  	unsigned long val;
>  
> @@ -4046,331 +3942,6 @@ static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css,
>  	return 0;
>  }
>  
> -static void __mem_cgroup_threshold(struct mem_cgroup *memcg, bool swap)
> -{
> -	struct mem_cgroup_threshold_ary *t;
> -	unsigned long usage;
> -	int i;
> -
> -	rcu_read_lock();
> -	if (!swap)
> -		t = rcu_dereference(memcg->thresholds.primary);
> -	else
> -		t = rcu_dereference(memcg->memsw_thresholds.primary);
> -
> -	if (!t)
> -		goto unlock;
> -
> -	usage = mem_cgroup_usage(memcg, swap);
> -
> -	/*
> -	 * current_threshold points to threshold just below or equal to usage.
> -	 * If it's not true, a threshold was crossed after last
> -	 * call of __mem_cgroup_threshold().
> -	 */
> -	i = t->current_threshold;
> -
> -	/*
> -	 * Iterate backward over array of thresholds starting from
> -	 * current_threshold and check if a threshold is crossed.
> -	 * If none of thresholds below usage is crossed, we read
> -	 * only one element of the array here.
> -	 */
> -	for (; i >= 0 && unlikely(t->entries[i].threshold > usage); i--)
> -		eventfd_signal(t->entries[i].eventfd);
> -
> -	/* i = current_threshold + 1 */
> -	i++;
> -
> -	/*
> -	 * Iterate forward over array of thresholds starting from
> -	 * current_threshold+1 and check if a threshold is crossed.
> -	 * If none of thresholds above usage is crossed, we read
> -	 * only one element of the array here.
> -	 */
> -	for (; i < t->size && unlikely(t->entries[i].threshold <= usage); i++)
> -		eventfd_signal(t->entries[i].eventfd);
> -
> -	/* Update current_threshold */
> -	t->current_threshold = i - 1;
> -unlock:
> -	rcu_read_unlock();
> -}
> -
> -static void mem_cgroup_threshold(struct mem_cgroup *memcg)
> -{
> -	while (memcg) {
> -		__mem_cgroup_threshold(memcg, false);
> -		if (do_memsw_account())
> -			__mem_cgroup_threshold(memcg, true);
> -
> -		memcg = parent_mem_cgroup(memcg);
> -	}
> -}
> -
> -static int compare_thresholds(const void *a, const void *b)
> -{
> -	const struct mem_cgroup_threshold *_a = a;
> -	const struct mem_cgroup_threshold *_b = b;
> -
> -	if (_a->threshold > _b->threshold)
> -		return 1;
> -
> -	if (_a->threshold < _b->threshold)
> -		return -1;
> -
> -	return 0;
> -}
> -
> -static int mem_cgroup_oom_notify_cb(struct mem_cgroup *memcg)
> -{
> -	struct mem_cgroup_eventfd_list *ev;
> -
> -	spin_lock(&memcg_oom_lock);
> -
> -	list_for_each_entry(ev, &memcg->oom_notify, list)
> -		eventfd_signal(ev->eventfd);
> -
> -	spin_unlock(&memcg_oom_lock);
> -	return 0;
> -}
> -
> -static void mem_cgroup_oom_notify(struct mem_cgroup *memcg)
> -{
> -	struct mem_cgroup *iter;
> -
> -	for_each_mem_cgroup_tree(iter, memcg)
> -		mem_cgroup_oom_notify_cb(iter);
> -}
> -
> -static int __mem_cgroup_usage_register_event(struct mem_cgroup *memcg,
> -	struct eventfd_ctx *eventfd, const char *args, enum res_type type)
> -{
> -	struct mem_cgroup_thresholds *thresholds;
> -	struct mem_cgroup_threshold_ary *new;
> -	unsigned long threshold;
> -	unsigned long usage;
> -	int i, size, ret;
> -
> -	ret = page_counter_memparse(args, "-1", &threshold);
> -	if (ret)
> -		return ret;
> -
> -	mutex_lock(&memcg->thresholds_lock);
> -
> -	if (type == _MEM) {
> -		thresholds = &memcg->thresholds;
> -		usage = mem_cgroup_usage(memcg, false);
> -	} else if (type == _MEMSWAP) {
> -		thresholds = &memcg->memsw_thresholds;
> -		usage = mem_cgroup_usage(memcg, true);
> -	} else
> -		BUG();
> -
> -	/* Check if a threshold crossed before adding a new one */
> -	if (thresholds->primary)
> -		__mem_cgroup_threshold(memcg, type == _MEMSWAP);
> -
> -	size = thresholds->primary ? thresholds->primary->size + 1 : 1;
> -
> -	/* Allocate memory for new array of thresholds */
> -	new = kmalloc(struct_size(new, entries, size), GFP_KERNEL);
> -	if (!new) {
> -		ret = -ENOMEM;
> -		goto unlock;
> -	}
> -	new->size = size;
> -
> -	/* Copy thresholds (if any) to new array */
> -	if (thresholds->primary)
> -		memcpy(new->entries, thresholds->primary->entries,
> -		       flex_array_size(new, entries, size - 1));
> -
> -	/* Add new threshold */
> -	new->entries[size - 1].eventfd = eventfd;
> -	new->entries[size - 1].threshold = threshold;
> -
> -	/* Sort thresholds. Registering of new threshold isn't time-critical */
> -	sort(new->entries, size, sizeof(*new->entries),
> -			compare_thresholds, NULL);
> -
> -	/* Find current threshold */
> -	new->current_threshold = -1;
> -	for (i = 0; i < size; i++) {
> -		if (new->entries[i].threshold <= usage) {
> -			/*
> -			 * new->current_threshold will not be used until
> -			 * rcu_assign_pointer(), so it's safe to increment
> -			 * it here.
> -			 */
> -			++new->current_threshold;
> -		} else
> -			break;
> -	}
> -
> -	/* Free old spare buffer and save old primary buffer as spare */
> -	kfree(thresholds->spare);
> -	thresholds->spare = thresholds->primary;
> -
> -	rcu_assign_pointer(thresholds->primary, new);
> -
> -	/* To be sure that nobody uses thresholds */
> -	synchronize_rcu();
> -
> -unlock:
> -	mutex_unlock(&memcg->thresholds_lock);
> -
> -	return ret;
> -}
> -
> -static int mem_cgroup_usage_register_event(struct mem_cgroup *memcg,
> -	struct eventfd_ctx *eventfd, const char *args)
> -{
> -	return __mem_cgroup_usage_register_event(memcg, eventfd, args, _MEM);
> -}
> -
> -static int memsw_cgroup_usage_register_event(struct mem_cgroup *memcg,
> -	struct eventfd_ctx *eventfd, const char *args)
> -{
> -	return __mem_cgroup_usage_register_event(memcg, eventfd, args, _MEMSWAP);
> -}
> -
> -static void __mem_cgroup_usage_unregister_event(struct mem_cgroup *memcg,
> -	struct eventfd_ctx *eventfd, enum res_type type)
> -{
> -	struct mem_cgroup_thresholds *thresholds;
> -	struct mem_cgroup_threshold_ary *new;
> -	unsigned long usage;
> -	int i, j, size, entries;
> -
> -	mutex_lock(&memcg->thresholds_lock);
> -
> -	if (type == _MEM) {
> -		thresholds = &memcg->thresholds;
> -		usage = mem_cgroup_usage(memcg, false);
> -	} else if (type == _MEMSWAP) {
> -		thresholds = &memcg->memsw_thresholds;
> -		usage = mem_cgroup_usage(memcg, true);
> -	} else
> -		BUG();
> -
> -	if (!thresholds->primary)
> -		goto unlock;
> -
> -	/* Check if a threshold crossed before removing */
> -	__mem_cgroup_threshold(memcg, type == _MEMSWAP);
> -
> -	/* Calculate new number of threshold */
> -	size = entries = 0;
> -	for (i = 0; i < thresholds->primary->size; i++) {
> -		if (thresholds->primary->entries[i].eventfd != eventfd)
> -			size++;
> -		else
> -			entries++;
> -	}
> -
> -	new = thresholds->spare;
> -
> -	/* If no items related to eventfd have been cleared, nothing to do */
> -	if (!entries)
> -		goto unlock;
> -
> -	/* Set thresholds array to NULL if we don't have thresholds */
> -	if (!size) {
> -		kfree(new);
> -		new = NULL;
> -		goto swap_buffers;
> -	}
> -
> -	new->size = size;
> -
> -	/* Copy thresholds and find current threshold */
> -	new->current_threshold = -1;
> -	for (i = 0, j = 0; i < thresholds->primary->size; i++) {
> -		if (thresholds->primary->entries[i].eventfd == eventfd)
> -			continue;
> -
> -		new->entries[j] = thresholds->primary->entries[i];
> -		if (new->entries[j].threshold <= usage) {
> -			/*
> -			 * new->current_threshold will not be used
> -			 * until rcu_assign_pointer(), so it's safe to increment
> -			 * it here.
> -			 */
> -			++new->current_threshold;
> -		}
> -		j++;
> -	}
> -
> -swap_buffers:
> -	/* Swap primary and spare array */
> -	thresholds->spare = thresholds->primary;
> -
> -	rcu_assign_pointer(thresholds->primary, new);
> -
> -	/* To be sure that nobody uses thresholds */
> -	synchronize_rcu();
> -
> -	/* If all events are unregistered, free the spare array */
> -	if (!new) {
> -		kfree(thresholds->spare);
> -		thresholds->spare = NULL;
> -	}
> -unlock:
> -	mutex_unlock(&memcg->thresholds_lock);
> -}
> -
> -static void mem_cgroup_usage_unregister_event(struct mem_cgroup *memcg,
> -	struct eventfd_ctx *eventfd)
> -{
> -	return __mem_cgroup_usage_unregister_event(memcg, eventfd, _MEM);
> -}
> -
> -static void memsw_cgroup_usage_unregister_event(struct mem_cgroup *memcg,
> -	struct eventfd_ctx *eventfd)
> -{
> -	return __mem_cgroup_usage_unregister_event(memcg, eventfd, _MEMSWAP);
> -}
> -
> -static int mem_cgroup_oom_register_event(struct mem_cgroup *memcg,
> -	struct eventfd_ctx *eventfd, const char *args)
> -{
> -	struct mem_cgroup_eventfd_list *event;
> -
> -	event = kmalloc(sizeof(*event),	GFP_KERNEL);
> -	if (!event)
> -		return -ENOMEM;
> -
> -	spin_lock(&memcg_oom_lock);
> -
> -	event->eventfd = eventfd;
> -	list_add(&event->list, &memcg->oom_notify);
> -
> -	/* already in OOM ? */
> -	if (memcg->under_oom)
> -		eventfd_signal(eventfd);
> -	spin_unlock(&memcg_oom_lock);
> -
> -	return 0;
> -}
> -
> -static void mem_cgroup_oom_unregister_event(struct mem_cgroup *memcg,
> -	struct eventfd_ctx *eventfd)
> -{
> -	struct mem_cgroup_eventfd_list *ev, *tmp;
> -
> -	spin_lock(&memcg_oom_lock);
> -
> -	list_for_each_entry_safe(ev, tmp, &memcg->oom_notify, list) {
> -		if (ev->eventfd == eventfd) {
> -			list_del(&ev->list);
> -			kfree(ev);
> -		}
> -	}
> -
> -	spin_unlock(&memcg_oom_lock);
> -}
> -
>  static int mem_cgroup_oom_control_read(struct seq_file *sf, void *v)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_seq(sf);
> @@ -4611,243 +4182,6 @@ static void memcg_wb_domain_size_changed(struct mem_cgroup *memcg)
>  
>  #endif	/* CONFIG_CGROUP_WRITEBACK */
>  
> -/*
> - * DO NOT USE IN NEW FILES.
> - *
> - * "cgroup.event_control" implementation.
> - *
> - * This is way over-engineered.  It tries to support fully configurable
> - * events for each user.  Such level of flexibility is completely
> - * unnecessary especially in the light of the planned unified hierarchy.
> - *
> - * Please deprecate this and replace with something simpler if at all
> - * possible.
> - */
> -
> -/*
> - * Unregister event and free resources.
> - *
> - * Gets called from workqueue.
> - */
> -static void memcg_event_remove(struct work_struct *work)
> -{
> -	struct mem_cgroup_event *event =
> -		container_of(work, struct mem_cgroup_event, remove);
> -	struct mem_cgroup *memcg = event->memcg;
> -
> -	remove_wait_queue(event->wqh, &event->wait);
> -
> -	event->unregister_event(memcg, event->eventfd);
> -
> -	/* Notify userspace the event is going away. */
> -	eventfd_signal(event->eventfd);
> -
> -	eventfd_ctx_put(event->eventfd);
> -	kfree(event);
> -	css_put(&memcg->css);
> -}
> -
> -/*
> - * Gets called on EPOLLHUP on eventfd when user closes it.
> - *
> - * Called with wqh->lock held and interrupts disabled.
> - */
> -static int memcg_event_wake(wait_queue_entry_t *wait, unsigned mode,
> -			    int sync, void *key)
> -{
> -	struct mem_cgroup_event *event =
> -		container_of(wait, struct mem_cgroup_event, wait);
> -	struct mem_cgroup *memcg = event->memcg;
> -	__poll_t flags = key_to_poll(key);
> -
> -	if (flags & EPOLLHUP) {
> -		/*
> -		 * If the event has been detached at cgroup removal, we
> -		 * can simply return knowing the other side will cleanup
> -		 * for us.
> -		 *
> -		 * We can't race against event freeing since the other
> -		 * side will require wqh->lock via remove_wait_queue(),
> -		 * which we hold.
> -		 */
> -		spin_lock(&memcg->event_list_lock);
> -		if (!list_empty(&event->list)) {
> -			list_del_init(&event->list);
> -			/*
> -			 * We are in atomic context, but cgroup_event_remove()
> -			 * may sleep, so we have to call it in workqueue.
> -			 */
> -			schedule_work(&event->remove);
> -		}
> -		spin_unlock(&memcg->event_list_lock);
> -	}
> -
> -	return 0;
> -}
> -
> -static void memcg_event_ptable_queue_proc(struct file *file,
> -		wait_queue_head_t *wqh, poll_table *pt)
> -{
> -	struct mem_cgroup_event *event =
> -		container_of(pt, struct mem_cgroup_event, pt);
> -
> -	event->wqh = wqh;
> -	add_wait_queue(wqh, &event->wait);
> -}
> -
> -/*
> - * DO NOT USE IN NEW FILES.
> - *
> - * Parse input and register new cgroup event handler.
> - *
> - * Input must be in format '<event_fd> <control_fd> <args>'.
> - * Interpretation of args is defined by control file implementation.
> - */
> -static ssize_t memcg_write_event_control(struct kernfs_open_file *of,
> -					 char *buf, size_t nbytes, loff_t off)
> -{
> -	struct cgroup_subsys_state *css = of_css(of);
> -	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> -	struct mem_cgroup_event *event;
> -	struct cgroup_subsys_state *cfile_css;
> -	unsigned int efd, cfd;
> -	struct fd efile;
> -	struct fd cfile;
> -	struct dentry *cdentry;
> -	const char *name;
> -	char *endp;
> -	int ret;
> -
> -	if (IS_ENABLED(CONFIG_PREEMPT_RT))
> -		return -EOPNOTSUPP;
> -
> -	buf = strstrip(buf);
> -
> -	efd = simple_strtoul(buf, &endp, 10);
> -	if (*endp != ' ')
> -		return -EINVAL;
> -	buf = endp + 1;
> -
> -	cfd = simple_strtoul(buf, &endp, 10);
> -	if ((*endp != ' ') && (*endp != '\0'))
> -		return -EINVAL;
> -	buf = endp + 1;
> -
> -	event = kzalloc(sizeof(*event), GFP_KERNEL);
> -	if (!event)
> -		return -ENOMEM;
> -
> -	event->memcg = memcg;
> -	INIT_LIST_HEAD(&event->list);
> -	init_poll_funcptr(&event->pt, memcg_event_ptable_queue_proc);
> -	init_waitqueue_func_entry(&event->wait, memcg_event_wake);
> -	INIT_WORK(&event->remove, memcg_event_remove);
> -
> -	efile = fdget(efd);
> -	if (!efile.file) {
> -		ret = -EBADF;
> -		goto out_kfree;
> -	}
> -
> -	event->eventfd = eventfd_ctx_fileget(efile.file);
> -	if (IS_ERR(event->eventfd)) {
> -		ret = PTR_ERR(event->eventfd);
> -		goto out_put_efile;
> -	}
> -
> -	cfile = fdget(cfd);
> -	if (!cfile.file) {
> -		ret = -EBADF;
> -		goto out_put_eventfd;
> -	}
> -
> -	/* the process need read permission on control file */
> -	/* AV: shouldn't we check that it's been opened for read instead? */
> -	ret = file_permission(cfile.file, MAY_READ);
> -	if (ret < 0)
> -		goto out_put_cfile;
> -
> -	/*
> -	 * The control file must be a regular cgroup1 file. As a regular cgroup
> -	 * file can't be renamed, it's safe to access its name afterwards.
> -	 */
> -	cdentry = cfile.file->f_path.dentry;
> -	if (cdentry->d_sb->s_type != &cgroup_fs_type || !d_is_reg(cdentry)) {
> -		ret = -EINVAL;
> -		goto out_put_cfile;
> -	}
> -
> -	/*
> -	 * Determine the event callbacks and set them in @event.  This used
> -	 * to be done via struct cftype but cgroup core no longer knows
> -	 * about these events.  The following is crude but the whole thing
> -	 * is for compatibility anyway.
> -	 *
> -	 * DO NOT ADD NEW FILES.
> -	 */
> -	name = cdentry->d_name.name;
> -
> -	if (!strcmp(name, "memory.usage_in_bytes")) {
> -		event->register_event = mem_cgroup_usage_register_event;
> -		event->unregister_event = mem_cgroup_usage_unregister_event;
> -	} else if (!strcmp(name, "memory.oom_control")) {
> -		event->register_event = mem_cgroup_oom_register_event;
> -		event->unregister_event = mem_cgroup_oom_unregister_event;
> -	} else if (!strcmp(name, "memory.pressure_level")) {
> -		event->register_event = vmpressure_register_event;
> -		event->unregister_event = vmpressure_unregister_event;
> -	} else if (!strcmp(name, "memory.memsw.usage_in_bytes")) {
> -		event->register_event = memsw_cgroup_usage_register_event;
> -		event->unregister_event = memsw_cgroup_usage_unregister_event;
> -	} else {
> -		ret = -EINVAL;
> -		goto out_put_cfile;
> -	}
> -
> -	/*
> -	 * Verify @cfile should belong to @css.  Also, remaining events are
> -	 * automatically removed on cgroup destruction but the removal is
> -	 * asynchronous, so take an extra ref on @css.
> -	 */
> -	cfile_css = css_tryget_online_from_dir(cdentry->d_parent,
> -					       &memory_cgrp_subsys);
> -	ret = -EINVAL;
> -	if (IS_ERR(cfile_css))
> -		goto out_put_cfile;
> -	if (cfile_css != css) {
> -		css_put(cfile_css);
> -		goto out_put_cfile;
> -	}
> -
> -	ret = event->register_event(memcg, event->eventfd, buf);
> -	if (ret)
> -		goto out_put_css;
> -
> -	vfs_poll(efile.file, &event->pt);
> -
> -	spin_lock_irq(&memcg->event_list_lock);
> -	list_add(&event->list, &memcg->event_list);
> -	spin_unlock_irq(&memcg->event_list_lock);
> -
> -	fdput(cfile);
> -	fdput(efile);
> -
> -	return nbytes;
> -
> -out_put_css:
> -	css_put(css);
> -out_put_cfile:
> -	fdput(cfile);
> -out_put_eventfd:
> -	eventfd_ctx_put(event->eventfd);
> -out_put_efile:
> -	fdput(efile);
> -out_kfree:
> -	kfree(event);
> -
> -	return ret;
> -}
> -
>  #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_SLUB_DEBUG)
>  static int mem_cgroup_slab_show(struct seq_file *m, void *p)
>  {
> @@ -5314,19 +4648,8 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
>  static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> -	struct mem_cgroup_event *event, *tmp;
>  
> -	/*
> -	 * Unregister events and notify userspace.
> -	 * Notify userspace about cgroup removing only after rmdir of cgroup
> -	 * directory to avoid race between userspace and kernelspace.
> -	 */
> -	spin_lock_irq(&memcg->event_list_lock);
> -	list_for_each_entry_safe(event, tmp, &memcg->event_list, list) {
> -		list_del_init(&event->list);
> -		schedule_work(&event->remove);
> -	}
> -	spin_unlock_irq(&memcg->event_list_lock);
> +	memcg1_css_offline(memcg);
>  
>  	page_counter_set_min(&memcg->memory, 0);
>  	page_counter_set_low(&memcg->memory, 0);
> -- 
> 2.45.2

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 07/14] mm: memcg: rename memcg_check_events()
  2024-06-25  0:58 ` [PATCH v2 07/14] mm: memcg: rename memcg_check_events() Roman Gushchin
@ 2024-06-25  7:08   ` Michal Hocko
  0 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2024-06-25  7:08 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Johannes Weiner, Shakeel Butt, Muchun Song,
	linux-kernel, cgroups, linux-mm

On Mon 24-06-24 17:58:59, Roman Gushchin wrote:
> Rename memcg_check_events() into memcg1_check_events() for
> consistency with other cgroup v1-specific functions.
> 
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/memcontrol-v1.c | 6 +++---
>  mm/memcontrol-v1.h | 2 +-
>  mm/memcontrol.c    | 8 ++++----
>  3 files changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
> index 4b2290ceace6..d7b5c4c14732 100644
> --- a/mm/memcontrol-v1.c
> +++ b/mm/memcontrol-v1.c
> @@ -835,9 +835,9 @@ static int mem_cgroup_move_account(struct folio *folio,
>  
>  	local_irq_disable();
>  	mem_cgroup_charge_statistics(to, nr_pages);
> -	memcg_check_events(to, nid);
> +	memcg1_check_events(to, nid);
>  	mem_cgroup_charge_statistics(from, -nr_pages);
> -	memcg_check_events(from, nid);
> +	memcg1_check_events(from, nid);
>  	local_irq_enable();
>  out:
>  	return ret;
> @@ -1424,7 +1424,7 @@ static void mem_cgroup_threshold(struct mem_cgroup *memcg)
>   * Check events in order.
>   *
>   */
> -void memcg_check_events(struct mem_cgroup *memcg, int nid)
> +void memcg1_check_events(struct mem_cgroup *memcg, int nid)
>  {
>  	if (IS_ENABLED(CONFIG_PREEMPT_RT))
>  		return;
> diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
> index 524a2c76ffc9..ef1b7037cbdc 100644
> --- a/mm/memcontrol-v1.h
> +++ b/mm/memcontrol-v1.h
> @@ -12,7 +12,7 @@ static inline void memcg1_soft_limit_reset(struct mem_cgroup *memcg)
>  }
>  
>  void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, int nr_pages);
> -void memcg_check_events(struct mem_cgroup *memcg, int nid);
> +void memcg1_check_events(struct mem_cgroup *memcg, int nid);
>  void memcg_oom_recover(struct mem_cgroup *memcg);
>  int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  		     unsigned int nr_pages);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index bd4b26a73596..92fb72bbd494 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2632,7 +2632,7 @@ void mem_cgroup_commit_charge(struct folio *folio, struct mem_cgroup *memcg)
>  
>  	local_irq_disable();
>  	mem_cgroup_charge_statistics(memcg, folio_nr_pages(folio));
> -	memcg_check_events(memcg, folio_nid(folio));
> +	memcg1_check_events(memcg, folio_nid(folio));
>  	local_irq_enable();
>  }
>  
> @@ -5697,7 +5697,7 @@ static void uncharge_batch(const struct uncharge_gather *ug)
>  	local_irq_save(flags);
>  	__count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout);
>  	__this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_memory);
> -	memcg_check_events(ug->memcg, ug->nid);
> +	memcg1_check_events(ug->memcg, ug->nid);
>  	local_irq_restore(flags);
>  
>  	/* drop reference from uncharge_folio */
> @@ -5836,7 +5836,7 @@ void mem_cgroup_replace_folio(struct folio *old, struct folio *new)
>  
>  	local_irq_save(flags);
>  	mem_cgroup_charge_statistics(memcg, nr_pages);
> -	memcg_check_events(memcg, folio_nid(new));
> +	memcg1_check_events(memcg, folio_nid(new));
>  	local_irq_restore(flags);
>  }
>  
> @@ -6104,7 +6104,7 @@ void mem_cgroup_swapout(struct folio *folio, swp_entry_t entry)
>  	memcg_stats_lock();
>  	mem_cgroup_charge_statistics(memcg, -nr_entries);
>  	memcg_stats_unlock();
> -	memcg_check_events(memcg, folio_nid(folio));
> +	memcg1_check_events(memcg, folio_nid(folio));
>  
>  	css_put(&memcg->css);
>  }
> -- 
> 2.45.2

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 08/14] mm: memcg: move cgroup v1 oom handling code into memcontrol-v1.c
  2024-06-25  0:59 ` [PATCH v2 08/14] mm: memcg: move cgroup v1 oom handling code into memcontrol-v1.c Roman Gushchin
@ 2024-06-25  7:08   ` Michal Hocko
  0 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2024-06-25  7:08 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Johannes Weiner, Shakeel Butt, Muchun Song,
	linux-kernel, cgroups, linux-mm

On Mon 24-06-24 17:59:00, Roman Gushchin wrote:
> Cgroup v1 supports a complicated OOM handling in userspace mechanism,
> which is not supported by cgroup v2. Let's move the corresponding code
> into memcontrol-v1.c.
> 
> Aside from mechanical code movement this patch introduces two new
> functions: memcg1_oom_prepare() and memcg1_oom_finish().
> Those are implementing cgroup v1-specific parts of the common memcg
> OOM handling path.
> 
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/memcontrol-v1.c | 229 ++++++++++++++++++++++++++++++++++++++++++++-
>  mm/memcontrol-v1.h |   3 +-
>  mm/memcontrol.c    | 216 +-----------------------------------------
>  3 files changed, 231 insertions(+), 217 deletions(-)
> 
> diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
> index d7b5c4c14732..253d49d5fb12 100644
> --- a/mm/memcontrol-v1.c
> +++ b/mm/memcontrol-v1.c
> @@ -110,7 +110,13 @@ struct mem_cgroup_event {
>  	struct work_struct remove;
>  };
>  
> -extern spinlock_t memcg_oom_lock;
> +#ifdef CONFIG_LOCKDEP
> +static struct lockdep_map memcg_oom_lock_dep_map = {
> +	.name = "memcg_oom_lock",
> +};
> +#endif
> +
> +DEFINE_SPINLOCK(memcg_oom_lock);
>  
>  static void __mem_cgroup_insert_exceeded(struct mem_cgroup_per_node *mz,
>  					 struct mem_cgroup_tree_per_node *mctz,
> @@ -1469,7 +1475,7 @@ static int mem_cgroup_oom_notify_cb(struct mem_cgroup *memcg)
>  	return 0;
>  }
>  
> -void mem_cgroup_oom_notify(struct mem_cgroup *memcg)
> +static void mem_cgroup_oom_notify(struct mem_cgroup *memcg)
>  {
>  	struct mem_cgroup *iter;
>  
> @@ -1959,6 +1965,225 @@ void memcg1_css_offline(struct mem_cgroup *memcg)
>  	spin_unlock_irq(&memcg->event_list_lock);
>  }
>  
> +/*
> + * Check OOM-Killer is already running under our hierarchy.
> + * If someone is running, return false.
> + */
> +static bool mem_cgroup_oom_trylock(struct mem_cgroup *memcg)
> +{
> +	struct mem_cgroup *iter, *failed = NULL;
> +
> +	spin_lock(&memcg_oom_lock);
> +
> +	for_each_mem_cgroup_tree(iter, memcg) {
> +		if (iter->oom_lock) {
> +			/*
> +			 * this subtree of our hierarchy is already locked
> +			 * so we cannot give a lock.
> +			 */
> +			failed = iter;
> +			mem_cgroup_iter_break(memcg, iter);
> +			break;
> +		} else
> +			iter->oom_lock = true;
> +	}
> +
> +	if (failed) {
> +		/*
> +		 * OK, we failed to lock the whole subtree so we have
> +		 * to clean up what we set up to the failing subtree
> +		 */
> +		for_each_mem_cgroup_tree(iter, memcg) {
> +			if (iter == failed) {
> +				mem_cgroup_iter_break(memcg, iter);
> +				break;
> +			}
> +			iter->oom_lock = false;
> +		}
> +	} else
> +		mutex_acquire(&memcg_oom_lock_dep_map, 0, 1, _RET_IP_);
> +
> +	spin_unlock(&memcg_oom_lock);
> +
> +	return !failed;
> +}
> +
> +static void mem_cgroup_oom_unlock(struct mem_cgroup *memcg)
> +{
> +	struct mem_cgroup *iter;
> +
> +	spin_lock(&memcg_oom_lock);
> +	mutex_release(&memcg_oom_lock_dep_map, _RET_IP_);
> +	for_each_mem_cgroup_tree(iter, memcg)
> +		iter->oom_lock = false;
> +	spin_unlock(&memcg_oom_lock);
> +}
> +
> +static void mem_cgroup_mark_under_oom(struct mem_cgroup *memcg)
> +{
> +	struct mem_cgroup *iter;
> +
> +	spin_lock(&memcg_oom_lock);
> +	for_each_mem_cgroup_tree(iter, memcg)
> +		iter->under_oom++;
> +	spin_unlock(&memcg_oom_lock);
> +}
> +
> +static void mem_cgroup_unmark_under_oom(struct mem_cgroup *memcg)
> +{
> +	struct mem_cgroup *iter;
> +
> +	/*
> +	 * Be careful about under_oom underflows because a child memcg
> +	 * could have been added after mem_cgroup_mark_under_oom.
> +	 */
> +	spin_lock(&memcg_oom_lock);
> +	for_each_mem_cgroup_tree(iter, memcg)
> +		if (iter->under_oom > 0)
> +			iter->under_oom--;
> +	spin_unlock(&memcg_oom_lock);
> +}
> +
> +static DECLARE_WAIT_QUEUE_HEAD(memcg_oom_waitq);
> +
> +struct oom_wait_info {
> +	struct mem_cgroup *memcg;
> +	wait_queue_entry_t	wait;
> +};
> +
> +static int memcg_oom_wake_function(wait_queue_entry_t *wait,
> +	unsigned mode, int sync, void *arg)
> +{
> +	struct mem_cgroup *wake_memcg = (struct mem_cgroup *)arg;
> +	struct mem_cgroup *oom_wait_memcg;
> +	struct oom_wait_info *oom_wait_info;
> +
> +	oom_wait_info = container_of(wait, struct oom_wait_info, wait);
> +	oom_wait_memcg = oom_wait_info->memcg;
> +
> +	if (!mem_cgroup_is_descendant(wake_memcg, oom_wait_memcg) &&
> +	    !mem_cgroup_is_descendant(oom_wait_memcg, wake_memcg))
> +		return 0;
> +	return autoremove_wake_function(wait, mode, sync, arg);
> +}
> +
> +void memcg_oom_recover(struct mem_cgroup *memcg)
> +{
> +	/*
> +	 * For the following lockless ->under_oom test, the only required
> +	 * guarantee is that it must see the state asserted by an OOM when
> +	 * this function is called as a result of userland actions
> +	 * triggered by the notification of the OOM.  This is trivially
> +	 * achieved by invoking mem_cgroup_mark_under_oom() before
> +	 * triggering notification.
> +	 */
> +	if (memcg && memcg->under_oom)
> +		__wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg);
> +}
> +
> +/**
> + * mem_cgroup_oom_synchronize - complete memcg OOM handling
> + * @handle: actually kill/wait or just clean up the OOM state
> + *
> + * This has to be called at the end of a page fault if the memcg OOM
> + * handler was enabled.
> + *
> + * Memcg supports userspace OOM handling where failed allocations must
> + * sleep on a waitqueue until the userspace task resolves the
> + * situation.  Sleeping directly in the charge context with all kinds
> + * of locks held is not a good idea, instead we remember an OOM state
> + * in the task and mem_cgroup_oom_synchronize() has to be called at
> + * the end of the page fault to complete the OOM handling.
> + *
> + * Returns %true if an ongoing memcg OOM situation was detected and
> + * completed, %false otherwise.
> + */
> +bool mem_cgroup_oom_synchronize(bool handle)
> +{
> +	struct mem_cgroup *memcg = current->memcg_in_oom;
> +	struct oom_wait_info owait;
> +	bool locked;
> +
> +	/* OOM is global, do not handle */
> +	if (!memcg)
> +		return false;
> +
> +	if (!handle)
> +		goto cleanup;
> +
> +	owait.memcg = memcg;
> +	owait.wait.flags = 0;
> +	owait.wait.func = memcg_oom_wake_function;
> +	owait.wait.private = current;
> +	INIT_LIST_HEAD(&owait.wait.entry);
> +
> +	prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
> +	mem_cgroup_mark_under_oom(memcg);
> +
> +	locked = mem_cgroup_oom_trylock(memcg);
> +
> +	if (locked)
> +		mem_cgroup_oom_notify(memcg);
> +
> +	schedule();
> +	mem_cgroup_unmark_under_oom(memcg);
> +	finish_wait(&memcg_oom_waitq, &owait.wait);
> +
> +	if (locked)
> +		mem_cgroup_oom_unlock(memcg);
> +cleanup:
> +	current->memcg_in_oom = NULL;
> +	css_put(&memcg->css);
> +	return true;
> +}
> +
> +
> +bool memcg1_oom_prepare(struct mem_cgroup *memcg, bool *locked)
> +{
> +	/*
> +	 * We are in the middle of the charge context here, so we
> +	 * don't want to block when potentially sitting on a callstack
> +	 * that holds all kinds of filesystem and mm locks.
> +	 *
> +	 * cgroup1 allows disabling the OOM killer and waiting for outside
> +	 * handling until the charge can succeed; remember the context and put
> +	 * the task to sleep at the end of the page fault when all locks are
> +	 * released.
> +	 *
> +	 * On the other hand, in-kernel OOM killer allows for an async victim
> +	 * memory reclaim (oom_reaper) and that means that we are not solely
> +	 * relying on the oom victim to make a forward progress and we can
> +	 * invoke the oom killer here.
> +	 *
> +	 * Please note that mem_cgroup_out_of_memory might fail to find a
> +	 * victim and then we have to bail out from the charge path.
> +	 */
> +	if (READ_ONCE(memcg->oom_kill_disable)) {
> +		if (current->in_user_fault) {
> +			css_get(&memcg->css);
> +			current->memcg_in_oom = memcg;
> +		}
> +		return false;
> +	}
> +
> +	mem_cgroup_mark_under_oom(memcg);
> +
> +	*locked = mem_cgroup_oom_trylock(memcg);
> +
> +	if (*locked)
> +		mem_cgroup_oom_notify(memcg);
> +
> +	mem_cgroup_unmark_under_oom(memcg);
> +
> +	return true;
> +}
> +
> +void memcg1_oom_finish(struct mem_cgroup *memcg, bool locked)
> +{
> +	if (locked)
> +		mem_cgroup_oom_unlock(memcg);
> +}
> +
>  static int __init memcg1_init(void)
>  {
>  	int node;
> diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
> index ef1b7037cbdc..3de956b2422f 100644
> --- a/mm/memcontrol-v1.h
> +++ b/mm/memcontrol-v1.h
> @@ -87,9 +87,10 @@ enum res_type {
>  bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
>  				enum mem_cgroup_events_target target);
>  unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap);
> -void mem_cgroup_oom_notify(struct mem_cgroup *memcg);
>  ssize_t memcg_write_event_control(struct kernfs_open_file *of,
>  				  char *buf, size_t nbytes, loff_t off);
>  
> +bool memcg1_oom_prepare(struct mem_cgroup *memcg, bool *locked);
> +void memcg1_oom_finish(struct mem_cgroup *memcg, bool locked);
>  
>  #endif	/* __MM_MEMCONTROL_V1_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 92fb72bbd494..8abd364ac837 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1616,130 +1616,6 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  	return ret;
>  }
>  
> -#ifdef CONFIG_LOCKDEP
> -static struct lockdep_map memcg_oom_lock_dep_map = {
> -	.name = "memcg_oom_lock",
> -};
> -#endif
> -
> -DEFINE_SPINLOCK(memcg_oom_lock);
> -
> -/*
> - * Check OOM-Killer is already running under our hierarchy.
> - * If someone is running, return false.
> - */
> -static bool mem_cgroup_oom_trylock(struct mem_cgroup *memcg)
> -{
> -	struct mem_cgroup *iter, *failed = NULL;
> -
> -	spin_lock(&memcg_oom_lock);
> -
> -	for_each_mem_cgroup_tree(iter, memcg) {
> -		if (iter->oom_lock) {
> -			/*
> -			 * this subtree of our hierarchy is already locked
> -			 * so we cannot give a lock.
> -			 */
> -			failed = iter;
> -			mem_cgroup_iter_break(memcg, iter);
> -			break;
> -		} else
> -			iter->oom_lock = true;
> -	}
> -
> -	if (failed) {
> -		/*
> -		 * OK, we failed to lock the whole subtree so we have
> -		 * to clean up what we set up to the failing subtree
> -		 */
> -		for_each_mem_cgroup_tree(iter, memcg) {
> -			if (iter == failed) {
> -				mem_cgroup_iter_break(memcg, iter);
> -				break;
> -			}
> -			iter->oom_lock = false;
> -		}
> -	} else
> -		mutex_acquire(&memcg_oom_lock_dep_map, 0, 1, _RET_IP_);
> -
> -	spin_unlock(&memcg_oom_lock);
> -
> -	return !failed;
> -}
> -
> -static void mem_cgroup_oom_unlock(struct mem_cgroup *memcg)
> -{
> -	struct mem_cgroup *iter;
> -
> -	spin_lock(&memcg_oom_lock);
> -	mutex_release(&memcg_oom_lock_dep_map, _RET_IP_);
> -	for_each_mem_cgroup_tree(iter, memcg)
> -		iter->oom_lock = false;
> -	spin_unlock(&memcg_oom_lock);
> -}
> -
> -static void mem_cgroup_mark_under_oom(struct mem_cgroup *memcg)
> -{
> -	struct mem_cgroup *iter;
> -
> -	spin_lock(&memcg_oom_lock);
> -	for_each_mem_cgroup_tree(iter, memcg)
> -		iter->under_oom++;
> -	spin_unlock(&memcg_oom_lock);
> -}
> -
> -static void mem_cgroup_unmark_under_oom(struct mem_cgroup *memcg)
> -{
> -	struct mem_cgroup *iter;
> -
> -	/*
> -	 * Be careful about under_oom underflows because a child memcg
> -	 * could have been added after mem_cgroup_mark_under_oom.
> -	 */
> -	spin_lock(&memcg_oom_lock);
> -	for_each_mem_cgroup_tree(iter, memcg)
> -		if (iter->under_oom > 0)
> -			iter->under_oom--;
> -	spin_unlock(&memcg_oom_lock);
> -}
> -
> -static DECLARE_WAIT_QUEUE_HEAD(memcg_oom_waitq);
> -
> -struct oom_wait_info {
> -	struct mem_cgroup *memcg;
> -	wait_queue_entry_t	wait;
> -};
> -
> -static int memcg_oom_wake_function(wait_queue_entry_t *wait,
> -	unsigned mode, int sync, void *arg)
> -{
> -	struct mem_cgroup *wake_memcg = (struct mem_cgroup *)arg;
> -	struct mem_cgroup *oom_wait_memcg;
> -	struct oom_wait_info *oom_wait_info;
> -
> -	oom_wait_info = container_of(wait, struct oom_wait_info, wait);
> -	oom_wait_memcg = oom_wait_info->memcg;
> -
> -	if (!mem_cgroup_is_descendant(wake_memcg, oom_wait_memcg) &&
> -	    !mem_cgroup_is_descendant(oom_wait_memcg, wake_memcg))
> -		return 0;
> -	return autoremove_wake_function(wait, mode, sync, arg);
> -}
> -
> -void memcg_oom_recover(struct mem_cgroup *memcg)
> -{
> -	/*
> -	 * For the following lockless ->under_oom test, the only required
> -	 * guarantee is that it must see the state asserted by an OOM when
> -	 * this function is called as a result of userland actions
> -	 * triggered by the notification of the OOM.  This is trivially
> -	 * achieved by invoking mem_cgroup_mark_under_oom() before
> -	 * triggering notification.
> -	 */
> -	if (memcg && memcg->under_oom)
> -		__wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg);
> -}
> -
>  /*
>   * Returns true if successfully killed one or more processes. Though in some
>   * corner cases it can return true even without killing any process.
> @@ -1753,104 +1629,16 @@ static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
>  
>  	memcg_memory_event(memcg, MEMCG_OOM);
>  
> -	/*
> -	 * We are in the middle of the charge context here, so we
> -	 * don't want to block when potentially sitting on a callstack
> -	 * that holds all kinds of filesystem and mm locks.
> -	 *
> -	 * cgroup1 allows disabling the OOM killer and waiting for outside
> -	 * handling until the charge can succeed; remember the context and put
> -	 * the task to sleep at the end of the page fault when all locks are
> -	 * released.
> -	 *
> -	 * On the other hand, in-kernel OOM killer allows for an async victim
> -	 * memory reclaim (oom_reaper) and that means that we are not solely
> -	 * relying on the oom victim to make a forward progress and we can
> -	 * invoke the oom killer here.
> -	 *
> -	 * Please note that mem_cgroup_out_of_memory might fail to find a
> -	 * victim and then we have to bail out from the charge path.
> -	 */
> -	if (READ_ONCE(memcg->oom_kill_disable)) {
> -		if (current->in_user_fault) {
> -			css_get(&memcg->css);
> -			current->memcg_in_oom = memcg;
> -		}
> +	if (!memcg1_oom_prepare(memcg, &locked))
>  		return false;
> -	}
> -
> -	mem_cgroup_mark_under_oom(memcg);
>  
> -	locked = mem_cgroup_oom_trylock(memcg);
> -
> -	if (locked)
> -		mem_cgroup_oom_notify(memcg);
> -
> -	mem_cgroup_unmark_under_oom(memcg);
>  	ret = mem_cgroup_out_of_memory(memcg, mask, order);
>  
> -	if (locked)
> -		mem_cgroup_oom_unlock(memcg);
> +	memcg1_oom_finish(memcg, locked);
>  
>  	return ret;
>  }
>  
> -/**
> - * mem_cgroup_oom_synchronize - complete memcg OOM handling
> - * @handle: actually kill/wait or just clean up the OOM state
> - *
> - * This has to be called at the end of a page fault if the memcg OOM
> - * handler was enabled.
> - *
> - * Memcg supports userspace OOM handling where failed allocations must
> - * sleep on a waitqueue until the userspace task resolves the
> - * situation.  Sleeping directly in the charge context with all kinds
> - * of locks held is not a good idea, instead we remember an OOM state
> - * in the task and mem_cgroup_oom_synchronize() has to be called at
> - * the end of the page fault to complete the OOM handling.
> - *
> - * Returns %true if an ongoing memcg OOM situation was detected and
> - * completed, %false otherwise.
> - */
> -bool mem_cgroup_oom_synchronize(bool handle)
> -{
> -	struct mem_cgroup *memcg = current->memcg_in_oom;
> -	struct oom_wait_info owait;
> -	bool locked;
> -
> -	/* OOM is global, do not handle */
> -	if (!memcg)
> -		return false;
> -
> -	if (!handle)
> -		goto cleanup;
> -
> -	owait.memcg = memcg;
> -	owait.wait.flags = 0;
> -	owait.wait.func = memcg_oom_wake_function;
> -	owait.wait.private = current;
> -	INIT_LIST_HEAD(&owait.wait.entry);
> -
> -	prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
> -	mem_cgroup_mark_under_oom(memcg);
> -
> -	locked = mem_cgroup_oom_trylock(memcg);
> -
> -	if (locked)
> -		mem_cgroup_oom_notify(memcg);
> -
> -	schedule();
> -	mem_cgroup_unmark_under_oom(memcg);
> -	finish_wait(&memcg_oom_waitq, &owait.wait);
> -
> -	if (locked)
> -		mem_cgroup_oom_unlock(memcg);
> -cleanup:
> -	current->memcg_in_oom = NULL;
> -	css_put(&memcg->css);
> -	return true;
> -}
> -
>  /**
>   * mem_cgroup_get_oom_group - get a memory cgroup to clean up after OOM
>   * @victim: task to be killed by the OOM killer
> -- 
> 2.45.2

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 09/14] mm: memcg: rename memcg_oom_recover()
  2024-06-25  0:59 ` [PATCH v2 09/14] mm: memcg: rename memcg_oom_recover() Roman Gushchin
@ 2024-06-25  7:08   ` Michal Hocko
  0 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2024-06-25  7:08 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Johannes Weiner, Shakeel Butt, Muchun Song,
	linux-kernel, cgroups, linux-mm

On Mon 24-06-24 17:59:01, Roman Gushchin wrote:
> Rename memcg_oom_recover() into memcg1_oom_recover() for consistency
> with other memory cgroup v1-related functions.
> 
> Move the declaration in mm/memcontrol-v1.h to be nearby other
> memcg v1 oom handling functions.
> 
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/memcontrol-v1.c | 6 +++---
>  mm/memcontrol-v1.h | 2 +-
>  mm/memcontrol.c    | 6 +++---
>  3 files changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
> index 253d49d5fb12..1d5608ee1606 100644
> --- a/mm/memcontrol-v1.c
> +++ b/mm/memcontrol-v1.c
> @@ -1090,8 +1090,8 @@ static void __mem_cgroup_clear_mc(void)
>  
>  		mc.moved_swap = 0;
>  	}
> -	memcg_oom_recover(from);
> -	memcg_oom_recover(to);
> +	memcg1_oom_recover(from);
> +	memcg1_oom_recover(to);
>  	wake_up_all(&mc.waitq);
>  }
>  
> @@ -2067,7 +2067,7 @@ static int memcg_oom_wake_function(wait_queue_entry_t *wait,
>  	return autoremove_wake_function(wait, mode, sync, arg);
>  }
>  
> -void memcg_oom_recover(struct mem_cgroup *memcg)
> +void memcg1_oom_recover(struct mem_cgroup *memcg)
>  {
>  	/*
>  	 * For the following lockless ->under_oom test, the only required
> diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
> index 3de956b2422f..972c493a8ae3 100644
> --- a/mm/memcontrol-v1.h
> +++ b/mm/memcontrol-v1.h
> @@ -13,7 +13,6 @@ static inline void memcg1_soft_limit_reset(struct mem_cgroup *memcg)
>  
>  void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, int nr_pages);
>  void memcg1_check_events(struct mem_cgroup *memcg, int nid);
> -void memcg_oom_recover(struct mem_cgroup *memcg);
>  int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  		     unsigned int nr_pages);
>  
> @@ -92,5 +91,6 @@ ssize_t memcg_write_event_control(struct kernfs_open_file *of,
>  
>  bool memcg1_oom_prepare(struct mem_cgroup *memcg, bool *locked);
>  void memcg1_oom_finish(struct mem_cgroup *memcg, bool locked);
> +void memcg1_oom_recover(struct mem_cgroup *memcg);
>  
>  #endif	/* __MM_MEMCONTROL_V1_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 8abd364ac837..37e0af5b26f3 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3167,7 +3167,7 @@ static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
>  	} while (true);
>  
>  	if (!ret && enlarge)
> -		memcg_oom_recover(memcg);
> +		memcg1_oom_recover(memcg);
>  
>  	return ret;
>  }
> @@ -3752,7 +3752,7 @@ static int mem_cgroup_oom_control_write(struct cgroup_subsys_state *css,
>  
>  	WRITE_ONCE(memcg->oom_kill_disable, val);
>  	if (!val)
> -		memcg_oom_recover(memcg);
> +		memcg1_oom_recover(memcg);
>  
>  	return 0;
>  }
> @@ -5479,7 +5479,7 @@ static void uncharge_batch(const struct uncharge_gather *ug)
>  			page_counter_uncharge(&ug->memcg->memsw, ug->nr_memory);
>  		if (ug->nr_kmem)
>  			memcg_account_kmem(ug->memcg, -ug->nr_kmem);
> -		memcg_oom_recover(ug->memcg);
> +		memcg1_oom_recover(ug->memcg);
>  	}
>  
>  	local_irq_save(flags);
> -- 
> 2.45.2
> 

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 10/14] mm: memcg: move cgroup v1 interface files to memcontrol-v1.c
  2024-06-25  0:59 ` [PATCH v2 10/14] mm: memcg: move cgroup v1 interface files to memcontrol-v1.c Roman Gushchin
@ 2024-06-25  7:09   ` Michal Hocko
  0 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2024-06-25  7:09 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Johannes Weiner, Shakeel Butt, Muchun Song,
	linux-kernel, cgroups, linux-mm

On Mon 24-06-24 17:59:02, Roman Gushchin wrote:
> Move legacy cgroup v1 memory controller interfaces and corresponding
> code into memcontrol-v1.c.
> 
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/memcontrol-v1.c | 739 ++++++++++++++++++++++++++++++++++++++++++++-
>  mm/memcontrol-v1.h |  29 +-
>  mm/memcontrol.c    | 721 +------------------------------------------
>  3 files changed, 767 insertions(+), 722 deletions(-)
> 
> diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
> index 1d5608ee1606..1b7337d0170d 100644
> --- a/mm/memcontrol-v1.c
> +++ b/mm/memcontrol-v1.c
> @@ -10,6 +10,7 @@
>  #include <linux/poll.h>
>  #include <linux/sort.h>
>  #include <linux/file.h>
> +#include <linux/seq_buf.h>
>  
>  #include "internal.h"
>  #include "swap.h"
> @@ -110,6 +111,18 @@ struct mem_cgroup_event {
>  	struct work_struct remove;
>  };
>  
> +#define MEMFILE_PRIVATE(x, val)	((x) << 16 | (val))
> +#define MEMFILE_TYPE(val)	((val) >> 16 & 0xffff)
> +#define MEMFILE_ATTR(val)	((val) & 0xffff)
> +
> +enum {
> +	RES_USAGE,
> +	RES_LIMIT,
> +	RES_MAX_USAGE,
> +	RES_FAILCNT,
> +	RES_SOFT_LIMIT,
> +};
> +
>  #ifdef CONFIG_LOCKDEP
>  static struct lockdep_map memcg_oom_lock_dep_map = {
>  	.name = "memcg_oom_lock",
> @@ -577,14 +590,14 @@ static inline int mem_cgroup_move_swap_account(swp_entry_t entry,
>  }
>  #endif
>  
> -u64 mem_cgroup_move_charge_read(struct cgroup_subsys_state *css,
> +static u64 mem_cgroup_move_charge_read(struct cgroup_subsys_state *css,
>  				struct cftype *cft)
>  {
>  	return mem_cgroup_from_css(css)->move_charge_at_immigrate;
>  }
>  
>  #ifdef CONFIG_MMU
> -int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css,
> +static int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css,
>  				 struct cftype *cft, u64 val)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> @@ -606,7 +619,7 @@ int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css,
>  	return 0;
>  }
>  #else
> -int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css,
> +static int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css,
>  				 struct cftype *cft, u64 val)
>  {
>  	return -ENOSYS;
> @@ -1803,8 +1816,8 @@ static void memcg_event_ptable_queue_proc(struct file *file,
>   * Input must be in format '<event_fd> <control_fd> <args>'.
>   * Interpretation of args is defined by control file implementation.
>   */
> -ssize_t memcg_write_event_control(struct kernfs_open_file *of,
> -				  char *buf, size_t nbytes, loff_t off)
> +static ssize_t memcg_write_event_control(struct kernfs_open_file *of,
> +					 char *buf, size_t nbytes, loff_t off)
>  {
>  	struct cgroup_subsys_state *css = of_css(of);
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> @@ -2184,6 +2197,722 @@ void memcg1_oom_finish(struct mem_cgroup *memcg, bool locked)
>  		mem_cgroup_oom_unlock(memcg);
>  }
>  
> +static DEFINE_MUTEX(memcg_max_mutex);
> +
> +static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
> +				 unsigned long max, bool memsw)
> +{
> +	bool enlarge = false;
> +	bool drained = false;
> +	int ret;
> +	bool limits_invariant;
> +	struct page_counter *counter = memsw ? &memcg->memsw : &memcg->memory;
> +
> +	do {
> +		if (signal_pending(current)) {
> +			ret = -EINTR;
> +			break;
> +		}
> +
> +		mutex_lock(&memcg_max_mutex);
> +		/*
> +		 * Make sure that the new limit (memsw or memory limit) doesn't
> +		 * break our basic invariant rule memory.max <= memsw.max.
> +		 */
> +		limits_invariant = memsw ? max >= READ_ONCE(memcg->memory.max) :
> +					   max <= memcg->memsw.max;
> +		if (!limits_invariant) {
> +			mutex_unlock(&memcg_max_mutex);
> +			ret = -EINVAL;
> +			break;
> +		}
> +		if (max > counter->max)
> +			enlarge = true;
> +		ret = page_counter_set_max(counter, max);
> +		mutex_unlock(&memcg_max_mutex);
> +
> +		if (!ret)
> +			break;
> +
> +		if (!drained) {
> +			drain_all_stock(memcg);
> +			drained = true;
> +			continue;
> +		}
> +
> +		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> +						  memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, NULL)) {
> +			ret = -EBUSY;
> +			break;
> +		}
> +	} while (true);
> +
> +	if (!ret && enlarge)
> +		memcg1_oom_recover(memcg);
> +
> +	return ret;
> +}
> +
> +/*
> + * Reclaims as many pages from the given memcg as possible.
> + *
> + * Caller is responsible for holding css reference for memcg.
> + */
> +static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
> +{
> +	int nr_retries = MAX_RECLAIM_RETRIES;
> +
> +	/* we call try-to-free pages for make this cgroup empty */
> +	lru_add_drain_all();
> +
> +	drain_all_stock(memcg);
> +
> +	/* try to free all pages in this cgroup */
> +	while (nr_retries && page_counter_read(&memcg->memory)) {
> +		if (signal_pending(current))
> +			return -EINTR;
> +
> +		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> +						  MEMCG_RECLAIM_MAY_SWAP, NULL))
> +			nr_retries--;
> +	}
> +
> +	return 0;
> +}
> +
> +static ssize_t mem_cgroup_force_empty_write(struct kernfs_open_file *of,
> +					    char *buf, size_t nbytes,
> +					    loff_t off)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> +
> +	if (mem_cgroup_is_root(memcg))
> +		return -EINVAL;
> +	return mem_cgroup_force_empty(memcg) ?: nbytes;
> +}
> +
> +static u64 mem_cgroup_hierarchy_read(struct cgroup_subsys_state *css,
> +				     struct cftype *cft)
> +{
> +	return 1;
> +}
> +
> +static int mem_cgroup_hierarchy_write(struct cgroup_subsys_state *css,
> +				      struct cftype *cft, u64 val)
> +{
> +	if (val == 1)
> +		return 0;
> +
> +	pr_warn_once("Non-hierarchical mode is deprecated. "
> +		     "Please report your usecase to linux-mm@kvack.org if you "
> +		     "depend on this functionality.\n");
> +
> +	return -EINVAL;
> +}
> +
> +static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
> +			       struct cftype *cft)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> +	struct page_counter *counter;
> +
> +	switch (MEMFILE_TYPE(cft->private)) {
> +	case _MEM:
> +		counter = &memcg->memory;
> +		break;
> +	case _MEMSWAP:
> +		counter = &memcg->memsw;
> +		break;
> +	case _KMEM:
> +		counter = &memcg->kmem;
> +		break;
> +	case _TCP:
> +		counter = &memcg->tcpmem;
> +		break;
> +	default:
> +		BUG();
> +	}
> +
> +	switch (MEMFILE_ATTR(cft->private)) {
> +	case RES_USAGE:
> +		if (counter == &memcg->memory)
> +			return (u64)mem_cgroup_usage(memcg, false) * PAGE_SIZE;
> +		if (counter == &memcg->memsw)
> +			return (u64)mem_cgroup_usage(memcg, true) * PAGE_SIZE;
> +		return (u64)page_counter_read(counter) * PAGE_SIZE;
> +	case RES_LIMIT:
> +		return (u64)counter->max * PAGE_SIZE;
> +	case RES_MAX_USAGE:
> +		return (u64)counter->watermark * PAGE_SIZE;
> +	case RES_FAILCNT:
> +		return counter->failcnt;
> +	case RES_SOFT_LIMIT:
> +		return (u64)READ_ONCE(memcg->soft_limit) * PAGE_SIZE;
> +	default:
> +		BUG();
> +	}
> +}
> +
> +/*
> + * This function doesn't do anything useful. Its only job is to provide a read
> + * handler for a file so that cgroup_file_mode() will add read permissions.
> + */
> +static int mem_cgroup_dummy_seq_show(__always_unused struct seq_file *m,
> +				     __always_unused void *v)
> +{
> +	return -EINVAL;
> +}
> +
> +static int memcg_update_tcp_max(struct mem_cgroup *memcg, unsigned long max)
> +{
> +	int ret;
> +
> +	mutex_lock(&memcg_max_mutex);
> +
> +	ret = page_counter_set_max(&memcg->tcpmem, max);
> +	if (ret)
> +		goto out;
> +
> +	if (!memcg->tcpmem_active) {
> +		/*
> +		 * The active flag needs to be written after the static_key
> +		 * update. This is what guarantees that the socket activation
> +		 * function is the last one to run. See mem_cgroup_sk_alloc()
> +		 * for details, and note that we don't mark any socket as
> +		 * belonging to this memcg until that flag is up.
> +		 *
> +		 * We need to do this, because static_keys will span multiple
> +		 * sites, but we can't control their order. If we mark a socket
> +		 * as accounted, but the accounting functions are not patched in
> +		 * yet, we'll lose accounting.
> +		 *
> +		 * We never race with the readers in mem_cgroup_sk_alloc(),
> +		 * because when this value change, the code to process it is not
> +		 * patched in yet.
> +		 */
> +		static_branch_inc(&memcg_sockets_enabled_key);
> +		memcg->tcpmem_active = true;
> +	}
> +out:
> +	mutex_unlock(&memcg_max_mutex);
> +	return ret;
> +}
> +
> +/*
> + * The user of this function is...
> + * RES_LIMIT.
> + */
> +static ssize_t mem_cgroup_write(struct kernfs_open_file *of,
> +				char *buf, size_t nbytes, loff_t off)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> +	unsigned long nr_pages;
> +	int ret;
> +
> +	buf = strstrip(buf);
> +	ret = page_counter_memparse(buf, "-1", &nr_pages);
> +	if (ret)
> +		return ret;
> +
> +	switch (MEMFILE_ATTR(of_cft(of)->private)) {
> +	case RES_LIMIT:
> +		if (mem_cgroup_is_root(memcg)) { /* Can't set limit on root */
> +			ret = -EINVAL;
> +			break;
> +		}
> +		switch (MEMFILE_TYPE(of_cft(of)->private)) {
> +		case _MEM:
> +			ret = mem_cgroup_resize_max(memcg, nr_pages, false);
> +			break;
> +		case _MEMSWAP:
> +			ret = mem_cgroup_resize_max(memcg, nr_pages, true);
> +			break;
> +		case _KMEM:
> +			pr_warn_once("kmem.limit_in_bytes is deprecated and will be removed. "
> +				     "Writing any value to this file has no effect. "
> +				     "Please report your usecase to linux-mm@kvack.org if you "
> +				     "depend on this functionality.\n");
> +			ret = 0;
> +			break;
> +		case _TCP:
> +			ret = memcg_update_tcp_max(memcg, nr_pages);
> +			break;
> +		}
> +		break;
> +	case RES_SOFT_LIMIT:
> +		if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
> +			ret = -EOPNOTSUPP;
> +		} else {
> +			WRITE_ONCE(memcg->soft_limit, nr_pages);
> +			ret = 0;
> +		}
> +		break;
> +	}
> +	return ret ?: nbytes;
> +}
> +
> +static ssize_t mem_cgroup_reset(struct kernfs_open_file *of, char *buf,
> +				size_t nbytes, loff_t off)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> +	struct page_counter *counter;
> +
> +	switch (MEMFILE_TYPE(of_cft(of)->private)) {
> +	case _MEM:
> +		counter = &memcg->memory;
> +		break;
> +	case _MEMSWAP:
> +		counter = &memcg->memsw;
> +		break;
> +	case _KMEM:
> +		counter = &memcg->kmem;
> +		break;
> +	case _TCP:
> +		counter = &memcg->tcpmem;
> +		break;
> +	default:
> +		BUG();
> +	}
> +
> +	switch (MEMFILE_ATTR(of_cft(of)->private)) {
> +	case RES_MAX_USAGE:
> +		page_counter_reset_watermark(counter);
> +		break;
> +	case RES_FAILCNT:
> +		counter->failcnt = 0;
> +		break;
> +	default:
> +		BUG();
> +	}
> +
> +	return nbytes;
> +}
> +
> +#ifdef CONFIG_NUMA
> +
> +#define LRU_ALL_FILE (BIT(LRU_INACTIVE_FILE) | BIT(LRU_ACTIVE_FILE))
> +#define LRU_ALL_ANON (BIT(LRU_INACTIVE_ANON) | BIT(LRU_ACTIVE_ANON))
> +#define LRU_ALL	     ((1 << NR_LRU_LISTS) - 1)
> +
> +/* static unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg, */
> +/* 				int nid, unsigned int lru_mask, bool tree) */
> +/* { */
> +/* 	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid)); */
> +/* 	unsigned long nr = 0; */
> +/* 	enum lru_list lru; */
> +
> +/* 	VM_BUG_ON((unsigned)nid >= nr_node_ids); */
> +
> +/* 	for_each_lru(lru) { */
> +/* 		if (!(BIT(lru) & lru_mask)) */
> +/* 			continue; */
> +/* 		if (tree) */
> +/* 			nr += lruvec_page_state(lruvec, NR_LRU_BASE + lru); */
> +/* 		else */
> +/* 			nr += lruvec_page_state_local(lruvec, NR_LRU_BASE + lru); */
> +/* 	} */
> +/* 	return nr; */
> +/* } */
> +
> +/* static unsigned long mem_cgroup_nr_lru_pages(struct mem_cgroup *memcg, */
> +/* 					     unsigned int lru_mask, */
> +/* 					     bool tree) */
> +/* { */
> +/* 	unsigned long nr = 0; */
> +/* 	enum lru_list lru; */
> +
> +/* 	for_each_lru(lru) { */
> +/* 		if (!(BIT(lru) & lru_mask)) */
> +/* 			continue; */
> +/* 		if (tree) */
> +/* 			nr += memcg_page_state(memcg, NR_LRU_BASE + lru); */
> +/* 		else */
> +/* 			nr += memcg_page_state_local(memcg, NR_LRU_BASE + lru); */
> +/* 	} */
> +/* 	return nr; */
> +/* } */
> +
> +static int memcg_numa_stat_show(struct seq_file *m, void *v)
> +{
> +	struct numa_stat {
> +		const char *name;
> +		unsigned int lru_mask;
> +	};
> +
> +	static const struct numa_stat stats[] = {
> +		{ "total", LRU_ALL },
> +		{ "file", LRU_ALL_FILE },
> +		{ "anon", LRU_ALL_ANON },
> +		{ "unevictable", BIT(LRU_UNEVICTABLE) },
> +	};
> +	const struct numa_stat *stat;
> +	int nid;
> +	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
> +
> +	mem_cgroup_flush_stats(memcg);
> +
> +	for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) {
> +		seq_printf(m, "%s=%lu", stat->name,
> +			   mem_cgroup_nr_lru_pages(memcg, stat->lru_mask,
> +						   false));
> +		for_each_node_state(nid, N_MEMORY)
> +			seq_printf(m, " N%d=%lu", nid,
> +				   mem_cgroup_node_nr_lru_pages(memcg, nid,
> +							stat->lru_mask, false));
> +		seq_putc(m, '\n');
> +	}
> +
> +	for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) {
> +
> +		seq_printf(m, "hierarchical_%s=%lu", stat->name,
> +			   mem_cgroup_nr_lru_pages(memcg, stat->lru_mask,
> +						   true));
> +		for_each_node_state(nid, N_MEMORY)
> +			seq_printf(m, " N%d=%lu", nid,
> +				   mem_cgroup_node_nr_lru_pages(memcg, nid,
> +							stat->lru_mask, true));
> +		seq_putc(m, '\n');
> +	}
> +
> +	return 0;
> +}
> +#endif /* CONFIG_NUMA */
> +
> +static const unsigned int memcg1_stats[] = {
> +	NR_FILE_PAGES,
> +	NR_ANON_MAPPED,
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	NR_ANON_THPS,
> +#endif
> +	NR_SHMEM,
> +	NR_FILE_MAPPED,
> +	NR_FILE_DIRTY,
> +	NR_WRITEBACK,
> +	WORKINGSET_REFAULT_ANON,
> +	WORKINGSET_REFAULT_FILE,
> +#ifdef CONFIG_SWAP
> +	MEMCG_SWAP,
> +	NR_SWAPCACHE,
> +#endif
> +};
> +
> +static const char *const memcg1_stat_names[] = {
> +	"cache",
> +	"rss",
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	"rss_huge",
> +#endif
> +	"shmem",
> +	"mapped_file",
> +	"dirty",
> +	"writeback",
> +	"workingset_refault_anon",
> +	"workingset_refault_file",
> +#ifdef CONFIG_SWAP
> +	"swap",
> +	"swapcached",
> +#endif
> +};
> +
> +/* Universal VM events cgroup1 shows, original sort order */
> +static const unsigned int memcg1_events[] = {
> +	PGPGIN,
> +	PGPGOUT,
> +	PGFAULT,
> +	PGMAJFAULT,
> +};
> +
> +void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
> +{
> +	unsigned long memory, memsw;
> +	struct mem_cgroup *mi;
> +	unsigned int i;
> +
> +	BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats));
> +
> +	mem_cgroup_flush_stats(memcg);
> +
> +	for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
> +		unsigned long nr;
> +
> +		nr = memcg_page_state_local_output(memcg, memcg1_stats[i]);
> +		seq_buf_printf(s, "%s %lu\n", memcg1_stat_names[i], nr);
> +	}
> +
> +	for (i = 0; i < ARRAY_SIZE(memcg1_events); i++)
> +		seq_buf_printf(s, "%s %lu\n", vm_event_name(memcg1_events[i]),
> +			       memcg_events_local(memcg, memcg1_events[i]));
> +
> +	for (i = 0; i < NR_LRU_LISTS; i++)
> +		seq_buf_printf(s, "%s %lu\n", lru_list_name(i),
> +			       memcg_page_state_local(memcg, NR_LRU_BASE + i) *
> +			       PAGE_SIZE);
> +
> +	/* Hierarchical information */
> +	memory = memsw = PAGE_COUNTER_MAX;
> +	for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) {
> +		memory = min(memory, READ_ONCE(mi->memory.max));
> +		memsw = min(memsw, READ_ONCE(mi->memsw.max));
> +	}
> +	seq_buf_printf(s, "hierarchical_memory_limit %llu\n",
> +		       (u64)memory * PAGE_SIZE);
> +	seq_buf_printf(s, "hierarchical_memsw_limit %llu\n",
> +		       (u64)memsw * PAGE_SIZE);
> +
> +	for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
> +		unsigned long nr;
> +
> +		nr = memcg_page_state_output(memcg, memcg1_stats[i]);
> +		seq_buf_printf(s, "total_%s %llu\n", memcg1_stat_names[i],
> +			       (u64)nr);
> +	}
> +
> +	for (i = 0; i < ARRAY_SIZE(memcg1_events); i++)
> +		seq_buf_printf(s, "total_%s %llu\n",
> +			       vm_event_name(memcg1_events[i]),
> +			       (u64)memcg_events(memcg, memcg1_events[i]));
> +
> +	for (i = 0; i < NR_LRU_LISTS; i++)
> +		seq_buf_printf(s, "total_%s %llu\n", lru_list_name(i),
> +			       (u64)memcg_page_state(memcg, NR_LRU_BASE + i) *
> +			       PAGE_SIZE);
> +
> +#ifdef CONFIG_DEBUG_VM
> +	{
> +		pg_data_t *pgdat;
> +		struct mem_cgroup_per_node *mz;
> +		unsigned long anon_cost = 0;
> +		unsigned long file_cost = 0;
> +
> +		for_each_online_pgdat(pgdat) {
> +			mz = memcg->nodeinfo[pgdat->node_id];
> +
> +			anon_cost += mz->lruvec.anon_cost;
> +			file_cost += mz->lruvec.file_cost;
> +		}
> +		seq_buf_printf(s, "anon_cost %lu\n", anon_cost);
> +		seq_buf_printf(s, "file_cost %lu\n", file_cost);
> +	}
> +#endif
> +}
> +
> +static u64 mem_cgroup_swappiness_read(struct cgroup_subsys_state *css,
> +				      struct cftype *cft)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> +
> +	return mem_cgroup_swappiness(memcg);
> +}
> +
> +static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css,
> +				       struct cftype *cft, u64 val)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> +
> +	if (val > MAX_SWAPPINESS)
> +		return -EINVAL;
> +
> +	if (!mem_cgroup_is_root(memcg))
> +		WRITE_ONCE(memcg->swappiness, val);
> +	else
> +		WRITE_ONCE(vm_swappiness, val);
> +
> +	return 0;
> +}
> +
> +static int mem_cgroup_oom_control_read(struct seq_file *sf, void *v)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_seq(sf);
> +
> +	seq_printf(sf, "oom_kill_disable %d\n", READ_ONCE(memcg->oom_kill_disable));
> +	seq_printf(sf, "under_oom %d\n", (bool)memcg->under_oom);
> +	seq_printf(sf, "oom_kill %lu\n",
> +		   atomic_long_read(&memcg->memory_events[MEMCG_OOM_KILL]));
> +	return 0;
> +}
> +
> +static int mem_cgroup_oom_control_write(struct cgroup_subsys_state *css,
> +	struct cftype *cft, u64 val)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> +
> +	/* cannot set to root cgroup and only 0 and 1 are allowed */
> +	if (mem_cgroup_is_root(memcg) || !((val == 0) || (val == 1)))
> +		return -EINVAL;
> +
> +	WRITE_ONCE(memcg->oom_kill_disable, val);
> +	if (!val)
> +		memcg1_oom_recover(memcg);
> +
> +	return 0;
> +}
> +
> +#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_SLUB_DEBUG)
> +static int mem_cgroup_slab_show(struct seq_file *m, void *p)
> +{
> +	/*
> +	 * Deprecated.
> +	 * Please, take a look at tools/cgroup/memcg_slabinfo.py .
> +	 */
> +	return 0;
> +}
> +#endif
> +
> +struct cftype mem_cgroup_legacy_files[] = {
> +	{
> +		.name = "usage_in_bytes",
> +		.private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
> +		.read_u64 = mem_cgroup_read_u64,
> +	},
> +	{
> +		.name = "max_usage_in_bytes",
> +		.private = MEMFILE_PRIVATE(_MEM, RES_MAX_USAGE),
> +		.write = mem_cgroup_reset,
> +		.read_u64 = mem_cgroup_read_u64,
> +	},
> +	{
> +		.name = "limit_in_bytes",
> +		.private = MEMFILE_PRIVATE(_MEM, RES_LIMIT),
> +		.write = mem_cgroup_write,
> +		.read_u64 = mem_cgroup_read_u64,
> +	},
> +	{
> +		.name = "soft_limit_in_bytes",
> +		.private = MEMFILE_PRIVATE(_MEM, RES_SOFT_LIMIT),
> +		.write = mem_cgroup_write,
> +		.read_u64 = mem_cgroup_read_u64,
> +	},
> +	{
> +		.name = "failcnt",
> +		.private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT),
> +		.write = mem_cgroup_reset,
> +		.read_u64 = mem_cgroup_read_u64,
> +	},
> +	{
> +		.name = "stat",
> +		.seq_show = memory_stat_show,
> +	},
> +	{
> +		.name = "force_empty",
> +		.write = mem_cgroup_force_empty_write,
> +	},
> +	{
> +		.name = "use_hierarchy",
> +		.write_u64 = mem_cgroup_hierarchy_write,
> +		.read_u64 = mem_cgroup_hierarchy_read,
> +	},
> +	{
> +		.name = "cgroup.event_control",		/* XXX: for compat */
> +		.write = memcg_write_event_control,
> +		.flags = CFTYPE_NO_PREFIX | CFTYPE_WORLD_WRITABLE,
> +	},
> +	{
> +		.name = "swappiness",
> +		.read_u64 = mem_cgroup_swappiness_read,
> +		.write_u64 = mem_cgroup_swappiness_write,
> +	},
> +	{
> +		.name = "move_charge_at_immigrate",
> +		.read_u64 = mem_cgroup_move_charge_read,
> +		.write_u64 = mem_cgroup_move_charge_write,
> +	},
> +	{
> +		.name = "oom_control",
> +		.seq_show = mem_cgroup_oom_control_read,
> +		.write_u64 = mem_cgroup_oom_control_write,
> +	},
> +	{
> +		.name = "pressure_level",
> +		.seq_show = mem_cgroup_dummy_seq_show,
> +	},
> +#ifdef CONFIG_NUMA
> +	{
> +		.name = "numa_stat",
> +		.seq_show = memcg_numa_stat_show,
> +	},
> +#endif
> +	{
> +		.name = "kmem.limit_in_bytes",
> +		.private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
> +		.write = mem_cgroup_write,
> +		.read_u64 = mem_cgroup_read_u64,
> +	},
> +	{
> +		.name = "kmem.usage_in_bytes",
> +		.private = MEMFILE_PRIVATE(_KMEM, RES_USAGE),
> +		.read_u64 = mem_cgroup_read_u64,
> +	},
> +	{
> +		.name = "kmem.failcnt",
> +		.private = MEMFILE_PRIVATE(_KMEM, RES_FAILCNT),
> +		.write = mem_cgroup_reset,
> +		.read_u64 = mem_cgroup_read_u64,
> +	},
> +	{
> +		.name = "kmem.max_usage_in_bytes",
> +		.private = MEMFILE_PRIVATE(_KMEM, RES_MAX_USAGE),
> +		.write = mem_cgroup_reset,
> +		.read_u64 = mem_cgroup_read_u64,
> +	},
> +#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_SLUB_DEBUG)
> +	{
> +		.name = "kmem.slabinfo",
> +		.seq_show = mem_cgroup_slab_show,
> +	},
> +#endif
> +	{
> +		.name = "kmem.tcp.limit_in_bytes",
> +		.private = MEMFILE_PRIVATE(_TCP, RES_LIMIT),
> +		.write = mem_cgroup_write,
> +		.read_u64 = mem_cgroup_read_u64,
> +	},
> +	{
> +		.name = "kmem.tcp.usage_in_bytes",
> +		.private = MEMFILE_PRIVATE(_TCP, RES_USAGE),
> +		.read_u64 = mem_cgroup_read_u64,
> +	},
> +	{
> +		.name = "kmem.tcp.failcnt",
> +		.private = MEMFILE_PRIVATE(_TCP, RES_FAILCNT),
> +		.write = mem_cgroup_reset,
> +		.read_u64 = mem_cgroup_read_u64,
> +	},
> +	{
> +		.name = "kmem.tcp.max_usage_in_bytes",
> +		.private = MEMFILE_PRIVATE(_TCP, RES_MAX_USAGE),
> +		.write = mem_cgroup_reset,
> +		.read_u64 = mem_cgroup_read_u64,
> +	},
> +	{ },	/* terminate */
> +};
> +
> +struct cftype memsw_files[] = {
> +	{
> +		.name = "memsw.usage_in_bytes",
> +		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_USAGE),
> +		.read_u64 = mem_cgroup_read_u64,
> +	},
> +	{
> +		.name = "memsw.max_usage_in_bytes",
> +		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_MAX_USAGE),
> +		.write = mem_cgroup_reset,
> +		.read_u64 = mem_cgroup_read_u64,
> +	},
> +	{
> +		.name = "memsw.limit_in_bytes",
> +		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_LIMIT),
> +		.write = mem_cgroup_write,
> +		.read_u64 = mem_cgroup_read_u64,
> +	},
> +	{
> +		.name = "memsw.failcnt",
> +		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_FAILCNT),
> +		.write = mem_cgroup_reset,
> +		.read_u64 = mem_cgroup_read_u64,
> +	},
> +	{ },	/* terminate */
> +};
> +
>  static int __init memcg1_init(void)
>  {
>  	int node;
> diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
> index 972c493a8ae3..7be4670d9abb 100644
> --- a/mm/memcontrol-v1.h
> +++ b/mm/memcontrol-v1.h
> @@ -3,6 +3,8 @@
>  #ifndef __MM_MEMCONTROL_V1_H
>  #define __MM_MEMCONTROL_V1_H
>  
> +#include <linux/cgroup-defs.h>
> +
>  void memcg1_update_tree(struct mem_cgroup *memcg, int nid);
>  void memcg1_remove_from_trees(struct mem_cgroup *memcg);
>  
> @@ -34,12 +36,6 @@ int memcg1_can_attach(struct cgroup_taskset *tset);
>  void memcg1_cancel_attach(struct cgroup_taskset *tset);
>  void memcg1_move_task(void);
>  
> -struct cftype;
> -u64 mem_cgroup_move_charge_read(struct cgroup_subsys_state *css,
> -				struct cftype *cft);
> -int mem_cgroup_move_charge_write(struct cgroup_subsys_state *css,
> -				 struct cftype *cft, u64 val);
> -
>  /*
>   * Per memcg event counter is incremented at every pagein/pageout. With THP,
>   * it will be incremented by the number of pages. This counter is used
> @@ -86,11 +82,28 @@ enum res_type {
>  bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
>  				enum mem_cgroup_events_target target);
>  unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap);
> -ssize_t memcg_write_event_control(struct kernfs_open_file *of,
> -				  char *buf, size_t nbytes, loff_t off);
>  
>  bool memcg1_oom_prepare(struct mem_cgroup *memcg, bool *locked);
>  void memcg1_oom_finish(struct mem_cgroup *memcg, bool locked);
>  void memcg1_oom_recover(struct mem_cgroup *memcg);
>  
> +void drain_all_stock(struct mem_cgroup *root_memcg);
> +unsigned long mem_cgroup_nr_lru_pages(struct mem_cgroup *memcg,
> +				      unsigned int lru_mask, bool tree);
> +unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
> +					   int nid, unsigned int lru_mask,
> +					   bool tree);
> +
> +unsigned long memcg_events(struct mem_cgroup *memcg, int event);
> +unsigned long memcg_events_local(struct mem_cgroup *memcg, int event);
> +unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx);
> +unsigned long memcg_page_state_output(struct mem_cgroup *memcg, int item);
> +unsigned long memcg_page_state_local_output(struct mem_cgroup *memcg, int item);
> +int memory_stat_show(struct seq_file *m, void *v);
> +
> +void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s);
> +
> +extern struct cftype memsw_files[];
> +extern struct cftype mem_cgroup_legacy_files[];
> +
>  #endif	/* __MM_MEMCONTROL_V1_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 37e0af5b26f3..c7341e811945 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -96,10 +96,6 @@ static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq);
>  #define THRESHOLDS_EVENTS_TARGET 128
>  #define SOFTLIMIT_EVENTS_TARGET 1024
>  
> -#define MEMFILE_PRIVATE(x, val)	((x) << 16 | (val))
> -#define MEMFILE_TYPE(val)	((val) >> 16 & 0xffff)
> -#define MEMFILE_ATTR(val)	((val) & 0xffff)
> -
>  static inline bool task_is_dying(void)
>  {
>  	return tsk_is_oom_victim(current) || fatal_signal_pending(current) ||
> @@ -676,7 +672,7 @@ void __mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx,
>  }
>  
>  /* idx can be of type enum memcg_stat_item or node_stat_item. */
> -static unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx)
> +unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx)
>  {
>  	long x;
>  	int i = memcg_stats_index(idx);
> @@ -825,7 +821,7 @@ void __count_memcg_events(struct mem_cgroup *memcg, enum vm_event_item idx,
>  	memcg_stats_unlock();
>  }
>  
> -static unsigned long memcg_events(struct mem_cgroup *memcg, int event)
> +unsigned long memcg_events(struct mem_cgroup *memcg, int event)
>  {
>  	int i = memcg_events_index(event);
>  
> @@ -835,7 +831,7 @@ static unsigned long memcg_events(struct mem_cgroup *memcg, int event)
>  	return READ_ONCE(memcg->vmstats->events[i]);
>  }
>  
> -static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event)
> +unsigned long memcg_events_local(struct mem_cgroup *memcg, int event)
>  {
>  	int i = memcg_events_index(event);
>  
> @@ -1420,15 +1416,13 @@ static int memcg_page_state_output_unit(int item)
>  	}
>  }
>  
> -static inline unsigned long memcg_page_state_output(struct mem_cgroup *memcg,
> -						    int item)
> +unsigned long memcg_page_state_output(struct mem_cgroup *memcg, int item)
>  {
>  	return memcg_page_state(memcg, item) *
>  		memcg_page_state_output_unit(item);
>  }
>  
> -static inline unsigned long memcg_page_state_local_output(
> -		struct mem_cgroup *memcg, int item)
> +unsigned long memcg_page_state_local_output(struct mem_cgroup *memcg, int item)
>  {
>  	return memcg_page_state_local(memcg, item) *
>  		memcg_page_state_output_unit(item);
> @@ -1487,8 +1481,6 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
>  	WARN_ON_ONCE(seq_buf_has_overflowed(s));
>  }
>  
> -static void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s);
> -
>  static void memory_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
>  {
>  	if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
> @@ -1861,7 +1853,7 @@ static void refill_stock(struct mem_cgroup *memcg, unsigned int nr_pages)
>   * Drains all per-CPU charge caches for given root_memcg resp. subtree
>   * of the hierarchy under it.
>   */
> -static void drain_all_stock(struct mem_cgroup *root_memcg)
> +void drain_all_stock(struct mem_cgroup *root_memcg)
>  {
>  	int cpu, curcpu;
>  
> @@ -3115,120 +3107,6 @@ void split_page_memcg(struct page *head, int old_order, int new_order)
>  		css_get_many(&memcg->css, old_nr / new_nr - 1);
>  }
>  
> -
> -static DEFINE_MUTEX(memcg_max_mutex);
> -
> -static int mem_cgroup_resize_max(struct mem_cgroup *memcg,
> -				 unsigned long max, bool memsw)
> -{
> -	bool enlarge = false;
> -	bool drained = false;
> -	int ret;
> -	bool limits_invariant;
> -	struct page_counter *counter = memsw ? &memcg->memsw : &memcg->memory;
> -
> -	do {
> -		if (signal_pending(current)) {
> -			ret = -EINTR;
> -			break;
> -		}
> -
> -		mutex_lock(&memcg_max_mutex);
> -		/*
> -		 * Make sure that the new limit (memsw or memory limit) doesn't
> -		 * break our basic invariant rule memory.max <= memsw.max.
> -		 */
> -		limits_invariant = memsw ? max >= READ_ONCE(memcg->memory.max) :
> -					   max <= memcg->memsw.max;
> -		if (!limits_invariant) {
> -			mutex_unlock(&memcg_max_mutex);
> -			ret = -EINVAL;
> -			break;
> -		}
> -		if (max > counter->max)
> -			enlarge = true;
> -		ret = page_counter_set_max(counter, max);
> -		mutex_unlock(&memcg_max_mutex);
> -
> -		if (!ret)
> -			break;
> -
> -		if (!drained) {
> -			drain_all_stock(memcg);
> -			drained = true;
> -			continue;
> -		}
> -
> -		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> -					memsw ? 0 : MEMCG_RECLAIM_MAY_SWAP, NULL)) {
> -			ret = -EBUSY;
> -			break;
> -		}
> -	} while (true);
> -
> -	if (!ret && enlarge)
> -		memcg1_oom_recover(memcg);
> -
> -	return ret;
> -}
> -
> -/*
> - * Reclaims as many pages from the given memcg as possible.
> - *
> - * Caller is responsible for holding css reference for memcg.
> - */
> -static int mem_cgroup_force_empty(struct mem_cgroup *memcg)
> -{
> -	int nr_retries = MAX_RECLAIM_RETRIES;
> -
> -	/* we call try-to-free pages for make this cgroup empty */
> -	lru_add_drain_all();
> -
> -	drain_all_stock(memcg);
> -
> -	/* try to free all pages in this cgroup */
> -	while (nr_retries && page_counter_read(&memcg->memory)) {
> -		if (signal_pending(current))
> -			return -EINTR;
> -
> -		if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL,
> -						  MEMCG_RECLAIM_MAY_SWAP, NULL))
> -			nr_retries--;
> -	}
> -
> -	return 0;
> -}
> -
> -static ssize_t mem_cgroup_force_empty_write(struct kernfs_open_file *of,
> -					    char *buf, size_t nbytes,
> -					    loff_t off)
> -{
> -	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> -
> -	if (mem_cgroup_is_root(memcg))
> -		return -EINVAL;
> -	return mem_cgroup_force_empty(memcg) ?: nbytes;
> -}
> -
> -static u64 mem_cgroup_hierarchy_read(struct cgroup_subsys_state *css,
> -				     struct cftype *cft)
> -{
> -	return 1;
> -}
> -
> -static int mem_cgroup_hierarchy_write(struct cgroup_subsys_state *css,
> -				      struct cftype *cft, u64 val)
> -{
> -	if (val == 1)
> -		return 0;
> -
> -	pr_warn_once("Non-hierarchical mode is deprecated. "
> -		     "Please report your usecase to linux-mm@kvack.org if you "
> -		     "depend on this functionality.\n");
> -
> -	return -EINVAL;
> -}
> -
>  unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
>  {
>  	unsigned long val;
> @@ -3251,67 +3129,6 @@ unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
>  	return val;
>  }
>  
> -enum {
> -	RES_USAGE,
> -	RES_LIMIT,
> -	RES_MAX_USAGE,
> -	RES_FAILCNT,
> -	RES_SOFT_LIMIT,
> -};
> -
> -static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
> -			       struct cftype *cft)
> -{
> -	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> -	struct page_counter *counter;
> -
> -	switch (MEMFILE_TYPE(cft->private)) {
> -	case _MEM:
> -		counter = &memcg->memory;
> -		break;
> -	case _MEMSWAP:
> -		counter = &memcg->memsw;
> -		break;
> -	case _KMEM:
> -		counter = &memcg->kmem;
> -		break;
> -	case _TCP:
> -		counter = &memcg->tcpmem;
> -		break;
> -	default:
> -		BUG();
> -	}
> -
> -	switch (MEMFILE_ATTR(cft->private)) {
> -	case RES_USAGE:
> -		if (counter == &memcg->memory)
> -			return (u64)mem_cgroup_usage(memcg, false) * PAGE_SIZE;
> -		if (counter == &memcg->memsw)
> -			return (u64)mem_cgroup_usage(memcg, true) * PAGE_SIZE;
> -		return (u64)page_counter_read(counter) * PAGE_SIZE;
> -	case RES_LIMIT:
> -		return (u64)counter->max * PAGE_SIZE;
> -	case RES_MAX_USAGE:
> -		return (u64)counter->watermark * PAGE_SIZE;
> -	case RES_FAILCNT:
> -		return counter->failcnt;
> -	case RES_SOFT_LIMIT:
> -		return (u64)READ_ONCE(memcg->soft_limit) * PAGE_SIZE;
> -	default:
> -		BUG();
> -	}
> -}
> -
> -/*
> - * This function doesn't do anything useful. Its only job is to provide a read
> - * handler for a file so that cgroup_file_mode() will add read permissions.
> - */
> -static int mem_cgroup_dummy_seq_show(__always_unused struct seq_file *m,
> -				     __always_unused void *v)
> -{
> -	return -EINVAL;
> -}
> -
>  #ifdef CONFIG_MEMCG_KMEM
>  static int memcg_online_kmem(struct mem_cgroup *memcg)
>  {
> @@ -3373,139 +3190,9 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
>  }
>  #endif /* CONFIG_MEMCG_KMEM */
>  
> -static int memcg_update_tcp_max(struct mem_cgroup *memcg, unsigned long max)
> -{
> -	int ret;
> -
> -	mutex_lock(&memcg_max_mutex);
> -
> -	ret = page_counter_set_max(&memcg->tcpmem, max);
> -	if (ret)
> -		goto out;
> -
> -	if (!memcg->tcpmem_active) {
> -		/*
> -		 * The active flag needs to be written after the static_key
> -		 * update. This is what guarantees that the socket activation
> -		 * function is the last one to run. See mem_cgroup_sk_alloc()
> -		 * for details, and note that we don't mark any socket as
> -		 * belonging to this memcg until that flag is up.
> -		 *
> -		 * We need to do this, because static_keys will span multiple
> -		 * sites, but we can't control their order. If we mark a socket
> -		 * as accounted, but the accounting functions are not patched in
> -		 * yet, we'll lose accounting.
> -		 *
> -		 * We never race with the readers in mem_cgroup_sk_alloc(),
> -		 * because when this value change, the code to process it is not
> -		 * patched in yet.
> -		 */
> -		static_branch_inc(&memcg_sockets_enabled_key);
> -		memcg->tcpmem_active = true;
> -	}
> -out:
> -	mutex_unlock(&memcg_max_mutex);
> -	return ret;
> -}
> -
> -/*
> - * The user of this function is...
> - * RES_LIMIT.
> - */
> -static ssize_t mem_cgroup_write(struct kernfs_open_file *of,
> -				char *buf, size_t nbytes, loff_t off)
> -{
> -	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> -	unsigned long nr_pages;
> -	int ret;
> -
> -	buf = strstrip(buf);
> -	ret = page_counter_memparse(buf, "-1", &nr_pages);
> -	if (ret)
> -		return ret;
> -
> -	switch (MEMFILE_ATTR(of_cft(of)->private)) {
> -	case RES_LIMIT:
> -		if (mem_cgroup_is_root(memcg)) { /* Can't set limit on root */
> -			ret = -EINVAL;
> -			break;
> -		}
> -		switch (MEMFILE_TYPE(of_cft(of)->private)) {
> -		case _MEM:
> -			ret = mem_cgroup_resize_max(memcg, nr_pages, false);
> -			break;
> -		case _MEMSWAP:
> -			ret = mem_cgroup_resize_max(memcg, nr_pages, true);
> -			break;
> -		case _KMEM:
> -			pr_warn_once("kmem.limit_in_bytes is deprecated and will be removed. "
> -				     "Writing any value to this file has no effect. "
> -				     "Please report your usecase to linux-mm@kvack.org if you "
> -				     "depend on this functionality.\n");
> -			ret = 0;
> -			break;
> -		case _TCP:
> -			ret = memcg_update_tcp_max(memcg, nr_pages);
> -			break;
> -		}
> -		break;
> -	case RES_SOFT_LIMIT:
> -		if (IS_ENABLED(CONFIG_PREEMPT_RT)) {
> -			ret = -EOPNOTSUPP;
> -		} else {
> -			WRITE_ONCE(memcg->soft_limit, nr_pages);
> -			ret = 0;
> -		}
> -		break;
> -	}
> -	return ret ?: nbytes;
> -}
> -
> -static ssize_t mem_cgroup_reset(struct kernfs_open_file *of, char *buf,
> -				size_t nbytes, loff_t off)
> -{
> -	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> -	struct page_counter *counter;
> -
> -	switch (MEMFILE_TYPE(of_cft(of)->private)) {
> -	case _MEM:
> -		counter = &memcg->memory;
> -		break;
> -	case _MEMSWAP:
> -		counter = &memcg->memsw;
> -		break;
> -	case _KMEM:
> -		counter = &memcg->kmem;
> -		break;
> -	case _TCP:
> -		counter = &memcg->tcpmem;
> -		break;
> -	default:
> -		BUG();
> -	}
> -
> -	switch (MEMFILE_ATTR(of_cft(of)->private)) {
> -	case RES_MAX_USAGE:
> -		page_counter_reset_watermark(counter);
> -		break;
> -	case RES_FAILCNT:
> -		counter->failcnt = 0;
> -		break;
> -	default:
> -		BUG();
> -	}
> -
> -	return nbytes;
> -}
> -
> -#ifdef CONFIG_NUMA
> -
> -#define LRU_ALL_FILE (BIT(LRU_INACTIVE_FILE) | BIT(LRU_ACTIVE_FILE))
> -#define LRU_ALL_ANON (BIT(LRU_INACTIVE_ANON) | BIT(LRU_ACTIVE_ANON))
> -#define LRU_ALL	     ((1 << NR_LRU_LISTS) - 1)
> -
> -static unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
> -				int nid, unsigned int lru_mask, bool tree)
> +unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
> +					   int nid, unsigned int lru_mask,
> +					   bool tree)
>  {
>  	struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
>  	unsigned long nr = 0;
> @@ -3524,9 +3211,8 @@ static unsigned long mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
>  	return nr;
>  }
>  
> -static unsigned long mem_cgroup_nr_lru_pages(struct mem_cgroup *memcg,
> -					     unsigned int lru_mask,
> -					     bool tree)
> +unsigned long mem_cgroup_nr_lru_pages(struct mem_cgroup *memcg,
> +				      unsigned int lru_mask, bool tree)
>  {
>  	unsigned long nr = 0;
>  	enum lru_list lru;
> @@ -3542,221 +3228,6 @@ static unsigned long mem_cgroup_nr_lru_pages(struct mem_cgroup *memcg,
>  	return nr;
>  }
>  
> -static int memcg_numa_stat_show(struct seq_file *m, void *v)
> -{
> -	struct numa_stat {
> -		const char *name;
> -		unsigned int lru_mask;
> -	};
> -
> -	static const struct numa_stat stats[] = {
> -		{ "total", LRU_ALL },
> -		{ "file", LRU_ALL_FILE },
> -		{ "anon", LRU_ALL_ANON },
> -		{ "unevictable", BIT(LRU_UNEVICTABLE) },
> -	};
> -	const struct numa_stat *stat;
> -	int nid;
> -	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
> -
> -	mem_cgroup_flush_stats(memcg);
> -
> -	for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) {
> -		seq_printf(m, "%s=%lu", stat->name,
> -			   mem_cgroup_nr_lru_pages(memcg, stat->lru_mask,
> -						   false));
> -		for_each_node_state(nid, N_MEMORY)
> -			seq_printf(m, " N%d=%lu", nid,
> -				   mem_cgroup_node_nr_lru_pages(memcg, nid,
> -							stat->lru_mask, false));
> -		seq_putc(m, '\n');
> -	}
> -
> -	for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) {
> -
> -		seq_printf(m, "hierarchical_%s=%lu", stat->name,
> -			   mem_cgroup_nr_lru_pages(memcg, stat->lru_mask,
> -						   true));
> -		for_each_node_state(nid, N_MEMORY)
> -			seq_printf(m, " N%d=%lu", nid,
> -				   mem_cgroup_node_nr_lru_pages(memcg, nid,
> -							stat->lru_mask, true));
> -		seq_putc(m, '\n');
> -	}
> -
> -	return 0;
> -}
> -#endif /* CONFIG_NUMA */
> -
> -static const unsigned int memcg1_stats[] = {
> -	NR_FILE_PAGES,
> -	NR_ANON_MAPPED,
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -	NR_ANON_THPS,
> -#endif
> -	NR_SHMEM,
> -	NR_FILE_MAPPED,
> -	NR_FILE_DIRTY,
> -	NR_WRITEBACK,
> -	WORKINGSET_REFAULT_ANON,
> -	WORKINGSET_REFAULT_FILE,
> -#ifdef CONFIG_SWAP
> -	MEMCG_SWAP,
> -	NR_SWAPCACHE,
> -#endif
> -};
> -
> -static const char *const memcg1_stat_names[] = {
> -	"cache",
> -	"rss",
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -	"rss_huge",
> -#endif
> -	"shmem",
> -	"mapped_file",
> -	"dirty",
> -	"writeback",
> -	"workingset_refault_anon",
> -	"workingset_refault_file",
> -#ifdef CONFIG_SWAP
> -	"swap",
> -	"swapcached",
> -#endif
> -};
> -
> -/* Universal VM events cgroup1 shows, original sort order */
> -static const unsigned int memcg1_events[] = {
> -	PGPGIN,
> -	PGPGOUT,
> -	PGFAULT,
> -	PGMAJFAULT,
> -};
> -
> -static void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
> -{
> -	unsigned long memory, memsw;
> -	struct mem_cgroup *mi;
> -	unsigned int i;
> -
> -	BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats));
> -
> -	mem_cgroup_flush_stats(memcg);
> -
> -	for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
> -		unsigned long nr;
> -
> -		nr = memcg_page_state_local_output(memcg, memcg1_stats[i]);
> -		seq_buf_printf(s, "%s %lu\n", memcg1_stat_names[i], nr);
> -	}
> -
> -	for (i = 0; i < ARRAY_SIZE(memcg1_events); i++)
> -		seq_buf_printf(s, "%s %lu\n", vm_event_name(memcg1_events[i]),
> -			       memcg_events_local(memcg, memcg1_events[i]));
> -
> -	for (i = 0; i < NR_LRU_LISTS; i++)
> -		seq_buf_printf(s, "%s %lu\n", lru_list_name(i),
> -			       memcg_page_state_local(memcg, NR_LRU_BASE + i) *
> -			       PAGE_SIZE);
> -
> -	/* Hierarchical information */
> -	memory = memsw = PAGE_COUNTER_MAX;
> -	for (mi = memcg; mi; mi = parent_mem_cgroup(mi)) {
> -		memory = min(memory, READ_ONCE(mi->memory.max));
> -		memsw = min(memsw, READ_ONCE(mi->memsw.max));
> -	}
> -	seq_buf_printf(s, "hierarchical_memory_limit %llu\n",
> -		       (u64)memory * PAGE_SIZE);
> -	seq_buf_printf(s, "hierarchical_memsw_limit %llu\n",
> -		       (u64)memsw * PAGE_SIZE);
> -
> -	for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
> -		unsigned long nr;
> -
> -		nr = memcg_page_state_output(memcg, memcg1_stats[i]);
> -		seq_buf_printf(s, "total_%s %llu\n", memcg1_stat_names[i],
> -			       (u64)nr);
> -	}
> -
> -	for (i = 0; i < ARRAY_SIZE(memcg1_events); i++)
> -		seq_buf_printf(s, "total_%s %llu\n",
> -			       vm_event_name(memcg1_events[i]),
> -			       (u64)memcg_events(memcg, memcg1_events[i]));
> -
> -	for (i = 0; i < NR_LRU_LISTS; i++)
> -		seq_buf_printf(s, "total_%s %llu\n", lru_list_name(i),
> -			       (u64)memcg_page_state(memcg, NR_LRU_BASE + i) *
> -			       PAGE_SIZE);
> -
> -#ifdef CONFIG_DEBUG_VM
> -	{
> -		pg_data_t *pgdat;
> -		struct mem_cgroup_per_node *mz;
> -		unsigned long anon_cost = 0;
> -		unsigned long file_cost = 0;
> -
> -		for_each_online_pgdat(pgdat) {
> -			mz = memcg->nodeinfo[pgdat->node_id];
> -
> -			anon_cost += mz->lruvec.anon_cost;
> -			file_cost += mz->lruvec.file_cost;
> -		}
> -		seq_buf_printf(s, "anon_cost %lu\n", anon_cost);
> -		seq_buf_printf(s, "file_cost %lu\n", file_cost);
> -	}
> -#endif
> -}
> -
> -static u64 mem_cgroup_swappiness_read(struct cgroup_subsys_state *css,
> -				      struct cftype *cft)
> -{
> -	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> -
> -	return mem_cgroup_swappiness(memcg);
> -}
> -
> -static int mem_cgroup_swappiness_write(struct cgroup_subsys_state *css,
> -				       struct cftype *cft, u64 val)
> -{
> -	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> -
> -	if (val > MAX_SWAPPINESS)
> -		return -EINVAL;
> -
> -	if (!mem_cgroup_is_root(memcg))
> -		WRITE_ONCE(memcg->swappiness, val);
> -	else
> -		WRITE_ONCE(vm_swappiness, val);
> -
> -	return 0;
> -}
> -
> -static int mem_cgroup_oom_control_read(struct seq_file *sf, void *v)
> -{
> -	struct mem_cgroup *memcg = mem_cgroup_from_seq(sf);
> -
> -	seq_printf(sf, "oom_kill_disable %d\n", READ_ONCE(memcg->oom_kill_disable));
> -	seq_printf(sf, "under_oom %d\n", (bool)memcg->under_oom);
> -	seq_printf(sf, "oom_kill %lu\n",
> -		   atomic_long_read(&memcg->memory_events[MEMCG_OOM_KILL]));
> -	return 0;
> -}
> -
> -static int mem_cgroup_oom_control_write(struct cgroup_subsys_state *css,
> -	struct cftype *cft, u64 val)
> -{
> -	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> -
> -	/* cannot set to root cgroup and only 0 and 1 are allowed */
> -	if (mem_cgroup_is_root(memcg) || !((val == 0) || (val == 1)))
> -		return -EINVAL;
> -
> -	WRITE_ONCE(memcg->oom_kill_disable, val);
> -	if (!val)
> -		memcg1_oom_recover(memcg);
> -
> -	return 0;
> -}
> -
>  #ifdef CONFIG_CGROUP_WRITEBACK
>  
>  #include <trace/events/writeback.h>
> @@ -3970,147 +3441,6 @@ static void memcg_wb_domain_size_changed(struct mem_cgroup *memcg)
>  
>  #endif	/* CONFIG_CGROUP_WRITEBACK */
>  
> -#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_SLUB_DEBUG)
> -static int mem_cgroup_slab_show(struct seq_file *m, void *p)
> -{
> -	/*
> -	 * Deprecated.
> -	 * Please, take a look at tools/cgroup/memcg_slabinfo.py .
> -	 */
> -	return 0;
> -}
> -#endif
> -
> -static int memory_stat_show(struct seq_file *m, void *v);
> -
> -static struct cftype mem_cgroup_legacy_files[] = {
> -	{
> -		.name = "usage_in_bytes",
> -		.private = MEMFILE_PRIVATE(_MEM, RES_USAGE),
> -		.read_u64 = mem_cgroup_read_u64,
> -	},
> -	{
> -		.name = "max_usage_in_bytes",
> -		.private = MEMFILE_PRIVATE(_MEM, RES_MAX_USAGE),
> -		.write = mem_cgroup_reset,
> -		.read_u64 = mem_cgroup_read_u64,
> -	},
> -	{
> -		.name = "limit_in_bytes",
> -		.private = MEMFILE_PRIVATE(_MEM, RES_LIMIT),
> -		.write = mem_cgroup_write,
> -		.read_u64 = mem_cgroup_read_u64,
> -	},
> -	{
> -		.name = "soft_limit_in_bytes",
> -		.private = MEMFILE_PRIVATE(_MEM, RES_SOFT_LIMIT),
> -		.write = mem_cgroup_write,
> -		.read_u64 = mem_cgroup_read_u64,
> -	},
> -	{
> -		.name = "failcnt",
> -		.private = MEMFILE_PRIVATE(_MEM, RES_FAILCNT),
> -		.write = mem_cgroup_reset,
> -		.read_u64 = mem_cgroup_read_u64,
> -	},
> -	{
> -		.name = "stat",
> -		.seq_show = memory_stat_show,
> -	},
> -	{
> -		.name = "force_empty",
> -		.write = mem_cgroup_force_empty_write,
> -	},
> -	{
> -		.name = "use_hierarchy",
> -		.write_u64 = mem_cgroup_hierarchy_write,
> -		.read_u64 = mem_cgroup_hierarchy_read,
> -	},
> -	{
> -		.name = "cgroup.event_control",		/* XXX: for compat */
> -		.write = memcg_write_event_control,
> -		.flags = CFTYPE_NO_PREFIX | CFTYPE_WORLD_WRITABLE,
> -	},
> -	{
> -		.name = "swappiness",
> -		.read_u64 = mem_cgroup_swappiness_read,
> -		.write_u64 = mem_cgroup_swappiness_write,
> -	},
> -	{
> -		.name = "move_charge_at_immigrate",
> -		.read_u64 = mem_cgroup_move_charge_read,
> -		.write_u64 = mem_cgroup_move_charge_write,
> -	},
> -	{
> -		.name = "oom_control",
> -		.seq_show = mem_cgroup_oom_control_read,
> -		.write_u64 = mem_cgroup_oom_control_write,
> -	},
> -	{
> -		.name = "pressure_level",
> -		.seq_show = mem_cgroup_dummy_seq_show,
> -	},
> -#ifdef CONFIG_NUMA
> -	{
> -		.name = "numa_stat",
> -		.seq_show = memcg_numa_stat_show,
> -	},
> -#endif
> -	{
> -		.name = "kmem.limit_in_bytes",
> -		.private = MEMFILE_PRIVATE(_KMEM, RES_LIMIT),
> -		.write = mem_cgroup_write,
> -		.read_u64 = mem_cgroup_read_u64,
> -	},
> -	{
> -		.name = "kmem.usage_in_bytes",
> -		.private = MEMFILE_PRIVATE(_KMEM, RES_USAGE),
> -		.read_u64 = mem_cgroup_read_u64,
> -	},
> -	{
> -		.name = "kmem.failcnt",
> -		.private = MEMFILE_PRIVATE(_KMEM, RES_FAILCNT),
> -		.write = mem_cgroup_reset,
> -		.read_u64 = mem_cgroup_read_u64,
> -	},
> -	{
> -		.name = "kmem.max_usage_in_bytes",
> -		.private = MEMFILE_PRIVATE(_KMEM, RES_MAX_USAGE),
> -		.write = mem_cgroup_reset,
> -		.read_u64 = mem_cgroup_read_u64,
> -	},
> -#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_SLUB_DEBUG)
> -	{
> -		.name = "kmem.slabinfo",
> -		.seq_show = mem_cgroup_slab_show,
> -	},
> -#endif
> -	{
> -		.name = "kmem.tcp.limit_in_bytes",
> -		.private = MEMFILE_PRIVATE(_TCP, RES_LIMIT),
> -		.write = mem_cgroup_write,
> -		.read_u64 = mem_cgroup_read_u64,
> -	},
> -	{
> -		.name = "kmem.tcp.usage_in_bytes",
> -		.private = MEMFILE_PRIVATE(_TCP, RES_USAGE),
> -		.read_u64 = mem_cgroup_read_u64,
> -	},
> -	{
> -		.name = "kmem.tcp.failcnt",
> -		.private = MEMFILE_PRIVATE(_TCP, RES_FAILCNT),
> -		.write = mem_cgroup_reset,
> -		.read_u64 = mem_cgroup_read_u64,
> -	},
> -	{
> -		.name = "kmem.tcp.max_usage_in_bytes",
> -		.private = MEMFILE_PRIVATE(_TCP, RES_MAX_USAGE),
> -		.write = mem_cgroup_reset,
> -		.read_u64 = mem_cgroup_read_u64,
> -	},
> -	{ },	/* terminate */
> -};
> -
>  /*
>   * Private memory cgroup IDR
>   *
> @@ -4902,7 +4232,7 @@ static int memory_events_local_show(struct seq_file *m, void *v)
>  	return 0;
>  }
>  
> -static int memory_stat_show(struct seq_file *m, void *v)
> +int memory_stat_show(struct seq_file *m, void *v)
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
>  	char *buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
> @@ -6133,33 +5463,6 @@ static struct cftype swap_files[] = {
>  	{ }	/* terminate */
>  };
>  
> -static struct cftype memsw_files[] = {
> -	{
> -		.name = "memsw.usage_in_bytes",
> -		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_USAGE),
> -		.read_u64 = mem_cgroup_read_u64,
> -	},
> -	{
> -		.name = "memsw.max_usage_in_bytes",
> -		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_MAX_USAGE),
> -		.write = mem_cgroup_reset,
> -		.read_u64 = mem_cgroup_read_u64,
> -	},
> -	{
> -		.name = "memsw.limit_in_bytes",
> -		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_LIMIT),
> -		.write = mem_cgroup_write,
> -		.read_u64 = mem_cgroup_read_u64,
> -	},
> -	{
> -		.name = "memsw.failcnt",
> -		.private = MEMFILE_PRIVATE(_MEMSWAP, RES_FAILCNT),
> -		.write = mem_cgroup_reset,
> -		.read_u64 = mem_cgroup_read_u64,
> -	},
> -	{ },	/* terminate */
> -};
> -
>  #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
>  /**
>   * obj_cgroup_may_zswap - check if this cgroup can zswap
> -- 
> 2.45.2

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 11/14] mm: memcg: make memcg1_update_tree() static
  2024-06-25  0:59 ` [PATCH v2 11/14] mm: memcg: make memcg1_update_tree() static Roman Gushchin
@ 2024-06-25  7:09   ` Michal Hocko
  0 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2024-06-25  7:09 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Johannes Weiner, Shakeel Butt, Muchun Song,
	linux-kernel, cgroups, linux-mm

On Mon 24-06-24 17:59:03, Roman Gushchin wrote:
> memcg1_update_tree() is not used outside of mm/memcontrol-v1.c
> anymore, define it as static and remove the declaration from
> the header file.
> 
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/memcontrol-v1.c | 2 +-
>  mm/memcontrol-v1.h | 1 -
>  2 files changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
> index 1b7337d0170d..f89de413004b 100644
> --- a/mm/memcontrol-v1.c
> +++ b/mm/memcontrol-v1.c
> @@ -201,7 +201,7 @@ static unsigned long soft_limit_excess(struct mem_cgroup *memcg)
>  	return excess;
>  }
>  
> -void memcg1_update_tree(struct mem_cgroup *memcg, int nid)
> +static void memcg1_update_tree(struct mem_cgroup *memcg, int nid)
>  {
>  	unsigned long excess;
>  	struct mem_cgroup_per_node *mz;
> diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
> index 7be4670d9abb..7d6ac4a4fb36 100644
> --- a/mm/memcontrol-v1.h
> +++ b/mm/memcontrol-v1.h
> @@ -5,7 +5,6 @@
>  
>  #include <linux/cgroup-defs.h>
>  
> -void memcg1_update_tree(struct mem_cgroup *memcg, int nid);
>  void memcg1_remove_from_trees(struct mem_cgroup *memcg);
>  
>  static inline void memcg1_soft_limit_reset(struct mem_cgroup *memcg)
> -- 
> 2.45.2
> 

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 12/14] mm: memcg: group cgroup v1 memcg related declarations
  2024-06-25  0:59 ` [PATCH v2 12/14] mm: memcg: group cgroup v1 memcg related declarations Roman Gushchin
@ 2024-06-25  7:09   ` Michal Hocko
  0 siblings, 0 replies; 31+ messages in thread
From: Michal Hocko @ 2024-06-25  7:09 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Johannes Weiner, Shakeel Butt, Muchun Song,
	linux-kernel, cgroups, linux-mm

On Mon 24-06-24 17:59:04, Roman Gushchin wrote:
> Group all cgroup v1-related declarations at the end of memcontrol.h
> and mm/memcontrol-v1.h with an intention to put them all together
> under a config option later on. It should make things easier to
> follow and maintain too.
> 
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/memcontrol.h | 144 +++++++++++++++++++------------------
>  mm/memcontrol-v1.h         |  89 ++++++++++++-----------
>  2 files changed, 123 insertions(+), 110 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 588179d29849..a70d64ed04f5 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -950,39 +950,13 @@ static inline void mem_cgroup_exit_user_fault(void)
>  	current->in_user_fault = 0;
>  }
>  
> -static inline bool task_in_memcg_oom(struct task_struct *p)
> -{
> -	return p->memcg_in_oom;
> -}
> -
> -bool mem_cgroup_oom_synchronize(bool wait);
>  struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim,
>  					    struct mem_cgroup *oom_domain);
>  void mem_cgroup_print_oom_group(struct mem_cgroup *memcg);
>  
> -void folio_memcg_lock(struct folio *folio);
> -void folio_memcg_unlock(struct folio *folio);
> -
>  void __mod_memcg_state(struct mem_cgroup *memcg, enum memcg_stat_item idx,
>  		       int val);
>  
> -/* try to stablize folio_memcg() for all the pages in a memcg */
> -static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
> -{
> -	rcu_read_lock();
> -
> -	if (mem_cgroup_disabled() || !atomic_read(&memcg->moving_account))
> -		return true;
> -
> -	rcu_read_unlock();
> -	return false;
> -}
> -
> -static inline void mem_cgroup_unlock_pages(void)
> -{
> -	rcu_read_unlock();
> -}
> -
>  /* idx can be of type enum memcg_stat_item or node_stat_item */
>  static inline void mod_memcg_state(struct mem_cgroup *memcg,
>  				   enum memcg_stat_item idx, int val)
> @@ -1109,10 +1083,6 @@ static inline void memcg_memory_event_mm(struct mm_struct *mm,
>  
>  void split_page_memcg(struct page *head, int old_order, int new_order);
>  
> -unsigned long memcg1_soft_limit_reclaim(pg_data_t *pgdat, int order,
> -					gfp_t gfp_mask,
> -					unsigned long *total_scanned);
> -
>  #else /* CONFIG_MEMCG */
>  
>  #define MEM_CGROUP_ID_SHIFT	0
> @@ -1423,26 +1393,6 @@ mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg)
>  {
>  }
>  
> -static inline void folio_memcg_lock(struct folio *folio)
> -{
> -}
> -
> -static inline void folio_memcg_unlock(struct folio *folio)
> -{
> -}
> -
> -static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
> -{
> -	/* to match folio_memcg_rcu() */
> -	rcu_read_lock();
> -	return true;
> -}
> -
> -static inline void mem_cgroup_unlock_pages(void)
> -{
> -	rcu_read_unlock();
> -}
> -
>  static inline void mem_cgroup_handle_over_high(gfp_t gfp_mask)
>  {
>  }
> @@ -1455,16 +1405,6 @@ static inline void mem_cgroup_exit_user_fault(void)
>  {
>  }
>  
> -static inline bool task_in_memcg_oom(struct task_struct *p)
> -{
> -	return false;
> -}
> -
> -static inline bool mem_cgroup_oom_synchronize(bool wait)
> -{
> -	return false;
> -}
> -
>  static inline struct mem_cgroup *mem_cgroup_get_oom_group(
>  	struct task_struct *victim, struct mem_cgroup *oom_domain)
>  {
> @@ -1558,14 +1498,6 @@ void count_memcg_event_mm(struct mm_struct *mm, enum vm_event_item idx)
>  static inline void split_page_memcg(struct page *head, int old_order, int new_order)
>  {
>  }
> -
> -static inline
> -unsigned long memcg1_soft_limit_reclaim(pg_data_t *pgdat, int order,
> -					gfp_t gfp_mask,
> -					unsigned long *total_scanned)
> -{
> -	return 0;
> -}
>  #endif /* CONFIG_MEMCG */
>  
>  /*
> @@ -1916,4 +1848,80 @@ static inline bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg)
>  }
>  #endif
>  
> +
> +/* Cgroup v1-related declarations */
> +
> +#ifdef CONFIG_MEMCG
> +unsigned long memcg1_soft_limit_reclaim(pg_data_t *pgdat, int order,
> +					gfp_t gfp_mask,
> +					unsigned long *total_scanned);
> +
> +bool mem_cgroup_oom_synchronize(bool wait);
> +
> +static inline bool task_in_memcg_oom(struct task_struct *p)
> +{
> +	return p->memcg_in_oom;
> +}
> +
> +void folio_memcg_lock(struct folio *folio);
> +void folio_memcg_unlock(struct folio *folio);
> +
> +/* try to stablize folio_memcg() for all the pages in a memcg */
> +static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
> +{
> +	rcu_read_lock();
> +
> +	if (mem_cgroup_disabled() || !atomic_read(&memcg->moving_account))
> +		return true;
> +
> +	rcu_read_unlock();
> +	return false;
> +}
> +
> +static inline void mem_cgroup_unlock_pages(void)
> +{
> +	rcu_read_unlock();
> +}
> +
> +#else /* CONFIG_MEMCG */
> +static inline
> +unsigned long memcg1_soft_limit_reclaim(pg_data_t *pgdat, int order,
> +					gfp_t gfp_mask,
> +					unsigned long *total_scanned)
> +{
> +	return 0;
> +}
> +
> +static inline void folio_memcg_lock(struct folio *folio)
> +{
> +}
> +
> +static inline void folio_memcg_unlock(struct folio *folio)
> +{
> +}
> +
> +static inline bool mem_cgroup_trylock_pages(struct mem_cgroup *memcg)
> +{
> +	/* to match folio_memcg_rcu() */
> +	rcu_read_lock();
> +	return true;
> +}
> +
> +static inline void mem_cgroup_unlock_pages(void)
> +{
> +	rcu_read_unlock();
> +}
> +
> +static inline bool task_in_memcg_oom(struct task_struct *p)
> +{
> +	return false;
> +}
> +
> +static inline bool mem_cgroup_oom_synchronize(bool wait)
> +{
> +	return false;
> +}
> +
> +#endif /* CONFIG_MEMCG */
> +
>  #endif /* _LINUX_MEMCONTROL_H */
> diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
> index 7d6ac4a4fb36..89d420793048 100644
> --- a/mm/memcontrol-v1.h
> +++ b/mm/memcontrol-v1.h
> @@ -5,15 +5,9 @@
>  
>  #include <linux/cgroup-defs.h>
>  
> -void memcg1_remove_from_trees(struct mem_cgroup *memcg);
> -
> -static inline void memcg1_soft_limit_reset(struct mem_cgroup *memcg)
> -{
> -	WRITE_ONCE(memcg->soft_limit, PAGE_COUNTER_MAX);
> -}
> +/* Cgroup v1 and v2 common declarations */
>  
>  void mem_cgroup_charge_statistics(struct mem_cgroup *memcg, int nr_pages);
> -void memcg1_check_events(struct mem_cgroup *memcg, int nid);
>  int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  		     unsigned int nr_pages);
>  
> @@ -29,30 +23,6 @@ static inline int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  void mem_cgroup_id_get_many(struct mem_cgroup *memcg, unsigned int n);
>  void mem_cgroup_id_put_many(struct mem_cgroup *memcg, unsigned int n);
>  
> -bool memcg1_wait_acct_move(struct mem_cgroup *memcg);
> -struct cgroup_taskset;
> -int memcg1_can_attach(struct cgroup_taskset *tset);
> -void memcg1_cancel_attach(struct cgroup_taskset *tset);
> -void memcg1_move_task(void);
> -
> -/*
> - * Per memcg event counter is incremented at every pagein/pageout. With THP,
> - * it will be incremented by the number of pages. This counter is used
> - * to trigger some periodic events. This is straightforward and better
> - * than using jiffies etc. to handle periodic memcg event.
> - */
> -enum mem_cgroup_events_target {
> -	MEM_CGROUP_TARGET_THRESH,
> -	MEM_CGROUP_TARGET_SOFTLIMIT,
> -	MEM_CGROUP_NTARGETS,
> -};
> -
> -/* Whether legacy memory+swap accounting is active */
> -static bool do_memsw_account(void)
> -{
> -	return !cgroup_subsys_on_dfl(memory_cgrp_subsys);
> -}
> -
>  /*
>   * Iteration constructs for visiting all cgroups (under a tree).  If
>   * loops are exited prematurely (break), mem_cgroup_iter_break() must
> @@ -68,24 +38,28 @@ static bool do_memsw_account(void)
>  	     iter != NULL;				\
>  	     iter = mem_cgroup_iter(NULL, iter, NULL))
>  
> -void memcg1_css_offline(struct mem_cgroup *memcg);
> +/* Whether legacy memory+swap accounting is active */
> +static bool do_memsw_account(void)
> +{
> +	return !cgroup_subsys_on_dfl(memory_cgrp_subsys);
> +}
>  
> -/* for encoding cft->private value on file */
> -enum res_type {
> -	_MEM,
> -	_MEMSWAP,
> -	_KMEM,
> -	_TCP,
> +/*
> + * Per memcg event counter is incremented at every pagein/pageout. With THP,
> + * it will be incremented by the number of pages. This counter is used
> + * to trigger some periodic events. This is straightforward and better
> + * than using jiffies etc. to handle periodic memcg event.
> + */
> +enum mem_cgroup_events_target {
> +	MEM_CGROUP_TARGET_THRESH,
> +	MEM_CGROUP_TARGET_SOFTLIMIT,
> +	MEM_CGROUP_NTARGETS,
>  };
>  
>  bool mem_cgroup_event_ratelimit(struct mem_cgroup *memcg,
>  				enum mem_cgroup_events_target target);
>  unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap);
>  
> -bool memcg1_oom_prepare(struct mem_cgroup *memcg, bool *locked);
> -void memcg1_oom_finish(struct mem_cgroup *memcg, bool locked);
> -void memcg1_oom_recover(struct mem_cgroup *memcg);
> -
>  void drain_all_stock(struct mem_cgroup *root_memcg);
>  unsigned long mem_cgroup_nr_lru_pages(struct mem_cgroup *memcg,
>  				      unsigned int lru_mask, bool tree);
> @@ -100,6 +74,37 @@ unsigned long memcg_page_state_output(struct mem_cgroup *memcg, int item);
>  unsigned long memcg_page_state_local_output(struct mem_cgroup *memcg, int item);
>  int memory_stat_show(struct seq_file *m, void *v);
>  
> +/* Cgroup v1-specific declarations */
> +
> +void memcg1_remove_from_trees(struct mem_cgroup *memcg);
> +
> +static inline void memcg1_soft_limit_reset(struct mem_cgroup *memcg)
> +{
> +	WRITE_ONCE(memcg->soft_limit, PAGE_COUNTER_MAX);
> +}
> +
> +bool memcg1_wait_acct_move(struct mem_cgroup *memcg);
> +
> +struct cgroup_taskset;
> +int memcg1_can_attach(struct cgroup_taskset *tset);
> +void memcg1_cancel_attach(struct cgroup_taskset *tset);
> +void memcg1_move_task(void);
> +void memcg1_css_offline(struct mem_cgroup *memcg);
> +
> +/* for encoding cft->private value on file */
> +enum res_type {
> +	_MEM,
> +	_MEMSWAP,
> +	_KMEM,
> +	_TCP,
> +};
> +
> +bool memcg1_oom_prepare(struct mem_cgroup *memcg, bool *locked);
> +void memcg1_oom_finish(struct mem_cgroup *memcg, bool locked);
> +void memcg1_oom_recover(struct mem_cgroup *memcg);
> +
> +void memcg1_check_events(struct mem_cgroup *memcg, int nid);
> +
>  void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s);
>  
>  extern struct cftype memsw_files[];
> -- 
> 2.45.2

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 13/14] mm: memcg: put cgroup v1-related members of task_struct under config option
  2024-06-25  0:59 ` [PATCH v2 13/14] mm: memcg: put cgroup v1-related members of task_struct under config option Roman Gushchin
@ 2024-06-25  7:19   ` Michal Hocko
  2024-06-26 18:06     ` Roman Gushchin
  0 siblings, 1 reply; 31+ messages in thread
From: Michal Hocko @ 2024-06-25  7:19 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Johannes Weiner, Shakeel Butt, Muchun Song,
	linux-kernel, cgroups, linux-mm

On Mon 24-06-24 17:59:05, Roman Gushchin wrote:
> Guard cgroup v1-related members of task_struct under the CONFIG_MEMCG_V1
> config option, so that users who adopted cgroup v2 don't have to waste
> the memory for fields which are never accessed.

This patch does more than that, right? It is essentially making the
whole v1 code conditional. Please change the wording accordingly.

I also think we should make it more clear when to enable the option. I
would propose the following for the config option help text:

Legacy cgroup v1 memory controller which has been deprecated by cgroup
v2 implementation. The v1 is there for legacy applications which haven't
migrated to the new cgroup v2 interface yet. If you do not have any such
application then you are completely fine leaving this option disabled.

Please note that feature set of the legacy memory controller is likely
going to shrink due to deprecation process. New deployments with v1
controller are highly discouraged.

> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>

With that updated feel free to add
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/memcontrol.h |  6 +++---
>  init/Kconfig               |  9 +++++++++
>  mm/Makefile                |  3 ++-
>  mm/memcontrol-v1.h         | 21 ++++++++++++++++++++-
>  mm/memcontrol.c            | 10 +++++++---
>  5 files changed, 41 insertions(+), 8 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index a70d64ed04f5..796cfa842346 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1851,7 +1851,7 @@ static inline bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg)
>  
>  /* Cgroup v1-related declarations */
>  
> -#ifdef CONFIG_MEMCG
> +#ifdef CONFIG_MEMCG_V1
>  unsigned long memcg1_soft_limit_reclaim(pg_data_t *pgdat, int order,
>  					gfp_t gfp_mask,
>  					unsigned long *total_scanned);
> @@ -1883,7 +1883,7 @@ static inline void mem_cgroup_unlock_pages(void)
>  	rcu_read_unlock();
>  }
>  
> -#else /* CONFIG_MEMCG */
> +#else /* CONFIG_MEMCG_V1 */
>  static inline
>  unsigned long memcg1_soft_limit_reclaim(pg_data_t *pgdat, int order,
>  					gfp_t gfp_mask,
> @@ -1922,6 +1922,6 @@ static inline bool mem_cgroup_oom_synchronize(bool wait)
>  	return false;
>  }
>  
> -#endif /* CONFIG_MEMCG */
> +#endif /* CONFIG_MEMCG_V1 */
>  
>  #endif /* _LINUX_MEMCONTROL_H */
> diff --git a/init/Kconfig b/init/Kconfig
> index febdea2afc3b..5191b6435b4e 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -969,6 +969,15 @@ config MEMCG
>  	help
>  	  Provides control over the memory footprint of tasks in a cgroup.
>  
> +config MEMCG_V1
> +	bool "Legacy memory controller"
> +	depends on MEMCG
> +	default n
> +	help
> +	  Legacy cgroup v1 memory controller.
> +
> +	  San N is unsure.
> +
>  config MEMCG_KMEM
>  	bool
>  	depends on MEMCG
> diff --git a/mm/Makefile b/mm/Makefile
> index 124d4dea2035..d2915f8c9dc0 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -96,7 +96,8 @@ obj-$(CONFIG_NUMA) += memory-tiers.o
>  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
>  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
> -obj-$(CONFIG_MEMCG) += memcontrol.o memcontrol-v1.o vmpressure.o
> +obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
> +obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
>  ifdef CONFIG_SWAP
>  obj-$(CONFIG_MEMCG) += swap_cgroup.o
>  endif
> diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
> index 89d420793048..64b053d7f131 100644
> --- a/mm/memcontrol-v1.h
> +++ b/mm/memcontrol-v1.h
> @@ -75,7 +75,7 @@ unsigned long memcg_page_state_local_output(struct mem_cgroup *memcg, int item);
>  int memory_stat_show(struct seq_file *m, void *v);
>  
>  /* Cgroup v1-specific declarations */
> -
> +#ifdef CONFIG_MEMCG_V1
>  void memcg1_remove_from_trees(struct mem_cgroup *memcg);
>  
>  static inline void memcg1_soft_limit_reset(struct mem_cgroup *memcg)
> @@ -110,4 +110,23 @@ void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s);
>  extern struct cftype memsw_files[];
>  extern struct cftype mem_cgroup_legacy_files[];
>  
> +#else	/* CONFIG_MEMCG_V1 */
> +
> +static inline void memcg1_remove_from_trees(struct mem_cgroup *memcg) {}
> +static inline void memcg1_soft_limit_reset(struct mem_cgroup *memcg) {}
> +static inline bool memcg1_wait_acct_move(struct mem_cgroup *memcg) { return false; }
> +static inline void memcg1_css_offline(struct mem_cgroup *memcg) {}
> +
> +static inline bool memcg1_oom_prepare(struct mem_cgroup *memcg, bool *locked) { return true; }
> +static inline void memcg1_oom_finish(struct mem_cgroup *memcg, bool locked) {}
> +static inline void memcg1_oom_recover(struct mem_cgroup *memcg) {}
> +
> +static inline void memcg1_check_events(struct mem_cgroup *memcg, int nid) {}
> +
> +static inline void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) {}
> +
> +extern struct cftype memsw_files[];
> +extern struct cftype mem_cgroup_legacy_files[];
> +#endif	/* CONFIG_MEMCG_V1 */
> +
>  #endif	/* __MM_MEMCONTROL_V1_H */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index c7341e811945..d2e1f8baeae8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4471,18 +4471,20 @@ struct cgroup_subsys memory_cgrp_subsys = {
>  	.css_free = mem_cgroup_css_free,
>  	.css_reset = mem_cgroup_css_reset,
>  	.css_rstat_flush = mem_cgroup_css_rstat_flush,
> -	.can_attach = memcg1_can_attach,
>  #if defined(CONFIG_LRU_GEN) || defined(CONFIG_MEMCG_KMEM)
>  	.attach = mem_cgroup_attach,
>  #endif
> -	.cancel_attach = memcg1_cancel_attach,
> -	.post_attach = memcg1_move_task,
>  #ifdef CONFIG_MEMCG_KMEM
>  	.fork = mem_cgroup_fork,
>  	.exit = mem_cgroup_exit,
>  #endif
>  	.dfl_cftypes = memory_files,
> +#ifdef CONFIG_MEMCG_V1
> +	.can_attach = memcg1_can_attach,
> +	.cancel_attach = memcg1_cancel_attach,
> +	.post_attach = memcg1_move_task,
>  	.legacy_cftypes = mem_cgroup_legacy_files,
> +#endif
>  	.early_init = 0,
>  };
>  
> @@ -5653,7 +5655,9 @@ static int __init mem_cgroup_swap_init(void)
>  		return 0;
>  
>  	WARN_ON(cgroup_add_dfl_cftypes(&memory_cgrp_subsys, swap_files));
> +#ifdef CONFIG_MEMCG_V1
>  	WARN_ON(cgroup_add_legacy_cftypes(&memory_cgrp_subsys, memsw_files));
> +#endif
>  #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
>  	WARN_ON(cgroup_add_dfl_cftypes(&memory_cgrp_subsys, zswap_files));
>  #endif
> -- 
> 2.45.2

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 00/14] mm: memcg: separate legacy cgroup v1 code and put under config option
  2024-06-25  0:58 [PATCH v2 00/14] mm: memcg: separate legacy cgroup v1 code and put under config option Roman Gushchin
                   ` (13 preceding siblings ...)
  2024-06-25  0:59 ` [PATCH v2 14/14] MAINTAINERS: add mm/memcontrol-v1.c/h to the list of maintained files Roman Gushchin
@ 2024-06-25 17:03 ` Shakeel Butt
  2024-06-26 18:07   ` Roman Gushchin
  14 siblings, 1 reply; 31+ messages in thread
From: Shakeel Butt @ 2024-06-25 17:03 UTC (permalink / raw)
  To: Roman Gushchin
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Muchun Song,
	linux-kernel, cgroups, linux-mm, Matthew Wilcox

On Mon, Jun 24, 2024 at 05:58:52PM GMT, Roman Gushchin wrote:
> Cgroups v2 have been around for a while and many users have fully adopted them,
> so they never use cgroups v1 features and functionality. Yet they have to "pay"
> for the cgroup v1 support anyway:
> 1) the kernel binary contains an unused cgroup v1 code,
> 2) some code paths have additional checks which are not needed,
> 3) some common structures like task_struct and mem_cgroup contain unused
>    cgroup v1-specific members.
> 
> Cgroup v1's memory controller has a number of features that are not supported
> by cgroup v2 and their implementation is pretty much self contained.
> Most notably, these features are: soft limit reclaim, oom handling in userspace,
> complicated event notification system, charge migration. Cgroup v1-specific code
> in memcontrol.c is close to 4k lines in size and it's intervened with generic
> and cgroup v2-specific code. It's a burden on developers and maintainers.
> 
> This patchset aims to solve these problems by:
> 1) moving cgroup v1-specific memcg code to the new mm/memcontrol-v1.c file,
> 2) putting definitions shared by memcontrol.c and memcontrol-v1.c into the
>    mm/memcontrol-v1.h header,
> 3) introducing the CONFIG_MEMCG_V1 config option, turned off by default,
> 4) making memcontrol-v1.c to compile only if CONFIG_MEMCG_V1 is set.
> 
> If CONFIG_MEMCG_V1 is not set, cgroup v1 memory controller is still available
> for mounting, however no memory-specific control knobs are present.
> 
> This patchset is based against mm-unstable tree (b610f75d19a34),
> however a version based on mm-stable can be found here:
>   https://github.com/rgushchin/linux/tree/memcontrol_v1.1-stable .
> 
> v2:
>   - minor compilation fix
>   - #else/#endif comments fix (Lance Yang)
> 
> v1:
>   - switched to CONFIG_MEMCG_V1 being off by default based on LSFMMBPF
>     discussion [1]
>   - switched to memcg1_ prefix (Johannes)
>   - many minor fixes
>   - dropped patches which put struct memcg members under CONFIG_MEMCG_V1
>     (will post as a separate patchset)
> 
> rfc:
>   https://lwn.net/Articles/973082/
> 
> [1]: https://lwn.net/Articles/974575/
> 
> MAINTAINERS                |    2 +
> include/linux/memcontrol.h |  156 ++++---
> init/Kconfig               |    9 +
> mm/Makefile                |    2 +
> mm/memcontrol-v1.c         | 2933 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> mm/memcontrol-v1.h         |  132 ++++++
> mm/memcontrol.c            | 4169 +++++++++++++++++++++++++++---------------------------------------------------------------------------------------------------------------------------------------------------
> mm/vmscan.c                |   10 +-
> 8 files changed, 3794 insertions(+), 3619 deletions(-)
> 
> Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>

For the series:

Acked-by: Shakeel Butt <shakeel.butt@linux.dev>


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 13/14] mm: memcg: put cgroup v1-related members of task_struct under config option
  2024-06-25  7:19   ` Michal Hocko
@ 2024-06-26 18:06     ` Roman Gushchin
  0 siblings, 0 replies; 31+ messages in thread
From: Roman Gushchin @ 2024-06-26 18:06 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, Shakeel Butt, Muchun Song,
	linux-kernel, cgroups, linux-mm

On Tue, Jun 25, 2024 at 09:19:04AM +0200, Michal Hocko wrote:
> On Mon 24-06-24 17:59:05, Roman Gushchin wrote:
> > Guard cgroup v1-related members of task_struct under the CONFIG_MEMCG_V1
> > config option, so that users who adopted cgroup v2 don't have to waste
> > the memory for fields which are never accessed.
> 
> This patch does more than that, right? It is essentially making the
> whole v1 code conditional. Please change the wording accordingly.

More than that, it doesn't do this at all. This commit message was taken
from another patch in v1 of this series by a mistake.
> 
> I also think we should make it more clear when to enable the option. I
> would propose the following for the config option help text:
> 
> Legacy cgroup v1 memory controller which has been deprecated by cgroup
> v2 implementation. The v1 is there for legacy applications which haven't
> migrated to the new cgroup v2 interface yet. If you do not have any such
> application then you are completely fine leaving this option disabled.
> 
> Please note that feature set of the legacy memory controller is likely
> going to shrink due to deprecation process. New deployments with v1
> controller are highly discouraged.
> 
> > Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> 
> With that updated feel free to add
> Acked-by: Michal Hocko <mhocko@suse.com>

An updated version with a correct commit subject and description and your config
option description is sent to Andrew, you're cc'ed.

Thank you for suggesting the config option description and reviewing the series,
appreciate it!

Thanks,
Roman


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH v2 00/14] mm: memcg: separate legacy cgroup v1 code and put under config option
  2024-06-25 17:03 ` [PATCH v2 00/14] mm: memcg: separate legacy cgroup v1 code and put under config option Shakeel Butt
@ 2024-06-26 18:07   ` Roman Gushchin
  0 siblings, 0 replies; 31+ messages in thread
From: Roman Gushchin @ 2024-06-26 18:07 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Andrew Morton, Johannes Weiner, Michal Hocko, Muchun Song,
	linux-kernel, cgroups, linux-mm, Matthew Wilcox

On Tue, Jun 25, 2024 at 10:03:53AM -0700, Shakeel Butt wrote:
> On Mon, Jun 24, 2024 at 05:58:52PM GMT, Roman Gushchin wrote:
> > Cgroups v2 have been around for a while and many users have fully adopted them,
> > so they never use cgroups v1 features and functionality. Yet they have to "pay"
> > for the cgroup v1 support anyway:
> > ...
> > Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
> 
> For the series:
> 
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>

Thank you!


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2024-06-26 18:07 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-06-25  0:58 [PATCH v2 00/14] mm: memcg: separate legacy cgroup v1 code and put under config option Roman Gushchin
2024-06-25  0:58 ` [PATCH v2 01/14] mm: memcg: introduce memcontrol-v1.c Roman Gushchin
2024-06-25  7:05   ` Michal Hocko
2024-06-25  0:58 ` [PATCH v2 02/14] mm: memcg: move soft limit reclaim code to memcontrol-v1.c Roman Gushchin
2024-06-25  7:06   ` Michal Hocko
2024-06-25  0:58 ` [PATCH v2 03/14] mm: memcg: rename soft limit reclaim-related functions Roman Gushchin
2024-06-25  7:06   ` Michal Hocko
2024-06-25  0:58 ` [PATCH v2 04/14] mm: memcg: move charge migration code to memcontrol-v1.c Roman Gushchin
2024-06-25  7:07   ` Michal Hocko
2024-06-25  0:58 ` [PATCH v2 05/14] mm: memcg: rename charge move-related functions Roman Gushchin
2024-06-25  7:07   ` Michal Hocko
2024-06-25  0:58 ` [PATCH v2 06/14] mm: memcg: move legacy memcg event code into memcontrol-v1.c Roman Gushchin
2024-06-25  7:07   ` Michal Hocko
2024-06-25  0:58 ` [PATCH v2 07/14] mm: memcg: rename memcg_check_events() Roman Gushchin
2024-06-25  7:08   ` Michal Hocko
2024-06-25  0:59 ` [PATCH v2 08/14] mm: memcg: move cgroup v1 oom handling code into memcontrol-v1.c Roman Gushchin
2024-06-25  7:08   ` Michal Hocko
2024-06-25  0:59 ` [PATCH v2 09/14] mm: memcg: rename memcg_oom_recover() Roman Gushchin
2024-06-25  7:08   ` Michal Hocko
2024-06-25  0:59 ` [PATCH v2 10/14] mm: memcg: move cgroup v1 interface files to memcontrol-v1.c Roman Gushchin
2024-06-25  7:09   ` Michal Hocko
2024-06-25  0:59 ` [PATCH v2 11/14] mm: memcg: make memcg1_update_tree() static Roman Gushchin
2024-06-25  7:09   ` Michal Hocko
2024-06-25  0:59 ` [PATCH v2 12/14] mm: memcg: group cgroup v1 memcg related declarations Roman Gushchin
2024-06-25  7:09   ` Michal Hocko
2024-06-25  0:59 ` [PATCH v2 13/14] mm: memcg: put cgroup v1-related members of task_struct under config option Roman Gushchin
2024-06-25  7:19   ` Michal Hocko
2024-06-26 18:06     ` Roman Gushchin
2024-06-25  0:59 ` [PATCH v2 14/14] MAINTAINERS: add mm/memcontrol-v1.c/h to the list of maintained files Roman Gushchin
2024-06-25 17:03 ` [PATCH v2 00/14] mm: memcg: separate legacy cgroup v1 code and put under config option Shakeel Butt
2024-06-26 18:07   ` Roman Gushchin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox