linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH mm-unstable v3 0/5] Improve mem_cgroup_iter()
@ 2024-08-27 23:07 Kinsey Ho
  2024-08-27 23:07 ` [PATCH mm-unstable v3 1/5] cgroup: clarify css sibling linkage is protected by cgroup_mutex or RCU Kinsey Ho
                   ` (4 more replies)
  0 siblings, 5 replies; 13+ messages in thread
From: Kinsey Ho @ 2024-08-27 23:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, cgroups, Yosry Ahmed, Roman Gushchin,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song,
	Tejun Heo, Zefan Li, mkoutny, Kinsey Ho

Incremental cgroup iteration is being used again [1]. This patchset
improves the reliability of mem_cgroup_iter(). It also improves
simplicity and code readability.

[1] https://lore.kernel.org/20240514202641.2821494-1-hannes@cmpxchg.org/
---
v3: Removed __rcu tag from patch 2/5 which removes the need for
rcu_dereference(). This helps readability.
v2: https://lore.kernel.org/20240813204716.842811-1-kinseyho@google.com/
Add patch to clarify css sibling linkage is RCU protected. The
kernel build bot RCU sparse error from v1 has been ignored.
v1: https://lore.kernel.org/20240724190214.1108049-1-kinseyho@google.com/

Kinsey Ho (5):
  cgroup: clarify css sibling linkage is protected by cgroup_mutex or
    RCU
  mm: don't hold css->refcnt during traversal
  mm: increment gen # before restarting traversal
  mm: restart if multiple traversals raced
  mm: clean up mem_cgroup_iter()

 include/linux/cgroup-defs.h |  6 ++-
 include/linux/memcontrol.h  |  4 +-
 kernel/cgroup/cgroup.c      | 16 +++----
 mm/memcontrol.c             | 84 +++++++++++++++----------------------
 4 files changed, 50 insertions(+), 60 deletions(-)

-- 
2.46.0.295.g3b9ea8a38a-goog



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH mm-unstable v3 1/5] cgroup: clarify css sibling linkage is protected by cgroup_mutex or RCU
  2024-08-27 23:07 [PATCH mm-unstable v3 0/5] Improve mem_cgroup_iter() Kinsey Ho
@ 2024-08-27 23:07 ` Kinsey Ho
  2024-08-27 23:07 ` [PATCH mm-unstable v3 2/5] mm: don't hold css->refcnt during traversal Kinsey Ho
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 13+ messages in thread
From: Kinsey Ho @ 2024-08-27 23:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, cgroups, Yosry Ahmed, Roman Gushchin,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song,
	Tejun Heo, Zefan Li, mkoutny, Kinsey Ho

Explicitly document that css sibling/descendant linkage is protected by
cgroup_mutex or RCU. Also, document in css_next_descendant_pre() and
similar functions that it isn't necessary to hold a ref on @pos.

The following changes in this patchset rely on this clarification
for simplification in memcg iteration code.

Suggested-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Kinsey Ho <kinseyho@google.com>
---
 include/linux/cgroup-defs.h |  6 +++++-
 kernel/cgroup/cgroup.c      | 16 +++++++++-------
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 7fc2d0195f56..ca7e912b8355 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -172,7 +172,11 @@ struct cgroup_subsys_state {
 	/* reference count - access via css_[try]get() and css_put() */
 	struct percpu_ref refcnt;
 
-	/* siblings list anchored at the parent's ->children */
+	/*
+	 * siblings list anchored at the parent's ->children
+	 *
+	 * linkage is protected by cgroup_mutex or RCU
+	 */
 	struct list_head sibling;
 	struct list_head children;
 
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 0a97cb2ef124..ece2316e2bca 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -4602,8 +4602,9 @@ struct cgroup_subsys_state *css_next_child(struct cgroup_subsys_state *pos,
  *
  * While this function requires cgroup_mutex or RCU read locking, it
  * doesn't require the whole traversal to be contained in a single critical
- * section.  This function will return the correct next descendant as long
- * as both @pos and @root are accessible and @pos is a descendant of @root.
+ * section. Additionally, it isn't necessary to hold onto a reference to @pos.
+ * This function will return the correct next descendant as long as both @pos
+ * and @root are accessible and @pos is a descendant of @root.
  *
  * If a subsystem synchronizes ->css_online() and the start of iteration, a
  * css which finished ->css_online() is guaranteed to be visible in the
@@ -4651,8 +4652,9 @@ EXPORT_SYMBOL_GPL(css_next_descendant_pre);
  *
  * While this function requires cgroup_mutex or RCU read locking, it
  * doesn't require the whole traversal to be contained in a single critical
- * section.  This function will return the correct rightmost descendant as
- * long as @pos is accessible.
+ * section. Additionally, it isn't necessary to hold onto a reference to @pos.
+ * This function will return the correct rightmost descendant as long as @pos
+ * is accessible.
  */
 struct cgroup_subsys_state *
 css_rightmost_descendant(struct cgroup_subsys_state *pos)
@@ -4696,9 +4698,9 @@ css_leftmost_descendant(struct cgroup_subsys_state *pos)
  *
  * While this function requires cgroup_mutex or RCU read locking, it
  * doesn't require the whole traversal to be contained in a single critical
- * section.  This function will return the correct next descendant as long
- * as both @pos and @cgroup are accessible and @pos is a descendant of
- * @cgroup.
+ * section. Additionally, it isn't necessary to hold onto a reference to @pos.
+ * This function will return the correct next descendant as long as both @pos
+ * and @cgroup are accessible and @pos is a descendant of @cgroup.
  *
  * If a subsystem synchronizes ->css_online() and the start of iteration, a
  * css which finished ->css_online() is guaranteed to be visible in the
-- 
2.46.0.295.g3b9ea8a38a-goog



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH mm-unstable v3 2/5] mm: don't hold css->refcnt during traversal
  2024-08-27 23:07 [PATCH mm-unstable v3 0/5] Improve mem_cgroup_iter() Kinsey Ho
  2024-08-27 23:07 ` [PATCH mm-unstable v3 1/5] cgroup: clarify css sibling linkage is protected by cgroup_mutex or RCU Kinsey Ho
@ 2024-08-27 23:07 ` Kinsey Ho
  2024-08-28 17:58   ` T.J. Mercier
  2024-08-27 23:07 ` [PATCH mm-unstable v3 3/5] mm: increment gen # before restarting traversal Kinsey Ho
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 13+ messages in thread
From: Kinsey Ho @ 2024-08-27 23:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, cgroups, Yosry Ahmed, Roman Gushchin,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song,
	Tejun Heo, Zefan Li, mkoutny, Kinsey Ho

To obtain the pointer to the next memcg position, mem_cgroup_iter()
currently holds css->refcnt during memcg traversal only to put
css->refcnt at the end of the routine. This isn't necessary as an
rcu_read_lock is already held throughout the function. The use of
the RCU read lock with css_next_descendant_pre() guarantees that
sibling linkage is safe without holding a ref on the passed-in @css.

Remove css->refcnt usage during traversal by leveraging RCU.

Signed-off-by: Kinsey Ho <kinseyho@google.com>
---
 mm/memcontrol.c | 18 +-----------------
 1 file changed, 1 insertion(+), 17 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 35431035e782..67b1994377b7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1013,20 +1013,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		else if (reclaim->generation != iter->generation)
 			goto out_unlock;
 
-		while (1) {
-			pos = READ_ONCE(iter->position);
-			if (!pos || css_tryget(&pos->css))
-				break;
-			/*
-			 * css reference reached zero, so iter->position will
-			 * be cleared by ->css_released. However, we should not
-			 * rely on this happening soon, because ->css_released
-			 * is called from a work queue, and by busy-waiting we
-			 * might block it. So we clear iter->position right
-			 * away.
-			 */
-			(void)cmpxchg(&iter->position, pos, NULL);
-		}
+		pos = READ_ONCE(iter->position);
 	} else if (prev) {
 		pos = prev;
 	}
@@ -1067,9 +1054,6 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		 */
 		(void)cmpxchg(&iter->position, pos, memcg);
 
-		if (pos)
-			css_put(&pos->css);
-
 		if (!memcg)
 			iter->generation++;
 	}
-- 
2.46.0.295.g3b9ea8a38a-goog



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH mm-unstable v3 3/5] mm: increment gen # before restarting traversal
  2024-08-27 23:07 [PATCH mm-unstable v3 0/5] Improve mem_cgroup_iter() Kinsey Ho
  2024-08-27 23:07 ` [PATCH mm-unstable v3 1/5] cgroup: clarify css sibling linkage is protected by cgroup_mutex or RCU Kinsey Ho
  2024-08-27 23:07 ` [PATCH mm-unstable v3 2/5] mm: don't hold css->refcnt during traversal Kinsey Ho
@ 2024-08-27 23:07 ` Kinsey Ho
  2024-08-28 17:49   ` T.J. Mercier
  2024-08-27 23:07 ` [PATCH mm-unstable v3 4/5] mm: restart if multiple traversals raced Kinsey Ho
  2024-08-27 23:07 ` [PATCH mm-unstable v3 5/5] mm: clean up mem_cgroup_iter() Kinsey Ho
  4 siblings, 1 reply; 13+ messages in thread
From: Kinsey Ho @ 2024-08-27 23:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, cgroups, Yosry Ahmed, Roman Gushchin,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song,
	Tejun Heo, Zefan Li, mkoutny, Kinsey Ho

The generation number in struct mem_cgroup_reclaim_iter should be
incremented on every round-trip. Currently, it is possible for a
concurrent reclaimer to jump in at the end of the hierarchy, causing a
traversal restart (resetting the iteration position) without
incrementing the generation number.

By resetting the position without incrementing the generation, it's
possible for another ongoing mem_cgroup_iter() thread to walk the tree
twice.

Move the traversal restart such that the generation number is
incremented before the restart.

Signed-off-by: Kinsey Ho <kinseyho@google.com>
---
 mm/memcontrol.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 67b1994377b7..51b194a4c375 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -997,7 +997,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		root = root_mem_cgroup;
 
 	rcu_read_lock();
-
+restart:
 	if (reclaim) {
 		struct mem_cgroup_per_node *mz;
 
@@ -1024,14 +1024,6 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 	for (;;) {
 		css = css_next_descendant_pre(css, &root->css);
 		if (!css) {
-			/*
-			 * Reclaimers share the hierarchy walk, and a
-			 * new one might jump in right at the end of
-			 * the hierarchy - make sure they see at least
-			 * one group and restart from the beginning.
-			 */
-			if (!prev)
-				continue;
 			break;
 		}
 
@@ -1054,8 +1046,18 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		 */
 		(void)cmpxchg(&iter->position, pos, memcg);
 
-		if (!memcg)
+		if (!memcg) {
 			iter->generation++;
+
+			/*
+			 * Reclaimers share the hierarchy walk, and a
+			 * new one might jump in right at the end of
+			 * the hierarchy - make sure they see at least
+			 * one group and restart from the beginning.
+			 */
+			if (!prev)
+				goto restart;
+		}
 	}
 
 out_unlock:
-- 
2.46.0.295.g3b9ea8a38a-goog



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH mm-unstable v3 4/5] mm: restart if multiple traversals raced
  2024-08-27 23:07 [PATCH mm-unstable v3 0/5] Improve mem_cgroup_iter() Kinsey Ho
                   ` (2 preceding siblings ...)
  2024-08-27 23:07 ` [PATCH mm-unstable v3 3/5] mm: increment gen # before restarting traversal Kinsey Ho
@ 2024-08-27 23:07 ` Kinsey Ho
  2024-08-28 17:49   ` T.J. Mercier
  2024-08-30 10:04   ` Hugh Dickins
  2024-08-27 23:07 ` [PATCH mm-unstable v3 5/5] mm: clean up mem_cgroup_iter() Kinsey Ho
  4 siblings, 2 replies; 13+ messages in thread
From: Kinsey Ho @ 2024-08-27 23:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, cgroups, Yosry Ahmed, Roman Gushchin,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song,
	Tejun Heo, Zefan Li, mkoutny, Kinsey Ho

Currently, if multiple reclaimers raced on the same position, the
reclaimers which detect the race will still reclaim from the same memcg.
Instead, the reclaimers which detect the race should move on to the next
memcg in the hierarchy.

So, in the case where multiple traversals race, jump back to the start
of the mem_cgroup_iter() function to find the next memcg in the
hierarchy to reclaim from.

Signed-off-by: Kinsey Ho <kinseyho@google.com>
---
 include/linux/memcontrol.h |  4 ++--
 mm/memcontrol.c            | 22 ++++++++++++++--------
 2 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index fe05fdb92779..2ef94c74847d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -57,7 +57,7 @@ enum memcg_memory_event {
 
 struct mem_cgroup_reclaim_cookie {
 	pg_data_t *pgdat;
-	unsigned int generation;
+	int generation;
 };
 
 #ifdef CONFIG_MEMCG
@@ -78,7 +78,7 @@ struct lruvec_stats;
 struct mem_cgroup_reclaim_iter {
 	struct mem_cgroup *position;
 	/* scan generation, increased every round-trip */
-	unsigned int generation;
+	atomic_t generation;
 };
 
 /*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 51b194a4c375..33bd379c738b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -986,7 +986,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 				   struct mem_cgroup_reclaim_cookie *reclaim)
 {
 	struct mem_cgroup_reclaim_iter *iter;
-	struct cgroup_subsys_state *css = NULL;
+	struct cgroup_subsys_state *css;
 	struct mem_cgroup *memcg = NULL;
 	struct mem_cgroup *pos = NULL;
 
@@ -999,18 +999,20 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 	rcu_read_lock();
 restart:
 	if (reclaim) {
+		int gen;
 		struct mem_cgroup_per_node *mz;
 
 		mz = root->nodeinfo[reclaim->pgdat->node_id];
 		iter = &mz->iter;
+		gen = atomic_read(&iter->generation);
 
 		/*
 		 * On start, join the current reclaim iteration cycle.
 		 * Exit when a concurrent walker completes it.
 		 */
 		if (!prev)
-			reclaim->generation = iter->generation;
-		else if (reclaim->generation != iter->generation)
+			reclaim->generation = gen;
+		else if (reclaim->generation != gen)
 			goto out_unlock;
 
 		pos = READ_ONCE(iter->position);
@@ -1018,8 +1020,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		pos = prev;
 	}
 
-	if (pos)
-		css = &pos->css;
+	css = pos ? &pos->css : NULL;
 
 	for (;;) {
 		css = css_next_descendant_pre(css, &root->css);
@@ -1033,21 +1034,26 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		 * and kicking, and don't take an extra reference.
 		 */
 		if (css == &root->css || css_tryget(css)) {
-			memcg = mem_cgroup_from_css(css);
 			break;
 		}
 	}
 
+	memcg = mem_cgroup_from_css(css);
+
 	if (reclaim) {
 		/*
 		 * The position could have already been updated by a competing
 		 * thread, so check that the value hasn't changed since we read
 		 * it to avoid reclaiming from the same cgroup twice.
 		 */
-		(void)cmpxchg(&iter->position, pos, memcg);
+		if (cmpxchg(&iter->position, pos, memcg) != pos) {
+			if (css && css != &root->css)
+				css_put(css);
+			goto restart;
+		}
 
 		if (!memcg) {
-			iter->generation++;
+			atomic_inc(&iter->generation);
 
 			/*
 			 * Reclaimers share the hierarchy walk, and a
-- 
2.46.0.295.g3b9ea8a38a-goog



^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH mm-unstable v3 5/5] mm: clean up mem_cgroup_iter()
  2024-08-27 23:07 [PATCH mm-unstable v3 0/5] Improve mem_cgroup_iter() Kinsey Ho
                   ` (3 preceding siblings ...)
  2024-08-27 23:07 ` [PATCH mm-unstable v3 4/5] mm: restart if multiple traversals raced Kinsey Ho
@ 2024-08-27 23:07 ` Kinsey Ho
  2024-08-28 17:49   ` T.J. Mercier
  4 siblings, 1 reply; 13+ messages in thread
From: Kinsey Ho @ 2024-08-27 23:07 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, cgroups, Yosry Ahmed, Roman Gushchin,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song,
	Tejun Heo, Zefan Li, mkoutny, Kinsey Ho

A clean up to make variable names more clear and to improve code
readability.

No functional change.

Signed-off-by: Kinsey Ho <kinseyho@google.com>
---
 mm/memcontrol.c | 30 +++++++++++-------------------
 1 file changed, 11 insertions(+), 19 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 33bd379c738b..2bdad7c29ac0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -987,8 +987,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 {
 	struct mem_cgroup_reclaim_iter *iter;
 	struct cgroup_subsys_state *css;
-	struct mem_cgroup *memcg = NULL;
-	struct mem_cgroup *pos = NULL;
+	struct mem_cgroup *pos;
+	struct mem_cgroup *next = NULL;
 
 	if (mem_cgroup_disabled())
 		return NULL;
@@ -1000,10 +1000,9 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 restart:
 	if (reclaim) {
 		int gen;
-		struct mem_cgroup_per_node *mz;
+		int nid = reclaim->pgdat->node_id;
 
-		mz = root->nodeinfo[reclaim->pgdat->node_id];
-		iter = &mz->iter;
+		iter = &root->nodeinfo[nid]->iter;
 		gen = atomic_read(&iter->generation);
 
 		/*
@@ -1016,29 +1015,22 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 			goto out_unlock;
 
 		pos = READ_ONCE(iter->position);
-	} else if (prev) {
+	} else
 		pos = prev;
-	}
 
 	css = pos ? &pos->css : NULL;
 
-	for (;;) {
-		css = css_next_descendant_pre(css, &root->css);
-		if (!css) {
-			break;
-		}
-
+	while ((css = css_next_descendant_pre(css, &root->css))) {
 		/*
 		 * Verify the css and acquire a reference.  The root
 		 * is provided by the caller, so we know it's alive
 		 * and kicking, and don't take an extra reference.
 		 */
-		if (css == &root->css || css_tryget(css)) {
+		if (css == &root->css || css_tryget(css))
 			break;
-		}
 	}
 
-	memcg = mem_cgroup_from_css(css);
+	next = mem_cgroup_from_css(css);
 
 	if (reclaim) {
 		/*
@@ -1046,13 +1038,13 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		 * thread, so check that the value hasn't changed since we read
 		 * it to avoid reclaiming from the same cgroup twice.
 		 */
-		if (cmpxchg(&iter->position, pos, memcg) != pos) {
+		if (cmpxchg(&iter->position, pos, next) != pos) {
 			if (css && css != &root->css)
 				css_put(css);
 			goto restart;
 		}
 
-		if (!memcg) {
+		if (!next) {
 			atomic_inc(&iter->generation);
 
 			/*
@@ -1071,7 +1063,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 	if (prev && prev != root)
 		css_put(&prev->css);
 
-	return memcg;
+	return next;
 }
 
 /**
-- 
2.46.0.295.g3b9ea8a38a-goog



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH mm-unstable v3 3/5] mm: increment gen # before restarting traversal
  2024-08-27 23:07 ` [PATCH mm-unstable v3 3/5] mm: increment gen # before restarting traversal Kinsey Ho
@ 2024-08-28 17:49   ` T.J. Mercier
  0 siblings, 0 replies; 13+ messages in thread
From: T.J. Mercier @ 2024-08-28 17:49 UTC (permalink / raw)
  To: Kinsey Ho
  Cc: Andrew Morton, linux-mm, linux-kernel, cgroups, Yosry Ahmed,
	Roman Gushchin, Johannes Weiner, Michal Hocko, Shakeel Butt,
	Muchun Song, Tejun Heo, Zefan Li, mkoutny

On Tue, Aug 27, 2024 at 4:11 PM Kinsey Ho <kinseyho@google.com> wrote:
>
> The generation number in struct mem_cgroup_reclaim_iter should be
> incremented on every round-trip. Currently, it is possible for a
> concurrent reclaimer to jump in at the end of the hierarchy, causing a
> traversal restart (resetting the iteration position) without
> incrementing the generation number.
>
> By resetting the position without incrementing the generation, it's
> possible for another ongoing mem_cgroup_iter() thread to walk the tree
> twice.
>
> Move the traversal restart such that the generation number is
> incremented before the restart.
>
> Signed-off-by: Kinsey Ho <kinseyho@google.com>

Reviewed-by: T.J. Mercier <tjmercier@google.com>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH mm-unstable v3 4/5] mm: restart if multiple traversals raced
  2024-08-27 23:07 ` [PATCH mm-unstable v3 4/5] mm: restart if multiple traversals raced Kinsey Ho
@ 2024-08-28 17:49   ` T.J. Mercier
  2024-08-30 10:04   ` Hugh Dickins
  1 sibling, 0 replies; 13+ messages in thread
From: T.J. Mercier @ 2024-08-28 17:49 UTC (permalink / raw)
  To: Kinsey Ho
  Cc: Andrew Morton, linux-mm, linux-kernel, cgroups, Yosry Ahmed,
	Roman Gushchin, Johannes Weiner, Michal Hocko, Shakeel Butt,
	Muchun Song, Tejun Heo, Zefan Li, mkoutny

On Tue, Aug 27, 2024 at 4:11 PM Kinsey Ho <kinseyho@google.com> wrote:
>
> Currently, if multiple reclaimers raced on the same position, the
> reclaimers which detect the race will still reclaim from the same memcg.
> Instead, the reclaimers which detect the race should move on to the next
> memcg in the hierarchy.
>
> So, in the case where multiple traversals race, jump back to the start
> of the mem_cgroup_iter() function to find the next memcg in the
> hierarchy to reclaim from.
>
> Signed-off-by: Kinsey Ho <kinseyho@google.com>

Reviewed-by: T.J. Mercier <tjmercier@google.com>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH mm-unstable v3 5/5] mm: clean up mem_cgroup_iter()
  2024-08-27 23:07 ` [PATCH mm-unstable v3 5/5] mm: clean up mem_cgroup_iter() Kinsey Ho
@ 2024-08-28 17:49   ` T.J. Mercier
  0 siblings, 0 replies; 13+ messages in thread
From: T.J. Mercier @ 2024-08-28 17:49 UTC (permalink / raw)
  To: Kinsey Ho
  Cc: Andrew Morton, linux-mm, linux-kernel, cgroups, Yosry Ahmed,
	Roman Gushchin, Johannes Weiner, Michal Hocko, Shakeel Butt,
	Muchun Song, Tejun Heo, Zefan Li, mkoutny

On Tue, Aug 27, 2024 at 4:11 PM Kinsey Ho <kinseyho@google.com> wrote:
>
> A clean up to make variable names more clear and to improve code
> readability.
>
> No functional change.
>
> Signed-off-by: Kinsey Ho <kinseyho@google.com>

Reviewed-by: T.J. Mercier <tjmercier@google.com>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH mm-unstable v3 2/5] mm: don't hold css->refcnt during traversal
  2024-08-27 23:07 ` [PATCH mm-unstable v3 2/5] mm: don't hold css->refcnt during traversal Kinsey Ho
@ 2024-08-28 17:58   ` T.J. Mercier
  0 siblings, 0 replies; 13+ messages in thread
From: T.J. Mercier @ 2024-08-28 17:58 UTC (permalink / raw)
  To: Kinsey Ho
  Cc: Andrew Morton, linux-mm, linux-kernel, cgroups, Yosry Ahmed,
	Roman Gushchin, Johannes Weiner, Michal Hocko, Shakeel Butt,
	Muchun Song, Tejun Heo, Zefan Li, mkoutny

On Tue, Aug 27, 2024 at 4:11 PM Kinsey Ho <kinseyho@google.com> wrote:
>
> To obtain the pointer to the next memcg position, mem_cgroup_iter()
> currently holds css->refcnt during memcg traversal only to put
> css->refcnt at the end of the routine. This isn't necessary as an
> rcu_read_lock is already held throughout the function. The use of
> the RCU read lock with css_next_descendant_pre() guarantees that
> sibling linkage is safe without holding a ref on the passed-in @css.
>
> Remove css->refcnt usage during traversal by leveraging RCU.
>
> Signed-off-by: Kinsey Ho <kinseyho@google.com>

Reviewed-by: T.J. Mercier <tjmercier@google.com>

I found a different place where a more trivial css get/put pair than
this could be removed, but I couldn't measure a perf difference. Like
Yosry, I appreciate the simplicity gains here though.

> ---
>  mm/memcontrol.c | 18 +-----------------
>  1 file changed, 1 insertion(+), 17 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 35431035e782..67b1994377b7 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1013,20 +1013,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>                 else if (reclaim->generation != iter->generation)
>                         goto out_unlock;
>
> -               while (1) {
> -                       pos = READ_ONCE(iter->position);
> -                       if (!pos || css_tryget(&pos->css))
> -                               break;
> -                       /*
> -                        * css reference reached zero, so iter->position will
> -                        * be cleared by ->css_released. However, we should not
> -                        * rely on this happening soon, because ->css_released
> -                        * is called from a work queue, and by busy-waiting we
> -                        * might block it. So we clear iter->position right
> -                        * away.
> -                        */
> -                       (void)cmpxchg(&iter->position, pos, NULL);
> -               }
> +               pos = READ_ONCE(iter->position);
>         } else if (prev) {
>                 pos = prev;
>         }
> @@ -1067,9 +1054,6 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>                  */
>                 (void)cmpxchg(&iter->position, pos, memcg);
>
> -               if (pos)
> -                       css_put(&pos->css);
> -
>                 if (!memcg)
>                         iter->generation++;
>         }
> --
> 2.46.0.295.g3b9ea8a38a-goog
>
>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH mm-unstable v3 4/5] mm: restart if multiple traversals raced
  2024-08-27 23:07 ` [PATCH mm-unstable v3 4/5] mm: restart if multiple traversals raced Kinsey Ho
  2024-08-28 17:49   ` T.J. Mercier
@ 2024-08-30 10:04   ` Hugh Dickins
  2024-08-30 17:45     ` Kinsey Ho
  1 sibling, 1 reply; 13+ messages in thread
From: Hugh Dickins @ 2024-08-30 10:04 UTC (permalink / raw)
  To: Andrew Morton, Kinsey Ho
  Cc: linux-mm, linux-kernel, cgroups, Yosry Ahmed, Roman Gushchin,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song,
	Tejun Heo, Zefan Li, mkoutny, baolin.wang, tjmercier, hughd

On Tue, 27 Aug 2024, Kinsey Ho wrote:

> Currently, if multiple reclaimers raced on the same position, the
> reclaimers which detect the race will still reclaim from the same memcg.
> Instead, the reclaimers which detect the race should move on to the next
> memcg in the hierarchy.
> 
> So, in the case where multiple traversals race, jump back to the start
> of the mem_cgroup_iter() function to find the next memcg in the
> hierarchy to reclaim from.
> 
> Signed-off-by: Kinsey Ho <kinseyho@google.com>

mm-unstable commit 954dd0848c61 needs the fix below to be merged in;
but the commit after it (the 5/5) then renames "memcg" to "next",
so that one has to be adjusted too.

[PATCH] mm: restart if multiple traversals raced: fix

mem_cgroup_iter() reset memcg to NULL before the goto restart, so that
goto out_unlock does not then return an ungotten memcg, causing oopses
on stale memcg in many places (often in memcg_rstat_updated()).

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/memcontrol.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6f66ac0ad4f0..dd82dd1e1f0a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1049,6 +1049,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		if (cmpxchg(&iter->position, pos, memcg) != pos) {
 			if (css && css != &root->css)
 				css_put(css);
+			memcg = NULL;
 			goto restart;
 		}
 
-- 
2.35.3


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH mm-unstable v3 4/5] mm: restart if multiple traversals raced
  2024-08-30 10:04   ` Hugh Dickins
@ 2024-08-30 17:45     ` Kinsey Ho
  2024-08-30 19:04       ` Yu Zhao
  0 siblings, 1 reply; 13+ messages in thread
From: Kinsey Ho @ 2024-08-30 17:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Hugh Dickins, linux-mm, linux-kernel, cgroups, Yosry Ahmed,
	Roman Gushchin, Johannes Weiner, Michal Hocko, Shakeel Butt,
	Muchun Song, Tejun Heo, Zefan Li, mkoutny, baolin.wang,
	tjmercier

On Fri, Aug 30, 2024 at 3:04 AM Hugh Dickins <hughd@google.com> wrote:
>
> mm-unstable commit 954dd0848c61 needs the fix below to be merged in;
> but the commit after it (the 5/5) then renames "memcg" to "next",
> so that one has to be adjusted too.
>
> [PATCH] mm: restart if multiple traversals raced: fix
>
> mem_cgroup_iter() reset memcg to NULL before the goto restart, so that
> goto out_unlock does not then return an ungotten memcg, causing oopses
> on stale memcg in many places (often in memcg_rstat_updated()).
>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>  mm/memcontrol.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 6f66ac0ad4f0..dd82dd1e1f0a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1049,6 +1049,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>                 if (cmpxchg(&iter->position, pos, memcg) != pos) {
>                         if (css && css != &root->css)
>                                 css_put(css);
> +                       memcg = NULL;
>                         goto restart;
>                 }
>
> --
> 2.35.3

Hi Andrew,

Would you prefer that I resend the series with Hugh's fix inserted?

Acked-by: Kinsey Ho <kinseyho@google.com>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH mm-unstable v3 4/5] mm: restart if multiple traversals raced
  2024-08-30 17:45     ` Kinsey Ho
@ 2024-08-30 19:04       ` Yu Zhao
  0 siblings, 0 replies; 13+ messages in thread
From: Yu Zhao @ 2024-08-30 19:04 UTC (permalink / raw)
  To: Kinsey Ho
  Cc: Andrew Morton, Hugh Dickins, linux-mm, linux-kernel, cgroups,
	Yosry Ahmed, Roman Gushchin, Johannes Weiner, Michal Hocko,
	Shakeel Butt, Muchun Song, Tejun Heo, Zefan Li, mkoutny,
	baolin.wang, tjmercier

On Fri, Aug 30, 2024 at 11:45 AM Kinsey Ho <kinseyho@google.com> wrote:
>
> On Fri, Aug 30, 2024 at 3:04 AM Hugh Dickins <hughd@google.com> wrote:
> >
> > mm-unstable commit 954dd0848c61 needs the fix below to be merged in;
> > but the commit after it (the 5/5) then renames "memcg" to "next",
> > so that one has to be adjusted too.
> >
> > [PATCH] mm: restart if multiple traversals raced: fix
> >
> > mem_cgroup_iter() reset memcg to NULL before the goto restart, so that
> > goto out_unlock does not then return an ungotten memcg, causing oopses
> > on stale memcg in many places (often in memcg_rstat_updated()).
> >
> > Signed-off-by: Hugh Dickins <hughd@google.com>
> > ---
> >  mm/memcontrol.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 6f66ac0ad4f0..dd82dd1e1f0a 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1049,6 +1049,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
> >                 if (cmpxchg(&iter->position, pos, memcg) != pos) {
> >                         if (css && css != &root->css)
> >                                 css_put(css);
> > +                       memcg = NULL;
> >                         goto restart;
> >                 }
> >
> > --
> > 2.35.3
>
> Hi Andrew,
>
> Would you prefer that I resend the series with Hugh's fix inserted?

Please send a new version to get this properly fixed, preferably move
the initialization of `memcg` from the declaration to right below
`restart`, and also add the following footers:

Reported-by: syzbot+e099d407346c45275ce9@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/000000000000817cf10620e20d33@google.com/


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2024-08-30 19:04 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-08-27 23:07 [PATCH mm-unstable v3 0/5] Improve mem_cgroup_iter() Kinsey Ho
2024-08-27 23:07 ` [PATCH mm-unstable v3 1/5] cgroup: clarify css sibling linkage is protected by cgroup_mutex or RCU Kinsey Ho
2024-08-27 23:07 ` [PATCH mm-unstable v3 2/5] mm: don't hold css->refcnt during traversal Kinsey Ho
2024-08-28 17:58   ` T.J. Mercier
2024-08-27 23:07 ` [PATCH mm-unstable v3 3/5] mm: increment gen # before restarting traversal Kinsey Ho
2024-08-28 17:49   ` T.J. Mercier
2024-08-27 23:07 ` [PATCH mm-unstable v3 4/5] mm: restart if multiple traversals raced Kinsey Ho
2024-08-28 17:49   ` T.J. Mercier
2024-08-30 10:04   ` Hugh Dickins
2024-08-30 17:45     ` Kinsey Ho
2024-08-30 19:04       ` Yu Zhao
2024-08-27 23:07 ` [PATCH mm-unstable v3 5/5] mm: clean up mem_cgroup_iter() Kinsey Ho
2024-08-28 17:49   ` T.J. Mercier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox