linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH mm-unstable v2 0/5] Improve mem_cgroup_iter()
@ 2024-08-13 20:47 Kinsey Ho
  2024-08-13 20:47 ` [PATCH mm-unstable v2 1/5] cgroup: clarify css sibling linkage is protected by cgroup_mutex or RCU Kinsey Ho
                   ` (4 more replies)
  0 siblings, 5 replies; 12+ messages in thread
From: Kinsey Ho @ 2024-08-13 20:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, cgroups, Yosry Ahmed, Roman Gushchin,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song,
	Tejun Heo, Zefan Li, mkoutny, Kinsey Ho

Incremental cgroup iteration is being used again [1]. This patchset
improves the reliability of mem_cgroup_iter(). It also improves
simplicity and code readability.

[1] https://lore.kernel.org/20240514202641.2821494-1-hannes@cmpxchg.org/
--

v2: add patch to clarify css sibling linkage is RCU protected. The
kernel build bot RCU sparse error from v1 has been ignored.
v1: https://lore.kernel.org/20240724190214.1108049-1-kinseyho@google.com/

Kinsey Ho (5):
  cgroup: clarify css sibling linkage is protected by cgroup_mutex or
    RCU
  mm: don't hold css->refcnt during traversal
  mm: increment gen # before restarting traversal
  mm: restart if multiple traversals raced
  mm: clean up mem_cgroup_iter()

 include/linux/cgroup-defs.h |  6 ++-
 include/linux/memcontrol.h  |  6 +--
 kernel/cgroup/cgroup.c      | 16 +++----
 mm/memcontrol.c             | 84 +++++++++++++++----------------------
 4 files changed, 51 insertions(+), 61 deletions(-)

-- 
2.46.0.76.ge559c4bf1a-goog



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH mm-unstable v2 1/5] cgroup: clarify css sibling linkage is protected by cgroup_mutex or RCU
  2024-08-13 20:47 [PATCH mm-unstable v2 0/5] Improve mem_cgroup_iter() Kinsey Ho
@ 2024-08-13 20:47 ` Kinsey Ho
  2024-08-14  9:00   ` Michal Koutný
  2024-08-13 20:47 ` [PATCH mm-unstable v2 2/5] mm: don't hold css->refcnt during traversal Kinsey Ho
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Kinsey Ho @ 2024-08-13 20:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, cgroups, Yosry Ahmed, Roman Gushchin,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song,
	Tejun Heo, Zefan Li, mkoutny, Kinsey Ho

Explicitly document that css sibling/descendant linkage is protected by
cgroup_mutex or RCU. Also, document in css_next_descendant_pre() and
similar functions that is isn't necessary to hold a ref on @pos.

The following changes in this patchset rely on this clarification
for simplification in memcg iteration code.

Suggested-By: Yosry Ahmed <yosryahmed@google.com>
Signed-Off-By: Kinsey Ho <kinseyho@google.com>

---
 include/linux/cgroup-defs.h |  6 +++++-
 kernel/cgroup/cgroup.c      | 16 +++++++++-------
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 7fc2d0195f56..6862243bd1c2 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -172,7 +172,11 @@ struct cgroup_subsys_state {
 	/* reference count - access via css_[try]get() and css_put() */
 	struct percpu_ref refcnt;
 
-	/* siblings list anchored at the parent's ->children */
+	/*
+	 * siblings list anchored at the parent's ->children
+	 *
+	 * linkage is protected by cgroup_mutex or RCU
+	 */
 	struct list_head sibling;
 	struct list_head children;
 
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 0a97cb2ef124..ece2316e2bca 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -4602,8 +4602,9 @@ struct cgroup_subsys_state *css_next_child(struct cgroup_subsys_state *pos,
  *
  * While this function requires cgroup_mutex or RCU read locking, it
  * doesn't require the whole traversal to be contained in a single critical
- * section.  This function will return the correct next descendant as long
- * as both @pos and @root are accessible and @pos is a descendant of @root.
+ * section. Additionally, it isn't necessary to hold onto a reference to @pos.
+ * This function will return the correct next descendant as long as both @pos
+ * and @root are accessible and @pos is a descendant of @root.
  *
  * If a subsystem synchronizes ->css_online() and the start of iteration, a
  * css which finished ->css_online() is guaranteed to be visible in the
@@ -4651,8 +4652,9 @@ EXPORT_SYMBOL_GPL(css_next_descendant_pre);
  *
  * While this function requires cgroup_mutex or RCU read locking, it
  * doesn't require the whole traversal to be contained in a single critical
- * section.  This function will return the correct rightmost descendant as
- * long as @pos is accessible.
+ * section. Additionally, it isn't necessary to hold onto a reference to @pos.
+ * This function will return the correct rightmost descendant as long as @pos
+ * is accessible.
  */
 struct cgroup_subsys_state *
 css_rightmost_descendant(struct cgroup_subsys_state *pos)
@@ -4696,9 +4698,9 @@ css_leftmost_descendant(struct cgroup_subsys_state *pos)
  *
  * While this function requires cgroup_mutex or RCU read locking, it
  * doesn't require the whole traversal to be contained in a single critical
- * section.  This function will return the correct next descendant as long
- * as both @pos and @cgroup are accessible and @pos is a descendant of
- * @cgroup.
+ * section. Additionally, it isn't necessary to hold onto a reference to @pos.
+ * This function will return the correct next descendant as long as both @pos
+ * and @cgroup are accessible and @pos is a descendant of @cgroup.
  *
  * If a subsystem synchronizes ->css_online() and the start of iteration, a
  * css which finished ->css_online() is guaranteed to be visible in the
-- 
2.46.0.76.ge559c4bf1a-goog



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH mm-unstable v2 2/5] mm: don't hold css->refcnt during traversal
  2024-08-13 20:47 [PATCH mm-unstable v2 0/5] Improve mem_cgroup_iter() Kinsey Ho
  2024-08-13 20:47 ` [PATCH mm-unstable v2 1/5] cgroup: clarify css sibling linkage is protected by cgroup_mutex or RCU Kinsey Ho
@ 2024-08-13 20:47 ` Kinsey Ho
  2024-08-14  9:00   ` Michal Koutný
  2024-08-13 20:47 ` [PATCH mm-unstable v2 3/5] mm: increment gen # before restarting traversal Kinsey Ho
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 12+ messages in thread
From: Kinsey Ho @ 2024-08-13 20:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, cgroups, Yosry Ahmed, Roman Gushchin,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song,
	Tejun Heo, Zefan Li, mkoutny, Kinsey Ho

To obtain the pointer to the next memcg position, mem_cgroup_iter()
currently holds css->refcnt during memcg traversal only to put
css->refcnt at the end of the routine. This isn't necessary as an
rcu_read_lock is already held throughout the function. The use of
the RCU read lock with css_next_descendant_pre() guarantees that
sibling linkage is safe without holding a ref on the passed-in @css.

Remove css->refcnt usage during traversal by leveraging RCU.

Signed-off-by: Kinsey Ho <kinseyho@google.com>
---
 include/linux/memcontrol.h |  2 +-
 mm/memcontrol.c            | 18 +-----------------
 2 files changed, 2 insertions(+), 18 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 90ecd2dbca06..1aaed2f1f6ae 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -75,7 +75,7 @@ struct lruvec_stats_percpu;
 struct lruvec_stats;
 
 struct mem_cgroup_reclaim_iter {
-	struct mem_cgroup *position;
+	struct mem_cgroup __rcu *position;
 	/* scan generation, increased every round-trip */
 	unsigned int generation;
 };
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index dacf4fec4541..1688aae3b1b4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1052,20 +1052,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		else if (reclaim->generation != iter->generation)
 			goto out_unlock;
 
-		while (1) {
-			pos = READ_ONCE(iter->position);
-			if (!pos || css_tryget(&pos->css))
-				break;
-			/*
-			 * css reference reached zero, so iter->position will
-			 * be cleared by ->css_released. However, we should not
-			 * rely on this happening soon, because ->css_released
-			 * is called from a work queue, and by busy-waiting we
-			 * might block it. So we clear iter->position right
-			 * away.
-			 */
-			(void)cmpxchg(&iter->position, pos, NULL);
-		}
+		pos = rcu_dereference(iter->position);
 	} else if (prev) {
 		pos = prev;
 	}
@@ -1106,9 +1093,6 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		 */
 		(void)cmpxchg(&iter->position, pos, memcg);
 
-		if (pos)
-			css_put(&pos->css);
-
 		if (!memcg)
 			iter->generation++;
 	}
-- 
2.46.0.76.ge559c4bf1a-goog



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH mm-unstable v2 3/5] mm: increment gen # before restarting traversal
  2024-08-13 20:47 [PATCH mm-unstable v2 0/5] Improve mem_cgroup_iter() Kinsey Ho
  2024-08-13 20:47 ` [PATCH mm-unstable v2 1/5] cgroup: clarify css sibling linkage is protected by cgroup_mutex or RCU Kinsey Ho
  2024-08-13 20:47 ` [PATCH mm-unstable v2 2/5] mm: don't hold css->refcnt during traversal Kinsey Ho
@ 2024-08-13 20:47 ` Kinsey Ho
  2024-08-13 20:47 ` [PATCH mm-unstable v2 4/5] mm: restart if multiple traversals raced Kinsey Ho
  2024-08-13 20:47 ` [PATCH mm-unstable v2 5/5] mm: clean up mem_cgroup_iter() Kinsey Ho
  4 siblings, 0 replies; 12+ messages in thread
From: Kinsey Ho @ 2024-08-13 20:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, cgroups, Yosry Ahmed, Roman Gushchin,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song,
	Tejun Heo, Zefan Li, mkoutny, Kinsey Ho

The generation number in struct mem_cgroup_reclaim_iter should be
incremented on every round-trip. Currently, it is possible for a
concurrent reclaimer to jump in at the end of the hierarchy, causing a
traversal restart (resetting the iteration position) without
incrementing the generation number.

By resetting the position without incrementing the generation, it's
possible for another ongoing mem_cgroup_iter() thread to walk the tree
twice.

Move the traversal restart such that the generation number is
incremented before the restart.

Signed-off-by: Kinsey Ho <kinseyho@google.com>
---
 mm/memcontrol.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1688aae3b1b4..937b7efc41ca 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1036,7 +1036,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		root = root_mem_cgroup;
 
 	rcu_read_lock();
-
+restart:
 	if (reclaim) {
 		struct mem_cgroup_per_node *mz;
 
@@ -1063,14 +1063,6 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 	for (;;) {
 		css = css_next_descendant_pre(css, &root->css);
 		if (!css) {
-			/*
-			 * Reclaimers share the hierarchy walk, and a
-			 * new one might jump in right at the end of
-			 * the hierarchy - make sure they see at least
-			 * one group and restart from the beginning.
-			 */
-			if (!prev)
-				continue;
 			break;
 		}
 
@@ -1093,8 +1085,18 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		 */
 		(void)cmpxchg(&iter->position, pos, memcg);
 
-		if (!memcg)
+		if (!memcg) {
 			iter->generation++;
+
+			/*
+			 * Reclaimers share the hierarchy walk, and a
+			 * new one might jump in right at the end of
+			 * the hierarchy - make sure they see at least
+			 * one group and restart from the beginning.
+			 */
+			if (!prev)
+				goto restart;
+		}
 	}
 
 out_unlock:
-- 
2.46.0.76.ge559c4bf1a-goog



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH mm-unstable v2 4/5] mm: restart if multiple traversals raced
  2024-08-13 20:47 [PATCH mm-unstable v2 0/5] Improve mem_cgroup_iter() Kinsey Ho
                   ` (2 preceding siblings ...)
  2024-08-13 20:47 ` [PATCH mm-unstable v2 3/5] mm: increment gen # before restarting traversal Kinsey Ho
@ 2024-08-13 20:47 ` Kinsey Ho
  2024-08-14  9:00   ` Michal Koutný
  2024-08-13 20:47 ` [PATCH mm-unstable v2 5/5] mm: clean up mem_cgroup_iter() Kinsey Ho
  4 siblings, 1 reply; 12+ messages in thread
From: Kinsey Ho @ 2024-08-13 20:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, cgroups, Yosry Ahmed, Roman Gushchin,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song,
	Tejun Heo, Zefan Li, mkoutny, Kinsey Ho

Currently, if multiple reclaimers raced on the same position, the
reclaimers which detect the race will still reclaim from the same memcg.
Instead, the reclaimers which detect the race should move on to the next
memcg in the hierarchy.

So, in the case where multiple traversals race, jump back to the start
of the mem_cgroup_iter() function to find the next memcg in the
hierarchy to reclaim from.

Signed-off-by: Kinsey Ho <kinseyho@google.com>
---
 include/linux/memcontrol.h |  4 ++--
 mm/memcontrol.c            | 22 ++++++++++++++--------
 2 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1aaed2f1f6ae..aada9ef3ca44 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -57,7 +57,7 @@ enum memcg_memory_event {
 
 struct mem_cgroup_reclaim_cookie {
 	pg_data_t *pgdat;
-	unsigned int generation;
+	int generation;
 };
 
 #ifdef CONFIG_MEMCG
@@ -77,7 +77,7 @@ struct lruvec_stats;
 struct mem_cgroup_reclaim_iter {
 	struct mem_cgroup __rcu *position;
 	/* scan generation, increased every round-trip */
-	unsigned int generation;
+	atomic_t generation;
 };
 
 /*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 937b7efc41ca..84de46ece9a9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1025,7 +1025,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 				   struct mem_cgroup_reclaim_cookie *reclaim)
 {
 	struct mem_cgroup_reclaim_iter *iter;
-	struct cgroup_subsys_state *css = NULL;
+	struct cgroup_subsys_state *css;
 	struct mem_cgroup *memcg = NULL;
 	struct mem_cgroup *pos = NULL;
 
@@ -1038,18 +1038,20 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 	rcu_read_lock();
 restart:
 	if (reclaim) {
+		int gen;
 		struct mem_cgroup_per_node *mz;
 
 		mz = root->nodeinfo[reclaim->pgdat->node_id];
 		iter = &mz->iter;
+		gen = atomic_read(&iter->generation);
 
 		/*
 		 * On start, join the current reclaim iteration cycle.
 		 * Exit when a concurrent walker completes it.
 		 */
 		if (!prev)
-			reclaim->generation = iter->generation;
-		else if (reclaim->generation != iter->generation)
+			reclaim->generation = gen;
+		else if (reclaim->generation != gen)
 			goto out_unlock;
 
 		pos = rcu_dereference(iter->position);
@@ -1057,8 +1059,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		pos = prev;
 	}
 
-	if (pos)
-		css = &pos->css;
+	css = pos ? &pos->css : NULL;
 
 	for (;;) {
 		css = css_next_descendant_pre(css, &root->css);
@@ -1072,21 +1073,26 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		 * and kicking, and don't take an extra reference.
 		 */
 		if (css == &root->css || css_tryget(css)) {
-			memcg = mem_cgroup_from_css(css);
 			break;
 		}
 	}
 
+	memcg = mem_cgroup_from_css(css);
+
 	if (reclaim) {
 		/*
 		 * The position could have already been updated by a competing
 		 * thread, so check that the value hasn't changed since we read
 		 * it to avoid reclaiming from the same cgroup twice.
 		 */
-		(void)cmpxchg(&iter->position, pos, memcg);
+		if (cmpxchg(&iter->position, pos, memcg) != pos) {
+			if (css && css != &root->css)
+				css_put(css);
+			goto restart;
+		}
 
 		if (!memcg) {
-			iter->generation++;
+			atomic_inc(&iter->generation);
 
 			/*
 			 * Reclaimers share the hierarchy walk, and a
-- 
2.46.0.76.ge559c4bf1a-goog



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH mm-unstable v2 5/5] mm: clean up mem_cgroup_iter()
  2024-08-13 20:47 [PATCH mm-unstable v2 0/5] Improve mem_cgroup_iter() Kinsey Ho
                   ` (3 preceding siblings ...)
  2024-08-13 20:47 ` [PATCH mm-unstable v2 4/5] mm: restart if multiple traversals raced Kinsey Ho
@ 2024-08-13 20:47 ` Kinsey Ho
  4 siblings, 0 replies; 12+ messages in thread
From: Kinsey Ho @ 2024-08-13 20:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, cgroups, Yosry Ahmed, Roman Gushchin,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song,
	Tejun Heo, Zefan Li, mkoutny, Kinsey Ho

A clean up to make variable names more clear and to improve code readability.

No functional change.

Signed-off-by: Kinsey Ho <kinseyho@google.com>
---
 mm/memcontrol.c | 30 +++++++++++-------------------
 1 file changed, 11 insertions(+), 19 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 84de46ece9a9..87a0dc9d779a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1026,8 +1026,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 {
 	struct mem_cgroup_reclaim_iter *iter;
 	struct cgroup_subsys_state *css;
-	struct mem_cgroup *memcg = NULL;
-	struct mem_cgroup *pos = NULL;
+	struct mem_cgroup *pos;
+	struct mem_cgroup *next = NULL;
 
 	if (mem_cgroup_disabled())
 		return NULL;
@@ -1039,10 +1039,9 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 restart:
 	if (reclaim) {
 		int gen;
-		struct mem_cgroup_per_node *mz;
+		int nid = reclaim->pgdat->node_id;
 
-		mz = root->nodeinfo[reclaim->pgdat->node_id];
-		iter = &mz->iter;
+		iter = &root->nodeinfo[nid]->iter;
 		gen = atomic_read(&iter->generation);
 
 		/*
@@ -1055,29 +1054,22 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 			goto out_unlock;
 
 		pos = rcu_dereference(iter->position);
-	} else if (prev) {
+	} else
 		pos = prev;
-	}
 
 	css = pos ? &pos->css : NULL;
 
-	for (;;) {
-		css = css_next_descendant_pre(css, &root->css);
-		if (!css) {
-			break;
-		}
-
+	while ((css = css_next_descendant_pre(css, &root->css))) {
 		/*
 		 * Verify the css and acquire a reference.  The root
 		 * is provided by the caller, so we know it's alive
 		 * and kicking, and don't take an extra reference.
 		 */
-		if (css == &root->css || css_tryget(css)) {
+		if (css == &root->css || css_tryget(css))
 			break;
-		}
 	}
 
-	memcg = mem_cgroup_from_css(css);
+	next = mem_cgroup_from_css(css);
 
 	if (reclaim) {
 		/*
@@ -1085,13 +1077,13 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		 * thread, so check that the value hasn't changed since we read
 		 * it to avoid reclaiming from the same cgroup twice.
 		 */
-		if (cmpxchg(&iter->position, pos, memcg) != pos) {
+		if (cmpxchg(&iter->position, pos, next) != pos) {
 			if (css && css != &root->css)
 				css_put(css);
 			goto restart;
 		}
 
-		if (!memcg) {
+		if (!next) {
 			atomic_inc(&iter->generation);
 
 			/*
@@ -1110,7 +1102,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 	if (prev && prev != root)
 		css_put(&prev->css);
 
-	return memcg;
+	return next;
 }
 
 /**
-- 
2.46.0.76.ge559c4bf1a-goog



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH mm-unstable v2 1/5] cgroup: clarify css sibling linkage is protected by cgroup_mutex or RCU
  2024-08-13 20:47 ` [PATCH mm-unstable v2 1/5] cgroup: clarify css sibling linkage is protected by cgroup_mutex or RCU Kinsey Ho
@ 2024-08-14  9:00   ` Michal Koutný
  0 siblings, 0 replies; 12+ messages in thread
From: Michal Koutný @ 2024-08-14  9:00 UTC (permalink / raw)
  To: Kinsey Ho
  Cc: Andrew Morton, linux-mm, linux-kernel, cgroups, Yosry Ahmed,
	Roman Gushchin, Johannes Weiner, Michal Hocko, Shakeel Butt,
	Muchun Song, Tejun Heo, Zefan Li

[-- Attachment #1: Type: text/plain, Size: 3254 bytes --]

On Tue, Aug 13, 2024 at 08:47:11PM GMT, Kinsey Ho <kinseyho@google.com> wrote:
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -172,7 +172,11 @@ struct cgroup_subsys_state {
>  	/* reference count - access via css_[try]get() and css_put() */
>  	struct percpu_ref refcnt;
>  
> -	/* siblings list anchored at the parent's ->children */
> +	/*
> +	 * siblings list anchored at the parent's ->children
> +	 *
> +	 * linkage is protected by cgroup_mutex or RCU
> +	 */
>  	struct list_head sibling;
>  	struct list_head children;

Thanks, this is good.

>  
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 0a97cb2ef124..ece2316e2bca 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -4602,8 +4602,9 @@ struct cgroup_subsys_state *css_next_child(struct cgroup_subsys_state *pos,
>   *
>   * While this function requires cgroup_mutex or RCU read locking, it
>   * doesn't require the whole traversal to be contained in a single critical
> - * section.  This function will return the correct next descendant as long
> - * as both @pos and @root are accessible and @pos is a descendant of @root.
> + * section. Additionally, it isn't necessary to hold onto a reference to @pos.
> + * This function will return the correct next descendant as long as both @pos
> + * and @root are accessible and @pos is a descendant of @root.
>   *
>   * If a subsystem synchronizes ->css_online() and the start of iteration, a
>   * css which finished ->css_online() is guaranteed to be visible in the
> @@ -4651,8 +4652,9 @@ EXPORT_SYMBOL_GPL(css_next_descendant_pre);
>   *
>   * While this function requires cgroup_mutex or RCU read locking, it
>   * doesn't require the whole traversal to be contained in a single critical
> - * section.  This function will return the correct rightmost descendant as
> - * long as @pos is accessible.
> + * section. Additionally, it isn't necessary to hold onto a reference to @pos.
> + * This function will return the correct rightmost descendant as long as @pos
> + * is accessible.
>   */
>  struct cgroup_subsys_state *
>  css_rightmost_descendant(struct cgroup_subsys_state *pos)
> @@ -4696,9 +4698,9 @@ css_leftmost_descendant(struct cgroup_subsys_state *pos)
>   *
>   * While this function requires cgroup_mutex or RCU read locking, it
>   * doesn't require the whole traversal to be contained in a single critical
> - * section.  This function will return the correct next descendant as long
> - * as both @pos and @cgroup are accessible and @pos is a descendant of
> - * @cgroup.
> + * section. Additionally, it isn't necessary to hold onto a reference to @pos.
> + * This function will return the correct next descendant as long as both @pos
> + * and @cgroup are accessible and @pos is a descendant of @cgroup.
>   *
>   * If a subsystem synchronizes ->css_online() and the start of iteration, a
>   * css which finished ->css_online() is guaranteed to be visible in the

I'd say these comments are unnecessary given the functions have
cgroup_assert_mutex_or_rcu_locked() but if it helps overall
understanding in broader context, why not.

Reviewed-by: Michal Koutný <mkoutny@suse.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH mm-unstable v2 4/5] mm: restart if multiple traversals raced
  2024-08-13 20:47 ` [PATCH mm-unstable v2 4/5] mm: restart if multiple traversals raced Kinsey Ho
@ 2024-08-14  9:00   ` Michal Koutný
  2024-08-16 16:27     ` Kinsey Ho
  0 siblings, 1 reply; 12+ messages in thread
From: Michal Koutný @ 2024-08-14  9:00 UTC (permalink / raw)
  To: Kinsey Ho
  Cc: Andrew Morton, linux-mm, linux-kernel, cgroups, Yosry Ahmed,
	Roman Gushchin, Johannes Weiner, Michal Hocko, Shakeel Butt,
	Muchun Song, Tejun Heo, Zefan Li

[-- Attachment #1: Type: text/plain, Size: 1052 bytes --]

On Tue, Aug 13, 2024 at 08:47:14PM GMT, Kinsey Ho <kinseyho@google.com> wrote:
> @@ -1072,21 +1073,26 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
>  		 * and kicking, and don't take an extra reference.
>  		 */
>  		if (css == &root->css || css_tryget(css)) {
> -			memcg = mem_cgroup_from_css(css);
>  			break;
>  		}
>  	}
>  
> +	memcg = mem_cgroup_from_css(css);
> +
>  	if (reclaim) {
>  		/*
>  		 * The position could have already been updated by a competing
>  		 * thread, so check that the value hasn't changed since we read
>  		 * it to avoid reclaiming from the same cgroup twice.
>  		 */
> -		(void)cmpxchg(&iter->position, pos, memcg);
> +		if (cmpxchg(&iter->position, pos, memcg) != pos) {
> +			if (css && css != &root->css)
> +				css_put(css);
> +			goto restart;
> +		}

I may be missing (literal) context but I'd suggest not moving the memcg
assignment and leverage
	if (memcg != NULL)
		css_put(memcg->css)
so that the is-root comparison needn't be repeated.

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH mm-unstable v2 2/5] mm: don't hold css->refcnt during traversal
  2024-08-13 20:47 ` [PATCH mm-unstable v2 2/5] mm: don't hold css->refcnt during traversal Kinsey Ho
@ 2024-08-14  9:00   ` Michal Koutný
  2024-08-16 16:15     ` Kinsey Ho
  0 siblings, 1 reply; 12+ messages in thread
From: Michal Koutný @ 2024-08-14  9:00 UTC (permalink / raw)
  To: Kinsey Ho
  Cc: Andrew Morton, linux-mm, linux-kernel, cgroups, Yosry Ahmed,
	Roman Gushchin, Johannes Weiner, Michal Hocko, Shakeel Butt,
	Muchun Song, Tejun Heo, Zefan Li

[-- Attachment #1: Type: text/plain, Size: 1520 bytes --]

On Tue, Aug 13, 2024 at 08:47:12PM GMT, Kinsey Ho <kinseyho@google.com> wrote:
> To obtain the pointer to the next memcg position, mem_cgroup_iter()
> currently holds css->refcnt during memcg traversal only to put
> css->refcnt at the end of the routine. This isn't necessary as an
> rcu_read_lock is already held throughout the function. The use of
> the RCU read lock with css_next_descendant_pre() guarantees that
> sibling linkage is safe without holding a ref on the passed-in @css.
> 
> Remove css->refcnt usage during traversal by leveraging RCU.
> 
> Signed-off-by: Kinsey Ho <kinseyho@google.com>
> ---
>  include/linux/memcontrol.h |  2 +-
>  mm/memcontrol.c            | 18 +-----------------
>  2 files changed, 2 insertions(+), 18 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 90ecd2dbca06..1aaed2f1f6ae 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -75,7 +75,7 @@ struct lruvec_stats_percpu;
>  struct lruvec_stats;
>  
>  struct mem_cgroup_reclaim_iter {
> -	struct mem_cgroup *position;
> +	struct mem_cgroup __rcu *position;

I'm not sure about this annotation.
This pointer could be modified concurrently with RCU read sections with
the cmpxchg which would assume that's equivalent with
rcu_assign_pointer(). (Which it might be but it's not idiomatic, so it
causes some head wrapping.)
Isn't this situation covered with a regular pointer and
READ_ONCE()+cmpxchg?

Regards,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH mm-unstable v2 2/5] mm: don't hold css->refcnt during traversal
  2024-08-14  9:00   ` Michal Koutný
@ 2024-08-16 16:15     ` Kinsey Ho
  0 siblings, 0 replies; 12+ messages in thread
From: Kinsey Ho @ 2024-08-16 16:15 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Andrew Morton, linux-mm, linux-kernel, cgroups, Yosry Ahmed,
	Roman Gushchin, Johannes Weiner, Michal Hocko, Shakeel Butt,
	Muchun Song, Tejun Heo, Zefan Li

Hi Michal,

Thank you for reviewing this patchset!

On Wed, Aug 14, 2024 at 5:00 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Tue, Aug 13, 2024 at 08:47:12PM GMT, Kinsey Ho <kinseyho@google.com> wrote:
> > To obtain the pointer to the next memcg position, mem_cgroup_iter()
> > currently holds css->refcnt during memcg traversal only to put
> > css->refcnt at the end of the routine. This isn't necessary as an
> > rcu_read_lock is already held throughout the function. The use of
> > the RCU read lock with css_next_descendant_pre() guarantees that
> > sibling linkage is safe without holding a ref on the passed-in @css.
> >
> > Remove css->refcnt usage during traversal by leveraging RCU.
> >
> > Signed-off-by: Kinsey Ho <kinseyho@google.com>
> > ---
> >  include/linux/memcontrol.h |  2 +-
> >  mm/memcontrol.c            | 18 +-----------------
> >  2 files changed, 2 insertions(+), 18 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 90ecd2dbca06..1aaed2f1f6ae 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -75,7 +75,7 @@ struct lruvec_stats_percpu;
> >  struct lruvec_stats;
> >
> >  struct mem_cgroup_reclaim_iter {
> > -     struct mem_cgroup *position;
> > +     struct mem_cgroup __rcu *position;
>
> I'm not sure about this annotation.
> This pointer could be modified concurrently with RCU read sections with
> the cmpxchg which would assume that's equivalent with
> rcu_assign_pointer(). (Which it might be but it's not idiomatic, so it
> causes some head wrapping.)
> Isn't this situation covered with a regular pointer and
> READ_ONCE()+cmpxchg?

Yes, that's a good point – this situation is covered with a regular
pointer and READ_ONCE() + cmpxchg(). I'll make the change to remove
the __rcu tag and replace rcu_dereference() with READ_ONCE() and send
it out in v3. (This also rids of the sparse errors seen in v1)

Thanks for pointing this out.

Best,
Kinsey


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH mm-unstable v2 4/5] mm: restart if multiple traversals raced
  2024-08-14  9:00   ` Michal Koutný
@ 2024-08-16 16:27     ` Kinsey Ho
  2024-08-20 11:59       ` Michal Koutný
  0 siblings, 1 reply; 12+ messages in thread
From: Kinsey Ho @ 2024-08-16 16:27 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Andrew Morton, linux-mm, linux-kernel, cgroups, Yosry Ahmed,
	Roman Gushchin, Johannes Weiner, Michal Hocko, Shakeel Butt,
	Muchun Song, Tejun Heo, Zefan Li

Hi Michal,

> I may be missing (literal) context but I'd suggest not moving the memcg
> assignment and leverage
>         if (memcg != NULL)
>                 css_put(memcg->css)
> so that the is-root comparison needn't be repeated.

I might also be misunderstanding you with respect to the is-root
comparison – the reason the memcg assignment is moved is because it is
possible that on the restart added in this patch, css could be NULL.
In that case, memcg won't be assigned and could be left with a
previous, invalid value. By moving the assignment out, it ensures that
memcg is a valid value.

Best,
Kinsey


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH mm-unstable v2 4/5] mm: restart if multiple traversals raced
  2024-08-16 16:27     ` Kinsey Ho
@ 2024-08-20 11:59       ` Michal Koutný
  0 siblings, 0 replies; 12+ messages in thread
From: Michal Koutný @ 2024-08-20 11:59 UTC (permalink / raw)
  To: Kinsey Ho
  Cc: Andrew Morton, linux-mm, linux-kernel, cgroups, Yosry Ahmed,
	Roman Gushchin, Johannes Weiner, Michal Hocko, Shakeel Butt,
	Muchun Song, Tejun Heo, Zefan Li

[-- Attachment #1: Type: text/plain, Size: 718 bytes --]

On Fri, Aug 16, 2024 at 12:27:27PM GMT, Kinsey Ho <kinseyho@google.com> wrote:
> Hi Michal,
> 
> > I may be missing (literal) context but I'd suggest not moving the memcg
> > assignment and leverage
> >         if (memcg != NULL)
> >                 css_put(memcg->css)
> > so that the is-root comparison needn't be repeated.
> 
> I might also be misunderstanding you with respect to the is-root
> comparison – the reason the memcg assignment is moved is because it is
> possible that on the restart added in this patch, css could be NULL.

I've played with this applied up to 4/5 and I see more changes would be
needed to preserve the function. Please disregard my initial suggestion
':-)

Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-08-20 12:00 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-08-13 20:47 [PATCH mm-unstable v2 0/5] Improve mem_cgroup_iter() Kinsey Ho
2024-08-13 20:47 ` [PATCH mm-unstable v2 1/5] cgroup: clarify css sibling linkage is protected by cgroup_mutex or RCU Kinsey Ho
2024-08-14  9:00   ` Michal Koutný
2024-08-13 20:47 ` [PATCH mm-unstable v2 2/5] mm: don't hold css->refcnt during traversal Kinsey Ho
2024-08-14  9:00   ` Michal Koutný
2024-08-16 16:15     ` Kinsey Ho
2024-08-13 20:47 ` [PATCH mm-unstable v2 3/5] mm: increment gen # before restarting traversal Kinsey Ho
2024-08-13 20:47 ` [PATCH mm-unstable v2 4/5] mm: restart if multiple traversals raced Kinsey Ho
2024-08-14  9:00   ` Michal Koutný
2024-08-16 16:27     ` Kinsey Ho
2024-08-20 11:59       ` Michal Koutný
2024-08-13 20:47 ` [PATCH mm-unstable v2 5/5] mm: clean up mem_cgroup_iter() Kinsey Ho

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox