linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH mm-unstable v1 0/4] Improve mem_cgroup_iter()
@ 2024-07-24 19:02 Kinsey Ho
  2024-07-24 19:02 ` [PATCH mm-unstable v1 1/4] mm: don't hold css->refcnt during traversal Kinsey Ho
                   ` (4 more replies)
  0 siblings, 5 replies; 15+ messages in thread
From: Kinsey Ho @ 2024-07-24 19:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Yosry Ahmed, Roman Gushchin, Kinsey Ho

Incremental cgroup iteration is being used again [1], but incremental
cgroup iteration was introduced for cgroup v1. It hasn't been fully
maintained for many years. This patchset improves the reliability of
mem_cgroup_iter(), along with improving simplicity and code readability.

[1] https://lore.kernel.org/20240514202641.2821494-1-hannes@cmpxchg.org/

Kinsey Ho (4):
  mm: don't hold css->refcnt during traversal
  mm: increment gen # before restarting traversal
  mm: restart if multiple traversals raced
  mm: clean up mem_cgroup_iter()

 include/linux/memcontrol.h |  6 +--
 mm/memcontrol.c            | 84 +++++++++++++++-----------------------
 2 files changed, 37 insertions(+), 53 deletions(-)

-- 
2.45.2.1089.g2a221341d9-goog



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH mm-unstable v1 1/4] mm: don't hold css->refcnt during traversal
  2024-07-24 19:02 [PATCH mm-unstable v1 0/4] Improve mem_cgroup_iter() Kinsey Ho
@ 2024-07-24 19:02 ` Kinsey Ho
  2024-07-25 16:33   ` kernel test robot
  2024-07-25 20:43   ` Johannes Weiner
  2024-07-24 19:02 ` [PATCH mm-unstable v1 2/4] mm: increment gen # before restarting traversal Kinsey Ho
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 15+ messages in thread
From: Kinsey Ho @ 2024-07-24 19:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Yosry Ahmed, Roman Gushchin, Kinsey Ho

To obtain the pointer to the saved memcg position, mem_cgroup_iter()
currently holds css->refcnt during memcg traversal only to put
css->refcnt at the end of the routine. This isn't necessary as an
rcu_read_lock is already held throughout the function.

Remove css->refcnt usage during traversal by leveraging RCU.

Signed-off-by: Kinsey Ho <kinseyho@google.com>
---
 include/linux/memcontrol.h |  2 +-
 mm/memcontrol.c            | 18 +-----------------
 2 files changed, 2 insertions(+), 18 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7e2eb091049a..4cbab85e2e56 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -75,7 +75,7 @@ struct lruvec_stats_percpu;
 struct lruvec_stats;
 
 struct mem_cgroup_reclaim_iter {
-	struct mem_cgroup *position;
+	struct mem_cgroup __rcu *position;
 	/* scan generation, increased every round-trip */
 	unsigned int generation;
 };
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 960371788687..062bfeee799c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1019,20 +1019,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		else if (reclaim->generation != iter->generation)
 			goto out_unlock;
 
-		while (1) {
-			pos = READ_ONCE(iter->position);
-			if (!pos || css_tryget(&pos->css))
-				break;
-			/*
-			 * css reference reached zero, so iter->position will
-			 * be cleared by ->css_released. However, we should not
-			 * rely on this happening soon, because ->css_released
-			 * is called from a work queue, and by busy-waiting we
-			 * might block it. So we clear iter->position right
-			 * away.
-			 */
-			(void)cmpxchg(&iter->position, pos, NULL);
-		}
+		pos = rcu_dereference(iter->position);
 	} else if (prev) {
 		pos = prev;
 	}
@@ -1073,9 +1060,6 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		 */
 		(void)cmpxchg(&iter->position, pos, memcg);
 
-		if (pos)
-			css_put(&pos->css);
-
 		if (!memcg)
 			iter->generation++;
 	}
-- 
2.45.2.1089.g2a221341d9-goog



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH mm-unstable v1 2/4] mm: increment gen # before restarting traversal
  2024-07-24 19:02 [PATCH mm-unstable v1 0/4] Improve mem_cgroup_iter() Kinsey Ho
  2024-07-24 19:02 ` [PATCH mm-unstable v1 1/4] mm: don't hold css->refcnt during traversal Kinsey Ho
@ 2024-07-24 19:02 ` Kinsey Ho
  2024-07-25 20:53   ` Johannes Weiner
  2024-07-24 19:02 ` [PATCH mm-unstable v1 3/4] mm: restart if multiple traversals raced Kinsey Ho
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 15+ messages in thread
From: Kinsey Ho @ 2024-07-24 19:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Yosry Ahmed, Roman Gushchin, Kinsey Ho

The generation number in struct mem_cgroup_reclaim_iter should be
incremented on every round-trip. Currently, it is possible for a
concurrent reclaimer to jump in at the end of the hierarchy, causing a
traversal restart (resetting the iteration position) without
incrementing the generation number.

Move the traversal restart such that the generation number is
incremented before the restart.

Signed-off-by: Kinsey Ho <kinseyho@google.com>
---
 mm/memcontrol.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 062bfeee799c..f672bc47c6b5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1003,7 +1003,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		root = root_mem_cgroup;
 
 	rcu_read_lock();
-
+restart:
 	if (reclaim) {
 		struct mem_cgroup_per_node *mz;
 
@@ -1030,14 +1030,6 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 	for (;;) {
 		css = css_next_descendant_pre(css, &root->css);
 		if (!css) {
-			/*
-			 * Reclaimers share the hierarchy walk, and a
-			 * new one might jump in right at the end of
-			 * the hierarchy - make sure they see at least
-			 * one group and restart from the beginning.
-			 */
-			if (!prev)
-				continue;
 			break;
 		}
 
@@ -1060,8 +1052,18 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		 */
 		(void)cmpxchg(&iter->position, pos, memcg);
 
-		if (!memcg)
+		if (!memcg) {
 			iter->generation++;
+
+			/*
+			 * Reclaimers share the hierarchy walk, and a
+			 * new one might jump in right at the end of
+			 * the hierarchy - make sure they see at least
+			 * one group and restart from the beginning.
+			 */
+			if (!prev)
+				goto restart;
+		}
 	}
 
 out_unlock:
-- 
2.45.2.1089.g2a221341d9-goog



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH mm-unstable v1 3/4] mm: restart if multiple traversals raced
  2024-07-24 19:02 [PATCH mm-unstable v1 0/4] Improve mem_cgroup_iter() Kinsey Ho
  2024-07-24 19:02 ` [PATCH mm-unstable v1 1/4] mm: don't hold css->refcnt during traversal Kinsey Ho
  2024-07-24 19:02 ` [PATCH mm-unstable v1 2/4] mm: increment gen # before restarting traversal Kinsey Ho
@ 2024-07-24 19:02 ` Kinsey Ho
  2024-07-24 19:02 ` [PATCH mm-unstable v1 4/4] mm: clean up mem_cgroup_iter() Kinsey Ho
  2024-07-25  0:25 ` [PATCH mm-unstable v1 0/4] Improve mem_cgroup_iter() Roman Gushchin
  4 siblings, 0 replies; 15+ messages in thread
From: Kinsey Ho @ 2024-07-24 19:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Yosry Ahmed, Roman Gushchin, Kinsey Ho

Currently, if multiple reclaimers raced on the same position, the
reclaimers which detect the race will still reclaim from the same memcg.
Instead, the reclaimers which detect the race should move on to the next
memcg in the hierarchy.

So, in the case where multiple traversals race, jump back to the start
of the mem_cgroup_iter() function to find the next memcg in the
hierarchy to reclaim from.

Signed-off-by: Kinsey Ho <kinseyho@google.com>
---
 include/linux/memcontrol.h |  4 ++--
 mm/memcontrol.c            | 14 ++++++++++----
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4cbab85e2e56..2b354abe6d48 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -57,7 +57,7 @@ enum memcg_memory_event {
 
 struct mem_cgroup_reclaim_cookie {
 	pg_data_t *pgdat;
-	unsigned int generation;
+	int generation;
 };
 
 #ifdef CONFIG_MEMCG
@@ -77,7 +77,7 @@ struct lruvec_stats;
 struct mem_cgroup_reclaim_iter {
 	struct mem_cgroup __rcu *position;
 	/* scan generation, increased every round-trip */
-	unsigned int generation;
+	atomic_t generation;
 };
 
 /*
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f672bc47c6b5..4314a2b8848d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1005,18 +1005,20 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 	rcu_read_lock();
 restart:
 	if (reclaim) {
+		int gen;
 		struct mem_cgroup_per_node *mz;
 
 		mz = root->nodeinfo[reclaim->pgdat->node_id];
 		iter = &mz->iter;
+		gen = atomic_read(&iter->generation);
 
 		/*
 		 * On start, join the current reclaim iteration cycle.
 		 * Exit when a concurrent walker completes it.
 		 */
 		if (!prev)
-			reclaim->generation = iter->generation;
-		else if (reclaim->generation != iter->generation)
+			reclaim->generation = gen;
+		else if (reclaim->generation != gen)
 			goto out_unlock;
 
 		pos = rcu_dereference(iter->position);
@@ -1050,10 +1052,14 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 		 * thread, so check that the value hasn't changed since we read
 		 * it to avoid reclaiming from the same cgroup twice.
 		 */
-		(void)cmpxchg(&iter->position, pos, memcg);
+		if (cmpxchg(&iter->position, pos, memcg) != pos) {
+			if (css && css != &root->css)
+				css_put(css);
+			goto restart;
+		}
 
 		if (!memcg) {
-			iter->generation++;
+			atomic_inc(&iter->generation);
 
 			/*
 			 * Reclaimers share the hierarchy walk, and a
-- 
2.45.2.1089.g2a221341d9-goog



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH mm-unstable v1 4/4] mm: clean up mem_cgroup_iter()
  2024-07-24 19:02 [PATCH mm-unstable v1 0/4] Improve mem_cgroup_iter() Kinsey Ho
                   ` (2 preceding siblings ...)
  2024-07-24 19:02 ` [PATCH mm-unstable v1 3/4] mm: restart if multiple traversals raced Kinsey Ho
@ 2024-07-24 19:02 ` Kinsey Ho
  2024-07-25 18:15   ` kernel test robot
  2024-07-25  0:25 ` [PATCH mm-unstable v1 0/4] Improve mem_cgroup_iter() Roman Gushchin
  4 siblings, 1 reply; 15+ messages in thread
From: Kinsey Ho @ 2024-07-24 19:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Yosry Ahmed, Roman Gushchin, Kinsey Ho

A clean up to make variable names more clear and to improve code
readability.

No functional change.

Signed-off-by: Kinsey Ho <kinseyho@google.com>
---
 mm/memcontrol.c | 36 ++++++++++++++----------------------
 1 file changed, 14 insertions(+), 22 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4314a2b8848d..7e3e95c62122 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -992,9 +992,9 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 				   struct mem_cgroup_reclaim_cookie *reclaim)
 {
 	struct mem_cgroup_reclaim_iter *iter;
-	struct cgroup_subsys_state *css = NULL;
-	struct mem_cgroup *memcg = NULL;
-	struct mem_cgroup *pos = NULL;
+	struct cgroup_subsys_state *css;
+	struct mem_cgroup *pos;
+	struct mem_cgroup *next = NULL;
 
 	if (mem_cgroup_disabled())
 		return NULL;
@@ -1006,10 +1006,9 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 restart:
 	if (reclaim) {
 		int gen;
-		struct mem_cgroup_per_node *mz;
+		int nid = reclaim->pgdat->node_id;
 
-		mz = root->nodeinfo[reclaim->pgdat->node_id];
-		iter = &mz->iter;
+		iter = &root->nodeinfo[nid]->iter;
 		gen = atomic_read(&iter->generation);
 
 		/*
@@ -1022,43 +1021,36 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 			goto out_unlock;
 
 		pos = rcu_dereference(iter->position);
-	} else if (prev) {
+	} else
 		pos = prev;
-	}
 
-	if (pos)
-		css = &pos->css;
-
-	for (;;) {
-		css = css_next_descendant_pre(css, &root->css);
-		if (!css) {
-			break;
-		}
+	css = pos ? &pos->css : NULL;
 
+	while ((css = css_next_descendant_pre(css, &root->css))) {
 		/*
 		 * Verify the css and acquire a reference.  The root
 		 * is provided by the caller, so we know it's alive
 		 * and kicking, and don't take an extra reference.
 		 */
-		if (css == &root->css || css_tryget(css)) {
-			memcg = mem_cgroup_from_css(css);
+		if (css == &root->css || css_tryget(css))
 			break;
-		}
 	}
 
+	next = mem_cgroup_from_css(css);
+
 	if (reclaim) {
 		/*
 		 * The position could have already been updated by a competing
 		 * thread, so check that the value hasn't changed since we read
 		 * it to avoid reclaiming from the same cgroup twice.
 		 */
-		if (cmpxchg(&iter->position, pos, memcg) != pos) {
+		if (cmpxchg(&iter->position, pos, next) != pos) {
 			if (css && css != &root->css)
 				css_put(css);
 			goto restart;
 		}
 
-		if (!memcg) {
+		if (!next) {
 			atomic_inc(&iter->generation);
 
 			/*
@@ -1077,7 +1069,7 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
 	if (prev && prev != root)
 		css_put(&prev->css);
 
-	return memcg;
+	return next;
 }
 
 /**
-- 
2.45.2.1089.g2a221341d9-goog



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH mm-unstable v1 0/4] Improve mem_cgroup_iter()
  2024-07-24 19:02 [PATCH mm-unstable v1 0/4] Improve mem_cgroup_iter() Kinsey Ho
                   ` (3 preceding siblings ...)
  2024-07-24 19:02 ` [PATCH mm-unstable v1 4/4] mm: clean up mem_cgroup_iter() Kinsey Ho
@ 2024-07-25  0:25 ` Roman Gushchin
  4 siblings, 0 replies; 15+ messages in thread
From: Roman Gushchin @ 2024-07-25  0:25 UTC (permalink / raw)
  To: Kinsey Ho
  Cc: Andrew Morton, linux-mm, linux-kernel, Yosry Ahmed,
	Johannes Weiner, Michal Hocko, Shakeel Butt, Muchun Song

cc memcg maintainers

On Wed, Jul 24, 2024 at 07:02:10PM +0000, Kinsey Ho wrote:
> Incremental cgroup iteration is being used again [1], but incremental
> cgroup iteration was introduced for cgroup v1. It hasn't been fully
> maintained for many years.

This is a bold statement :)

> This patchset improves the reliability of
> mem_cgroup_iter(), along with improving simplicity and code readability.
> 
> [1] https://lore.kernel.org/20240514202641.2821494-1-hannes@cmpxchg.org/
> 
> Kinsey Ho (4):
>   mm: don't hold css->refcnt during traversal
>   mm: increment gen # before restarting traversal
>   mm: restart if multiple traversals raced
>   mm: clean up mem_cgroup_iter()

But the series looks great to me!

Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
for the series.

Thanks!


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH mm-unstable v1 1/4] mm: don't hold css->refcnt during traversal
  2024-07-24 19:02 ` [PATCH mm-unstable v1 1/4] mm: don't hold css->refcnt during traversal Kinsey Ho
@ 2024-07-25 16:33   ` kernel test robot
  2024-07-25 20:43   ` Johannes Weiner
  1 sibling, 0 replies; 15+ messages in thread
From: kernel test robot @ 2024-07-25 16:33 UTC (permalink / raw)
  To: Kinsey Ho, Andrew Morton
  Cc: oe-kbuild-all, Linux Memory Management List, linux-kernel,
	Yosry Ahmed, Roman Gushchin, Kinsey Ho

Hi Kinsey,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on linus/master v6.10 next-20240725]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Kinsey-Ho/mm-don-t-hold-css-refcnt-during-traversal/20240725-030750
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20240724190214.1108049-2-kinseyho%40google.com
patch subject: [PATCH mm-unstable v1 1/4] mm: don't hold css->refcnt during traversal
config: x86_64-randconfig-121-20240725 (https://download.01.org/0day-ci/archive/20240726/202407260047.sIuRFLJE-lkp@intel.com/config)
compiler: gcc-7 (Ubuntu 7.5.0-6ubuntu2) 7.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240726/202407260047.sIuRFLJE-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202407260047.sIuRFLJE-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
>> mm/memcontrol.c:1063:23: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct mem_cgroup [noderef] __rcu *__old @@     got struct mem_cgroup *[assigned] pos @@
   mm/memcontrol.c:1063:23: sparse:     expected struct mem_cgroup [noderef] __rcu *__old
   mm/memcontrol.c:1063:23: sparse:     got struct mem_cgroup *[assigned] pos
   mm/memcontrol.c:1063:23: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct mem_cgroup [noderef] __rcu *__new @@     got struct mem_cgroup *[assigned] memcg @@
   mm/memcontrol.c:1063:23: sparse:     expected struct mem_cgroup [noderef] __rcu *__new
   mm/memcontrol.c:1063:23: sparse:     got struct mem_cgroup *[assigned] memcg
>> mm/memcontrol.c:1101:17: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct mem_cgroup [noderef] __rcu *__old @@     got struct mem_cgroup *dead_memcg @@
   mm/memcontrol.c:1101:17: sparse:     expected struct mem_cgroup [noderef] __rcu *__old
   mm/memcontrol.c:1101:17: sparse:     got struct mem_cgroup *dead_memcg
   mm/memcontrol.c: note: in included file (through include/linux/cgroup-defs.h):
   include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true
   mm/memcontrol.c: note: in included file:
   include/linux/memcontrol.h:747:9: sparse: sparse: context imbalance in 'folio_lruvec_lock' - wrong count at exit
   include/linux/memcontrol.h:747:9: sparse: sparse: context imbalance in 'folio_lruvec_lock_irq' - wrong count at exit
   include/linux/memcontrol.h:747:9: sparse: sparse: context imbalance in 'folio_lruvec_lock_irqsave' - wrong count at exit
   mm/memcontrol.c: note: in included file (through include/linux/cgroup-defs.h):
   include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true

vim +1063 mm/memcontrol.c

4b569387c0d566 Nhat Pham         2023-10-06   974  
5660048ccac873 Johannes Weiner   2012-01-12   975  /**
5660048ccac873 Johannes Weiner   2012-01-12   976   * mem_cgroup_iter - iterate over memory cgroup hierarchy
5660048ccac873 Johannes Weiner   2012-01-12   977   * @root: hierarchy root
5660048ccac873 Johannes Weiner   2012-01-12   978   * @prev: previously returned memcg, NULL on first invocation
5660048ccac873 Johannes Weiner   2012-01-12   979   * @reclaim: cookie for shared reclaim walks, NULL for full walks
5660048ccac873 Johannes Weiner   2012-01-12   980   *
5660048ccac873 Johannes Weiner   2012-01-12   981   * Returns references to children of the hierarchy below @root, or
5660048ccac873 Johannes Weiner   2012-01-12   982   * @root itself, or %NULL after a full round-trip.
5660048ccac873 Johannes Weiner   2012-01-12   983   *
5660048ccac873 Johannes Weiner   2012-01-12   984   * Caller must pass the return value in @prev on subsequent
5660048ccac873 Johannes Weiner   2012-01-12   985   * invocations for reference counting, or use mem_cgroup_iter_break()
5660048ccac873 Johannes Weiner   2012-01-12   986   * to cancel a hierarchy walk before the round-trip is complete.
5660048ccac873 Johannes Weiner   2012-01-12   987   *
05bdc520b3ad39 Miaohe Lin        2020-10-13   988   * Reclaimers can specify a node in @reclaim to divide up the memcgs
05bdc520b3ad39 Miaohe Lin        2020-10-13   989   * in the hierarchy among all concurrent reclaimers operating on the
05bdc520b3ad39 Miaohe Lin        2020-10-13   990   * same node.
5660048ccac873 Johannes Weiner   2012-01-12   991   */
694fbc0fe78518 Andrew Morton     2013-09-24   992  struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
9f3a0d0933de07 Johannes Weiner   2012-01-12   993  				   struct mem_cgroup *prev,
694fbc0fe78518 Andrew Morton     2013-09-24   994  				   struct mem_cgroup_reclaim_cookie *reclaim)
7d74b06f240f1b KAMEZAWA Hiroyuki 2010-10-27   995  {
3f649ab728cda8 Kees Cook         2020-06-03   996  	struct mem_cgroup_reclaim_iter *iter;
5ac8fb31ad2ebd Johannes Weiner   2014-12-10   997  	struct cgroup_subsys_state *css = NULL;
9f3a0d0933de07 Johannes Weiner   2012-01-12   998  	struct mem_cgroup *memcg = NULL;
5ac8fb31ad2ebd Johannes Weiner   2014-12-10   999  	struct mem_cgroup *pos = NULL;
711d3d2c9bc3fb KAMEZAWA Hiroyuki 2010-10-27  1000  
694fbc0fe78518 Andrew Morton     2013-09-24  1001  	if (mem_cgroup_disabled())
694fbc0fe78518 Andrew Morton     2013-09-24  1002  		return NULL;
5660048ccac873 Johannes Weiner   2012-01-12  1003  
9f3a0d0933de07 Johannes Weiner   2012-01-12  1004  	if (!root)
9f3a0d0933de07 Johannes Weiner   2012-01-12  1005  		root = root_mem_cgroup;
9f3a0d0933de07 Johannes Weiner   2012-01-12  1006  
542f85f9ae4acd Michal Hocko      2013-04-29  1007  	rcu_read_lock();
14067bb3e24b96 KAMEZAWA Hiroyuki 2009-04-02  1008  
527a5ec9a53471 Johannes Weiner   2012-01-12  1009  	if (reclaim) {
ef8f2327996b5c Mel Gorman        2016-07-28  1010  		struct mem_cgroup_per_node *mz;
527a5ec9a53471 Johannes Weiner   2012-01-12  1011  
a3747b53b1771a Johannes Weiner   2021-04-29  1012  		mz = root->nodeinfo[reclaim->pgdat->node_id];
9da83f3fc74b80 Yafang Shao       2019-11-30  1013  		iter = &mz->iter;
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1014  
a9320aae68a1cd Wei Yang          2022-04-28  1015  		/*
a9320aae68a1cd Wei Yang          2022-04-28  1016  		 * On start, join the current reclaim iteration cycle.
a9320aae68a1cd Wei Yang          2022-04-28  1017  		 * Exit when a concurrent walker completes it.
a9320aae68a1cd Wei Yang          2022-04-28  1018  		 */
a9320aae68a1cd Wei Yang          2022-04-28  1019  		if (!prev)
a9320aae68a1cd Wei Yang          2022-04-28  1020  			reclaim->generation = iter->generation;
a9320aae68a1cd Wei Yang          2022-04-28  1021  		else if (reclaim->generation != iter->generation)
542f85f9ae4acd Michal Hocko      2013-04-29  1022  			goto out_unlock;
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1023  
fe6faac115aa89 Kinsey Ho         2024-07-24  1024  		pos = rcu_dereference(iter->position);
89d8330ccf2ad4 Wei Yang          2022-04-28  1025  	} else if (prev) {
89d8330ccf2ad4 Wei Yang          2022-04-28  1026  		pos = prev;
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1027  	}
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1028  
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1029  	if (pos)
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1030  		css = &pos->css;
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1031  
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1032  	for (;;) {
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1033  		css = css_next_descendant_pre(css, &root->css);
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1034  		if (!css) {
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1035  			/*
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1036  			 * Reclaimers share the hierarchy walk, and a
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1037  			 * new one might jump in right at the end of
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1038  			 * the hierarchy - make sure they see at least
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1039  			 * one group and restart from the beginning.
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1040  			 */
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1041  			if (!prev)
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1042  				continue;
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1043  			break;
542f85f9ae4acd Michal Hocko      2013-04-29  1044  		}
5f578161971863 Michal Hocko      2013-04-29  1045  
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1046  		/*
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1047  		 * Verify the css and acquire a reference.  The root
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1048  		 * is provided by the caller, so we know it's alive
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1049  		 * and kicking, and don't take an extra reference.
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1050  		 */
41555dadbff8d2 Wei Yang          2022-04-28  1051  		if (css == &root->css || css_tryget(css)) {
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1052  			memcg = mem_cgroup_from_css(css);
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1053  			break;
41555dadbff8d2 Wei Yang          2022-04-28  1054  		}
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1055  	}
9f3a0d0933de07 Johannes Weiner   2012-01-12  1056  
527a5ec9a53471 Johannes Weiner   2012-01-12  1057  	if (reclaim) {
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1058  		/*
6df38689e0e9a0 Vladimir Davydov  2015-12-29  1059  		 * The position could have already been updated by a competing
6df38689e0e9a0 Vladimir Davydov  2015-12-29  1060  		 * thread, so check that the value hasn't changed since we read
6df38689e0e9a0 Vladimir Davydov  2015-12-29  1061  		 * it to avoid reclaiming from the same cgroup twice.
5ac8fb31ad2ebd Johannes Weiner   2014-12-10  1062  		 */
6df38689e0e9a0 Vladimir Davydov  2015-12-29 @1063  		(void)cmpxchg(&iter->position, pos, memcg);
6df38689e0e9a0 Vladimir Davydov  2015-12-29  1064  
19f39402864ea3 Michal Hocko      2013-04-29  1065  		if (!memcg)
527a5ec9a53471 Johannes Weiner   2012-01-12  1066  			iter->generation++;
527a5ec9a53471 Johannes Weiner   2012-01-12  1067  	}
14067bb3e24b96 KAMEZAWA Hiroyuki 2009-04-02  1068  
542f85f9ae4acd Michal Hocko      2013-04-29  1069  out_unlock:
542f85f9ae4acd Michal Hocko      2013-04-29  1070  	rcu_read_unlock();
c40046f3ad5e87 Michal Hocko      2013-04-29  1071  	if (prev && prev != root)
c40046f3ad5e87 Michal Hocko      2013-04-29  1072  		css_put(&prev->css);
c40046f3ad5e87 Michal Hocko      2013-04-29  1073  
9f3a0d0933de07 Johannes Weiner   2012-01-12  1074  	return memcg;
9f3a0d0933de07 Johannes Weiner   2012-01-12  1075  }
14067bb3e24b96 KAMEZAWA Hiroyuki 2009-04-02  1076  
5660048ccac873 Johannes Weiner   2012-01-12  1077  /**
5660048ccac873 Johannes Weiner   2012-01-12  1078   * mem_cgroup_iter_break - abort a hierarchy walk prematurely
5660048ccac873 Johannes Weiner   2012-01-12  1079   * @root: hierarchy root
5660048ccac873 Johannes Weiner   2012-01-12  1080   * @prev: last visited hierarchy member as returned by mem_cgroup_iter()
5660048ccac873 Johannes Weiner   2012-01-12  1081   */
5660048ccac873 Johannes Weiner   2012-01-12  1082  void mem_cgroup_iter_break(struct mem_cgroup *root,
9f3a0d0933de07 Johannes Weiner   2012-01-12  1083  			   struct mem_cgroup *prev)
9f3a0d0933de07 Johannes Weiner   2012-01-12  1084  {
711d3d2c9bc3fb KAMEZAWA Hiroyuki 2010-10-27  1085  	if (!root)
711d3d2c9bc3fb KAMEZAWA Hiroyuki 2010-10-27  1086  		root = root_mem_cgroup;
9f3a0d0933de07 Johannes Weiner   2012-01-12  1087  	if (prev && prev != root)
9f3a0d0933de07 Johannes Weiner   2012-01-12  1088  		css_put(&prev->css);
14067bb3e24b96 KAMEZAWA Hiroyuki 2009-04-02  1089  }
9f3a0d0933de07 Johannes Weiner   2012-01-12  1090  
54a83d6bcbf8f4 Miles Chen        2019-08-13  1091  static void __invalidate_reclaim_iterators(struct mem_cgroup *from,
54a83d6bcbf8f4 Miles Chen        2019-08-13  1092  					struct mem_cgroup *dead_memcg)
6df38689e0e9a0 Vladimir Davydov  2015-12-29  1093  {
6df38689e0e9a0 Vladimir Davydov  2015-12-29  1094  	struct mem_cgroup_reclaim_iter *iter;
ef8f2327996b5c Mel Gorman        2016-07-28  1095  	struct mem_cgroup_per_node *mz;
ef8f2327996b5c Mel Gorman        2016-07-28  1096  	int nid;
6df38689e0e9a0 Vladimir Davydov  2015-12-29  1097  
6df38689e0e9a0 Vladimir Davydov  2015-12-29  1098  	for_each_node(nid) {
a3747b53b1771a Johannes Weiner   2021-04-29  1099  		mz = from->nodeinfo[nid];
9da83f3fc74b80 Yafang Shao       2019-11-30  1100  		iter = &mz->iter;
9da83f3fc74b80 Yafang Shao       2019-11-30 @1101  		cmpxchg(&iter->position, dead_memcg, NULL);
6df38689e0e9a0 Vladimir Davydov  2015-12-29  1102  	}
6df38689e0e9a0 Vladimir Davydov  2015-12-29  1103  }
54a83d6bcbf8f4 Miles Chen        2019-08-13  1104  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH mm-unstable v1 4/4] mm: clean up mem_cgroup_iter()
  2024-07-24 19:02 ` [PATCH mm-unstable v1 4/4] mm: clean up mem_cgroup_iter() Kinsey Ho
@ 2024-07-25 18:15   ` kernel test robot
  0 siblings, 0 replies; 15+ messages in thread
From: kernel test robot @ 2024-07-25 18:15 UTC (permalink / raw)
  To: Kinsey Ho, Andrew Morton
  Cc: oe-kbuild-all, Linux Memory Management List, linux-kernel,
	Yosry Ahmed, Roman Gushchin, Kinsey Ho

Hi Kinsey,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on linus/master v6.10 next-20240725]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Kinsey-Ho/mm-don-t-hold-css-refcnt-during-traversal/20240725-030750
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20240724190214.1108049-5-kinseyho%40google.com
patch subject: [PATCH mm-unstable v1 4/4] mm: clean up mem_cgroup_iter()
config: x86_64-randconfig-121-20240725 (https://download.01.org/0day-ci/archive/20240726/202407260248.CBU1JMb1-lkp@intel.com/config)
compiler: gcc-7 (Ubuntu 7.5.0-6ubuntu2) 7.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20240726/202407260248.CBU1JMb1-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202407260248.CBU1JMb1-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
   mm/memcontrol.c:1049:21: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct mem_cgroup [noderef] __rcu *__old @@     got struct mem_cgroup *[assigned] pos @@
   mm/memcontrol.c:1049:21: sparse:     expected struct mem_cgroup [noderef] __rcu *__old
   mm/memcontrol.c:1049:21: sparse:     got struct mem_cgroup *[assigned] pos
>> mm/memcontrol.c:1049:21: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct mem_cgroup [noderef] __rcu *__new @@     got struct mem_cgroup *[assigned] next @@
   mm/memcontrol.c:1049:21: sparse:     expected struct mem_cgroup [noderef] __rcu *__new
   mm/memcontrol.c:1049:21: sparse:     got struct mem_cgroup *[assigned] next
   mm/memcontrol.c:1049:57: sparse: sparse: incompatible types in comparison expression (different address spaces):
   mm/memcontrol.c:1049:57: sparse:    struct mem_cgroup [noderef] __rcu *
   mm/memcontrol.c:1049:57: sparse:    struct mem_cgroup *
   mm/memcontrol.c:1101:17: sparse: sparse: incorrect type in initializer (different address spaces) @@     expected struct mem_cgroup [noderef] __rcu *__old @@     got struct mem_cgroup *dead_memcg @@
   mm/memcontrol.c:1101:17: sparse:     expected struct mem_cgroup [noderef] __rcu *__old
   mm/memcontrol.c:1101:17: sparse:     got struct mem_cgroup *dead_memcg
   mm/memcontrol.c: note: in included file (through include/linux/cgroup-defs.h):
   include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true
   mm/memcontrol.c: note: in included file:
   include/linux/memcontrol.h:747:9: sparse: sparse: context imbalance in 'folio_lruvec_lock' - wrong count at exit
   include/linux/memcontrol.h:747:9: sparse: sparse: context imbalance in 'folio_lruvec_lock_irq' - wrong count at exit
   include/linux/memcontrol.h:747:9: sparse: sparse: context imbalance in 'folio_lruvec_lock_irqsave' - wrong count at exit
   mm/memcontrol.c: note: in included file (through include/linux/cgroup-defs.h):
   include/linux/list.h:83:21: sparse: sparse: self-comparison always evaluates to true

vim +1049 mm/memcontrol.c

   974	
   975	/**
   976	 * mem_cgroup_iter - iterate over memory cgroup hierarchy
   977	 * @root: hierarchy root
   978	 * @prev: previously returned memcg, NULL on first invocation
   979	 * @reclaim: cookie for shared reclaim walks, NULL for full walks
   980	 *
   981	 * Returns references to children of the hierarchy below @root, or
   982	 * @root itself, or %NULL after a full round-trip.
   983	 *
   984	 * Caller must pass the return value in @prev on subsequent
   985	 * invocations for reference counting, or use mem_cgroup_iter_break()
   986	 * to cancel a hierarchy walk before the round-trip is complete.
   987	 *
   988	 * Reclaimers can specify a node in @reclaim to divide up the memcgs
   989	 * in the hierarchy among all concurrent reclaimers operating on the
   990	 * same node.
   991	 */
   992	struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *root,
   993					   struct mem_cgroup *prev,
   994					   struct mem_cgroup_reclaim_cookie *reclaim)
   995	{
   996		struct mem_cgroup_reclaim_iter *iter;
   997		struct cgroup_subsys_state *css;
   998		struct mem_cgroup *pos;
   999		struct mem_cgroup *next = NULL;
  1000	
  1001		if (mem_cgroup_disabled())
  1002			return NULL;
  1003	
  1004		if (!root)
  1005			root = root_mem_cgroup;
  1006	
  1007		rcu_read_lock();
  1008	restart:
  1009		if (reclaim) {
  1010			int gen;
  1011			int nid = reclaim->pgdat->node_id;
  1012	
  1013			iter = &root->nodeinfo[nid]->iter;
  1014			gen = atomic_read(&iter->generation);
  1015	
  1016			/*
  1017			 * On start, join the current reclaim iteration cycle.
  1018			 * Exit when a concurrent walker completes it.
  1019			 */
  1020			if (!prev)
  1021				reclaim->generation = gen;
  1022			else if (reclaim->generation != gen)
  1023				goto out_unlock;
  1024	
  1025			pos = rcu_dereference(iter->position);
  1026		} else
  1027			pos = prev;
  1028	
  1029		css = pos ? &pos->css : NULL;
  1030	
  1031		while ((css = css_next_descendant_pre(css, &root->css))) {
  1032			/*
  1033			 * Verify the css and acquire a reference.  The root
  1034			 * is provided by the caller, so we know it's alive
  1035			 * and kicking, and don't take an extra reference.
  1036			 */
  1037			if (css == &root->css || css_tryget(css))
  1038				break;
  1039		}
  1040	
  1041		next = mem_cgroup_from_css(css);
  1042	
  1043		if (reclaim) {
  1044			/*
  1045			 * The position could have already been updated by a competing
  1046			 * thread, so check that the value hasn't changed since we read
  1047			 * it to avoid reclaiming from the same cgroup twice.
  1048			 */
> 1049			if (cmpxchg(&iter->position, pos, next) != pos) {
  1050				if (css && css != &root->css)
  1051					css_put(css);
  1052				goto restart;
  1053			}
  1054	
  1055			if (!next) {
  1056				atomic_inc(&iter->generation);
  1057	
  1058				/*
  1059				 * Reclaimers share the hierarchy walk, and a
  1060				 * new one might jump in right at the end of
  1061				 * the hierarchy - make sure they see at least
  1062				 * one group and restart from the beginning.
  1063				 */
  1064				if (!prev)
  1065					goto restart;
  1066			}
  1067		}
  1068	
  1069	out_unlock:
  1070		rcu_read_unlock();
  1071		if (prev && prev != root)
  1072			css_put(&prev->css);
  1073	
  1074		return next;
  1075	}
  1076	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH mm-unstable v1 1/4] mm: don't hold css->refcnt during traversal
  2024-07-24 19:02 ` [PATCH mm-unstable v1 1/4] mm: don't hold css->refcnt during traversal Kinsey Ho
  2024-07-25 16:33   ` kernel test robot
@ 2024-07-25 20:43   ` Johannes Weiner
  2024-07-25 21:09     ` Roman Gushchin
  2024-07-25 22:33     ` Yosry Ahmed
  1 sibling, 2 replies; 15+ messages in thread
From: Johannes Weiner @ 2024-07-25 20:43 UTC (permalink / raw)
  To: Kinsey Ho
  Cc: Andrew Morton, linux-mm, linux-kernel, Yosry Ahmed, Roman Gushchin

On Wed, Jul 24, 2024 at 07:02:11PM +0000, Kinsey Ho wrote:
> To obtain the pointer to the saved memcg position, mem_cgroup_iter()
> currently holds css->refcnt during memcg traversal only to put
> css->refcnt at the end of the routine. This isn't necessary as an
> rcu_read_lock is already held throughout the function.
> 
> Remove css->refcnt usage during traversal by leveraging RCU.

Eh, I don't know about this.

RCU ensures that the css memory isn't freed.

The tryget ensures that the css is still alive and valid.

In this case, it just so happens that the sibling linkage is also rcu
protected. But accessing random css members when the refcount is 0 is
kind of sketchy. On the other hand, the refcount is guaranteed to be
valid, and rcu + tryget is a common pattern.

What does this buy us? The tryget is cheap.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH mm-unstable v1 2/4] mm: increment gen # before restarting traversal
  2024-07-24 19:02 ` [PATCH mm-unstable v1 2/4] mm: increment gen # before restarting traversal Kinsey Ho
@ 2024-07-25 20:53   ` Johannes Weiner
  0 siblings, 0 replies; 15+ messages in thread
From: Johannes Weiner @ 2024-07-25 20:53 UTC (permalink / raw)
  To: Kinsey Ho
  Cc: Andrew Morton, linux-mm, linux-kernel, Yosry Ahmed, Roman Gushchin

On Wed, Jul 24, 2024 at 07:02:12PM +0000, Kinsey Ho wrote:
> The generation number in struct mem_cgroup_reclaim_iter should be
> incremented on every round-trip. Currently, it is possible for a
> concurrent reclaimer to jump in at the end of the hierarchy, causing a
> traversal restart (resetting the iteration position) without
> incrementing the generation number.
> 
> Move the traversal restart such that the generation number is
> incremented before the restart.
> 
> Signed-off-by: Kinsey Ho <kinseyho@google.com>

The consequence of resetting the position without bumping the
generation would be that another ongoing mem_cgroup_iter() could walk
the tree twice, which is undesirable. It would be good to spell that
out in the changelog.

Otherwise, looks good to me.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH mm-unstable v1 1/4] mm: don't hold css->refcnt during traversal
  2024-07-25 20:43   ` Johannes Weiner
@ 2024-07-25 21:09     ` Roman Gushchin
  2024-07-25 22:33     ` Yosry Ahmed
  1 sibling, 0 replies; 15+ messages in thread
From: Roman Gushchin @ 2024-07-25 21:09 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Kinsey Ho, Andrew Morton, linux-mm, linux-kernel, Yosry Ahmed

On Thu, Jul 25, 2024 at 04:43:46PM -0400, Johannes Weiner wrote:
> On Wed, Jul 24, 2024 at 07:02:11PM +0000, Kinsey Ho wrote:
> > To obtain the pointer to the saved memcg position, mem_cgroup_iter()
> > currently holds css->refcnt during memcg traversal only to put
> > css->refcnt at the end of the routine. This isn't necessary as an
> > rcu_read_lock is already held throughout the function.
> > 
> > Remove css->refcnt usage during traversal by leveraging RCU.
> 
> Eh, I don't know about this.
> 
> RCU ensures that the css memory isn't freed.
> 
> The tryget ensures that the css is still alive and valid.
> 
> In this case, it just so happens that the sibling linkage is also rcu
> protected. But accessing random css members when the refcount is 0 is
> kind of sketchy. On the other hand, the refcount is guaranteed to be
> valid, and rcu + tryget is a common pattern.

I also spent quite some time thinking about potential bad consequences,
but _it seems_ to be safe (but I agree it feels dangerous).

> 
> What does this buy us? The tryget is cheap.

To be fair, tryget is not always cheap. Offline/dying cgroups have an atomic
operation there.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH mm-unstable v1 1/4] mm: don't hold css->refcnt during traversal
  2024-07-25 20:43   ` Johannes Weiner
  2024-07-25 21:09     ` Roman Gushchin
@ 2024-07-25 22:33     ` Yosry Ahmed
  2024-08-01 21:46       ` Kinsey Ho
  2024-08-01 22:32       ` Kinsey Ho
  1 sibling, 2 replies; 15+ messages in thread
From: Yosry Ahmed @ 2024-07-25 22:33 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Kinsey Ho, Andrew Morton, linux-mm, linux-kernel, Roman Gushchin

On Thu, Jul 25, 2024 at 1:43 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Wed, Jul 24, 2024 at 07:02:11PM +0000, Kinsey Ho wrote:
> > To obtain the pointer to the saved memcg position, mem_cgroup_iter()
> > currently holds css->refcnt during memcg traversal only to put
> > css->refcnt at the end of the routine. This isn't necessary as an
> > rcu_read_lock is already held throughout the function.
> >
> > Remove css->refcnt usage during traversal by leveraging RCU.
>
> Eh, I don't know about this.
>
> RCU ensures that the css memory isn't freed.
>
> The tryget ensures that the css is still alive and valid.
>
> In this case, it just so happens that the sibling linkage is also rcu
> protected. But accessing random css members when the refcount is 0 is
> kind of sketchy. On the other hand, the refcount is guaranteed to be
> valid, and rcu + tryget is a common pattern.

To be fair, the documentation of css_next_descendant_pre() mentions
that the requirements are:
- Either cgroup_mutex or RCU lock is held.
- Both @pos and @root are accessible.
- @pos is a descendant of @root.

This reads to me like it is intentional that RCU protection is enough
for @pos and @root, and that the sibling linkage is RCU protected by
design. Perhaps we could clarify this further (whether at
css_next_descendant_pre(), or above the definition of the linkage
members).

>
> What does this buy us? The tryget is cheap.

mem_cgroup_iter() is not an easy function to follow, so I personally
appreciate the simplicity gains tbh.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH mm-unstable v1 1/4] mm: don't hold css->refcnt during traversal
  2024-07-25 22:33     ` Yosry Ahmed
@ 2024-08-01 21:46       ` Kinsey Ho
  2024-08-01 22:32       ` Kinsey Ho
  1 sibling, 0 replies; 15+ messages in thread
From: Kinsey Ho @ 2024-08-01 21:46 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Johannes Weiner, Andrew Morton, linux-mm, linux-kernel, Roman Gushchin

[-- Attachment #1: Type: text/plain, Size: 1093 bytes --]

Thank you Johannes, Roman, and Yosry for reviewing this patch!

On Thu, Jul 25, 2024 at 3:34 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> On Thu, Jul 25, 2024 at 1:43 PM Johannes Weiner <hannes@cmpxchg.org>
wrote:
> > What does this buy us? The tryget is cheap.
>
> mem_cgroup_iter() is not an easy function to follow, so I personally
> appreciate the simplicity gains tbh.

Yes, the main intention here was to simplify the code’s readability.

> This reads to me like it is intentional that RCU protection is enough
> for @pos and @root, and that the sibling linkage is RCU protected by
> design. Perhaps we could clarify this further (whether at
> css_next_descendant_pre(), or above the definition of the linkage
> members).

Do we want to move forward with Yosry’s suggestion to clarify that the
sibling linkage is RCU-protected by design? Perhaps this clarification can
be made in the definition of the linkage members so that the safety of the
css in this function is more clear to users. If this is sufficient, I will
make the change in a v2 patchset.

[-- Attachment #2: Type: text/html, Size: 1349 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH mm-unstable v1 1/4] mm: don't hold css->refcnt during traversal
  2024-07-25 22:33     ` Yosry Ahmed
  2024-08-01 21:46       ` Kinsey Ho
@ 2024-08-01 22:32       ` Kinsey Ho
  2024-08-02  1:28         ` Johannes Weiner
  1 sibling, 1 reply; 15+ messages in thread
From: Kinsey Ho @ 2024-08-01 22:32 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Johannes Weiner, Andrew Morton, linux-mm, linux-kernel, Roman Gushchin

Sorry, I replied to this email earlier but it had some issues with plain
text. Please ignore the first reply of mine (the one with HTML). I'm resending
the email below.

Thank you Johannes, Roman, and Yosry for reviewing this patch!

On Thu, Jul 25, 2024 at 3:34 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> On Thu, Jul 25, 2024 at 1:43 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > What does this buy us? The tryget is cheap.
>
> mem_cgroup_iter() is not an easy function to follow, so I personally
> appreciate the simplicity gains tbh.

Yes, the main intention here was to simplify the code's readability.

> This reads to me like it is intentional that RCU protection is enough
> for @pos and @root, and that the sibling linkage is RCU protected by
> design. Perhaps we could clarify this further (whether at
> css_next_descendant_pre(), or above the definition of the linkage
> members).

Do we want to move forward with Yosry's suggestion to clarify that the
sibling linkage is RCU-protected by design? Perhaps this clarification
can be made in the definition of the linkage members so that the
safety of the css in this function is more clear to users. If this is
sufficient, I will make the change in a v2 patchset.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH mm-unstable v1 1/4] mm: don't hold css->refcnt during traversal
  2024-08-01 22:32       ` Kinsey Ho
@ 2024-08-02  1:28         ` Johannes Weiner
  0 siblings, 0 replies; 15+ messages in thread
From: Johannes Weiner @ 2024-08-02  1:28 UTC (permalink / raw)
  To: Kinsey Ho
  Cc: Yosry Ahmed, Andrew Morton, linux-mm, linux-kernel, Roman Gushchin

On Thu, Aug 01, 2024 at 03:32:53PM -0700, Kinsey Ho wrote:
> Sorry, I replied to this email earlier but it had some issues with plain
> text. Please ignore the first reply of mine (the one with HTML). I'm resending
> the email below.
> 
> Thank you Johannes, Roman, and Yosry for reviewing this patch!
> 
> On Thu, Jul 25, 2024 at 3:34 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > On Thu, Jul 25, 2024 at 1:43 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
> > > What does this buy us? The tryget is cheap.
> >
> > mem_cgroup_iter() is not an easy function to follow, so I personally
> > appreciate the simplicity gains tbh.
> 
> Yes, the main intention here was to simplify the code's readability.
>
> > This reads to me like it is intentional that RCU protection is enough
> > for @pos and @root, and that the sibling linkage is RCU protected by
> > design. Perhaps we could clarify this further (whether at
> > css_next_descendant_pre(), or above the definition of the linkage
> > members).
> 
> Do we want to move forward with Yosry's suggestion to clarify that the
> sibling linkage is RCU-protected by design? Perhaps this clarification
> can be made in the definition of the linkage members so that the
> safety of the css in this function is more clear to users. If this is
> sufficient, I will make the change in a v2 patchset.

Yes, that sounds like a good way forward to me.



^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2024-08-02  1:28 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-07-24 19:02 [PATCH mm-unstable v1 0/4] Improve mem_cgroup_iter() Kinsey Ho
2024-07-24 19:02 ` [PATCH mm-unstable v1 1/4] mm: don't hold css->refcnt during traversal Kinsey Ho
2024-07-25 16:33   ` kernel test robot
2024-07-25 20:43   ` Johannes Weiner
2024-07-25 21:09     ` Roman Gushchin
2024-07-25 22:33     ` Yosry Ahmed
2024-08-01 21:46       ` Kinsey Ho
2024-08-01 22:32       ` Kinsey Ho
2024-08-02  1:28         ` Johannes Weiner
2024-07-24 19:02 ` [PATCH mm-unstable v1 2/4] mm: increment gen # before restarting traversal Kinsey Ho
2024-07-25 20:53   ` Johannes Weiner
2024-07-24 19:02 ` [PATCH mm-unstable v1 3/4] mm: restart if multiple traversals raced Kinsey Ho
2024-07-24 19:02 ` [PATCH mm-unstable v1 4/4] mm: clean up mem_cgroup_iter() Kinsey Ho
2024-07-25 18:15   ` kernel test robot
2024-07-25  0:25 ` [PATCH mm-unstable v1 0/4] Improve mem_cgroup_iter() Roman Gushchin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox