linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH mm-unstable RFC 0/5] cgroup: eliminate atomic rstat
@ 2023-04-03 22:03 Yosry Ahmed
  2023-04-03 22:03 ` [PATCH mm-unstable RFC 1/5] writeback: move wb_over_bg_thresh() call outside lock section Yosry Ahmed
                   ` (7 more replies)
  0 siblings, 8 replies; 25+ messages in thread
From: Yosry Ahmed @ 2023-04-03 22:03 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton
  Cc: linux-fsdevel, linux-kernel, cgroups, linux-mm, Yosry Ahmed

A previous patch series ([1] currently in mm-unstable) changed most
atomic rstat flushing contexts to become non-atomic. This was done to
avoid an expensive operation that scales with # cgroups and # cpus to
happen with irqs disabled and scheduling not permitted. There were two
remaining atomic flushing contexts after that series. This series tries
to eliminate them as well, eliminating atomic rstat flushing completely.

The two remaining atomic flushing contexts are:
(a) wb_over_bg_thresh()->mem_cgroup_wb_stats()
(b) mem_cgroup_threshold()->mem_cgroup_usage()

For (a), flushing needs to be atomic as wb_writeback() calls
wb_over_bg_thresh() with a spinlock held. However, it seems like the
call to wb_over_bg_thresh() doesn't need to be protected by that
spinlock, so this series proposes a refactoring that moves the call
outside the lock criticial section and makes the stats flushing
in mem_cgroup_wb_stats() non-atomic.

For (b), flushing needs to be atomic as mem_cgroup_threshold() is called
with irqs disabled. We only flush the stats when calculating the root
usage, as it is approximated as the sum of some memcg stats (file, anon,
and optionally swap) instead of the conventional page counter. This
series proposes changing this calculation to use the global stats
instead, eliminating the need for a memcg stat flush.

After these 2 contexts are eliminated, we no longer need
mem_cgroup_flush_stats_atomic() or cgroup_rstat_flush_atomic(). We can
remove them and simplify the code.

Yosry Ahmed (5):
  writeback: move wb_over_bg_thresh() call outside lock section
  memcg: flush stats non-atomically in mem_cgroup_wb_stats()
  memcg: calculate root usage from global state
  memcg: remove mem_cgroup_flush_stats_atomic()
  cgroup: remove cgroup_rstat_flush_atomic()

 fs/fs-writeback.c          | 16 +++++++----
 include/linux/cgroup.h     |  1 -
 include/linux/memcontrol.h |  5 ----
 kernel/cgroup/rstat.c      | 26 ++++--------------
 mm/memcontrol.c            | 54 ++++++++------------------------------
 5 files changed, 27 insertions(+), 75 deletions(-)

-- 
2.40.0.348.gf938b09366-goog



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH mm-unstable RFC 1/5] writeback: move wb_over_bg_thresh() call outside lock section
  2023-04-03 22:03 [PATCH mm-unstable RFC 0/5] cgroup: eliminate atomic rstat Yosry Ahmed
@ 2023-04-03 22:03 ` Yosry Ahmed
  2023-04-19 11:38   ` Michal Koutný
                     ` (2 more replies)
  2023-04-03 22:03 ` [PATCH mm-unstable RFC 2/5] memcg: flush stats non-atomically in mem_cgroup_wb_stats() Yosry Ahmed
                   ` (6 subsequent siblings)
  7 siblings, 3 replies; 25+ messages in thread
From: Yosry Ahmed @ 2023-04-03 22:03 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton
  Cc: linux-fsdevel, linux-kernel, cgroups, linux-mm, Yosry Ahmed

wb_over_bg_thresh() calls mem_cgroup_wb_stats() which invokes an rstat
flush, which can be expensive on large systems. Currently,
wb_writeback() calls wb_over_bg_thresh() within a lock section, so we
have to make the rstat flush atomically. On systems with a lot of
cpus/cgroups, this can cause us to disable irqs for a long time,
potentially causing problems.

Move the call to wb_over_bg_thresh() outside the lock section in
preparation to make the rstat flush in mem_cgroup_wb_stats() non-atomic.
The list_empty(&wb->work_list) should be okay outside the lock section
of wb->list_lock as it is protected by a separate lock (wb->work_lock),
and wb_over_bg_thresh() doesn't seem like it is modifying any of the b_*
lists the wb->list_lock is protecting. Also, the loop seems to be
already releasing and reacquring the lock, so this refactoring looks
safe.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 fs/fs-writeback.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 195dc23e0d831..012357bc8daa3 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -2021,7 +2021,6 @@ static long wb_writeback(struct bdi_writeback *wb,
 	struct blk_plug plug;
 
 	blk_start_plug(&plug);
-	spin_lock(&wb->list_lock);
 	for (;;) {
 		/*
 		 * Stop writeback when nr_pages has been consumed
@@ -2046,6 +2045,9 @@ static long wb_writeback(struct bdi_writeback *wb,
 		if (work->for_background && !wb_over_bg_thresh(wb))
 			break;
 
+
+		spin_lock(&wb->list_lock);
+
 		/*
 		 * Kupdate and background works are special and we want to
 		 * include all inodes that need writing. Livelock avoidance is
@@ -2075,13 +2077,19 @@ static long wb_writeback(struct bdi_writeback *wb,
 		 * mean the overall work is done. So we keep looping as long
 		 * as made some progress on cleaning pages or inodes.
 		 */
-		if (progress)
+		if (progress) {
+			spin_unlock(&wb->list_lock);
 			continue;
+		}
+
 		/*
 		 * No more inodes for IO, bail
 		 */
-		if (list_empty(&wb->b_more_io))
+		if (list_empty(&wb->b_more_io)) {
+			spin_unlock(&wb->list_lock);
 			break;
+		}
+
 		/*
 		 * Nothing written. Wait for some inode to
 		 * become available for writeback. Otherwise
@@ -2093,9 +2101,7 @@ static long wb_writeback(struct bdi_writeback *wb,
 		spin_unlock(&wb->list_lock);
 		/* This function drops i_lock... */
 		inode_sleep_on_writeback(inode);
-		spin_lock(&wb->list_lock);
 	}
-	spin_unlock(&wb->list_lock);
 	blk_finish_plug(&plug);
 
 	return nr_pages - work->nr_pages;
-- 
2.40.0.348.gf938b09366-goog



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH mm-unstable RFC 2/5] memcg: flush stats non-atomically in mem_cgroup_wb_stats()
  2023-04-03 22:03 [PATCH mm-unstable RFC 0/5] cgroup: eliminate atomic rstat Yosry Ahmed
  2023-04-03 22:03 ` [PATCH mm-unstable RFC 1/5] writeback: move wb_over_bg_thresh() call outside lock section Yosry Ahmed
@ 2023-04-03 22:03 ` Yosry Ahmed
  2023-04-19 11:44   ` Michal Koutný
  2023-04-20 18:55   ` Shakeel Butt
  2023-04-03 22:03 ` [PATCH mm-unstable RFC 3/5] memcg: calculate root usage from global state Yosry Ahmed
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 25+ messages in thread
From: Yosry Ahmed @ 2023-04-03 22:03 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton
  Cc: linux-fsdevel, linux-kernel, cgroups, linux-mm, Yosry Ahmed

The previous patch moved the wb_over_bg_thresh()->mem_cgroup_wb_stats()
code path in wb_writeback() outside the lock section. We no longer need
to flush the stats atomically. Flush the stats non-atomically.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 mm/memcontrol.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3d040a5fa7a35..bdd52fe9e7e4b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4637,11 +4637,7 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
 	struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
 	struct mem_cgroup *parent;
 
-	/*
-	 * wb_writeback() takes a spinlock and calls
-	 * wb_over_bg_thresh()->mem_cgroup_wb_stats(). Do not sleep.
-	 */
-	mem_cgroup_flush_stats_atomic();
+	mem_cgroup_flush_stats();
 
 	*pdirty = memcg_page_state(memcg, NR_FILE_DIRTY);
 	*pwriteback = memcg_page_state(memcg, NR_WRITEBACK);
-- 
2.40.0.348.gf938b09366-goog



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH mm-unstable RFC 3/5] memcg: calculate root usage from global state
  2023-04-03 22:03 [PATCH mm-unstable RFC 0/5] cgroup: eliminate atomic rstat Yosry Ahmed
  2023-04-03 22:03 ` [PATCH mm-unstable RFC 1/5] writeback: move wb_over_bg_thresh() call outside lock section Yosry Ahmed
  2023-04-03 22:03 ` [PATCH mm-unstable RFC 2/5] memcg: flush stats non-atomically in mem_cgroup_wb_stats() Yosry Ahmed
@ 2023-04-03 22:03 ` Yosry Ahmed
  2023-04-11 12:53   ` Michal Koutný
  2023-04-20 18:57   ` Shakeel Butt
  2023-04-03 22:03 ` [PATCH mm-unstable RFC 4/5] memcg: remove mem_cgroup_flush_stats_atomic() Yosry Ahmed
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 25+ messages in thread
From: Yosry Ahmed @ 2023-04-03 22:03 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton
  Cc: linux-fsdevel, linux-kernel, cgroups, linux-mm, Yosry Ahmed

Currently, we approximate the root usage by adding the memcg stats for
anon, file, and conditionally swap (for memsw). To read the memcg stats
we need to invoke an rstat flush. rstat flushes can be expensive, they
scale with the number of cpus and cgroups on the system.

mem_cgroup_usage() is called by memcg_events()->mem_cgroup_threshold()
with irqs disabled, so such an expensive operation with irqs disabled
can cause problems.

Instead, approximate the root usage from global state. This is not 100%
accurate, but the root usage has always been ill-defined anyway.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 mm/memcontrol.c | 24 +++++-------------------
 1 file changed, 5 insertions(+), 19 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index bdd52fe9e7e4b..e7fe18c0c0ef2 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3698,27 +3698,13 @@ static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
 
 	if (mem_cgroup_is_root(memcg)) {
 		/*
-		 * We can reach here from irq context through:
-		 * uncharge_batch()
-		 * |--memcg_check_events()
-		 *    |--mem_cgroup_threshold()
-		 *       |--__mem_cgroup_threshold()
-		 *          |--mem_cgroup_usage
-		 *
-		 * rstat flushing is an expensive operation that should not be
-		 * done from irq context; use stale stats in this case.
-		 * Arguably, usage threshold events are not reliable on the root
-		 * memcg anyway since its usage is ill-defined.
-		 *
-		 * Additionally, other call paths through memcg_check_events()
-		 * disable irqs, so make sure we are flushing stats atomically.
+		 * Approximate root's usage from global state. This isn't
+		 * perfect, but the root usage was always an approximation.
 		 */
-		if (in_task())
-			mem_cgroup_flush_stats_atomic();
-		val = memcg_page_state(memcg, NR_FILE_PAGES) +
-			memcg_page_state(memcg, NR_ANON_MAPPED);
+		val = global_node_page_state(NR_FILE_PAGES) +
+			global_node_page_state(NR_ANON_MAPPED);
 		if (swap)
-			val += memcg_page_state(memcg, MEMCG_SWAP);
+			val += total_swap_pages - get_nr_swap_pages();
 	} else {
 		if (!swap)
 			val = page_counter_read(&memcg->memory);
-- 
2.40.0.348.gf938b09366-goog



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH mm-unstable RFC 4/5] memcg: remove mem_cgroup_flush_stats_atomic()
  2023-04-03 22:03 [PATCH mm-unstable RFC 0/5] cgroup: eliminate atomic rstat Yosry Ahmed
                   ` (2 preceding siblings ...)
  2023-04-03 22:03 ` [PATCH mm-unstable RFC 3/5] memcg: calculate root usage from global state Yosry Ahmed
@ 2023-04-03 22:03 ` Yosry Ahmed
  2023-04-20 19:38   ` Shakeel Butt
  2023-04-03 22:03 ` [PATCH mm-unstable RFC 5/5] cgroup: remove cgroup_rstat_flush_atomic() Yosry Ahmed
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Yosry Ahmed @ 2023-04-03 22:03 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton
  Cc: linux-fsdevel, linux-kernel, cgroups, linux-mm, Yosry Ahmed

Previous patches removed all callers of mem_cgroup_flush_stats_atomic().
Remove the function and simplify the code.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 include/linux/memcontrol.h |  5 -----
 mm/memcontrol.c            | 24 +++++-------------------
 2 files changed, 5 insertions(+), 24 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 222d7370134c7..00a88cf947e14 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1038,7 +1038,6 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
 }
 
 void mem_cgroup_flush_stats(void);
-void mem_cgroup_flush_stats_atomic(void);
 void mem_cgroup_flush_stats_ratelimited(void);
 
 void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
@@ -1537,10 +1536,6 @@ static inline void mem_cgroup_flush_stats(void)
 {
 }
 
-static inline void mem_cgroup_flush_stats_atomic(void)
-{
-}
-
 static inline void mem_cgroup_flush_stats_ratelimited(void)
 {
 }
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e7fe18c0c0ef2..33339106f1d9b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -638,7 +638,7 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
 	}
 }
 
-static void do_flush_stats(bool atomic)
+static void do_flush_stats(void)
 {
 	/*
 	 * We always flush the entire tree, so concurrent flushers can just
@@ -651,30 +651,16 @@ static void do_flush_stats(bool atomic)
 
 	WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME);
 
-	if (atomic)
-		cgroup_rstat_flush_atomic(root_mem_cgroup->css.cgroup);
-	else
-		cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
+	cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
 
 	atomic_set(&stats_flush_threshold, 0);
 	atomic_set(&stats_flush_ongoing, 0);
 }
 
-static bool should_flush_stats(void)
-{
-	return atomic_read(&stats_flush_threshold) > num_online_cpus();
-}
-
 void mem_cgroup_flush_stats(void)
 {
-	if (should_flush_stats())
-		do_flush_stats(false);
-}
-
-void mem_cgroup_flush_stats_atomic(void)
-{
-	if (should_flush_stats())
-		do_flush_stats(true);
+	if (atomic_read(&stats_flush_threshold) > num_online_cpus())
+		do_flush_stats();
 }
 
 void mem_cgroup_flush_stats_ratelimited(void)
@@ -689,7 +675,7 @@ static void flush_memcg_stats_dwork(struct work_struct *w)
 	 * Always flush here so that flushing in latency-sensitive paths is
 	 * as cheap as possible.
 	 */
-	do_flush_stats(false);
+	do_flush_stats();
 	queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME);
 }
 
-- 
2.40.0.348.gf938b09366-goog



^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH mm-unstable RFC 5/5] cgroup: remove cgroup_rstat_flush_atomic()
  2023-04-03 22:03 [PATCH mm-unstable RFC 0/5] cgroup: eliminate atomic rstat Yosry Ahmed
                   ` (3 preceding siblings ...)
  2023-04-03 22:03 ` [PATCH mm-unstable RFC 4/5] memcg: remove mem_cgroup_flush_stats_atomic() Yosry Ahmed
@ 2023-04-03 22:03 ` Yosry Ahmed
  2023-04-20 19:40   ` Shakeel Butt
  2023-04-03 22:04 ` [PATCH mm-unstable RFC 0/5] cgroup: eliminate atomic rstat Yosry Ahmed
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 25+ messages in thread
From: Yosry Ahmed @ 2023-04-03 22:03 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton
  Cc: linux-fsdevel, linux-kernel, cgroups, linux-mm, Yosry Ahmed

Previous patches removed the only caller of cgroup_rstat_flush_atomic().
Remove the function and simplify the code.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 include/linux/cgroup.h |  1 -
 kernel/cgroup/rstat.c  | 26 +++++---------------------
 2 files changed, 5 insertions(+), 22 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 885f5395fcd04..567c547cf371f 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -692,7 +692,6 @@ static inline void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen)
  */
 void cgroup_rstat_updated(struct cgroup *cgrp, int cpu);
 void cgroup_rstat_flush(struct cgroup *cgrp);
-void cgroup_rstat_flush_atomic(struct cgroup *cgrp);
 void cgroup_rstat_flush_hold(struct cgroup *cgrp);
 void cgroup_rstat_flush_release(void);
 
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index d3252b0416b69..f9ad33f117c82 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -171,7 +171,7 @@ __weak noinline void bpf_rstat_flush(struct cgroup *cgrp,
 __diag_pop();
 
 /* see cgroup_rstat_flush() */
-static void cgroup_rstat_flush_locked(struct cgroup *cgrp, bool may_sleep)
+static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
 	__releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
 {
 	int cpu;
@@ -207,9 +207,8 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp, bool may_sleep)
 		}
 		raw_spin_unlock_irqrestore(cpu_lock, flags);
 
-		/* if @may_sleep, play nice and yield if necessary */
-		if (may_sleep && (need_resched() ||
-				  spin_needbreak(&cgroup_rstat_lock))) {
+		/* play nice and yield if necessary */
+		if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
 			spin_unlock_irq(&cgroup_rstat_lock);
 			if (!cond_resched())
 				cpu_relax();
@@ -236,25 +235,10 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
 	might_sleep();
 
 	spin_lock_irq(&cgroup_rstat_lock);
-	cgroup_rstat_flush_locked(cgrp, true);
+	cgroup_rstat_flush_locked(cgrp);
 	spin_unlock_irq(&cgroup_rstat_lock);
 }
 
-/**
- * cgroup_rstat_flush_atomic- atomic version of cgroup_rstat_flush()
- * @cgrp: target cgroup
- *
- * This function can be called from any context.
- */
-void cgroup_rstat_flush_atomic(struct cgroup *cgrp)
-{
-	unsigned long flags;
-
-	spin_lock_irqsave(&cgroup_rstat_lock, flags);
-	cgroup_rstat_flush_locked(cgrp, false);
-	spin_unlock_irqrestore(&cgroup_rstat_lock, flags);
-}
-
 /**
  * cgroup_rstat_flush_hold - flush stats in @cgrp's subtree and hold
  * @cgrp: target cgroup
@@ -269,7 +253,7 @@ void cgroup_rstat_flush_hold(struct cgroup *cgrp)
 {
 	might_sleep();
 	spin_lock_irq(&cgroup_rstat_lock);
-	cgroup_rstat_flush_locked(cgrp, true);
+	cgroup_rstat_flush_locked(cgrp);
 }
 
 /**
-- 
2.40.0.348.gf938b09366-goog



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH mm-unstable RFC 0/5] cgroup: eliminate atomic rstat
  2023-04-03 22:03 [PATCH mm-unstable RFC 0/5] cgroup: eliminate atomic rstat Yosry Ahmed
                   ` (4 preceding siblings ...)
  2023-04-03 22:03 ` [PATCH mm-unstable RFC 5/5] cgroup: remove cgroup_rstat_flush_atomic() Yosry Ahmed
@ 2023-04-03 22:04 ` Yosry Ahmed
  2023-04-06 18:26   ` Tim Chen
  2023-04-06 18:23 ` Tim Chen
  2023-04-17 11:54 ` Yosry Ahmed
  7 siblings, 1 reply; 25+ messages in thread
From: Yosry Ahmed @ 2023-04-03 22:04 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton
  Cc: linux-fsdevel, linux-kernel, cgroups, linux-mm

On Mon, Apr 3, 2023 at 3:03 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> A previous patch series ([1] currently in mm-unstable) changed most

.. and I naturally forgot to link this:
[1] https://lore.kernel.org/linux-mm/20230330191801.1967435-1-yosryahmed@google.com/

> atomic rstat flushing contexts to become non-atomic. This was done to
> avoid an expensive operation that scales with # cgroups and # cpus to
> happen with irqs disabled and scheduling not permitted. There were two
> remaining atomic flushing contexts after that series. This series tries
> to eliminate them as well, eliminating atomic rstat flushing completely.
>
> The two remaining atomic flushing contexts are:
> (a) wb_over_bg_thresh()->mem_cgroup_wb_stats()
> (b) mem_cgroup_threshold()->mem_cgroup_usage()
>
> For (a), flushing needs to be atomic as wb_writeback() calls
> wb_over_bg_thresh() with a spinlock held. However, it seems like the
> call to wb_over_bg_thresh() doesn't need to be protected by that
> spinlock, so this series proposes a refactoring that moves the call
> outside the lock criticial section and makes the stats flushing
> in mem_cgroup_wb_stats() non-atomic.
>
> For (b), flushing needs to be atomic as mem_cgroup_threshold() is called
> with irqs disabled. We only flush the stats when calculating the root
> usage, as it is approximated as the sum of some memcg stats (file, anon,
> and optionally swap) instead of the conventional page counter. This
> series proposes changing this calculation to use the global stats
> instead, eliminating the need for a memcg stat flush.
>
> After these 2 contexts are eliminated, we no longer need
> mem_cgroup_flush_stats_atomic() or cgroup_rstat_flush_atomic(). We can
> remove them and simplify the code.
>
> Yosry Ahmed (5):
>   writeback: move wb_over_bg_thresh() call outside lock section
>   memcg: flush stats non-atomically in mem_cgroup_wb_stats()
>   memcg: calculate root usage from global state
>   memcg: remove mem_cgroup_flush_stats_atomic()
>   cgroup: remove cgroup_rstat_flush_atomic()
>
>  fs/fs-writeback.c          | 16 +++++++----
>  include/linux/cgroup.h     |  1 -
>  include/linux/memcontrol.h |  5 ----
>  kernel/cgroup/rstat.c      | 26 ++++--------------
>  mm/memcontrol.c            | 54 ++++++++------------------------------
>  5 files changed, 27 insertions(+), 75 deletions(-)
>
> --
> 2.40.0.348.gf938b09366-goog
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH mm-unstable RFC 0/5] cgroup: eliminate atomic rstat
  2023-04-03 22:03 [PATCH mm-unstable RFC 0/5] cgroup: eliminate atomic rstat Yosry Ahmed
                   ` (5 preceding siblings ...)
  2023-04-03 22:04 ` [PATCH mm-unstable RFC 0/5] cgroup: eliminate atomic rstat Yosry Ahmed
@ 2023-04-06 18:23 ` Tim Chen
  2023-04-17 11:54 ` Yosry Ahmed
  7 siblings, 0 replies; 25+ messages in thread
From: Tim Chen @ 2023-04-06 18:23 UTC (permalink / raw)
  To: Yosry Ahmed, Alexander Viro, Christian Brauner, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton
  Cc: linux-fsdevel, linux-kernel, cgroups, linux-mm

On Mon, 2023-04-03 at 22:03 +0000, Yosry Ahmed wrote:
> A previous patch series ([1] currently in mm-unstable) changed most

Can you include the link to [1]?

Thanks.

Tim

> atomic rstat flushing contexts to become non-atomic. This was done to
> avoid an expensive operation that scales with # cgroups and # cpus to
> happen with irqs disabled and scheduling not permitted. There were two
> remaining atomic flushing contexts after that series. This series tries
> to eliminate them as well, eliminating atomic rstat flushing completely.
> 
> The two remaining atomic flushing contexts are:
> (a) wb_over_bg_thresh()->mem_cgroup_wb_stats()
> (b) mem_cgroup_threshold()->mem_cgroup_usage()
> 
> For (a), flushing needs to be atomic as wb_writeback() calls
> wb_over_bg_thresh() with a spinlock held. However, it seems like the
> call to wb_over_bg_thresh() doesn't need to be protected by that
> spinlock, so this series proposes a refactoring that moves the call
> outside the lock criticial section and makes the stats flushing
> in mem_cgroup_wb_stats() non-atomic.
> 
> For (b), flushing needs to be atomic as mem_cgroup_threshold() is called
> with irqs disabled. We only flush the stats when calculating the root
> usage, as it is approximated as the sum of some memcg stats (file, anon,
> and optionally swap) instead of the conventional page counter. This
> series proposes changing this calculation to use the global stats
> instead, eliminating the need for a memcg stat flush.
> 
> After these 2 contexts are eliminated, we no longer need
> mem_cgroup_flush_stats_atomic() or cgroup_rstat_flush_atomic(). We can
> remove them and simplify the code.
> 
> Yosry Ahmed (5):
>   writeback: move wb_over_bg_thresh() call outside lock section
>   memcg: flush stats non-atomically in mem_cgroup_wb_stats()
>   memcg: calculate root usage from global state
>   memcg: remove mem_cgroup_flush_stats_atomic()
>   cgroup: remove cgroup_rstat_flush_atomic()
> 
>  fs/fs-writeback.c          | 16 +++++++----
>  include/linux/cgroup.h     |  1 -
>  include/linux/memcontrol.h |  5 ----
>  kernel/cgroup/rstat.c      | 26 ++++--------------
>  mm/memcontrol.c            | 54 ++++++++------------------------------
>  5 files changed, 27 insertions(+), 75 deletions(-)
> 



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH mm-unstable RFC 0/5] cgroup: eliminate atomic rstat
  2023-04-03 22:04 ` [PATCH mm-unstable RFC 0/5] cgroup: eliminate atomic rstat Yosry Ahmed
@ 2023-04-06 18:26   ` Tim Chen
  0 siblings, 0 replies; 25+ messages in thread
From: Tim Chen @ 2023-04-06 18:26 UTC (permalink / raw)
  To: Yosry Ahmed, Alexander Viro, Christian Brauner, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Shakeel Butt, Muchun Song,
	Andrew Morton
  Cc: linux-fsdevel, linux-kernel, cgroups, linux-mm

On Mon, 2023-04-03 at 15:04 -0700, Yosry Ahmed wrote:
> On Mon, Apr 3, 2023 at 3:03 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > 
> > A previous patch series ([1] currently in mm-unstable) changed most
> 
> .. and I naturally forgot to link this:
> [1] https://lore.kernel.org/linux-mm/20230330191801.1967435-1-yosryahmed@google.com/

Thanks. Saw this after I sent my request for link.

Tim


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH mm-unstable RFC 3/5] memcg: calculate root usage from global state
  2023-04-03 22:03 ` [PATCH mm-unstable RFC 3/5] memcg: calculate root usage from global state Yosry Ahmed
@ 2023-04-11 12:53   ` Michal Koutný
  2023-04-11 16:59     ` Yosry Ahmed
  2023-04-20 18:57   ` Shakeel Butt
  1 sibling, 1 reply; 25+ messages in thread
From: Michal Koutný @ 2023-04-11 12:53 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Alexander Viro, Christian Brauner, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	linux-fsdevel, linux-kernel, cgroups, linux-mm

[-- Attachment #1: Type: text/plain, Size: 594 bytes --]

On Mon, Apr 03, 2023 at 10:03:35PM +0000, Yosry Ahmed <yosryahmed@google.com> wrote:
> Instead, approximate the root usage from global state. This is not 100%
> accurate, but the root usage has always been ill-defined anyway.

Technically, this approximation should be closer to truth because global
counters aren't subject to flushing "delay".

> 
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> ---
>  mm/memcontrol.c | 24 +++++-------------------
>  1 file changed, 5 insertions(+), 19 deletions(-)

But feel free to add
Reviewed-by: Michal Koutný <mkoutny@suse.com>


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH mm-unstable RFC 3/5] memcg: calculate root usage from global state
  2023-04-11 12:53   ` Michal Koutný
@ 2023-04-11 16:59     ` Yosry Ahmed
  0 siblings, 0 replies; 25+ messages in thread
From: Yosry Ahmed @ 2023-04-11 16:59 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Alexander Viro, Christian Brauner, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	linux-fsdevel, linux-kernel, cgroups, linux-mm

On Tue, Apr 11, 2023 at 5:53 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Mon, Apr 03, 2023 at 10:03:35PM +0000, Yosry Ahmed <yosryahmed@google.com> wrote:
> > Instead, approximate the root usage from global state. This is not 100%
> > accurate, but the root usage has always been ill-defined anyway.
>
> Technically, this approximation should be closer to truth because global
> counters aren't subject to flushing "delay".

It is a tiny bit different when some pages are in swap, probably
because of swap slot caching and other swap specifics. At least in
cgroup v1, the swap uncharging and freeing of the underlying swap
entry may happen at different times. I think it practically doesn't
really matter though.

>
> >
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > ---
> >  mm/memcontrol.c | 24 +++++-------------------
> >  1 file changed, 5 insertions(+), 19 deletions(-)
>
> But feel free to add
> Reviewed-by: Michal Koutný <mkoutny@suse.com>

Thanks!
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH mm-unstable RFC 0/5] cgroup: eliminate atomic rstat
  2023-04-03 22:03 [PATCH mm-unstable RFC 0/5] cgroup: eliminate atomic rstat Yosry Ahmed
                   ` (6 preceding siblings ...)
  2023-04-06 18:23 ` Tim Chen
@ 2023-04-17 11:54 ` Yosry Ahmed
  7 siblings, 0 replies; 25+ messages in thread
From: Yosry Ahmed @ 2023-04-17 11:54 UTC (permalink / raw)
  To: Alexander Viro, Christian Brauner, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton
  Cc: linux-fsdevel, linux-kernel, cgroups, linux-mm

On Mon, Apr 3, 2023 at 3:03 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> A previous patch series ([1] currently in mm-unstable) changed most
> atomic rstat flushing contexts to become non-atomic. This was done to
> avoid an expensive operation that scales with # cgroups and # cpus to
> happen with irqs disabled and scheduling not permitted. There were two
> remaining atomic flushing contexts after that series. This series tries
> to eliminate them as well, eliminating atomic rstat flushing completely.
>
> The two remaining atomic flushing contexts are:
> (a) wb_over_bg_thresh()->mem_cgroup_wb_stats()
> (b) mem_cgroup_threshold()->mem_cgroup_usage()
>
> For (a), flushing needs to be atomic as wb_writeback() calls
> wb_over_bg_thresh() with a spinlock held. However, it seems like the
> call to wb_over_bg_thresh() doesn't need to be protected by that
> spinlock, so this series proposes a refactoring that moves the call
> outside the lock criticial section and makes the stats flushing
> in mem_cgroup_wb_stats() non-atomic.
>
> For (b), flushing needs to be atomic as mem_cgroup_threshold() is called
> with irqs disabled. We only flush the stats when calculating the root
> usage, as it is approximated as the sum of some memcg stats (file, anon,
> and optionally swap) instead of the conventional page counter. This
> series proposes changing this calculation to use the global stats
> instead, eliminating the need for a memcg stat flush.
>
> After these 2 contexts are eliminated, we no longer need
> mem_cgroup_flush_stats_atomic() or cgroup_rstat_flush_atomic(). We can
> remove them and simplify the code.
>
> Yosry Ahmed (5):
>   writeback: move wb_over_bg_thresh() call outside lock section
>   memcg: flush stats non-atomically in mem_cgroup_wb_stats()
>   memcg: calculate root usage from global state
>   memcg: remove mem_cgroup_flush_stats_atomic()
>   cgroup: remove cgroup_rstat_flush_atomic()
>
>  fs/fs-writeback.c          | 16 +++++++----
>  include/linux/cgroup.h     |  1 -
>  include/linux/memcontrol.h |  5 ----
>  kernel/cgroup/rstat.c      | 26 ++++--------------
>  mm/memcontrol.c            | 54 ++++++++------------------------------
>  5 files changed, 27 insertions(+), 75 deletions(-)
>
> --
> 2.40.0.348.gf938b09366-goog
>

Any thoughts on this series, anyone? :)


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH mm-unstable RFC 1/5] writeback: move wb_over_bg_thresh() call outside lock section
  2023-04-03 22:03 ` [PATCH mm-unstable RFC 1/5] writeback: move wb_over_bg_thresh() call outside lock section Yosry Ahmed
@ 2023-04-19 11:38   ` Michal Koutný
  2023-04-20 20:23     ` Yosry Ahmed
  2023-04-20 18:53   ` Shakeel Butt
  2023-04-21  8:53   ` Jan Kara
  2 siblings, 1 reply; 25+ messages in thread
From: Michal Koutný @ 2023-04-19 11:38 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Alexander Viro, Christian Brauner, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	linux-fsdevel, linux-kernel, cgroups, linux-mm

[-- Attachment #1: Type: text/plain, Size: 2530 bytes --]

On Mon, Apr 03, 2023 at 10:03:33PM +0000, Yosry Ahmed <yosryahmed@google.com> wrote:
> wb_over_bg_thresh() calls mem_cgroup_wb_stats() which invokes an rstat
> flush, which can be expensive on large systems. Currently,
> wb_writeback() calls wb_over_bg_thresh() within a lock section, so we
> have to make the rstat flush atomically. On systems with a lot of
> cpus/cgroups, this can cause us to disable irqs for a long time,
> potentially causing problems.
> 
> Move the call to wb_over_bg_thresh() outside the lock section in
> preparation to make the rstat flush in mem_cgroup_wb_stats() non-atomic.
> The list_empty(&wb->work_list) should be okay outside the lock section
> of wb->list_lock as it is protected by a separate lock (wb->work_lock),
> and wb_over_bg_thresh() doesn't seem like it is modifying any of the b_*
> lists the wb->list_lock is protecting. Also, the loop seems to be
> already releasing and reacquring the lock, so this refactoring looks
> safe.
> 
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> ---
>  fs/fs-writeback.c | 16 +++++++++++-----
>  1 file changed, 11 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 195dc23e0d831..012357bc8daa3 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -2021,7 +2021,6 @@ static long wb_writeback(struct bdi_writeback *wb,
>  	struct blk_plug plug;
>  
>  	blk_start_plug(&plug);
> -	spin_lock(&wb->list_lock);
>  	for (;;) {
>  		/*
>  		 * Stop writeback when nr_pages has been consumed
> @@ -2046,6 +2045,9 @@ static long wb_writeback(struct bdi_writeback *wb,
>  		if (work->for_background && !wb_over_bg_thresh(wb))
>  			break;
>  
> +
> +		spin_lock(&wb->list_lock);
> +
>  		/*
>  		 * Kupdate and background works are special and we want to
>  		 * include all inodes that need writing. Livelock avoidance is
> @@ -2075,13 +2077,19 @@ static long wb_writeback(struct bdi_writeback *wb,
>  		 * mean the overall work is done. So we keep looping as long
>  		 * as made some progress on cleaning pages or inodes.
>  		 */
> -		if (progress)
> +		if (progress) {
> +			spin_unlock(&wb->list_lock);
>  			continue;
> +		}
> +

This would release wb->list_lock temporarily with progress but that's
already not held continuously due to writeback_sb_inodes().
Holding the lock could even be shortened by taking it later after
trace_writeback_start().

Altogether, the change looks OK,
Reviewed-by: Michal Koutný <mkoutny@suse.com>


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH mm-unstable RFC 2/5] memcg: flush stats non-atomically in mem_cgroup_wb_stats()
  2023-04-03 22:03 ` [PATCH mm-unstable RFC 2/5] memcg: flush stats non-atomically in mem_cgroup_wb_stats() Yosry Ahmed
@ 2023-04-19 11:44   ` Michal Koutný
  2023-04-20 18:55   ` Shakeel Butt
  1 sibling, 0 replies; 25+ messages in thread
From: Michal Koutný @ 2023-04-19 11:44 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Alexander Viro, Christian Brauner, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	linux-fsdevel, linux-kernel, cgroups, linux-mm

[-- Attachment #1: Type: text/plain, Size: 217 bytes --]

On Mon, Apr 03, 2023 at 10:03:34PM +0000, Yosry Ahmed <yosryahmed@google.com> wrote:
>  mm/memcontrol.c | 6 +-----
>  1 file changed, 1 insertion(+), 5 deletions(-)

Reviewed-by: Michal Koutný <mkoutny@suse.com>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH mm-unstable RFC 1/5] writeback: move wb_over_bg_thresh() call outside lock section
  2023-04-03 22:03 ` [PATCH mm-unstable RFC 1/5] writeback: move wb_over_bg_thresh() call outside lock section Yosry Ahmed
  2023-04-19 11:38   ` Michal Koutný
@ 2023-04-20 18:53   ` Shakeel Butt
  2023-04-20 20:22     ` Yosry Ahmed
  2023-04-21  8:53   ` Jan Kara
  2 siblings, 1 reply; 25+ messages in thread
From: Shakeel Butt @ 2023-04-20 18:53 UTC (permalink / raw)
  To: Yosry Ahmed, Jan Kara, Jens Axboe
  Cc: Alexander Viro, Christian Brauner, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, Andrew Morton, linux-fsdevel,
	linux-kernel, cgroups, linux-mm

+Jens & Jan

The patch looks good but it would be nice to pass this patch through
the eyes of experts of this area.

On Mon, Apr 3, 2023 at 3:03 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> wb_over_bg_thresh() calls mem_cgroup_wb_stats() which invokes an rstat
> flush, which can be expensive on large systems. Currently,
> wb_writeback() calls wb_over_bg_thresh() within a lock section, so we
> have to make the rstat flush atomically. On systems with a lot of
> cpus/cgroups, this can cause us to disable irqs for a long time,
> potentially causing problems.
>
> Move the call to wb_over_bg_thresh() outside the lock section in
> preparation to make the rstat flush in mem_cgroup_wb_stats() non-atomic.
> The list_empty(&wb->work_list) should be okay outside the lock section
> of wb->list_lock as it is protected by a separate lock (wb->work_lock),
> and wb_over_bg_thresh() doesn't seem like it is modifying any of the b_*
> lists the wb->list_lock is protecting. Also, the loop seems to be
> already releasing and reacquring the lock, so this refactoring looks
> safe.
>
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> ---
>  fs/fs-writeback.c | 16 +++++++++++-----
>  1 file changed, 11 insertions(+), 5 deletions(-)
>
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 195dc23e0d831..012357bc8daa3 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -2021,7 +2021,6 @@ static long wb_writeback(struct bdi_writeback *wb,
>         struct blk_plug plug;
>
>         blk_start_plug(&plug);
> -       spin_lock(&wb->list_lock);
>         for (;;) {
>                 /*
>                  * Stop writeback when nr_pages has been consumed
> @@ -2046,6 +2045,9 @@ static long wb_writeback(struct bdi_writeback *wb,
>                 if (work->for_background && !wb_over_bg_thresh(wb))
>                         break;
>
> +
> +               spin_lock(&wb->list_lock);
> +
>                 /*
>                  * Kupdate and background works are special and we want to
>                  * include all inodes that need writing. Livelock avoidance is
> @@ -2075,13 +2077,19 @@ static long wb_writeback(struct bdi_writeback *wb,
>                  * mean the overall work is done. So we keep looping as long
>                  * as made some progress on cleaning pages or inodes.
>                  */
> -               if (progress)
> +               if (progress) {
> +                       spin_unlock(&wb->list_lock);
>                         continue;
> +               }
> +
>                 /*
>                  * No more inodes for IO, bail
>                  */
> -               if (list_empty(&wb->b_more_io))
> +               if (list_empty(&wb->b_more_io)) {
> +                       spin_unlock(&wb->list_lock);
>                         break;
> +               }
> +
>                 /*
>                  * Nothing written. Wait for some inode to
>                  * become available for writeback. Otherwise
> @@ -2093,9 +2101,7 @@ static long wb_writeback(struct bdi_writeback *wb,
>                 spin_unlock(&wb->list_lock);
>                 /* This function drops i_lock... */
>                 inode_sleep_on_writeback(inode);
> -               spin_lock(&wb->list_lock);
>         }
> -       spin_unlock(&wb->list_lock);
>         blk_finish_plug(&plug);
>
>         return nr_pages - work->nr_pages;
> --
> 2.40.0.348.gf938b09366-goog
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH mm-unstable RFC 2/5] memcg: flush stats non-atomically in mem_cgroup_wb_stats()
  2023-04-03 22:03 ` [PATCH mm-unstable RFC 2/5] memcg: flush stats non-atomically in mem_cgroup_wb_stats() Yosry Ahmed
  2023-04-19 11:44   ` Michal Koutný
@ 2023-04-20 18:55   ` Shakeel Butt
  1 sibling, 0 replies; 25+ messages in thread
From: Shakeel Butt @ 2023-04-20 18:55 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Alexander Viro, Christian Brauner, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, Andrew Morton, linux-fsdevel,
	linux-kernel, cgroups, linux-mm

On Mon, Apr 3, 2023 at 3:03 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> The previous patch moved the wb_over_bg_thresh()->mem_cgroup_wb_stats()
> code path in wb_writeback() outside the lock section. We no longer need
> to flush the stats atomically. Flush the stats non-atomically.
>
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>

Acked-by: Shakeel Butt <shakeelb@google.com>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH mm-unstable RFC 3/5] memcg: calculate root usage from global state
  2023-04-03 22:03 ` [PATCH mm-unstable RFC 3/5] memcg: calculate root usage from global state Yosry Ahmed
  2023-04-11 12:53   ` Michal Koutný
@ 2023-04-20 18:57   ` Shakeel Butt
  1 sibling, 0 replies; 25+ messages in thread
From: Shakeel Butt @ 2023-04-20 18:57 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Alexander Viro, Christian Brauner, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, Andrew Morton, linux-fsdevel,
	linux-kernel, cgroups, linux-mm

On Mon, Apr 3, 2023 at 3:03 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> Currently, we approximate the root usage by adding the memcg stats for
> anon, file, and conditionally swap (for memsw). To read the memcg stats
> we need to invoke an rstat flush. rstat flushes can be expensive, they
> scale with the number of cpus and cgroups on the system.
>
> mem_cgroup_usage() is called by memcg_events()->mem_cgroup_threshold()
> with irqs disabled, so such an expensive operation with irqs disabled
> can cause problems.
>
> Instead, approximate the root usage from global state. This is not 100%
> accurate, but the root usage has always been ill-defined anyway.
>
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>

Acked-by: Shakeel Butt <shakeelb@google.com>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH mm-unstable RFC 4/5] memcg: remove mem_cgroup_flush_stats_atomic()
  2023-04-03 22:03 ` [PATCH mm-unstable RFC 4/5] memcg: remove mem_cgroup_flush_stats_atomic() Yosry Ahmed
@ 2023-04-20 19:38   ` Shakeel Butt
  0 siblings, 0 replies; 25+ messages in thread
From: Shakeel Butt @ 2023-04-20 19:38 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Alexander Viro, Christian Brauner, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, Andrew Morton, linux-fsdevel,
	linux-kernel, cgroups, linux-mm

On Mon, Apr 3, 2023 at 3:03 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> Previous patches removed all callers of mem_cgroup_flush_stats_atomic().
> Remove the function and simplify the code.
>
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>

Acked-by: Shakeel Butt <shakeelb@google.com>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH mm-unstable RFC 5/5] cgroup: remove cgroup_rstat_flush_atomic()
  2023-04-03 22:03 ` [PATCH mm-unstable RFC 5/5] cgroup: remove cgroup_rstat_flush_atomic() Yosry Ahmed
@ 2023-04-20 19:40   ` Shakeel Butt
  2023-04-20 19:48     ` Tejun Heo
  0 siblings, 1 reply; 25+ messages in thread
From: Shakeel Butt @ 2023-04-20 19:40 UTC (permalink / raw)
  To: Yosry Ahmed, Tejun Heo
  Cc: Alexander Viro, Christian Brauner, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, Andrew Morton, linux-fsdevel,
	linux-kernel, cgroups, linux-mm

+Tejun


On Mon, Apr 3, 2023 at 3:03 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> Previous patches removed the only caller of cgroup_rstat_flush_atomic().
> Remove the function and simplify the code.


I would say let cgroup maintainers decide this and this patch can be
decoupled from the series.

>
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> ---
>  include/linux/cgroup.h |  1 -
>  kernel/cgroup/rstat.c  | 26 +++++---------------------
>  2 files changed, 5 insertions(+), 22 deletions(-)
>
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 885f5395fcd04..567c547cf371f 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -692,7 +692,6 @@ static inline void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen)
>   */
>  void cgroup_rstat_updated(struct cgroup *cgrp, int cpu);
>  void cgroup_rstat_flush(struct cgroup *cgrp);
> -void cgroup_rstat_flush_atomic(struct cgroup *cgrp);
>  void cgroup_rstat_flush_hold(struct cgroup *cgrp);
>  void cgroup_rstat_flush_release(void);
>
> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> index d3252b0416b69..f9ad33f117c82 100644
> --- a/kernel/cgroup/rstat.c
> +++ b/kernel/cgroup/rstat.c
> @@ -171,7 +171,7 @@ __weak noinline void bpf_rstat_flush(struct cgroup *cgrp,
>  __diag_pop();
>
>  /* see cgroup_rstat_flush() */
> -static void cgroup_rstat_flush_locked(struct cgroup *cgrp, bool may_sleep)
> +static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
>         __releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
>  {
>         int cpu;
> @@ -207,9 +207,8 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp, bool may_sleep)
>                 }
>                 raw_spin_unlock_irqrestore(cpu_lock, flags);
>
> -               /* if @may_sleep, play nice and yield if necessary */
> -               if (may_sleep && (need_resched() ||
> -                                 spin_needbreak(&cgroup_rstat_lock))) {
> +               /* play nice and yield if necessary */
> +               if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
>                         spin_unlock_irq(&cgroup_rstat_lock);
>                         if (!cond_resched())
>                                 cpu_relax();
> @@ -236,25 +235,10 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
>         might_sleep();
>
>         spin_lock_irq(&cgroup_rstat_lock);
> -       cgroup_rstat_flush_locked(cgrp, true);
> +       cgroup_rstat_flush_locked(cgrp);
>         spin_unlock_irq(&cgroup_rstat_lock);
>  }
>
> -/**
> - * cgroup_rstat_flush_atomic- atomic version of cgroup_rstat_flush()
> - * @cgrp: target cgroup
> - *
> - * This function can be called from any context.
> - */
> -void cgroup_rstat_flush_atomic(struct cgroup *cgrp)
> -{
> -       unsigned long flags;
> -
> -       spin_lock_irqsave(&cgroup_rstat_lock, flags);
> -       cgroup_rstat_flush_locked(cgrp, false);
> -       spin_unlock_irqrestore(&cgroup_rstat_lock, flags);
> -}
> -
>  /**
>   * cgroup_rstat_flush_hold - flush stats in @cgrp's subtree and hold
>   * @cgrp: target cgroup
> @@ -269,7 +253,7 @@ void cgroup_rstat_flush_hold(struct cgroup *cgrp)
>  {
>         might_sleep();
>         spin_lock_irq(&cgroup_rstat_lock);
> -       cgroup_rstat_flush_locked(cgrp, true);
> +       cgroup_rstat_flush_locked(cgrp);
>  }
>
>  /**
> --
> 2.40.0.348.gf938b09366-goog
>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH mm-unstable RFC 5/5] cgroup: remove cgroup_rstat_flush_atomic()
  2023-04-20 19:40   ` Shakeel Butt
@ 2023-04-20 19:48     ` Tejun Heo
  2023-04-20 20:19       ` Yosry Ahmed
  0 siblings, 1 reply; 25+ messages in thread
From: Tejun Heo @ 2023-04-20 19:48 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Yosry Ahmed, Alexander Viro, Christian Brauner, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Muchun Song, Andrew Morton,
	linux-fsdevel, linux-kernel, cgroups, linux-mm

On Thu, Apr 20, 2023 at 12:40:24PM -0700, Shakeel Butt wrote:
> +Tejun
> 
> 
> On Mon, Apr 3, 2023 at 3:03 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > Previous patches removed the only caller of cgroup_rstat_flush_atomic().
> > Remove the function and simplify the code.
> 
> 
> I would say let cgroup maintainers decide this and this patch can be
> decoupled from the series.

Looks fine to me but yeah please cc me on the whole series from the next
round.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH mm-unstable RFC 5/5] cgroup: remove cgroup_rstat_flush_atomic()
  2023-04-20 19:48     ` Tejun Heo
@ 2023-04-20 20:19       ` Yosry Ahmed
  0 siblings, 0 replies; 25+ messages in thread
From: Yosry Ahmed @ 2023-04-20 20:19 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Shakeel Butt, Alexander Viro, Christian Brauner, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Muchun Song, Andrew Morton,
	linux-fsdevel, linux-kernel, cgroups, linux-mm

On Thu, Apr 20, 2023 at 12:48 PM Tejun Heo <tj@kernel.org> wrote:
>
> On Thu, Apr 20, 2023 at 12:40:24PM -0700, Shakeel Butt wrote:
> > +Tejun
> >
> >
> > On Mon, Apr 3, 2023 at 3:03 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> > >
> > > Previous patches removed the only caller of cgroup_rstat_flush_atomic().
> > > Remove the function and simplify the code.
> >
> >
> > I would say let cgroup maintainers decide this and this patch can be
> > decoupled from the series.
>
> Looks fine to me but yeah please cc me on the whole series from the next
> round.


Thanks for taking a look, I don't know how I missed CC'ing you on this
RFC. If I have to guess, my initial draft did not have this patch, so
I did not include you or linux-cgroups, then I added this patch. Sorry
for that :)

>
>
> Thanks.
>
> --
> tejun


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH mm-unstable RFC 1/5] writeback: move wb_over_bg_thresh() call outside lock section
  2023-04-20 18:53   ` Shakeel Butt
@ 2023-04-20 20:22     ` Yosry Ahmed
  0 siblings, 0 replies; 25+ messages in thread
From: Yosry Ahmed @ 2023-04-20 20:22 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Jan Kara, Jens Axboe, Alexander Viro, Christian Brauner,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Andrew Morton, linux-fsdevel, linux-kernel, cgroups, linux-mm

On Thu, Apr 20, 2023 at 11:53 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> +Jens & Jan
>
> The patch looks good but it would be nice to pass this patch through
> the eyes of experts of this area.

Thanks for taking a look and CC'ing folks. I will make sure to include
them in the next rounds as well. FWIW, Jens & Jan did not show up when
I ran scripts/get_maintainers.ph if I remember correctly.

>
> On Mon, Apr 3, 2023 at 3:03 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > wb_over_bg_thresh() calls mem_cgroup_wb_stats() which invokes an rstat
> > flush, which can be expensive on large systems. Currently,
> > wb_writeback() calls wb_over_bg_thresh() within a lock section, so we
> > have to make the rstat flush atomically. On systems with a lot of
> > cpus/cgroups, this can cause us to disable irqs for a long time,
> > potentially causing problems.
> >
> > Move the call to wb_over_bg_thresh() outside the lock section in
> > preparation to make the rstat flush in mem_cgroup_wb_stats() non-atomic.
> > The list_empty(&wb->work_list) should be okay outside the lock section
> > of wb->list_lock as it is protected by a separate lock (wb->work_lock),
> > and wb_over_bg_thresh() doesn't seem like it is modifying any of the b_*
> > lists the wb->list_lock is protecting. Also, the loop seems to be
> > already releasing and reacquring the lock, so this refactoring looks
> > safe.
> >
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > ---
> >  fs/fs-writeback.c | 16 +++++++++++-----
> >  1 file changed, 11 insertions(+), 5 deletions(-)
> >
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 195dc23e0d831..012357bc8daa3 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -2021,7 +2021,6 @@ static long wb_writeback(struct bdi_writeback *wb,
> >         struct blk_plug plug;
> >
> >         blk_start_plug(&plug);
> > -       spin_lock(&wb->list_lock);
> >         for (;;) {
> >                 /*
> >                  * Stop writeback when nr_pages has been consumed
> > @@ -2046,6 +2045,9 @@ static long wb_writeback(struct bdi_writeback *wb,
> >                 if (work->for_background && !wb_over_bg_thresh(wb))
> >                         break;
> >
> > +
> > +               spin_lock(&wb->list_lock);
> > +
> >                 /*
> >                  * Kupdate and background works are special and we want to
> >                  * include all inodes that need writing. Livelock avoidance is
> > @@ -2075,13 +2077,19 @@ static long wb_writeback(struct bdi_writeback *wb,
> >                  * mean the overall work is done. So we keep looping as long
> >                  * as made some progress on cleaning pages or inodes.
> >                  */
> > -               if (progress)
> > +               if (progress) {
> > +                       spin_unlock(&wb->list_lock);
> >                         continue;
> > +               }
> > +
> >                 /*
> >                  * No more inodes for IO, bail
> >                  */
> > -               if (list_empty(&wb->b_more_io))
> > +               if (list_empty(&wb->b_more_io)) {
> > +                       spin_unlock(&wb->list_lock);
> >                         break;
> > +               }
> > +
> >                 /*
> >                  * Nothing written. Wait for some inode to
> >                  * become available for writeback. Otherwise
> > @@ -2093,9 +2101,7 @@ static long wb_writeback(struct bdi_writeback *wb,
> >                 spin_unlock(&wb->list_lock);
> >                 /* This function drops i_lock... */
> >                 inode_sleep_on_writeback(inode);
> > -               spin_lock(&wb->list_lock);
> >         }
> > -       spin_unlock(&wb->list_lock);
> >         blk_finish_plug(&plug);
> >
> >         return nr_pages - work->nr_pages;
> > --
> > 2.40.0.348.gf938b09366-goog
> >


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH mm-unstable RFC 1/5] writeback: move wb_over_bg_thresh() call outside lock section
  2023-04-19 11:38   ` Michal Koutný
@ 2023-04-20 20:23     ` Yosry Ahmed
  0 siblings, 0 replies; 25+ messages in thread
From: Yosry Ahmed @ 2023-04-20 20:23 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Alexander Viro, Christian Brauner, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	linux-fsdevel, linux-kernel, cgroups, linux-mm

On Wed, Apr 19, 2023 at 4:38 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Mon, Apr 03, 2023 at 10:03:33PM +0000, Yosry Ahmed <yosryahmed@google.com> wrote:
> > wb_over_bg_thresh() calls mem_cgroup_wb_stats() which invokes an rstat
> > flush, which can be expensive on large systems. Currently,
> > wb_writeback() calls wb_over_bg_thresh() within a lock section, so we
> > have to make the rstat flush atomically. On systems with a lot of
> > cpus/cgroups, this can cause us to disable irqs for a long time,
> > potentially causing problems.
> >
> > Move the call to wb_over_bg_thresh() outside the lock section in
> > preparation to make the rstat flush in mem_cgroup_wb_stats() non-atomic.
> > The list_empty(&wb->work_list) should be okay outside the lock section
> > of wb->list_lock as it is protected by a separate lock (wb->work_lock),
> > and wb_over_bg_thresh() doesn't seem like it is modifying any of the b_*
> > lists the wb->list_lock is protecting. Also, the loop seems to be
> > already releasing and reacquring the lock, so this refactoring looks
> > safe.
> >
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > ---
> >  fs/fs-writeback.c | 16 +++++++++++-----
> >  1 file changed, 11 insertions(+), 5 deletions(-)
> >
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 195dc23e0d831..012357bc8daa3 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -2021,7 +2021,6 @@ static long wb_writeback(struct bdi_writeback *wb,
> >       struct blk_plug plug;
> >
> >       blk_start_plug(&plug);
> > -     spin_lock(&wb->list_lock);
> >       for (;;) {
> >               /*
> >                * Stop writeback when nr_pages has been consumed
> > @@ -2046,6 +2045,9 @@ static long wb_writeback(struct bdi_writeback *wb,
> >               if (work->for_background && !wb_over_bg_thresh(wb))
> >                       break;
> >
> > +
> > +             spin_lock(&wb->list_lock);
> > +
> >               /*
> >                * Kupdate and background works are special and we want to
> >                * include all inodes that need writing. Livelock avoidance is
> > @@ -2075,13 +2077,19 @@ static long wb_writeback(struct bdi_writeback *wb,
> >                * mean the overall work is done. So we keep looping as long
> >                * as made some progress on cleaning pages or inodes.
> >                */
> > -             if (progress)
> > +             if (progress) {
> > +                     spin_unlock(&wb->list_lock);
> >                       continue;
> > +             }
> > +
>
> This would release wb->list_lock temporarily with progress but that's
> already not held continuously due to writeback_sb_inodes().
> Holding the lock could even be shortened by taking it later after
> trace_writeback_start().
>
> Altogether, the change looks OK,
> Reviewed-by: Michal Koutný <mkoutny@suse.com>

Thanks for taking a look!

>


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH mm-unstable RFC 1/5] writeback: move wb_over_bg_thresh() call outside lock section
  2023-04-03 22:03 ` [PATCH mm-unstable RFC 1/5] writeback: move wb_over_bg_thresh() call outside lock section Yosry Ahmed
  2023-04-19 11:38   ` Michal Koutný
  2023-04-20 18:53   ` Shakeel Butt
@ 2023-04-21  8:53   ` Jan Kara
  2023-04-21 17:21     ` Yosry Ahmed
  2 siblings, 1 reply; 25+ messages in thread
From: Jan Kara @ 2023-04-21  8:53 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Alexander Viro, Christian Brauner, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	linux-fsdevel, linux-kernel, cgroups, linux-mm

On Mon 03-04-23 22:03:33, Yosry Ahmed wrote:
> wb_over_bg_thresh() calls mem_cgroup_wb_stats() which invokes an rstat
> flush, which can be expensive on large systems. Currently,
> wb_writeback() calls wb_over_bg_thresh() within a lock section, so we
> have to make the rstat flush atomically. On systems with a lot of
> cpus/cgroups, this can cause us to disable irqs for a long time,
> potentially causing problems.
> 
> Move the call to wb_over_bg_thresh() outside the lock section in
> preparation to make the rstat flush in mem_cgroup_wb_stats() non-atomic.
> The list_empty(&wb->work_list) should be okay outside the lock section
> of wb->list_lock as it is protected by a separate lock (wb->work_lock),
> and wb_over_bg_thresh() doesn't seem like it is modifying any of the b_*
> lists the wb->list_lock is protecting. Also, the loop seems to be
> already releasing and reacquring the lock, so this refactoring looks
> safe.
> 
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>

The patch looks good to me. Nice find. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/fs-writeback.c | 16 +++++++++++-----
>  1 file changed, 11 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 195dc23e0d831..012357bc8daa3 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -2021,7 +2021,6 @@ static long wb_writeback(struct bdi_writeback *wb,
>  	struct blk_plug plug;
>  
>  	blk_start_plug(&plug);
> -	spin_lock(&wb->list_lock);
>  	for (;;) {
>  		/*
>  		 * Stop writeback when nr_pages has been consumed
> @@ -2046,6 +2045,9 @@ static long wb_writeback(struct bdi_writeback *wb,
>  		if (work->for_background && !wb_over_bg_thresh(wb))
>  			break;
>  
> +
> +		spin_lock(&wb->list_lock);
> +
>  		/*
>  		 * Kupdate and background works are special and we want to
>  		 * include all inodes that need writing. Livelock avoidance is
> @@ -2075,13 +2077,19 @@ static long wb_writeback(struct bdi_writeback *wb,
>  		 * mean the overall work is done. So we keep looping as long
>  		 * as made some progress on cleaning pages or inodes.
>  		 */
> -		if (progress)
> +		if (progress) {
> +			spin_unlock(&wb->list_lock);
>  			continue;
> +		}
> +
>  		/*
>  		 * No more inodes for IO, bail
>  		 */
> -		if (list_empty(&wb->b_more_io))
> +		if (list_empty(&wb->b_more_io)) {
> +			spin_unlock(&wb->list_lock);
>  			break;
> +		}
> +
>  		/*
>  		 * Nothing written. Wait for some inode to
>  		 * become available for writeback. Otherwise
> @@ -2093,9 +2101,7 @@ static long wb_writeback(struct bdi_writeback *wb,
>  		spin_unlock(&wb->list_lock);
>  		/* This function drops i_lock... */
>  		inode_sleep_on_writeback(inode);
> -		spin_lock(&wb->list_lock);
>  	}
> -	spin_unlock(&wb->list_lock);
>  	blk_finish_plug(&plug);
>  
>  	return nr_pages - work->nr_pages;
> -- 
> 2.40.0.348.gf938b09366-goog
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH mm-unstable RFC 1/5] writeback: move wb_over_bg_thresh() call outside lock section
  2023-04-21  8:53   ` Jan Kara
@ 2023-04-21 17:21     ` Yosry Ahmed
  0 siblings, 0 replies; 25+ messages in thread
From: Yosry Ahmed @ 2023-04-21 17:21 UTC (permalink / raw)
  To: Jan Kara
  Cc: Alexander Viro, Christian Brauner, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	linux-fsdevel, linux-kernel, cgroups, linux-mm

On Fri, Apr 21, 2023 at 1:53 AM Jan Kara <jack@suse.cz> wrote:
>
> On Mon 03-04-23 22:03:33, Yosry Ahmed wrote:
> > wb_over_bg_thresh() calls mem_cgroup_wb_stats() which invokes an rstat
> > flush, which can be expensive on large systems. Currently,
> > wb_writeback() calls wb_over_bg_thresh() within a lock section, so we
> > have to make the rstat flush atomically. On systems with a lot of
> > cpus/cgroups, this can cause us to disable irqs for a long time,
> > potentially causing problems.
> >
> > Move the call to wb_over_bg_thresh() outside the lock section in
> > preparation to make the rstat flush in mem_cgroup_wb_stats() non-atomic.
> > The list_empty(&wb->work_list) should be okay outside the lock section
> > of wb->list_lock as it is protected by a separate lock (wb->work_lock),
> > and wb_over_bg_thresh() doesn't seem like it is modifying any of the b_*
> > lists the wb->list_lock is protecting. Also, the loop seems to be
> > already releasing and reacquring the lock, so this refactoring looks
> > safe.
> >
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
>
> The patch looks good to me. Nice find. Feel free to add:
>
> Reviewed-by: Jan Kara <jack@suse.cz>

Thanks for taking a look!

>
>                                                                 Honza
>
> > ---
> >  fs/fs-writeback.c | 16 +++++++++++-----
> >  1 file changed, 11 insertions(+), 5 deletions(-)
> >
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 195dc23e0d831..012357bc8daa3 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -2021,7 +2021,6 @@ static long wb_writeback(struct bdi_writeback *wb,
> >       struct blk_plug plug;
> >
> >       blk_start_plug(&plug);
> > -     spin_lock(&wb->list_lock);
> >       for (;;) {
> >               /*
> >                * Stop writeback when nr_pages has been consumed
> > @@ -2046,6 +2045,9 @@ static long wb_writeback(struct bdi_writeback *wb,
> >               if (work->for_background && !wb_over_bg_thresh(wb))
> >                       break;
> >
> > +
> > +             spin_lock(&wb->list_lock);
> > +
> >               /*
> >                * Kupdate and background works are special and we want to
> >                * include all inodes that need writing. Livelock avoidance is
> > @@ -2075,13 +2077,19 @@ static long wb_writeback(struct bdi_writeback *wb,
> >                * mean the overall work is done. So we keep looping as long
> >                * as made some progress on cleaning pages or inodes.
> >                */
> > -             if (progress)
> > +             if (progress) {
> > +                     spin_unlock(&wb->list_lock);
> >                       continue;
> > +             }
> > +
> >               /*
> >                * No more inodes for IO, bail
> >                */
> > -             if (list_empty(&wb->b_more_io))
> > +             if (list_empty(&wb->b_more_io)) {
> > +                     spin_unlock(&wb->list_lock);
> >                       break;
> > +             }
> > +
> >               /*
> >                * Nothing written. Wait for some inode to
> >                * become available for writeback. Otherwise
> > @@ -2093,9 +2101,7 @@ static long wb_writeback(struct bdi_writeback *wb,
> >               spin_unlock(&wb->list_lock);
> >               /* This function drops i_lock... */
> >               inode_sleep_on_writeback(inode);
> > -             spin_lock(&wb->list_lock);
> >       }
> > -     spin_unlock(&wb->list_lock);
> >       blk_finish_plug(&plug);
> >
> >       return nr_pages - work->nr_pages;
> > --
> > 2.40.0.348.gf938b09366-goog
> >
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR


^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2023-04-21 17:22 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-04-03 22:03 [PATCH mm-unstable RFC 0/5] cgroup: eliminate atomic rstat Yosry Ahmed
2023-04-03 22:03 ` [PATCH mm-unstable RFC 1/5] writeback: move wb_over_bg_thresh() call outside lock section Yosry Ahmed
2023-04-19 11:38   ` Michal Koutný
2023-04-20 20:23     ` Yosry Ahmed
2023-04-20 18:53   ` Shakeel Butt
2023-04-20 20:22     ` Yosry Ahmed
2023-04-21  8:53   ` Jan Kara
2023-04-21 17:21     ` Yosry Ahmed
2023-04-03 22:03 ` [PATCH mm-unstable RFC 2/5] memcg: flush stats non-atomically in mem_cgroup_wb_stats() Yosry Ahmed
2023-04-19 11:44   ` Michal Koutný
2023-04-20 18:55   ` Shakeel Butt
2023-04-03 22:03 ` [PATCH mm-unstable RFC 3/5] memcg: calculate root usage from global state Yosry Ahmed
2023-04-11 12:53   ` Michal Koutný
2023-04-11 16:59     ` Yosry Ahmed
2023-04-20 18:57   ` Shakeel Butt
2023-04-03 22:03 ` [PATCH mm-unstable RFC 4/5] memcg: remove mem_cgroup_flush_stats_atomic() Yosry Ahmed
2023-04-20 19:38   ` Shakeel Butt
2023-04-03 22:03 ` [PATCH mm-unstable RFC 5/5] cgroup: remove cgroup_rstat_flush_atomic() Yosry Ahmed
2023-04-20 19:40   ` Shakeel Butt
2023-04-20 19:48     ` Tejun Heo
2023-04-20 20:19       ` Yosry Ahmed
2023-04-03 22:04 ` [PATCH mm-unstable RFC 0/5] cgroup: eliminate atomic rstat Yosry Ahmed
2023-04-06 18:26   ` Tim Chen
2023-04-06 18:23 ` Tim Chen
2023-04-17 11:54 ` Yosry Ahmed

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox