[PATCH v4 0/4] memcg: non-unified flushing for userspace stats

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/4] memcg: non-unified flushing for userspace stats
@ 2023-08-31 16:56 Yosry Ahmed
  2023-08-31 16:56 ` [PATCH v4 1/4] mm: memcg: properly name and document unified stats flushing Yosry Ahmed
                   ` (4 more replies)
  0 siblings, 5 replies; 29+ messages in thread
From: Yosry Ahmed @ 2023-08-31 16:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Ivan Babrou, Tejun Heo, Michal Koutný,
	Waiman Long, linux-mm, cgroups, linux-kernel, Yosry Ahmed

Most memcg flushing contexts using "unified" flushing, where only one
flusher is allowed at a time (others skip), and all flushers need to
flush the entire tree. This works well with high concurrency, which
mostly comes from in-kernel flushers (e.g. reclaim, refault, ..).

For userspace reads, unified flushing leads to non-deterministic stats
staleness and reading cost. This series clarifies and documents the
differences between unified and non-unified flushing (patches 1 & 2),
then opts userspace reads out of unified flushing (patch 3).

This patch series is a follow up on the discussion in [1]. That was a
patch that proposed that userspace reads wait for ongoing unified
flushers to complete before returning. There were concerns about the
latency that this introduces to userspace reads, especially with ongoing
reports of expensive stat reads even with unified flushing. Hence, this
series follows a different approach, by opting userspace reads out of
unified flushing completely. The cost of userspace reads are now
determinstic, and depend on the size of the subtree being read. This
should fix both the *sometimes* expensive reads (due to flushing the
entire tree) and occasional staless (due to skipping flushing).

I attempted to remove unified flushing completely, but noticed that
in-kernel flushers with high concurrency (e.g. hundreds of concurrent
reclaimers). This sort of concurrency is not expected from userspace
reads. More details about testing and some numbers in the last patch's
changelog.

v4 -> v5:
- Fixed build error in the last patch with W=1 because of a missed
  'static'.

v4: https://lore.kernel.org/lkml/20230830175335.1536008-1-yosryahmed@google.com/

Yosry Ahmed (4):
  mm: memcg: properly name and document unified stats flushing
  mm: memcg: add a helper for non-unified stats flushing
  mm: memcg: let non-unified root stats flushes help unified flushes
  mm: memcg: use non-unified stats flushing for userspace reads

 include/linux/memcontrol.h |   8 +--
 mm/memcontrol.c            | 106 +++++++++++++++++++++++++++----------
 mm/vmscan.c                |   2 +-
 mm/workingset.c            |   4 +-
 4 files changed, 85 insertions(+), 35 deletions(-)

-- 
2.42.0.rc2.253.gd59a3bf2b4-goog

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v4 1/4] mm: memcg: properly name and document unified stats flushing
  2023-08-31 16:56 [PATCH v4 0/4] memcg: non-unified flushing for userspace stats Yosry Ahmed
@ 2023-08-31 16:56 ` Yosry Ahmed
  2023-09-04 14:44   ` Michal Hocko
  2023-08-31 16:56 ` [PATCH v4 2/4] mm: memcg: add a helper for non-unified " Yosry Ahmed
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 29+ messages in thread
From: Yosry Ahmed @ 2023-08-31 16:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Ivan Babrou, Tejun Heo, Michal Koutný,
	Waiman Long, linux-mm, cgroups, linux-kernel, Yosry Ahmed

Most contexts that flush memcg stats use "unified" flushing, where
basically all flushers attempt to flush the entire hierarchy, but only
one flusher is allowed at a time, others skip flushing.

This is needed because we need to flush the stats from paths such as
reclaim or refaults, which may have high concurrency, especially on
large systems. Serializing such performance-sensitive paths can
introduce regressions, hence, unified flushing offers a tradeoff between
stats staleness and the performance impact of flushing stats.

Document this properly and explicitly by renaming the common flushing
helper from do_flush_stats() to do_unified_stats_flush(), and adding
documentation to describe unified flushing. Additionally, rename
flushing APIs to add "try" in the name, which implies that flushing will
not always happen. Also add proper documentation.

No functional change intended.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 include/linux/memcontrol.h |  8 ++---
 mm/memcontrol.c            | 61 +++++++++++++++++++++++++-------------
 mm/vmscan.c                |  2 +-
 mm/workingset.c            |  4 +--
 4 files changed, 47 insertions(+), 28 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 11810a2cfd2d..d517b0cc5221 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1034,8 +1034,8 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
 	return x;
 }
 
-void mem_cgroup_flush_stats(void);
-void mem_cgroup_flush_stats_ratelimited(void);
+void mem_cgroup_try_flush_stats(void);
+void mem_cgroup_try_flush_stats_ratelimited(void);
 
 void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
 			      int val);
@@ -1519,11 +1519,11 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
 	return node_page_state(lruvec_pgdat(lruvec), idx);
 }
 
-static inline void mem_cgroup_flush_stats(void)
+static inline void mem_cgroup_try_flush_stats(void)
 {
 }
 
-static inline void mem_cgroup_flush_stats_ratelimited(void)
+static inline void mem_cgroup_try_flush_stats_ratelimited(void)
 {
 }
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cf57fe9318d5..2d0ec828a1c4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -588,7 +588,7 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
 static void flush_memcg_stats_dwork(struct work_struct *w);
 static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
 static DEFINE_PER_CPU(unsigned int, stats_updates);
-static atomic_t stats_flush_ongoing = ATOMIC_INIT(0);
+static atomic_t stats_unified_flush_ongoing = ATOMIC_INIT(0);
 static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
 static u64 flush_next_time;
 
@@ -630,7 +630,7 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
 		/*
 		 * If stats_flush_threshold exceeds the threshold
 		 * (>num_online_cpus()), cgroup stats update will be triggered
-		 * in __mem_cgroup_flush_stats(). Increasing this var further
+		 * in mem_cgroup_try_flush_stats(). Increasing this var further
 		 * is redundant and simply adds overhead in atomic update.
 		 */
 		if (atomic_read(&stats_flush_threshold) <= num_online_cpus())
@@ -639,15 +639,19 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
 	}
 }
 
-static void do_flush_stats(void)
+/*
+ * do_unified_stats_flush - do a unified flush of memory cgroup statistics
+ *
+ * A unified flush tries to flush the entire hierarchy, but skips if there is
+ * another ongoing flush. This is meant for flushers that may have a lot of
+ * concurrency (e.g. reclaim, refault, etc), and should not be serialized to
+ * avoid slowing down performance-sensitive paths. A unified flush may skip, and
+ * hence may yield stale stats.
+ */
+static void do_unified_stats_flush(void)
 {
-	/*
-	 * We always flush the entire tree, so concurrent flushers can just
-	 * skip. This avoids a thundering herd problem on the rstat global lock
-	 * from memcg flushers (e.g. reclaim, refault, etc).
-	 */
-	if (atomic_read(&stats_flush_ongoing) ||
-	    atomic_xchg(&stats_flush_ongoing, 1))
+	if (atomic_read(&stats_unified_flush_ongoing) ||
+	    atomic_xchg(&stats_unified_flush_ongoing, 1))
 		return;
 
 	WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME);
@@ -655,19 +659,34 @@ static void do_flush_stats(void)
 	cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
 
 	atomic_set(&stats_flush_threshold, 0);
-	atomic_set(&stats_flush_ongoing, 0);
+	atomic_set(&stats_unified_flush_ongoing, 0);
 }
 
-void mem_cgroup_flush_stats(void)
+/*
+ * mem_cgroup_try_flush_stats - try to flush memory cgroup statistics
+ *
+ * Try to flush the stats of all memcgs that have stat updates since the last
+ * flush. We do not flush the stats if:
+ * - The magnitude of the pending updates is below a certain threshold.
+ * - There is another ongoing unified flush (see do_unified_stats_flush()).
+ *
+ * Hence, the stats may be stale, but ideally by less than FLUSH_TIME due to
+ * periodic flushing.
+ */
+void mem_cgroup_try_flush_stats(void)
 {
 	if (atomic_read(&stats_flush_threshold) > num_online_cpus())
-		do_flush_stats();
+		do_unified_stats_flush();
 }
 
-void mem_cgroup_flush_stats_ratelimited(void)
+/*
+ * Like mem_cgroup_try_flush_stats(), but only flushes if the periodic flusher
+ * is late.
+ */
+void mem_cgroup_try_flush_stats_ratelimited(void)
 {
 	if (time_after64(jiffies_64, READ_ONCE(flush_next_time)))
-		mem_cgroup_flush_stats();
+		mem_cgroup_try_flush_stats();
 }
 
 static void flush_memcg_stats_dwork(struct work_struct *w)
@@ -676,7 +695,7 @@ static void flush_memcg_stats_dwork(struct work_struct *w)
 	 * Always flush here so that flushing in latency-sensitive paths is
 	 * as cheap as possible.
 	 */
-	do_flush_stats();
+	do_unified_stats_flush();
 	queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME);
 }
 
@@ -1576,7 +1595,7 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
 	 *
 	 * Current memory state:
 	 */
-	mem_cgroup_flush_stats();
+	mem_cgroup_try_flush_stats();
 
 	for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
 		u64 size;
@@ -4018,7 +4037,7 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v)
 	int nid;
 	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
 
-	mem_cgroup_flush_stats();
+	mem_cgroup_try_flush_stats();
 
 	for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) {
 		seq_printf(m, "%s=%lu", stat->name,
@@ -4093,7 +4112,7 @@ static void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
 
 	BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats));
 
-	mem_cgroup_flush_stats();
+	mem_cgroup_try_flush_stats();
 
 	for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
 		unsigned long nr;
@@ -4595,7 +4614,7 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
 	struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
 	struct mem_cgroup *parent;
 
-	mem_cgroup_flush_stats();
+	mem_cgroup_try_flush_stats();
 
 	*pdirty = memcg_page_state(memcg, NR_FILE_DIRTY);
 	*pwriteback = memcg_page_state(memcg, NR_WRITEBACK);
@@ -6610,7 +6629,7 @@ static int memory_numa_stat_show(struct seq_file *m, void *v)
 	int i;
 	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
 
-	mem_cgroup_flush_stats();
+	mem_cgroup_try_flush_stats();
 
 	for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
 		int nid;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c7c149cb8d66..457a18921fda 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2923,7 +2923,7 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
 	 * Flush the memory cgroup stats, so that we read accurate per-memcg
 	 * lruvec stats for heuristics.
 	 */
-	mem_cgroup_flush_stats();
+	mem_cgroup_try_flush_stats();
 
 	/*
 	 * Determine the scan balance between anon and file LRUs.
diff --git a/mm/workingset.c b/mm/workingset.c
index da58a26d0d4d..affb8699e58d 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -520,7 +520,7 @@ void workingset_refault(struct folio *folio, void *shadow)
 	}
 
 	/* Flush stats (and potentially sleep) before holding RCU read lock */
-	mem_cgroup_flush_stats_ratelimited();
+	mem_cgroup_try_flush_stats_ratelimited();
 
 	rcu_read_lock();
 
@@ -664,7 +664,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
 		struct lruvec *lruvec;
 		int i;
 
-		mem_cgroup_flush_stats();
+		mem_cgroup_try_flush_stats();
 		lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
 		for (pages = 0, i = 0; i < NR_LRU_LISTS; i++)
 			pages += lruvec_page_state_local(lruvec,
-- 
2.42.0.rc2.253.gd59a3bf2b4-goog



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v4 2/4] mm: memcg: add a helper for non-unified stats flushing
  2023-08-31 16:56 [PATCH v4 0/4] memcg: non-unified flushing for userspace stats Yosry Ahmed
  2023-08-31 16:56 ` [PATCH v4 1/4] mm: memcg: properly name and document unified stats flushing Yosry Ahmed
@ 2023-08-31 16:56 ` Yosry Ahmed
  2023-09-04 14:45   ` Michal Hocko
  2023-08-31 16:56 ` [PATCH v4 3/4] mm: memcg: let non-unified root stats flushes help unified flushes Yosry Ahmed
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 29+ messages in thread
From: Yosry Ahmed @ 2023-08-31 16:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Ivan Babrou, Tejun Heo, Michal Koutný,
	Waiman Long, linux-mm, cgroups, linux-kernel, Yosry Ahmed

Some contexts flush memcg stats outside of unified flushing, directly
using cgroup_rstat_flush(). Add a helper for non-unified flushing, a
counterpart for do_unified_stats_flush(), and use it in those contexts,
as well as in do_unified_stats_flush() itself.

This abstracts the rstat API and makes it easy to introduce
modifications to either unified or non-unified flushing functions
without changing callers.

No functional change intended.

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 mm/memcontrol.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2d0ec828a1c4..8c046feeaae7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -639,6 +639,17 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
 	}
 }
 
+/*
+ * do_stats_flush - do a flush of the memory cgroup statistics
+ * @memcg: memory cgroup to flush
+ *
+ * Only flushes the subtree of @memcg, does not skip under any conditions.
+ */
+static void do_stats_flush(struct mem_cgroup *memcg)
+{
+	cgroup_rstat_flush(memcg->css.cgroup);
+}
+
 /*
  * do_unified_stats_flush - do a unified flush of memory cgroup statistics
  *
@@ -656,7 +667,7 @@ static void do_unified_stats_flush(void)
 
 	WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME);
 
-	cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
+	do_stats_flush(root_mem_cgroup);
 
 	atomic_set(&stats_flush_threshold, 0);
 	atomic_set(&stats_unified_flush_ongoing, 0);
@@ -7790,7 +7801,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
 			break;
 		}
 
-		cgroup_rstat_flush(memcg->css.cgroup);
+		do_stats_flush(memcg);
 		pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE;
 		if (pages < max)
 			continue;
@@ -7855,8 +7866,10 @@ void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size)
 static u64 zswap_current_read(struct cgroup_subsys_state *css,
 			      struct cftype *cft)
 {
-	cgroup_rstat_flush(css->cgroup);
-	return memcg_page_state(mem_cgroup_from_css(css), MEMCG_ZSWAP_B);
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+	do_stats_flush(memcg);
+	return memcg_page_state(memcg, MEMCG_ZSWAP_B);
 }
 
 static int zswap_max_show(struct seq_file *m, void *v)
-- 
2.42.0.rc2.253.gd59a3bf2b4-goog



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v4 3/4] mm: memcg: let non-unified root stats flushes help unified flushes
  2023-08-31 16:56 [PATCH v4 0/4] memcg: non-unified flushing for userspace stats Yosry Ahmed
  2023-08-31 16:56 ` [PATCH v4 1/4] mm: memcg: properly name and document unified stats flushing Yosry Ahmed
  2023-08-31 16:56 ` [PATCH v4 2/4] mm: memcg: add a helper for non-unified " Yosry Ahmed
@ 2023-08-31 16:56 ` Yosry Ahmed
  2023-09-04 14:50   ` Michal Hocko
  2023-08-31 16:56 ` [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads Yosry Ahmed
  2023-08-31 17:18 ` [PATCH v4 0/4] memcg: non-unified flushing for userspace stats Waiman Long
  4 siblings, 1 reply; 29+ messages in thread
From: Yosry Ahmed @ 2023-08-31 16:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Ivan Babrou, Tejun Heo, Michal Koutný,
	Waiman Long, linux-mm, cgroups, linux-kernel, Yosry Ahmed

Unified flushing of memcg stats keeps track of the magnitude of pending
updates, and only allows a flush if that magnitude exceeds a threshold.
It also keeps track of the time at which ratelimited flushing should be
allowed as flush_next_time.

A non-unified flush on the root memcg has the same effect as a unified
flush, so let it help unified flushing by resetting pending updates and
kicking flush_next_time forward. Move the logic into the common
do_stats_flush() helper, and do it for all root flushes, unified or
not.

There is a subtle change here, we reset stats_flush_threshold before a
flush rather than after a flush. This probably okay because:

(a) For flushers: only unified flushers check stats_flush_threshold, and
those flushers skip anyway if there is another unified flush ongoing.
Having them also skip if there is an ongoing non-unified root flush is
actually more consistent.

(b) For updaters: Resetting stats_flush_threshold early may lead to more
atomic updates of stats_flush_threshold, as we start updating it
earlier. This should not be significant in practice because we stop
updating stats_flush_threshold when it reaches the threshold anyway. If
we start early and stop early, the number of atomic updates remain the
same. The only difference is the scenario where we reset
stats_flush_threshold early, start doing atomic updates early, and then
the periodic flusher kicks in before we reach the threshold. In this
case, we will have done more atomic updates. However, since the
threshold wasn't reached, then we did not do a lot of updates anyway.

Suggested-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 mm/memcontrol.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8c046feeaae7..94d5a6751a9e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -647,6 +647,11 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
  */
 static void do_stats_flush(struct mem_cgroup *memcg)
 {
+	/* for unified flushing, root non-unified flushing can help as well */
+	if (mem_cgroup_is_root(memcg)) {
+		WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME);
+		atomic_set(&stats_flush_threshold, 0);
+	}
 	cgroup_rstat_flush(memcg->css.cgroup);
 }
 
@@ -665,11 +670,8 @@ static void do_unified_stats_flush(void)
 	    atomic_xchg(&stats_unified_flush_ongoing, 1))
 		return;
 
-	WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME);
-
 	do_stats_flush(root_mem_cgroup);
 
-	atomic_set(&stats_flush_threshold, 0);
 	atomic_set(&stats_unified_flush_ongoing, 0);
 }
 
-- 
2.42.0.rc2.253.gd59a3bf2b4-goog



^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads
  2023-08-31 16:56 [PATCH v4 0/4] memcg: non-unified flushing for userspace stats Yosry Ahmed
                   ` (2 preceding siblings ...)
  2023-08-31 16:56 ` [PATCH v4 3/4] mm: memcg: let non-unified root stats flushes help unified flushes Yosry Ahmed
@ 2023-08-31 16:56 ` Yosry Ahmed
  2023-09-04 15:15   ` Michal Hocko
  2023-08-31 17:18 ` [PATCH v4 0/4] memcg: non-unified flushing for userspace stats Waiman Long
  4 siblings, 1 reply; 29+ messages in thread
From: Yosry Ahmed @ 2023-08-31 16:56 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Ivan Babrou, Tejun Heo, Michal Koutný,
	Waiman Long, linux-mm, cgroups, linux-kernel, Yosry Ahmed

Unified flushing allows for great concurrency for paths that attempt to
flush the stats, at the expense of potential staleness and a single
flusher paying the extra cost of flushing the full tree.

This tradeoff makes sense for in-kernel flushers that may observe high
concurrency (e.g. reclaim, refault). For userspace readers, stale stats
may be unexpected and problematic, especially when such stats are used
for critical paths such as userspace OOM handling. Additionally, a
userspace reader will occasionally pay the cost of flushing the entire
hierarchy, which also causes problems in some cases [1].

Opt userspace reads out of unified flushing. This makes the cost of
reading the stats more predictable (proportional to the size of the
subtree), as well as the freshness of the stats. Userspace readers are
not expected to have similar concurrency to in-kernel flushers,
serializing them among themselves and among in-kernel flushers should be
okay. Nonetheless, for extra safety, introduce a mutex when flushing for
userspace readers to make sure only a single userspace reader can compete
with in-kernel flushers at a time. This takes away userspace ability to
directly influence or hurt in-kernel lock contention.

An alternative is to remove flushing from the stats reading path
completely, and rely on the periodic flusher. This should be accompanied
by making the periodic flushing period tunable, and providing an
interface for userspace to force a flush, following a similar model to
/proc/vmstat. However, such a change will be hard to reverse if the
implementation needs to be changed because:
- The cost of reading stats will be very cheap and we won't be able to
  take that back easily.
- There are user-visible interfaces involved.

Hence, let's go with the change that's most reversible first and revisit
as needed.

This was tested on a machine with 256 cpus by running a synthetic test
script [2] that creates 50 top-level cgroups, each with 5 children (250
leaf cgroups). Each leaf cgroup has 10 processes running that allocate
memory beyond the cgroup limit, invoking reclaim (which is an in-kernel
unified flusher). Concurrently, one thread is spawned per-cgroup to read
the stats every second (including root, top-level, and leaf cgroups --
so total 251 threads). No significant regressions were observed in the
total run time, which means that userspace readers are not significantly
affecting in-kernel flushers:

Base (mm-unstable):

real	0m22.500s
user	0m9.399s
sys	73m41.381s

real	0m22.749s
user	0m15.648s
sys	73m13.113s

real	0m22.466s
user	0m10.000s
sys	73m11.933s

With this patch:

real	0m23.092s
user	0m10.110s
sys	75m42.774s

real	0m22.277s
user	0m10.443s
sys	72m7.182s

real	0m24.127s
user	0m12.617s
sys	78m52.765s

[1]https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/
[2]https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com/

Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
---
 mm/memcontrol.c | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 94d5a6751a9e..46a7abf71c73 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -588,6 +588,7 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
 static void flush_memcg_stats_dwork(struct work_struct *w);
 static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
 static DEFINE_PER_CPU(unsigned int, stats_updates);
+static DEFINE_MUTEX(stats_user_flush_mutex);
 static atomic_t stats_unified_flush_ongoing = ATOMIC_INIT(0);
 static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
 static u64 flush_next_time;
@@ -655,6 +656,21 @@ static void do_stats_flush(struct mem_cgroup *memcg)
 	cgroup_rstat_flush(memcg->css.cgroup);
 }
 
+/*
+ * mem_cgroup_user_flush_stats - do a stats flush for a user read
+ * @memcg: memory cgroup to flush
+ *
+ * Flush the subtree of @memcg. A mutex is used for userspace readers to gate
+ * the global rstat spinlock. This protects in-kernel flushers from userspace
+ * readers hogging the lock.
+ */
+static void mem_cgroup_user_flush_stats(struct mem_cgroup *memcg)
+{
+	mutex_lock(&stats_user_flush_mutex);
+	do_stats_flush(memcg);
+	mutex_unlock(&stats_user_flush_mutex);
+}
+
 /*
  * do_unified_stats_flush - do a unified flush of memory cgroup statistics
  *
@@ -1608,7 +1624,7 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
 	 *
 	 * Current memory state:
 	 */
-	mem_cgroup_try_flush_stats();
+	mem_cgroup_user_flush_stats(memcg);
 
 	for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
 		u64 size;
@@ -4050,7 +4066,7 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v)
 	int nid;
 	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
 
-	mem_cgroup_try_flush_stats();
+	mem_cgroup_user_flush_stats(memcg);
 
 	for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) {
 		seq_printf(m, "%s=%lu", stat->name,
@@ -4125,7 +4141,7 @@ static void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
 
 	BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats));
 
-	mem_cgroup_try_flush_stats();
+	mem_cgroup_user_flush_stats(memcg);
 
 	for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
 		unsigned long nr;
@@ -6642,7 +6658,7 @@ static int memory_numa_stat_show(struct seq_file *m, void *v)
 	int i;
 	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
 
-	mem_cgroup_try_flush_stats();
+	mem_cgroup_user_flush_stats(memcg);
 
 	for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
 		int nid;
-- 
2.42.0.rc2.253.gd59a3bf2b4-goog



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 0/4] memcg: non-unified flushing for userspace stats
  2023-08-31 16:56 [PATCH v4 0/4] memcg: non-unified flushing for userspace stats Yosry Ahmed
                   ` (3 preceding siblings ...)
  2023-08-31 16:56 ` [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads Yosry Ahmed
@ 2023-08-31 17:18 ` Waiman Long
  4 siblings, 0 replies; 29+ messages in thread
From: Waiman Long @ 2023-08-31 17:18 UTC (permalink / raw)
  To: Yosry Ahmed, Andrew Morton
  Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, Ivan Babrou, Tejun Heo, Michal Koutný,
	linux-mm, cgroups, linux-kernel

On 8/31/23 12:56, Yosry Ahmed wrote:
> Most memcg flushing contexts using "unified" flushing, where only one
> flusher is allowed at a time (others skip), and all flushers need to
> flush the entire tree. This works well with high concurrency, which
> mostly comes from in-kernel flushers (e.g. reclaim, refault, ..).
>
> For userspace reads, unified flushing leads to non-deterministic stats
> staleness and reading cost. This series clarifies and documents the
> differences between unified and non-unified flushing (patches 1 & 2),
> then opts userspace reads out of unified flushing (patch 3).
>
> This patch series is a follow up on the discussion in [1]. That was a
> patch that proposed that userspace reads wait for ongoing unified
> flushers to complete before returning. There were concerns about the
> latency that this introduces to userspace reads, especially with ongoing
> reports of expensive stat reads even with unified flushing. Hence, this
> series follows a different approach, by opting userspace reads out of
> unified flushing completely. The cost of userspace reads are now
> determinstic, and depend on the size of the subtree being read. This
> should fix both the *sometimes* expensive reads (due to flushing the
> entire tree) and occasional staless (due to skipping flushing).
>
> I attempted to remove unified flushing completely, but noticed that
> in-kernel flushers with high concurrency (e.g. hundreds of concurrent
> reclaimers). This sort of concurrency is not expected from userspace
> reads. More details about testing and some numbers in the last patch's
> changelog.
>
> v4 -> v5:
> - Fixed build error in the last patch with W=1 because of a missed
>    'static'.
>
> v4: https://lore.kernel.org/lkml/20230830175335.1536008-1-yosryahmed@google.com/
>
> Yosry Ahmed (4):
>    mm: memcg: properly name and document unified stats flushing
>    mm: memcg: add a helper for non-unified stats flushing
>    mm: memcg: let non-unified root stats flushes help unified flushes
>    mm: memcg: use non-unified stats flushing for userspace reads
>
>   include/linux/memcontrol.h |   8 +--
>   mm/memcontrol.c            | 106 +++++++++++++++++++++++++++----------
>   mm/vmscan.c                |   2 +-
>   mm/workingset.c            |   4 +-
>   4 files changed, 85 insertions(+), 35 deletions(-)
>
LGTM

Acked-by: Waiman Long <longman@redhat.com>



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 1/4] mm: memcg: properly name and document unified stats flushing
  2023-08-31 16:56 ` [PATCH v4 1/4] mm: memcg: properly name and document unified stats flushing Yosry Ahmed
@ 2023-09-04 14:44   ` Michal Hocko
  2023-09-05 15:55     ` Yosry Ahmed
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2023-09-04 14:44 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt,
	Muchun Song, Ivan Babrou, Tejun Heo, Michal Koutný,
	Waiman Long, linux-mm, cgroups, linux-kernel

On Thu 31-08-23 16:56:08, Yosry Ahmed wrote:
> Most contexts that flush memcg stats use "unified" flushing, where
> basically all flushers attempt to flush the entire hierarchy, but only
> one flusher is allowed at a time, others skip flushing.
> 
> This is needed because we need to flush the stats from paths such as
> reclaim or refaults, which may have high concurrency, especially on
> large systems. Serializing such performance-sensitive paths can
> introduce regressions, hence, unified flushing offers a tradeoff between
> stats staleness and the performance impact of flushing stats.
> 
> Document this properly and explicitly by renaming the common flushing
> helper from do_flush_stats() to do_unified_stats_flush(), and adding
> documentation to describe unified flushing. Additionally, rename
> flushing APIs to add "try" in the name, which implies that flushing will
> not always happen. Also add proper documentation.
> 
> No functional change intended.
> 
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>

No objections to renaming but it would be really nice to simplify this.
It is just "funny" to see 4 different flushing methods (flush from
userspace, flush, try_flush_with_ratelimit_1 and try_flush_with_ratelimit_2).
This is all internal so I am not all that worried that this would get
confused but it surely is rather convoluted.

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/memcontrol.h |  8 ++---
>  mm/memcontrol.c            | 61 +++++++++++++++++++++++++-------------
>  mm/vmscan.c                |  2 +-
>  mm/workingset.c            |  4 +--
>  4 files changed, 47 insertions(+), 28 deletions(-)
> 
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 11810a2cfd2d..d517b0cc5221 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1034,8 +1034,8 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
>  	return x;
>  }
>  
> -void mem_cgroup_flush_stats(void);
> -void mem_cgroup_flush_stats_ratelimited(void);
> +void mem_cgroup_try_flush_stats(void);
> +void mem_cgroup_try_flush_stats_ratelimited(void);
>  
>  void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
>  			      int val);
> @@ -1519,11 +1519,11 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
>  	return node_page_state(lruvec_pgdat(lruvec), idx);
>  }
>  
> -static inline void mem_cgroup_flush_stats(void)
> +static inline void mem_cgroup_try_flush_stats(void)
>  {
>  }
>  
> -static inline void mem_cgroup_flush_stats_ratelimited(void)
> +static inline void mem_cgroup_try_flush_stats_ratelimited(void)
>  {
>  }
>  
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index cf57fe9318d5..2d0ec828a1c4 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -588,7 +588,7 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
>  static void flush_memcg_stats_dwork(struct work_struct *w);
>  static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
>  static DEFINE_PER_CPU(unsigned int, stats_updates);
> -static atomic_t stats_flush_ongoing = ATOMIC_INIT(0);
> +static atomic_t stats_unified_flush_ongoing = ATOMIC_INIT(0);
>  static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
>  static u64 flush_next_time;
>  
> @@ -630,7 +630,7 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
>  		/*
>  		 * If stats_flush_threshold exceeds the threshold
>  		 * (>num_online_cpus()), cgroup stats update will be triggered
> -		 * in __mem_cgroup_flush_stats(). Increasing this var further
> +		 * in mem_cgroup_try_flush_stats(). Increasing this var further
>  		 * is redundant and simply adds overhead in atomic update.
>  		 */
>  		if (atomic_read(&stats_flush_threshold) <= num_online_cpus())
> @@ -639,15 +639,19 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
>  	}
>  }
>  
> -static void do_flush_stats(void)
> +/*
> + * do_unified_stats_flush - do a unified flush of memory cgroup statistics
> + *
> + * A unified flush tries to flush the entire hierarchy, but skips if there is
> + * another ongoing flush. This is meant for flushers that may have a lot of
> + * concurrency (e.g. reclaim, refault, etc), and should not be serialized to
> + * avoid slowing down performance-sensitive paths. A unified flush may skip, and
> + * hence may yield stale stats.
> + */
> +static void do_unified_stats_flush(void)
>  {
> -	/*
> -	 * We always flush the entire tree, so concurrent flushers can just
> -	 * skip. This avoids a thundering herd problem on the rstat global lock
> -	 * from memcg flushers (e.g. reclaim, refault, etc).
> -	 */
> -	if (atomic_read(&stats_flush_ongoing) ||
> -	    atomic_xchg(&stats_flush_ongoing, 1))
> +	if (atomic_read(&stats_unified_flush_ongoing) ||
> +	    atomic_xchg(&stats_unified_flush_ongoing, 1))
>  		return;
>  
>  	WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME);
> @@ -655,19 +659,34 @@ static void do_flush_stats(void)
>  	cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
>  
>  	atomic_set(&stats_flush_threshold, 0);
> -	atomic_set(&stats_flush_ongoing, 0);
> +	atomic_set(&stats_unified_flush_ongoing, 0);
>  }
>  
> -void mem_cgroup_flush_stats(void)
> +/*
> + * mem_cgroup_try_flush_stats - try to flush memory cgroup statistics
> + *
> + * Try to flush the stats of all memcgs that have stat updates since the last
> + * flush. We do not flush the stats if:
> + * - The magnitude of the pending updates is below a certain threshold.
> + * - There is another ongoing unified flush (see do_unified_stats_flush()).
> + *
> + * Hence, the stats may be stale, but ideally by less than FLUSH_TIME due to
> + * periodic flushing.
> + */
> +void mem_cgroup_try_flush_stats(void)
>  {
>  	if (atomic_read(&stats_flush_threshold) > num_online_cpus())
> -		do_flush_stats();
> +		do_unified_stats_flush();
>  }
>  
> -void mem_cgroup_flush_stats_ratelimited(void)
> +/*
> + * Like mem_cgroup_try_flush_stats(), but only flushes if the periodic flusher
> + * is late.
> + */
> +void mem_cgroup_try_flush_stats_ratelimited(void)
>  {
>  	if (time_after64(jiffies_64, READ_ONCE(flush_next_time)))
> -		mem_cgroup_flush_stats();
> +		mem_cgroup_try_flush_stats();
>  }
>  
>  static void flush_memcg_stats_dwork(struct work_struct *w)
> @@ -676,7 +695,7 @@ static void flush_memcg_stats_dwork(struct work_struct *w)
>  	 * Always flush here so that flushing in latency-sensitive paths is
>  	 * as cheap as possible.
>  	 */
> -	do_flush_stats();
> +	do_unified_stats_flush();
>  	queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME);
>  }
>  
> @@ -1576,7 +1595,7 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
>  	 *
>  	 * Current memory state:
>  	 */
> -	mem_cgroup_flush_stats();
> +	mem_cgroup_try_flush_stats();
>  
>  	for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
>  		u64 size;
> @@ -4018,7 +4037,7 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v)
>  	int nid;
>  	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
>  
> -	mem_cgroup_flush_stats();
> +	mem_cgroup_try_flush_stats();
>  
>  	for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) {
>  		seq_printf(m, "%s=%lu", stat->name,
> @@ -4093,7 +4112,7 @@ static void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
>  
>  	BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats));
>  
> -	mem_cgroup_flush_stats();
> +	mem_cgroup_try_flush_stats();
>  
>  	for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
>  		unsigned long nr;
> @@ -4595,7 +4614,7 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
>  	struct mem_cgroup *parent;
>  
> -	mem_cgroup_flush_stats();
> +	mem_cgroup_try_flush_stats();
>  
>  	*pdirty = memcg_page_state(memcg, NR_FILE_DIRTY);
>  	*pwriteback = memcg_page_state(memcg, NR_WRITEBACK);
> @@ -6610,7 +6629,7 @@ static int memory_numa_stat_show(struct seq_file *m, void *v)
>  	int i;
>  	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
>  
> -	mem_cgroup_flush_stats();
> +	mem_cgroup_try_flush_stats();
>  
>  	for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
>  		int nid;
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index c7c149cb8d66..457a18921fda 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2923,7 +2923,7 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
>  	 * Flush the memory cgroup stats, so that we read accurate per-memcg
>  	 * lruvec stats for heuristics.
>  	 */
> -	mem_cgroup_flush_stats();
> +	mem_cgroup_try_flush_stats();
>  
>  	/*
>  	 * Determine the scan balance between anon and file LRUs.
> diff --git a/mm/workingset.c b/mm/workingset.c
> index da58a26d0d4d..affb8699e58d 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -520,7 +520,7 @@ void workingset_refault(struct folio *folio, void *shadow)
>  	}
>  
>  	/* Flush stats (and potentially sleep) before holding RCU read lock */
> -	mem_cgroup_flush_stats_ratelimited();
> +	mem_cgroup_try_flush_stats_ratelimited();
>  
>  	rcu_read_lock();
>  
> @@ -664,7 +664,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
>  		struct lruvec *lruvec;
>  		int i;
>  
> -		mem_cgroup_flush_stats();
> +		mem_cgroup_try_flush_stats();
>  		lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
>  		for (pages = 0, i = 0; i < NR_LRU_LISTS; i++)
>  			pages += lruvec_page_state_local(lruvec,
> -- 
> 2.42.0.rc2.253.gd59a3bf2b4-goog

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 2/4] mm: memcg: add a helper for non-unified stats flushing
  2023-08-31 16:56 ` [PATCH v4 2/4] mm: memcg: add a helper for non-unified " Yosry Ahmed
@ 2023-09-04 14:45   ` Michal Hocko
  0 siblings, 0 replies; 29+ messages in thread
From: Michal Hocko @ 2023-09-04 14:45 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt,
	Muchun Song, Ivan Babrou, Tejun Heo, Michal Koutný,
	Waiman Long, linux-mm, cgroups, linux-kernel

On Thu 31-08-23 16:56:09, Yosry Ahmed wrote:
> Some contexts flush memcg stats outside of unified flushing, directly
> using cgroup_rstat_flush(). Add a helper for non-unified flushing, a
> counterpart for do_unified_stats_flush(), and use it in those contexts,
> as well as in do_unified_stats_flush() itself.
> 
> This abstracts the rstat API and makes it easy to introduce
> modifications to either unified or non-unified flushing functions
> without changing callers.
> 
> No functional change intended.
> 
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/memcontrol.c | 21 +++++++++++++++++----
>  1 file changed, 17 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2d0ec828a1c4..8c046feeaae7 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -639,6 +639,17 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
>  	}
>  }
>  
> +/*
> + * do_stats_flush - do a flush of the memory cgroup statistics
> + * @memcg: memory cgroup to flush
> + *
> + * Only flushes the subtree of @memcg, does not skip under any conditions.
> + */
> +static void do_stats_flush(struct mem_cgroup *memcg)
> +{
> +	cgroup_rstat_flush(memcg->css.cgroup);
> +}
> +
>  /*
>   * do_unified_stats_flush - do a unified flush of memory cgroup statistics
>   *
> @@ -656,7 +667,7 @@ static void do_unified_stats_flush(void)
>  
>  	WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME);
>  
> -	cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
> +	do_stats_flush(root_mem_cgroup);
>  
>  	atomic_set(&stats_flush_threshold, 0);
>  	atomic_set(&stats_unified_flush_ongoing, 0);
> @@ -7790,7 +7801,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
>  			break;
>  		}
>  
> -		cgroup_rstat_flush(memcg->css.cgroup);
> +		do_stats_flush(memcg);
>  		pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE;
>  		if (pages < max)
>  			continue;
> @@ -7855,8 +7866,10 @@ void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size)
>  static u64 zswap_current_read(struct cgroup_subsys_state *css,
>  			      struct cftype *cft)
>  {
> -	cgroup_rstat_flush(css->cgroup);
> -	return memcg_page_state(mem_cgroup_from_css(css), MEMCG_ZSWAP_B);
> +	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> +
> +	do_stats_flush(memcg);
> +	return memcg_page_state(memcg, MEMCG_ZSWAP_B);
>  }
>  
>  static int zswap_max_show(struct seq_file *m, void *v)
> -- 
> 2.42.0.rc2.253.gd59a3bf2b4-goog

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 3/4] mm: memcg: let non-unified root stats flushes help unified flushes
  2023-08-31 16:56 ` [PATCH v4 3/4] mm: memcg: let non-unified root stats flushes help unified flushes Yosry Ahmed
@ 2023-09-04 14:50   ` Michal Hocko
  2023-09-04 15:29     ` Michal Koutný
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2023-09-04 14:50 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt,
	Muchun Song, Ivan Babrou, Tejun Heo, Michal Koutný,
	Waiman Long, linux-mm, cgroups, linux-kernel

On Thu 31-08-23 16:56:10, Yosry Ahmed wrote:
> Unified flushing of memcg stats keeps track of the magnitude of pending
> updates, and only allows a flush if that magnitude exceeds a threshold.
> It also keeps track of the time at which ratelimited flushing should be
> allowed as flush_next_time.
> 
> A non-unified flush on the root memcg has the same effect as a unified
> flush, so let it help unified flushing by resetting pending updates and
> kicking flush_next_time forward. Move the logic into the common
> do_stats_flush() helper, and do it for all root flushes, unified or
> not.

I have hard time to follow why we really want/need this. Does this cause
any observable changes to the behavior?

> 
> There is a subtle change here, we reset stats_flush_threshold before a
> flush rather than after a flush. This probably okay because:
> 
> (a) For flushers: only unified flushers check stats_flush_threshold, and
> those flushers skip anyway if there is another unified flush ongoing.
> Having them also skip if there is an ongoing non-unified root flush is
> actually more consistent.
> 
> (b) For updaters: Resetting stats_flush_threshold early may lead to more
> atomic updates of stats_flush_threshold, as we start updating it
> earlier. This should not be significant in practice because we stop
> updating stats_flush_threshold when it reaches the threshold anyway. If
> we start early and stop early, the number of atomic updates remain the
> same. The only difference is the scenario where we reset
> stats_flush_threshold early, start doing atomic updates early, and then
> the periodic flusher kicks in before we reach the threshold. In this
> case, we will have done more atomic updates. However, since the
> threshold wasn't reached, then we did not do a lot of updates anyway.
> 
> Suggested-by: Michal Koutný <mkoutny@suse.com>
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> ---
>  mm/memcontrol.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 8c046feeaae7..94d5a6751a9e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -647,6 +647,11 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
>   */
>  static void do_stats_flush(struct mem_cgroup *memcg)
>  {
> +	/* for unified flushing, root non-unified flushing can help as well */
> +	if (mem_cgroup_is_root(memcg)) {
> +		WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME);
> +		atomic_set(&stats_flush_threshold, 0);
> +	}
>  	cgroup_rstat_flush(memcg->css.cgroup);
>  }
>  
> @@ -665,11 +670,8 @@ static void do_unified_stats_flush(void)
>  	    atomic_xchg(&stats_unified_flush_ongoing, 1))
>  		return;
>  
> -	WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME);
> -
>  	do_stats_flush(root_mem_cgroup);
>  
> -	atomic_set(&stats_flush_threshold, 0);
>  	atomic_set(&stats_unified_flush_ongoing, 0);
>  }
>  
> -- 
> 2.42.0.rc2.253.gd59a3bf2b4-goog

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads
  2023-08-31 16:56 ` [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads Yosry Ahmed
@ 2023-09-04 15:15   ` Michal Hocko
  2023-09-05 15:57     ` Yosry Ahmed
  2023-09-08  0:52     ` Wei Xu
  0 siblings, 2 replies; 29+ messages in thread
From: Michal Hocko @ 2023-09-04 15:15 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt,
	Muchun Song, Ivan Babrou, Tejun Heo, Michal Koutný,
	Waiman Long, linux-mm, cgroups, linux-kernel

On Thu 31-08-23 16:56:11, Yosry Ahmed wrote:
> Unified flushing allows for great concurrency for paths that attempt to
> flush the stats, at the expense of potential staleness and a single
> flusher paying the extra cost of flushing the full tree.
> 
> This tradeoff makes sense for in-kernel flushers that may observe high
> concurrency (e.g. reclaim, refault). For userspace readers, stale stats
> may be unexpected and problematic, especially when such stats are used
> for critical paths such as userspace OOM handling. Additionally, a
> userspace reader will occasionally pay the cost of flushing the entire
> hierarchy, which also causes problems in some cases [1].
> 
> Opt userspace reads out of unified flushing. This makes the cost of
> reading the stats more predictable (proportional to the size of the
> subtree), as well as the freshness of the stats. Userspace readers are
> not expected to have similar concurrency to in-kernel flushers,
> serializing them among themselves and among in-kernel flushers should be
> okay. Nonetheless, for extra safety, introduce a mutex when flushing for
> userspace readers to make sure only a single userspace reader can compete
> with in-kernel flushers at a time. This takes away userspace ability to
> directly influence or hurt in-kernel lock contention.

I think it would be helpful to note that the primary reason this is a
concern is that the spinlock is dropped during flushing under
contention.

> An alternative is to remove flushing from the stats reading path
> completely, and rely on the periodic flusher. This should be accompanied
> by making the periodic flushing period tunable, and providing an
> interface for userspace to force a flush, following a similar model to
> /proc/vmstat. However, such a change will be hard to reverse if the
> implementation needs to be changed because:
> - The cost of reading stats will be very cheap and we won't be able to
>   take that back easily.
> - There are user-visible interfaces involved.
> 
> Hence, let's go with the change that's most reversible first and revisit
> as needed.
> 
> This was tested on a machine with 256 cpus by running a synthetic test
> script [2] that creates 50 top-level cgroups, each with 5 children (250
> leaf cgroups). Each leaf cgroup has 10 processes running that allocate
> memory beyond the cgroup limit, invoking reclaim (which is an in-kernel
> unified flusher). Concurrently, one thread is spawned per-cgroup to read
> the stats every second (including root, top-level, and leaf cgroups --
> so total 251 threads). No significant regressions were observed in the
> total run time, which means that userspace readers are not significantly
> affecting in-kernel flushers:
> 
> Base (mm-unstable):
> 
> real	0m22.500s
> user	0m9.399s
> sys	73m41.381s
> 
> real	0m22.749s
> user	0m15.648s
> sys	73m13.113s
> 
> real	0m22.466s
> user	0m10.000s
> sys	73m11.933s
> 
> With this patch:
> 
> real	0m23.092s
> user	0m10.110s
> sys	75m42.774s
> 
> real	0m22.277s
> user	0m10.443s
> sys	72m7.182s
> 
> real	0m24.127s
> user	0m12.617s
> sys	78m52.765s
> 
> [1]https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/
> [2]https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com/
> 
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>

OK, I can live with that but I still believe that locking involved in
the user interface only begs for issues later on as there is no control
over that lock contention other than the number of processes involved.
As it seems that we cannot make a consensus on this concern now and this
should be already helping existing workloads then let's just buy some
more time ;)

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  mm/memcontrol.c | 24 ++++++++++++++++++++----
>  1 file changed, 20 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 94d5a6751a9e..46a7abf71c73 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -588,6 +588,7 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
>  static void flush_memcg_stats_dwork(struct work_struct *w);
>  static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
>  static DEFINE_PER_CPU(unsigned int, stats_updates);
> +static DEFINE_MUTEX(stats_user_flush_mutex);
>  static atomic_t stats_unified_flush_ongoing = ATOMIC_INIT(0);
>  static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
>  static u64 flush_next_time;
> @@ -655,6 +656,21 @@ static void do_stats_flush(struct mem_cgroup *memcg)
>  	cgroup_rstat_flush(memcg->css.cgroup);
>  }
>  
> +/*
> + * mem_cgroup_user_flush_stats - do a stats flush for a user read
> + * @memcg: memory cgroup to flush
> + *
> + * Flush the subtree of @memcg. A mutex is used for userspace readers to gate
> + * the global rstat spinlock. This protects in-kernel flushers from userspace
> + * readers hogging the lock.

readers hogging the lock as do_stats_flush drops the spinlock under
contention.

> + */
> +static void mem_cgroup_user_flush_stats(struct mem_cgroup *memcg)
> +{
> +	mutex_lock(&stats_user_flush_mutex);
> +	do_stats_flush(memcg);
> +	mutex_unlock(&stats_user_flush_mutex);
> +}
> +
>  /*
>   * do_unified_stats_flush - do a unified flush of memory cgroup statistics
>   *
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 3/4] mm: memcg: let non-unified root stats flushes help unified flushes
  2023-09-04 14:50   ` Michal Hocko
@ 2023-09-04 15:29     ` Michal Koutný
  2023-09-04 15:41       ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Koutný @ 2023-09-04 15:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Yosry Ahmed, Andrew Morton, Johannes Weiner, Roman Gushchin,
	Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, Waiman Long,
	linux-mm, cgroups, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 574 bytes --]

Hello.

On Mon, Sep 04, 2023 at 04:50:15PM +0200, Michal Hocko <mhocko@suse.com> wrote:
> I have hard time to follow why we really want/need this. Does this cause
> any observable changes to the behavior?

Behavior change depends on how much userspace triggers the root memcg
flush, from nothing to effectively offloading flushing to userspace tasks.
(Theory^^^)

It keeps stats_flush_threshold up to date representing global error
estimate so that error-tolerant readers may save their time and it keeps
the reasoning about the stats_flush_threshold effect simple.

Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 3/4] mm: memcg: let non-unified root stats flushes help unified flushes
  2023-09-04 15:29     ` Michal Koutný
@ 2023-09-04 15:41       ` Michal Hocko
  2023-09-05 14:10         ` Michal Koutný
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2023-09-04 15:41 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Yosry Ahmed, Andrew Morton, Johannes Weiner, Roman Gushchin,
	Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, Waiman Long,
	linux-mm, cgroups, linux-kernel

On Mon 04-09-23 17:29:14, Michal Koutny wrote:
> Hello.
> 
> On Mon, Sep 04, 2023 at 04:50:15PM +0200, Michal Hocko <mhocko@suse.com> wrote:
> > I have hard time to follow why we really want/need this. Does this cause
> > any observable changes to the behavior?
> 
> Behavior change depends on how much userspace triggers the root memcg
> flush, from nothing to effectively offloading flushing to userspace tasks.
> (Theory^^^)
> 
> It keeps stats_flush_threshold up to date representing global error
> estimate so that error-tolerant readers may save their time and it keeps
> the reasoning about the stats_flush_threshold effect simple.

So it also creates an undocumented but userspace visible behavior.
Something that userspace might start depending on, right?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 3/4] mm: memcg: let non-unified root stats flushes help unified flushes
  2023-09-04 15:41       ` Michal Hocko
@ 2023-09-05 14:10         ` Michal Koutný
  2023-09-05 15:54           ` Yosry Ahmed
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Koutný @ 2023-09-05 14:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Yosry Ahmed, Andrew Morton, Johannes Weiner, Roman Gushchin,
	Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, Waiman Long,
	linux-mm, cgroups, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 630 bytes --]

On Mon, Sep 04, 2023 at 05:41:10PM +0200, Michal Hocko <mhocko@suse.com> wrote:
> So it also creates an undocumented but userspace visible behavior.
> Something that userspace might start depending on, right?

Yes but -
- depending on undocumented behavior is a mistake,
- breaking the dependency would manifest (in the case I imagine) as a
  performance regression (and if there are some users, the future can
  allow them configuring periodic kernel flush to compensate for that).

Or do you suggest these effects should be documented (that would require
deeper analysis of the actual effect)? 


Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 3/4] mm: memcg: let non-unified root stats flushes help unified flushes
  2023-09-05 14:10         ` Michal Koutný
@ 2023-09-05 15:54           ` Yosry Ahmed
  2023-09-05 16:07             ` Michal Koutný
  2023-09-12 11:03             ` Michal Hocko
  0 siblings, 2 replies; 29+ messages in thread
From: Yosry Ahmed @ 2023-09-05 15:54 UTC (permalink / raw)
  To: Michal Koutný
  Cc: Michal Hocko, Andrew Morton, Johannes Weiner, Roman Gushchin,
	Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, Waiman Long,
	linux-mm, cgroups, linux-kernel

On Tue, Sep 5, 2023 at 7:10 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> On Mon, Sep 04, 2023 at 05:41:10PM +0200, Michal Hocko <mhocko@suse.com> wrote:
> > So it also creates an undocumented but userspace visible behavior.
> > Something that userspace might start depending on, right?
>
> Yes but -
> - depending on undocumented behavior is a mistake,
> - breaking the dependency would manifest (in the case I imagine) as a
>   performance regression (and if there are some users, the future can
>   allow them configuring periodic kernel flush to compensate for that).

I think I am missing something. This change basically makes userspace
readers (for the root memcg) help out unified flushers, which are
in-kernel readers (e.g. reclaim) -- not the other way around.

How would that create a userspace visible behavior that a dependency
can be formed on? Users expecting reclaim to be faster right after
reading root stats? I would guess that would be too flaky to cause a
behavior that people can depend on tbh.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 1/4] mm: memcg: properly name and document unified stats flushing
  2023-09-04 14:44   ` Michal Hocko
@ 2023-09-05 15:55     ` Yosry Ahmed
  0 siblings, 0 replies; 29+ messages in thread
From: Yosry Ahmed @ 2023-09-05 15:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt,
	Muchun Song, Ivan Babrou, Tejun Heo, Michal Koutný,
	Waiman Long, linux-mm, cgroups, linux-kernel

On Mon, Sep 4, 2023 at 7:44 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 31-08-23 16:56:08, Yosry Ahmed wrote:
> > Most contexts that flush memcg stats use "unified" flushing, where
> > basically all flushers attempt to flush the entire hierarchy, but only
> > one flusher is allowed at a time, others skip flushing.
> >
> > This is needed because we need to flush the stats from paths such as
> > reclaim or refaults, which may have high concurrency, especially on
> > large systems. Serializing such performance-sensitive paths can
> > introduce regressions, hence, unified flushing offers a tradeoff between
> > stats staleness and the performance impact of flushing stats.
> >
> > Document this properly and explicitly by renaming the common flushing
> > helper from do_flush_stats() to do_unified_stats_flush(), and adding
> > documentation to describe unified flushing. Additionally, rename
> > flushing APIs to add "try" in the name, which implies that flushing will
> > not always happen. Also add proper documentation.
> >
> > No functional change intended.
> >
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
>
> No objections to renaming but it would be really nice to simplify this.
> It is just "funny" to see 4 different flushing methods (flush from
> userspace, flush, try_flush_with_ratelimit_1 and try_flush_with_ratelimit_2).
> This is all internal so I am not all that worried that this would get
> confused but it surely is rather convoluted.
>
> Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

I tried to at least keep the naming consistent and add documentation,
but I agree we can do better :) It's less obvious to me tbh where we
can improve, but hopefully when someone new to this code comes
complaining we'll know better what needs to be changed.

>
> > ---
> >  include/linux/memcontrol.h |  8 ++---
> >  mm/memcontrol.c            | 61 +++++++++++++++++++++++++-------------
> >  mm/vmscan.c                |  2 +-
> >  mm/workingset.c            |  4 +--
> >  4 files changed, 47 insertions(+), 28 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 11810a2cfd2d..d517b0cc5221 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -1034,8 +1034,8 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
> >       return x;
> >  }
> >
> > -void mem_cgroup_flush_stats(void);
> > -void mem_cgroup_flush_stats_ratelimited(void);
> > +void mem_cgroup_try_flush_stats(void);
> > +void mem_cgroup_try_flush_stats_ratelimited(void);
> >
> >  void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
> >                             int val);
> > @@ -1519,11 +1519,11 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
> >       return node_page_state(lruvec_pgdat(lruvec), idx);
> >  }
> >
> > -static inline void mem_cgroup_flush_stats(void)
> > +static inline void mem_cgroup_try_flush_stats(void)
> >  {
> >  }
> >
> > -static inline void mem_cgroup_flush_stats_ratelimited(void)
> > +static inline void mem_cgroup_try_flush_stats_ratelimited(void)
> >  {
> >  }
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index cf57fe9318d5..2d0ec828a1c4 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -588,7 +588,7 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
> >  static void flush_memcg_stats_dwork(struct work_struct *w);
> >  static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
> >  static DEFINE_PER_CPU(unsigned int, stats_updates);
> > -static atomic_t stats_flush_ongoing = ATOMIC_INIT(0);
> > +static atomic_t stats_unified_flush_ongoing = ATOMIC_INIT(0);
> >  static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
> >  static u64 flush_next_time;
> >
> > @@ -630,7 +630,7 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
> >               /*
> >                * If stats_flush_threshold exceeds the threshold
> >                * (>num_online_cpus()), cgroup stats update will be triggered
> > -              * in __mem_cgroup_flush_stats(). Increasing this var further
> > +              * in mem_cgroup_try_flush_stats(). Increasing this var further
> >                * is redundant and simply adds overhead in atomic update.
> >                */
> >               if (atomic_read(&stats_flush_threshold) <= num_online_cpus())
> > @@ -639,15 +639,19 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
> >       }
> >  }
> >
> > -static void do_flush_stats(void)
> > +/*
> > + * do_unified_stats_flush - do a unified flush of memory cgroup statistics
> > + *
> > + * A unified flush tries to flush the entire hierarchy, but skips if there is
> > + * another ongoing flush. This is meant for flushers that may have a lot of
> > + * concurrency (e.g. reclaim, refault, etc), and should not be serialized to
> > + * avoid slowing down performance-sensitive paths. A unified flush may skip, and
> > + * hence may yield stale stats.
> > + */
> > +static void do_unified_stats_flush(void)
> >  {
> > -     /*
> > -      * We always flush the entire tree, so concurrent flushers can just
> > -      * skip. This avoids a thundering herd problem on the rstat global lock
> > -      * from memcg flushers (e.g. reclaim, refault, etc).
> > -      */
> > -     if (atomic_read(&stats_flush_ongoing) ||
> > -         atomic_xchg(&stats_flush_ongoing, 1))
> > +     if (atomic_read(&stats_unified_flush_ongoing) ||
> > +         atomic_xchg(&stats_unified_flush_ongoing, 1))
> >               return;
> >
> >       WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME);
> > @@ -655,19 +659,34 @@ static void do_flush_stats(void)
> >       cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
> >
> >       atomic_set(&stats_flush_threshold, 0);
> > -     atomic_set(&stats_flush_ongoing, 0);
> > +     atomic_set(&stats_unified_flush_ongoing, 0);
> >  }
> >
> > -void mem_cgroup_flush_stats(void)
> > +/*
> > + * mem_cgroup_try_flush_stats - try to flush memory cgroup statistics
> > + *
> > + * Try to flush the stats of all memcgs that have stat updates since the last
> > + * flush. We do not flush the stats if:
> > + * - The magnitude of the pending updates is below a certain threshold.
> > + * - There is another ongoing unified flush (see do_unified_stats_flush()).
> > + *
> > + * Hence, the stats may be stale, but ideally by less than FLUSH_TIME due to
> > + * periodic flushing.
> > + */
> > +void mem_cgroup_try_flush_stats(void)
> >  {
> >       if (atomic_read(&stats_flush_threshold) > num_online_cpus())
> > -             do_flush_stats();
> > +             do_unified_stats_flush();
> >  }
> >
> > -void mem_cgroup_flush_stats_ratelimited(void)
> > +/*
> > + * Like mem_cgroup_try_flush_stats(), but only flushes if the periodic flusher
> > + * is late.
> > + */
> > +void mem_cgroup_try_flush_stats_ratelimited(void)
> >  {
> >       if (time_after64(jiffies_64, READ_ONCE(flush_next_time)))
> > -             mem_cgroup_flush_stats();
> > +             mem_cgroup_try_flush_stats();
> >  }
> >
> >  static void flush_memcg_stats_dwork(struct work_struct *w)
> > @@ -676,7 +695,7 @@ static void flush_memcg_stats_dwork(struct work_struct *w)
> >        * Always flush here so that flushing in latency-sensitive paths is
> >        * as cheap as possible.
> >        */
> > -     do_flush_stats();
> > +     do_unified_stats_flush();
> >       queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME);
> >  }
> >
> > @@ -1576,7 +1595,7 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
> >        *
> >        * Current memory state:
> >        */
> > -     mem_cgroup_flush_stats();
> > +     mem_cgroup_try_flush_stats();
> >
> >       for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
> >               u64 size;
> > @@ -4018,7 +4037,7 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v)
> >       int nid;
> >       struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
> >
> > -     mem_cgroup_flush_stats();
> > +     mem_cgroup_try_flush_stats();
> >
> >       for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) {
> >               seq_printf(m, "%s=%lu", stat->name,
> > @@ -4093,7 +4112,7 @@ static void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
> >
> >       BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats));
> >
> > -     mem_cgroup_flush_stats();
> > +     mem_cgroup_try_flush_stats();
> >
> >       for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
> >               unsigned long nr;
> > @@ -4595,7 +4614,7 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
> >       struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
> >       struct mem_cgroup *parent;
> >
> > -     mem_cgroup_flush_stats();
> > +     mem_cgroup_try_flush_stats();
> >
> >       *pdirty = memcg_page_state(memcg, NR_FILE_DIRTY);
> >       *pwriteback = memcg_page_state(memcg, NR_WRITEBACK);
> > @@ -6610,7 +6629,7 @@ static int memory_numa_stat_show(struct seq_file *m, void *v)
> >       int i;
> >       struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
> >
> > -     mem_cgroup_flush_stats();
> > +     mem_cgroup_try_flush_stats();
> >
> >       for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
> >               int nid;
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index c7c149cb8d66..457a18921fda 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2923,7 +2923,7 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
> >        * Flush the memory cgroup stats, so that we read accurate per-memcg
> >        * lruvec stats for heuristics.
> >        */
> > -     mem_cgroup_flush_stats();
> > +     mem_cgroup_try_flush_stats();
> >
> >       /*
> >        * Determine the scan balance between anon and file LRUs.
> > diff --git a/mm/workingset.c b/mm/workingset.c
> > index da58a26d0d4d..affb8699e58d 100644
> > --- a/mm/workingset.c
> > +++ b/mm/workingset.c
> > @@ -520,7 +520,7 @@ void workingset_refault(struct folio *folio, void *shadow)
> >       }
> >
> >       /* Flush stats (and potentially sleep) before holding RCU read lock */
> > -     mem_cgroup_flush_stats_ratelimited();
> > +     mem_cgroup_try_flush_stats_ratelimited();
> >
> >       rcu_read_lock();
> >
> > @@ -664,7 +664,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
> >               struct lruvec *lruvec;
> >               int i;
> >
> > -             mem_cgroup_flush_stats();
> > +             mem_cgroup_try_flush_stats();
> >               lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
> >               for (pages = 0, i = 0; i < NR_LRU_LISTS; i++)
> >                       pages += lruvec_page_state_local(lruvec,
> > --
> > 2.42.0.rc2.253.gd59a3bf2b4-goog
>
> --
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads
  2023-09-04 15:15   ` Michal Hocko
@ 2023-09-05 15:57     ` Yosry Ahmed
  2023-09-08  0:52     ` Wei Xu
  1 sibling, 0 replies; 29+ messages in thread
From: Yosry Ahmed @ 2023-09-05 15:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt,
	Muchun Song, Ivan Babrou, Tejun Heo, Michal Koutný,
	Waiman Long, linux-mm, cgroups, linux-kernel

On Mon, Sep 4, 2023 at 8:15 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 31-08-23 16:56:11, Yosry Ahmed wrote:
> > Unified flushing allows for great concurrency for paths that attempt to
> > flush the stats, at the expense of potential staleness and a single
> > flusher paying the extra cost of flushing the full tree.
> >
> > This tradeoff makes sense for in-kernel flushers that may observe high
> > concurrency (e.g. reclaim, refault). For userspace readers, stale stats
> > may be unexpected and problematic, especially when such stats are used
> > for critical paths such as userspace OOM handling. Additionally, a
> > userspace reader will occasionally pay the cost of flushing the entire
> > hierarchy, which also causes problems in some cases [1].
> >
> > Opt userspace reads out of unified flushing. This makes the cost of
> > reading the stats more predictable (proportional to the size of the
> > subtree), as well as the freshness of the stats. Userspace readers are
> > not expected to have similar concurrency to in-kernel flushers,
> > serializing them among themselves and among in-kernel flushers should be
> > okay. Nonetheless, for extra safety, introduce a mutex when flushing for
> > userspace readers to make sure only a single userspace reader can compete
> > with in-kernel flushers at a time. This takes away userspace ability to
> > directly influence or hurt in-kernel lock contention.
>
> I think it would be helpful to note that the primary reason this is a
> concern is that the spinlock is dropped during flushing under
> contention.
>
> > An alternative is to remove flushing from the stats reading path
> > completely, and rely on the periodic flusher. This should be accompanied
> > by making the periodic flushing period tunable, and providing an
> > interface for userspace to force a flush, following a similar model to
> > /proc/vmstat. However, such a change will be hard to reverse if the
> > implementation needs to be changed because:
> > - The cost of reading stats will be very cheap and we won't be able to
> >   take that back easily.
> > - There are user-visible interfaces involved.
> >
> > Hence, let's go with the change that's most reversible first and revisit
> > as needed.
> >
> > This was tested on a machine with 256 cpus by running a synthetic test
> > script [2] that creates 50 top-level cgroups, each with 5 children (250
> > leaf cgroups). Each leaf cgroup has 10 processes running that allocate
> > memory beyond the cgroup limit, invoking reclaim (which is an in-kernel
> > unified flusher). Concurrently, one thread is spawned per-cgroup to read
> > the stats every second (including root, top-level, and leaf cgroups --
> > so total 251 threads). No significant regressions were observed in the
> > total run time, which means that userspace readers are not significantly
> > affecting in-kernel flushers:
> >
> > Base (mm-unstable):
> >
> > real  0m22.500s
> > user  0m9.399s
> > sys   73m41.381s
> >
> > real  0m22.749s
> > user  0m15.648s
> > sys   73m13.113s
> >
> > real  0m22.466s
> > user  0m10.000s
> > sys   73m11.933s
> >
> > With this patch:
> >
> > real  0m23.092s
> > user  0m10.110s
> > sys   75m42.774s
> >
> > real  0m22.277s
> > user  0m10.443s
> > sys   72m7.182s
> >
> > real  0m24.127s
> > user  0m12.617s
> > sys   78m52.765s
> >
> > [1]https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/
> > [2]https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com/
> >
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
>
> OK, I can live with that but I still believe that locking involved in
> the user interface only begs for issues later on as there is no control
> over that lock contention other than the number of processes involved.
> As it seems that we cannot make a consensus on this concern now and this
> should be already helping existing workloads then let's just buy some
> more time ;)
>
> Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

I agree, let's fix problems if/when they arise, maybe it will be just fine :)

I will send a v5 collecting Ack's and augmenting the changelog and
comment below as you suggested (probably after we resolve patch 3).

>
> Thanks!
>
> > ---
> >  mm/memcontrol.c | 24 ++++++++++++++++++++----
> >  1 file changed, 20 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 94d5a6751a9e..46a7abf71c73 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -588,6 +588,7 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
> >  static void flush_memcg_stats_dwork(struct work_struct *w);
> >  static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
> >  static DEFINE_PER_CPU(unsigned int, stats_updates);
> > +static DEFINE_MUTEX(stats_user_flush_mutex);
> >  static atomic_t stats_unified_flush_ongoing = ATOMIC_INIT(0);
> >  static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
> >  static u64 flush_next_time;
> > @@ -655,6 +656,21 @@ static void do_stats_flush(struct mem_cgroup *memcg)
> >       cgroup_rstat_flush(memcg->css.cgroup);
> >  }
> >
> > +/*
> > + * mem_cgroup_user_flush_stats - do a stats flush for a user read
> > + * @memcg: memory cgroup to flush
> > + *
> > + * Flush the subtree of @memcg. A mutex is used for userspace readers to gate
> > + * the global rstat spinlock. This protects in-kernel flushers from userspace
> > + * readers hogging the lock.
>
> readers hogging the lock as do_stats_flush drops the spinlock under
> contention.
>
> > + */
> > +static void mem_cgroup_user_flush_stats(struct mem_cgroup *memcg)
> > +{
> > +     mutex_lock(&stats_user_flush_mutex);
> > +     do_stats_flush(memcg);
> > +     mutex_unlock(&stats_user_flush_mutex);
> > +}
> > +
> >  /*
> >   * do_unified_stats_flush - do a unified flush of memory cgroup statistics
> >   *
> --
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 3/4] mm: memcg: let non-unified root stats flushes help unified flushes
  2023-09-05 15:54           ` Yosry Ahmed
@ 2023-09-05 16:07             ` Michal Koutný
  2023-09-12 11:03             ` Michal Hocko
  1 sibling, 0 replies; 29+ messages in thread
From: Michal Koutný @ 2023-09-05 16:07 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Michal Hocko, Andrew Morton, Johannes Weiner, Roman Gushchin,
	Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo, Waiman Long,
	linux-mm, cgroups, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 571 bytes --]

On Tue, Sep 05, 2023 at 08:54:46AM -0700, Yosry Ahmed <yosryahmed@google.com> wrote:
> How would that create a userspace visible behavior that a dependency
> can be formed on?

A userspace process reading out root memory.stat more frequently than
in-kernel periodic flusher.

> Users expecting reclaim to be faster right after reading root stats?

Yes, that is what I had in mind.

> I would guess that would be too flaky to cause a behavior that people
> can depend on tbh.

I agree it's a weird dependency. As I wrote, nothing that would be hard
to take away.


Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads
  2023-09-04 15:15   ` Michal Hocko
  2023-09-05 15:57     ` Yosry Ahmed
@ 2023-09-08  0:52     ` Wei Xu
  2023-09-08  1:02       ` Ivan Babrou
  2023-09-11 13:11       ` Michal Hocko
  1 sibling, 2 replies; 29+ messages in thread
From: Wei Xu @ 2023-09-08  0:52 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Yosry Ahmed, Andrew Morton, Johannes Weiner, Roman Gushchin,
	Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo,
	Michal Koutný,
	Waiman Long, linux-mm, cgroups, linux-kernel, Greg Thelen

On Mon, Sep 4, 2023 at 8:15 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 31-08-23 16:56:11, Yosry Ahmed wrote:
> > Unified flushing allows for great concurrency for paths that attempt to
> > flush the stats, at the expense of potential staleness and a single
> > flusher paying the extra cost of flushing the full tree.
> >
> > This tradeoff makes sense for in-kernel flushers that may observe high
> > concurrency (e.g. reclaim, refault). For userspace readers, stale stats
> > may be unexpected and problematic, especially when such stats are used
> > for critical paths such as userspace OOM handling. Additionally, a
> > userspace reader will occasionally pay the cost of flushing the entire
> > hierarchy, which also causes problems in some cases [1].
> >
> > Opt userspace reads out of unified flushing. This makes the cost of
> > reading the stats more predictable (proportional to the size of the
> > subtree), as well as the freshness of the stats. Userspace readers are
> > not expected to have similar concurrency to in-kernel flushers,
> > serializing them among themselves and among in-kernel flushers should be
> > okay. Nonetheless, for extra safety, introduce a mutex when flushing for
> > userspace readers to make sure only a single userspace reader can compete
> > with in-kernel flushers at a time. This takes away userspace ability to
> > directly influence or hurt in-kernel lock contention.
>
> I think it would be helpful to note that the primary reason this is a
> concern is that the spinlock is dropped during flushing under
> contention.
>
> > An alternative is to remove flushing from the stats reading path
> > completely, and rely on the periodic flusher. This should be accompanied
> > by making the periodic flushing period tunable, and providing an
> > interface for userspace to force a flush, following a similar model to
> > /proc/vmstat. However, such a change will be hard to reverse if the
> > implementation needs to be changed because:
> > - The cost of reading stats will be very cheap and we won't be able to
> >   take that back easily.
> > - There are user-visible interfaces involved.
> >
> > Hence, let's go with the change that's most reversible first and revisit
> > as needed.
> >
> > This was tested on a machine with 256 cpus by running a synthetic test
> > script [2] that creates 50 top-level cgroups, each with 5 children (250
> > leaf cgroups). Each leaf cgroup has 10 processes running that allocate
> > memory beyond the cgroup limit, invoking reclaim (which is an in-kernel
> > unified flusher). Concurrently, one thread is spawned per-cgroup to read
> > the stats every second (including root, top-level, and leaf cgroups --
> > so total 251 threads). No significant regressions were observed in the
> > total run time, which means that userspace readers are not significantly
> > affecting in-kernel flushers:
> >
> > Base (mm-unstable):
> >
> > real  0m22.500s
> > user  0m9.399s
> > sys   73m41.381s
> >
> > real  0m22.749s
> > user  0m15.648s
> > sys   73m13.113s
> >
> > real  0m22.466s
> > user  0m10.000s
> > sys   73m11.933s
> >
> > With this patch:
> >
> > real  0m23.092s
> > user  0m10.110s
> > sys   75m42.774s
> >
> > real  0m22.277s
> > user  0m10.443s
> > sys   72m7.182s
> >
> > real  0m24.127s
> > user  0m12.617s
> > sys   78m52.765s
> >
> > [1]https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/
> > [2]https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com/
> >
> > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
>
> OK, I can live with that but I still believe that locking involved in
> the user interface only begs for issues later on as there is no control
> over that lock contention other than the number of processes involved.
> As it seems that we cannot make a consensus on this concern now and this
> should be already helping existing workloads then let's just buy some
> more time ;)

Indeed, even though the new global mutex protects the kernel from the
userspace contention on the rstats spinlock, its current
implementation doesn't have any protection for the lock contention
among the userspace threads and can cause significant delays to memcg
stats reads.

I tested this patch on a machine with 384 CPUs using a microbenchmark
that spawns 10K threads, each reading its memory.stat every 100
milliseconds.  Most of memory.stat reads take 5ms-10ms in kernel, with
~5% reads even exceeding 1 second. This is a significant regression.
In comparison, without contention, each memory.stat read only takes
20us-50us in the kernel.  Almost all of the extra read time is spent
on waiting for the new mutex. The time to flush rstats only accounts
for 10us-50us (This test creates only 1K memory cgroups and doesn't
generate any loads other than these stat reader threads).

 Here are some ideas to control the lock contention on the mutex and
reduce both the median and tail latencies of concurrent memcg stat
reads:

- Bring back the stats_flush_threshold check in
mem_cgroup_try_flush_stats() to mem_cgroup_user_flush_stats().  This
check provides a reasonable bound on the stats staleness while being
able to filter out a large number of rstats flush requests, which
reduces the contention on the mutex.

- When contended, upgrade the per-memcg rstats flush in
mem_cgroup_user_flush_stats() to a root memcg flush and coalesce these
contended flushes together.  We can wait for the ongoing flush to
complete and eliminate repeated flush requests.

- Wait for the mutex and the ongoing flush with a timeout.  We should
not use busy-wait, though.  We can bail out to read the stats without
a flush after enough wait.  A long-stalled system call is much worse
than somewhat stale stats in the corner cases in my opinion.

Wei Xu

> Acked-by: Michal Hocko <mhocko@suse.com>
>
> Thanks!
>
> > ---
> >  mm/memcontrol.c | 24 ++++++++++++++++++++----
> >  1 file changed, 20 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 94d5a6751a9e..46a7abf71c73 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -588,6 +588,7 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
> >  static void flush_memcg_stats_dwork(struct work_struct *w);
> >  static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
> >  static DEFINE_PER_CPU(unsigned int, stats_updates);
> > +static DEFINE_MUTEX(stats_user_flush_mutex);
> >  static atomic_t stats_unified_flush_ongoing = ATOMIC_INIT(0);
> >  static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
> >  static u64 flush_next_time;
> > @@ -655,6 +656,21 @@ static void do_stats_flush(struct mem_cgroup *memcg)
> >       cgroup_rstat_flush(memcg->css.cgroup);
> >  }
> >
> > +/*
> > + * mem_cgroup_user_flush_stats - do a stats flush for a user read
> > + * @memcg: memory cgroup to flush
> > + *
> > + * Flush the subtree of @memcg. A mutex is used for userspace readers to gate
> > + * the global rstat spinlock. This protects in-kernel flushers from userspace
> > + * readers hogging the lock.
>
> readers hogging the lock as do_stats_flush drops the spinlock under
> contention.
>
> > + */
> > +static void mem_cgroup_user_flush_stats(struct mem_cgroup *memcg)
> > +{
> > +     mutex_lock(&stats_user_flush_mutex);
> > +     do_stats_flush(memcg);
> > +     mutex_unlock(&stats_user_flush_mutex);
> > +}
> > +
> >  /*
> >   * do_unified_stats_flush - do a unified flush of memory cgroup statistics
> >   *
> --
> Michal Hocko
> SUSE Labs
>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads
  2023-09-08  0:52     ` Wei Xu
@ 2023-09-08  1:02       ` Ivan Babrou
  2023-09-08  1:11         ` Yosry Ahmed
  2023-09-11 13:11       ` Michal Hocko
  1 sibling, 1 reply; 29+ messages in thread
From: Ivan Babrou @ 2023-09-08  1:02 UTC (permalink / raw)
  To: Wei Xu
  Cc: Michal Hocko, Yosry Ahmed, Andrew Morton, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Tejun Heo,
	Michal Koutný,
	Waiman Long, linux-mm, cgroups, linux-kernel, Greg Thelen

On Thu, Sep 7, 2023 at 5:52 PM Wei Xu <weixugc@google.com> wrote:
>
> On Mon, Sep 4, 2023 at 8:15 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 31-08-23 16:56:11, Yosry Ahmed wrote:
> > > Unified flushing allows for great concurrency for paths that attempt to
> > > flush the stats, at the expense of potential staleness and a single
> > > flusher paying the extra cost of flushing the full tree.
> > >
> > > This tradeoff makes sense for in-kernel flushers that may observe high
> > > concurrency (e.g. reclaim, refault). For userspace readers, stale stats
> > > may be unexpected and problematic, especially when such stats are used
> > > for critical paths such as userspace OOM handling. Additionally, a
> > > userspace reader will occasionally pay the cost of flushing the entire
> > > hierarchy, which also causes problems in some cases [1].
> > >
> > > Opt userspace reads out of unified flushing. This makes the cost of
> > > reading the stats more predictable (proportional to the size of the
> > > subtree), as well as the freshness of the stats. Userspace readers are
> > > not expected to have similar concurrency to in-kernel flushers,
> > > serializing them among themselves and among in-kernel flushers should be
> > > okay. Nonetheless, for extra safety, introduce a mutex when flushing for
> > > userspace readers to make sure only a single userspace reader can compete
> > > with in-kernel flushers at a time. This takes away userspace ability to
> > > directly influence or hurt in-kernel lock contention.
> >
> > I think it would be helpful to note that the primary reason this is a
> > concern is that the spinlock is dropped during flushing under
> > contention.
> >
> > > An alternative is to remove flushing from the stats reading path
> > > completely, and rely on the periodic flusher. This should be accompanied
> > > by making the periodic flushing period tunable, and providing an
> > > interface for userspace to force a flush, following a similar model to
> > > /proc/vmstat. However, such a change will be hard to reverse if the
> > > implementation needs to be changed because:
> > > - The cost of reading stats will be very cheap and we won't be able to
> > >   take that back easily.
> > > - There are user-visible interfaces involved.
> > >
> > > Hence, let's go with the change that's most reversible first and revisit
> > > as needed.
> > >
> > > This was tested on a machine with 256 cpus by running a synthetic test
> > > script [2] that creates 50 top-level cgroups, each with 5 children (250
> > > leaf cgroups). Each leaf cgroup has 10 processes running that allocate
> > > memory beyond the cgroup limit, invoking reclaim (which is an in-kernel
> > > unified flusher). Concurrently, one thread is spawned per-cgroup to read
> > > the stats every second (including root, top-level, and leaf cgroups --
> > > so total 251 threads). No significant regressions were observed in the
> > > total run time, which means that userspace readers are not significantly
> > > affecting in-kernel flushers:
> > >
> > > Base (mm-unstable):
> > >
> > > real  0m22.500s
> > > user  0m9.399s
> > > sys   73m41.381s
> > >
> > > real  0m22.749s
> > > user  0m15.648s
> > > sys   73m13.113s
> > >
> > > real  0m22.466s
> > > user  0m10.000s
> > > sys   73m11.933s
> > >
> > > With this patch:
> > >
> > > real  0m23.092s
> > > user  0m10.110s
> > > sys   75m42.774s
> > >
> > > real  0m22.277s
> > > user  0m10.443s
> > > sys   72m7.182s
> > >
> > > real  0m24.127s
> > > user  0m12.617s
> > > sys   78m52.765s
> > >
> > > [1]https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/
> > > [2]https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com/
> > >
> > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> >
> > OK, I can live with that but I still believe that locking involved in
> > the user interface only begs for issues later on as there is no control
> > over that lock contention other than the number of processes involved.
> > As it seems that we cannot make a consensus on this concern now and this
> > should be already helping existing workloads then let's just buy some
> > more time ;)
>
> Indeed, even though the new global mutex protects the kernel from the
> userspace contention on the rstats spinlock, its current
> implementation doesn't have any protection for the lock contention
> among the userspace threads and can cause significant delays to memcg
> stats reads.
>
> I tested this patch on a machine with 384 CPUs using a microbenchmark
> that spawns 10K threads, each reading its memory.stat every 100
> milliseconds.  Most of memory.stat reads take 5ms-10ms in kernel, with
> ~5% reads even exceeding 1 second. This is a significant regression.
> In comparison, without contention, each memory.stat read only takes
> 20us-50us in the kernel.  Almost all of the extra read time is spent
> on waiting for the new mutex. The time to flush rstats only accounts
> for 10us-50us (This test creates only 1K memory cgroups and doesn't
> generate any loads other than these stat reader threads).
>
>  Here are some ideas to control the lock contention on the mutex and
> reduce both the median and tail latencies of concurrent memcg stat
> reads:
>
> - Bring back the stats_flush_threshold check in
> mem_cgroup_try_flush_stats() to mem_cgroup_user_flush_stats().  This
> check provides a reasonable bound on the stats staleness while being
> able to filter out a large number of rstats flush requests, which
> reduces the contention on the mutex.
>
> - When contended, upgrade the per-memcg rstats flush in
> mem_cgroup_user_flush_stats() to a root memcg flush and coalesce these
> contended flushes together.  We can wait for the ongoing flush to
> complete and eliminate repeated flush requests.

Full root memcg flush being slow is one of the issues that prompted this patch:

* https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/

I don't want us to regress in this regard.

> - Wait for the mutex and the ongoing flush with a timeout.  We should
> not use busy-wait, though.  We can bail out to read the stats without
> a flush after enough wait.  A long-stalled system call is much worse
> than somewhat stale stats in the corner cases in my opinion.
>
> Wei Xu
>
> > Acked-by: Michal Hocko <mhocko@suse.com>
> >
> > Thanks!
> >
> > > ---
> > >  mm/memcontrol.c | 24 ++++++++++++++++++++----
> > >  1 file changed, 20 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 94d5a6751a9e..46a7abf71c73 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -588,6 +588,7 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
> > >  static void flush_memcg_stats_dwork(struct work_struct *w);
> > >  static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
> > >  static DEFINE_PER_CPU(unsigned int, stats_updates);
> > > +static DEFINE_MUTEX(stats_user_flush_mutex);
> > >  static atomic_t stats_unified_flush_ongoing = ATOMIC_INIT(0);
> > >  static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
> > >  static u64 flush_next_time;
> > > @@ -655,6 +656,21 @@ static void do_stats_flush(struct mem_cgroup *memcg)
> > >       cgroup_rstat_flush(memcg->css.cgroup);
> > >  }
> > >
> > > +/*
> > > + * mem_cgroup_user_flush_stats - do a stats flush for a user read
> > > + * @memcg: memory cgroup to flush
> > > + *
> > > + * Flush the subtree of @memcg. A mutex is used for userspace readers to gate
> > > + * the global rstat spinlock. This protects in-kernel flushers from userspace
> > > + * readers hogging the lock.
> >
> > readers hogging the lock as do_stats_flush drops the spinlock under
> > contention.
> >
> > > + */
> > > +static void mem_cgroup_user_flush_stats(struct mem_cgroup *memcg)
> > > +{
> > > +     mutex_lock(&stats_user_flush_mutex);
> > > +     do_stats_flush(memcg);
> > > +     mutex_unlock(&stats_user_flush_mutex);
> > > +}
> > > +
> > >  /*
> > >   * do_unified_stats_flush - do a unified flush of memory cgroup statistics
> > >   *
> > --
> > Michal Hocko
> > SUSE Labs
> >


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads
  2023-09-08  1:02       ` Ivan Babrou
@ 2023-09-08  1:11         ` Yosry Ahmed
  0 siblings, 0 replies; 29+ messages in thread
From: Yosry Ahmed @ 2023-09-08  1:11 UTC (permalink / raw)
  To: Ivan Babrou
  Cc: Wei Xu, Michal Hocko, Andrew Morton, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Tejun Heo,
	Michal Koutný,
	Waiman Long, linux-mm, cgroups, linux-kernel, Greg Thelen

On Thu, Sep 7, 2023 at 6:03 PM Ivan Babrou <ivan@cloudflare.com> wrote:
>
> On Thu, Sep 7, 2023 at 5:52 PM Wei Xu <weixugc@google.com> wrote:
> >
> > On Mon, Sep 4, 2023 at 8:15 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Thu 31-08-23 16:56:11, Yosry Ahmed wrote:
> > > > Unified flushing allows for great concurrency for paths that attempt to
> > > > flush the stats, at the expense of potential staleness and a single
> > > > flusher paying the extra cost of flushing the full tree.
> > > >
> > > > This tradeoff makes sense for in-kernel flushers that may observe high
> > > > concurrency (e.g. reclaim, refault). For userspace readers, stale stats
> > > > may be unexpected and problematic, especially when such stats are used
> > > > for critical paths such as userspace OOM handling. Additionally, a
> > > > userspace reader will occasionally pay the cost of flushing the entire
> > > > hierarchy, which also causes problems in some cases [1].
> > > >
> > > > Opt userspace reads out of unified flushing. This makes the cost of
> > > > reading the stats more predictable (proportional to the size of the
> > > > subtree), as well as the freshness of the stats. Userspace readers are
> > > > not expected to have similar concurrency to in-kernel flushers,
> > > > serializing them among themselves and among in-kernel flushers should be
> > > > okay. Nonetheless, for extra safety, introduce a mutex when flushing for
> > > > userspace readers to make sure only a single userspace reader can compete
> > > > with in-kernel flushers at a time. This takes away userspace ability to
> > > > directly influence or hurt in-kernel lock contention.
> > >
> > > I think it would be helpful to note that the primary reason this is a
> > > concern is that the spinlock is dropped during flushing under
> > > contention.
> > >
> > > > An alternative is to remove flushing from the stats reading path
> > > > completely, and rely on the periodic flusher. This should be accompanied
> > > > by making the periodic flushing period tunable, and providing an
> > > > interface for userspace to force a flush, following a similar model to
> > > > /proc/vmstat. However, such a change will be hard to reverse if the
> > > > implementation needs to be changed because:
> > > > - The cost of reading stats will be very cheap and we won't be able to
> > > >   take that back easily.
> > > > - There are user-visible interfaces involved.
> > > >
> > > > Hence, let's go with the change that's most reversible first and revisit
> > > > as needed.
> > > >
> > > > This was tested on a machine with 256 cpus by running a synthetic test
> > > > script [2] that creates 50 top-level cgroups, each with 5 children (250
> > > > leaf cgroups). Each leaf cgroup has 10 processes running that allocate
> > > > memory beyond the cgroup limit, invoking reclaim (which is an in-kernel
> > > > unified flusher). Concurrently, one thread is spawned per-cgroup to read
> > > > the stats every second (including root, top-level, and leaf cgroups --
> > > > so total 251 threads). No significant regressions were observed in the
> > > > total run time, which means that userspace readers are not significantly
> > > > affecting in-kernel flushers:
> > > >
> > > > Base (mm-unstable):
> > > >
> > > > real  0m22.500s
> > > > user  0m9.399s
> > > > sys   73m41.381s
> > > >
> > > > real  0m22.749s
> > > > user  0m15.648s
> > > > sys   73m13.113s
> > > >
> > > > real  0m22.466s
> > > > user  0m10.000s
> > > > sys   73m11.933s
> > > >
> > > > With this patch:
> > > >
> > > > real  0m23.092s
> > > > user  0m10.110s
> > > > sys   75m42.774s
> > > >
> > > > real  0m22.277s
> > > > user  0m10.443s
> > > > sys   72m7.182s
> > > >
> > > > real  0m24.127s
> > > > user  0m12.617s
> > > > sys   78m52.765s
> > > >
> > > > [1]https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/
> > > > [2]https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com/
> > > >
> > > > Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> > >
> > > OK, I can live with that but I still believe that locking involved in
> > > the user interface only begs for issues later on as there is no control
> > > over that lock contention other than the number of processes involved.
> > > As it seems that we cannot make a consensus on this concern now and this
> > > should be already helping existing workloads then let's just buy some
> > > more time ;)
> >
> > Indeed, even though the new global mutex protects the kernel from the
> > userspace contention on the rstats spinlock, its current
> > implementation doesn't have any protection for the lock contention
> > among the userspace threads and can cause significant delays to memcg
> > stats reads.
> >
> > I tested this patch on a machine with 384 CPUs using a microbenchmark
> > that spawns 10K threads, each reading its memory.stat every 100
> > milliseconds.  Most of memory.stat reads take 5ms-10ms in kernel, with
> > ~5% reads even exceeding 1 second. This is a significant regression.
> > In comparison, without contention, each memory.stat read only takes
> > 20us-50us in the kernel.  Almost all of the extra read time is spent
> > on waiting for the new mutex. The time to flush rstats only accounts
> > for 10us-50us (This test creates only 1K memory cgroups and doesn't
> > generate any loads other than these stat reader threads).
> >
> >  Here are some ideas to control the lock contention on the mutex and
> > reduce both the median and tail latencies of concurrent memcg stat
> > reads:


Thanks for the analysis, Wei!

I will update the patch series based on your ideas to limit the
contention on the userspace read mutex.

>
> >
> > - Bring back the stats_flush_threshold check in
> > mem_cgroup_try_flush_stats() to mem_cgroup_user_flush_stats().  This
> > check provides a reasonable bound on the stats staleness while being
> > able to filter out a large number of rstats flush requests, which
> > reduces the contention on the mutex.
> >
> > - When contended, upgrade the per-memcg rstats flush in
> > mem_cgroup_user_flush_stats() to a root memcg flush and coalesce these
> > contended flushes together.  We can wait for the ongoing flush to
> > complete and eliminate repeated flush requests.
>
> Full root memcg flush being slow is one of the issues that prompted this patch:
>
> * https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/
>
> I don't want us to regress in this regard.


It will only be a fallback if there is high concurrency among
userspace reads, which will cause high contention on the mutex. In
that case, the userspace reads will be slowed down by contention,
which can be even worse than flush slowness as it is theoretically
unbounded.

I am working on a v5 now to incorporate Wei's suggestions. Would you
be able to test that and verify that there are no regressions?

>
>
> > - Wait for the mutex and the ongoing flush with a timeout.  We should
> > not use busy-wait, though.  We can bail out to read the stats without
> > a flush after enough wait.  A long-stalled system call is much worse
> > than somewhat stale stats in the corner cases in my opinion.
> >
> > Wei Xu
> >


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads
  2023-09-08  0:52     ` Wei Xu
  2023-09-08  1:02       ` Ivan Babrou
@ 2023-09-11 13:11       ` Michal Hocko
  2023-09-11 19:15         ` Wei Xu
  1 sibling, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2023-09-11 13:11 UTC (permalink / raw)
  To: Wei Xu
  Cc: Yosry Ahmed, Andrew Morton, Johannes Weiner, Roman Gushchin,
	Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo,
	Michal Koutný,
	Waiman Long, linux-mm, cgroups, linux-kernel, Greg Thelen

On Thu 07-09-23 17:52:12, Wei Xu wrote:
[...]
> I tested this patch on a machine with 384 CPUs using a microbenchmark
> that spawns 10K threads, each reading its memory.stat every 100
> milliseconds.

This is rather extreme case but I wouldn't call it utterly insane
though.

> Most of memory.stat reads take 5ms-10ms in kernel, with
> ~5% reads even exceeding 1 second.

Just curious, what would numbers look like if the mutex is removed and
those threads would be condending on the existing spinlock with lock
dropping in place and removed. Would you be willing to give it a shot?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads
  2023-09-11 13:11       ` Michal Hocko
@ 2023-09-11 19:15         ` Wei Xu
  2023-09-11 19:34           ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Wei Xu @ 2023-09-11 19:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Yosry Ahmed, Andrew Morton, Johannes Weiner, Roman Gushchin,
	Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo,
	Michal Koutný,
	Waiman Long, linux-mm, cgroups, linux-kernel, Greg Thelen

On Mon, Sep 11, 2023 at 6:11 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Thu 07-09-23 17:52:12, Wei Xu wrote:
> [...]
> > I tested this patch on a machine with 384 CPUs using a microbenchmark
> > that spawns 10K threads, each reading its memory.stat every 100
> > milliseconds.
>
> This is rather extreme case but I wouldn't call it utterly insane
> though.
>
> > Most of memory.stat reads take 5ms-10ms in kernel, with
> > ~5% reads even exceeding 1 second.
>
> Just curious, what would numbers look like if the mutex is removed and
> those threads would be condending on the existing spinlock with lock
> dropping in place and removed. Would you be willing to give it a shot?

Without the mutex and with the spinlock only, the common read latency
of memory.stat is still 5ms-10ms in kernel. There are very few reads
(<0.003%) going above 10ms and none more than 1 second.

> --
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads
  2023-09-11 19:15         ` Wei Xu
@ 2023-09-11 19:34           ` Michal Hocko
  2023-09-11 20:01             ` Wei Xu
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2023-09-11 19:34 UTC (permalink / raw)
  To: Wei Xu
  Cc: Yosry Ahmed, Andrew Morton, Johannes Weiner, Roman Gushchin,
	Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo,
	Michal Koutný,
	Waiman Long, linux-mm, cgroups, linux-kernel, Greg Thelen

On Mon 11-09-23 12:15:24, Wei Xu wrote:
> On Mon, Sep 11, 2023 at 6:11 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Thu 07-09-23 17:52:12, Wei Xu wrote:
> > [...]
> > > I tested this patch on a machine with 384 CPUs using a microbenchmark
> > > that spawns 10K threads, each reading its memory.stat every 100
> > > milliseconds.
> >
> > This is rather extreme case but I wouldn't call it utterly insane
> > though.
> >
> > > Most of memory.stat reads take 5ms-10ms in kernel, with
> > > ~5% reads even exceeding 1 second.
> >
> > Just curious, what would numbers look like if the mutex is removed and
> > those threads would be condending on the existing spinlock with lock
> > dropping in place and removed. Would you be willing to give it a shot?
> 
> Without the mutex and with the spinlock only, the common read latency
> of memory.stat is still 5ms-10ms in kernel. There are very few reads
> (<0.003%) going above 10ms and none more than 1 second.

Is this with the existing spinlock dropping and same 10k potentially
contending readers?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads
  2023-09-11 19:34           ` Michal Hocko
@ 2023-09-11 20:01             ` Wei Xu
  2023-09-11 20:21               ` Tejun Heo
  0 siblings, 1 reply; 29+ messages in thread
From: Wei Xu @ 2023-09-11 20:01 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Yosry Ahmed, Andrew Morton, Johannes Weiner, Roman Gushchin,
	Shakeel Butt, Muchun Song, Ivan Babrou, Tejun Heo,
	Michal Koutný,
	Waiman Long, linux-mm, cgroups, linux-kernel, Greg Thelen

On Mon, Sep 11, 2023 at 12:34 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 11-09-23 12:15:24, Wei Xu wrote:
> > On Mon, Sep 11, 2023 at 6:11 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Thu 07-09-23 17:52:12, Wei Xu wrote:
> > > [...]
> > > > I tested this patch on a machine with 384 CPUs using a microbenchmark
> > > > that spawns 10K threads, each reading its memory.stat every 100
> > > > milliseconds.
> > >
> > > This is rather extreme case but I wouldn't call it utterly insane
> > > though.
> > >
> > > > Most of memory.stat reads take 5ms-10ms in kernel, with
> > > > ~5% reads even exceeding 1 second.
> > >
> > > Just curious, what would numbers look like if the mutex is removed and
> > > those threads would be condending on the existing spinlock with lock
> > > dropping in place and removed. Would you be willing to give it a shot?
> >
> > Without the mutex and with the spinlock only, the common read latency
> > of memory.stat is still 5ms-10ms in kernel. There are very few reads
> > (<0.003%) going above 10ms and none more than 1 second.
>
> Is this with the existing spinlock dropping and same 10k potentially
> contending readers?

Yes, it is the same test (10K contending readers). The kernel change
is to remove stats_user_flush_mutex from mem_cgroup_user_flush_stats()
so that the concurrent mem_cgroup_user_flush_stats() requests directly
contend on cgroup_rstat_lock in cgroup_rstat_flush().

> --
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads
  2023-09-11 20:01             ` Wei Xu
@ 2023-09-11 20:21               ` Tejun Heo
  2023-09-11 20:28                 ` Yosry Ahmed
  2023-09-12 11:03                 ` Michal Hocko
  0 siblings, 2 replies; 29+ messages in thread
From: Tejun Heo @ 2023-09-11 20:21 UTC (permalink / raw)
  To: Wei Xu
  Cc: Michal Hocko, Yosry Ahmed, Andrew Morton, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou,
	Michal Koutný,
	Waiman Long, linux-mm, cgroups, linux-kernel, Greg Thelen

Hello,

On Mon, Sep 11, 2023 at 01:01:25PM -0700, Wei Xu wrote:
> Yes, it is the same test (10K contending readers). The kernel change
> is to remove stats_user_flush_mutex from mem_cgroup_user_flush_stats()
> so that the concurrent mem_cgroup_user_flush_stats() requests directly
> contend on cgroup_rstat_lock in cgroup_rstat_flush().

I don't think it'd be a good idea to twist rstat and other kernel internal
code to accommodate 10k parallel readers. If we want to support that, let's
explicitly support that by implementing better batching in the read path.
The only guarantee you need is that there has been at least one flush since
the read attempt started, so we can do sth like the following in the read
path:

1. Grab a waiter lock. Remember the current timestamp.

2. Try lock flush mutex. If obtained, drop the waiter lock, flush. Regrab
   the waiter lock, update the latest flush time to my start time, wake up
   waiters on the waitqueue (maybe do custom wakeups based on start time?).

3. Release the waiter lock and sleep on the waitqueue.

4. When woken up, regarb the waiter lock, compare whether the latest flush
   timestamp is later than my start time, if so, return the latest result.
   If not go back to #2.

Maybe the above isn't the best way to do it but you get the general idea.
When you have that many concurrent readers, most of them won't need to
actually flush.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads
  2023-09-11 20:21               ` Tejun Heo
@ 2023-09-11 20:28                 ` Yosry Ahmed
  2023-09-12 11:03                 ` Michal Hocko
  1 sibling, 0 replies; 29+ messages in thread
From: Yosry Ahmed @ 2023-09-11 20:28 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Wei Xu, Michal Hocko, Andrew Morton, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou,
	Michal Koutný,
	Waiman Long, linux-mm, cgroups, linux-kernel, Greg Thelen

On Mon, Sep 11, 2023 at 1:21 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Mon, Sep 11, 2023 at 01:01:25PM -0700, Wei Xu wrote:
> > Yes, it is the same test (10K contending readers). The kernel change
> > is to remove stats_user_flush_mutex from mem_cgroup_user_flush_stats()
> > so that the concurrent mem_cgroup_user_flush_stats() requests directly
> > contend on cgroup_rstat_lock in cgroup_rstat_flush().
>
> I don't think it'd be a good idea to twist rstat and other kernel internal
> code to accommodate 10k parallel readers. If we want to support that, let's
> explicitly support that by implementing better batching in the read path.
> The only guarantee you need is that there has been at least one flush since
> the read attempt started, so we can do sth like the following in the read
> path:
>
> 1. Grab a waiter lock. Remember the current timestamp.
>
> 2. Try lock flush mutex. If obtained, drop the waiter lock, flush. Regrab
>    the waiter lock, update the latest flush time to my start time, wake up
>    waiters on the waitqueue (maybe do custom wakeups based on start time?).
>
> 3. Release the waiter lock and sleep on the waitqueue.
>
> 4. When woken up, regarb the waiter lock, compare whether the latest flush
>    timestamp is later than my start time, if so, return the latest result.
>    If not go back to #2.
>
> Maybe the above isn't the best way to do it but you get the general idea.
> When you have that many concurrent readers, most of them won't need to
> actually flush.

I am testing something vaguely similar to this conceptually, but
doesn't depend on timestamps.

I replaced the mutex with a semaphore, and I added a fallback logic to
unified flushing with a timeout:

  static void mem_cgroup_user_flush_stats(struct mem_cgroup *memcg)
  {
          static DEFINE_SEMAPHORE(user_flush_sem, 1);

          if (atomic_read(&stats_flush_order) <= STATS_FLUSH_THRESHOLD)
                  return;

          if (!down_timeout(&user_flush_sem, msecs_to_jiffies(1))) {
                  do_stats_flush(memcg);
                  up(&user_flush_sem);
          } else {
                  do_unified_stats_flush(true);
          }
  }

In do_unified_stats_flush(), I added a wait argument. If set, the
caller will wait for any ongoing flushers before returning (but it
never attempts to flush, so no contention on the underlying rstat
lock). I implemented this using completions. I am running some tests
now, but this should make sure userspace read latency is bounded by
1ms + unified flush time. We basically attempt to flush our subtree
only, if we can't after 1ms, we fallback to unified flushing.

Another benefit I am seeing here is that I tried switching in-kernel
flushers to also use the completion in do_unified_stats_flush().
Basically instead of skipping entirely when someone else is flushing,
they just wait for them to finish (without being serialized or
contending the lock). I see no regressions in my parallel reclaim
benchmark. This should make sure no one ever skips a flush, while
still avoiding too much serialization/contention. I suspect this
should make reclaim heuristics (and other in-kernel flushers) slightly
better.

I will run Wei's benchmark next to see how userspace read latency is affected.

>
> Thanks.
>
> --
> tejun

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads
  2023-09-11 20:21               ` Tejun Heo
  2023-09-11 20:28                 ` Yosry Ahmed
@ 2023-09-12 11:03                 ` Michal Hocko
  2023-09-12 11:09                   ` Yosry Ahmed
  1 sibling, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2023-09-12 11:03 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Wei Xu, Yosry Ahmed, Andrew Morton, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou,
	Michal Koutný,
	Waiman Long, linux-mm, cgroups, linux-kernel, Greg Thelen

On Mon 11-09-23 10:21:24, Tejun Heo wrote:
> Hello,
> 
> On Mon, Sep 11, 2023 at 01:01:25PM -0700, Wei Xu wrote:
> > Yes, it is the same test (10K contending readers). The kernel change
> > is to remove stats_user_flush_mutex from mem_cgroup_user_flush_stats()
> > so that the concurrent mem_cgroup_user_flush_stats() requests directly
> > contend on cgroup_rstat_lock in cgroup_rstat_flush().
> 
> I don't think it'd be a good idea to twist rstat and other kernel internal
> code to accommodate 10k parallel readers.

I didn't mean to suggest optimizing for this specific scenario. I was
mostly curious whether the pathological case of unbound high latency due
to lock dropping is easy to trigger by huge number of readers. It seems
it is not and the mutex might not be really needed as a prevention.

> If we want to support that, let's
> explicitly support that by implementing better batching in the read path.

Well, we need to be able to handle those situations because stat files
are generally readable and we do not want unrelated workloads to
influence each other heavily through this path.

[...]

> When you have that many concurrent readers, most of them won't need to
> actually flush.

Agreed! 
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 3/4] mm: memcg: let non-unified root stats flushes help unified flushes
  2023-09-05 15:54           ` Yosry Ahmed
  2023-09-05 16:07             ` Michal Koutný
@ 2023-09-12 11:03             ` Michal Hocko
  1 sibling, 0 replies; 29+ messages in thread
From: Michal Hocko @ 2023-09-12 11:03 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Michal Koutný,
	Andrew Morton, Johannes Weiner, Roman Gushchin, Shakeel Butt,
	Muchun Song, Ivan Babrou, Tejun Heo, Waiman Long, linux-mm,
	cgroups, linux-kernel

On Tue 05-09-23 08:54:46, Yosry Ahmed wrote:
> On Tue, Sep 5, 2023 at 7:10 AM Michal Koutný <mkoutny@suse.com> wrote:
> >
> > On Mon, Sep 04, 2023 at 05:41:10PM +0200, Michal Hocko <mhocko@suse.com> wrote:
> > > So it also creates an undocumented but userspace visible behavior.
> > > Something that userspace might start depending on, right?
> >
> > Yes but -
> > - depending on undocumented behavior is a mistake,
> > - breaking the dependency would manifest (in the case I imagine) as a
> >   performance regression (and if there are some users, the future can
> >   allow them configuring periodic kernel flush to compensate for that).
> 
> I think I am missing something. This change basically makes userspace
> readers (for the root memcg) help out unified flushers, which are
> in-kernel readers (e.g. reclaim) -- not the other way around.
> 
> How would that create a userspace visible behavior that a dependency
> can be formed on? Users expecting reclaim to be faster right after
> reading root stats? I would guess that would be too flaky to cause a
> behavior that people can depend on tbh.

Flaky or not, it might cause behavior difference and a subtle one. I can
imagine nohz and similar workloads wanting to (ab)use this to reduce
kernel footprint. If we really need this then make it obvious in the
changelog at least.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads
  2023-09-12 11:03                 ` Michal Hocko
@ 2023-09-12 11:09                   ` Yosry Ahmed
  0 siblings, 0 replies; 29+ messages in thread
From: Yosry Ahmed @ 2023-09-12 11:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tejun Heo, Wei Xu, Andrew Morton, Johannes Weiner,
	Roman Gushchin, Shakeel Butt, Muchun Song, Ivan Babrou,
	Michal Koutný,
	Waiman Long, linux-mm, cgroups, linux-kernel, Greg Thelen

On Tue, Sep 12, 2023 at 4:03 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 11-09-23 10:21:24, Tejun Heo wrote:
> > Hello,
> >
> > On Mon, Sep 11, 2023 at 01:01:25PM -0700, Wei Xu wrote:
> > > Yes, it is the same test (10K contending readers). The kernel change
> > > is to remove stats_user_flush_mutex from mem_cgroup_user_flush_stats()
> > > so that the concurrent mem_cgroup_user_flush_stats() requests directly
> > > contend on cgroup_rstat_lock in cgroup_rstat_flush().
> >
> > I don't think it'd be a good idea to twist rstat and other kernel internal
> > code to accommodate 10k parallel readers.
>
> I didn't mean to suggest optimizing for this specific scenario. I was
> mostly curious whether the pathological case of unbound high latency due
> to lock dropping is easy to trigger by huge number of readers. It seems
> it is not and the mutex might not be really needed as a prevention.
>
> > If we want to support that, let's
> > explicitly support that by implementing better batching in the read path.
>
> Well, we need to be able to handle those situations because stat files
> are generally readable and we do not want unrelated workloads to
> influence each other heavily through this path.

I am working on a complete rework of this series based on the feedback
I got from Wei and the discussions here. I think I have something
simpler and more generic, and doesn't proliferate the number of
flushing variants we have. I am running some tests right now and will
share it as soon as I can.

It should address the high concurrency use case without adding a lot
of complexity. It basically involves a fast path where we only flush
the needed subtree if there's no contention, and a slow path where we
coalesce all flushing requests, and everyone just waits for a single
flush to complete (without spinning or contending on any locks). I am
trying to use this generic mechanism for both userspace reads and
in-kernel flushers. I am making sure in-kernel flushers do not
regress.

>
> [...]
>
> > When you have that many concurrent readers, most of them won't need to
> > actually flush.
>
> Agreed!
> --
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2023-09-12 11:10 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-31 16:56 [PATCH v4 0/4] memcg: non-unified flushing for userspace stats Yosry Ahmed
2023-08-31 16:56 ` [PATCH v4 1/4] mm: memcg: properly name and document unified stats flushing Yosry Ahmed
2023-09-04 14:44   ` Michal Hocko
2023-09-05 15:55     ` Yosry Ahmed
2023-08-31 16:56 ` [PATCH v4 2/4] mm: memcg: add a helper for non-unified " Yosry Ahmed
2023-09-04 14:45   ` Michal Hocko
2023-08-31 16:56 ` [PATCH v4 3/4] mm: memcg: let non-unified root stats flushes help unified flushes Yosry Ahmed
2023-09-04 14:50   ` Michal Hocko
2023-09-04 15:29     ` Michal Koutný
2023-09-04 15:41       ` Michal Hocko
2023-09-05 14:10         ` Michal Koutný
2023-09-05 15:54           ` Yosry Ahmed
2023-09-05 16:07             ` Michal Koutný
2023-09-12 11:03             ` Michal Hocko
2023-08-31 16:56 ` [PATCH v4 4/4] mm: memcg: use non-unified stats flushing for userspace reads Yosry Ahmed
2023-09-04 15:15   ` Michal Hocko
2023-09-05 15:57     ` Yosry Ahmed
2023-09-08  0:52     ` Wei Xu
2023-09-08  1:02       ` Ivan Babrou
2023-09-08  1:11         ` Yosry Ahmed
2023-09-11 13:11       ` Michal Hocko
2023-09-11 19:15         ` Wei Xu
2023-09-11 19:34           ` Michal Hocko
2023-09-11 20:01             ` Wei Xu
2023-09-11 20:21               ` Tejun Heo
2023-09-11 20:28                 ` Yosry Ahmed
2023-09-12 11:03                 ` Michal Hocko
2023-09-12 11:09                   ` Yosry Ahmed
2023-08-31 17:18 ` [PATCH v4 0/4] memcg: non-unified flushing for userspace stats Waiman Long

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox