linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Waiman Long <longman@redhat.com>
To: Yosry Ahmed <yosryahmed@google.com>,
	Andrew Morton <akpm@linux-foundation.org>
Cc: "Johannes Weiner" <hannes@cmpxchg.org>,
	"Michal Hocko" <mhocko@kernel.org>,
	"Roman Gushchin" <roman.gushchin@linux.dev>,
	"Shakeel Butt" <shakeelb@google.com>,
	"Muchun Song" <muchun.song@linux.dev>,
	"Ivan Babrou" <ivan@cloudflare.com>, "Tejun Heo" <tj@kernel.org>,
	"Michal Koutný" <mkoutny@suse.com>,
	kernel-team@cloudflare.com, "Wei Xu" <weixugc@google.com>,
	"Greg Thelen" <gthelen@google.com>,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 3/3] mm: memcg: optimize stats flushing for latency and accuracy
Date: Thu, 14 Sep 2023 13:19:52 -0400	[thread overview]
Message-ID: <cbad0762-892f-229e-280e-1faafbcb36b8@redhat.com> (raw)
In-Reply-To: <20230913073846.1528938-4-yosryahmed@google.com>


On 9/13/23 03:38, Yosry Ahmed wrote:
> Stats flushing for memcg currently follows the following rules:
> - Always flush the entire memcg hierarchy (i.e. flush the root).
> - Only one flusher is allowed at a time. If someone else tries to flush
>    concurrently, they skip and return immediately.
> - A periodic flusher flushes all the stats every 2 seconds.
>
> The reason this approach is followed is because all flushes are
> serialized by a global rstat spinlock. On the memcg side, flushing is
> invoked from userspace reads as well as in-kernel flushers (e.g.
> reclaim, refault, etc). This approach aims to avoid serializing all
> flushers on the global lock, which can cause a significant performance
> hit under high concurrency.
>
> This approach has the following problems:
> - Occasionally a userspace read of the stats of a non-root cgroup will
>    be too expensive as it has to flush the entire hierarchy [1].
> - Sometimes the stats accuracy are compromised if there is an ongoing
>    flush, and we skip and return before the subtree of interest is
>    actually flushed. This is more visible when reading stats from
>    userspace, but can also affect in-kernel flushers.
>
> This patch aims to solve both problems by reworking how flushing
> currently works as follows:
> - Without contention, there is no need to flush the entire tree. In this
>    case, only flush the subtree of interest. This avoids the latency of a
>    full root flush if unnecessary.
> - With contention, fallback to a coalesced (aka unified) flush of the
>    entire hierarchy, a root flush. In this case, instead of returning
>    immediately if a root flush is ongoing, wait for it to finish
>    *without* attempting to acquire the lock or flush. This is done using
>    a completion. Compared to competing directly on the underlying lock,
>    this approach makes concurrent flushing a synchronization point
>    instead of a serialization point. Once  a root flush finishes, *all*
>    waiters can wake up and continue at once.
> - Finally, with very high contention, bound the number of waiters to the
>    number of online cpus. This keeps the flush latency bounded at the tail
>    (very high concurrency). We fallback to sacrificing stats freshness only
>    in such cases in favor of performance.
>
> This was tested in two ways on a machine with 384 cpus:
> - A synthetic test with 5000 concurrent workers doing allocations and
>    reclaim, as well as 1000 readers for memory.stat (variation of [2]).
>    No significant regressions were noticed in the total runtime.
>    Note that if concurrent flushers compete directly on the spinlock
>    instead of waiting for a completion, this test shows 2x-3x slowdowns.
>    Even though subsequent flushers would have nothing to flush, just the
>    serialization and lock contention is a major problem. Using a
>    completion for synchronization instead seems to overcome this problem.
>
> - A synthetic stress test for concurrently reading memcg stats provided
>    by Wei Xu.
>    With 10k threads reading the stats every 100ms:
>    - 98.8% of reads take <100us
>    - 1.09% of reads take 100us to 1ms.
>    - 0.11% of reads take 1ms to 10ms.
>    - Almost no reads take more than 10ms.
>    With 10k threads reading the stats every 10ms:
>    - 82.3% of reads take <100us.
>    - 4.2% of reads take 100us to 1ms.
>    - 4.7% of reads take 1ms to 10ms.
>    - 8.8% of reads take 10ms to 100ms.
>    - Almost no reads take more than 100ms.
>
> [1] https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/
> [2] https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com/
> [3] https://lore.kernel.org/lkml/CAAPL-u9D2b=iF5Lf_cRnKxUfkiEe0AMDTu6yhrUAzX0b6a6rDg@mail.gmail.com/
>
> [weixugc@google.com: suggested the fallback logic and bounding the
> number of waiters]
>
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> ---
>   include/linux/memcontrol.h |   4 +-
>   mm/memcontrol.c            | 100 ++++++++++++++++++++++++++++---------
>   mm/vmscan.c                |   2 +-
>   mm/workingset.c            |   8 ++-
>   4 files changed, 85 insertions(+), 29 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 11810a2cfd2d..4453cd3fc4b8 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1034,7 +1034,7 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
>   	return x;
>   }
>   
> -void mem_cgroup_flush_stats(void);
> +void mem_cgroup_flush_stats(struct mem_cgroup *memcg);
>   void mem_cgroup_flush_stats_ratelimited(void);
>   
>   void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
> @@ -1519,7 +1519,7 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
>   	return node_page_state(lruvec_pgdat(lruvec), idx);
>   }
>   
> -static inline void mem_cgroup_flush_stats(void)
> +static inline void mem_cgroup_flush_stats(struct mem_cgroup *memcg)
>   {
>   }
>   
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d729870505f1..edff41e4b4e7 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -588,7 +588,6 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
>   static void flush_memcg_stats_dwork(struct work_struct *w);
>   static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
>   static DEFINE_PER_CPU(unsigned int, stats_updates);
> -static atomic_t stats_flush_ongoing = ATOMIC_INIT(0);
>   /* stats_updates_order is in multiples of MEMCG_CHARGE_BATCH */
>   static atomic_t stats_updates_order = ATOMIC_INIT(0);
>   static u64 flush_last_time;
> @@ -639,36 +638,87 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
>   	}
>   }
>   
> -static void do_flush_stats(void)
> +/*
> + * do_flush_stats - flush the statistics of a memory cgroup and its tree
> + * @memcg: the memory cgroup to flush
> + * @wait: wait for an ongoing root flush to complete before returning
> + *
> + * All flushes are serialized by the underlying rstat global lock. If there is
> + * no contention, we try to only flush the subtree of the passed @memcg to
> + * minimize the work. Otherwise, we coalesce multiple flushing requests into a
> + * single flush of the root memcg. When there is an ongoing root flush, we wait
> + * for its completion (unless otherwise requested), to get fresh stats. If the
> + * number of waiters exceeds the number of cpus just skip the flush to bound the
> + * flush latency at the tail with very high concurrency.
> + *
> + * This is a trade-off between stats accuracy and flush latency.
> + */
> +static void do_flush_stats(struct mem_cgroup *memcg, bool wait)
>   {
> +	static DECLARE_COMPLETION(root_flush_done);
> +	static DEFINE_SPINLOCK(root_flusher_lock);
> +	static DEFINE_MUTEX(subtree_flush_mutex);
> +	static atomic_t waiters = ATOMIC_INIT(0);
> +	static bool root_flush_ongoing;
> +	bool root_flusher = false;
> +
> +	/* Ongoing root flush, just wait for it (unless otherwise requested) */
> +	if (READ_ONCE(root_flush_ongoing))
> +		goto root_flush_or_wait;
> +
>   	/*
> -	 * We always flush the entire tree, so concurrent flushers can just
> -	 * skip. This avoids a thundering herd problem on the rstat global lock
> -	 * from memcg flushers (e.g. reclaim, refault, etc).
> +	 * Opportunistically try to only flush the requested subtree. Otherwise
> +	 * fallback to a coalesced flush below.
>   	 */
> -	if (atomic_read(&stats_flush_ongoing) ||
> -	    atomic_xchg(&stats_flush_ongoing, 1))
> +	if (!mem_cgroup_is_root(memcg) && mutex_trylock(&subtree_flush_mutex)) {
> +		cgroup_rstat_flush(memcg->css.cgroup);
> +		mutex_unlock(&subtree_flush_mutex);
>   		return;
> +	}

If mutex_trylock() is the only way to acquire subtree_flush_mutex, you 
don't really need a mutex. Just a simple integer flag with xchg() call 
should be enough.

Cheers,
Longman



  parent reply	other threads:[~2023-09-14 17:20 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-13  7:38 [PATCH 0/3] memcg: more sophisticated stats flushing Yosry Ahmed
2023-09-13  7:38 ` [PATCH 1/3] mm: memcg: change flush_next_time to flush_last_time Yosry Ahmed
2023-09-13  7:38 ` [PATCH 2/3] mm: memcg: rename stats_flush_threshold to stats_updates_order Yosry Ahmed
2023-09-13  7:38 ` [PATCH 3/3] mm: memcg: optimize stats flushing for latency and accuracy Yosry Ahmed
2023-09-13 15:37   ` Johannes Weiner
2023-09-13 16:26     ` Yosry Ahmed
2023-09-14 16:06       ` Johannes Weiner
2023-09-14 17:22         ` Yosry Ahmed
2023-09-14 17:26           ` Yosry Ahmed
2023-09-19  5:46           ` Yosry Ahmed
2023-09-14 17:19   ` Waiman Long [this message]
2023-09-14 17:23     ` Yosry Ahmed
2023-09-14 17:36       ` Waiman Long
2023-09-14 17:36   ` Shakeel Butt
2023-09-14 17:56     ` Yosry Ahmed
2023-09-14 22:58       ` Shakeel Butt
2023-09-14 23:30         ` Yosry Ahmed
2023-09-15  1:01           ` Shakeel Butt
2023-09-19  5:29             ` Yosry Ahmed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cbad0762-892f-229e-280e-1faafbcb36b8@redhat.com \
    --to=longman@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=gthelen@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=ivan@cloudflare.com \
    --cc=kernel-team@cloudflare.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=mkoutny@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeelb@google.com \
    --cc=tj@kernel.org \
    --cc=weixugc@google.com \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox