linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Shakeel Butt <shakeel.butt@linux.dev>
To: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Yosry Ahmed <yosryahmed@google.com>,
	 Andrew Morton <akpm@linux-foundation.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	 Michal Hocko <mhocko@suse.com>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	 Yu Zhao <yuzhao@google.com>,
	Muchun Song <songmuchun@bytedance.com>,
	 Facebook Kernel Team <kernel-team@meta.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	 kernel-team <kernel-team@cloudflare.com>
Subject: Re: [PATCH] memcg: use ratelimited stats flush in the reclaim
Date: Mon, 17 Jun 2024 11:01:36 -0700	[thread overview]
Message-ID: <rhvafiag6fjkj66ohex3eamoqpsw62bxmwbvd7shsa72rqcile@fvo4nsggjpwg> (raw)
In-Reply-To: <0ec3c33c-d9ff-41a5-be94-0142f103b815@kernel.org>

On Mon, Jun 17, 2024 at 05:31:21PM GMT, Jesper Dangaard Brouer wrote:
> 
> 
> On 16/06/2024 02.28, Yosry Ahmed wrote:
> > On Sat, Jun 15, 2024 at 1:13 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > 
> > > The Meta prod is seeing large amount of stalls in memcg stats flush
> > > from the memcg reclaim code path. At the moment, this specific callsite
> > > is doing a synchronous memcg stats flush. The rstat flush is an
> > > expensive and time consuming operation, so concurrent relaimers will
> > > busywait on the lock potentially for a long time. Actually this issue is
> > > not unique to Meta and has been observed by Cloudflare [1] as well. For
> > > the Cloudflare case, the stalls were due to contention between kswapd
> > > threads running on their 8 numa node machines which does not make sense
> > > as rstat flush is global and flush from one kswapd thread should be
> > > sufficient for all. Simply replace the synchronous flush with the
> > > ratelimited one.
> > > 
> 
> Like Yosry, I don't agree that simply using ratelimited flush here is
> the right solution, at-least other options need to be investigated first.

I added more detail in my reply to Yosry on why using ratelimited flush
for this specific case is fine.

[...]
> > 
> > I think you already know my opinion about this one :) I don't like it
> > at all, and I will explain why below. I know it may be a necessary
> > evil, but I would like us to make sure there is no other option before
> > going forward with this.
> > 
> I'm signing up to solving this somehow, as this is a real prod issue.
> 
> An easy way to solve the kswapd issue, would be to reintroduce
> "stats_flush_ongoing" concept, that was reverted in 7d7ef0a4686a ("mm:
> memcg: restore subtree stats flushing") (Author: Yosry Ahmed), and
> introduced in 3cd9992b9302 ("memcg: replace stats_flush_lock with an
> atomic") (Author: Yosry Ahmed).
> 

The skipping flush for "stats_flush_ongoing" was there from the start.

> The concept is: If there is an ongoing rstat flush, this time limited to
> the root cgroup, then don't perform the flush.  We can only do this for
> the root cgroup tree, as flushing can be done for subtrees, but kswapd
> is always for root tree, so it is good enough to solve the kswapd
> thundering herd problem.  We might want to generalize this beyond memcg.
> 

No objection from me for this skipping root memcg flush idea.

> 
[...]
> 
> > - With the added thresholding code, a flush is only done if there is a
> > significant number of pending updates in the relevant subtree.
> > Choosing the ratelimited approach is intentionally ignoring a
> > significant change in stats (although arguably it could be irrelevant
> > stats).
> > 
> 
> My production observations are that the thresholding code isn't limiting
> the flushing in practice.
> 

Here we need more production data. I remember you mentioned MEMCG_KMEM
being used for most of the updates. Is it possible to get top 5 (or 10)
most updated stats for your production environment?

> 
> > - Reclaim code is an iterative process, so not updating the stats on
> > every retry is very counterintuitive. We are retrying reclaim using
> > the same stats and heuristics used by a previous iteration,
> > essentially dismissing the effects of those previous iterations.
> > 
> > - Indeterministic behavior like this one is very difficult to debug if
> > it causes problems. The missing updates in the last 2s (or whatever
> > period) could be of any magnitude. We may be ignoring GBs of
> > free/allocated memory. What's worse is, if it causes any problems,
> > tracing it back to this flush will be extremely difficult.
> > 
> 
> The 2 sec seems like a long period for me.
> 
> > What can we do?
> > 
> > - Try to make more fundamental improvements to the flushing code (for
> > memcgs or cgroups in general). The per-memcg flushing thresholding is
> > an example of this. For example, if flushing is taking too long
> > because we are flushing all subsystems, it may make sense to have
> > separate rstat trees for separate subsystems.
> > 
> > One other thing we can try is add a mutex in the memcg flushing path.
> > I had initially had this in my subtree flushing series [1], but I
> > dropped it as we thought it's not very useful.
> 
> I'm running an experimental kernel with rstat lock converted to mutex on
> a number of production servers, and we have not observed any regressions.
> The kswapd thundering herd problem also happen on these machines, but as
> these are sleep-able background threads, it is fine to sleep on the mutex.
> 

Sorry but a global mutex which can be taken by userspace applications
and is needed by node controller (to read stats) is a no from me. On a
multi-tenant systems, global locks causing priority inversion is a real
issue.

> 
[...]
> 
> My pipe dream is that kernel can avoiding the cost of maintain the
> cgroup threshold stats for flushing, and instead rely on a dynamic time
> based threshold (in ms area) that have no fast-path overhead :-P
> 

Please do expand on what you mean by dynamic time based threshold.


  reply	other threads:[~2024-06-17 18:01 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-15  8:12 Shakeel Butt
2024-06-16  0:28 ` Yosry Ahmed
2024-06-17 15:31   ` Jesper Dangaard Brouer
2024-06-17 18:01     ` Shakeel Butt [this message]
2024-06-18 15:53       ` Jesper Dangaard Brouer
2024-06-18 18:07         ` Shakeel Butt
2024-06-17 17:20   ` Shakeel Butt
2024-06-24 12:57     ` Yosry Ahmed
2024-06-24 17:02       ` Shakeel Butt
2024-06-24 17:15         ` Yosry Ahmed
2024-06-24 18:59           ` Shakeel Butt
2024-06-24 19:06             ` Yosry Ahmed
2024-06-24 20:01               ` Shakeel Butt
2024-06-24 21:41                 ` Yosry Ahmed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=rhvafiag6fjkj66ohex3eamoqpsw62bxmwbvd7shsa72rqcile@fvo4nsggjpwg \
    --to=shakeel.butt@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=hannes@cmpxchg.org \
    --cc=hawk@kernel.org \
    --cc=kernel-team@cloudflare.com \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=roman.gushchin@linux.dev \
    --cc=songmuchun@bytedance.com \
    --cc=yosryahmed@google.com \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox