From: Shakeel Butt <shakeel.butt@linux.dev>
To: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Yosry Ahmed <yosryahmed@google.com>,
Andrew Morton <akpm@linux-foundation.org>,
Johannes Weiner <hannes@cmpxchg.org>,
Michal Hocko <mhocko@suse.com>,
Roman Gushchin <roman.gushchin@linux.dev>,
Yu Zhao <yuzhao@google.com>,
Muchun Song <songmuchun@bytedance.com>,
Facebook Kernel Team <kernel-team@meta.com>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
kernel-team <kernel-team@cloudflare.com>
Subject: Re: [PATCH] memcg: use ratelimited stats flush in the reclaim
Date: Mon, 17 Jun 2024 11:01:36 -0700 [thread overview]
Message-ID: <rhvafiag6fjkj66ohex3eamoqpsw62bxmwbvd7shsa72rqcile@fvo4nsggjpwg> (raw)
In-Reply-To: <0ec3c33c-d9ff-41a5-be94-0142f103b815@kernel.org>
On Mon, Jun 17, 2024 at 05:31:21PM GMT, Jesper Dangaard Brouer wrote:
>
>
> On 16/06/2024 02.28, Yosry Ahmed wrote:
> > On Sat, Jun 15, 2024 at 1:13 AM Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > The Meta prod is seeing large amount of stalls in memcg stats flush
> > > from the memcg reclaim code path. At the moment, this specific callsite
> > > is doing a synchronous memcg stats flush. The rstat flush is an
> > > expensive and time consuming operation, so concurrent relaimers will
> > > busywait on the lock potentially for a long time. Actually this issue is
> > > not unique to Meta and has been observed by Cloudflare [1] as well. For
> > > the Cloudflare case, the stalls were due to contention between kswapd
> > > threads running on their 8 numa node machines which does not make sense
> > > as rstat flush is global and flush from one kswapd thread should be
> > > sufficient for all. Simply replace the synchronous flush with the
> > > ratelimited one.
> > >
>
> Like Yosry, I don't agree that simply using ratelimited flush here is
> the right solution, at-least other options need to be investigated first.
I added more detail in my reply to Yosry on why using ratelimited flush
for this specific case is fine.
[...]
> >
> > I think you already know my opinion about this one :) I don't like it
> > at all, and I will explain why below. I know it may be a necessary
> > evil, but I would like us to make sure there is no other option before
> > going forward with this.
> >
> I'm signing up to solving this somehow, as this is a real prod issue.
>
> An easy way to solve the kswapd issue, would be to reintroduce
> "stats_flush_ongoing" concept, that was reverted in 7d7ef0a4686a ("mm:
> memcg: restore subtree stats flushing") (Author: Yosry Ahmed), and
> introduced in 3cd9992b9302 ("memcg: replace stats_flush_lock with an
> atomic") (Author: Yosry Ahmed).
>
The skipping flush for "stats_flush_ongoing" was there from the start.
> The concept is: If there is an ongoing rstat flush, this time limited to
> the root cgroup, then don't perform the flush. We can only do this for
> the root cgroup tree, as flushing can be done for subtrees, but kswapd
> is always for root tree, so it is good enough to solve the kswapd
> thundering herd problem. We might want to generalize this beyond memcg.
>
No objection from me for this skipping root memcg flush idea.
>
[...]
>
> > - With the added thresholding code, a flush is only done if there is a
> > significant number of pending updates in the relevant subtree.
> > Choosing the ratelimited approach is intentionally ignoring a
> > significant change in stats (although arguably it could be irrelevant
> > stats).
> >
>
> My production observations are that the thresholding code isn't limiting
> the flushing in practice.
>
Here we need more production data. I remember you mentioned MEMCG_KMEM
being used for most of the updates. Is it possible to get top 5 (or 10)
most updated stats for your production environment?
>
> > - Reclaim code is an iterative process, so not updating the stats on
> > every retry is very counterintuitive. We are retrying reclaim using
> > the same stats and heuristics used by a previous iteration,
> > essentially dismissing the effects of those previous iterations.
> >
> > - Indeterministic behavior like this one is very difficult to debug if
> > it causes problems. The missing updates in the last 2s (or whatever
> > period) could be of any magnitude. We may be ignoring GBs of
> > free/allocated memory. What's worse is, if it causes any problems,
> > tracing it back to this flush will be extremely difficult.
> >
>
> The 2 sec seems like a long period for me.
>
> > What can we do?
> >
> > - Try to make more fundamental improvements to the flushing code (for
> > memcgs or cgroups in general). The per-memcg flushing thresholding is
> > an example of this. For example, if flushing is taking too long
> > because we are flushing all subsystems, it may make sense to have
> > separate rstat trees for separate subsystems.
> >
> > One other thing we can try is add a mutex in the memcg flushing path.
> > I had initially had this in my subtree flushing series [1], but I
> > dropped it as we thought it's not very useful.
>
> I'm running an experimental kernel with rstat lock converted to mutex on
> a number of production servers, and we have not observed any regressions.
> The kswapd thundering herd problem also happen on these machines, but as
> these are sleep-able background threads, it is fine to sleep on the mutex.
>
Sorry but a global mutex which can be taken by userspace applications
and is needed by node controller (to read stats) is a no from me. On a
multi-tenant systems, global locks causing priority inversion is a real
issue.
>
[...]
>
> My pipe dream is that kernel can avoiding the cost of maintain the
> cgroup threshold stats for flushing, and instead rely on a dynamic time
> based threshold (in ms area) that have no fast-path overhead :-P
>
Please do expand on what you mean by dynamic time based threshold.
next prev parent reply other threads:[~2024-06-17 18:01 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-06-15 8:12 Shakeel Butt
2024-06-16 0:28 ` Yosry Ahmed
2024-06-17 15:31 ` Jesper Dangaard Brouer
2024-06-17 18:01 ` Shakeel Butt [this message]
2024-06-18 15:53 ` Jesper Dangaard Brouer
2024-06-18 18:07 ` Shakeel Butt
2024-06-17 17:20 ` Shakeel Butt
2024-06-24 12:57 ` Yosry Ahmed
2024-06-24 17:02 ` Shakeel Butt
2024-06-24 17:15 ` Yosry Ahmed
2024-06-24 18:59 ` Shakeel Butt
2024-06-24 19:06 ` Yosry Ahmed
2024-06-24 20:01 ` Shakeel Butt
2024-06-24 21:41 ` Yosry Ahmed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=rhvafiag6fjkj66ohex3eamoqpsw62bxmwbvd7shsa72rqcile@fvo4nsggjpwg \
--to=shakeel.butt@linux.dev \
--cc=akpm@linux-foundation.org \
--cc=hannes@cmpxchg.org \
--cc=hawk@kernel.org \
--cc=kernel-team@cloudflare.com \
--cc=kernel-team@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=roman.gushchin@linux.dev \
--cc=songmuchun@bytedance.com \
--cc=yosryahmed@google.com \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox