Re: [RFC] memcg rstat flushing optimization

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yosry Ahmed <yosryahmed@google.com>
To: Tejun Heo <tj@kernel.org>
Cc: "Zefan Li" <lizefan.x@bytedance.com>,
	"Johannes Weiner" <hannes@cmpxchg.org>,
	"Michal Hocko" <mhocko@kernel.org>,
	"Shakeel Butt" <shakeelb@google.com>,
	"Roman Gushchin" <roman.gushchin@linux.dev>,
	"Michal Koutný" <mkoutny@suse.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	Linux-MM <linux-mm@kvack.org>, Cgroups <cgroups@vger.kernel.org>,
	"Greg Thelen" <gthelen@google.com>
Subject: Re: [RFC] memcg rstat flushing optimization
Date: Wed, 5 Oct 2022 11:02:23 -0700	[thread overview]
Message-ID: <CAJD7tkY8gNNaPneAVFDYcWN9irUvE4ZFW=Hv=5898cWFG1p7rg@mail.gmail.com> (raw)
In-Reply-To: <Yz3CH7caP7H/C3gL@slm.duckdns.org>

On Wed, Oct 5, 2022 at 10:43 AM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Wed, Oct 05, 2022 at 10:20:54AM -0700, Yosry Ahmed wrote:
> > > How long were the stalls? Given that rstats are usually flushed by its
> >
> > I think 10 seconds while interrupts are disabled is what we need for a
> > hard lockup, right?
>
> Oh man, that's a long while. I'd really like to learn more about the
> numbers. How many cgroups are being flushed across how many CPUs?

The total number of cgroups is about ~11k. Unfortunately, I wouldn't
know how many of them are on the rstat updated tree. The number of
cpus is 256.

In all honesty a chunk of those cgroups were dying, which is a
different problem, but there is nothing really preventing our users
from creating that many live cgroups. Also, we naturally don't want
the kernel to face a 10s hard lockup and panic even if we have a
zombie cgroups problem.

Interestingly, we are on cgroup v1, which means we are only flushing
memcg stats. When we move to cgroup v2 we will also flush blkcg stats
in the same irq disabled call.

>
> > IIUC you mean that the caller of cgroup_rstat_flush() can call a
> > different variant that only flushes a part of the rstat tree then
> > returns, and the caller makes several calls interleaved by re-enabling
> > irq, right? Because the flushing code seems to already do this
> > internally if the non irqsafe version is used.
>
> I was thinking more that being done inside the flush function.

I think the flush function already does that in some sense if
might_sleep is true, right? The problem here is that we are using
cgroup_rstat_flush_irqsafe() which can't sleep. Even if we modify
mem_cgroup_flush_stats() so that it doesn't always call the irqsafe
version, we still have code paths that need AFAICT. It would help to
limit the callers to the irqsafe version, but it doesn't fix the
problem.

>
> > I think this might be tricky. In this case the path that caused the
> > lockup was memcg_check_events()->mem_cgroup_threshold()->__mem_cgroup_threshold()->mem_cgroup_usage()->mem_cgroup_flush_stats().
> > Interrupts are disabled by callers of memcg_check_events(), but the
> > rstat flush call is made much deeper in the call stack. Whoever is
> > disabling interrupts doesn't have access to pause/resume flushing.
>
> Hmm.... yeah I guess it's worthwhile to experiment with selective flushing
> for specific paths. That said, we'd still need to address the whole flush
> taking long too.

Agreed. The irqsafe paths are a more severe problem but ideally we
want to optimize flushing in general (which is why I dumped a lot of
ideas in the original email, to see what makes sense to other folks).

>
> > There are also other code paths that used to use
> > cgroup_rstat_flush_irqsafe() directly before mem_cgroup_flush_stats()
> > was introduced like mem_cgroup_wb_stats() [1].
> >
> > This is why I suggested a selective flushing variant of
> > cgroup_rstat_flush_irqsafe(), so that flushers that need irq disabled
> > have the ability to only flush a subset of the stats to avoid long
> > stalls if possible.
>
> I have nothing against selective flushing but it's not a free thing to do
> both in terms of complexity and runtime overhead, so let's get some numbers
> on how much time is spent where.

The problem with acquiring numbers is that rstat flushing is very
heavily dependent on workloads. The stats represent basically
everything that memcg does. There might be some workloads that only
update a couple of stats mostly, and workloads that exercise most of
them. There might also be workloads that are limited to a few cpus or
can run on all cpus. The number of memcgs is also a huge factor. It
feels like if we use an artificial benchmark it would significantly be
non-representative.

I took a couple of crashed machines kdumps and ran a script to
traverse updated memcgs and check how many cpus have updates and how
many updates are there on each cpu. I found that on average only a
couple of stats are updated per-cpu per-cgroup, and less than 25% of
cpus (but this is on a large machine, I expect the number to go higher
on smaller machines). Which is why I suggested a bitmask. I understand
though that this depends on whatever workloads were running on those
machines, and that in case where most stats are updated the bitmask
will actually make things slightly worse.

>
> Thanks.
>
> --
> tejun

next prev parent reply	other threads:[~2022-10-05 18:03 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-05  1:17 Yosry Ahmed
2022-10-05 16:30 ` Tejun Heo
2022-10-05 17:20   ` Yosry Ahmed
2022-10-05 17:42     ` Tejun Heo
2022-10-05 18:02       ` Yosry Ahmed [this message]
2022-10-05 18:22         ` Tejun Heo
2022-10-05 18:38           ` Yosry Ahmed
2022-10-06  2:13             ` Yosry Ahmed
2022-10-11  0:15             ` Yosry Ahmed
2022-10-11  0:19               ` Tejun Heo
2022-10-17 18:52 ` Michal Koutný
2022-10-17 21:30   ` Yosry Ahmed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJD7tkY8gNNaPneAVFDYcWN9irUvE4ZFW=Hv=5898cWFG1p7rg@mail.gmail.com' \
    --to=yosryahmed@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=gthelen@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=linux-mm@kvack.org \
    --cc=lizefan.x@bytedance.com \
    --cc=mhocko@kernel.org \
    --cc=mkoutny@suse.com \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeelb@google.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox