Re: [PATCH V7 1/2] cgroup/rstat: Avoid thundering herd problem by kswapd across NUMA nodes

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yosry Ahmed <yosryahmed@google.com>
To: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: tj@kernel.org, cgroups@vger.kernel.org, shakeel.butt@linux.dev,
	 hannes@cmpxchg.org, lizefan.x@bytedance.com, longman@redhat.com,
	 kernel-team@cloudflare.com, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH V7 1/2] cgroup/rstat: Avoid thundering herd problem by kswapd across NUMA nodes
Date: Wed, 17 Jul 2024 11:43:17 -0700	[thread overview]
Message-ID: <CAJD7tkaxQeGbTckq-9Oepp=n56i-871SmCBjgjz2VBMOU1L7DA@mail.gmail.com> (raw)
In-Reply-To: <de05ebf2-2baa-493e-a6a8-acf43702824b@kernel.org>

[..]
> > What I think we should be doing is either supporting multiple
> > ongoing_flushers (by replacing the global variable with a per-cgroup
> > flag, perhaps), or biting the bullet and start using a mutex to drop
> > lock yielding. If there aren't any concerns beyond priority inversion
> > (raised by Shakeel), maybe adding a mutex_lock_timeout() variant as I
> > suggested could fix this issue (and also put a bound on the flushing
> > latency in general)?
> >
>
> The mutex_lock_timeout is an interesting idea, but not a primitive we
> have available today, right?

We don't, but Waiman said it shouldn't be difficult to add:
https://lore.kernel.org/lkml/623e62c5-3045-4dca-9f2c-ed15b8d3bad8@redhat.com/

>
> p.s. I have 5 prod machines running with mutex change, and I've not
> gotten any SRE complaints.

That's nice!

>
>
> > (I suppose the latter is preferrable)
> >
> >>
> >> IMHO we should go back to only doing ongoing_flusher for root-cgroup.
> >> There is really low chance of sub-trees flushes being concurrent enough
> >> to benefit from this, and it cause issues and needs (ugly) race handling.
> >>
> >> Further more sub-tree flushing doesn't take as long time as level 0
> >> flushing, which further lower the chance of concurrent flushes.
> >
> > I agree that the race handling is not pretty, and we can try to
> > improve the implementation or handle the race in other ways. However,
> > I think that skipping flushes for subtrees is valuable. I am not sure
> > about the cgroup arrangement in your use case, but we do have cgroups
> > with a lot of tasks in them (and/or child cgroups). If there is memory
> > pressure (or hit the cgroup limit), they all may go into direct
> > reclaim concurrently, so skipping flushes could really be beneficial.
> >
> > Of course, if the difference in complexity is not justified, we can go
> > back to only supporting root cgroups for ongoing_flusher for now. But
> > as I mentioned in the v4 email thread, some of the complexity may
> > still remain anyway as we have multiple root cgroups in v1.
> >
>
> Having an incremental step with "only supporting root cgroups for
> ongoing_flusher for now" is a good step forward IMHO.
> As you could see in grafana plot, this would be a significant production
> improvement on its own, as it avoids wasting CPU resources spinning on
> the lock.

I am not opposed to this at all. All I am saying is, if we need to
handle most complexity anyway due to multiple root cgroups in v1, then
might as well support subtrees too.

>
> Being able to have multiple root cgroups, due to in v1, does pose an
> implementation problem.  Only having a single root, would allow to have
> a race-free cmpxchg scheme.
> Would it be reasonable to only support v2?

The rstat code has so far been v1/v2 agnostic AFAICT, and Google is
still using cgroup v1, so I naturally prefer we keep supporting both
v1 and v2 going forward.

> If so, how can I match on this?

cgroup_on_dfl() is basically testing if a cgroup is on v2, but I
really want to keep v1 included if possible :/

>
> >>
> >> Let's get some quick data on flush times from production, to support my
> >> claim:
> >
> > Thanks for the data. I agree that in general root flushes will be a
> > lot more expensive than subtree flushes, but keep in mind that the
> > data may look really different depends on the cgroup setup. As I
> > mentioned, I think we should still support subtree flushes unless the
> > difference in complexity is not justified.
> >
>
> It would be really valuable if you could provide production data on the
> lock-hold times, just like I did with below bpftrace script...
> Is that possible, please?

Unfortunately, we don't have the infrastructure to do this :/

But again, I am not objecting to only supporting root cgroups as
ongoing flushers for now if there is a justifiable complexity
difference. So this shouldn't be a blocker.

next prev parent reply	other threads:[~2024-07-17 18:44 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-11 13:28 Jesper Dangaard Brouer
2024-07-11 13:29 ` [PATCH V7 2/2 RFC] cgroup/rstat: add tracepoint for ongoing flusher waits Jesper Dangaard Brouer
2024-07-16  8:42 ` [PATCH V7 1/2] cgroup/rstat: Avoid thundering herd problem by kswapd across NUMA nodes Jesper Dangaard Brouer
2024-07-17  0:35   ` Yosry Ahmed
2024-07-17  3:00     ` Waiman Long
2024-07-17 16:05       ` Yosry Ahmed
2024-07-17 16:36     ` Jesper Dangaard Brouer
2024-07-17 16:49       ` Yosry Ahmed
2024-07-18  8:12         ` Jesper Dangaard Brouer
2024-07-18 15:55           ` Yosry Ahmed
2024-07-19  0:40       ` Shakeel Butt
2024-07-19  3:11         ` Yosry Ahmed
2024-07-19 23:01           ` Shakeel Butt
2024-07-19  7:54         ` Jesper Dangaard Brouer
2024-07-19 22:47           ` Shakeel Butt
2024-07-20  4:52             ` Yosry Ahmed
     [not found]               ` <CAJD7tkaypFa3Nk0jh_ZYJX8YB0i7h9VY2YFXMg7GKzSS+f8H5g@mail.gmail.com>
2024-07-20 15:05                 ` Jesper Dangaard Brouer
2024-07-22 20:02               ` Shakeel Butt
2024-07-22 20:12                 ` Yosry Ahmed
2024-07-22 21:32                   ` Shakeel Butt
2024-07-22 22:58                     ` Shakeel Butt
2024-07-23  6:24                       ` Yosry Ahmed
2024-07-17  0:30 ` Yosry Ahmed
2024-07-17  7:32   ` Jesper Dangaard Brouer
2024-07-17 16:31     ` Yosry Ahmed
2024-07-17 18:17       ` Jesper Dangaard Brouer
2024-07-17 18:43         ` Yosry Ahmed [this message]
2024-07-19 15:07   ` Jesper Dangaard Brouer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJD7tkaxQeGbTckq-9Oepp=n56i-871SmCBjgjz2VBMOU1L7DA@mail.gmail.com' \
    --to=yosryahmed@google.com \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=hawk@kernel.org \
    --cc=kernel-team@cloudflare.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lizefan.x@bytedance.com \
    --cc=longman@redhat.com \
    --cc=shakeel.butt@linux.dev \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox