From: Yosry Ahmed <yosryahmed@google.com>
To: Tejun Heo <tj@kernel.org>
Cc: "Zefan Li" <lizefan.x@bytedance.com>,
"Johannes Weiner" <hannes@cmpxchg.org>,
"Michal Hocko" <mhocko@kernel.org>,
"Shakeel Butt" <shakeelb@google.com>,
"Roman Gushchin" <roman.gushchin@linux.dev>,
"Michal Koutný" <mkoutny@suse.com>,
"Andrew Morton" <akpm@linux-foundation.org>,
Linux-MM <linux-mm@kvack.org>, Cgroups <cgroups@vger.kernel.org>,
"Greg Thelen" <gthelen@google.com>
Subject: Re: [RFC] memcg rstat flushing optimization
Date: Mon, 10 Oct 2022 17:15:33 -0700 [thread overview]
Message-ID: <CAJD7tkZZuDwGHDjAsOde0VjDm9YcKWnWUGHg43q79hcffZH5Xw@mail.gmail.com> (raw)
In-Reply-To: <CAJD7tkZOw9hrc0jKYqYW1ysGZNjSVDgjhCyownBRmpS+UUCP3A@mail.gmail.com>
On Wed, Oct 5, 2022 at 11:38 AM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Wed, Oct 5, 2022 at 11:22 AM Tejun Heo <tj@kernel.org> wrote:
> >
> > Hello,
> >
> > On Wed, Oct 05, 2022 at 11:02:23AM -0700, Yosry Ahmed wrote:
> > > > I was thinking more that being done inside the flush function.
> > >
> > > I think the flush function already does that in some sense if
> > > might_sleep is true, right? The problem here is that we are using
> >
> > Oh I forgot about that. Right.
> >
> > ...
> > > I took a couple of crashed machines kdumps and ran a script to
> > > traverse updated memcgs and check how many cpus have updates and how
> > > many updates are there on each cpu. I found that on average only a
> > > couple of stats are updated per-cpu per-cgroup, and less than 25% of
> > > cpus (but this is on a large machine, I expect the number to go higher
> > > on smaller machines). Which is why I suggested a bitmask. I understand
> > > though that this depends on whatever workloads were running on those
> > > machines, and that in case where most stats are updated the bitmask
> > > will actually make things slightly worse.
> >
> > One worry I have about selective flushing is that it's only gonna improve
> > things by some multiples while we can reasonably increase the problem size
> > by orders of magnitude.
>
> I think we would usually want to flush a few stats (< 5?) in irqsafe
> contexts out of over 100, so I would say the improvement would be
> good, but yeah, the problem size can reasonably increase more than
> that. It also depends on which stats we selectively flush. If they are
> not in the same cache line we might end up bringing in a lot of stats
> anyway into the cpu cache.
>
> >
> > The only real ways out I can think of are:
> >
> > * Implement a periodic flusher which keeps the stats needed in irqsafe path
> > acceptably uptodate to avoid flushing with irq disabled. We can make this
> > adaptive too - no reason to do all this if the number to flush isn't huge.
>
> We do have a periodic flusher today for memcg stats (see
> flush_memcg_stats_dwork). It calls __mem_cgroup_flush_stas() which
> only flushes if the total number of updates is over a certain
> threshold.
> mem_cgroup_flush_stas_delayed(), which is called in the page fault
> path, only does a flush if the last flush was a certain while ago. We
> don't use the delayed version in all irqsafe contexts though, and I am
> not the right person to tell if we can.
>
> But I think this is not what you meant. I think you meant only
> flushing the specific stats needed in irqsafe contexts more frequently
> and not invoking a flush at all in irqsafe contexts (or using
> mem_cgroup_flush_stas_delayed()..?). Right?
>
> I am not the right person to judge what is acceptably up-to-date to be
> honest, so I would wait for other memcgs folks to chime in on this.
>
> >
> > * Shift some work to the updaters. e.g. in many cases, propagating per-cpu
> > updates a couple levels up from update path will significantly reduce the
> > fanouts and thus the number of entries which need to be flushed later. It
> > does add on-going overhead, so it prolly should adaptive or configurable,
> > hopefully the former.
>
> If we are adding overhead to the updaters, would it be better to
> maintain a bitmask of updated stats, or do you think it would be more
> effective to propagate updates a couple of levels up? I think to
> propagate updates up in updaters context we would need percpu versions
> of the "pending" stats, which would also add memory consumption.
>
Any thoughts here, Tejun or anyone?
> >
> > Thanks.
> >
> > --
> > tejun
next prev parent reply other threads:[~2022-10-11 0:21 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-10-05 1:17 Yosry Ahmed
2022-10-05 16:30 ` Tejun Heo
2022-10-05 17:20 ` Yosry Ahmed
2022-10-05 17:42 ` Tejun Heo
2022-10-05 18:02 ` Yosry Ahmed
2022-10-05 18:22 ` Tejun Heo
2022-10-05 18:38 ` Yosry Ahmed
2022-10-06 2:13 ` Yosry Ahmed
2022-10-11 0:15 ` Yosry Ahmed [this message]
2022-10-11 0:19 ` Tejun Heo
2022-10-17 18:52 ` Michal Koutný
2022-10-17 21:30 ` Yosry Ahmed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAJD7tkZZuDwGHDjAsOde0VjDm9YcKWnWUGHg43q79hcffZH5Xw@mail.gmail.com \
--to=yosryahmed@google.com \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=gthelen@google.com \
--cc=hannes@cmpxchg.org \
--cc=linux-mm@kvack.org \
--cc=lizefan.x@bytedance.com \
--cc=mhocko@kernel.org \
--cc=mkoutny@suse.com \
--cc=roman.gushchin@linux.dev \
--cc=shakeelb@google.com \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox