From: "Michal Koutný" <mkoutny@suse.com>
To: Shakeel Butt <shakeelb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
Michal Hocko <mhocko@kernel.org>,
Roman Gushchin <roman.gushchin@linux.dev>,
Ivan Babrou <ivan@cloudflare.com>,
Frank Hofmann <fhofmann@cloudflare.com>,
Andrew Morton <akpm@linux-foundation.org>,
cgroups@vger.kernel.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, Daniel Dao <dqminh@cloudflare.com>,
stable@vger.kernel.org
Subject: Re: [PATCH] memcg: sync flush only if periodic flush is delayed
Date: Fri, 11 Mar 2022 17:00:51 +0100 [thread overview]
Message-ID: <20220311160051.GA24796@blackbody.suse.cz> (raw)
In-Reply-To: <20220304184040.1304781-1-shakeelb@google.com>
Hello.
TL;DR rstats are slow but accurate on reader side. To tackle the
performance regression no flush seems simpler than this patch.
So, I've made an overview for myself what were the relevant changes with
rstat introduction.
The amount of work is:
- before
R: O(1)
W: O(tree_depth)
- after
R: O(nr_cpus * nr_cgroups(subtree) * nr_counters)
W: O(tree_depth)
That doesn't look like a positive change especially on the reader side.
(In theory, the reader's work would be amortized but as the original
report here shows, not all workloads are diluting the flushes
sufficiently. [1])
The benefit this was traded for was the greater accuracy, the possible
error is:
- before
- O(nr_cpus * nr_cgroups(subtree) * MEMCG_CHARGE_BATCH) (1)
- after
O(nr_cpus * MEMCG_CHARGE_BATCH) // sync. flush
or
O(flush_period * max_cr) // periodic flush only (2)
// max_cr is per-counter max change rate
So we could argue that if the pre-rstat kernels did just fine with the
error (1), they would not be worse with periodic flush if we can compare
(1) and (2).
On Fri, Mar 04, 2022 at 06:40:40PM +0000, Shakeel Butt <shakeelb@google.com> wrote:
> This patch fixes this regression by making the rstat flushing
> conditional in the performance critical codepaths. More specifically,
> the kernel relies on the async periodic rstat flusher to flush the stats
> and only if the periodic flusher is delayed by more than twice the
> amount of its normal time window then the kernel allows rstat flushing
> from the performance critical codepaths.
I'm not sure whether your patch attempts to solve the problem of
(a) periodic flush getting stuck or (b) limiting error on refault path.
If it's (a), it should be tackled more systematically (dedicated wq?).
If it's (b), why not just rely on periodic flush (self answer: (1) and
(2) comparison is workload dependent).
> Now the question: what are the side-effects of this change? The worst
> that can happen is the refault codepath will see 4sec old lruvec stats
> and may cause false (or missed) activations of the refaulted page which
> may under-or-overestimate the workingset size. Though that is not very
> concerning as the kernel can already miss or do false activations.
We can't argue what's the effect of periodic only flushing so this
newly introduced factor would inherit that too. I find it superfluous.
Michal
[1] This is worth looking at in more detail.
From the flush condition we have
cr * Δt = nr_cpus * MEMCG_CHARGE_BATCH
where Δt is time between flushes and cr is global change rate.
cr composes of all updates together (corresponds to stats_updates in
memcg_rstat_updated(), max_cr is change rate per counter)
cr = Σ cr_i <= nr_counters * max_cr
By combining these two we get shortest time between flushes:
cr * Δt <= nr_counters * max_cr * Δt
nr_cpus * MEMCG_CHARGE_BATCH <= nr_counters * max_cr * Δt
Δt >= (nr_cpus * MEMCG_CHARGE_BATCH) / (nr_counters * max_cr)
We are interested in
R_amort = flush_work / Δt
which is
R_amort <= flush_work * nr_counters * max_cr / (nr_cpus * MEMCG_CHARGE_BATCH)
R_amort: O( nr_cpus * nr_cgroups(subtree) * nr_counters * (nr_counters * max_cr) / (nr_cpus * MEMCG_CHARGE_BATCH) )
R_amort: O( nr_cgroups(subtree) * nr_counters^2 * max_cr) / (MEMCG_CHARGE_BATCH) )
The square looks interesting given there are already tens of counters.
(As data from Ivan have shown, we can hardly restore the pre-rstat
performance on the read side even with mere mod_delayed_work().)
This is what you partially solved with introduction of NR_MEMCG_EVENTS
but the stats_updates was still sum of all events, so the flush might
have still triggered too frequently.
Maybe that would be better long-term approach, splitting into accurate
and approximate counters and reflect that in the error estimator stats_updates.
Or some other optimization of mem_cgroup_css_rstat_flush().
next prev parent reply other threads:[~2022-03-11 16:00 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-03-04 18:40 Shakeel Butt
2022-03-04 18:53 ` Ivan Babrou
2022-03-07 2:44 ` Andrew Morton
2022-03-07 3:06 ` Shakeel Butt
2022-03-11 16:00 ` Michal Koutný [this message]
2022-03-12 19:07 ` Shakeel Butt
2022-03-14 12:57 ` Michal Koutný
2022-03-14 12:57 ` Michal Koutný
2022-03-16 16:26 ` Shakeel Butt
2022-03-13 2:50 ` Hillf Danton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220311160051.GA24796@blackbody.suse.cz \
--to=mkoutny@suse.com \
--cc=akpm@linux-foundation.org \
--cc=cgroups@vger.kernel.org \
--cc=dqminh@cloudflare.com \
--cc=fhofmann@cloudflare.com \
--cc=hannes@cmpxchg.org \
--cc=ivan@cloudflare.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=roman.gushchin@linux.dev \
--cc=shakeelb@google.com \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox