From: Shakeel Butt <shakeelb@google.com>
To: Yosry Ahmed <yosryahmed@google.com>
Cc: "Andrew Morton" <akpm@linux-foundation.org>,
"Johannes Weiner" <hannes@cmpxchg.org>,
"Michal Hocko" <mhocko@kernel.org>,
"Roman Gushchin" <roman.gushchin@linux.dev>,
"Muchun Song" <muchun.song@linux.dev>,
"Ivan Babrou" <ivan@cloudflare.com>, "Tejun Heo" <tj@kernel.org>,
"Michal Koutný" <mkoutny@suse.com>,
"Waiman Long" <longman@redhat.com>,
kernel-team@cloudflare.com, "Wei Xu" <weixugc@google.com>,
"Greg Thelen" <gthelen@google.com>,
"Domenico Cerasuolo" <cerasuolodomenico@gmail.com>,
linux-mm@kvack.org, cgroups@vger.kernel.org,
linux-kernel@vger.kernel.org
Subject: Re: [mm-unstable v4 3/5] mm: memcg: make stats flushing threshold per-memcg
Date: Sat, 2 Dec 2023 07:48:50 +0000 [thread overview]
Message-ID: <20231202074850.aisqdvyc5u2kth6r@google.com> (raw)
In-Reply-To: <20231129032154.3710765-4-yosryahmed@google.com>
On Wed, Nov 29, 2023 at 03:21:51AM +0000, Yosry Ahmed wrote:
> A global counter for the magnitude of memcg stats update is maintained
> on the memcg side to avoid invoking rstat flushes when the pending
> updates are not significant. This avoids unnecessary flushes, which are
> not very cheap even if there isn't a lot of stats to flush. It also
> avoids unnecessary lock contention on the underlying global rstat lock.
>
> Make this threshold per-memcg. The scheme is followed where percpu (now
> also per-memcg) counters are incremented in the update path, and only
> propagated to per-memcg atomics when they exceed a certain threshold.
>
> This provides two benefits:
> (a) On large machines with a lot of memcgs, the global threshold can be
> reached relatively fast, so guarding the underlying lock becomes less
> effective. Making the threshold per-memcg avoids this.
>
> (b) Having a global threshold makes it hard to do subtree flushes, as we
> cannot reset the global counter except for a full flush. Per-memcg
> counters removes this as a blocker from doing subtree flushes, which
> helps avoid unnecessary work when the stats of a small subtree are
> needed.
>
> Nothing is free, of course. This comes at a cost:
> (a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4
> bytes. The extra memory usage is insigificant.
>
> (b) More work on the update side, although in the common case it will
> only be percpu counter updates. The amount of work scales with the
> number of ancestors (i.e. tree depth). This is not a new concept, adding
> a cgroup to the rstat tree involves a parent loop, so is charging.
> Testing results below show no significant regressions.
>
> (c) The error margin in the stats for the system as a whole increases
> from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH *
> NR_MEMCGS. This is probably fine because we have a similar per-memcg
> error in charges coming from percpu stocks, and we have a periodic
> flusher that makes sure we always flush all the stats every 2s anyway.
>
> This patch was tested to make sure no significant regressions are
> introduced on the update path as follows. The following benchmarks were
> ran in a cgroup that is 2 levels deep (/sys/fs/cgroup/a/b/):
>
> (1) Running 22 instances of netperf on a 44 cpu machine with
> hyperthreading disabled. All instances are run in a level 2 cgroup, as
> well as netserver:
> # netserver -6
> # netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
>
> Averaging 20 runs, the numbers are as follows:
> Base: 40198.0 mbps
> Patched: 38629.7 mbps (-3.9%)
>
> The regression is minimal, especially for 22 instances in the same
> cgroup sharing all ancestors (so updating the same atomics).
>
> (2) will-it-scale page_fault tests. These tests (specifically
> per_process_ops in page_fault3 test) detected a 25.9% regression before
> for a change in the stats update path [1]. These are the
> numbers from 10 runs (+ is good) on a machine with 256 cpus:
>
> LABEL | MEAN | MEDIAN | STDDEV |
> ------------------------------+-------------+-------------+-------------
> page_fault1_per_process_ops | | | |
> (A) base | 270249.164 | 265437.000 | 13451.836 |
> (B) patched | 261368.709 | 255725.000 | 13394.767 |
> | -3.29% | -3.66% | |
> page_fault1_per_thread_ops | | | |
> (A) base | 242111.345 | 239737.000 | 10026.031 |
> (B) patched | 237057.109 | 235305.000 | 9769.687 |
> | -2.09% | -1.85% | |
> page_fault1_scalability | | |
> (A) base | 0.034387 | 0.035168 | 0.0018283 |
> (B) patched | 0.033988 | 0.034573 | 0.0018056 |
> | -1.16% | -1.69% | |
> page_fault2_per_process_ops | | |
> (A) base | 203561.836 | 203301.000 | 2550.764 |
> (B) patched | 197195.945 | 197746.000 | 2264.263 |
> | -3.13% | -2.73% | |
> page_fault2_per_thread_ops | | |
> (A) base | 171046.473 | 170776.000 | 1509.679 |
> (B) patched | 166626.327 | 166406.000 | 768.753 |
> | -2.58% | -2.56% | |
> page_fault2_scalability | | |
> (A) base | 0.054026 | 0.053821 | 0.00062121 |
> (B) patched | 0.053329 | 0.05306 | 0.00048394 |
> | -1.29% | -1.41% | |
> page_fault3_per_process_ops | | |
> (A) base | 1295807.782 | 1297550.000 | 5907.585 |
> (B) patched | 1275579.873 | 1273359.000 | 8759.160 |
> | -1.56% | -1.86% | |
> page_fault3_per_thread_ops | | |
> (A) base | 391234.164 | 390860.000 | 1760.720 |
> (B) patched | 377231.273 | 376369.000 | 1874.971 |
> | -3.58% | -3.71% | |
> page_fault3_scalability | | |
> (A) base | 0.60369 | 0.60072 | 0.0083029 |
> (B) patched | 0.61733 | 0.61544 | 0.009855 |
> | +2.26% | +2.45% | |
>
> All regressions seem to be minimal, and within the normal variance for
> the benchmark. The fix for [1] assumes that 3% is noise -- and there
> were no further practical complaints), so hopefully this means that such
> variations in these microbenchmarks do not reflect on practical
> workloads.
>
> (3) I also ran stress-ng in a nested cgroup and did not observe any
> obvious regressions.
>
> [1]https://lore.kernel.org/all/20190520063534.GB19312@shao2-debian/
>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
> Tested-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
next prev parent reply other threads:[~2023-12-02 7:48 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-11-29 3:21 [mm-unstable v4 0/5] mm: memcg: subtree stats flushing and thresholds Yosry Ahmed
2023-11-29 3:21 ` [mm-unstable v4 1/5] mm: memcg: change flush_next_time to flush_last_time Yosry Ahmed
2023-11-29 3:21 ` [mm-unstable v4 2/5] mm: memcg: move vmstats structs definition above flushing code Yosry Ahmed
2023-11-29 3:21 ` [mm-unstable v4 3/5] mm: memcg: make stats flushing threshold per-memcg Yosry Ahmed
2023-12-02 7:48 ` Shakeel Butt [this message]
2023-11-29 3:21 ` [mm-unstable v4 4/5] mm: workingset: move the stats flush into workingset_test_recent() Yosry Ahmed
2023-12-02 8:07 ` Shakeel Butt
2023-11-29 3:21 ` [mm-unstable v4 5/5] mm: memcg: restore subtree stats flushing Yosry Ahmed
2023-12-02 1:57 ` Bagas Sanjaya
2023-12-02 2:56 ` Waiman Long
2023-12-02 5:53 ` Bagas Sanjaya
2023-12-04 19:51 ` Yosry Ahmed
2023-12-02 8:31 ` Shakeel Butt
2023-12-04 20:12 ` Yosry Ahmed
2023-12-04 21:37 ` Yosry Ahmed
2023-12-04 23:31 ` Shakeel Butt
2023-12-04 23:46 ` Wei Xu
2023-12-04 23:49 ` Yosry Ahmed
2023-12-04 23:58 ` Shakeel Butt
2023-12-12 18:43 ` Andrew Morton
2023-12-12 19:11 ` Shakeel Butt
2023-12-12 20:44 ` Yosry Ahmed
2023-12-02 4:51 ` [mm-unstable v4 0/5] mm: memcg: subtree stats flushing and thresholds Bagas Sanjaya
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20231202074850.aisqdvyc5u2kth6r@google.com \
--to=shakeelb@google.com \
--cc=akpm@linux-foundation.org \
--cc=cerasuolodomenico@gmail.com \
--cc=cgroups@vger.kernel.org \
--cc=gthelen@google.com \
--cc=hannes@cmpxchg.org \
--cc=ivan@cloudflare.com \
--cc=kernel-team@cloudflare.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=longman@redhat.com \
--cc=mhocko@kernel.org \
--cc=mkoutny@suse.com \
--cc=muchun.song@linux.dev \
--cc=roman.gushchin@linux.dev \
--cc=tj@kernel.org \
--cc=weixugc@google.com \
--cc=yosryahmed@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox