Re: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yosry Ahmed <yosryahmed@google.com>
To: Shakeel Butt <shakeelb@google.com>
Cc: "Andrew Morton" <akpm@linux-foundation.org>,
	"Johannes Weiner" <hannes@cmpxchg.org>,
	"Michal Hocko" <mhocko@kernel.org>,
	"Roman Gushchin" <roman.gushchin@linux.dev>,
	"Muchun Song" <muchun.song@linux.dev>,
	"Ivan Babrou" <ivan@cloudflare.com>, "Tejun Heo" <tj@kernel.org>,
	"Michal Koutný" <mkoutny@suse.com>,
	"Waiman Long" <longman@redhat.com>,
	kernel-team@cloudflare.com, "Wei Xu" <weixugc@google.com>,
	"Greg Thelen" <gthelen@google.com>,
	linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg
Date: Tue, 10 Oct 2023 14:02:18 -0700	[thread overview]
Message-ID: <CAJD7tka=kjd42oFpTm8FzMpNedxpJCUj-Wn6L=zrFODC610A-A@mail.gmail.com> (raw)
In-Reply-To: <CALvZod5nQrf=Y24u_hzGOTXYBfnt-+bo+cYbRMRpmauTMXJn3Q@mail.gmail.com>

On Tue, Oct 10, 2023 at 1:45 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Mon, Oct 9, 2023 at 8:21 PM Yosry Ahmed <yosryahmed@google.com> wrote:
> >
> > A global counter for the magnitude of memcg stats update is maintained
> > on the memcg side to avoid invoking rstat flushes when the pending
> > updates are not significant. This avoids unnecessary flushes, which are
> > not very cheap even if there isn't a lot of stats to flush. It also
> > avoids unnecessary lock contention on the underlying global rstat lock.
> >
> > Make this threshold per-memcg. The scheme is followed where percpu (now
> > also per-memcg) counters are incremented in the update path, and only
> > propagated to per-memcg atomics when they exceed a certain threshold.
> >
> > This provides two benefits:
> > (a) On large machines with a lot of memcgs, the global threshold can be
> > reached relatively fast, so guarding the underlying lock becomes less
> > effective. Making the threshold per-memcg avoids this.
> >
> > (b) Having a global threshold makes it hard to do subtree flushes, as we
> > cannot reset the global counter except for a full flush. Per-memcg
> > counters removes this as a blocker from doing subtree flushes, which
> > helps avoid unnecessary work when the stats of a small subtree are
> > needed.
> >
> > Nothing is free, of course. This comes at a cost:
> > (a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4
> > bytes. The extra memory usage is insigificant.
> >
> > (b) More work on the update side, although in the common case it will
> > only be percpu counter updates. The amount of work scales with the
> > number of ancestors (i.e. tree depth). This is not a new concept, adding
> > a cgroup to the rstat tree involves a parent loop, so is charging.
> > Testing results below show no significant regressions.
> >
> > (c) The error margin in the stats for the system as a whole increases
> > from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH *
> > NR_MEMCGS. This is probably fine because we have a similar per-memcg
> > error in charges coming from percpu stocks, and we have a periodic
> > flusher that makes sure we always flush all the stats every 2s anyway.
> >
> > This patch was tested to make sure no significant regressions are
> > introduced on the update path as follows. The following benchmarks were
> > ran in a cgroup that is 4 levels deep (/sys/fs/cgroup/a/b/c/d), which is
> > deeper than a usual setup:
> >
> > (a) neper [1] with 1000 flows and 100 threads (single machine). The
> > values in the table are the average of server and client throughputs in
> > mbps after 30 iterations, each running for 30s:
> >
> >                                 tcp_rr          tcp_stream
> > Base                            9504218.56      357366.84
> > Patched                         9656205.68      356978.39
> > Delta                           +1.6%           -0.1%
> > Standard Deviation              0.95%           1.03%
> >
> > An increase in the performance of tcp_rr doesn't really make sense, but
> > it's probably in the noise. The same tests were ran with 1 flow and 1
> > thread but the throughput was too noisy to make any conclusions (the
> > averages did not show regressions nonetheless).
> >
> > Looking at perf for one iteration of the above test, __mod_memcg_state()
> > (which is where memcg_rstat_updated() is called) does not show up at all
> > without this patch, but it shows up with this patch as 1.06% for tcp_rr
> > and 0.36% for tcp_stream.
> >
> > (b) "stress-ng --vm 0 -t 1m --times --perf". I don't understand
> > stress-ng very well, so I am not sure that's the best way to test this,
> > but it spawns 384 workers and spits a lot of metrics which looks nice :)
> > I picked a few ones that seem to be relevant to the stats update path. I
> > also included cache misses as this patch introduce more atomics that may
> > bounce between cpu caches:
> >
> > Metric                  Base            Patched         Delta
> > Cache Misses            3.394 B/sec     3.433 B/sec     +1.14%
> > Cache L1D Read          0.148 T/sec     0.154 T/sec     +4.05%
> > Cache L1D Read Miss     20.430 B/sec    21.820 B/sec    +6.8%
> > Page Faults Total       4.304 M/sec     4.535 M/sec     +5.4%
> > Page Faults Minor       4.304 M/sec     4.535 M/sec     +5.4%
> > Page Faults Major       18.794 /sec     0.000 /sec
> > Kmalloc                 0.153 M/sec     0.152 M/sec     -0.65%
> > Kfree                   0.152 M/sec     0.153 M/sec     +0.65%
> > MM Page Alloc           4.640 M/sec     4.898 M/sec     +5.56%
> > MM Page Free            4.639 M/sec     4.897 M/sec     +5.56%
> > Lock Contention Begin   0.362 M/sec     0.479 M/sec     +32.32%
> > Lock Contention End     0.362 M/sec     0.479 M/sec     +32.32%
> > page-cache add          238.057 /sec    0.000 /sec
> > page-cache del          6.265 /sec      6.267 /sec      -0.03%
> >
> > This is only using a single run in each case. I am not sure what to
> > make out of most of these numbers, but they mostly seem in the noise
> > (some better, some worse). The lock contention numbers are interesting.
> > I am not sure if higher is better or worse here. No new locks or lock
> > sections are introduced by this patch either way.
> >
> > Looking at perf, __mod_memcg_state() shows up as 0.00% with and without
> > this patch. This is suspicious, but I verified while stress-ng is
> > running that all the threads are in the right cgroup.
> >
> > (3) will-it-scale page_fault tests. These tests (specifically
> > per_process_ops in page_fault3 test) detected a 25.9% regression before
> > for a change in the stats update path [2]. These are the
> > numbers from 30 runs (+ is good):
> >
> >              LABEL            |     MEAN    |   MEDIAN    |   STDDEV   |
> > ------------------------------+-------------+-------------+-------------
> >   page_fault1_per_process_ops |             |             |            |
> >   (A) base                    | 265207.738  | 262941.000  | 12112.379  |
> >   (B) patched                 | 249249.191  | 248781.000  | 8767.457   |
> >                               | -6.02%      | -5.39%      |            |
> >   page_fault1_per_thread_ops  |             |             |            |
> >   (A) base                    | 241618.484  | 240209.000  | 10162.207  |
> >   (B) patched                 | 229820.671  | 229108.000  | 7506.582   |
> >                               | -4.88%      | -4.62%      |            |
> >   page_fault1_scalability     |             |             |
> >   (A) base                    | 0.03545     | 0.035705    | 0.0015837  |
> >   (B) patched                 | 0.029952    | 0.029957    | 0.0013551  |
> >                               | -9.29%      | -9.35%      |            |
>
> This much regression is not acceptable.
>
> In addition, I ran netperf with the same 4 level hierarchy as you have
> run and I am seeing ~11% regression.

Interesting, I thought neper and netperf should be similar. Let me try
to reproduce this.

Thanks for testing!

>
> More specifically on a machine with 44 CPUs (HT disabled ixion machine):
>
> # for server
> $ netserver -6
>
> # 22 instances of netperf clients
> $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
>
> (averaged over 4 runs)
>
> base (next-20231009): 33081 MBPS
> patched: 29267 MBPS
>
> So, this series is not acceptable unless this regression is resolved.

next prev parent reply	other threads:[~2023-10-10 21:03 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-10  3:21 [PATCH v2 0/5] mm: memcg: subtree stats flushing and thresholds Yosry Ahmed
2023-10-10  3:21 ` [PATCH v2 1/5] mm: memcg: change flush_next_time to flush_last_time Yosry Ahmed
2023-10-10  3:21 ` [PATCH v2 2/5] mm: memcg: move vmstats structs definition above flushing code Yosry Ahmed
2023-10-10  3:21 ` [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg Yosry Ahmed
2023-10-10  3:24   ` Yosry Ahmed
2023-10-10 20:45   ` Shakeel Butt
2023-10-10 21:02     ` Yosry Ahmed [this message]
2023-10-10 22:21       ` Yosry Ahmed
2023-10-11  0:36         ` Shakeel Butt
2023-10-11  1:48           ` Yosry Ahmed
2023-10-11 12:45             ` Shakeel Butt
2023-10-12  3:13               ` Yosry Ahmed
2023-10-12  8:01                 ` Yosry Ahmed
2023-10-12  8:04                 ` Yosry Ahmed
2023-10-12 13:29                   ` Johannes Weiner
2023-10-12 23:28                     ` Yosry Ahmed
2023-10-13  2:33                       ` Johannes Weiner
2023-10-13  2:38                         ` Yosry Ahmed
2023-10-12 13:35                   ` Shakeel Butt
2023-10-12 15:10                     ` Yosry Ahmed
2023-10-12 21:05                       ` Yosry Ahmed
2023-10-12 21:16                         ` Shakeel Butt
2023-10-12 21:19                           ` Yosry Ahmed
2023-10-12 21:38                             ` Shakeel Butt
2023-10-12 22:23                               ` Yosry Ahmed
2023-10-14 23:08                                 ` Andrew Morton
2023-10-16 18:42                                   ` Yosry Ahmed
2023-10-17 23:52                                   ` Yosry Ahmed
2023-10-18  8:22                                 ` Oliver Sang
2023-10-18  8:54                                   ` Yosry Ahmed
2023-10-20 16:17   ` kernel test robot
2023-10-20 17:23     ` Shakeel Butt
2023-10-20 17:42       ` Yosry Ahmed
2023-10-23  1:25         ` Feng Tang
2023-10-23 18:25           ` Yosry Ahmed
2023-10-24  2:13             ` Yosry Ahmed
2023-10-24  6:56               ` Oliver Sang
2023-10-24  7:14                 ` Yosry Ahmed
2023-10-25  6:09                   ` Oliver Sang
2023-10-25  6:22                     ` Yosry Ahmed
2023-10-25 17:06                       ` Shakeel Butt
2023-10-25 18:36                         ` Yosry Ahmed
2023-10-10  3:21 ` [PATCH v2 4/5] mm: workingset: move the stats flush into workingset_test_recent() Yosry Ahmed
2023-10-10  3:21 ` [PATCH v2 5/5] mm: memcg: restore subtree stats flushing Yosry Ahmed
2023-10-10 16:48 ` [PATCH v2 0/5] mm: memcg: subtree stats flushing and thresholds domenico cerasuolo
2023-10-10 19:01   ` Yosry Ahmed
2023-10-18 21:12 ` Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJD7tka=kjd42oFpTm8FzMpNedxpJCUj-Wn6L=zrFODC610A-A@mail.gmail.com' \
    --to=yosryahmed@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=gthelen@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=ivan@cloudflare.com \
    --cc=kernel-team@cloudflare.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=longman@redhat.com \
    --cc=mhocko@kernel.org \
    --cc=mkoutny@suse.com \
    --cc=muchun.song@linux.dev \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeelb@google.com \
    --cc=tj@kernel.org \
    --cc=weixugc@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox