From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 7D90CCD98C0
	for <linux-mm@archiver.kernel.org>; Tue, 10 Oct 2023 21:03:02 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id DE4D08D00E0; Tue, 10 Oct 2023 17:03:01 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id D6D908D0002; Tue, 10 Oct 2023 17:03:01 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id C354A8D00E0; Tue, 10 Oct 2023 17:03:01 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id ADAFF8D0002
	for <linux-mm@kvack.org>; Tue, 10 Oct 2023 17:03:01 -0400 (EDT)
Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id 78C44C03DC
	for <linux-mm@kvack.org>; Tue, 10 Oct 2023 21:03:01 +0000 (UTC)
X-FDA: 81330776562.25.7E9D97F
Received: from mail-ej1-f41.google.com (mail-ej1-f41.google.com [209.85.218.41])
	by imf13.hostedemail.com (Postfix) with ESMTP id 802B12002A
	for <linux-mm@kvack.org>; Tue, 10 Oct 2023 21:02:59 +0000 (UTC)
Authentication-Results: imf13.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=jHoFryUU;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf13.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.41 as permitted sender) smtp.mailfrom=yosryahmed@google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1696971779;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=A4yeKRKjuI6HeztC2APRDWWqL6TgPnk3prMEJAsLcGM=;
	b=gRWfRBR0NEruV0ZEjShTFXVGPHZf5qF0D70fh6YDoDU9xkCaWACrHa1aaQ4d9/T2IQ149m
	zokd7L9qtUCH2jRVCoHJ0hoThHlv3JuQnnOukaIHtlEL3dqTdnz4qDG6XrncviJkVQyEdJ
	coRmYP9OlquuItUweCPH8dRjudQ2JcQ=
ARC-Authentication-Results: i=1;
	imf13.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=jHoFryUU;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf13.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.41 as permitted sender) smtp.mailfrom=yosryahmed@google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1696971779; a=rsa-sha256;
	cv=none;
	b=r/9t/zLADEyNAmRsWAFL00e9ekwB2qULsdslDGTYsK5wzVG11arvfaXjgLR0cbqzzE39rK
	F3qWg3qqNv/hesDawuZL+ErbSaDT71nNDmjw/Mk6HkoCSBKypx2FocgOYzBrIEWSccRvIJ
	RgKXEVtbBcdggZuuVX+SE6wzc1u4pM0=
Received: by mail-ej1-f41.google.com with SMTP id a640c23a62f3a-9ada2e6e75fso1113636266b.2
        for <linux-mm@kvack.org>; Tue, 10 Oct 2023 14:02:59 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1696971778; x=1697576578; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=A4yeKRKjuI6HeztC2APRDWWqL6TgPnk3prMEJAsLcGM=;
        b=jHoFryUUdZCb9qHc84V1PJo4DuATF603e9RZmtKeaby5ZB5dQ3279M2kuC2qnf7R8c
         zUPAydDmYFriv2Z3H4uOc/Y/54FHCAYsxE0rTsQAub/EN8CGwMRTlIr/XStFlaj+10bO
         +rLdMTyZHAQrSU2vh6Xe6fwgO4uzrxCpzWxQI2ItybVGPQlcqcY5XeOdxSPZ5jCAPR3D
         zf1eOG62PbISh68Tb6CoJgsNjNtm470cdhDrYJePqwq1hu8Tn3gtfGJLTXnbF+EM/SEN
         txGJ6hFdybgXhtAJBaKRObkEedt0tB6VS1KbNKapNr4GFQPMyPmoTzqFXkKrcV93g2+y
         x/MQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1696971778; x=1697576578;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=A4yeKRKjuI6HeztC2APRDWWqL6TgPnk3prMEJAsLcGM=;
        b=gtqZc5+mgogT2wnj7pCGkfbL2QO36oCj5dXCnxwOXlDd8AzULw6GbQ9e5RPh6JQkuM
         bcKXDqok27CelZON3qDbkIGezpJFfoWVLXtDKnXajRlGIz7zo4MUok1U450UN9AsV6We
         GNEBxmuHe0GvvpceNJYFmKRIYD5b7bhHjMkrOm5l9dcs8gFXjTGavxPCk9j27IWHNizJ
         yLA34lDhw7f2IIXAOLlKCYzoz6wCNgtaKuKpKhiF3F1kUyAtixCYhOa/55ZAXExC0B0V
         3u0blTuhOnrzwkDQ8DKO4w3/PNNCRYA6zuzyFTT73rcK5qguLcNFkWurAcbxgJcUbdra
         GyfQ==
X-Gm-Message-State: AOJu0YzhuGzFHKpmmWhVwYXmOWLTGmw+2XTxvWHuILeDRo6sDnMBnRwe
	rMazBOjdRLZcuUNw60e72qLcWYlSEXOwWWEXVzEBNQ==
X-Google-Smtp-Source: AGHT+IH4IuCnrvHAexVt5ClAwTcIDRIWufIwCyS3lEcTcOute8k1s3EIy7KtO6A+1nJk5v12rqqzZmLNyYVklvEFG68=
X-Received: by 2002:a17:906:539a:b0:9a6:1560:42e8 with SMTP id
 g26-20020a170906539a00b009a6156042e8mr15576950ejo.55.1696971777704; Tue, 10
 Oct 2023 14:02:57 -0700 (PDT)
MIME-Version: 1.0
References: <20231010032117.1577496-1-yosryahmed@google.com>
 <20231010032117.1577496-4-yosryahmed@google.com> <CALvZod5nQrf=Y24u_hzGOTXYBfnt-+bo+cYbRMRpmauTMXJn3Q@mail.gmail.com>
In-Reply-To: <CALvZod5nQrf=Y24u_hzGOTXYBfnt-+bo+cYbRMRpmauTMXJn3Q@mail.gmail.com>
From: Yosry Ahmed <yosryahmed@google.com>
Date: Tue, 10 Oct 2023 14:02:18 -0700
Message-ID: <CAJD7tka=kjd42oFpTm8FzMpNedxpJCUj-Wn6L=zrFODC610A-A@mail.gmail.com>
Subject: Re: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg
To: Shakeel Butt <shakeelb@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>, Johannes Weiner <hannes@cmpxchg.org>, 
	Michal Hocko <mhocko@kernel.org>, Roman Gushchin <roman.gushchin@linux.dev>, 
	Muchun Song <muchun.song@linux.dev>, Ivan Babrou <ivan@cloudflare.com>, Tejun Heo <tj@kernel.org>, 
	=?UTF-8?Q?Michal_Koutn=C3=BD?= <mkoutny@suse.com>, 
	Waiman Long <longman@redhat.com>, kernel-team@cloudflare.com, 
	Wei Xu <weixugc@google.com>, Greg Thelen <gthelen@google.com>, linux-mm@kvack.org, 
	cgroups@vger.kernel.org, linux-kernel@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Rspamd-Server: rspam12
X-Rspamd-Queue-Id: 802B12002A
X-Stat-Signature: qa7znc7dnma67zaqfue4nynincxbpdpa
X-HE-Tag: 1696971779-772607
X-HE-Meta: U2FsdGVkX18wYbmYfpcUD7Gkvpj4A/V7XForvH4ACKWmmRfeRFY5cawm3WjMHd+PgpmKP2DPCcdrL8+beAPja2sfldFeDdcC2GsltOvqFRERT9x+F1W7sKAkkZ0ViVvuv26uCvCiKn/gWb1ShTI61oCiZSWUsgwcLENYpKuGPrinrW270Rshx9Rs/cpeSSHYLlVd7QDimY40aq0TuOzUJTIg6+MO5FKGGoQ9KVzKMncaRJYhCD7GM07FuAG4CRCdkInNxhEB+iSIF66jaGgL227hNKpdc2pHl/AFOYe21oHTEkqJoRVpV2QozaReDkrnTtE86tseU9oUWMRDdmuXyWxxN3f2QIZFa+JL9++Vt8BVmQgNE4NtqjBDwoCBpC9NnNNXuski37zeeKx2KfKi6MtWvozig0KaNBwkPlpHd11xsBJGkHpwgdXyJc830ebZso+4tEJY8LYxx56e1yA6arkLzqk0lzf051rRWopnCSvEm3rWEmcFzuZDPDTBjfk/eJdYt9EzJtQLeX1AXqw8FN6RNo0sUnHnCeJfejUGhzfEo8Snpq3vIJXsPDP6w+Iu73xIDMuneMF89OwqkCnaEf1H42l7lINKXNQooeQcLKyCi03e1UmAtHtq+0ixWOjoG/ko73BNTsSSv0mDZM0xYf5W7U4RsFgatsUjAk9n+qi7qeBmo4AWkZ8nduEgOnL58Lq8sklekSq+kZJKsXkoJBo3lDz3fW0wlTTZa5nSWJoRfSBsm7jB7XTmtr2puu+ekgFuB6/5b0wuifW/atzp5roAKBVSAsM9+S8G+xXl4sZrU/evHN1P5Sll4xHn2AIrRr8/OhSNsnnxEQ0bTOq0fNjZdTBnYo5kIy96Xbbo4gU9jzqKV+4RTtaPDehmfoW+Hjrye2TvoZpuj46PWdaIi2lbvnKzfljytpHKMWj4a+eqhGLkkjntVh2dxUsECd2WkwPA0ODAoau4W7+/QbB
 RqtbC7vF
 NC2tD+gQme4AyCvcl1xRt1m0ag15Wn4gN2hUvKKdd9lHoHgK2KiGLDHqI0on4LmDiikrj38NVMr66EdrG/dIdHP4oxUTRUHiC7Piarc7x7GBDp0hX2/X3kJj8bTvi3PyLe0+sbhsgH9MUQ0opNkBF3N95omRo1JZg25StxUU/AujsqU3sUmDCIn72o9fz2D4K3k79I7UpepM/3o/WmhEuruvXTua27qx3LpIIlIg/pUJA0PaNA1Z3WS7W0Q==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue, Oct 10, 2023 at 1:45=E2=80=AFPM Shakeel Butt <shakeelb@google.com> =
wrote:
>
> On Mon, Oct 9, 2023 at 8:21=E2=80=AFPM Yosry Ahmed <yosryahmed@google.com=
> wrote:
> >
> > A global counter for the magnitude of memcg stats update is maintained
> > on the memcg side to avoid invoking rstat flushes when the pending
> > updates are not significant. This avoids unnecessary flushes, which are
> > not very cheap even if there isn't a lot of stats to flush. It also
> > avoids unnecessary lock contention on the underlying global rstat lock.
> >
> > Make this threshold per-memcg. The scheme is followed where percpu (now
> > also per-memcg) counters are incremented in the update path, and only
> > propagated to per-memcg atomics when they exceed a certain threshold.
> >
> > This provides two benefits:
> > (a) On large machines with a lot of memcgs, the global threshold can be
> > reached relatively fast, so guarding the underlying lock becomes less
> > effective. Making the threshold per-memcg avoids this.
> >
> > (b) Having a global threshold makes it hard to do subtree flushes, as w=
e
> > cannot reset the global counter except for a full flush. Per-memcg
> > counters removes this as a blocker from doing subtree flushes, which
> > helps avoid unnecessary work when the stats of a small subtree are
> > needed.
> >
> > Nothing is free, of course. This comes at a cost:
> > (a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4
> > bytes. The extra memory usage is insigificant.
> >
> > (b) More work on the update side, although in the common case it will
> > only be percpu counter updates. The amount of work scales with the
> > number of ancestors (i.e. tree depth). This is not a new concept, addin=
g
> > a cgroup to the rstat tree involves a parent loop, so is charging.
> > Testing results below show no significant regressions.
> >
> > (c) The error margin in the stats for the system as a whole increases
> > from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH *
> > NR_MEMCGS. This is probably fine because we have a similar per-memcg
> > error in charges coming from percpu stocks, and we have a periodic
> > flusher that makes sure we always flush all the stats every 2s anyway.
> >
> > This patch was tested to make sure no significant regressions are
> > introduced on the update path as follows. The following benchmarks were
> > ran in a cgroup that is 4 levels deep (/sys/fs/cgroup/a/b/c/d), which i=
s
> > deeper than a usual setup:
> >
> > (a) neper [1] with 1000 flows and 100 threads (single machine). The
> > values in the table are the average of server and client throughputs in
> > mbps after 30 iterations, each running for 30s:
> >
> >                                 tcp_rr          tcp_stream
> > Base                            9504218.56      357366.84
> > Patched                         9656205.68      356978.39
> > Delta                           +1.6%           -0.1%
> > Standard Deviation              0.95%           1.03%
> >
> > An increase in the performance of tcp_rr doesn't really make sense, but
> > it's probably in the noise. The same tests were ran with 1 flow and 1
> > thread but the throughput was too noisy to make any conclusions (the
> > averages did not show regressions nonetheless).
> >
> > Looking at perf for one iteration of the above test, __mod_memcg_state(=
)
> > (which is where memcg_rstat_updated() is called) does not show up at al=
l
> > without this patch, but it shows up with this patch as 1.06% for tcp_rr
> > and 0.36% for tcp_stream.
> >
> > (b) "stress-ng --vm 0 -t 1m --times --perf". I don't understand
> > stress-ng very well, so I am not sure that's the best way to test this,
> > but it spawns 384 workers and spits a lot of metrics which looks nice :=
)
> > I picked a few ones that seem to be relevant to the stats update path. =
I
> > also included cache misses as this patch introduce more atomics that ma=
y
> > bounce between cpu caches:
> >
> > Metric                  Base            Patched         Delta
> > Cache Misses            3.394 B/sec     3.433 B/sec     +1.14%
> > Cache L1D Read          0.148 T/sec     0.154 T/sec     +4.05%
> > Cache L1D Read Miss     20.430 B/sec    21.820 B/sec    +6.8%
> > Page Faults Total       4.304 M/sec     4.535 M/sec     +5.4%
> > Page Faults Minor       4.304 M/sec     4.535 M/sec     +5.4%
> > Page Faults Major       18.794 /sec     0.000 /sec
> > Kmalloc                 0.153 M/sec     0.152 M/sec     -0.65%
> > Kfree                   0.152 M/sec     0.153 M/sec     +0.65%
> > MM Page Alloc           4.640 M/sec     4.898 M/sec     +5.56%
> > MM Page Free            4.639 M/sec     4.897 M/sec     +5.56%
> > Lock Contention Begin   0.362 M/sec     0.479 M/sec     +32.32%
> > Lock Contention End     0.362 M/sec     0.479 M/sec     +32.32%
> > page-cache add          238.057 /sec    0.000 /sec
> > page-cache del          6.265 /sec      6.267 /sec      -0.03%
> >
> > This is only using a single run in each case. I am not sure what to
> > make out of most of these numbers, but they mostly seem in the noise
> > (some better, some worse). The lock contention numbers are interesting.
> > I am not sure if higher is better or worse here. No new locks or lock
> > sections are introduced by this patch either way.
> >
> > Looking at perf, __mod_memcg_state() shows up as 0.00% with and without
> > this patch. This is suspicious, but I verified while stress-ng is
> > running that all the threads are in the right cgroup.
> >
> > (3) will-it-scale page_fault tests. These tests (specifically
> > per_process_ops in page_fault3 test) detected a 25.9% regression before
> > for a change in the stats update path [2]. These are the
> > numbers from 30 runs (+ is good):
> >
> >              LABEL            |     MEAN    |   MEDIAN    |   STDDEV   =
|
> > ------------------------------+-------------+-------------+------------=
-
> >   page_fault1_per_process_ops |             |             |            =
|
> >   (A) base                    | 265207.738  | 262941.000  | 12112.379  =
|
> >   (B) patched                 | 249249.191  | 248781.000  | 8767.457   =
|
> >                               | -6.02%      | -5.39%      |            =
|
> >   page_fault1_per_thread_ops  |             |             |            =
|
> >   (A) base                    | 241618.484  | 240209.000  | 10162.207  =
|
> >   (B) patched                 | 229820.671  | 229108.000  | 7506.582   =
|
> >                               | -4.88%      | -4.62%      |            =
|
> >   page_fault1_scalability     |             |             |
> >   (A) base                    | 0.03545     | 0.035705    | 0.0015837  =
|
> >   (B) patched                 | 0.029952    | 0.029957    | 0.0013551  =
|
> >                               | -9.29%      | -9.35%      |            =
|
>
> This much regression is not acceptable.
>
> In addition, I ran netperf with the same 4 level hierarchy as you have
> run and I am seeing ~11% regression.

Interesting, I thought neper and netperf should be similar. Let me try
to reproduce this.

Thanks for testing!

>
> More specifically on a machine with 44 CPUs (HT disabled ixion machine):
>
> # for server
> $ netserver -6
>
> # 22 instances of netperf clients
> $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
>
> (averaged over 4 runs)
>
> base (next-20231009): 33081 MBPS
> patched: 29267 MBPS
>
> So, this series is not acceptable unless this regression is resolved.