From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 88415C001DB
	for <linux-mm@archiver.kernel.org>; Fri, 11 Aug 2023 22:35:40 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 0BE9F6B0074; Fri, 11 Aug 2023 18:35:40 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 06EF06B0078; Fri, 11 Aug 2023 18:35:40 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id E529F8D0002; Fri, 11 Aug 2023 18:35:39 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id D54D16B0074
	for <linux-mm@kvack.org>; Fri, 11 Aug 2023 18:35:39 -0400 (EDT)
Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id B6FCCB1F36
	for <linux-mm@kvack.org>; Fri, 11 Aug 2023 22:35:39 +0000 (UTC)
X-FDA: 81113281998.14.23F8E05
Received: from mail-wr1-f48.google.com (mail-wr1-f48.google.com [209.85.221.48])
	by imf08.hostedemail.com (Postfix) with ESMTP id BAB2A160017
	for <linux-mm@kvack.org>; Fri, 11 Aug 2023 22:35:37 +0000 (UTC)
Authentication-Results: imf08.hostedemail.com;
	dkim=pass header.d=cloudflare.com header.s=google header.b=dflohdnc;
	spf=pass (imf08.hostedemail.com: domain of ivan@cloudflare.com designates 209.85.221.48 as permitted sender) smtp.mailfrom=ivan@cloudflare.com;
	dmarc=pass (policy=reject) header.from=cloudflare.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1691793338;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=P8LF9AZyTjnQ02IC5rFGZjKpNEux79/i6FeKaru0zgQ=;
	b=Rk0xgTU4BxeheRimVgsr+YcyiQBdIt9j5ucByYciqlrHCPZSmvHuS0yzhQinbWKUdJNjI7
	jbzhq5TFbgd7PTkGi6B/Bg6ukdA6m2ioToR453trzbpHYMZfjLNN+VR3AuhvLy4gUd1lYR
	C5gIWwdbtdx+zGaSQcwa7j640lTQQwM=
ARC-Authentication-Results: i=1;
	imf08.hostedemail.com;
	dkim=pass header.d=cloudflare.com header.s=google header.b=dflohdnc;
	spf=pass (imf08.hostedemail.com: domain of ivan@cloudflare.com designates 209.85.221.48 as permitted sender) smtp.mailfrom=ivan@cloudflare.com;
	dmarc=pass (policy=reject) header.from=cloudflare.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1691793338; a=rsa-sha256;
	cv=none;
	b=gC+M28z73Bzz73QkIHL/ZJ+HTdAmrSieRJzdnikxqI7BauT1L1GMHASVBKcfxxCHncQGgQ
	KSxPblyApqPgGl8S1nWG4RpT793WNaMu/l6stvDUzgDMge9Cfwuy5BdCREBfgIyCH2pK2B
	9e0BoNha4FNatELhkTtbvLFQydVts2A=
Received: by mail-wr1-f48.google.com with SMTP id ffacd0b85a97d-31759e6a4a1so2031550f8f.3
        for <linux-mm@kvack.org>; Fri, 11 Aug 2023 15:35:37 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=cloudflare.com; s=google; t=1691793336; x=1692398136;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=P8LF9AZyTjnQ02IC5rFGZjKpNEux79/i6FeKaru0zgQ=;
        b=dflohdncr3ytwTRzfIzKjnRGNGpaczVE0J/aFDcIK/Ne9KYPHVFRvuBcrQYbVm0HHB
         JdvAcrUFKtYXPFgvlOFtDpeEyw5Ro9GbfaPHzV2rOfL6XmOJCRlhKzwbKQaWrZLSLzfs
         TwwcIgyKy5GN7CVWekeF8j7XAxDopRi9Y/ANs=
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1691793336; x=1692398136;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=P8LF9AZyTjnQ02IC5rFGZjKpNEux79/i6FeKaru0zgQ=;
        b=e+feotIcjth6PBjB/4+mzN+y+rFMlNu8QEQGqNlfHB1wRJpzfrIvjenpOXIMwINLTF
         f6kTndBud+h3aoXMd+7Jn4J0UTGHmq3PmUHZ8g/f166ufAXV014MKRmtfPaq7YS0LSMp
         irpTicOMmoa3alHzKgaAWMct0D5dc+3Tci0lvzrhwWbMof3bu8FMma6YW0iAz18FgKGs
         HZjsnvwppWqmWJ//kO3lgYUM+ylyS/BFRnuB4Vk2Q7l7s7lQM75BEy0+h84pr1XBcmKy
         YoEqRYLnURvVWMXaShstvfnbmWxfwu++eBzL1DeRAY19Shh5UIlbMAGXCl58m83Nbj1a
         5dXw==
X-Gm-Message-State: AOJu0Yy/fTtkYPy/0YdCiY7m8taEecdWUal5FwBWBm56jWKq9BtIvPXG
	PZPqCEOReMnoKmlhBDXnAinn1npleyszpRzRc60ldw==
X-Google-Smtp-Source: AGHT+IHjhRZkUz6RsfYi6SmlRaUELhOTvnWfEcUXjaQjDAguJyp9AlmtxsRzcizlQc8+9UYfuZF649plOns5ExaRccw=
X-Received: by 2002:a5d:44c6:0:b0:317:faf2:97bf with SMTP id
 z6-20020a5d44c6000000b00317faf297bfmr2326680wrr.25.1691793336177; Fri, 11 Aug
 2023 15:35:36 -0700 (PDT)
MIME-Version: 1.0
References: <CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com>
 <20230706062045.xwmwns7cm4fxd7iu@google.com> <CABWYdi2pBaCrdKcM37oBomc+5W8MdRp1HwPpOExBGYfZitxyWA@mail.gmail.com>
 <d3f3a7bc-b181-a408-af1d-dd401c172cbf@redhat.com> <CABWYdi2iWYT0sHpK74W6=Oz6HA_3bAqKQd4h+amK0n3T3nge6g@mail.gmail.com>
 <CABWYdi3YNwtPDwwJWmCO-ER50iP7CfbXkCep5TKb-9QzY-a40A@mail.gmail.com>
 <CABWYdi0+0gxr7PB4R8rh6hXO=H7ZaCzfk8bmOSeQMuZR7s7Pjg@mail.gmail.com> <a052dffe-ed5e-6d22-8af8-0861e618f327@redhat.com>
In-Reply-To: <a052dffe-ed5e-6d22-8af8-0861e618f327@redhat.com>
From: Ivan Babrou <ivan@cloudflare.com>
Date: Fri, 11 Aug 2023 15:35:25 -0700
Message-ID: <CABWYdi0CXy2GZax_s6O-Xc0gvH+TGJzKwv_v6QqMty9P-ATJug@mail.gmail.com>
Subject: Re: Expensive memory.stat + cpu.stat reads
To: Waiman Long <longman@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>, cgroups@vger.kernel.org, 
	Linux MM <linux-mm@kvack.org>, kernel-team <kernel-team@cloudflare.com>, 
	Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, 
	Roman Gushchin <roman.gushchin@linux.dev>, Muchun Song <muchun.song@linux.dev>, 
	Andrew Morton <akpm@linux-foundation.org>, linux-kernel <linux-kernel@vger.kernel.org>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: BAB2A160017
X-Rspam-User: 
X-Stat-Signature: wyg1itqwfkdkucrgts4rqu47wajpwit4
X-Rspamd-Server: rspam01
X-HE-Tag: 1691793337-607065
X-HE-Meta: U2FsdGVkX1+sE1SyB9q0/SC2ICsyHG7eW5B+nGM6otS7ktsVhYZ9nwMPktD3au3ftRGRirabIb3MV+pWfgYqckZVxngRGKrSYHV2WCa1aZ2yvPSUUvC6lm87N2kfREga4yL7/946LX5ziDBBfpjyh+DOfUC5dTfubZ1boxPKxKouHB8R0B7BhLF0cJrVzFk2fU8BGBY4DP+Ms7c+l4q1Kuww7MVaqCgPqM2DQXliB1W/nGsktsLKAeck1BvC2hBsgZpyOL0QE1KaWrOg/l8XYJxeQkXf7vvotPEoLpez/Np+eP/W5DGMtm40cfH4SQ58eRehp47RgupSod/UK5Tmfi+IAMo/Ykn2c6u/L30qHpjsRYaJee8jZK5EP26RAeQr+O+IANv7mYKZpV3v/L8bqPxYc6ls4VwzLFE5DbQMffhjqDqKUXEae7FehKCjNu+McRjk6AFfxD/I5Eh18EvMSzJENWpc9w1WCGCFEfQuuXF1bM/6jspheX9AmyXTEjqhCUga+CIHjcacW88YwR7y7jNYFdiLF1oVhCxYcdc2zmgUGL5ABXnHnteIHkcCzhLKQoRHCAEMRpk6l1MU7/kJOqIViAikd1qxttDJGeUgy+b0JyfOC5mCj8iOzZpHvhBlC6VD2XgqOA8/NWu0aoAPawyplIzJ86MiBCmikWQ170p9XhFG2BA5ks+GO6IxHyNmoeldERKRxB6ry13gcWB0LqEXWOvPszD/X7Mrz3AgFZAyAYjdQC23g0jOadi0Ki1t3qAfwwmopyEoat1AVUQFrmvw5jwrxF/x1zw3sAN24NZxSV46knZjx/iqWt6xyZ8iBEBybGydshg8r24pdfJyq02GUdikX3b+boHaV33c0g6ILjoPleN+vHGoIBJFapTEy6rMyZh0y1WXJmWtCbDg62W8pv/px5vVTRZS+L4We1VRXM0j0h9KfEnMVepGn1oQ1m2V1ha5yDFdWH0sI2n
 Kf7HanO6
 entTTjNwAY2pcJDrFGIjFU6nf1mRPVmC3m9Dpoww3sUqF8h8IwYcpCUKV488MdLwHTm8LbGQgT9lrEjC9vLYCStl9qBqAiYF/sCtJcDxS3jsW0dRapO4uG4uqVul6Q/jEB1Qd2SE4ACmA4q8VPpgd1bQ5sI/PX7t/IpGPyKKuE5udIlLyyTVvktyhuZOSx+yUd/6nW9N7cg6w2a+d9sLCma0Rda52gCALEovK/5AM7nR/fNam4MBMhoqpk0zS1FCiXFBihtv1SLodgXRiiZsaSJoD4G+8DNSIuB4TOadN2s4mDwNbY4WugDb2Aw2rNK2h8Mo8VhBSRzc/1XYZFY9LMBVlo7tilh3k/6vI8v4Wr7HidAjYQK2FvSPIrJWYM+mrMPg/OQr/CAbcHApemu0ah8F1ryg3gozInu87
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Fri, Aug 11, 2023 at 3:27=E2=80=AFPM Waiman Long <longman@redhat.com> wr=
ote:
>
> On 8/11/23 18:03, Ivan Babrou wrote:
> > On Fri, Jul 14, 2023 at 5:30=E2=80=AFPM Ivan Babrou <ivan@cloudflare.co=
m> wrote:
> >> On Thu, Jul 13, 2023 at 4:25=E2=80=AFPM Ivan Babrou <ivan@cloudflare.c=
om> wrote:
> >>>> My understanding of mem-stat and cpu-stat is that they are independe=
nt
> >>>> of each other. In theory, reading one shouldn't affect the performan=
ce
> >>>> of reading the others. Since you are doing mem-stat and cpu-stat rea=
ding
> >>>> repetitively in a loop, it is likely that all the data are in the ca=
che
> >>>> most of the time resulting in very fast processing time. If it happe=
ns
> >>>> that the specific memory location of mem-stat and cpu-stat data are =
such
> >>>> that reading one will cause the other data to be flushed out of the
> >>>> cache and have to be re-read from memory again, you could see
> >>>> significant performance regression.
> >>>>
> >>>> It is one of the possible causes, but I may be wrong.
> >>> Do you think it's somewhat similar to how iterating a matrix in rows
> >>> is faster than in columns due to sequential vs random memory reads?
> >>>
> >>> * https://stackoverflow.com/q/9936132
> >>> * https://en.wikipedia.org/wiki/Row-_and_column-major_order
> >>> * https://en.wikipedia.org/wiki/Loop_interchange
> >>>
> >>> I've had a similar suspicion and it would be good to confirm whether
> >>> it's that or something else. I can probably collect perf counters for
> >>> different runs, but I'm not sure which ones I'll need.
> >>>
> >>> In a similar vein, if we could come up with a tracepoint that would
> >>> tell us the amount of work done (or any other relevant metric that
> >>> would help) during rstat flushing, I can certainly collect that
> >>> information as well for every reading combination.
> >> Since cgroup_rstat_flush_locked appears in flamegraphs for both fast
> >> (discrete) and slow (combined) cases, I grabbed some stats for it:
> >>
> >> * Slow:
> >>
> >> completed: 19.43s [manual / mem-stat + cpu-stat]
> >>
> >> $ sudo /usr/share/bcc/tools/funclatency -uT cgroup_rstat_flush_locked
> >> Tracing 1 functions for "cgroup_rstat_flush_locked"... Hit Ctrl-C to e=
nd.
> >> ^C
> >> 00:12:55
> >>       usecs               : count     distribution
> >>           0 -> 1          : 0        |                                =
        |
> >>           2 -> 3          : 0        |                                =
        |
> >>           4 -> 7          : 0        |                                =
        |
> >>           8 -> 15         : 0        |                                =
        |
> >>          16 -> 31         : 0        |                                =
        |
> >>          32 -> 63         : 0        |                                =
        |
> >>          64 -> 127        : 1        |                                =
        |
> >>         128 -> 255        : 191      |************                    =
        |
> >>         256 -> 511        : 590      |********************************=
********|
> >>         512 -> 1023       : 186      |************                    =
        |
> >>        1024 -> 2047       : 2        |                                =
        |
> >>        2048 -> 4095       : 0        |                                =
        |
> >>        4096 -> 8191       : 0        |                                =
        |
> >>        8192 -> 16383      : 504      |********************************=
**      |
> >>       16384 -> 32767      : 514      |********************************=
**      |
> >>       32768 -> 65535      : 3        |                                =
        |
> >>       65536 -> 131071     : 1        |                                =
        |
> >>
> >> avg =3D 8852 usecs, total: 17633268 usecs, count: 1992
> >>
> >> * Fast:
> >>
> >> completed:  0.95s [manual / mem-stat]
> >> completed:  0.05s [manual / cpu-stat]
> >>
> >> $ sudo /usr/share/bcc/tools/funclatency -uT cgroup_rstat_flush_locked
> >> Tracing 1 functions for "cgroup_rstat_flush_locked"... Hit Ctrl-C to e=
nd.
> >> ^C
> >> 00:13:27
> >>       usecs               : count     distribution
> >>           0 -> 1          : 0        |                                =
        |
> >>           2 -> 3          : 0        |                                =
        |
> >>           4 -> 7          : 499      |********************************=
********|
> >>           8 -> 15         : 253      |********************            =
        |
> >>          16 -> 31         : 191      |***************                 =
        |
> >>          32 -> 63         : 41       |***                             =
        |
> >>          64 -> 127        : 12       |                                =
        |
> >>         128 -> 255        : 2        |                                =
        |
> >>         256 -> 511        : 2        |                                =
        |
> >>         512 -> 1023       : 0        |                                =
        |
> >>        1024 -> 2047       : 0        |                                =
        |
> >>        2048 -> 4095       : 0        |                                =
        |
> >>        4096 -> 8191       : 0        |                                =
        |
> >>        8192 -> 16383      : 34       |**                              =
        |
> >>       16384 -> 32767      : 21       |*                               =
        |
> >>
> >> avg =3D 857 usecs, total: 904762 usecs, count: 1055
> >>
> >> There's a different number of calls into cgroup_rstat_flush_locked and
> >> they are much slower in the slow case. There are also two bands in the
> >> slow case, with 8ms..32ms having the half of the calls.
> >>
> >> For mem_cgroup_css_rstat_flush:
> >>
> >> * Slow:
> >>
> >> completed: 32.77s [manual / mem-stat + cpu-stat]
> >>
> >> $ sudo /usr/share/bcc/tools/funclatency -uT mem_cgroup_css_rstat_flush
> >> Tracing 1 functions for "mem_cgroup_css_rstat_flush"... Hit Ctrl-C to =
end.
> >> ^C
> >> 00:21:25
> >>       usecs               : count     distribution
> >>           0 -> 1          : 93078    |*                               =
        |
> >>           2 -> 3          : 3397714  |********************************=
********|
> >>           4 -> 7          : 1009440  |***********                     =
        |
> >>           8 -> 15         : 168013   |*                               =
        |
> >>          16 -> 31         : 93       |                                =
        |
> >>
> >> avg =3D 3 usecs, total: 17189289 usecs, count: 4668338
> >>
> >> * Fast:
> >>
> >> completed:  0.16s [manual / mem-stat]
> >> completed:  0.04s [manual / cpu-stat]
> >>
> >> $ sudo /usr/share/bcc/tools/funclatency -uT mem_cgroup_css_rstat_flush
> >> Tracing 1 functions for "mem_cgroup_css_rstat_flush"... Hit Ctrl-C to =
end.
> >> ^C
> >> 00:21:57
> >>       usecs               : count     distribution
> >>           0 -> 1          : 1441     |***                             =
        |
> >>           2 -> 3          : 18780    |********************************=
********|
> >>           4 -> 7          : 4826     |**********                      =
        |
> >>           8 -> 15         : 732      |*                               =
        |
> >>          16 -> 31         : 1        |                                =
        |
> >>
> >> avg =3D 3 usecs, total: 89174 usecs, count: 25780
> >>
> >> There's an 181x difference in the number of calls into
> >> mem_cgroup_css_rstat_flush.
> >>
> >> Does this provide a clue? Perhaps cgroup_rstat_cpu_pop_updated is
> >> yielding a ton more iterations for some reason here?
> >>
> >> * https://elixir.bootlin.com/linux/v6.1/source/kernel/cgroup/rstat.c#L=
196
> >>
> >> It's inlined, but I can place a probe into the loop:
> >>
> >>        7         for_each_possible_cpu(cpu) {
> >>        8                 raw_spinlock_t *cpu_lock =3D
> >> per_cpu_ptr(&cgroup_rstat_cpu_lock,
> >>                                                                 cpu);
> >>       10                 struct cgroup *pos =3D NULL;
> >>                          unsigned long flags;
> >>
> >>                          /*
> >>                           * The _irqsave() is needed because cgroup_rs=
tat_lock is
> >>                           * spinlock_t which is a sleeping lock on
> >> PREEMPT_RT. Acquiring
> >>                           * this lock with the _irq() suffix only
> >> disables interrupts on
> >>                           * a non-PREEMPT_RT kernel. The raw_spinlock_=
t
> >> below disables
> >>                           * interrupts on both configurations. The
> >> _irqsave() ensures
> >>                           * that interrupts are always disabled and
> >> later restored.
> >>                           */
> >>                          raw_spin_lock_irqsave(cpu_lock, flags);
> >>                          while ((pos =3D
> >> cgroup_rstat_cpu_pop_updated(pos, cgrp, cpu))) {
> >>                                  struct cgroup_subsys_state *css;
> >>
> >>                                  cgroup_base_stat_flush(pos, cpu);
> >>       26                         bpf_rstat_flush(pos, cgroup_parent(po=
s), cpu);
> >>
> >>       28                         rcu_read_lock();
> >>       29                         list_for_each_entry_rcu(css,
> >> &pos->rstat_css_list,
> >>                                                          rstat_css_nod=
e)
> >>       31                                 css->ss->css_rstat_flush(css,=
 cpu);
> >>       32                         rcu_read_unlock();
> >>                          }
> >>       34                 raw_spin_unlock_irqrestore(cpu_lock, flags);
> >>
> >> I added probes on both line 26 and line 31 to catch the middle and inn=
er loops.
> >>
> >> * Slow:
> >>
> >> completed: 32.97s [manual / mem-stat + cpu-stat]
> >>
> >>   Performance counter stats for '/tmp/derp':
> >>
> >>           4,702,570      probe:cgroup_rstat_flush_locked_L26
> >>           9,301,436      probe:cgroup_rstat_flush_locked_L31
> >>
> >> * Fast:
> >>
> >> completed:  0.17s [manual / mem-stat]
> >> completed:  0.34s [manual / cpu-stat]
> >>
> >>   Performance counter stats for '/tmp/derp':
> >>
> >>              31,769      probe:cgroup_rstat_flush_locked_L26
> >>              62,849      probe:cgroup_rstat_flush_locked_L31
> >>
> >> It definitely looks like cgroup_rstat_cpu_pop_updated is yielding a
> >> lot more positions.
> >>
> >> I'm going to sign off for the week, but let me know if I should place
> >> any more probes to nail this down.
> > I spent some time looking into this and I think I landed on a fix:
> >
> > * https://github.com/bobrik/linux/commit/50b627811d54
> >
> > I'm not 100% sure if it's the right fix for the issue, but it reduces
> > the runtime significantly.
>
> It looks like the overhead of mem_cgroup_flush_stats() may be the cause
> of the performance regression. So what version of the Linux kernel are
> you using? From the patch listed in the URL above, it doesn't seem like
> you are using the latest kernel.
>
> The latest upstream kernel already have patches that reduce rstat
> flushing overhead like commit 11192d9c124 ("memcg: flush stats only if
> updated"). Have you try to reproduce it with the latest kernel?

We're on v6.1, which is the latest LTS (I mentioned it in the first
message). It has this patch included and my commit mentions it in the
very first line of the description.

I should've also mentioned commit fd25a9e0e23b ("memcg: unify memcg
stat flushing") from the same series, which my patch is partially
reverting.