Expensive memory.stat + cpu.stat reads

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Expensive memory.stat + cpu.stat reads
@ 2023-06-30 23:22 Ivan Babrou
  2023-07-06  6:20 ` Shakeel Butt
  2023-07-10 14:44 ` Michal Koutný
  0 siblings, 2 replies; 21+ messages in thread
From: Ivan Babrou @ 2023-06-30 23:22 UTC (permalink / raw)
  To: cgroups
  Cc: Linux MM, kernel-team, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	linux-kernel

Hello,

We're seeing CPU load issues with cgroup stats retrieval. I made a
public gist with all the details, including the repro code (which
unfortunately requires heavily loaded hardware) and some flamegraphs:

* https://gist.github.com/bobrik/5ba58fb75a48620a1965026ad30a0a13

I'll repeat the gist of that gist here. Our repro has the following
output after a warm-up run:

completed:  5.17s [manual / mem-stat + cpu-stat]
completed:  5.59s [manual / cpu-stat + mem-stat]
completed:  0.52s [manual / mem-stat]
completed:  0.04s [manual / cpu-stat]

The first two lines do effectively the following:

for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/memory.stat
/sys/fs/cgroup/system.slice/cpu.stat > /dev/null

The latter two are the same thing, but via two loops:

for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/cpu.stat >
/dev/null; done
for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/memory.stat
> /dev/null; done

As you might've noticed from the output, splitting the loop into two
makes the code run 10x faster. This isn't great, because most
monitoring software likes to get all stats for one service before
reading the stats for the next one, which maps to the slow and
expensive way of doing this.

We're running Linux v6.1 (the output is from v6.1.25) with no patches
that touch the cgroup or mm subsystems, so you can assume vanilla
kernel.

From the flamegraph it just looks like rstat flushing takes longer. I
used the following flags on an AMD EPYC 7642 system (our usual pick
cpu-clock was blaming spinlock irqrestore, which was questionable):

perf -e cycles -g --call-graph fp -F 999 -- /tmp/repro

Naturally, there are two questions that arise:

* Is this expected (I guess not, but good to be sure)?
* What can we do to make this better?

I am happy to try out patches or to do some tracing to help understand
this better.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Expensive memory.stat + cpu.stat reads
  2023-06-30 23:22 Expensive memory.stat + cpu.stat reads Ivan Babrou
@ 2023-07-06  6:20 ` Shakeel Butt
  2023-07-10 23:21   ` Ivan Babrou
  2023-07-10 14:44 ` Michal Koutný
  1 sibling, 1 reply; 21+ messages in thread
From: Shakeel Butt @ 2023-07-06  6:20 UTC (permalink / raw)
  To: Ivan Babrou
  Cc: cgroups, Linux MM, kernel-team, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, Andrew Morton, linux-kernel

On Fri, Jun 30, 2023 at 04:22:28PM -0700, Ivan Babrou wrote:
> Hello,
> 
> We're seeing CPU load issues with cgroup stats retrieval. I made a
> public gist with all the details, including the repro code (which
> unfortunately requires heavily loaded hardware) and some flamegraphs:
> 
> * https://gist.github.com/bobrik/5ba58fb75a48620a1965026ad30a0a13
> 
> I'll repeat the gist of that gist here. Our repro has the following
> output after a warm-up run:
> 
> completed:  5.17s [manual / mem-stat + cpu-stat]
> completed:  5.59s [manual / cpu-stat + mem-stat]
> completed:  0.52s [manual / mem-stat]
> completed:  0.04s [manual / cpu-stat]
> 
> The first two lines do effectively the following:
> 
> for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/memory.stat
> /sys/fs/cgroup/system.slice/cpu.stat > /dev/null
> 
> The latter two are the same thing, but via two loops:
> 
> for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/cpu.stat >
> /dev/null; done
> for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/memory.stat
> > /dev/null; done
> 
> As you might've noticed from the output, splitting the loop into two
> makes the code run 10x faster. This isn't great, because most
> monitoring software likes to get all stats for one service before
> reading the stats for the next one, which maps to the slow and
> expensive way of doing this.
> 
> We're running Linux v6.1 (the output is from v6.1.25) with no patches
> that touch the cgroup or mm subsystems, so you can assume vanilla
> kernel.
> 
> From the flamegraph it just looks like rstat flushing takes longer. I
> used the following flags on an AMD EPYC 7642 system (our usual pick
> cpu-clock was blaming spinlock irqrestore, which was questionable):
> 
> perf -e cycles -g --call-graph fp -F 999 -- /tmp/repro
> 
> Naturally, there are two questions that arise:
> 
> * Is this expected (I guess not, but good to be sure)?
> * What can we do to make this better?
> 
> I am happy to try out patches or to do some tracing to help understand
> this better.

Hi Ivan,

Thanks a lot, as always, for reporting this. This is not expected and
should be fixed. Is the issue easy to repro or some specific workload or
high load/traffic is required? Can you repro this with the latest linus
tree? Also do you see any difference of root's cgroup.stat where this
issue happens vs good state?

BTW I am away for next month with very limited connectivity, so expect
slow response.

thanks,
Shakeel


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Expensive memory.stat + cpu.stat reads
  2023-07-06  6:20 ` Shakeel Butt
@ 2023-07-10 23:21   ` Ivan Babrou
  2023-07-11  0:44     ` Waiman Long
  2023-07-15  0:14     ` Ivan Babrou
  0 siblings, 2 replies; 21+ messages in thread
From: Ivan Babrou @ 2023-07-10 23:21 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: cgroups, Linux MM, kernel-team, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, Andrew Morton, linux-kernel

On Wed, Jul 5, 2023 at 11:20 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Fri, Jun 30, 2023 at 04:22:28PM -0700, Ivan Babrou wrote:
> > Hello,
> >
> > We're seeing CPU load issues with cgroup stats retrieval. I made a
> > public gist with all the details, including the repro code (which
> > unfortunately requires heavily loaded hardware) and some flamegraphs:
> >
> > * https://gist.github.com/bobrik/5ba58fb75a48620a1965026ad30a0a13
> >
> > I'll repeat the gist of that gist here. Our repro has the following
> > output after a warm-up run:
> >
> > completed:  5.17s [manual / mem-stat + cpu-stat]
> > completed:  5.59s [manual / cpu-stat + mem-stat]
> > completed:  0.52s [manual / mem-stat]
> > completed:  0.04s [manual / cpu-stat]
> >
> > The first two lines do effectively the following:
> >
> > for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/memory.stat
> > /sys/fs/cgroup/system.slice/cpu.stat > /dev/null
> >
> > The latter two are the same thing, but via two loops:
> >
> > for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/cpu.stat >
> > /dev/null; done
> > for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/memory.stat
> > > /dev/null; done
> >
> > As you might've noticed from the output, splitting the loop into two
> > makes the code run 10x faster. This isn't great, because most
> > monitoring software likes to get all stats for one service before
> > reading the stats for the next one, which maps to the slow and
> > expensive way of doing this.
> >
> > We're running Linux v6.1 (the output is from v6.1.25) with no patches
> > that touch the cgroup or mm subsystems, so you can assume vanilla
> > kernel.
> >
> > From the flamegraph it just looks like rstat flushing takes longer. I
> > used the following flags on an AMD EPYC 7642 system (our usual pick
> > cpu-clock was blaming spinlock irqrestore, which was questionable):
> >
> > perf -e cycles -g --call-graph fp -F 999 -- /tmp/repro
> >
> > Naturally, there are two questions that arise:
> >
> > * Is this expected (I guess not, but good to be sure)?
> > * What can we do to make this better?
> >
> > I am happy to try out patches or to do some tracing to help understand
> > this better.
>
> Hi Ivan,
>
> Thanks a lot, as always, for reporting this. This is not expected and
> should be fixed. Is the issue easy to repro or some specific workload or
> high load/traffic is required? Can you repro this with the latest linus
> tree? Also do you see any difference of root's cgroup.stat where this
> issue happens vs good state?

I'm afraid there's no easy way to reproduce. We see it from time to
time in different locations. The one that I was looking at for the
initial email does not reproduce it anymore:

completed:  0.75s [manual / cpu-stat + mem-stat]
completed:  0.72s [manual / mem-stat]
completed:  0.05s [manual / cpu-stat]

I took the top 20 servers by metrics scrape duration for cadvisor and
it it does happen on the slowest ones:

completed: 22.32s [manual / cpu-stat + mem-stat]
completed:  0.34s [manual / mem-stat]
completed:  1.68s [manual / cpu-stat]

Is cadvisor or a similar metrics agent used by Google? Any chance you
could see if your own fleet exhibits this behavior?

Given that this behavior requires full production setup with customer
traffic and long qualification times we have for kernels, I'm not sure
if I can try the linus tree here.

I uploaded the contents of /sys/fs/cgroup/memory.stat here:

* https://gist.github.com/bobrik/9255b5e8ed0a83afb73ebf06b79f07c4

The fast one is v6.1.37 and the slow one is v6.1.25. I'm not sure if
the kernel version makes a difference or if it's a matter of uptime /
traffic profile. The data is from two different locations. The fast
location has gone through an expansion, which meant a full reboot with
a kernel upgrade, so maybe that affected things:

* https://i.imgur.com/x8uyMaF.png

Let me try to reboot the slow location and see if there's any lasting
improvement.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Expensive memory.stat + cpu.stat reads
  2023-07-10 23:21   ` Ivan Babrou
@ 2023-07-11  0:44     ` Waiman Long
  2023-07-13 23:25       ` Ivan Babrou
  2023-07-15  0:14     ` Ivan Babrou
  1 sibling, 1 reply; 21+ messages in thread
From: Waiman Long @ 2023-07-11  0:44 UTC (permalink / raw)
  To: Ivan Babrou, Shakeel Butt
  Cc: cgroups, Linux MM, kernel-team, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, Andrew Morton, linux-kernel

On 7/10/23 19:21, Ivan Babrou wrote:
> On Wed, Jul 5, 2023 at 11:20 PM Shakeel Butt <shakeelb@google.com> wrote:
>> On Fri, Jun 30, 2023 at 04:22:28PM -0700, Ivan Babrou wrote:
>>> Hello,
>>>
>>> We're seeing CPU load issues with cgroup stats retrieval. I made a
>>> public gist with all the details, including the repro code (which
>>> unfortunately requires heavily loaded hardware) and some flamegraphs:
>>>
>>> * https://gist.github.com/bobrik/5ba58fb75a48620a1965026ad30a0a13
>>>
>>> I'll repeat the gist of that gist here. Our repro has the following
>>> output after a warm-up run:
>>>
>>> completed:  5.17s [manual / mem-stat + cpu-stat]
>>> completed:  5.59s [manual / cpu-stat + mem-stat]
>>> completed:  0.52s [manual / mem-stat]
>>> completed:  0.04s [manual / cpu-stat]
>>>
>>> The first two lines do effectively the following:
>>>
>>> for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/memory.stat
>>> /sys/fs/cgroup/system.slice/cpu.stat > /dev/null
>>>
>>> The latter two are the same thing, but via two loops:
>>>
>>> for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/cpu.stat >
>>> /dev/null; done
>>> for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/memory.stat
>>>> /dev/null; done
>>> As you might've noticed from the output, splitting the loop into two
>>> makes the code run 10x faster. This isn't great, because most
>>> monitoring software likes to get all stats for one service before
>>> reading the stats for the next one, which maps to the slow and
>>> expensive way of doing this.
>>>
>>> We're running Linux v6.1 (the output is from v6.1.25) with no patches
>>> that touch the cgroup or mm subsystems, so you can assume vanilla
>>> kernel.
>>>
>>>  From the flamegraph it just looks like rstat flushing takes longer. I
>>> used the following flags on an AMD EPYC 7642 system (our usual pick
>>> cpu-clock was blaming spinlock irqrestore, which was questionable):
>>>
>>> perf -e cycles -g --call-graph fp -F 999 -- /tmp/repro
>>>
>>> Naturally, there are two questions that arise:
>>>
>>> * Is this expected (I guess not, but good to be sure)?
>>> * What can we do to make this better?
>>>
>>> I am happy to try out patches or to do some tracing to help understand
>>> this better.
>> Hi Ivan,
>>
>> Thanks a lot, as always, for reporting this. This is not expected and
>> should be fixed. Is the issue easy to repro or some specific workload or
>> high load/traffic is required? Can you repro this with the latest linus
>> tree? Also do you see any difference of root's cgroup.stat where this
>> issue happens vs good state?
> I'm afraid there's no easy way to reproduce. We see it from time to
> time in different locations. The one that I was looking at for the
> initial email does not reproduce it anymore:

My understanding of mem-stat and cpu-stat is that they are independent 
of each other. In theory, reading one shouldn't affect the performance 
of reading the others. Since you are doing mem-stat and cpu-stat reading 
repetitively in a loop, it is likely that all the data are in the cache 
most of the time resulting in very fast processing time. If it happens 
that the specific memory location of mem-stat and cpu-stat data are such 
that reading one will cause the other data to be flushed out of the 
cache and have to be re-read from memory again, you could see 
significant performance regression.

It is one of the possible causes, but I may be wrong.

Cheers,
Longman




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Expensive memory.stat + cpu.stat reads
  2023-07-11  0:44     ` Waiman Long
@ 2023-07-13 23:25       ` Ivan Babrou
  2023-07-14 17:23         ` Waiman Long
  2023-07-15  0:30         ` Ivan Babrou
  0 siblings, 2 replies; 21+ messages in thread
From: Ivan Babrou @ 2023-07-13 23:25 UTC (permalink / raw)
  To: Waiman Long
  Cc: Shakeel Butt, cgroups, Linux MM, kernel-team, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Muchun Song, Andrew Morton,
	linux-kernel

On Mon, Jul 10, 2023 at 5:44 PM Waiman Long <longman@redhat.com> wrote:
>
> On 7/10/23 19:21, Ivan Babrou wrote:
> > On Wed, Jul 5, 2023 at 11:20 PM Shakeel Butt <shakeelb@google.com> wrote:
> >> On Fri, Jun 30, 2023 at 04:22:28PM -0700, Ivan Babrou wrote:
> >>> Hello,
> >>>
> >>> We're seeing CPU load issues with cgroup stats retrieval. I made a
> >>> public gist with all the details, including the repro code (which
> >>> unfortunately requires heavily loaded hardware) and some flamegraphs:
> >>>
> >>> * https://gist.github.com/bobrik/5ba58fb75a48620a1965026ad30a0a13
> >>>
> >>> I'll repeat the gist of that gist here. Our repro has the following
> >>> output after a warm-up run:
> >>>
> >>> completed:  5.17s [manual / mem-stat + cpu-stat]
> >>> completed:  5.59s [manual / cpu-stat + mem-stat]
> >>> completed:  0.52s [manual / mem-stat]
> >>> completed:  0.04s [manual / cpu-stat]
> >>>
> >>> The first two lines do effectively the following:
> >>>
> >>> for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/memory.stat
> >>> /sys/fs/cgroup/system.slice/cpu.stat > /dev/null
> >>>
> >>> The latter two are the same thing, but via two loops:
> >>>
> >>> for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/cpu.stat >
> >>> /dev/null; done
> >>> for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/memory.stat
> >>>> /dev/null; done
> >>> As you might've noticed from the output, splitting the loop into two
> >>> makes the code run 10x faster. This isn't great, because most
> >>> monitoring software likes to get all stats for one service before
> >>> reading the stats for the next one, which maps to the slow and
> >>> expensive way of doing this.
> >>>
> >>> We're running Linux v6.1 (the output is from v6.1.25) with no patches
> >>> that touch the cgroup or mm subsystems, so you can assume vanilla
> >>> kernel.
> >>>
> >>>  From the flamegraph it just looks like rstat flushing takes longer. I
> >>> used the following flags on an AMD EPYC 7642 system (our usual pick
> >>> cpu-clock was blaming spinlock irqrestore, which was questionable):
> >>>
> >>> perf -e cycles -g --call-graph fp -F 999 -- /tmp/repro
> >>>
> >>> Naturally, there are two questions that arise:
> >>>
> >>> * Is this expected (I guess not, but good to be sure)?
> >>> * What can we do to make this better?
> >>>
> >>> I am happy to try out patches or to do some tracing to help understand
> >>> this better.
> >> Hi Ivan,
> >>
> >> Thanks a lot, as always, for reporting this. This is not expected and
> >> should be fixed. Is the issue easy to repro or some specific workload or
> >> high load/traffic is required? Can you repro this with the latest linus
> >> tree? Also do you see any difference of root's cgroup.stat where this
> >> issue happens vs good state?
> > I'm afraid there's no easy way to reproduce. We see it from time to
> > time in different locations. The one that I was looking at for the
> > initial email does not reproduce it anymore:
>
> My understanding of mem-stat and cpu-stat is that they are independent
> of each other. In theory, reading one shouldn't affect the performance
> of reading the others. Since you are doing mem-stat and cpu-stat reading
> repetitively in a loop, it is likely that all the data are in the cache
> most of the time resulting in very fast processing time. If it happens
> that the specific memory location of mem-stat and cpu-stat data are such
> that reading one will cause the other data to be flushed out of the
> cache and have to be re-read from memory again, you could see
> significant performance regression.
>
> It is one of the possible causes, but I may be wrong.

Do you think it's somewhat similar to how iterating a matrix in rows
is faster than in columns due to sequential vs random memory reads?

* https://stackoverflow.com/q/9936132
* https://en.wikipedia.org/wiki/Row-_and_column-major_order
* https://en.wikipedia.org/wiki/Loop_interchange

I've had a similar suspicion and it would be good to confirm whether
it's that or something else. I can probably collect perf counters for
different runs, but I'm not sure which ones I'll need.

In a similar vein, if we could come up with a tracepoint that would
tell us the amount of work done (or any other relevant metric that
would help) during rstat flushing, I can certainly collect that
information as well for every reading combination.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Expensive memory.stat + cpu.stat reads
  2023-07-13 23:25       ` Ivan Babrou
@ 2023-07-14 17:23         ` Waiman Long
  2023-07-15  0:00           ` Ivan Babrou
  2023-07-15  0:30         ` Ivan Babrou
  1 sibling, 1 reply; 21+ messages in thread
From: Waiman Long @ 2023-07-14 17:23 UTC (permalink / raw)
  To: Ivan Babrou
  Cc: Shakeel Butt, cgroups, Linux MM, kernel-team, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Muchun Song, Andrew Morton,
	linux-kernel

On 7/13/23 19:25, Ivan Babrou wrote:
> On Mon, Jul 10, 2023 at 5:44 PM Waiman Long <longman@redhat.com> wrote:
>> On 7/10/23 19:21, Ivan Babrou wrote:
>>> On Wed, Jul 5, 2023 at 11:20 PM Shakeel Butt <shakeelb@google.com> wrote:
>>>> On Fri, Jun 30, 2023 at 04:22:28PM -0700, Ivan Babrou wrote:
>>>>> Hello,
>>>>>
>>>>> We're seeing CPU load issues with cgroup stats retrieval. I made a
>>>>> public gist with all the details, including the repro code (which
>>>>> unfortunately requires heavily loaded hardware) and some flamegraphs:
>>>>>
>>>>> * https://gist.github.com/bobrik/5ba58fb75a48620a1965026ad30a0a13
>>>>>
>>>>> I'll repeat the gist of that gist here. Our repro has the following
>>>>> output after a warm-up run:
>>>>>
>>>>> completed:  5.17s [manual / mem-stat + cpu-stat]
>>>>> completed:  5.59s [manual / cpu-stat + mem-stat]
>>>>> completed:  0.52s [manual / mem-stat]
>>>>> completed:  0.04s [manual / cpu-stat]
>>>>>
>>>>> The first two lines do effectively the following:
>>>>>
>>>>> for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/memory.stat
>>>>> /sys/fs/cgroup/system.slice/cpu.stat > /dev/null
>>>>>
>>>>> The latter two are the same thing, but via two loops:
>>>>>
>>>>> for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/cpu.stat >
>>>>> /dev/null; done
>>>>> for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/memory.stat
>>>>>> /dev/null; done
>>>>> As you might've noticed from the output, splitting the loop into two
>>>>> makes the code run 10x faster. This isn't great, because most
>>>>> monitoring software likes to get all stats for one service before
>>>>> reading the stats for the next one, which maps to the slow and
>>>>> expensive way of doing this.
>>>>>
>>>>> We're running Linux v6.1 (the output is from v6.1.25) with no patches
>>>>> that touch the cgroup or mm subsystems, so you can assume vanilla
>>>>> kernel.
>>>>>
>>>>>   From the flamegraph it just looks like rstat flushing takes longer. I
>>>>> used the following flags on an AMD EPYC 7642 system (our usual pick
>>>>> cpu-clock was blaming spinlock irqrestore, which was questionable):
>>>>>
>>>>> perf -e cycles -g --call-graph fp -F 999 -- /tmp/repro
>>>>>
>>>>> Naturally, there are two questions that arise:
>>>>>
>>>>> * Is this expected (I guess not, but good to be sure)?
>>>>> * What can we do to make this better?
>>>>>
>>>>> I am happy to try out patches or to do some tracing to help understand
>>>>> this better.
>>>> Hi Ivan,
>>>>
>>>> Thanks a lot, as always, for reporting this. This is not expected and
>>>> should be fixed. Is the issue easy to repro or some specific workload or
>>>> high load/traffic is required? Can you repro this with the latest linus
>>>> tree? Also do you see any difference of root's cgroup.stat where this
>>>> issue happens vs good state?
>>> I'm afraid there's no easy way to reproduce. We see it from time to
>>> time in different locations. The one that I was looking at for the
>>> initial email does not reproduce it anymore:
>> My understanding of mem-stat and cpu-stat is that they are independent
>> of each other. In theory, reading one shouldn't affect the performance
>> of reading the others. Since you are doing mem-stat and cpu-stat reading
>> repetitively in a loop, it is likely that all the data are in the cache
>> most of the time resulting in very fast processing time. If it happens
>> that the specific memory location of mem-stat and cpu-stat data are such
>> that reading one will cause the other data to be flushed out of the
>> cache and have to be re-read from memory again, you could see
>> significant performance regression.
>>
>> It is one of the possible causes, but I may be wrong.
> Do you think it's somewhat similar to how iterating a matrix in rows
> is faster than in columns due to sequential vs random memory reads?
>
> * https://stackoverflow.com/q/9936132
> * https://en.wikipedia.org/wiki/Row-_and_column-major_order
> * https://en.wikipedia.org/wiki/Loop_interchange

Yes, it is similar to what is being described in those articles.


>
> I've had a similar suspicion and it would be good to confirm whether
> it's that or something else. I can probably collect perf counters for
> different runs, but I'm not sure which ones I'll need.
>
> In a similar vein, if we could come up with a tracepoint that would
> tell us the amount of work done (or any other relevant metric that
> would help) during rstat flushing, I can certainly collect that
> information as well for every reading combination.

The perf-c2c tool may be able to help. The data to look for is how often 
the data is from caches vs direct memory load/store.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Expensive memory.stat + cpu.stat reads
  2023-07-14 17:23         ` Waiman Long
@ 2023-07-15  0:00           ` Ivan Babrou
  0 siblings, 0 replies; 21+ messages in thread
From: Ivan Babrou @ 2023-07-15  0:00 UTC (permalink / raw)
  To: Waiman Long
  Cc: Shakeel Butt, cgroups, Linux MM, kernel-team, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Muchun Song, Andrew Morton,
	linux-kernel

On Fri, Jul 14, 2023 at 10:23 AM Waiman Long <longman@redhat.com> wrote:
>
> On 7/13/23 19:25, Ivan Babrou wrote:
> > On Mon, Jul 10, 2023 at 5:44 PM Waiman Long <longman@redhat.com> wrote:
> >> On 7/10/23 19:21, Ivan Babrou wrote:
> >>> On Wed, Jul 5, 2023 at 11:20 PM Shakeel Butt <shakeelb@google.com> wrote:
> >>>> On Fri, Jun 30, 2023 at 04:22:28PM -0700, Ivan Babrou wrote:
> >>>>> Hello,
> >>>>>
> >>>>> We're seeing CPU load issues with cgroup stats retrieval. I made a
> >>>>> public gist with all the details, including the repro code (which
> >>>>> unfortunately requires heavily loaded hardware) and some flamegraphs:
> >>>>>
> >>>>> * https://gist.github.com/bobrik/5ba58fb75a48620a1965026ad30a0a13
> >>>>>
> >>>>> I'll repeat the gist of that gist here. Our repro has the following
> >>>>> output after a warm-up run:
> >>>>>
> >>>>> completed:  5.17s [manual / mem-stat + cpu-stat]
> >>>>> completed:  5.59s [manual / cpu-stat + mem-stat]
> >>>>> completed:  0.52s [manual / mem-stat]
> >>>>> completed:  0.04s [manual / cpu-stat]
> >>>>>
> >>>>> The first two lines do effectively the following:
> >>>>>
> >>>>> for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/memory.stat
> >>>>> /sys/fs/cgroup/system.slice/cpu.stat > /dev/null
> >>>>>
> >>>>> The latter two are the same thing, but via two loops:
> >>>>>
> >>>>> for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/cpu.stat >
> >>>>> /dev/null; done
> >>>>> for _ in $(seq 1 1000); do cat /sys/fs/cgroup/system.slice/memory.stat
> >>>>>> /dev/null; done
> >>>>> As you might've noticed from the output, splitting the loop into two
> >>>>> makes the code run 10x faster. This isn't great, because most
> >>>>> monitoring software likes to get all stats for one service before
> >>>>> reading the stats for the next one, which maps to the slow and
> >>>>> expensive way of doing this.
> >>>>>
> >>>>> We're running Linux v6.1 (the output is from v6.1.25) with no patches
> >>>>> that touch the cgroup or mm subsystems, so you can assume vanilla
> >>>>> kernel.
> >>>>>
> >>>>>   From the flamegraph it just looks like rstat flushing takes longer. I
> >>>>> used the following flags on an AMD EPYC 7642 system (our usual pick
> >>>>> cpu-clock was blaming spinlock irqrestore, which was questionable):
> >>>>>
> >>>>> perf -e cycles -g --call-graph fp -F 999 -- /tmp/repro
> >>>>>
> >>>>> Naturally, there are two questions that arise:
> >>>>>
> >>>>> * Is this expected (I guess not, but good to be sure)?
> >>>>> * What can we do to make this better?
> >>>>>
> >>>>> I am happy to try out patches or to do some tracing to help understand
> >>>>> this better.
> >>>> Hi Ivan,
> >>>>
> >>>> Thanks a lot, as always, for reporting this. This is not expected and
> >>>> should be fixed. Is the issue easy to repro or some specific workload or
> >>>> high load/traffic is required? Can you repro this with the latest linus
> >>>> tree? Also do you see any difference of root's cgroup.stat where this
> >>>> issue happens vs good state?
> >>> I'm afraid there's no easy way to reproduce. We see it from time to
> >>> time in different locations. The one that I was looking at for the
> >>> initial email does not reproduce it anymore:
> >> My understanding of mem-stat and cpu-stat is that they are independent
> >> of each other. In theory, reading one shouldn't affect the performance
> >> of reading the others. Since you are doing mem-stat and cpu-stat reading
> >> repetitively in a loop, it is likely that all the data are in the cache
> >> most of the time resulting in very fast processing time. If it happens
> >> that the specific memory location of mem-stat and cpu-stat data are such
> >> that reading one will cause the other data to be flushed out of the
> >> cache and have to be re-read from memory again, you could see
> >> significant performance regression.
> >>
> >> It is one of the possible causes, but I may be wrong.
> > Do you think it's somewhat similar to how iterating a matrix in rows
> > is faster than in columns due to sequential vs random memory reads?
> >
> > * https://stackoverflow.com/q/9936132
> > * https://en.wikipedia.org/wiki/Row-_and_column-major_order
> > * https://en.wikipedia.org/wiki/Loop_interchange
>
> Yes, it is similar to what is being described in those articles.
>
>
> >
> > I've had a similar suspicion and it would be good to confirm whether
> > it's that or something else. I can probably collect perf counters for
> > different runs, but I'm not sure which ones I'll need.
> >
> > In a similar vein, if we could come up with a tracepoint that would
> > tell us the amount of work done (or any other relevant metric that
> > would help) during rstat flushing, I can certainly collect that
> > information as well for every reading combination.
>
> The perf-c2c tool may be able to help. The data to look for is how often
> the data is from caches vs direct memory load/store.

It looks like c2c only works for the whole system, not individual
treads. There's a lot of noise from the rest of the system.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Expensive memory.stat + cpu.stat reads
  2023-07-13 23:25       ` Ivan Babrou
  2023-07-14 17:23         ` Waiman Long
@ 2023-07-15  0:30         ` Ivan Babrou
  2023-08-11 22:03           ` Ivan Babrou
  1 sibling, 1 reply; 21+ messages in thread
From: Ivan Babrou @ 2023-07-15  0:30 UTC (permalink / raw)
  To: Waiman Long
  Cc: Shakeel Butt, cgroups, Linux MM, kernel-team, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Muchun Song, Andrew Morton,
	linux-kernel

On Thu, Jul 13, 2023 at 4:25 PM Ivan Babrou <ivan@cloudflare.com> wrote:
> > My understanding of mem-stat and cpu-stat is that they are independent
> > of each other. In theory, reading one shouldn't affect the performance
> > of reading the others. Since you are doing mem-stat and cpu-stat reading
> > repetitively in a loop, it is likely that all the data are in the cache
> > most of the time resulting in very fast processing time. If it happens
> > that the specific memory location of mem-stat and cpu-stat data are such
> > that reading one will cause the other data to be flushed out of the
> > cache and have to be re-read from memory again, you could see
> > significant performance regression.
> >
> > It is one of the possible causes, but I may be wrong.
>
> Do you think it's somewhat similar to how iterating a matrix in rows
> is faster than in columns due to sequential vs random memory reads?
>
> * https://stackoverflow.com/q/9936132
> * https://en.wikipedia.org/wiki/Row-_and_column-major_order
> * https://en.wikipedia.org/wiki/Loop_interchange
>
> I've had a similar suspicion and it would be good to confirm whether
> it's that or something else. I can probably collect perf counters for
> different runs, but I'm not sure which ones I'll need.
>
> In a similar vein, if we could come up with a tracepoint that would
> tell us the amount of work done (or any other relevant metric that
> would help) during rstat flushing, I can certainly collect that
> information as well for every reading combination.

Since cgroup_rstat_flush_locked appears in flamegraphs for both fast
(discrete) and slow (combined) cases, I grabbed some stats for it:

* Slow:

completed: 19.43s [manual / mem-stat + cpu-stat]

$ sudo /usr/share/bcc/tools/funclatency -uT cgroup_rstat_flush_locked
Tracing 1 functions for "cgroup_rstat_flush_locked"... Hit Ctrl-C to end.
^C
00:12:55
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 1        |                                        |
       128 -> 255        : 191      |************                            |
       256 -> 511        : 590      |****************************************|
       512 -> 1023       : 186      |************                            |
      1024 -> 2047       : 2        |                                        |
      2048 -> 4095       : 0        |                                        |
      4096 -> 8191       : 0        |                                        |
      8192 -> 16383      : 504      |**********************************      |
     16384 -> 32767      : 514      |**********************************      |
     32768 -> 65535      : 3        |                                        |
     65536 -> 131071     : 1        |                                        |

avg = 8852 usecs, total: 17633268 usecs, count: 1992

* Fast:

completed:  0.95s [manual / mem-stat]
completed:  0.05s [manual / cpu-stat]

$ sudo /usr/share/bcc/tools/funclatency -uT cgroup_rstat_flush_locked
Tracing 1 functions for "cgroup_rstat_flush_locked"... Hit Ctrl-C to end.
^C
00:13:27
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 499      |****************************************|
         8 -> 15         : 253      |********************                    |
        16 -> 31         : 191      |***************                         |
        32 -> 63         : 41       |***                                     |
        64 -> 127        : 12       |                                        |
       128 -> 255        : 2        |                                        |
       256 -> 511        : 2        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 0        |                                        |
      4096 -> 8191       : 0        |                                        |
      8192 -> 16383      : 34       |**                                      |
     16384 -> 32767      : 21       |*                                       |

avg = 857 usecs, total: 904762 usecs, count: 1055

There's a different number of calls into cgroup_rstat_flush_locked and
they are much slower in the slow case. There are also two bands in the
slow case, with 8ms..32ms having the half of the calls.

For mem_cgroup_css_rstat_flush:

* Slow:

completed: 32.77s [manual / mem-stat + cpu-stat]

$ sudo /usr/share/bcc/tools/funclatency -uT mem_cgroup_css_rstat_flush
Tracing 1 functions for "mem_cgroup_css_rstat_flush"... Hit Ctrl-C to end.
^C
00:21:25
     usecs               : count     distribution
         0 -> 1          : 93078    |*                                       |
         2 -> 3          : 3397714  |****************************************|
         4 -> 7          : 1009440  |***********                             |
         8 -> 15         : 168013   |*                                       |
        16 -> 31         : 93       |                                        |

avg = 3 usecs, total: 17189289 usecs, count: 4668338

* Fast:

completed:  0.16s [manual / mem-stat]
completed:  0.04s [manual / cpu-stat]

$ sudo /usr/share/bcc/tools/funclatency -uT mem_cgroup_css_rstat_flush
Tracing 1 functions for "mem_cgroup_css_rstat_flush"... Hit Ctrl-C to end.
^C
00:21:57
     usecs               : count     distribution
         0 -> 1          : 1441     |***                                     |
         2 -> 3          : 18780    |****************************************|
         4 -> 7          : 4826     |**********                              |
         8 -> 15         : 732      |*                                       |
        16 -> 31         : 1        |                                        |

avg = 3 usecs, total: 89174 usecs, count: 25780

There's an 181x difference in the number of calls into
mem_cgroup_css_rstat_flush.

Does this provide a clue? Perhaps cgroup_rstat_cpu_pop_updated is
yielding a ton more iterations for some reason here?

* https://elixir.bootlin.com/linux/v6.1/source/kernel/cgroup/rstat.c#L196

It's inlined, but I can place a probe into the loop:

      7         for_each_possible_cpu(cpu) {
      8                 raw_spinlock_t *cpu_lock =
per_cpu_ptr(&cgroup_rstat_cpu_lock,
                                                               cpu);
     10                 struct cgroup *pos = NULL;
                        unsigned long flags;

                        /*
                         * The _irqsave() is needed because cgroup_rstat_lock is
                         * spinlock_t which is a sleeping lock on
PREEMPT_RT. Acquiring
                         * this lock with the _irq() suffix only
disables interrupts on
                         * a non-PREEMPT_RT kernel. The raw_spinlock_t
below disables
                         * interrupts on both configurations. The
_irqsave() ensures
                         * that interrupts are always disabled and
later restored.
                         */
                        raw_spin_lock_irqsave(cpu_lock, flags);
                        while ((pos =
cgroup_rstat_cpu_pop_updated(pos, cgrp, cpu))) {
                                struct cgroup_subsys_state *css;

                                cgroup_base_stat_flush(pos, cpu);
     26                         bpf_rstat_flush(pos, cgroup_parent(pos), cpu);

     28                         rcu_read_lock();
     29                         list_for_each_entry_rcu(css,
&pos->rstat_css_list,
                                                        rstat_css_node)
     31                                 css->ss->css_rstat_flush(css, cpu);
     32                         rcu_read_unlock();
                        }
     34                 raw_spin_unlock_irqrestore(cpu_lock, flags);

I added probes on both line 26 and line 31 to catch the middle and inner loops.

* Slow:

completed: 32.97s [manual / mem-stat + cpu-stat]

 Performance counter stats for '/tmp/derp':

         4,702,570      probe:cgroup_rstat_flush_locked_L26
         9,301,436      probe:cgroup_rstat_flush_locked_L31

* Fast:

completed:  0.17s [manual / mem-stat]
completed:  0.34s [manual / cpu-stat]

 Performance counter stats for '/tmp/derp':

            31,769      probe:cgroup_rstat_flush_locked_L26
            62,849      probe:cgroup_rstat_flush_locked_L31

It definitely looks like cgroup_rstat_cpu_pop_updated is yielding a
lot more positions.

I'm going to sign off for the week, but let me know if I should place
any more probes to nail this down.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Expensive memory.stat + cpu.stat reads
  2023-07-15  0:30         ` Ivan Babrou
@ 2023-08-11 22:03           ` Ivan Babrou
  2023-08-11 22:27             ` Waiman Long
  2023-08-11 23:43             ` Yosry Ahmed
  0 siblings, 2 replies; 21+ messages in thread
From: Ivan Babrou @ 2023-08-11 22:03 UTC (permalink / raw)
  To: Waiman Long
  Cc: Shakeel Butt, cgroups, Linux MM, kernel-team, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Muchun Song, Andrew Morton,
	linux-kernel

On Fri, Jul 14, 2023 at 5:30 PM Ivan Babrou <ivan@cloudflare.com> wrote:
>
> On Thu, Jul 13, 2023 at 4:25 PM Ivan Babrou <ivan@cloudflare.com> wrote:
> > > My understanding of mem-stat and cpu-stat is that they are independent
> > > of each other. In theory, reading one shouldn't affect the performance
> > > of reading the others. Since you are doing mem-stat and cpu-stat reading
> > > repetitively in a loop, it is likely that all the data are in the cache
> > > most of the time resulting in very fast processing time. If it happens
> > > that the specific memory location of mem-stat and cpu-stat data are such
> > > that reading one will cause the other data to be flushed out of the
> > > cache and have to be re-read from memory again, you could see
> > > significant performance regression.
> > >
> > > It is one of the possible causes, but I may be wrong.
> >
> > Do you think it's somewhat similar to how iterating a matrix in rows
> > is faster than in columns due to sequential vs random memory reads?
> >
> > * https://stackoverflow.com/q/9936132
> > * https://en.wikipedia.org/wiki/Row-_and_column-major_order
> > * https://en.wikipedia.org/wiki/Loop_interchange
> >
> > I've had a similar suspicion and it would be good to confirm whether
> > it's that or something else. I can probably collect perf counters for
> > different runs, but I'm not sure which ones I'll need.
> >
> > In a similar vein, if we could come up with a tracepoint that would
> > tell us the amount of work done (or any other relevant metric that
> > would help) during rstat flushing, I can certainly collect that
> > information as well for every reading combination.
>
> Since cgroup_rstat_flush_locked appears in flamegraphs for both fast
> (discrete) and slow (combined) cases, I grabbed some stats for it:
>
> * Slow:
>
> completed: 19.43s [manual / mem-stat + cpu-stat]
>
> $ sudo /usr/share/bcc/tools/funclatency -uT cgroup_rstat_flush_locked
> Tracing 1 functions for "cgroup_rstat_flush_locked"... Hit Ctrl-C to end.
> ^C
> 00:12:55
>      usecs               : count     distribution
>          0 -> 1          : 0        |                                        |
>          2 -> 3          : 0        |                                        |
>          4 -> 7          : 0        |                                        |
>          8 -> 15         : 0        |                                        |
>         16 -> 31         : 0        |                                        |
>         32 -> 63         : 0        |                                        |
>         64 -> 127        : 1        |                                        |
>        128 -> 255        : 191      |************                            |
>        256 -> 511        : 590      |****************************************|
>        512 -> 1023       : 186      |************                            |
>       1024 -> 2047       : 2        |                                        |
>       2048 -> 4095       : 0        |                                        |
>       4096 -> 8191       : 0        |                                        |
>       8192 -> 16383      : 504      |**********************************      |
>      16384 -> 32767      : 514      |**********************************      |
>      32768 -> 65535      : 3        |                                        |
>      65536 -> 131071     : 1        |                                        |
>
> avg = 8852 usecs, total: 17633268 usecs, count: 1992
>
> * Fast:
>
> completed:  0.95s [manual / mem-stat]
> completed:  0.05s [manual / cpu-stat]
>
> $ sudo /usr/share/bcc/tools/funclatency -uT cgroup_rstat_flush_locked
> Tracing 1 functions for "cgroup_rstat_flush_locked"... Hit Ctrl-C to end.
> ^C
> 00:13:27
>      usecs               : count     distribution
>          0 -> 1          : 0        |                                        |
>          2 -> 3          : 0        |                                        |
>          4 -> 7          : 499      |****************************************|
>          8 -> 15         : 253      |********************                    |
>         16 -> 31         : 191      |***************                         |
>         32 -> 63         : 41       |***                                     |
>         64 -> 127        : 12       |                                        |
>        128 -> 255        : 2        |                                        |
>        256 -> 511        : 2        |                                        |
>        512 -> 1023       : 0        |                                        |
>       1024 -> 2047       : 0        |                                        |
>       2048 -> 4095       : 0        |                                        |
>       4096 -> 8191       : 0        |                                        |
>       8192 -> 16383      : 34       |**                                      |
>      16384 -> 32767      : 21       |*                                       |
>
> avg = 857 usecs, total: 904762 usecs, count: 1055
>
> There's a different number of calls into cgroup_rstat_flush_locked and
> they are much slower in the slow case. There are also two bands in the
> slow case, with 8ms..32ms having the half of the calls.
>
> For mem_cgroup_css_rstat_flush:
>
> * Slow:
>
> completed: 32.77s [manual / mem-stat + cpu-stat]
>
> $ sudo /usr/share/bcc/tools/funclatency -uT mem_cgroup_css_rstat_flush
> Tracing 1 functions for "mem_cgroup_css_rstat_flush"... Hit Ctrl-C to end.
> ^C
> 00:21:25
>      usecs               : count     distribution
>          0 -> 1          : 93078    |*                                       |
>          2 -> 3          : 3397714  |****************************************|
>          4 -> 7          : 1009440  |***********                             |
>          8 -> 15         : 168013   |*                                       |
>         16 -> 31         : 93       |                                        |
>
> avg = 3 usecs, total: 17189289 usecs, count: 4668338
>
> * Fast:
>
> completed:  0.16s [manual / mem-stat]
> completed:  0.04s [manual / cpu-stat]
>
> $ sudo /usr/share/bcc/tools/funclatency -uT mem_cgroup_css_rstat_flush
> Tracing 1 functions for "mem_cgroup_css_rstat_flush"... Hit Ctrl-C to end.
> ^C
> 00:21:57
>      usecs               : count     distribution
>          0 -> 1          : 1441     |***                                     |
>          2 -> 3          : 18780    |****************************************|
>          4 -> 7          : 4826     |**********                              |
>          8 -> 15         : 732      |*                                       |
>         16 -> 31         : 1        |                                        |
>
> avg = 3 usecs, total: 89174 usecs, count: 25780
>
> There's an 181x difference in the number of calls into
> mem_cgroup_css_rstat_flush.
>
> Does this provide a clue? Perhaps cgroup_rstat_cpu_pop_updated is
> yielding a ton more iterations for some reason here?
>
> * https://elixir.bootlin.com/linux/v6.1/source/kernel/cgroup/rstat.c#L196
>
> It's inlined, but I can place a probe into the loop:
>
>       7         for_each_possible_cpu(cpu) {
>       8                 raw_spinlock_t *cpu_lock =
> per_cpu_ptr(&cgroup_rstat_cpu_lock,
>                                                                cpu);
>      10                 struct cgroup *pos = NULL;
>                         unsigned long flags;
>
>                         /*
>                          * The _irqsave() is needed because cgroup_rstat_lock is
>                          * spinlock_t which is a sleeping lock on
> PREEMPT_RT. Acquiring
>                          * this lock with the _irq() suffix only
> disables interrupts on
>                          * a non-PREEMPT_RT kernel. The raw_spinlock_t
> below disables
>                          * interrupts on both configurations. The
> _irqsave() ensures
>                          * that interrupts are always disabled and
> later restored.
>                          */
>                         raw_spin_lock_irqsave(cpu_lock, flags);
>                         while ((pos =
> cgroup_rstat_cpu_pop_updated(pos, cgrp, cpu))) {
>                                 struct cgroup_subsys_state *css;
>
>                                 cgroup_base_stat_flush(pos, cpu);
>      26                         bpf_rstat_flush(pos, cgroup_parent(pos), cpu);
>
>      28                         rcu_read_lock();
>      29                         list_for_each_entry_rcu(css,
> &pos->rstat_css_list,
>                                                         rstat_css_node)
>      31                                 css->ss->css_rstat_flush(css, cpu);
>      32                         rcu_read_unlock();
>                         }
>      34                 raw_spin_unlock_irqrestore(cpu_lock, flags);
>
> I added probes on both line 26 and line 31 to catch the middle and inner loops.
>
> * Slow:
>
> completed: 32.97s [manual / mem-stat + cpu-stat]
>
>  Performance counter stats for '/tmp/derp':
>
>          4,702,570      probe:cgroup_rstat_flush_locked_L26
>          9,301,436      probe:cgroup_rstat_flush_locked_L31
>
> * Fast:
>
> completed:  0.17s [manual / mem-stat]
> completed:  0.34s [manual / cpu-stat]
>
>  Performance counter stats for '/tmp/derp':
>
>             31,769      probe:cgroup_rstat_flush_locked_L26
>             62,849      probe:cgroup_rstat_flush_locked_L31
>
> It definitely looks like cgroup_rstat_cpu_pop_updated is yielding a
> lot more positions.
>
> I'm going to sign off for the week, but let me know if I should place
> any more probes to nail this down.

I spent some time looking into this and I think I landed on a fix:

* https://github.com/bobrik/linux/commit/50b627811d54

I'm not 100% sure if it's the right fix for the issue, but it reduces
the runtime significantly.

We see a 50x decrease for memory.stat + cpu.stat loop duration with
the patch applied. TL;DR is that memory.stat flushes all cgroups,
while cpu.stat flushes just the subtree you're asking for. Both do it
for cpu and memory at the same time, since both are rstat based. See
the description for the commit linked above for more details.

There are two more graphs to add. I rebooted 4 servers, 2 of which
received my patch and 2 were the control. The reboot happened at
around 05:00Z

First is the latency of scraping cadvisor, where you can clearly see
that the control group is rising with the daily load, while the test
group is staying mostly flat:

* https://i.imgur.com/GMtzHIu.png

Second is the CPU time spent by cadvisor, where you can see a similar picture:

* https://i.imgur.com/LWHt14P.png

Let me know what you think.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Expensive memory.stat + cpu.stat reads
  2023-08-11 22:03           ` Ivan Babrou
@ 2023-08-11 22:27             ` Waiman Long
  2023-08-11 22:35               ` Ivan Babrou
  2023-08-11 23:43             ` Yosry Ahmed
  1 sibling, 1 reply; 21+ messages in thread
From: Waiman Long @ 2023-08-11 22:27 UTC (permalink / raw)
  To: Ivan Babrou
  Cc: Shakeel Butt, cgroups, Linux MM, kernel-team, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Muchun Song, Andrew Morton,
	linux-kernel

On 8/11/23 18:03, Ivan Babrou wrote:
> On Fri, Jul 14, 2023 at 5:30 PM Ivan Babrou <ivan@cloudflare.com> wrote:
>> On Thu, Jul 13, 2023 at 4:25 PM Ivan Babrou <ivan@cloudflare.com> wrote:
>>>> My understanding of mem-stat and cpu-stat is that they are independent
>>>> of each other. In theory, reading one shouldn't affect the performance
>>>> of reading the others. Since you are doing mem-stat and cpu-stat reading
>>>> repetitively in a loop, it is likely that all the data are in the cache
>>>> most of the time resulting in very fast processing time. If it happens
>>>> that the specific memory location of mem-stat and cpu-stat data are such
>>>> that reading one will cause the other data to be flushed out of the
>>>> cache and have to be re-read from memory again, you could see
>>>> significant performance regression.
>>>>
>>>> It is one of the possible causes, but I may be wrong.
>>> Do you think it's somewhat similar to how iterating a matrix in rows
>>> is faster than in columns due to sequential vs random memory reads?
>>>
>>> * https://stackoverflow.com/q/9936132
>>> * https://en.wikipedia.org/wiki/Row-_and_column-major_order
>>> * https://en.wikipedia.org/wiki/Loop_interchange
>>>
>>> I've had a similar suspicion and it would be good to confirm whether
>>> it's that or something else. I can probably collect perf counters for
>>> different runs, but I'm not sure which ones I'll need.
>>>
>>> In a similar vein, if we could come up with a tracepoint that would
>>> tell us the amount of work done (or any other relevant metric that
>>> would help) during rstat flushing, I can certainly collect that
>>> information as well for every reading combination.
>> Since cgroup_rstat_flush_locked appears in flamegraphs for both fast
>> (discrete) and slow (combined) cases, I grabbed some stats for it:
>>
>> * Slow:
>>
>> completed: 19.43s [manual / mem-stat + cpu-stat]
>>
>> $ sudo /usr/share/bcc/tools/funclatency -uT cgroup_rstat_flush_locked
>> Tracing 1 functions for "cgroup_rstat_flush_locked"... Hit Ctrl-C to end.
>> ^C
>> 00:12:55
>>       usecs               : count     distribution
>>           0 -> 1          : 0        |                                        |
>>           2 -> 3          : 0        |                                        |
>>           4 -> 7          : 0        |                                        |
>>           8 -> 15         : 0        |                                        |
>>          16 -> 31         : 0        |                                        |
>>          32 -> 63         : 0        |                                        |
>>          64 -> 127        : 1        |                                        |
>>         128 -> 255        : 191      |************                            |
>>         256 -> 511        : 590      |****************************************|
>>         512 -> 1023       : 186      |************                            |
>>        1024 -> 2047       : 2        |                                        |
>>        2048 -> 4095       : 0        |                                        |
>>        4096 -> 8191       : 0        |                                        |
>>        8192 -> 16383      : 504      |**********************************      |
>>       16384 -> 32767      : 514      |**********************************      |
>>       32768 -> 65535      : 3        |                                        |
>>       65536 -> 131071     : 1        |                                        |
>>
>> avg = 8852 usecs, total: 17633268 usecs, count: 1992
>>
>> * Fast:
>>
>> completed:  0.95s [manual / mem-stat]
>> completed:  0.05s [manual / cpu-stat]
>>
>> $ sudo /usr/share/bcc/tools/funclatency -uT cgroup_rstat_flush_locked
>> Tracing 1 functions for "cgroup_rstat_flush_locked"... Hit Ctrl-C to end.
>> ^C
>> 00:13:27
>>       usecs               : count     distribution
>>           0 -> 1          : 0        |                                        |
>>           2 -> 3          : 0        |                                        |
>>           4 -> 7          : 499      |****************************************|
>>           8 -> 15         : 253      |********************                    |
>>          16 -> 31         : 191      |***************                         |
>>          32 -> 63         : 41       |***                                     |
>>          64 -> 127        : 12       |                                        |
>>         128 -> 255        : 2        |                                        |
>>         256 -> 511        : 2        |                                        |
>>         512 -> 1023       : 0        |                                        |
>>        1024 -> 2047       : 0        |                                        |
>>        2048 -> 4095       : 0        |                                        |
>>        4096 -> 8191       : 0        |                                        |
>>        8192 -> 16383      : 34       |**                                      |
>>       16384 -> 32767      : 21       |*                                       |
>>
>> avg = 857 usecs, total: 904762 usecs, count: 1055
>>
>> There's a different number of calls into cgroup_rstat_flush_locked and
>> they are much slower in the slow case. There are also two bands in the
>> slow case, with 8ms..32ms having the half of the calls.
>>
>> For mem_cgroup_css_rstat_flush:
>>
>> * Slow:
>>
>> completed: 32.77s [manual / mem-stat + cpu-stat]
>>
>> $ sudo /usr/share/bcc/tools/funclatency -uT mem_cgroup_css_rstat_flush
>> Tracing 1 functions for "mem_cgroup_css_rstat_flush"... Hit Ctrl-C to end.
>> ^C
>> 00:21:25
>>       usecs               : count     distribution
>>           0 -> 1          : 93078    |*                                       |
>>           2 -> 3          : 3397714  |****************************************|
>>           4 -> 7          : 1009440  |***********                             |
>>           8 -> 15         : 168013   |*                                       |
>>          16 -> 31         : 93       |                                        |
>>
>> avg = 3 usecs, total: 17189289 usecs, count: 4668338
>>
>> * Fast:
>>
>> completed:  0.16s [manual / mem-stat]
>> completed:  0.04s [manual / cpu-stat]
>>
>> $ sudo /usr/share/bcc/tools/funclatency -uT mem_cgroup_css_rstat_flush
>> Tracing 1 functions for "mem_cgroup_css_rstat_flush"... Hit Ctrl-C to end.
>> ^C
>> 00:21:57
>>       usecs               : count     distribution
>>           0 -> 1          : 1441     |***                                     |
>>           2 -> 3          : 18780    |****************************************|
>>           4 -> 7          : 4826     |**********                              |
>>           8 -> 15         : 732      |*                                       |
>>          16 -> 31         : 1        |                                        |
>>
>> avg = 3 usecs, total: 89174 usecs, count: 25780
>>
>> There's an 181x difference in the number of calls into
>> mem_cgroup_css_rstat_flush.
>>
>> Does this provide a clue? Perhaps cgroup_rstat_cpu_pop_updated is
>> yielding a ton more iterations for some reason here?
>>
>> * https://elixir.bootlin.com/linux/v6.1/source/kernel/cgroup/rstat.c#L196
>>
>> It's inlined, but I can place a probe into the loop:
>>
>>        7         for_each_possible_cpu(cpu) {
>>        8                 raw_spinlock_t *cpu_lock =
>> per_cpu_ptr(&cgroup_rstat_cpu_lock,
>>                                                                 cpu);
>>       10                 struct cgroup *pos = NULL;
>>                          unsigned long flags;
>>
>>                          /*
>>                           * The _irqsave() is needed because cgroup_rstat_lock is
>>                           * spinlock_t which is a sleeping lock on
>> PREEMPT_RT. Acquiring
>>                           * this lock with the _irq() suffix only
>> disables interrupts on
>>                           * a non-PREEMPT_RT kernel. The raw_spinlock_t
>> below disables
>>                           * interrupts on both configurations. The
>> _irqsave() ensures
>>                           * that interrupts are always disabled and
>> later restored.
>>                           */
>>                          raw_spin_lock_irqsave(cpu_lock, flags);
>>                          while ((pos =
>> cgroup_rstat_cpu_pop_updated(pos, cgrp, cpu))) {
>>                                  struct cgroup_subsys_state *css;
>>
>>                                  cgroup_base_stat_flush(pos, cpu);
>>       26                         bpf_rstat_flush(pos, cgroup_parent(pos), cpu);
>>
>>       28                         rcu_read_lock();
>>       29                         list_for_each_entry_rcu(css,
>> &pos->rstat_css_list,
>>                                                          rstat_css_node)
>>       31                                 css->ss->css_rstat_flush(css, cpu);
>>       32                         rcu_read_unlock();
>>                          }
>>       34                 raw_spin_unlock_irqrestore(cpu_lock, flags);
>>
>> I added probes on both line 26 and line 31 to catch the middle and inner loops.
>>
>> * Slow:
>>
>> completed: 32.97s [manual / mem-stat + cpu-stat]
>>
>>   Performance counter stats for '/tmp/derp':
>>
>>           4,702,570      probe:cgroup_rstat_flush_locked_L26
>>           9,301,436      probe:cgroup_rstat_flush_locked_L31
>>
>> * Fast:
>>
>> completed:  0.17s [manual / mem-stat]
>> completed:  0.34s [manual / cpu-stat]
>>
>>   Performance counter stats for '/tmp/derp':
>>
>>              31,769      probe:cgroup_rstat_flush_locked_L26
>>              62,849      probe:cgroup_rstat_flush_locked_L31
>>
>> It definitely looks like cgroup_rstat_cpu_pop_updated is yielding a
>> lot more positions.
>>
>> I'm going to sign off for the week, but let me know if I should place
>> any more probes to nail this down.
> I spent some time looking into this and I think I landed on a fix:
>
> * https://github.com/bobrik/linux/commit/50b627811d54
>
> I'm not 100% sure if it's the right fix for the issue, but it reduces
> the runtime significantly.

It looks like the overhead of mem_cgroup_flush_stats() may be the cause 
of the performance regression. So what version of the Linux kernel are 
you using? From the patch listed in the URL above, it doesn't seem like 
you are using the latest kernel.

The latest upstream kernel already have patches that reduce rstat 
flushing overhead like commit 11192d9c124 ("memcg: flush stats only if 
updated"). Have you try to reproduce it with the latest kernel?

Cheers,
Longman



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Expensive memory.stat + cpu.stat reads
  2023-08-11 22:27             ` Waiman Long
@ 2023-08-11 22:35               ` Ivan Babrou
  2023-08-12  2:33                 ` Shakeel Butt
  0 siblings, 1 reply; 21+ messages in thread
From: Ivan Babrou @ 2023-08-11 22:35 UTC (permalink / raw)
  To: Waiman Long
  Cc: Shakeel Butt, cgroups, Linux MM, kernel-team, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Muchun Song, Andrew Morton,
	linux-kernel

On Fri, Aug 11, 2023 at 3:27 PM Waiman Long <longman@redhat.com> wrote:
>
> On 8/11/23 18:03, Ivan Babrou wrote:
> > On Fri, Jul 14, 2023 at 5:30 PM Ivan Babrou <ivan@cloudflare.com> wrote:
> >> On Thu, Jul 13, 2023 at 4:25 PM Ivan Babrou <ivan@cloudflare.com> wrote:
> >>>> My understanding of mem-stat and cpu-stat is that they are independent
> >>>> of each other. In theory, reading one shouldn't affect the performance
> >>>> of reading the others. Since you are doing mem-stat and cpu-stat reading
> >>>> repetitively in a loop, it is likely that all the data are in the cache
> >>>> most of the time resulting in very fast processing time. If it happens
> >>>> that the specific memory location of mem-stat and cpu-stat data are such
> >>>> that reading one will cause the other data to be flushed out of the
> >>>> cache and have to be re-read from memory again, you could see
> >>>> significant performance regression.
> >>>>
> >>>> It is one of the possible causes, but I may be wrong.
> >>> Do you think it's somewhat similar to how iterating a matrix in rows
> >>> is faster than in columns due to sequential vs random memory reads?
> >>>
> >>> * https://stackoverflow.com/q/9936132
> >>> * https://en.wikipedia.org/wiki/Row-_and_column-major_order
> >>> * https://en.wikipedia.org/wiki/Loop_interchange
> >>>
> >>> I've had a similar suspicion and it would be good to confirm whether
> >>> it's that or something else. I can probably collect perf counters for
> >>> different runs, but I'm not sure which ones I'll need.
> >>>
> >>> In a similar vein, if we could come up with a tracepoint that would
> >>> tell us the amount of work done (or any other relevant metric that
> >>> would help) during rstat flushing, I can certainly collect that
> >>> information as well for every reading combination.
> >> Since cgroup_rstat_flush_locked appears in flamegraphs for both fast
> >> (discrete) and slow (combined) cases, I grabbed some stats for it:
> >>
> >> * Slow:
> >>
> >> completed: 19.43s [manual / mem-stat + cpu-stat]
> >>
> >> $ sudo /usr/share/bcc/tools/funclatency -uT cgroup_rstat_flush_locked
> >> Tracing 1 functions for "cgroup_rstat_flush_locked"... Hit Ctrl-C to end.
> >> ^C
> >> 00:12:55
> >>       usecs               : count     distribution
> >>           0 -> 1          : 0        |                                        |
> >>           2 -> 3          : 0        |                                        |
> >>           4 -> 7          : 0        |                                        |
> >>           8 -> 15         : 0        |                                        |
> >>          16 -> 31         : 0        |                                        |
> >>          32 -> 63         : 0        |                                        |
> >>          64 -> 127        : 1        |                                        |
> >>         128 -> 255        : 191      |************                            |
> >>         256 -> 511        : 590      |****************************************|
> >>         512 -> 1023       : 186      |************                            |
> >>        1024 -> 2047       : 2        |                                        |
> >>        2048 -> 4095       : 0        |                                        |
> >>        4096 -> 8191       : 0        |                                        |
> >>        8192 -> 16383      : 504      |**********************************      |
> >>       16384 -> 32767      : 514      |**********************************      |
> >>       32768 -> 65535      : 3        |                                        |
> >>       65536 -> 131071     : 1        |                                        |
> >>
> >> avg = 8852 usecs, total: 17633268 usecs, count: 1992
> >>
> >> * Fast:
> >>
> >> completed:  0.95s [manual / mem-stat]
> >> completed:  0.05s [manual / cpu-stat]
> >>
> >> $ sudo /usr/share/bcc/tools/funclatency -uT cgroup_rstat_flush_locked
> >> Tracing 1 functions for "cgroup_rstat_flush_locked"... Hit Ctrl-C to end.
> >> ^C
> >> 00:13:27
> >>       usecs               : count     distribution
> >>           0 -> 1          : 0        |                                        |
> >>           2 -> 3          : 0        |                                        |
> >>           4 -> 7          : 499      |****************************************|
> >>           8 -> 15         : 253      |********************                    |
> >>          16 -> 31         : 191      |***************                         |
> >>          32 -> 63         : 41       |***                                     |
> >>          64 -> 127        : 12       |                                        |
> >>         128 -> 255        : 2        |                                        |
> >>         256 -> 511        : 2        |                                        |
> >>         512 -> 1023       : 0        |                                        |
> >>        1024 -> 2047       : 0        |                                        |
> >>        2048 -> 4095       : 0        |                                        |
> >>        4096 -> 8191       : 0        |                                        |
> >>        8192 -> 16383      : 34       |**                                      |
> >>       16384 -> 32767      : 21       |*                                       |
> >>
> >> avg = 857 usecs, total: 904762 usecs, count: 1055
> >>
> >> There's a different number of calls into cgroup_rstat_flush_locked and
> >> they are much slower in the slow case. There are also two bands in the
> >> slow case, with 8ms..32ms having the half of the calls.
> >>
> >> For mem_cgroup_css_rstat_flush:
> >>
> >> * Slow:
> >>
> >> completed: 32.77s [manual / mem-stat + cpu-stat]
> >>
> >> $ sudo /usr/share/bcc/tools/funclatency -uT mem_cgroup_css_rstat_flush
> >> Tracing 1 functions for "mem_cgroup_css_rstat_flush"... Hit Ctrl-C to end.
> >> ^C
> >> 00:21:25
> >>       usecs               : count     distribution
> >>           0 -> 1          : 93078    |*                                       |
> >>           2 -> 3          : 3397714  |****************************************|
> >>           4 -> 7          : 1009440  |***********                             |
> >>           8 -> 15         : 168013   |*                                       |
> >>          16 -> 31         : 93       |                                        |
> >>
> >> avg = 3 usecs, total: 17189289 usecs, count: 4668338
> >>
> >> * Fast:
> >>
> >> completed:  0.16s [manual / mem-stat]
> >> completed:  0.04s [manual / cpu-stat]
> >>
> >> $ sudo /usr/share/bcc/tools/funclatency -uT mem_cgroup_css_rstat_flush
> >> Tracing 1 functions for "mem_cgroup_css_rstat_flush"... Hit Ctrl-C to end.
> >> ^C
> >> 00:21:57
> >>       usecs               : count     distribution
> >>           0 -> 1          : 1441     |***                                     |
> >>           2 -> 3          : 18780    |****************************************|
> >>           4 -> 7          : 4826     |**********                              |
> >>           8 -> 15         : 732      |*                                       |
> >>          16 -> 31         : 1        |                                        |
> >>
> >> avg = 3 usecs, total: 89174 usecs, count: 25780
> >>
> >> There's an 181x difference in the number of calls into
> >> mem_cgroup_css_rstat_flush.
> >>
> >> Does this provide a clue? Perhaps cgroup_rstat_cpu_pop_updated is
> >> yielding a ton more iterations for some reason here?
> >>
> >> * https://elixir.bootlin.com/linux/v6.1/source/kernel/cgroup/rstat.c#L196
> >>
> >> It's inlined, but I can place a probe into the loop:
> >>
> >>        7         for_each_possible_cpu(cpu) {
> >>        8                 raw_spinlock_t *cpu_lock =
> >> per_cpu_ptr(&cgroup_rstat_cpu_lock,
> >>                                                                 cpu);
> >>       10                 struct cgroup *pos = NULL;
> >>                          unsigned long flags;
> >>
> >>                          /*
> >>                           * The _irqsave() is needed because cgroup_rstat_lock is
> >>                           * spinlock_t which is a sleeping lock on
> >> PREEMPT_RT. Acquiring
> >>                           * this lock with the _irq() suffix only
> >> disables interrupts on
> >>                           * a non-PREEMPT_RT kernel. The raw_spinlock_t
> >> below disables
> >>                           * interrupts on both configurations. The
> >> _irqsave() ensures
> >>                           * that interrupts are always disabled and
> >> later restored.
> >>                           */
> >>                          raw_spin_lock_irqsave(cpu_lock, flags);
> >>                          while ((pos =
> >> cgroup_rstat_cpu_pop_updated(pos, cgrp, cpu))) {
> >>                                  struct cgroup_subsys_state *css;
> >>
> >>                                  cgroup_base_stat_flush(pos, cpu);
> >>       26                         bpf_rstat_flush(pos, cgroup_parent(pos), cpu);
> >>
> >>       28                         rcu_read_lock();
> >>       29                         list_for_each_entry_rcu(css,
> >> &pos->rstat_css_list,
> >>                                                          rstat_css_node)
> >>       31                                 css->ss->css_rstat_flush(css, cpu);
> >>       32                         rcu_read_unlock();
> >>                          }
> >>       34                 raw_spin_unlock_irqrestore(cpu_lock, flags);
> >>
> >> I added probes on both line 26 and line 31 to catch the middle and inner loops.
> >>
> >> * Slow:
> >>
> >> completed: 32.97s [manual / mem-stat + cpu-stat]
> >>
> >>   Performance counter stats for '/tmp/derp':
> >>
> >>           4,702,570      probe:cgroup_rstat_flush_locked_L26
> >>           9,301,436      probe:cgroup_rstat_flush_locked_L31
> >>
> >> * Fast:
> >>
> >> completed:  0.17s [manual / mem-stat]
> >> completed:  0.34s [manual / cpu-stat]
> >>
> >>   Performance counter stats for '/tmp/derp':
> >>
> >>              31,769      probe:cgroup_rstat_flush_locked_L26
> >>              62,849      probe:cgroup_rstat_flush_locked_L31
> >>
> >> It definitely looks like cgroup_rstat_cpu_pop_updated is yielding a
> >> lot more positions.
> >>
> >> I'm going to sign off for the week, but let me know if I should place
> >> any more probes to nail this down.
> > I spent some time looking into this and I think I landed on a fix:
> >
> > * https://github.com/bobrik/linux/commit/50b627811d54
> >
> > I'm not 100% sure if it's the right fix for the issue, but it reduces
> > the runtime significantly.
>
> It looks like the overhead of mem_cgroup_flush_stats() may be the cause
> of the performance regression. So what version of the Linux kernel are
> you using? From the patch listed in the URL above, it doesn't seem like
> you are using the latest kernel.
>
> The latest upstream kernel already have patches that reduce rstat
> flushing overhead like commit 11192d9c124 ("memcg: flush stats only if
> updated"). Have you try to reproduce it with the latest kernel?

We're on v6.1, which is the latest LTS (I mentioned it in the first
message). It has this patch included and my commit mentions it in the
very first line of the description.

I should've also mentioned commit fd25a9e0e23b ("memcg: unify memcg
stat flushing") from the same series, which my patch is partially
reverting.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Expensive memory.stat + cpu.stat reads
  2023-08-11 22:35               ` Ivan Babrou
@ 2023-08-12  2:33                 ` Shakeel Butt
  2023-08-14 17:56                   ` Ivan Babrou
  0 siblings, 1 reply; 21+ messages in thread
From: Shakeel Butt @ 2023-08-12  2:33 UTC (permalink / raw)
  To: Ivan Babrou
  Cc: Waiman Long, cgroups, Linux MM, kernel-team, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Muchun Song, Andrew Morton,
	linux-kernel

Hi Ivan,

(sorry for late response as I was away)

On Fri, Aug 11, 2023 at 3:35 PM Ivan Babrou <ivan@cloudflare.com> wrote:
[...]
> > > I spent some time looking into this and I think I landed on a fix:
> > >
> > > * https://github.com/bobrik/linux/commit/50b627811d54
> > >
> > > I'm not 100% sure if it's the right fix for the issue, but it reduces
> > > the runtime significantly.

In your patch, can you try to replace mem_cgroup_flush_stats() with
mem_cgroup_flush_stats_ratelimited() instead of cgroup_rstat_flush().
I wanted to see if you observe any stale stats issues.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Expensive memory.stat + cpu.stat reads
  2023-08-12  2:33                 ` Shakeel Butt
@ 2023-08-14 17:56                   ` Ivan Babrou
  0 siblings, 0 replies; 21+ messages in thread
From: Ivan Babrou @ 2023-08-14 17:56 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Waiman Long, cgroups, Linux MM, kernel-team, Johannes Weiner,
	Michal Hocko, Roman Gushchin, Muchun Song, Andrew Morton,
	linux-kernel

On Fri, Aug 11, 2023 at 7:34 PM Shakeel Butt <shakeelb@google.com> wrote:
>
> Hi Ivan,
>
> (sorry for late response as I was away)
>
> On Fri, Aug 11, 2023 at 3:35 PM Ivan Babrou <ivan@cloudflare.com> wrote:
> [...]
> > > > I spent some time looking into this and I think I landed on a fix:
> > > >
> > > > * https://github.com/bobrik/linux/commit/50b627811d54
> > > >
> > > > I'm not 100% sure if it's the right fix for the issue, but it reduces
> > > > the runtime significantly.
>
> In your patch, can you try to replace mem_cgroup_flush_stats() with
> mem_cgroup_flush_stats_ratelimited() instead of cgroup_rstat_flush().
> I wanted to see if you observe any stale stats issues.

We scrape cgroup metrics every 53s and at that scale I doubt we'd
notice any staleness. With 2s FLUSH_TIME it would be in the noise. We
can even remove any sort of flushing from memory.stat as long as
cpu.stat is read right before it as it does flushing for both.

I agree with Yosry that there's probably no reason to flush both cpu
and memory stats at the same time.

It doesn't seem right to use delayed flush for memory and direct flush
for cpu. Perhaps it doesn't matter as much to warrant a bigger rework
of how it's done, but maybe then my patch to use the same mechanism
between cpu and memory flushing makes sense?

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Expensive memory.stat + cpu.stat reads
  2023-08-11 22:03           ` Ivan Babrou
  2023-08-11 22:27             ` Waiman Long
@ 2023-08-11 23:43             ` Yosry Ahmed
  2023-08-12  0:01               ` Yosry Ahmed
  1 sibling, 1 reply; 21+ messages in thread
From: Yosry Ahmed @ 2023-08-11 23:43 UTC (permalink / raw)
  To: Ivan Babrou
  Cc: Waiman Long, Shakeel Butt, cgroups, Linux MM, kernel-team,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Andrew Morton, linux-kernel

On Fri, Aug 11, 2023 at 3:03 PM Ivan Babrou <ivan@cloudflare.com> wrote:
>
> On Fri, Jul 14, 2023 at 5:30 PM Ivan Babrou <ivan@cloudflare.com> wrote:
> >
> > On Thu, Jul 13, 2023 at 4:25 PM Ivan Babrou <ivan@cloudflare.com> wrote:
> > > > My understanding of mem-stat and cpu-stat is that they are independent
> > > > of each other. In theory, reading one shouldn't affect the performance
> > > > of reading the others. Since you are doing mem-stat and cpu-stat reading
> > > > repetitively in a loop, it is likely that all the data are in the cache
> > > > most of the time resulting in very fast processing time. If it happens
> > > > that the specific memory location of mem-stat and cpu-stat data are such
> > > > that reading one will cause the other data to be flushed out of the
> > > > cache and have to be re-read from memory again, you could see
> > > > significant performance regression.
> > > >
> > > > It is one of the possible causes, but I may be wrong.
> > >
> > > Do you think it's somewhat similar to how iterating a matrix in rows
> > > is faster than in columns due to sequential vs random memory reads?
> > >
> > > * https://stackoverflow.com/q/9936132
> > > * https://en.wikipedia.org/wiki/Row-_and_column-major_order
> > > * https://en.wikipedia.org/wiki/Loop_interchange
> > >
> > > I've had a similar suspicion and it would be good to confirm whether
> > > it's that or something else. I can probably collect perf counters for
> > > different runs, but I'm not sure which ones I'll need.
> > >
> > > In a similar vein, if we could come up with a tracepoint that would
> > > tell us the amount of work done (or any other relevant metric that
> > > would help) during rstat flushing, I can certainly collect that
> > > information as well for every reading combination.
> >
> > Since cgroup_rstat_flush_locked appears in flamegraphs for both fast
> > (discrete) and slow (combined) cases, I grabbed some stats for it:
> >
> > * Slow:
> >
> > completed: 19.43s [manual / mem-stat + cpu-stat]
> >
> > $ sudo /usr/share/bcc/tools/funclatency -uT cgroup_rstat_flush_locked
> > Tracing 1 functions for "cgroup_rstat_flush_locked"... Hit Ctrl-C to end.
> > ^C
> > 00:12:55
> >      usecs               : count     distribution
> >          0 -> 1          : 0        |                                        |
> >          2 -> 3          : 0        |                                        |
> >          4 -> 7          : 0        |                                        |
> >          8 -> 15         : 0        |                                        |
> >         16 -> 31         : 0        |                                        |
> >         32 -> 63         : 0        |                                        |
> >         64 -> 127        : 1        |                                        |
> >        128 -> 255        : 191      |************                            |
> >        256 -> 511        : 590      |****************************************|
> >        512 -> 1023       : 186      |************                            |
> >       1024 -> 2047       : 2        |                                        |
> >       2048 -> 4095       : 0        |                                        |
> >       4096 -> 8191       : 0        |                                        |
> >       8192 -> 16383      : 504      |**********************************      |
> >      16384 -> 32767      : 514      |**********************************      |
> >      32768 -> 65535      : 3        |                                        |
> >      65536 -> 131071     : 1        |                                        |
> >
> > avg = 8852 usecs, total: 17633268 usecs, count: 1992
> >
> > * Fast:
> >
> > completed:  0.95s [manual / mem-stat]
> > completed:  0.05s [manual / cpu-stat]
> >
> > $ sudo /usr/share/bcc/tools/funclatency -uT cgroup_rstat_flush_locked
> > Tracing 1 functions for "cgroup_rstat_flush_locked"... Hit Ctrl-C to end.
> > ^C
> > 00:13:27
> >      usecs               : count     distribution
> >          0 -> 1          : 0        |                                        |
> >          2 -> 3          : 0        |                                        |
> >          4 -> 7          : 499      |****************************************|
> >          8 -> 15         : 253      |********************                    |
> >         16 -> 31         : 191      |***************                         |
> >         32 -> 63         : 41       |***                                     |
> >         64 -> 127        : 12       |                                        |
> >        128 -> 255        : 2        |                                        |
> >        256 -> 511        : 2        |                                        |
> >        512 -> 1023       : 0        |                                        |
> >       1024 -> 2047       : 0        |                                        |
> >       2048 -> 4095       : 0        |                                        |
> >       4096 -> 8191       : 0        |                                        |
> >       8192 -> 16383      : 34       |**                                      |
> >      16384 -> 32767      : 21       |*                                       |
> >
> > avg = 857 usecs, total: 904762 usecs, count: 1055
> >
> > There's a different number of calls into cgroup_rstat_flush_locked and
> > they are much slower in the slow case. There are also two bands in the
> > slow case, with 8ms..32ms having the half of the calls.
> >
> > For mem_cgroup_css_rstat_flush:
> >
> > * Slow:
> >
> > completed: 32.77s [manual / mem-stat + cpu-stat]
> >
> > $ sudo /usr/share/bcc/tools/funclatency -uT mem_cgroup_css_rstat_flush
> > Tracing 1 functions for "mem_cgroup_css_rstat_flush"... Hit Ctrl-C to end.
> > ^C
> > 00:21:25
> >      usecs               : count     distribution
> >          0 -> 1          : 93078    |*                                       |
> >          2 -> 3          : 3397714  |****************************************|
> >          4 -> 7          : 1009440  |***********                             |
> >          8 -> 15         : 168013   |*                                       |
> >         16 -> 31         : 93       |                                        |
> >
> > avg = 3 usecs, total: 17189289 usecs, count: 4668338
> >
> > * Fast:
> >
> > completed:  0.16s [manual / mem-stat]
> > completed:  0.04s [manual / cpu-stat]
> >
> > $ sudo /usr/share/bcc/tools/funclatency -uT mem_cgroup_css_rstat_flush
> > Tracing 1 functions for "mem_cgroup_css_rstat_flush"... Hit Ctrl-C to end.
> > ^C
> > 00:21:57
> >      usecs               : count     distribution
> >          0 -> 1          : 1441     |***                                     |
> >          2 -> 3          : 18780    |****************************************|
> >          4 -> 7          : 4826     |**********                              |
> >          8 -> 15         : 732      |*                                       |
> >         16 -> 31         : 1        |                                        |
> >
> > avg = 3 usecs, total: 89174 usecs, count: 25780
> >
> > There's an 181x difference in the number of calls into
> > mem_cgroup_css_rstat_flush.
> >
> > Does this provide a clue? Perhaps cgroup_rstat_cpu_pop_updated is
> > yielding a ton more iterations for some reason here?
> >
> > * https://elixir.bootlin.com/linux/v6.1/source/kernel/cgroup/rstat.c#L196
> >
> > It's inlined, but I can place a probe into the loop:
> >
> >       7         for_each_possible_cpu(cpu) {
> >       8                 raw_spinlock_t *cpu_lock =
> > per_cpu_ptr(&cgroup_rstat_cpu_lock,
> >                                                                cpu);
> >      10                 struct cgroup *pos = NULL;
> >                         unsigned long flags;
> >
> >                         /*
> >                          * The _irqsave() is needed because cgroup_rstat_lock is
> >                          * spinlock_t which is a sleeping lock on
> > PREEMPT_RT. Acquiring
> >                          * this lock with the _irq() suffix only
> > disables interrupts on
> >                          * a non-PREEMPT_RT kernel. The raw_spinlock_t
> > below disables
> >                          * interrupts on both configurations. The
> > _irqsave() ensures
> >                          * that interrupts are always disabled and
> > later restored.
> >                          */
> >                         raw_spin_lock_irqsave(cpu_lock, flags);
> >                         while ((pos =
> > cgroup_rstat_cpu_pop_updated(pos, cgrp, cpu))) {
> >                                 struct cgroup_subsys_state *css;
> >
> >                                 cgroup_base_stat_flush(pos, cpu);
> >      26                         bpf_rstat_flush(pos, cgroup_parent(pos), cpu);
> >
> >      28                         rcu_read_lock();
> >      29                         list_for_each_entry_rcu(css,
> > &pos->rstat_css_list,
> >                                                         rstat_css_node)
> >      31                                 css->ss->css_rstat_flush(css, cpu);
> >      32                         rcu_read_unlock();
> >                         }
> >      34                 raw_spin_unlock_irqrestore(cpu_lock, flags);
> >
> > I added probes on both line 26 and line 31 to catch the middle and inner loops.
> >
> > * Slow:
> >
> > completed: 32.97s [manual / mem-stat + cpu-stat]
> >
> >  Performance counter stats for '/tmp/derp':
> >
> >          4,702,570      probe:cgroup_rstat_flush_locked_L26
> >          9,301,436      probe:cgroup_rstat_flush_locked_L31
> >
> > * Fast:
> >
> > completed:  0.17s [manual / mem-stat]
> > completed:  0.34s [manual / cpu-stat]
> >
> >  Performance counter stats for '/tmp/derp':
> >
> >             31,769      probe:cgroup_rstat_flush_locked_L26
> >             62,849      probe:cgroup_rstat_flush_locked_L31
> >
> > It definitely looks like cgroup_rstat_cpu_pop_updated is yielding a
> > lot more positions.
> >
> > I'm going to sign off for the week, but let me know if I should place
> > any more probes to nail this down.
>
> I spent some time looking into this and I think I landed on a fix:
>
> * https://github.com/bobrik/linux/commit/50b627811d54
>
> I'm not 100% sure if it's the right fix for the issue, but it reduces
> the runtime significantly.

Flushing the entire hierarchy in mem_cgroup_flush_stats() was added
such that concurrent flushers can just skip and let one flusher do the
work for everyone. This was added because we flush the stats in some
paths (like reclaim, refault, dirty throttling) where sometimes there
is a lot of concurrency and we have a thundering herd problem on the
cgroup rstat global lock.

Maybe we can separate userspace reads from other flushers, such that
userspace reads flush the cgroup in question only, while in-kernel
flushers skip if someone else is flushing.

There is also some inconsistency today as not all paths use
mem_cgroup_flush_stats() (see zswap charging function in
mm/memcontrol.c).

Separating userspace reads from in-kernel flushers would also help
because skipping a flush if someone else is flushing for userspace
reads can lead to inaccuracy (see [1]).

I would wait for Shakeel to weigh in here, since he introduced the
unified flushing.

[1]https://lore.kernel.org/lkml/20230809045810.1659356-1-yosryahmed@google.com/

>
> We see a 50x decrease for memory.stat + cpu.stat loop duration with
> the patch applied. TL;DR is that memory.stat flushes all cgroups,
> while cpu.stat flushes just the subtree you're asking for. Both do it
> for cpu and memory at the same time, since both are rstat based. See
> the description for the commit linked above for more details.
>
> There are two more graphs to add. I rebooted 4 servers, 2 of which
> received my patch and 2 were the control. The reboot happened at
> around 05:00Z
>
> First is the latency of scraping cadvisor, where you can clearly see
> that the control group is rising with the daily load, while the test
> group is staying mostly flat:
>
> * https://i.imgur.com/GMtzHIu.png
>
> Second is the CPU time spent by cadvisor, where you can see a similar picture:
>
> * https://i.imgur.com/LWHt14P.png
>
> Let me know what you think.
>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Expensive memory.stat + cpu.stat reads
  2023-08-11 23:43             ` Yosry Ahmed
@ 2023-08-12  0:01               ` Yosry Ahmed
  2023-08-15  0:18                 ` Tejun Heo
  0 siblings, 1 reply; 21+ messages in thread
From: Yosry Ahmed @ 2023-08-12  0:01 UTC (permalink / raw)
  To: Ivan Babrou
  Cc: Waiman Long, Shakeel Butt, cgroups, Linux MM, kernel-team,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
	Andrew Morton, linux-kernel, Tejun Heo

On Fri, Aug 11, 2023 at 4:43 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Fri, Aug 11, 2023 at 3:03 PM Ivan Babrou <ivan@cloudflare.com> wrote:
> >
> > On Fri, Jul 14, 2023 at 5:30 PM Ivan Babrou <ivan@cloudflare.com> wrote:
> > >
> > > On Thu, Jul 13, 2023 at 4:25 PM Ivan Babrou <ivan@cloudflare.com> wrote:
> > > > > My understanding of mem-stat and cpu-stat is that they are independent
> > > > > of each other. In theory, reading one shouldn't affect the performance
> > > > > of reading the others. Since you are doing mem-stat and cpu-stat reading
> > > > > repetitively in a loop, it is likely that all the data are in the cache
> > > > > most of the time resulting in very fast processing time. If it happens
> > > > > that the specific memory location of mem-stat and cpu-stat data are such
> > > > > that reading one will cause the other data to be flushed out of the
> > > > > cache and have to be re-read from memory again, you could see
> > > > > significant performance regression.
> > > > >
> > > > > It is one of the possible causes, but I may be wrong.
> > > >
> > > > Do you think it's somewhat similar to how iterating a matrix in rows
> > > > is faster than in columns due to sequential vs random memory reads?
> > > >
> > > > * https://stackoverflow.com/q/9936132
> > > > * https://en.wikipedia.org/wiki/Row-_and_column-major_order
> > > > * https://en.wikipedia.org/wiki/Loop_interchange
> > > >
> > > > I've had a similar suspicion and it would be good to confirm whether
> > > > it's that or something else. I can probably collect perf counters for
> > > > different runs, but I'm not sure which ones I'll need.
> > > >
> > > > In a similar vein, if we could come up with a tracepoint that would
> > > > tell us the amount of work done (or any other relevant metric that
> > > > would help) during rstat flushing, I can certainly collect that
> > > > information as well for every reading combination.
> > >
> > > Since cgroup_rstat_flush_locked appears in flamegraphs for both fast
> > > (discrete) and slow (combined) cases, I grabbed some stats for it:
> > >
> > > * Slow:
> > >
> > > completed: 19.43s [manual / mem-stat + cpu-stat]
> > >
> > > $ sudo /usr/share/bcc/tools/funclatency -uT cgroup_rstat_flush_locked
> > > Tracing 1 functions for "cgroup_rstat_flush_locked"... Hit Ctrl-C to end.
> > > ^C
> > > 00:12:55
> > >      usecs               : count     distribution
> > >          0 -> 1          : 0        |                                        |
> > >          2 -> 3          : 0        |                                        |
> > >          4 -> 7          : 0        |                                        |
> > >          8 -> 15         : 0        |                                        |
> > >         16 -> 31         : 0        |                                        |
> > >         32 -> 63         : 0        |                                        |
> > >         64 -> 127        : 1        |                                        |
> > >        128 -> 255        : 191      |************                            |
> > >        256 -> 511        : 590      |****************************************|
> > >        512 -> 1023       : 186      |************                            |
> > >       1024 -> 2047       : 2        |                                        |
> > >       2048 -> 4095       : 0        |                                        |
> > >       4096 -> 8191       : 0        |                                        |
> > >       8192 -> 16383      : 504      |**********************************      |
> > >      16384 -> 32767      : 514      |**********************************      |
> > >      32768 -> 65535      : 3        |                                        |
> > >      65536 -> 131071     : 1        |                                        |
> > >
> > > avg = 8852 usecs, total: 17633268 usecs, count: 1992
> > >
> > > * Fast:
> > >
> > > completed:  0.95s [manual / mem-stat]
> > > completed:  0.05s [manual / cpu-stat]
> > >
> > > $ sudo /usr/share/bcc/tools/funclatency -uT cgroup_rstat_flush_locked
> > > Tracing 1 functions for "cgroup_rstat_flush_locked"... Hit Ctrl-C to end.
> > > ^C
> > > 00:13:27
> > >      usecs               : count     distribution
> > >          0 -> 1          : 0        |                                        |
> > >          2 -> 3          : 0        |                                        |
> > >          4 -> 7          : 499      |****************************************|
> > >          8 -> 15         : 253      |********************                    |
> > >         16 -> 31         : 191      |***************                         |
> > >         32 -> 63         : 41       |***                                     |
> > >         64 -> 127        : 12       |                                        |
> > >        128 -> 255        : 2        |                                        |
> > >        256 -> 511        : 2        |                                        |
> > >        512 -> 1023       : 0        |                                        |
> > >       1024 -> 2047       : 0        |                                        |
> > >       2048 -> 4095       : 0        |                                        |
> > >       4096 -> 8191       : 0        |                                        |
> > >       8192 -> 16383      : 34       |**                                      |
> > >      16384 -> 32767      : 21       |*                                       |
> > >
> > > avg = 857 usecs, total: 904762 usecs, count: 1055
> > >
> > > There's a different number of calls into cgroup_rstat_flush_locked and
> > > they are much slower in the slow case. There are also two bands in the
> > > slow case, with 8ms..32ms having the half of the calls.
> > >
> > > For mem_cgroup_css_rstat_flush:
> > >
> > > * Slow:
> > >
> > > completed: 32.77s [manual / mem-stat + cpu-stat]
> > >
> > > $ sudo /usr/share/bcc/tools/funclatency -uT mem_cgroup_css_rstat_flush
> > > Tracing 1 functions for "mem_cgroup_css_rstat_flush"... Hit Ctrl-C to end.
> > > ^C
> > > 00:21:25
> > >      usecs               : count     distribution
> > >          0 -> 1          : 93078    |*                                       |
> > >          2 -> 3          : 3397714  |****************************************|
> > >          4 -> 7          : 1009440  |***********                             |
> > >          8 -> 15         : 168013   |*                                       |
> > >         16 -> 31         : 93       |                                        |
> > >
> > > avg = 3 usecs, total: 17189289 usecs, count: 4668338
> > >
> > > * Fast:
> > >
> > > completed:  0.16s [manual / mem-stat]
> > > completed:  0.04s [manual / cpu-stat]
> > >
> > > $ sudo /usr/share/bcc/tools/funclatency -uT mem_cgroup_css_rstat_flush
> > > Tracing 1 functions for "mem_cgroup_css_rstat_flush"... Hit Ctrl-C to end.
> > > ^C
> > > 00:21:57
> > >      usecs               : count     distribution
> > >          0 -> 1          : 1441     |***                                     |
> > >          2 -> 3          : 18780    |****************************************|
> > >          4 -> 7          : 4826     |**********                              |
> > >          8 -> 15         : 732      |*                                       |
> > >         16 -> 31         : 1        |                                        |
> > >
> > > avg = 3 usecs, total: 89174 usecs, count: 25780
> > >
> > > There's an 181x difference in the number of calls into
> > > mem_cgroup_css_rstat_flush.
> > >
> > > Does this provide a clue? Perhaps cgroup_rstat_cpu_pop_updated is
> > > yielding a ton more iterations for some reason here?
> > >
> > > * https://elixir.bootlin.com/linux/v6.1/source/kernel/cgroup/rstat.c#L196
> > >
> > > It's inlined, but I can place a probe into the loop:
> > >
> > >       7         for_each_possible_cpu(cpu) {
> > >       8                 raw_spinlock_t *cpu_lock =
> > > per_cpu_ptr(&cgroup_rstat_cpu_lock,
> > >                                                                cpu);
> > >      10                 struct cgroup *pos = NULL;
> > >                         unsigned long flags;
> > >
> > >                         /*
> > >                          * The _irqsave() is needed because cgroup_rstat_lock is
> > >                          * spinlock_t which is a sleeping lock on
> > > PREEMPT_RT. Acquiring
> > >                          * this lock with the _irq() suffix only
> > > disables interrupts on
> > >                          * a non-PREEMPT_RT kernel. The raw_spinlock_t
> > > below disables
> > >                          * interrupts on both configurations. The
> > > _irqsave() ensures
> > >                          * that interrupts are always disabled and
> > > later restored.
> > >                          */
> > >                         raw_spin_lock_irqsave(cpu_lock, flags);
> > >                         while ((pos =
> > > cgroup_rstat_cpu_pop_updated(pos, cgrp, cpu))) {
> > >                                 struct cgroup_subsys_state *css;
> > >
> > >                                 cgroup_base_stat_flush(pos, cpu);
> > >      26                         bpf_rstat_flush(pos, cgroup_parent(pos), cpu);
> > >
> > >      28                         rcu_read_lock();
> > >      29                         list_for_each_entry_rcu(css,
> > > &pos->rstat_css_list,
> > >                                                         rstat_css_node)
> > >      31                                 css->ss->css_rstat_flush(css, cpu);
> > >      32                         rcu_read_unlock();
> > >                         }
> > >      34                 raw_spin_unlock_irqrestore(cpu_lock, flags);
> > >
> > > I added probes on both line 26 and line 31 to catch the middle and inner loops.
> > >
> > > * Slow:
> > >
> > > completed: 32.97s [manual / mem-stat + cpu-stat]
> > >
> > >  Performance counter stats for '/tmp/derp':
> > >
> > >          4,702,570      probe:cgroup_rstat_flush_locked_L26
> > >          9,301,436      probe:cgroup_rstat_flush_locked_L31
> > >
> > > * Fast:
> > >
> > > completed:  0.17s [manual / mem-stat]
> > > completed:  0.34s [manual / cpu-stat]
> > >
> > >  Performance counter stats for '/tmp/derp':
> > >
> > >             31,769      probe:cgroup_rstat_flush_locked_L26
> > >             62,849      probe:cgroup_rstat_flush_locked_L31
> > >
> > > It definitely looks like cgroup_rstat_cpu_pop_updated is yielding a
> > > lot more positions.
> > >
> > > I'm going to sign off for the week, but let me know if I should place
> > > any more probes to nail this down.
> >
> > I spent some time looking into this and I think I landed on a fix:
> >
> > * https://github.com/bobrik/linux/commit/50b627811d54
> >
> > I'm not 100% sure if it's the right fix for the issue, but it reduces
> > the runtime significantly.
>
> Flushing the entire hierarchy in mem_cgroup_flush_stats() was added
> such that concurrent flushers can just skip and let one flusher do the
> work for everyone. This was added because we flush the stats in some
> paths (like reclaim, refault, dirty throttling) where sometimes there
> is a lot of concurrency and we have a thundering herd problem on the
> cgroup rstat global lock.
>
> Maybe we can separate userspace reads from other flushers, such that
> userspace reads flush the cgroup in question only, while in-kernel
> flushers skip if someone else is flushing.
>
> There is also some inconsistency today as not all paths use
> mem_cgroup_flush_stats() (see zswap charging function in
> mm/memcontrol.c).
>
> Separating userspace reads from in-kernel flushers would also help
> because skipping a flush if someone else is flushing for userspace
> reads can lead to inaccuracy (see [1]).
>
> I would wait for Shakeel to weigh in here, since he introduced the
> unified flushing.
>
> [1]https://lore.kernel.org/lkml/20230809045810.1659356-1-yosryahmed@google.com/
>

+Tejun Heo

There have been a lot of problems coming from this global rstat lock:
hard lockups (when we used to flush atomically), unified flushing
being expensive, skipping flushing being inaccurate, etc.

I wonder if it's time to rethink this lock and break it down into
granular locks. Perhaps a per-cgroup lock, and develop a locking
scheme where you always lock a parent then a child, then flush the
child and unlock it and move to the next child, etc. This will allow
concurrent flushing of non-root cgroups. Even when flushing the root,
if we flush all its children first without locking the root, then only
lock the root when flushing the top-level children, then some level of
concurrency can be achieved.

Maybe this is too complicated, I never tried to implement it, but I
have been bouncing around this idea in my head for a while now.

We can also split the update tree per controller. As far as I can tell
there is no reason to flush cpu stats for example when someone wants
to read memory stats.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Expensive memory.stat + cpu.stat reads
  2023-08-12  0:01               ` Yosry Ahmed
@ 2023-08-15  0:18                 ` Tejun Heo
  2023-08-15  0:30                   ` Ivan Babrou
  0 siblings, 1 reply; 21+ messages in thread
From: Tejun Heo @ 2023-08-15  0:18 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Ivan Babrou, Waiman Long, Shakeel Butt, cgroups, Linux MM,
	kernel-team, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Andrew Morton, linux-kernel

Hello,

On Fri, Aug 11, 2023 at 05:01:08PM -0700, Yosry Ahmed wrote:
> There have been a lot of problems coming from this global rstat lock:
> hard lockups (when we used to flush atomically), unified flushing
> being expensive, skipping flushing being inaccurate, etc.
> 
> I wonder if it's time to rethink this lock and break it down into
> granular locks. Perhaps a per-cgroup lock, and develop a locking
> scheme where you always lock a parent then a child, then flush the
> child and unlock it and move to the next child, etc. This will allow
> concurrent flushing of non-root cgroups. Even when flushing the root,
> if we flush all its children first without locking the root, then only
> lock the root when flushing the top-level children, then some level of
> concurrency can be achieved.
> 
> Maybe this is too complicated, I never tried to implement it, but I
> have been bouncing around this idea in my head for a while now.
> 
> We can also split the update tree per controller. As far as I can tell
> there is no reason to flush cpu stats for example when someone wants
> to read memory stats.

There's another thread. Let's continue there but I'm a bit skeptical whether
splitting the lock is a good solution here. Regardless of locking, we don't
want to run in an atomic context for that long anwyay.

Thanks.

-- 
tejun


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Expensive memory.stat + cpu.stat reads
  2023-08-15  0:18                 ` Tejun Heo
@ 2023-08-15  0:30                   ` Ivan Babrou
  2023-08-15  0:31                     ` Yosry Ahmed
  0 siblings, 1 reply; 21+ messages in thread
From: Ivan Babrou @ 2023-08-15  0:30 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Yosry Ahmed, Waiman Long, Shakeel Butt, cgroups, Linux MM,
	kernel-team, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Andrew Morton, linux-kernel

On Mon, Aug 14, 2023 at 5:18 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Fri, Aug 11, 2023 at 05:01:08PM -0700, Yosry Ahmed wrote:
> > There have been a lot of problems coming from this global rstat lock:
> > hard lockups (when we used to flush atomically), unified flushing
> > being expensive, skipping flushing being inaccurate, etc.
> >
> > I wonder if it's time to rethink this lock and break it down into
> > granular locks. Perhaps a per-cgroup lock, and develop a locking
> > scheme where you always lock a parent then a child, then flush the
> > child and unlock it and move to the next child, etc. This will allow
> > concurrent flushing of non-root cgroups. Even when flushing the root,
> > if we flush all its children first without locking the root, then only
> > lock the root when flushing the top-level children, then some level of
> > concurrency can be achieved.
> >
> > Maybe this is too complicated, I never tried to implement it, but I
> > have been bouncing around this idea in my head for a while now.
> >
> > We can also split the update tree per controller. As far as I can tell
> > there is no reason to flush cpu stats for example when someone wants
> > to read memory stats.
>
> There's another thread. Let's continue there but I'm a bit skeptical whether
> splitting the lock is a good solution here. Regardless of locking, we don't
> want to run in an atomic context for that long anwyay.

Could you link to the other thread?


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Expensive memory.stat + cpu.stat reads
  2023-08-15  0:30                   ` Ivan Babrou
@ 2023-08-15  0:31                     ` Yosry Ahmed
  0 siblings, 0 replies; 21+ messages in thread
From: Yosry Ahmed @ 2023-08-15  0:31 UTC (permalink / raw)
  To: Ivan Babrou
  Cc: Tejun Heo, Waiman Long, Shakeel Butt, cgroups, Linux MM,
	kernel-team, Johannes Weiner, Michal Hocko, Roman Gushchin,
	Muchun Song, Andrew Morton, linux-kernel

On Mon, Aug 14, 2023 at 5:30 PM Ivan Babrou <ivan@cloudflare.com> wrote:
>
> On Mon, Aug 14, 2023 at 5:18 PM Tejun Heo <tj@kernel.org> wrote:
> >
> > Hello,
> >
> > On Fri, Aug 11, 2023 at 05:01:08PM -0700, Yosry Ahmed wrote:
> > > There have been a lot of problems coming from this global rstat lock:
> > > hard lockups (when we used to flush atomically), unified flushing
> > > being expensive, skipping flushing being inaccurate, etc.
> > >
> > > I wonder if it's time to rethink this lock and break it down into
> > > granular locks. Perhaps a per-cgroup lock, and develop a locking
> > > scheme where you always lock a parent then a child, then flush the
> > > child and unlock it and move to the next child, etc. This will allow
> > > concurrent flushing of non-root cgroups. Even when flushing the root,
> > > if we flush all its children first without locking the root, then only
> > > lock the root when flushing the top-level children, then some level of
> > > concurrency can be achieved.
> > >
> > > Maybe this is too complicated, I never tried to implement it, but I
> > > have been bouncing around this idea in my head for a while now.
> > >
> > > We can also split the update tree per controller. As far as I can tell
> > > there is no reason to flush cpu stats for example when someone wants
> > > to read memory stats.
> >
> > There's another thread. Let's continue there but I'm a bit skeptical whether
> > splitting the lock is a good solution here. Regardless of locking, we don't
> > want to run in an atomic context for that long anwyay.
>
> Could you link to the other thread?

I supposedly CC'd you there, but I realized it didn't work for some reason:
https://lore.kernel.org/lkml/CAJD7tkYBFz-gZ2QsHxUMT=t0KNXs66S-zzMPebadHx9zaG0Q3w@mail.gmail.com/


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Expensive memory.stat + cpu.stat reads
  2023-07-10 23:21   ` Ivan Babrou
  2023-07-11  0:44     ` Waiman Long
@ 2023-07-15  0:14     ` Ivan Babrou
  1 sibling, 0 replies; 21+ messages in thread
From: Ivan Babrou @ 2023-07-15  0:14 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: cgroups, Linux MM, kernel-team, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Muchun Song, Andrew Morton, linux-kernel

On Mon, Jul 10, 2023 at 4:21 PM Ivan Babrou <ivan@cloudflare.com> wrote:
> The fast one is v6.1.37 and the slow one is v6.1.25. I'm not sure if
> the kernel version makes a difference or if it's a matter of uptime /
> traffic profile. The data is from two different locations. The fast
> location has gone through an expansion, which meant a full reboot with
> a kernel upgrade, so maybe that affected things:
>
> * https://i.imgur.com/x8uyMaF.png
>
> Let me try to reboot the slow location and see if there's any lasting
> improvement.

There is no correlation with the kernel version, v6.1.38 is slow too:

completed: 23.09s [manual / cpu-stat + mem-stat]
completed:  0.30s [manual / mem-stat]
completed:  0.64s [manual / cpu-stat]


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Expensive memory.stat + cpu.stat reads
  2023-06-30 23:22 Expensive memory.stat + cpu.stat reads Ivan Babrou
  2023-07-06  6:20 ` Shakeel Butt
@ 2023-07-10 14:44 ` Michal Koutný
  2023-07-10 23:23   ` Ivan Babrou
  1 sibling, 1 reply; 21+ messages in thread
From: Michal Koutný @ 2023-07-10 14:44 UTC (permalink / raw)
  To: Ivan Babrou
  Cc: cgroups, Linux MM, kernel-team, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	linux-kernel

[-- Attachment #1: Type: text/plain, Size: 795 bytes --]

Hello.

On Fri, Jun 30, 2023 at 04:22:28PM -0700, Ivan Babrou <ivan@cloudflare.com> wrote:
> As you might've noticed from the output, splitting the loop into two
> makes the code run 10x faster.

That is curious.

> We're running Linux v6.1 (the output is from v6.1.25) with no patches
> that touch the cgroup or mm subsystems, so you can assume vanilla
> kernel.

Have you watched for this on older kernels too?

> I am happy to try out patches or to do some tracing to help understand
> this better.

I see in your reproducer you tried swapping order of controllers
flushed. 
Have you also tried flushing same controller twice (in the inner loop)?
(Despite the expectation is that it shouldn't be different from half the
scenario where ran two loops.)

Thanks,
Michal

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: Expensive memory.stat + cpu.stat reads
  2023-07-10 14:44 ` Michal Koutný
@ 2023-07-10 23:23   ` Ivan Babrou
  0 siblings, 0 replies; 21+ messages in thread
From: Ivan Babrou @ 2023-07-10 23:23 UTC (permalink / raw)
  To: Michal Koutný
  Cc: cgroups, Linux MM, kernel-team, Johannes Weiner, Michal Hocko,
	Roman Gushchin, Shakeel Butt, Muchun Song, Andrew Morton,
	linux-kernel

On Mon, Jul 10, 2023 at 7:44 AM Michal Koutný <mkoutny@suse.com> wrote:
>
> Hello.
>
> On Fri, Jun 30, 2023 at 04:22:28PM -0700, Ivan Babrou <ivan@cloudflare.com> wrote:
> > As you might've noticed from the output, splitting the loop into two
> > makes the code run 10x faster.
>
> That is curious.
>
> > We're running Linux v6.1 (the output is from v6.1.25) with no patches
> > that touch the cgroup or mm subsystems, so you can assume vanilla
> > kernel.
>
> Have you watched for this on older kernels too?

We've been on v6.1 for quite a while now, but it's possible that we
weren't paying enough attention before to notice.

> > I am happy to try out patches or to do some tracing to help understand
> > this better.
>
> I see in your reproducer you tried swapping order of controllers
> flushed.
> Have you also tried flushing same controller twice (in the inner loop)?
> (Despite the expectation is that it shouldn't be different from half the
> scenario where ran two loops.)

Same controller twice is fast (whether it's mem + mem or cpu + cpu):

warm-up
completed: 17.24s [manual / cpu-stat + mem-stat]
completed:  1.02s [manual / mem-stat+mem-stat]
completed:  0.59s [manual / cpu-stat+cpu-stat]
completed:  0.44s [manual / mem-stat]
completed:  0.16s [manual / cpu-stat]
running
completed: 14.32s [manual / cpu-stat + mem-stat]
completed:  1.25s [manual / mem-stat+mem-stat]
completed:  0.42s [manual / cpu-stat+cpu-stat]
completed:  0.12s [manual / mem-stat]
completed:  0.50s [manual / cpu-stat]


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2023-08-15  0:31 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-30 23:22 Expensive memory.stat + cpu.stat reads Ivan Babrou
2023-07-06  6:20 ` Shakeel Butt
2023-07-10 23:21   ` Ivan Babrou
2023-07-11  0:44     ` Waiman Long
2023-07-13 23:25       ` Ivan Babrou
2023-07-14 17:23         ` Waiman Long
2023-07-15  0:00           ` Ivan Babrou
2023-07-15  0:30         ` Ivan Babrou
2023-08-11 22:03           ` Ivan Babrou
2023-08-11 22:27             ` Waiman Long
2023-08-11 22:35               ` Ivan Babrou
2023-08-12  2:33                 ` Shakeel Butt
2023-08-14 17:56                   ` Ivan Babrou
2023-08-11 23:43             ` Yosry Ahmed
2023-08-12  0:01               ` Yosry Ahmed
2023-08-15  0:18                 ` Tejun Heo
2023-08-15  0:30                   ` Ivan Babrou
2023-08-15  0:31                     ` Yosry Ahmed
2023-07-15  0:14     ` Ivan Babrou
2023-07-10 14:44 ` Michal Koutný
2023-07-10 23:23   ` Ivan Babrou

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox