Tools for explaining memory mappings/usage/pressure

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Tools for explaining memory mappings/usage/pressure
@ 2024-07-06 20:55 David Rientjes
  2024-07-07 15:44 ` SeongJae Park
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: David Rientjes @ 2024-07-06 20:55 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton, Mel Gorman, Balbir Singh, Peter Zijlstra
  Cc: linux-mm

Hi all,

I'm trying to crowdsource information on open source tools that can be 
used directly by customers to explain memory mappings, usage, pressure, 
etc.

We encounter both internal and external users that are looking for this 
insight and it often requires significant engineering time to collect data 
to make any conclusions.

A recent example is an external customer that recently upgraded their 
userspace and started to run into memcg constrained memory pressure that 
wasn't previously observed.  After handing off a hacky script to run in 
the background, it was immediately obvious that the source of the direct 
reclaim was all of the MADV_FREE memory that was sitting around.  
Converting to MADV_DONTNEED solved their issue.

A month ago, a different external customer was concerned about increased 
memory access latency in their guest on some instances although there 
were no issues observable on the host.  After handing off a hacky script 
to run in the background, it was immediately obvious that memory 
fragmentation was resulting in a large disparity in the number of 
hugepages that were available on some instances.

Rather than hacky scripts that collect things like vmstat, memory.stat, 
buddyinfo, etc, at regular intervals, it would be preferable to hand off 
something more complete.  Idea is an open source tool that can be run in 
the background to collect metrics for the system, NUMA nodes, and memcg 
hierarchies, as well as potentially from subsystems in the kernel like 
delay accounting.  IOW, I want to be able to say "install ${tool} and send 
over the log file."

Are thre any open source tools that do a good job of this today that I can 
latch onto?  If not, sounds like I'll be writing one from scratch.  Let me 
know if there's interest in this as well.

Thanks!

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Tools for explaining memory mappings/usage/pressure
  2024-07-06 20:55 Tools for explaining memory mappings/usage/pressure David Rientjes
@ 2024-07-07 15:44 ` SeongJae Park
  2024-07-08 13:50 ` Dan Schatzberg
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: SeongJae Park @ 2024-07-07 15:44 UTC (permalink / raw)
  To: David Rientjes
  Cc: SeongJae Park, Michal Hocko, Andrew Morton, Mel Gorman,
	Balbir Singh, Peter Zijlstra, linux-mm, damon

Hi David,

On Sat, 6 Jul 2024 13:55:11 -0700 (PDT) David Rientjes <rientjes@google.com> wrote:

[...]
> Rather than hacky scripts that collect things like vmstat, memory.stat, 
> buddyinfo, etc, at regular intervals, it would be preferable to hand off 
> something more complete.  Idea is an open source tool that can be run in 
> the background to collect metrics for the system, NUMA nodes, and memcg 
> hierarchies, as well as potentially from subsystems in the kernel like 
> delay accounting.  IOW, I want to be able to say "install ${tool} and send 
> over the log file."
> 
> Are thre any open source tools that do a good job of this today that I can 
> latch onto?

DAMON user-space tool, damo[1], provides background recording and reporting of
memory access information including size of memory showing specific access
pattern (e.g., working set size).  Nowadays we're extending the tool to capture
and provide more information for holistic and intuitive system investigations.
Currently basic memory footprints and CPU usage of functions are provided.

The current status of the tool would be far from what you're looking for,
though.  I'm also not sure if current future plan of the tool would perfectly
meet your requirements.  We're open to any contributions to damo, though.

Hopefully others may know better tools for this.  I'm looking forward to a
chance to learn from those.

> If not, sounds like I'll be writing one from scratch.  Let me 
> know if there's interest in this as well.

We're open at not only receiving contributions for damo, but also providing
contributions to other projects (and using it).  So, yes, I'm interested in
this :)

[1] https://github.com/awslabs/damo

Thanks,
SJ

> 
> Thanks!

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Tools for explaining memory mappings/usage/pressure
  2024-07-06 20:55 Tools for explaining memory mappings/usage/pressure David Rientjes
  2024-07-07 15:44 ` SeongJae Park
@ 2024-07-08 13:50 ` Dan Schatzberg
  2024-07-21 23:05   ` David Rientjes
  2024-07-22  9:05 ` Vlastimil Babka (SUSE)
  2024-12-30  8:15 ` Raghavendra K T
  3 siblings, 1 reply; 8+ messages in thread
From: Dan Schatzberg @ 2024-07-08 13:50 UTC (permalink / raw)
  To: David Rientjes
  Cc: Michal Hocko, Andrew Morton, Mel Gorman, Balbir Singh,
	Peter Zijlstra, linux-mm

On Sat, Jul 06, 2024 at 01:55:11PM -0700, David Rientjes wrote:
> Rather than hacky scripts that collect things like vmstat, memory.stat, 
> buddyinfo, etc, at regular intervals, it would be preferable to hand off 
> something more complete.  Idea is an open source tool that can be run in 
> the background to collect metrics for the system, NUMA nodes, and memcg 
> hierarchies, as well as potentially from subsystems in the kernel like 
> delay accounting.  IOW, I want to be able to say "install ${tool} and send 
> over the log file."
> 
> Are thre any open source tools that do a good job of this today that I can 
> latch onto?  If not, sounds like I'll be writing one from scratch.  Let me 
> know if there's interest in this as well.
> 
> Thanks!
> 

Hi David,

At meta we have built and deployed Below[1] for this purpose. It's a
tool similar to `top` or others, but can record system state
periodically and allow for replaying. We run this on our production
fleet, periodically recording system state to the local disk. When we
need to debug a machine at a point in the past, we can log in and
replay the state. This uses a TUI (see the link for a demo) to make
navigating the data more natural.

I'm aware of a few other organizations who have also deployed Below,
but tend to run it more in the manner you suggest - have it record
data but then use the snapshot command to export the state (e.g. as if
it was a log file) that can then be viewed off-host. Some
organizations eschew the TUI altogether and export the data to
Prometheus/Grafana.

I'll caution though that having the data is one thing, being able to
interpret it is entirely different. While we try and put the most
useful and easily-understood metrics front-and-center in the TUI,
debugging an issue like you describe would probably require some
domain-expertise.

[1] https://github.com/facebookincubator/below

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Tools for explaining memory mappings/usage/pressure
  2024-07-08 13:50 ` Dan Schatzberg
@ 2024-07-21 23:05   ` David Rientjes
  2024-07-22 22:15     ` Dan Schatzberg
  0 siblings, 1 reply; 8+ messages in thread
From: David Rientjes @ 2024-07-21 23:05 UTC (permalink / raw)
  To: Dan Schatzberg
  Cc: Michal Hocko, Andrew Morton, Mel Gorman, Balbir Singh,
	Peter Zijlstra, linux-mm

On Mon, 8 Jul 2024, Dan Schatzberg wrote:

> On Sat, Jul 06, 2024 at 01:55:11PM -0700, David Rientjes wrote:
> > Rather than hacky scripts that collect things like vmstat, memory.stat, 
> > buddyinfo, etc, at regular intervals, it would be preferable to hand off 
> > something more complete.  Idea is an open source tool that can be run in 
> > the background to collect metrics for the system, NUMA nodes, and memcg 
> > hierarchies, as well as potentially from subsystems in the kernel like 
> > delay accounting.  IOW, I want to be able to say "install ${tool} and send 
> > over the log file."
> > 
> > Are thre any open source tools that do a good job of this today that I can 
> > latch onto?  If not, sounds like I'll be writing one from scratch.  Let me 
> > know if there's interest in this as well.
> > 
> > Thanks!
> > 
> 
> Hi David,
> 
> At meta we have built and deployed Below[1] for this purpose. It's a
> tool similar to `top` or others, but can record system state
> periodically and allow for replaying. We run this on our production
> fleet, periodically recording system state to the local disk. When we
> need to debug a machine at a point in the past, we can log in and
> replay the state. This uses a TUI (see the link for a demo) to make
> navigating the data more natural.
> 
> I'm aware of a few other organizations who have also deployed Below,
> but tend to run it more in the manner you suggest - have it record
> data but then use the snapshot command to export the state (e.g. as if
> it was a log file) that can then be viewed off-host. Some
> organizations eschew the TUI altogether and export the data to
> Prometheus/Grafana.
> 
> I'll caution though that having the data is one thing, being able to
> interpret it is entirely different. While we try and put the most
> useful and easily-understood metrics front-and-center in the TUI,
> debugging an issue like you describe would probably require some
> domain-expertise.
> 
> [1] https://github.com/facebookincubator/below
> 

Thanks Dan, this is fantastic!  I've been playing with it locally.

This does indeed appear to meet the exact needs of what I was referring to 
above, I'm excited that this already exists.

Few questions for you:

 - Do you know of anybody who has deployed this in their guest when 
   running on a public cloud?

 - Is there a motivation to add this to well known distros so it is "just
   there" and can run out of the box?  There's some configuration and 
   setup that it requires

 - How receptive are the maintainers to adding new data points, things 
   like additional fields from vmstat, adding in /proc/pagetypeinfo, etc?

 - Any plans to support cgroup v1? :)  Would that be nacked outright?  
   Some customers still run this in their guest

 - For the "/usr/bin/below record --retain-for-s 604800 --compress" 
   support, is there an appetite for separating this out into its own
   non-systemd managed process?  IOW, the ability to tell the customer
   "go run 'mini-below' and send over the data" that *just* does the
   record operation and doesn't require installing/configuring anything?

This could be potentially very exciting.

Happy to take this discussion off-list as well: if anybody else from this 
thread (or not yet on this thread) is interested, please let me know so I 
include you.

Thanks!


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Tools for explaining memory mappings/usage/pressure
  2024-07-06 20:55 Tools for explaining memory mappings/usage/pressure David Rientjes
  2024-07-07 15:44 ` SeongJae Park
  2024-07-08 13:50 ` Dan Schatzberg
@ 2024-07-22  9:05 ` Vlastimil Babka (SUSE)
  2024-07-22 21:57   ` David Rientjes
  2024-12-30  8:15 ` Raghavendra K T
  3 siblings, 1 reply; 8+ messages in thread
From: Vlastimil Babka (SUSE) @ 2024-07-22  9:05 UTC (permalink / raw)
  To: David Rientjes, Michal Hocko, Andrew Morton, Mel Gorman,
	Balbir Singh, Peter Zijlstra
  Cc: linux-mm

On 7/6/24 10:55 PM, David Rientjes wrote:
> Hi all,
> 
> I'm trying to crowdsource information on open source tools that can be 
> used directly by customers to explain memory mappings, usage, pressure, 
> etc.
> 
> We encounter both internal and external users that are looking for this 
> insight and it often requires significant engineering time to collect data 
> to make any conclusions.
> 
> A recent example is an external customer that recently upgraded their 
> userspace and started to run into memcg constrained memory pressure that 
> wasn't previously observed.  After handing off a hacky script to run in 
> the background, it was immediately obvious that the source of the direct 
> reclaim was all of the MADV_FREE memory that was sitting around.  
> Converting to MADV_DONTNEED solved their issue.

BTW, was this reported/fixed upstream? Sounds like a bug to me that would
better be fixed than suggesting the MADV_DONTNEED workaround to everyone
from now on.

> A month ago, a different external customer was concerned about increased 
> memory access latency in their guest on some instances although there 
> were no issues observable on the host.  After handing off a hacky script 
> to run in the background, it was immediately obvious that memory 
> fragmentation was resulting in a large disparity in the number of 
> hugepages that were available on some instances.
> 
> Rather than hacky scripts that collect things like vmstat, memory.stat, 
> buddyinfo, etc, at regular intervals, it would be preferable to hand off 
> something more complete.  Idea is an open source tool that can be run in 
> the background to collect metrics for the system, NUMA nodes, and memcg 
> hierarchies, as well as potentially from subsystems in the kernel like 
> delay accounting.  IOW, I want to be able to say "install ${tool} and send 
> over the log file."
> 
> Are thre any open source tools that do a good job of this today that I can 
> latch onto?  If not, sounds like I'll be writing one from scratch.  Let me 
> know if there's interest in this as well.
> 
> Thanks!
> 



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Tools for explaining memory mappings/usage/pressure
  2024-07-22  9:05 ` Vlastimil Babka (SUSE)
@ 2024-07-22 21:57   ` David Rientjes
  0 siblings, 0 replies; 8+ messages in thread
From: David Rientjes @ 2024-07-22 21:57 UTC (permalink / raw)
  To: Vlastimil Babka (SUSE)
  Cc: Michal Hocko, Andrew Morton, Mel Gorman, Balbir Singh,
	Peter Zijlstra, linux-mm

On Mon, 22 Jul 2024, Vlastimil Babka (SUSE) wrote:

> On 7/6/24 10:55 PM, David Rientjes wrote:
> > Hi all,
> > 
> > I'm trying to crowdsource information on open source tools that can be 
> > used directly by customers to explain memory mappings, usage, pressure, 
> > etc.
> > 
> > We encounter both internal and external users that are looking for this 
> > insight and it often requires significant engineering time to collect data 
> > to make any conclusions.
> > 
> > A recent example is an external customer that recently upgraded their 
> > userspace and started to run into memcg constrained memory pressure that 
> > wasn't previously observed.  After handing off a hacky script to run in 
> > the background, it was immediately obvious that the source of the direct 
> > reclaim was all of the MADV_FREE memory that was sitting around.  
> > Converting to MADV_DONTNEED solved their issue.
> 
> BTW, was this reported/fixed upstream? Sounds like a bug to me that would
> better be fixed than suggesting the MADV_DONTNEED workaround to everyone
> from now on.
> 

From the kernel perspective, it was working as intended: nearly half of 
the customer's memory (32GB VM instance) was lazily freeable under memory 
pressure and they were not expecting any reclaim stalls for future page 
faults.  A recent upgrade of their userspace had switched heap freeing in 
a library from MADV_DONTNEED -> MADV_FREE implicitly without the user's 
direct knowledge :/


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Tools for explaining memory mappings/usage/pressure
  2024-07-21 23:05   ` David Rientjes
@ 2024-07-22 22:15     ` Dan Schatzberg
  0 siblings, 0 replies; 8+ messages in thread
From: Dan Schatzberg @ 2024-07-22 22:15 UTC (permalink / raw)
  To: David Rientjes
  Cc: Michal Hocko, Andrew Morton, Mel Gorman, Balbir Singh,
	Peter Zijlstra, linux-mm

On Sun, Jul 21, 2024 at 04:05:57PM -0700, David Rientjes wrote:
> Thanks Dan, this is fantastic!  I've been playing with it locally.
> 
> This does indeed appear to meet the exact needs of what I was referring to 
> above, I'm excited that this already exists.
> 
> Few questions for you:

Just a brief preface to my answers: Below is maintained by just a
couple engineers and our primary focus is internal debugging
use-cases. We welcome contributions as expanding Below's user base
leads to benefits for our internal use-cases. I'll try and speak to
what we would and wouldn't welcome, but before embarking on some more
specific work it may be worth circling back with us to avoid
misalignment.

> 
>  - Do you know of anybody who has deployed this in their guest when 
>    running on a public cloud?

I believe so, engineers from Aviatrix have been contributing to Below
recently as they have their customers use below to collect data for
off-host debugging. I've heard anecdotally that Netflix has been using
Below, but not entirely confident in that still being true.

> 
>  - Is there a motivation to add this to well known distros so it is "just
>    there" and can run out of the box?  There's some configuration and 
>    setup that it requires

https://github.com/facebookincubator/below?tab=readme-ov-file#installing

It's packaged for Below, Alpine Linux and Gentoo already. We'd welcome
any additional contributions to package Below for other distros so
long as the maintenance burden is not too high.

> 
>  - How receptive are the maintainers to adding new data points, things 
>    like additional fields from vmstat, adding in /proc/pagetypeinfo, etc?

In general, we welcome contributions adding additional data
collection, so long as it is sufficiently performant (e.g. collecting
data for each thread in the system may require more rigour to ensure
it doesn't blow up storage costs or cpu overhead of running Below) or
at least made optional. Of course we expect this to be done in a
fashion that doesn't overly burden the maintenance of the codebase as
well.

We're a bit more scrutinizing about adding data to the TUI (more
specifically, we scrutinize where the data gets added) just because
adding everyone's personal favorite metric in the most prominent spot
leads to UI clutter and devalues the tool as a visual guide to
debugging.

> 
>  - Any plans to support cgroup v1? :)  Would that be nacked outright?  
>    Some customers still run this in their guest

No plans, but we're not opposed to contributions. I don't think it
would be too challenging, just need to make sure there's some (github)
testing setup for it since we are not running cgroup v1 in our
internal CI.

> 
>  - For the "/usr/bin/below record --retain-for-s 604800 --compress" 
>    support, is there an appetite for separating this out into its own
>    non-systemd managed process?  IOW, the ability to tell the customer
>    "go run 'mini-below' and send over the data" that *just* does the
>    record operation and doesn't require installing/configuring anything?

I think I follow what you're suggesting here - basically something
fully self-contained (relies on no external configuration) to run
below record followed by below snapshot or some way to record directly
to a snapshot so data can be analyzed off-host. That seems perfectly
reasonable. I believe Aviatrix would benefit from making this easier
for their customers as well.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Tools for explaining memory mappings/usage/pressure
  2024-07-06 20:55 Tools for explaining memory mappings/usage/pressure David Rientjes
                   ` (2 preceding siblings ...)
  2024-07-22  9:05 ` Vlastimil Babka (SUSE)
@ 2024-12-30  8:15 ` Raghavendra K T
  3 siblings, 0 replies; 8+ messages in thread
From: Raghavendra K T @ 2024-12-30  8:15 UTC (permalink / raw)
  To: David Rientjes, Michal Hocko, Andrew Morton, Mel Gorman,
	Balbir Singh, Peter Zijlstra
  Cc: linux-mm, Rao, Bharata Bhasker, Shivank Garg, ayush.jain3,
	Shukla, Santosh, Grimm, Jon

On 7/7/2024 2:25 AM, David Rientjes wrote:
> Hi all,
> 
> I'm trying to crowdsource information on open source tools that can be
> used directly by customers to explain memory mappings, usage, pressure,
> etc.
> 
> We encounter both internal and external users that are looking for this
> insight and it often requires significant engineering time to collect data
> to make any conclusions.

Hello David,

Link: https://github.com/AMDESE/workload-insight-tool

sorry for replying late (as we took few months to make the
tool "opensource").

Not sure whether it exactly fits the requirement but the tools has been
very helpful for us (developers) to "visualize" the system behavior
(that is exported via procfs /sysfs interface) and initial analysis.

Deploying is easy since it is provided as a python package.

Typical usage:
1. collect the behavior of workload using
$syswit collect
e.g.,
$syswit collect -c "<PWD>/syswit/collector_configs/numa_cxl.yaml"  -C -T 
-m 1 -w "<workload>"

collected information is stored in Json format and it can be compared/
visualized further with
$syswit analyze (behavior of single workload)
OR
$syswit compare (comparing the result with multiple workload run
(e.g., before / after the patch).

Still a Long way to go.. But hope it is useful.

[Idea about the tool was seeded by Bharata, Bharata/Me helped in some of
the design and optimization and Ayush is the sole developer and many
inside AMD helped in reviewing/improving (Shivank,....)].

USP:
1. we can tune the information to collect.
2. It can run for longer time and store the data that can be analyzed/
visualized offline  for anomaly.

(Thinking whether I should also post as a main thread in linux-mm
for greater awareness)

Thanks and Regards
- Raghu

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-12-30  8:15 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-07-06 20:55 Tools for explaining memory mappings/usage/pressure David Rientjes
2024-07-07 15:44 ` SeongJae Park
2024-07-08 13:50 ` Dan Schatzberg
2024-07-21 23:05   ` David Rientjes
2024-07-22 22:15     ` Dan Schatzberg
2024-07-22  9:05 ` Vlastimil Babka (SUSE)
2024-07-22 21:57   ` David Rientjes
2024-12-30  8:15 ` Raghavendra K T

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox