[LSF/MM/BPF TOPIC] Physical LRU scanning feasibility

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] Physical LRU scanning feasibility
@ 2025-01-08 21:46 Lorenzo Stoakes
  2025-01-08 22:33 ` Yosry Ahmed
  2025-02-22 19:12 ` Lorenzo Stoakes
  0 siblings, 2 replies; 4+ messages in thread
From: Lorenzo Stoakes @ 2025-01-08 21:46 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-mm, linux-kernel, Matthew Wilcox, Axel Rasmussen,
	Guru Anbalagane, Wei Xu, Yuanchu Xie

Hi all,

Not too long ago I took some time to investigate the possibility of
scanning physical memory directly by traversing the memory map directly
rather than the LRU linked list.

This was inspired by a post from Matthew [0] wherein he demonstrated just
how significant the difference is between traversing arrays of contiguous
data on a modern system vs. the almost worst-case scenario of traversing a
linked-list.

I tested how this might look by implementing code which simply traverses
and filters the memory map for LRU pages, simplifying as much as possible.

However no matter which machine (ranging from 16 GB - 192 GB) or whether
virtualised or real hardware, I found unfortunately disappointing results -
the act of having to scan such a large range of memory resulted in
performance significantly less than a typical LRU scan at low memory
utilisation and performance at best matching LRU scanning at high memory
utilisation (simulating higher memory pressure).

There are a number of factors at play here, and perhaps the shrinkage of
struct page (allowing for denser placement in cache lines), or an improved
algorithm might lead to more promising results.

Having discussed this with Matthew, he suggested I put forward a proposal
to discuss this area in order that we can learn from this should it appear
this approach is unworkable or perhaps determine whether there might be
something to this that we might still salvage.

I intend to do some more research and generate some more specific numbers
(feel free to give feedback here) before LSF so we can have something more
specific to talk about.

I always envisioned this approach being somehow integrated with MGLRU and I
wonder if some hybrid means of integrating this approach with the MGLRU one
might make sense, which could also be another area of discussion.

Thanks!

[0]:https://lore.kernel.org/linux-mm/ZTc7SHQ4RbPkD3eZ@casper.infradead.org/

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Physical LRU scanning feasibility
  2025-01-08 21:46 [LSF/MM/BPF TOPIC] Physical LRU scanning feasibility Lorenzo Stoakes
@ 2025-01-08 22:33 ` Yosry Ahmed
  2025-01-09 12:38   ` Lorenzo Stoakes
  2025-02-22 19:12 ` Lorenzo Stoakes
  1 sibling, 1 reply; 4+ messages in thread
From: Yosry Ahmed @ 2025-01-08 22:33 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: lsf-pc, linux-mm, linux-kernel, Matthew Wilcox, Axel Rasmussen,
	Guru Anbalagane, Wei Xu, Yuanchu Xie

On Wed, Jan 8, 2025 at 1:46 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> Hi all,
>
> Not too long ago I took some time to investigate the possibility of
> scanning physical memory directly by traversing the memory map directly
> rather than the LRU linked list.
>
> This was inspired by a post from Matthew [0] wherein he demonstrated just
> how significant the difference is between traversing arrays of contiguous
> data on a modern system vs. the almost worst-case scenario of traversing a
> linked-list.
>
> I tested how this might look by implementing code which simply traverses
> and filters the memory map for LRU pages, simplifying as much as possible.
>
> However no matter which machine (ranging from 16 GB - 192 GB) or whether
> virtualised or real hardware, I found unfortunately disappointing results -
> the act of having to scan such a large range of memory resulted in
> performance significantly less than a typical LRU scan at low memory
> utilisation and performance at best matching LRU scanning at high memory
> utilisation (simulating higher memory pressure).
>
> There are a number of factors at play here, and perhaps the shrinkage of
> struct page (allowing for denser placement in cache lines), or an improved
> algorithm might lead to more promising results.
>
> Having discussed this with Matthew, he suggested I put forward a proposal
> to discuss this area in order that we can learn from this should it appear
> this approach is unworkable or perhaps determine whether there might be
> something to this that we might still salvage.
>
> I intend to do some more research and generate some more specific numbers
> (feel free to give feedback here) before LSF so we can have something more
> specific to talk about.
>
> I always envisioned this approach being somehow integrated with MGLRU and I
> wonder if some hybrid means of integrating this approach with the MGLRU one
> might make sense, which could also be another area of discussion.

When I read this proposal the first thing that came to mind was memcg
reclaim. While it seems to me that it is already inefficient to scan
all physical memory looking for a possible reclaim candidate, it seems
even more inefficient to try to find a possible reclaim candidate
within the needed memcg. We'll also probably do a lot of repeated
scanning as we iterate memcgs. The per-memcg per-node LRUs save us
from this.

Do you have an idea about handling memcg reclaim efficiently?


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Physical LRU scanning feasibility
  2025-01-08 22:33 ` Yosry Ahmed
@ 2025-01-09 12:38   ` Lorenzo Stoakes
  0 siblings, 0 replies; 4+ messages in thread
From: Lorenzo Stoakes @ 2025-01-09 12:38 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: lsf-pc, linux-mm, linux-kernel, Matthew Wilcox, Axel Rasmussen,
	Guru Anbalagane, Wei Xu, Yuanchu Xie

On Wed, Jan 08, 2025 at 02:33:01PM -0800, Yosry Ahmed wrote:
> On Wed, Jan 8, 2025 at 1:46 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > Hi all,
> >
> > Not too long ago I took some time to investigate the possibility of
> > scanning physical memory directly by traversing the memory map directly
> > rather than the LRU linked list.
> >
> > This was inspired by a post from Matthew [0] wherein he demonstrated just
> > how significant the difference is between traversing arrays of contiguous
> > data on a modern system vs. the almost worst-case scenario of traversing a
> > linked-list.
> >
> > I tested how this might look by implementing code which simply traverses
> > and filters the memory map for LRU pages, simplifying as much as possible.
> >
> > However no matter which machine (ranging from 16 GB - 192 GB) or whether
> > virtualised or real hardware, I found unfortunately disappointing results -
> > the act of having to scan such a large range of memory resulted in
> > performance significantly less than a typical LRU scan at low memory
> > utilisation and performance at best matching LRU scanning at high memory
> > utilisation (simulating higher memory pressure).
> >
> > There are a number of factors at play here, and perhaps the shrinkage of
> > struct page (allowing for denser placement in cache lines), or an improved
> > algorithm might lead to more promising results.
> >
> > Having discussed this with Matthew, he suggested I put forward a proposal
> > to discuss this area in order that we can learn from this should it appear
> > this approach is unworkable or perhaps determine whether there might be
> > something to this that we might still salvage.
> >
> > I intend to do some more research and generate some more specific numbers
> > (feel free to give feedback here) before LSF so we can have something more
> > specific to talk about.
> >
> > I always envisioned this approach being somehow integrated with MGLRU and I
> > wonder if some hybrid means of integrating this approach with the MGLRU one
> > might make sense, which could also be another area of discussion.
>
> When I read this proposal the first thing that came to mind was memcg
> reclaim. While it seems to me that it is already inefficient to scan
> all physical memory looking for a possible reclaim candidate, it seems
> even more inefficient to try to find a possible reclaim candidate
> within the needed memcg. We'll also probably do a lot of repeated
> scanning as we iterate memcgs. The per-memcg per-node LRUs save us
> from this.

Indeed the memcg and cgroup side of this is a big bug bear of this
approach, I was initially investigating from the global reclaim pespective,
but one would _absolutely_ have to have a good answer as to how to proceed
in this instances.

I suspect any workable version of this approach will involve some kind of
hybrid approach.

If we consider the issues arising from the current use of linked lists they
are:

1. LRU lock contention (probably by far the prime concern)
2. The horrible cache performance of linked list traversal.

So perhaps use of an appropriate data structure might improve things (maple
tree seems a candidate, though whether appropriate I am not sure), though
in considering this one has also to be very very concerned about potential
'ironic' memory overhead in using such a data structure, ironic because
reclaim is our last gasp attempt to free memory, so needing to allocate it
becomes problematic under memory pressure...

I wonder whether one might consider some sort of physical stratification of
memory based on memcg, but that seems likely unworkable especially in light
of zone constraints.

>
> Do you have an idea about handling memcg reclaim efficiently?

So in short, no :>) but 100% acked that this is a fundamental issue here.

I will do some more research on this though in preparation for further
discussion.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Physical LRU scanning feasibility
  2025-01-08 21:46 [LSF/MM/BPF TOPIC] Physical LRU scanning feasibility Lorenzo Stoakes
  2025-01-08 22:33 ` Yosry Ahmed
@ 2025-02-22 19:12 ` Lorenzo Stoakes
  1 sibling, 0 replies; 4+ messages in thread
From: Lorenzo Stoakes @ 2025-02-22 19:12 UTC (permalink / raw)
  To: lsf-pc
  Cc: linux-mm, linux-kernel, Matthew Wilcox, Axel Rasmussen,
	Guru Anbalagane, Wei Xu, Yuanchu Xie

Hi all,

Apologies I feel I've bitten off more than I can chew especially given
current workload. I'd like to focus on my anon_vma topic, so I'm going to
have to drop this one on physical LRU scanning.

If anybody else would like to lead feel free, but I think this one is
rather too dependent on my providing my analysis (which unfortunately I
don't have time for any longer) so I don't think this is really feasible :(

I may try to pick this up in 2026, and perhaps by then it won't be a
post-mortem of a failed attempt but rather a discussion of a succesful one
:) let's see.

Best, Lorenzo

On Wed, Jan 08, 2025 at 09:46:31PM +0000, Lorenzo Stoakes wrote:
> Hi all,
>
> Not too long ago I took some time to investigate the possibility of
> scanning physical memory directly by traversing the memory map directly
> rather than the LRU linked list.
>
> This was inspired by a post from Matthew [0] wherein he demonstrated just
> how significant the difference is between traversing arrays of contiguous
> data on a modern system vs. the almost worst-case scenario of traversing a
> linked-list.
>
> I tested how this might look by implementing code which simply traverses
> and filters the memory map for LRU pages, simplifying as much as possible.
>
> However no matter which machine (ranging from 16 GB - 192 GB) or whether
> virtualised or real hardware, I found unfortunately disappointing results -
> the act of having to scan such a large range of memory resulted in
> performance significantly less than a typical LRU scan at low memory
> utilisation and performance at best matching LRU scanning at high memory
> utilisation (simulating higher memory pressure).
>
> There are a number of factors at play here, and perhaps the shrinkage of
> struct page (allowing for denser placement in cache lines), or an improved
> algorithm might lead to more promising results.
>
> Having discussed this with Matthew, he suggested I put forward a proposal
> to discuss this area in order that we can learn from this should it appear
> this approach is unworkable or perhaps determine whether there might be
> something to this that we might still salvage.
>
> I intend to do some more research and generate some more specific numbers
> (feel free to give feedback here) before LSF so we can have something more
> specific to talk about.
>
> I always envisioned this approach being somehow integrated with MGLRU and I
> wonder if some hybrid means of integrating this approach with the MGLRU one
> might make sense, which could also be another area of discussion.
>
> Thanks!
>
> [0]:https://lore.kernel.org/linux-mm/ZTc7SHQ4RbPkD3eZ@casper.infradead.org/


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-02-22 19:12 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-08 21:46 [LSF/MM/BPF TOPIC] Physical LRU scanning feasibility Lorenzo Stoakes
2025-01-08 22:33 ` Yosry Ahmed
2025-01-09 12:38   ` Lorenzo Stoakes
2025-02-22 19:12 ` Lorenzo Stoakes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox