MM global locks as core counts quadruple

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* MM global locks as core counts quadruple
@ 2024-06-21  0:35 David Rientjes
  2024-06-21  2:01 ` Matthew Wilcox
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: David Rientjes @ 2024-06-21  0:35 UTC (permalink / raw)
  To: Andrew Morton, Liam R. Howlett, Suren Baghdasaryan,
	Matthew Wilcox, Christoph Lameter, Paul E. McKenney, Tejun Heo,
	Johannes Weiner, Davidlohr Bueso
  Cc: Namhyung Kim, linux-mm

Hi all,

As core counts are rapidly expanding over the next four years, Namhyung 
and I were looking at global locks that we're already seeing high 
contention on even today.

Some of these are not MM specific:
 - cgroup_mutex
 - cgroup_threadgroup_rwsem
 - tasklist_lock
 - kernfs_mutex (although should now be substantially better with the 
   kernfs_locks array)

Others *are* MM specific:
 - list_lrus_mutex
 - pcpu_drain_mutex
 - shrinker_mutex (formerly shrinker_rwsem)
 - vmap_purge_lock
 - slab_mutex

This is only looking at fleet data for global static locks, not locks like 
zone->lock that get dynamically allocated.

(mmap_lock was substantially improved by per-vma locking, although does 
show up for very large vmas.)

Couple questions:

 (1) How are people quantifying these pain points, if at all, in synthetic
     testing?  Any workloads or benchmarks that are really good at doing 
     this in the lab beyond the traditional will-it-scale?  (The above is
     from production data.)

 (2) Is anybody working on any of the above global locks?  Trying to 
     surface gaps for locks that will likely become even more painful in 
     the coming years.

Thanks!


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: MM global locks as core counts quadruple
  2024-06-21  0:35 MM global locks as core counts quadruple David Rientjes
@ 2024-06-21  2:01 ` Matthew Wilcox
  2024-06-21  2:46 ` Yafang Shao
  2024-06-21 19:10 ` Tejun Heo
  2 siblings, 0 replies; 10+ messages in thread
From: Matthew Wilcox @ 2024-06-21  2:01 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Liam R. Howlett, Suren Baghdasaryan,
	Christoph Lameter, Paul E. McKenney, Tejun Heo, Johannes Weiner,
	Davidlohr Bueso, Namhyung Kim, linux-mm

On Thu, Jun 20, 2024 at 05:35:45PM -0700, David Rientjes wrote:
>  - slab_mutex

This is weird.  You must have very strange workloads running ;-)

It protects the list of slabs.  So it's taken for

 - Slab creation / destruction
 - CPU online/offline
 - Memory hotplug
 - Reading /proc/slabinfo

All of these should be rare events, but don't seem to be for you?


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: MM global locks as core counts quadruple
  2024-06-21  0:35 MM global locks as core counts quadruple David Rientjes
  2024-06-21  2:01 ` Matthew Wilcox
@ 2024-06-21  2:46 ` Yafang Shao
  2024-06-21  2:54   ` Matthew Wilcox
  2024-06-21 19:10 ` Tejun Heo
  2 siblings, 1 reply; 10+ messages in thread
From: Yafang Shao @ 2024-06-21  2:46 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Liam R. Howlett, Suren Baghdasaryan,
	Matthew Wilcox, Christoph Lameter, Paul E. McKenney, Tejun Heo,
	Johannes Weiner, Davidlohr Bueso, Namhyung Kim, linux-mm

On Fri, Jun 21, 2024 at 8:50 AM David Rientjes <rientjes@google.com> wrote:
>
> Hi all,
>
> As core counts are rapidly expanding over the next four years, Namhyung
> and I were looking at global locks that we're already seeing high
> contention on even today.
>
> Some of these are not MM specific:
>  - cgroup_mutex

We previously discussed replacing cgroup_mutex with RCU in some
critical paths [0], but I haven't had the time to implement it yet.

[0]. https://lore.kernel.org/bpf/CALOAHbAgWoFGSc=uF5gFWXmsALECUaGGScQuXpRcwjgzv+TPGQ@mail.gmail.com/

>  - cgroup_threadgroup_rwsem
>  - tasklist_lock
>  - kernfs_mutex (although should now be substantially better with the
>    kernfs_locks array)
>
> Others *are* MM specific:
>  - list_lrus_mutex
>  - pcpu_drain_mutex
>  - shrinker_mutex (formerly shrinker_rwsem)
>  - vmap_purge_lock
>  - slab_mutex
>
> This is only looking at fleet data for global static locks, not locks like
> zone->lock that get dynamically allocated.

We have encountered latency spikes caused by the zone->lock. Do we
have any plans to eliminate this lock, such as implementing a lockless
buddy list? I believe this could be a viable solution.

>
> (mmap_lock was substantially improved by per-vma locking, although does
> show up for very large vmas.)
>
> Couple questions:
>
>  (1) How are people quantifying these pain points, if at all, in synthetic
>      testing?  Any workloads or benchmarks that are really good at doing
>      this in the lab beyond the traditional will-it-scale?  (The above is
>      from production data.)
>
>  (2) Is anybody working on any of the above global locks?  Trying to
>      surface gaps for locks that will likely become even more painful in
>      the coming years.
>
> Thanks!
>


-- 
Regards
Yafang


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: MM global locks as core counts quadruple
  2024-06-21  2:46 ` Yafang Shao
@ 2024-06-21  2:54   ` Matthew Wilcox
  2024-06-26 19:38     ` Karim Manaouil
  0 siblings, 1 reply; 10+ messages in thread
From: Matthew Wilcox @ 2024-06-21  2:54 UTC (permalink / raw)
  To: Yafang Shao
  Cc: David Rientjes, Andrew Morton, Liam R. Howlett,
	Suren Baghdasaryan, Christoph Lameter, Paul E. McKenney,
	Tejun Heo, Johannes Weiner, Davidlohr Bueso, Namhyung Kim,
	linux-mm

On Fri, Jun 21, 2024 at 10:46:21AM +0800, Yafang Shao wrote:
> > This is only looking at fleet data for global static locks, not locks like
> > zone->lock that get dynamically allocated.
> 
> We have encountered latency spikes caused by the zone->lock. Do we
> have any plans to eliminate this lock, such as implementing a lockless
> buddy list? I believe this could be a viable solution.

Lockless how?  I see three operations being performed on the buddy list:

 - Add to end (freeing)
 - Remove from end (allocation)
 - Remove from middle (buddy was freed, needs to be coalesced)

I don't see how we can handle this locklessly.

I think a more productive solution to contention on the LRU lock is to
increase the number of zones.  I don't think it's helpful to have a
1TB zone of memory.  Maybe we should limit each zone to 16GB or so.
That means we'd need to increase the number of zones we support, but
I think that's doable.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: MM global locks as core counts quadruple
  2024-06-21  2:54   ` Matthew Wilcox
@ 2024-06-26 19:38     ` Karim Manaouil
  2024-06-27  5:36       ` Christoph Lameter (Ampere)
  0 siblings, 1 reply; 10+ messages in thread
From: Karim Manaouil @ 2024-06-26 19:38 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Yafang Shao, David Rientjes, Andrew Morton, Liam R. Howlett,
	Suren Baghdasaryan, Christoph Lameter, Paul E. McKenney,
	Tejun Heo, Johannes Weiner, Davidlohr Bueso, Namhyung Kim,
	linux-mm

On Fri, Jun 21, 2024 at 03:54:31AM +0100, Matthew Wilcox wrote:
> I think a more productive solution to contention on the LRU lock is to
> increase the number of zones.  I don't think it's helpful to have a
> 1TB zone of memory.  Maybe we should limit each zone to 16GB or so.
> That means we'd need to increase the number of zones we support, but
> I think that's doable.

What do you mean by zones? The usual ZONE_{DMA|DMA32|NORMAL|HIGHMEM}?

But that's historically existed to deal with physical addressing
limitations (e.g. DMA for devices that can't deal with addresses
larger than 16MiB, or HIGHMEM on 32-bits to access physical memory
beyond what could be directly mapped by the kernel).

Maybe you mean turning ZONE_NORMAL into an array with each entry
pointing to a smaller ZONE_NORMAL region of, let's say, 64GiB or smthng. 
Or it could be divided by the number of CPUs within the NUMA node and each
CPU will be given one ZONE_NORMAL segment with a fallback list to other CPUs
segments in case it runs out of memory. Does that make sense?

Cheers
Karim
PhD Student
Edinburgh University

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: MM global locks as core counts quadruple
  2024-06-26 19:38     ` Karim Manaouil
@ 2024-06-27  5:36       ` Christoph Lameter (Ampere)
  0 siblings, 0 replies; 10+ messages in thread
From: Christoph Lameter (Ampere) @ 2024-06-27  5:36 UTC (permalink / raw)
  To: Karim Manaouil
  Cc: Matthew Wilcox, Yafang Shao, David Rientjes, Andrew Morton,
	Liam R. Howlett, Suren Baghdasaryan, Paul E. McKenney, Tejun Heo,
	Johannes Weiner, Davidlohr Bueso, Namhyung Kim, linux-mm

On Wed, 26 Jun 2024, Karim Manaouil wrote:

> Maybe you mean turning ZONE_NORMAL into an array with each entry
> pointing to a smaller ZONE_NORMAL region of, let's say, 64GiB or smthng.
> Or it could be divided by the number of CPUs within the NUMA node and each
> CPU will be given one ZONE_NORMAL segment with a fallback list to other CPUs
> segments in case it runs out of memory. Does that make sense?

More zones means longer zonelists for the page allocator to walk during 
memory allocation. VM statistics are also per zone so that would decrease 
counter cacheline contention. Would be good for scaling VM things in 
general.



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: MM global locks as core counts quadruple
  2024-06-21  0:35 MM global locks as core counts quadruple David Rientjes
  2024-06-21  2:01 ` Matthew Wilcox
  2024-06-21  2:46 ` Yafang Shao
@ 2024-06-21 19:10 ` Tejun Heo
  2024-06-21 21:37   ` Namhyung Kim
  2 siblings, 1 reply; 10+ messages in thread
From: Tejun Heo @ 2024-06-21 19:10 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andrew Morton, Liam R. Howlett, Suren Baghdasaryan,
	Matthew Wilcox, Christoph Lameter, Paul E. McKenney,
	Johannes Weiner, Davidlohr Bueso, Namhyung Kim, linux-mm

Hello,

On Thu, Jun 20, 2024 at 05:35:45PM -0700, David Rientjes wrote:
> As core counts are rapidly expanding over the next four years, Namhyung 
> and I were looking at global locks that we're already seeing high 
> contention on even today.
> 
> Some of these are not MM specific:
>  - cgroup_mutex

Our machines are getting bigger but we aren't creating and destroying
cgroups frequently enough for this to matter. But yeah, I can see how this
can be a problem.

>  - cgroup_threadgroup_rwsem

This one shouldn't matter at all in setups where new cgroups are populated
with CLONE_INTO_CGROUP and not migrated further. The lock isn't grabbed in
such usage pattern, which should be the vast majority already, I think. Are
you guys migrating tasks a lot or not using CLONE_INTO_CGROUP?

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: MM global locks as core counts quadruple
  2024-06-21 19:10 ` Tejun Heo
@ 2024-06-21 21:37   ` Namhyung Kim
  2024-06-23 17:59     ` Tejun Heo
  0 siblings, 1 reply; 10+ messages in thread
From: Namhyung Kim @ 2024-06-21 21:37 UTC (permalink / raw)
  To: Tejun Heo
  Cc: David Rientjes, Andrew Morton, Liam R. Howlett,
	Suren Baghdasaryan, Matthew Wilcox, Christoph Lameter,
	Paul E. McKenney, Johannes Weiner, Davidlohr Bueso, linux-mm

Hello,

On Fri, Jun 21, 2024 at 12:10 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Thu, Jun 20, 2024 at 05:35:45PM -0700, David Rientjes wrote:
> > As core counts are rapidly expanding over the next four years, Namhyung
> > and I were looking at global locks that we're already seeing high
> > contention on even today.
> >
> > Some of these are not MM specific:
> >  - cgroup_mutex
>
> Our machines are getting bigger but we aren't creating and destroying
> cgroups frequently enough for this to matter. But yeah, I can see how this
> can be a problem.
>
> >  - cgroup_threadgroup_rwsem
>
> This one shouldn't matter at all in setups where new cgroups are populated
> with CLONE_INTO_CGROUP and not migrated further. The lock isn't grabbed in
> such usage pattern, which should be the vast majority already, I think. Are
> you guys migrating tasks a lot or not using CLONE_INTO_CGROUP?

I'm afraid there are still some use cases in Google that migrate processes
and/or threads between cgroups. :(

Thanks,
Namhyung


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: MM global locks as core counts quadruple
  2024-06-21 21:37   ` Namhyung Kim
@ 2024-06-23 17:59     ` Tejun Heo
  2024-06-24 21:44       ` David Rientjes
  0 siblings, 1 reply; 10+ messages in thread
From: Tejun Heo @ 2024-06-23 17:59 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: David Rientjes, Andrew Morton, Liam R. Howlett,
	Suren Baghdasaryan, Matthew Wilcox, Christoph Lameter,
	Paul E. McKenney, Johannes Weiner, Davidlohr Bueso, linux-mm

Hello,

On Fri, Jun 21, 2024 at 02:37:43PM -0700, Namhyung Kim wrote:
> > >  - cgroup_threadgroup_rwsem
> >
> > This one shouldn't matter at all in setups where new cgroups are populated
> > with CLONE_INTO_CGROUP and not migrated further. The lock isn't grabbed in
> > such usage pattern, which should be the vast majority already, I think. Are
> > you guys migrating tasks a lot or not using CLONE_INTO_CGROUP?
> 
> I'm afraid there are still some use cases in Google that migrate processes
> and/or threads between cgroups. :(

I see. I wonder whether we can turn this into a cgroup lock. It's not
straightforward tho. It's protecting migration against forking and exiting
and the only way to turn it into per-cgroup lock would be tying it to the
source cgroup as that's the only thing identifiable from the fork and exit
paths. The problem is that a single atomic migration operation can pull
tasks from multiple cgroups into one destination cgroup, even on cgroup2 due
to the threaded cgroups. This would be pretty rare on cgroup2 but still need
to be handled which means grabbing multiple locks from the migration path.
Not the end of the world but a bit nasty.

But, as long as it's well encapsulated and out of line, I don't see problems
with such approach.

As for cgroup_mutex, it's more complicated as the usage is more spread, but
yeah, the only solution there too would be going for finer grained locking
whether that's hierarchical or hashed.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: MM global locks as core counts quadruple
  2024-06-23 17:59     ` Tejun Heo
@ 2024-06-24 21:44       ` David Rientjes
  0 siblings, 0 replies; 10+ messages in thread
From: David Rientjes @ 2024-06-24 21:44 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Namhyung Kim, Andrew Morton, Liam R. Howlett, Suren Baghdasaryan,
	Matthew Wilcox, Christoph Lameter, Paul E. McKenney,
	Johannes Weiner, Davidlohr Bueso, linux-mm

On Sun, 23 Jun 2024, Tejun Heo wrote:

> Hello,
> 
> On Fri, Jun 21, 2024 at 02:37:43PM -0700, Namhyung Kim wrote:
> > > >  - cgroup_threadgroup_rwsem
> > >
> > > This one shouldn't matter at all in setups where new cgroups are populated
> > > with CLONE_INTO_CGROUP and not migrated further. The lock isn't grabbed in
> > > such usage pattern, which should be the vast majority already, I think. Are
> > > you guys migrating tasks a lot or not using CLONE_INTO_CGROUP?
> > 
> > I'm afraid there are still some use cases in Google that migrate processes
> > and/or threads between cgroups. :(
> 
> I see. I wonder whether we can turn this into a cgroup lock. It's not
> straightforward tho. It's protecting migration against forking and exiting
> and the only way to turn it into per-cgroup lock would be tying it to the
> source cgroup as that's the only thing identifiable from the fork and exit
> paths. The problem is that a single atomic migration operation can pull
> tasks from multiple cgroups into one destination cgroup, even on cgroup2 due
> to the threaded cgroups. This would be pretty rare on cgroup2 but still need
> to be handled which means grabbing multiple locks from the migration path.
> Not the end of the world but a bit nasty.
> 
> But, as long as it's well encapsulated and out of line, I don't see problems
> with such approach.
> 
> As for cgroup_mutex, it's more complicated as the usage is more spread, but
> yeah, the only solution there too would be going for finer grained locking
> whether that's hierarchical or hashed.
> 

Thanks all for the great discussion in the thread so far!

Beyond the discussion of cgroup mutexes above, we also discussed 
increasing the number of zones within a NUMA node.  I'm thinking that this 
would actually be an implementation detail, i.e. we wouldn't need to 
change any user visible interfaces like /proc/zoneinfo.  IOW, we could 
have 64 16GB ZONE_NORMALs spanning 1TB of memory, and we could sum up the 
memory resident across all of those when describing the memory to 
userspace.

Anybody else working on any of the following or have thoughts/ideas for 
how they could be improved as core counts increase?

 - list_lrus_mutex
 - pcpu_drain_mutex
 - shrinker_mutex (formerly shrinker_rwsem)
 - vmap_purge_lock

Also, any favorite benchmarks that people use with high core counts to 
measure the improvement when generic MM locks become more sharded?  I can 
imagine running will-it-scale on platforms with >= 256 cores per socket 
but if there are specific stress tests that can help quantify the impact, 
that would be great to know about.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2024-06-27  5:47 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-06-21  0:35 MM global locks as core counts quadruple David Rientjes
2024-06-21  2:01 ` Matthew Wilcox
2024-06-21  2:46 ` Yafang Shao
2024-06-21  2:54   ` Matthew Wilcox
2024-06-26 19:38     ` Karim Manaouil
2024-06-27  5:36       ` Christoph Lameter (Ampere)
2024-06-21 19:10 ` Tejun Heo
2024-06-21 21:37   ` Namhyung Kim
2024-06-23 17:59     ` Tejun Heo
2024-06-24 21:44       ` David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox