* MM global locks as core counts quadruple
@ 2024-06-21 0:35 David Rientjes
2024-06-21 2:01 ` Matthew Wilcox
` (2 more replies)
0 siblings, 3 replies; 10+ messages in thread
From: David Rientjes @ 2024-06-21 0:35 UTC (permalink / raw)
To: Andrew Morton, Liam R. Howlett, Suren Baghdasaryan,
Matthew Wilcox, Christoph Lameter, Paul E. McKenney, Tejun Heo,
Johannes Weiner, Davidlohr Bueso
Cc: Namhyung Kim, linux-mm
Hi all,
As core counts are rapidly expanding over the next four years, Namhyung
and I were looking at global locks that we're already seeing high
contention on even today.
Some of these are not MM specific:
- cgroup_mutex
- cgroup_threadgroup_rwsem
- tasklist_lock
- kernfs_mutex (although should now be substantially better with the
kernfs_locks array)
Others *are* MM specific:
- list_lrus_mutex
- pcpu_drain_mutex
- shrinker_mutex (formerly shrinker_rwsem)
- vmap_purge_lock
- slab_mutex
This is only looking at fleet data for global static locks, not locks like
zone->lock that get dynamically allocated.
(mmap_lock was substantially improved by per-vma locking, although does
show up for very large vmas.)
Couple questions:
(1) How are people quantifying these pain points, if at all, in synthetic
testing? Any workloads or benchmarks that are really good at doing
this in the lab beyond the traditional will-it-scale? (The above is
from production data.)
(2) Is anybody working on any of the above global locks? Trying to
surface gaps for locks that will likely become even more painful in
the coming years.
Thanks!
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: MM global locks as core counts quadruple 2024-06-21 0:35 MM global locks as core counts quadruple David Rientjes @ 2024-06-21 2:01 ` Matthew Wilcox 2024-06-21 2:46 ` Yafang Shao 2024-06-21 19:10 ` Tejun Heo 2 siblings, 0 replies; 10+ messages in thread From: Matthew Wilcox @ 2024-06-21 2:01 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Liam R. Howlett, Suren Baghdasaryan, Christoph Lameter, Paul E. McKenney, Tejun Heo, Johannes Weiner, Davidlohr Bueso, Namhyung Kim, linux-mm On Thu, Jun 20, 2024 at 05:35:45PM -0700, David Rientjes wrote: > - slab_mutex This is weird. You must have very strange workloads running ;-) It protects the list of slabs. So it's taken for - Slab creation / destruction - CPU online/offline - Memory hotplug - Reading /proc/slabinfo All of these should be rare events, but don't seem to be for you? ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: MM global locks as core counts quadruple 2024-06-21 0:35 MM global locks as core counts quadruple David Rientjes 2024-06-21 2:01 ` Matthew Wilcox @ 2024-06-21 2:46 ` Yafang Shao 2024-06-21 2:54 ` Matthew Wilcox 2024-06-21 19:10 ` Tejun Heo 2 siblings, 1 reply; 10+ messages in thread From: Yafang Shao @ 2024-06-21 2:46 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Liam R. Howlett, Suren Baghdasaryan, Matthew Wilcox, Christoph Lameter, Paul E. McKenney, Tejun Heo, Johannes Weiner, Davidlohr Bueso, Namhyung Kim, linux-mm On Fri, Jun 21, 2024 at 8:50 AM David Rientjes <rientjes@google.com> wrote: > > Hi all, > > As core counts are rapidly expanding over the next four years, Namhyung > and I were looking at global locks that we're already seeing high > contention on even today. > > Some of these are not MM specific: > - cgroup_mutex We previously discussed replacing cgroup_mutex with RCU in some critical paths [0], but I haven't had the time to implement it yet. [0]. https://lore.kernel.org/bpf/CALOAHbAgWoFGSc=uF5gFWXmsALECUaGGScQuXpRcwjgzv+TPGQ@mail.gmail.com/ > - cgroup_threadgroup_rwsem > - tasklist_lock > - kernfs_mutex (although should now be substantially better with the > kernfs_locks array) > > Others *are* MM specific: > - list_lrus_mutex > - pcpu_drain_mutex > - shrinker_mutex (formerly shrinker_rwsem) > - vmap_purge_lock > - slab_mutex > > This is only looking at fleet data for global static locks, not locks like > zone->lock that get dynamically allocated. We have encountered latency spikes caused by the zone->lock. Do we have any plans to eliminate this lock, such as implementing a lockless buddy list? I believe this could be a viable solution. > > (mmap_lock was substantially improved by per-vma locking, although does > show up for very large vmas.) > > Couple questions: > > (1) How are people quantifying these pain points, if at all, in synthetic > testing? Any workloads or benchmarks that are really good at doing > this in the lab beyond the traditional will-it-scale? (The above is > from production data.) > > (2) Is anybody working on any of the above global locks? Trying to > surface gaps for locks that will likely become even more painful in > the coming years. > > Thanks! > -- Regards Yafang ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: MM global locks as core counts quadruple 2024-06-21 2:46 ` Yafang Shao @ 2024-06-21 2:54 ` Matthew Wilcox 2024-06-26 19:38 ` Karim Manaouil 0 siblings, 1 reply; 10+ messages in thread From: Matthew Wilcox @ 2024-06-21 2:54 UTC (permalink / raw) To: Yafang Shao Cc: David Rientjes, Andrew Morton, Liam R. Howlett, Suren Baghdasaryan, Christoph Lameter, Paul E. McKenney, Tejun Heo, Johannes Weiner, Davidlohr Bueso, Namhyung Kim, linux-mm On Fri, Jun 21, 2024 at 10:46:21AM +0800, Yafang Shao wrote: > > This is only looking at fleet data for global static locks, not locks like > > zone->lock that get dynamically allocated. > > We have encountered latency spikes caused by the zone->lock. Do we > have any plans to eliminate this lock, such as implementing a lockless > buddy list? I believe this could be a viable solution. Lockless how? I see three operations being performed on the buddy list: - Add to end (freeing) - Remove from end (allocation) - Remove from middle (buddy was freed, needs to be coalesced) I don't see how we can handle this locklessly. I think a more productive solution to contention on the LRU lock is to increase the number of zones. I don't think it's helpful to have a 1TB zone of memory. Maybe we should limit each zone to 16GB or so. That means we'd need to increase the number of zones we support, but I think that's doable. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: MM global locks as core counts quadruple 2024-06-21 2:54 ` Matthew Wilcox @ 2024-06-26 19:38 ` Karim Manaouil 2024-06-27 5:36 ` Christoph Lameter (Ampere) 0 siblings, 1 reply; 10+ messages in thread From: Karim Manaouil @ 2024-06-26 19:38 UTC (permalink / raw) To: Matthew Wilcox Cc: Yafang Shao, David Rientjes, Andrew Morton, Liam R. Howlett, Suren Baghdasaryan, Christoph Lameter, Paul E. McKenney, Tejun Heo, Johannes Weiner, Davidlohr Bueso, Namhyung Kim, linux-mm On Fri, Jun 21, 2024 at 03:54:31AM +0100, Matthew Wilcox wrote: > I think a more productive solution to contention on the LRU lock is to > increase the number of zones. I don't think it's helpful to have a > 1TB zone of memory. Maybe we should limit each zone to 16GB or so. > That means we'd need to increase the number of zones we support, but > I think that's doable. What do you mean by zones? The usual ZONE_{DMA|DMA32|NORMAL|HIGHMEM}? But that's historically existed to deal with physical addressing limitations (e.g. DMA for devices that can't deal with addresses larger than 16MiB, or HIGHMEM on 32-bits to access physical memory beyond what could be directly mapped by the kernel). Maybe you mean turning ZONE_NORMAL into an array with each entry pointing to a smaller ZONE_NORMAL region of, let's say, 64GiB or smthng. Or it could be divided by the number of CPUs within the NUMA node and each CPU will be given one ZONE_NORMAL segment with a fallback list to other CPUs segments in case it runs out of memory. Does that make sense? Cheers Karim PhD Student Edinburgh University ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: MM global locks as core counts quadruple 2024-06-26 19:38 ` Karim Manaouil @ 2024-06-27 5:36 ` Christoph Lameter (Ampere) 0 siblings, 0 replies; 10+ messages in thread From: Christoph Lameter (Ampere) @ 2024-06-27 5:36 UTC (permalink / raw) To: Karim Manaouil Cc: Matthew Wilcox, Yafang Shao, David Rientjes, Andrew Morton, Liam R. Howlett, Suren Baghdasaryan, Paul E. McKenney, Tejun Heo, Johannes Weiner, Davidlohr Bueso, Namhyung Kim, linux-mm On Wed, 26 Jun 2024, Karim Manaouil wrote: > Maybe you mean turning ZONE_NORMAL into an array with each entry > pointing to a smaller ZONE_NORMAL region of, let's say, 64GiB or smthng. > Or it could be divided by the number of CPUs within the NUMA node and each > CPU will be given one ZONE_NORMAL segment with a fallback list to other CPUs > segments in case it runs out of memory. Does that make sense? More zones means longer zonelists for the page allocator to walk during memory allocation. VM statistics are also per zone so that would decrease counter cacheline contention. Would be good for scaling VM things in general. ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: MM global locks as core counts quadruple 2024-06-21 0:35 MM global locks as core counts quadruple David Rientjes 2024-06-21 2:01 ` Matthew Wilcox 2024-06-21 2:46 ` Yafang Shao @ 2024-06-21 19:10 ` Tejun Heo 2024-06-21 21:37 ` Namhyung Kim 2 siblings, 1 reply; 10+ messages in thread From: Tejun Heo @ 2024-06-21 19:10 UTC (permalink / raw) To: David Rientjes Cc: Andrew Morton, Liam R. Howlett, Suren Baghdasaryan, Matthew Wilcox, Christoph Lameter, Paul E. McKenney, Johannes Weiner, Davidlohr Bueso, Namhyung Kim, linux-mm Hello, On Thu, Jun 20, 2024 at 05:35:45PM -0700, David Rientjes wrote: > As core counts are rapidly expanding over the next four years, Namhyung > and I were looking at global locks that we're already seeing high > contention on even today. > > Some of these are not MM specific: > - cgroup_mutex Our machines are getting bigger but we aren't creating and destroying cgroups frequently enough for this to matter. But yeah, I can see how this can be a problem. > - cgroup_threadgroup_rwsem This one shouldn't matter at all in setups where new cgroups are populated with CLONE_INTO_CGROUP and not migrated further. The lock isn't grabbed in such usage pattern, which should be the vast majority already, I think. Are you guys migrating tasks a lot or not using CLONE_INTO_CGROUP? Thanks. -- tejun ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: MM global locks as core counts quadruple 2024-06-21 19:10 ` Tejun Heo @ 2024-06-21 21:37 ` Namhyung Kim 2024-06-23 17:59 ` Tejun Heo 0 siblings, 1 reply; 10+ messages in thread From: Namhyung Kim @ 2024-06-21 21:37 UTC (permalink / raw) To: Tejun Heo Cc: David Rientjes, Andrew Morton, Liam R. Howlett, Suren Baghdasaryan, Matthew Wilcox, Christoph Lameter, Paul E. McKenney, Johannes Weiner, Davidlohr Bueso, linux-mm Hello, On Fri, Jun 21, 2024 at 12:10 PM Tejun Heo <tj@kernel.org> wrote: > > Hello, > > On Thu, Jun 20, 2024 at 05:35:45PM -0700, David Rientjes wrote: > > As core counts are rapidly expanding over the next four years, Namhyung > > and I were looking at global locks that we're already seeing high > > contention on even today. > > > > Some of these are not MM specific: > > - cgroup_mutex > > Our machines are getting bigger but we aren't creating and destroying > cgroups frequently enough for this to matter. But yeah, I can see how this > can be a problem. > > > - cgroup_threadgroup_rwsem > > This one shouldn't matter at all in setups where new cgroups are populated > with CLONE_INTO_CGROUP and not migrated further. The lock isn't grabbed in > such usage pattern, which should be the vast majority already, I think. Are > you guys migrating tasks a lot or not using CLONE_INTO_CGROUP? I'm afraid there are still some use cases in Google that migrate processes and/or threads between cgroups. :( Thanks, Namhyung ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: MM global locks as core counts quadruple 2024-06-21 21:37 ` Namhyung Kim @ 2024-06-23 17:59 ` Tejun Heo 2024-06-24 21:44 ` David Rientjes 0 siblings, 1 reply; 10+ messages in thread From: Tejun Heo @ 2024-06-23 17:59 UTC (permalink / raw) To: Namhyung Kim Cc: David Rientjes, Andrew Morton, Liam R. Howlett, Suren Baghdasaryan, Matthew Wilcox, Christoph Lameter, Paul E. McKenney, Johannes Weiner, Davidlohr Bueso, linux-mm Hello, On Fri, Jun 21, 2024 at 02:37:43PM -0700, Namhyung Kim wrote: > > > - cgroup_threadgroup_rwsem > > > > This one shouldn't matter at all in setups where new cgroups are populated > > with CLONE_INTO_CGROUP and not migrated further. The lock isn't grabbed in > > such usage pattern, which should be the vast majority already, I think. Are > > you guys migrating tasks a lot or not using CLONE_INTO_CGROUP? > > I'm afraid there are still some use cases in Google that migrate processes > and/or threads between cgroups. :( I see. I wonder whether we can turn this into a cgroup lock. It's not straightforward tho. It's protecting migration against forking and exiting and the only way to turn it into per-cgroup lock would be tying it to the source cgroup as that's the only thing identifiable from the fork and exit paths. The problem is that a single atomic migration operation can pull tasks from multiple cgroups into one destination cgroup, even on cgroup2 due to the threaded cgroups. This would be pretty rare on cgroup2 but still need to be handled which means grabbing multiple locks from the migration path. Not the end of the world but a bit nasty. But, as long as it's well encapsulated and out of line, I don't see problems with such approach. As for cgroup_mutex, it's more complicated as the usage is more spread, but yeah, the only solution there too would be going for finer grained locking whether that's hierarchical or hashed. Thanks. -- tejun ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: MM global locks as core counts quadruple 2024-06-23 17:59 ` Tejun Heo @ 2024-06-24 21:44 ` David Rientjes 0 siblings, 0 replies; 10+ messages in thread From: David Rientjes @ 2024-06-24 21:44 UTC (permalink / raw) To: Tejun Heo Cc: Namhyung Kim, Andrew Morton, Liam R. Howlett, Suren Baghdasaryan, Matthew Wilcox, Christoph Lameter, Paul E. McKenney, Johannes Weiner, Davidlohr Bueso, linux-mm On Sun, 23 Jun 2024, Tejun Heo wrote: > Hello, > > On Fri, Jun 21, 2024 at 02:37:43PM -0700, Namhyung Kim wrote: > > > > - cgroup_threadgroup_rwsem > > > > > > This one shouldn't matter at all in setups where new cgroups are populated > > > with CLONE_INTO_CGROUP and not migrated further. The lock isn't grabbed in > > > such usage pattern, which should be the vast majority already, I think. Are > > > you guys migrating tasks a lot or not using CLONE_INTO_CGROUP? > > > > I'm afraid there are still some use cases in Google that migrate processes > > and/or threads between cgroups. :( > > I see. I wonder whether we can turn this into a cgroup lock. It's not > straightforward tho. It's protecting migration against forking and exiting > and the only way to turn it into per-cgroup lock would be tying it to the > source cgroup as that's the only thing identifiable from the fork and exit > paths. The problem is that a single atomic migration operation can pull > tasks from multiple cgroups into one destination cgroup, even on cgroup2 due > to the threaded cgroups. This would be pretty rare on cgroup2 but still need > to be handled which means grabbing multiple locks from the migration path. > Not the end of the world but a bit nasty. > > But, as long as it's well encapsulated and out of line, I don't see problems > with such approach. > > As for cgroup_mutex, it's more complicated as the usage is more spread, but > yeah, the only solution there too would be going for finer grained locking > whether that's hierarchical or hashed. > Thanks all for the great discussion in the thread so far! Beyond the discussion of cgroup mutexes above, we also discussed increasing the number of zones within a NUMA node. I'm thinking that this would actually be an implementation detail, i.e. we wouldn't need to change any user visible interfaces like /proc/zoneinfo. IOW, we could have 64 16GB ZONE_NORMALs spanning 1TB of memory, and we could sum up the memory resident across all of those when describing the memory to userspace. Anybody else working on any of the following or have thoughts/ideas for how they could be improved as core counts increase? - list_lrus_mutex - pcpu_drain_mutex - shrinker_mutex (formerly shrinker_rwsem) - vmap_purge_lock Also, any favorite benchmarks that people use with high core counts to measure the improvement when generic MM locks become more sharded? I can imagine running will-it-scale on platforms with >= 256 cores per socket but if there are specific stress tests that can help quantify the impact, that would be great to know about. ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2024-06-27 5:47 UTC | newest] Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2024-06-21 0:35 MM global locks as core counts quadruple David Rientjes 2024-06-21 2:01 ` Matthew Wilcox 2024-06-21 2:46 ` Yafang Shao 2024-06-21 2:54 ` Matthew Wilcox 2024-06-26 19:38 ` Karim Manaouil 2024-06-27 5:36 ` Christoph Lameter (Ampere) 2024-06-21 19:10 ` Tejun Heo 2024-06-21 21:37 ` Namhyung Kim 2024-06-23 17:59 ` Tejun Heo 2024-06-24 21:44 ` David Rientjes
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox