* [LSF/MM/BPF Topic] Performance improvement for Memory Cgroups
@ 2025-03-19 6:19 Shakeel Butt
2025-03-19 8:49 ` [Lsf-pc] " Christian Brauner
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: Shakeel Butt @ 2025-03-19 6:19 UTC (permalink / raw)
To: linux-mm, lsf-pc
Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
Vlastimil Babka, Yosry Ahmed, Meta kernel team
A bit late but let me still propose a session on topics related to memory
cgroups. Last year at LSFMM 2024, we discussed [1] about the potential
deprecation of memcg v1. Since then we have made very good progress in that
regard. We have moved the v1-only code in a separate file and make it not
compile by default, have added warnings in many v1-only interfaces and have
removed a lot of v1-only code. This year, I want to focus on performance of
memory cgroup, particularly improving cost of charging and stats.
At the high level we can partition the memory charging in three cases. First
is the user memory (anon & file), second if kernel memory (slub mostly) and
third is network memory. For network memory, [1] has described some of the
challenges. Similarly for kernel memory, we had to revert patches where memcg
charging was too expensive [3,4].
I want to discuss and brainstorm different ways to further optimize the
memcg charging for all these types of memory. I am at the moment prototying
multi-memcg support for per-cpu memcg stocks and would like to see what else
we can do.
One additional interesting observation from our fleet is that the cost of
memory charging increases for the users of memory.low and memory.min. Basically
propagate_protected_usage() becomes very prominently visible in the perf
traces.
Other than charging, the memcg stats infra also is very expensive and a lot
of CPUs in our fleet are spent on maintaining these stats. Memcg stats use
rstat infrastructure which is designed for fast updates and slow readers.
The updaters put the cgroup in a per-cpu update tree while the stats readers
flushes update trees of all the cpus. For memcg, the flushes has become very
expensive and over the years we have added ratelimiting to limit the cost.
I want to discuss what else we can do to further improve the memcg stats.
Other than the performance of charging and memcg stats, time permitting, we
can discuss other memcg topics like new features or something still lacking.
[1] https://lwn.net/Articles/974575/
[2] https://lore.kernel.org/all/20250307055936.3988572-1-shakeel.butt@linux.dev/
[3] 3754707bcc3e ("Revert "memcg: enable accounting for file lock caches"")
[4] 0bcfe68b8767 ("Revert "memcg: enable accounting for pollfd and select bits arrays"")
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Lsf-pc] [LSF/MM/BPF Topic] Performance improvement for Memory Cgroups
2025-03-19 6:19 [LSF/MM/BPF Topic] Performance improvement for Memory Cgroups Shakeel Butt
@ 2025-03-19 8:49 ` Christian Brauner
2025-03-20 5:02 ` Balbir Singh
2025-03-20 6:22 ` Harry Yoo
2 siblings, 0 replies; 6+ messages in thread
From: Christian Brauner @ 2025-03-19 8:49 UTC (permalink / raw)
To: Shakeel Butt
Cc: linux-mm, lsf-pc, Johannes Weiner, Michal Hocko, Roman Gushchin,
Muchun Song, Vlastimil Babka, Yosry Ahmed, Meta kernel team
On Tue, Mar 18, 2025 at 11:19:42PM -0700, Shakeel Butt wrote:
> A bit late but let me still propose a session on topics related to memory
> cgroups. Last year at LSFMM 2024, we discussed [1] about the potential
> deprecation of memcg v1. Since then we have made very good progress in that
> regard. We have moved the v1-only code in a separate file and make it not
> compile by default, have added warnings in many v1-only interfaces and have
> removed a lot of v1-only code. This year, I want to focus on performance of
> memory cgroup, particularly improving cost of charging and stats.
>
> At the high level we can partition the memory charging in three cases. First
> is the user memory (anon & file), second if kernel memory (slub mostly) and
> third is network memory. For network memory, [1] has described some of the
> challenges. Similarly for kernel memory, we had to revert patches where memcg
> charging was too expensive [3,4].
>
> I want to discuss and brainstorm different ways to further optimize the
> memcg charging for all these types of memory. I am at the moment prototying
> multi-memcg support for per-cpu memcg stocks and would like to see what else
> we can do.
>
> One additional interesting observation from our fleet is that the cost of
> memory charging increases for the users of memory.low and memory.min. Basically
> propagate_protected_usage() becomes very prominently visible in the perf
> traces.
>
> Other than charging, the memcg stats infra also is very expensive and a lot
IIrcu, it also slows down opening files significantly as we discussed
last year. So I'm very interested in improvements in this area as well.
> of CPUs in our fleet are spent on maintaining these stats. Memcg stats use
> rstat infrastructure which is designed for fast updates and slow readers.
> The updaters put the cgroup in a per-cpu update tree while the stats readers
> flushes update trees of all the cpus. For memcg, the flushes has become very
> expensive and over the years we have added ratelimiting to limit the cost.
> I want to discuss what else we can do to further improve the memcg stats.
>
> Other than the performance of charging and memcg stats, time permitting, we
> can discuss other memcg topics like new features or something still lacking.
>
> [1] https://lwn.net/Articles/974575/
> [2] https://lore.kernel.org/all/20250307055936.3988572-1-shakeel.butt@linux.dev/
> [3] 3754707bcc3e ("Revert "memcg: enable accounting for file lock caches"")
> [4] 0bcfe68b8767 ("Revert "memcg: enable accounting for pollfd and select bits arrays"")
> _______________________________________________
> Lsf-pc mailing list
> Lsf-pc@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/lsf-pc
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [LSF/MM/BPF Topic] Performance improvement for Memory Cgroups
2025-03-19 6:19 [LSF/MM/BPF Topic] Performance improvement for Memory Cgroups Shakeel Butt
2025-03-19 8:49 ` [Lsf-pc] " Christian Brauner
@ 2025-03-20 5:02 ` Balbir Singh
2025-03-21 17:57 ` Shakeel Butt
2025-03-20 6:22 ` Harry Yoo
2 siblings, 1 reply; 6+ messages in thread
From: Balbir Singh @ 2025-03-20 5:02 UTC (permalink / raw)
To: Shakeel Butt, linux-mm, lsf-pc
Cc: Johannes Weiner, Michal Hocko, Roman Gushchin, Muchun Song,
Vlastimil Babka, Yosry Ahmed, Meta kernel team
On 3/19/25 17:19, Shakeel Butt wrote:
> A bit late but let me still propose a session on topics related to memory
> cgroups. Last year at LSFMM 2024, we discussed [1] about the potential
> deprecation of memcg v1. Since then we have made very good progress in that
> regard. We have moved the v1-only code in a separate file and make it not
> compile by default, have added warnings in many v1-only interfaces and have
> removed a lot of v1-only code. This year, I want to focus on performance of
> memory cgroup, particularly improving cost of charging and stats.
I'd be very interested in the discussion, I am not there in person, FYI
>
> At the high level we can partition the memory charging in three cases. First
> is the user memory (anon & file), second if kernel memory (slub mostly) and
> third is network memory. For network memory, [1] has described some of the
> challenges. Similarly for kernel memory, we had to revert patches where memcg
> charging was too expensive [3,4].
>
> I want to discuss and brainstorm different ways to further optimize the
> memcg charging for all these types of memory. I am at the moment prototying
> multi-memcg support for per-cpu memcg stocks and would like to see what else
> we can do.
>
What do you mean by multi-memcg support? Does it means creating those buckets
per cpu?
> One additional interesting observation from our fleet is that the cost of
> memory charging increases for the users of memory.low and memory.min. Basically
> propagate_protected_usage() becomes very prominently visible in the perf
> traces.
>
> Other than charging, the memcg stats infra also is very expensive and a lot
> of CPUs in our fleet are spent on maintaining these stats. Memcg stats use
> rstat infrastructure which is designed for fast updates and slow readers.
> The updaters put the cgroup in a per-cpu update tree while the stats readers
> flushes update trees of all the cpus. For memcg, the flushes has become very
> expensive and over the years we have added ratelimiting to limit the cost.
> I want to discuss what else we can do to further improve the memcg stats.
>
Generally anything per-cpu scales well for write, but summing up stats is
very expensive. I personally think we might need to consider cases where
the limits we enforce allow a certain amount of delta and the watermarks
in v2 are a good step in that direction. The one API I've struggled with
in v2 is memory_cgroup_handle_over_high(). Ideally, I expected it to act
as a soft limit, that when run over and hits max, would cause OOM if needed.
Balbir
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [LSF/MM/BPF Topic] Performance improvement for Memory Cgroups
2025-03-19 6:19 [LSF/MM/BPF Topic] Performance improvement for Memory Cgroups Shakeel Butt
2025-03-19 8:49 ` [Lsf-pc] " Christian Brauner
2025-03-20 5:02 ` Balbir Singh
@ 2025-03-20 6:22 ` Harry Yoo
2025-03-31 18:02 ` Vlastimil Babka
2 siblings, 1 reply; 6+ messages in thread
From: Harry Yoo @ 2025-03-20 6:22 UTC (permalink / raw)
To: Shakeel Butt
Cc: linux-mm, lsf-pc, Johannes Weiner, Michal Hocko, Roman Gushchin,
Muchun Song, Vlastimil Babka, Yosry Ahmed, Meta kernel team
On Tue, Mar 18, 2025 at 11:19:42PM -0700, Shakeel Butt wrote:
> A bit late but let me still propose a session on topics related to memory
> cgroups. Last year at LSFMM 2024, we discussed [1] about the potential
> deprecation of memcg v1. Since then we have made very good progress in that
> regard. We have moved the v1-only code in a separate file and make it not
> compile by default, have added warnings in many v1-only interfaces and have
> removed a lot of v1-only code. This year, I want to focus on performance of
> memory cgroup, particularly improving cost of charging and stats.
>
> At the high level we can partition the memory charging in three cases. First
> is the user memory (anon & file), second if kernel memory (slub mostly) and
> third is network memory. For network memory, [1] has described some of the
> challenges. Similarly for kernel memory, we had to revert patches where memcg
> charging was too expensive [3,4].
>
> I want to discuss and brainstorm different ways to further optimize the
> memcg charging for all these types of memory. I am at the moment prototying
> multi-memcg support for per-cpu memcg stocks and would like to see what else
> we can do.
For slab memory, I have an idea:
Deferring the uncharging of slab objects on free until the CPU slab and
per-CPU partial slabs are moved to the per-node partial slab list
might be beneficial.
Something like:
0. SLUB allocator defers uncharging objects if the slab the freed
objects belong to is the CPU slab or in the percpu partial slab
list.
1. memcg_slab_post_alloc_hook() does:
1.1 Skips charging, if the object is already charged to the same
memcg and has not been uncharged yet.
1.2 Uncharges the object if it is charged to a different memcg
and then charges it to current memcg.
1.3 Charges the object if it's not currently not charged to any memcg.
2. deactivate_slab() and __put_partials() uncharges free objects
that were not uncharged yet before moving them to the per-node
partial slab list.
Unless 1) we have tasks belonging to many different memcgs on each CPU
(I'm not an expert on the scheduler's interaction with cgroups, though),
or 2) load balancing migrates tasks between CPUs too frequently,
many allocations should hit case 1.1 (Oh, it's already charged to the same
memcg so skip charging) in the hot path, right?
Some experiments are needed to determine whether this idea is actually
beneficial.
Or has a similar approach been tried before?
--
Cheers,
Harry
> One additional interesting observation from our fleet is that the cost of
> memory charging increases for the users of memory.low and memory.min. Basically
> propagate_protected_usage() becomes very prominently visible in the perf
> traces.
>
> Other than charging, the memcg stats infra also is very expensive and a lot
> of CPUs in our fleet are spent on maintaining these stats. Memcg stats use
> rstat infrastructure which is designed for fast updates and slow readers.
> The updaters put the cgroup in a per-cpu update tree while the stats readers
> flushes update trees of all the cpus. For memcg, the flushes has become very
> expensive and over the years we have added ratelimiting to limit the cost.
> I want to discuss what else we can do to further improve the memcg stats.
>
> Other than the performance of charging and memcg stats, time permitting, we
> can discuss other memcg topics like new features or something still lacking.
>
> [1] https://lwn.net/Articles/974575/
> [2] https://lore.kernel.org/all/20250307055936.3988572-1-shakeel.butt@linux.dev/
> [3] 3754707bcc3e ("Revert "memcg: enable accounting for file lock caches"")
> [4] 0bcfe68b8767 ("Revert "memcg: enable accounting for pollfd and select bits arrays"")
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [LSF/MM/BPF Topic] Performance improvement for Memory Cgroups
2025-03-20 5:02 ` Balbir Singh
@ 2025-03-21 17:57 ` Shakeel Butt
0 siblings, 0 replies; 6+ messages in thread
From: Shakeel Butt @ 2025-03-21 17:57 UTC (permalink / raw)
To: Balbir Singh
Cc: linux-mm, lsf-pc, Johannes Weiner, Michal Hocko, Roman Gushchin,
Muchun Song, Vlastimil Babka, Yosry Ahmed, Meta kernel team
On Thu, Mar 20, 2025 at 04:02:27PM +1100, Balbir Singh wrote:
> On 3/19/25 17:19, Shakeel Butt wrote:
> > A bit late but let me still propose a session on topics related to memory
> > cgroups. Last year at LSFMM 2024, we discussed [1] about the potential
> > deprecation of memcg v1. Since then we have made very good progress in that
> > regard. We have moved the v1-only code in a separate file and make it not
> > compile by default, have added warnings in many v1-only interfaces and have
> > removed a lot of v1-only code. This year, I want to focus on performance of
> > memory cgroup, particularly improving cost of charging and stats.
>
> I'd be very interested in the discussion, I am not there in person, FYI
>
> >
> > At the high level we can partition the memory charging in three cases. First
> > is the user memory (anon & file), second if kernel memory (slub mostly) and
> > third is network memory. For network memory, [1] has described some of the
> > challenges. Similarly for kernel memory, we had to revert patches where memcg
> > charging was too expensive [3,4].
> >
> > I want to discuss and brainstorm different ways to further optimize the
> > memcg charging for all these types of memory. I am at the moment prototying
> > multi-memcg support for per-cpu memcg stocks and would like to see what else
> > we can do.
> >
>
> What do you mean by multi-memcg support? Does it means creating those buckets
> per cpu?
>
Multiple cached memcgs in struct memcg_stock_pcp. In [1] I prototypes a
network specific per-cpu multi-memcg stock. However I think we need a
general support instead of just for networking.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [LSF/MM/BPF Topic] Performance improvement for Memory Cgroups
2025-03-20 6:22 ` Harry Yoo
@ 2025-03-31 18:02 ` Vlastimil Babka
0 siblings, 0 replies; 6+ messages in thread
From: Vlastimil Babka @ 2025-03-31 18:02 UTC (permalink / raw)
To: Harry Yoo, Shakeel Butt
Cc: linux-mm, lsf-pc, Johannes Weiner, Michal Hocko, Roman Gushchin,
Muchun Song, Yosry Ahmed, Meta kernel team
On 3/20/25 07:22, Harry Yoo wrote:
> On Tue, Mar 18, 2025 at 11:19:42PM -0700, Shakeel Butt wrote:
>
> For slab memory, I have an idea:
>
> Deferring the uncharging of slab objects on free until the CPU slab and
> per-CPU partial slabs are moved to the per-node partial slab list
> might be beneficial.
>
> Something like:
>
> 0. SLUB allocator defers uncharging objects if the slab the freed
> objects belong to is the CPU slab or in the percpu partial slab
> list.
>
> 1. memcg_slab_post_alloc_hook() does:
> 1.1 Skips charging, if the object is already charged to the same
> memcg and has not been uncharged yet.
> 1.2 Uncharges the object if it is charged to a different memcg
> and then charges it to current memcg.
> 1.3 Charges the object if it's not currently not charged to any memcg.
>
> 2. deactivate_slab() and __put_partials() uncharges free objects
> that were not uncharged yet before moving them to the per-node
> partial slab list.
>
> Unless 1) we have tasks belonging to many different memcgs on each CPU
> (I'm not an expert on the scheduler's interaction with cgroups, though),
> or 2) load balancing migrates tasks between CPUs too frequently,
>
> many allocations should hit case 1.1 (Oh, it's already charged to the same
> memcg so skip charging) in the hot path, right?
>
> Some experiments are needed to determine whether this idea is actually
> beneficial.
>
> Or has a similar approach been tried before?
I don't think so, it would have to be tried and measured. As I hinted in my
sheaves slot, I doubt the step 0. above happens often enough.
In step 2. you would have to walk the slabs' freelists to check if anything
is charged, right?
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-03-31 18:03 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-03-19 6:19 [LSF/MM/BPF Topic] Performance improvement for Memory Cgroups Shakeel Butt
2025-03-19 8:49 ` [Lsf-pc] " Christian Brauner
2025-03-20 5:02 ` Balbir Singh
2025-03-21 17:57 ` Shakeel Butt
2025-03-20 6:22 ` Harry Yoo
2025-03-31 18:02 ` Vlastimil Babka
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox