Re: [LSF/MM/BPF Topic] Performance improvement for Memory Cgroups

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Harry Yoo <harry.yoo@oracle.com>
To: Shakeel Butt <shakeel.butt@linux.dev>
Cc: linux-mm@kvack.org, lsf-pc@lists.linux-foundation.org,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@kernel.org>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Muchun Song <muchun.song@linux.dev>,
	Vlastimil Babka <vbabka@suse.cz>,
	Yosry Ahmed <yosry.ahmed@linux.dev>,
	Meta kernel team <kernel-team@meta.com>
Subject: Re: [LSF/MM/BPF Topic] Performance improvement for Memory Cgroups
Date: Thu, 20 Mar 2025 15:22:44 +0900	[thread overview]
Message-ID: <Z9u0NGdfrHqn_G8j@harry> (raw)
In-Reply-To: <n6ucvbqqrbms3lhd562bshkxbn3gv43hjndjugjpy7jq2uej3o@3a4gmvhr2o5q>

On Tue, Mar 18, 2025 at 11:19:42PM -0700, Shakeel Butt wrote:
> A bit late but let me still propose a session on topics related to memory
> cgroups. Last year at LSFMM 2024, we discussed [1] about the potential
> deprecation of memcg v1. Since then we have made very good progress in that
> regard. We have moved the v1-only code in a separate file and make it not
> compile by default, have added warnings in many v1-only interfaces and have
> removed a lot of v1-only code. This year, I want to focus on performance of
> memory cgroup, particularly improving cost of charging and stats.
> 
> At the high level we can partition the memory charging in three cases. First
> is the user memory (anon & file), second if kernel memory (slub mostly) and
> third is network memory. For network memory, [1] has described some of the
> challenges. Similarly for kernel memory, we had to revert patches where memcg
> charging was too expensive [3,4].
> 
> I want to discuss and brainstorm different ways to further optimize the
> memcg charging for all these types of memory. I am at the moment prototying
> multi-memcg support for per-cpu memcg stocks and would like to see what else
> we can do.

For slab memory, I have an idea:

Deferring the uncharging of slab objects on free until the CPU slab and
per-CPU partial slabs are moved to the per-node partial slab list
might be beneficial.

Something like:

    0. SLUB allocator defers uncharging objects if the slab the freed
       objects belong to is the CPU slab or in the percpu partial slab
       list.

    1. memcg_slab_post_alloc_hook() does:
       1.1 Skips charging, if the object is already charged to the same
           memcg and has not been uncharged yet.
       1.2 Uncharges the object if it is charged to a different memcg
           and then charges it to current memcg.
       1.3 Charges the object if it's not currently not charged to any memcg.

    2. deactivate_slab() and __put_partials() uncharges free objects
       that were not uncharged yet before moving them to the per-node
       partial slab list.

Unless 1) we have tasks belonging to many different memcgs on each CPU
(I'm not an expert on the scheduler's interaction with cgroups, though),
or 2) load balancing migrates tasks between CPUs too frequently,

many allocations should hit case 1.1 (Oh, it's already charged to the same
memcg so skip charging) in the hot path, right?

Some experiments are needed to determine whether this idea is actually
beneficial.

Or has a similar approach been tried before?

-- 
Cheers,
Harry

> One additional interesting observation from our fleet is that the cost of 
> memory charging increases for the users of memory.low and memory.min. Basically
> propagate_protected_usage() becomes very prominently visible in the perf
> traces.
> 
> Other than charging, the memcg stats infra also is very expensive and a lot
> of CPUs in our fleet are spent on maintaining these stats. Memcg stats use
> rstat infrastructure which is designed for fast updates and slow readers.
> The updaters put the cgroup in a per-cpu update tree while the stats readers
> flushes update trees of all the cpus. For memcg, the flushes has become very
> expensive and over the years we have added ratelimiting to limit the cost.
> I want to discuss what else we can do to further improve the memcg stats.
> 
> Other than the performance of charging and memcg stats, time permitting, we
> can discuss other memcg topics like new features or something still lacking.
> 
> [1] https://lwn.net/Articles/974575/
> [2] https://lore.kernel.org/all/20250307055936.3988572-1-shakeel.butt@linux.dev/
> [3] 3754707bcc3e ("Revert "memcg: enable accounting for file lock caches"")
> [4] 0bcfe68b8767 ("Revert "memcg: enable accounting for pollfd and select bits arrays"")
>

next prev parent reply	other threads:[~2025-03-20  6:23 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-19  6:19 Shakeel Butt
2025-03-19  8:49 ` [Lsf-pc] " Christian Brauner
2025-03-20  5:02 ` Balbir Singh
2025-03-21 17:57   ` Shakeel Butt
2025-03-20  6:22 ` Harry Yoo [this message]
2025-03-31 18:02   ` Vlastimil Babka

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z9u0NGdfrHqn_G8j@harry \
    --to=harry.yoo@oracle.com \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@meta.com \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeel.butt@linux.dev \
    --cc=vbabka@suse.cz \
    --cc=yosry.ahmed@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox