linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Shakeel Butt <shakeel.butt@linux.dev>
To: JP Kobryn <inwardvessel@gmail.com>
Cc: mkoutny@suse.com, yosryahmed@google.com, hannes@cmpxchg.org,
	 tj@kernel.org, akpm@linux-foundation.org,
	linux-kernel@vger.kernel.org,  cgroups@vger.kernel.org,
	kernel-team@meta.com, linux-mm@kvack.org, bpf@vger.kernel.org
Subject: Re: [RFC PATCH] memcg: introduce kfuncs for fetching memcg stats
Date: Fri, 19 Sep 2025 22:17:49 -0700	[thread overview]
Message-ID: <ky2yjg6qrqf6hqych7v3usphpcgpcemsmfrb5ephc7bdzxo57b@6cxnzxap3bsc> (raw)
In-Reply-To: <20250920015526.246554-1-inwardvessel@gmail.com>

+linux-mm, bpf

Hi JP,

On Fri, Sep 19, 2025 at 06:55:26PM -0700, JP Kobryn wrote:
> The kernel has to perform a significant amount of the work when a user mode
> program reads the memory.stat file of a cgroup. Aside from flushing stats,
> there is overhead in the string formatting that is done for each stat. Some
> perf data is shown below from a program that reads memory.stat 1M times:
> 
> 26.75%  a.out [kernel.kallsyms] [k] vsnprintf
> 19.88%  a.out [kernel.kallsyms] [k] format_decode
> 12.11%  a.out [kernel.kallsyms] [k] number
> 11.72%  a.out [kernel.kallsyms] [k] string
>  8.46%  a.out [kernel.kallsyms] [k] strlen
>  4.22%  a.out [kernel.kallsyms] [k] seq_buf_printf
>  2.79%  a.out [kernel.kallsyms] [k] memory_stat_format
>  1.49%  a.out [kernel.kallsyms] [k] put_dec_trunc8
>  1.45%  a.out [kernel.kallsyms] [k] widen_string
>  1.01%  a.out [kernel.kallsyms] [k] memcpy_orig
> 
> As an alternative to reading memory.stat, introduce new kfuncs to allow
> fetching specific memcg stats from within bpf iter/cgroup-based programs.
> Reading stats in this manner avoids the overhead of the string formatting
> shown above.
> 
> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>

Thanks for this but I feel like you are drastically under-selling the
potential of this work. This will not just reduce the cost of reading
stats but will also provide a lot of flexibility.

Large infra owners which use cgroup, spent a lot of compute on reading
stats (I know about Google & Meta) and even small optimizations becomes
significant at the fleet level.

Your perf profile is focusing only on kernel but I can see similar
operation in the userspace (i.e. from string to binary format) would be
happening in the real world workloads. I imagine with bpf we can
directly pass binary data to userspace or we can do custom serialization
(like protobuf or thrift) in the bpf program directly.

Beside string formatting, I think you should have seen open()/close() as
well in your perf profile. In your microbenchmark, did you read
memory.stat 1M times with the same fd and use lseek(0) between the reads
or did you open(), read() & close(). If you had done later one, then
open/close would be visible in the perf data as well. I know Google
implemented fd caching in their userspacecontainer library to reduce
their open/close cost. I imagine with this approach, we can avoid this
cost as well.

In terms of flexibility, I can see userspace can get the stats which it
needs rather than getting all the stats. In addition, userspace can
avoid flushing stats based on the fact that system is flushing the stats
every 2 seconds.

In your next version, please also include the sample bpf which uses
these kfuncs and also include the performance comparison between this
approach and the traditional reading memory.stat approach.

thanks,
Shakeel


       reply	other threads:[~2025-09-20  5:18 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20250920015526.246554-1-inwardvessel@gmail.com>
2025-09-20  5:17 ` Shakeel Butt [this message]
2025-09-23 18:02   ` JP Kobryn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ky2yjg6qrqf6hqych7v3usphpcgpcemsmfrb5ephc7bdzxo57b@6cxnzxap3bsc \
    --to=shakeel.butt@linux.dev \
    --cc=akpm@linux-foundation.org \
    --cc=bpf@vger.kernel.org \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=inwardvessel@gmail.com \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mkoutny@suse.com \
    --cc=tj@kernel.org \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox