* Re: [RFC PATCH] memcg: introduce kfuncs for fetching memcg stats
[not found] <20250920015526.246554-1-inwardvessel@gmail.com>
@ 2025-09-20 5:17 ` Shakeel Butt
2025-09-23 18:02 ` JP Kobryn
0 siblings, 1 reply; 2+ messages in thread
From: Shakeel Butt @ 2025-09-20 5:17 UTC (permalink / raw)
To: JP Kobryn
Cc: mkoutny, yosryahmed, hannes, tj, akpm, linux-kernel, cgroups,
kernel-team, linux-mm, bpf
+linux-mm, bpf
Hi JP,
On Fri, Sep 19, 2025 at 06:55:26PM -0700, JP Kobryn wrote:
> The kernel has to perform a significant amount of the work when a user mode
> program reads the memory.stat file of a cgroup. Aside from flushing stats,
> there is overhead in the string formatting that is done for each stat. Some
> perf data is shown below from a program that reads memory.stat 1M times:
>
> 26.75% a.out [kernel.kallsyms] [k] vsnprintf
> 19.88% a.out [kernel.kallsyms] [k] format_decode
> 12.11% a.out [kernel.kallsyms] [k] number
> 11.72% a.out [kernel.kallsyms] [k] string
> 8.46% a.out [kernel.kallsyms] [k] strlen
> 4.22% a.out [kernel.kallsyms] [k] seq_buf_printf
> 2.79% a.out [kernel.kallsyms] [k] memory_stat_format
> 1.49% a.out [kernel.kallsyms] [k] put_dec_trunc8
> 1.45% a.out [kernel.kallsyms] [k] widen_string
> 1.01% a.out [kernel.kallsyms] [k] memcpy_orig
>
> As an alternative to reading memory.stat, introduce new kfuncs to allow
> fetching specific memcg stats from within bpf iter/cgroup-based programs.
> Reading stats in this manner avoids the overhead of the string formatting
> shown above.
>
> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Thanks for this but I feel like you are drastically under-selling the
potential of this work. This will not just reduce the cost of reading
stats but will also provide a lot of flexibility.
Large infra owners which use cgroup, spent a lot of compute on reading
stats (I know about Google & Meta) and even small optimizations becomes
significant at the fleet level.
Your perf profile is focusing only on kernel but I can see similar
operation in the userspace (i.e. from string to binary format) would be
happening in the real world workloads. I imagine with bpf we can
directly pass binary data to userspace or we can do custom serialization
(like protobuf or thrift) in the bpf program directly.
Beside string formatting, I think you should have seen open()/close() as
well in your perf profile. In your microbenchmark, did you read
memory.stat 1M times with the same fd and use lseek(0) between the reads
or did you open(), read() & close(). If you had done later one, then
open/close would be visible in the perf data as well. I know Google
implemented fd caching in their userspacecontainer library to reduce
their open/close cost. I imagine with this approach, we can avoid this
cost as well.
In terms of flexibility, I can see userspace can get the stats which it
needs rather than getting all the stats. In addition, userspace can
avoid flushing stats based on the fact that system is flushing the stats
every 2 seconds.
In your next version, please also include the sample bpf which uses
these kfuncs and also include the performance comparison between this
approach and the traditional reading memory.stat approach.
thanks,
Shakeel
^ permalink raw reply [flat|nested] 2+ messages in thread
* Re: [RFC PATCH] memcg: introduce kfuncs for fetching memcg stats
2025-09-20 5:17 ` [RFC PATCH] memcg: introduce kfuncs for fetching memcg stats Shakeel Butt
@ 2025-09-23 18:02 ` JP Kobryn
0 siblings, 0 replies; 2+ messages in thread
From: JP Kobryn @ 2025-09-23 18:02 UTC (permalink / raw)
To: Shakeel Butt
Cc: mkoutny, yosryahmed, hannes, tj, akpm, linux-kernel, cgroups,
kernel-team, linux-mm, bpf
On 9/19/25 10:17 PM, Shakeel Butt wrote:
> +linux-mm, bpf
>
> Hi JP,
>
> On Fri, Sep 19, 2025 at 06:55:26PM -0700, JP Kobryn wrote:
>> The kernel has to perform a significant amount of the work when a user mode
>> program reads the memory.stat file of a cgroup. Aside from flushing stats,
>> there is overhead in the string formatting that is done for each stat. Some
>> perf data is shown below from a program that reads memory.stat 1M times:
>>
>> 26.75% a.out [kernel.kallsyms] [k] vsnprintf
>> 19.88% a.out [kernel.kallsyms] [k] format_decode
>> 12.11% a.out [kernel.kallsyms] [k] number
>> 11.72% a.out [kernel.kallsyms] [k] string
>> 8.46% a.out [kernel.kallsyms] [k] strlen
>> 4.22% a.out [kernel.kallsyms] [k] seq_buf_printf
>> 2.79% a.out [kernel.kallsyms] [k] memory_stat_format
>> 1.49% a.out [kernel.kallsyms] [k] put_dec_trunc8
>> 1.45% a.out [kernel.kallsyms] [k] widen_string
>> 1.01% a.out [kernel.kallsyms] [k] memcpy_orig
>>
>> As an alternative to reading memory.stat, introduce new kfuncs to allow
>> fetching specific memcg stats from within bpf iter/cgroup-based programs.
>> Reading stats in this manner avoids the overhead of the string formatting
>> shown above.
>>
>> Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
>
> Thanks for this but I feel like you are drastically under-selling the
> potential of this work. This will not just reduce the cost of reading
> stats but will also provide a lot of flexibility.
>
> Large infra owners which use cgroup, spent a lot of compute on reading
> stats (I know about Google & Meta) and even small optimizations becomes
> significant at the fleet level.
>
> Your perf profile is focusing only on kernel but I can see similar
> operation in the userspace (i.e. from string to binary format) would be
> happening in the real world workloads. I imagine with bpf we can
> directly pass binary data to userspace or we can do custom serialization
> (like protobuf or thrift) in the bpf program directly.
>
> Beside string formatting, I think you should have seen open()/close() as
> well in your perf profile. In your microbenchmark, did you read
> memory.stat 1M times with the same fd and use lseek(0) between the reads
> or did you open(), read() & close(). If you had done later one, then
> open/close would be visible in the perf data as well. I know Google
> implemented fd caching in their userspacecontainer library to reduce
> their open/close cost. I imagine with this approach, we can avoid this
> cost as well.
In the test program, I opened once and used lseek() at the end of each
iteration. It's a good point though about user programs typically
opening and closing. I'll adjust the test program to resemble that
action.
>
> In terms of flexibility, I can see userspace can get the stats which it
> needs rather than getting all the stats. In addition, userspace can
> avoid flushing stats based on the fact that system is flushing the stats
> every 2 seconds.
That's true. The kfunc for flushing is made available but not required.
>
> In your next version, please also include the sample bpf which uses
> these kfuncs and also include the performance comparison between this
> approach and the traditional reading memory.stat approach.
Thanks for the good input. Will do.
^ permalink raw reply [flat|nested] 2+ messages in thread
end of thread, other threads:[~2025-09-23 18:02 UTC | newest]
Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <20250920015526.246554-1-inwardvessel@gmail.com>
2025-09-20 5:17 ` [RFC PATCH] memcg: introduce kfuncs for fetching memcg stats Shakeel Butt
2025-09-23 18:02 ` JP Kobryn
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox