From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7D235C433F5 for ; Tue, 1 Feb 2022 22:01:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C2BB46B016C; Tue, 1 Feb 2022 17:01:44 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id BB38C6B016D; Tue, 1 Feb 2022 17:01:44 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A2D036B016E; Tue, 1 Feb 2022 17:01:44 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0050.hostedemail.com [216.40.44.50]) by kanga.kvack.org (Postfix) with ESMTP id 8DE896B016C for ; Tue, 1 Feb 2022 17:01:44 -0500 (EST) Received: from smtpin24.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 4D9038124235 for ; Tue, 1 Feb 2022 22:01:44 +0000 (UTC) X-FDA: 79095583728.24.E52A92D Received: from mail-pf1-f182.google.com (mail-pf1-f182.google.com [209.85.210.182]) by imf31.hostedemail.com (Postfix) with ESMTP id D6CF720008 for ; Tue, 1 Feb 2022 22:01:43 +0000 (UTC) Received: by mail-pf1-f182.google.com with SMTP id a8so17083388pfa.6 for ; Tue, 01 Feb 2022 14:01:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=OxpeGnCMRb2rQW5rZ03moAHh1z0pOcq+D66lBOm/2zQ=; b=tROAmVCmaZa0UkybqESYBB5T4arVm0uYT9j/yZPUAcd19AI9za/S5XaF2NiUP1QyVe 5N4dfAsh8lan854LcZzzGtMTbjmmPwkVB9C+aJ1BQ/xF9WWXT++pk5OmdWjQIjkGE2TH QEJPyNkmv3F3guccY6dH/N5sTuJxrU4X1TtikzIJexs3lboiwMghQThY1Rm7S4s9qcA6 yvODBSpNSvRMpL/lRcpcVd6ZWK1ZEDScBS4A63N5cNrq+PRppKUbG0LT06hh6ba5uL9d CdITFB3zBOUaJxP1st8Sjnj2+wPSR0Ng/LAbB9LFZhxdSkYr4st13J5QObkjuMtzuyFL 9V5A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=OxpeGnCMRb2rQW5rZ03moAHh1z0pOcq+D66lBOm/2zQ=; b=JP4ZRKLup0sOLcz6or8zOE9F+sDVDlthP0Kjz1k4GDcYJWiVEFF1602WNNrni3zRut HfyT17xn1geOFOo5KYw8Yf3cA7wF8wDmK/SbnT1VtbZueJ6cgPKPwLnUaRNarcPzUYTP Vg7y7qRlBDFAD1np/nmveeySoPC+kek9o0QmTOlz4vlIN0GQPrXcKzn0cX5ByBpYFpWe nNSBij94oL6WeESSQpahP66Eri8pFLNtBk3AbqzDgYTPpSid/Oibb3aw1FWwoMdxR5Bf O9QjQcUOrdbt6+gxP2SD148gRUL0CPYFLtbOHO7+iX4q+8wMY4RUUxm7ZhnIysnnq9vB WTPA== X-Gm-Message-State: AOAM530HMW7di4Lq4lW1iaZ6sSG11CxNpobNP+wCrL30oRuimlUzDAYb BTLE+wMnwI7UAPCgZ6He/GbJWY1jo2YJiHYBwqWAnQ== X-Google-Smtp-Source: ABdhPJzEH29CbK4Ne5XS3eujdcYIOcQiWkQsXaVnpG8WXrFogIMnO38MjIH9aQMRiqo5HNo8E3HjxyoKY90rqQV5Tjs= X-Received: by 2002:a63:2bcc:: with SMTP id r195mr21983591pgr.561.1643752902448; Tue, 01 Feb 2022 14:01:42 -0800 (PST) MIME-Version: 1.0 References: <20220201200823.3283171-1-yosryahmed@google.com> In-Reply-To: From: Yosry Ahmed Date: Tue, 1 Feb 2022 14:01:06 -0800 Message-ID: Subject: Re: [PATCH] memcg: add per-memcg total kernel memory stat To: Johannes Weiner Cc: Andrew Morton , Michal Hocko , Muchun Song , Shakeel Butt , linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: multipart/alternative; boundary="000000000000da439e05d6fc0b98" X-Rspam-User: nil X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: D6CF720008 X-Stat-Signature: xcctdguukrx6so3ryqo3zsnnyp6pww9a Authentication-Results: imf31.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=tROAmVCm; spf=pass (imf31.hostedemail.com: domain of yosryahmed@google.com designates 209.85.210.182 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com X-HE-Tag: 1643752903-558364 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: --000000000000da439e05d6fc0b98 Content-Type: text/plain; charset="UTF-8" Hello Johannes, Thanks so much for your review. On Tue, Feb 1, 2022 at 1:00 PM Johannes Weiner wrote: > > Hello Yosry, > > On Tue, Feb 01, 2022 at 08:08:23PM +0000, Yosry Ahmed wrote: > > Currently memcg stats show several types of kernel memory: > > kernel stack, page tables, sock, vmalloc, and slab. > > However, there are other allocations with __GFP_ACCOUNT > > (or supersets such as GFP_KERNEL_ACCOUNT) that are not accounted > > in any of those stats, a few examples are: > > - various kvm allocations (e.g. allocated pages to create vcpus) > > - io_uring > > - tmp_page in pipes during pipe_write() > > - bpf ringbuffers > > - unix sockets > > > > Keeping track of the total kernel memory is essential for the ease of > > migration from cgroup v1 to v2 as there are large discrepancies between > > v1's kmem.usage_in_bytes and the sum of the available kernel memory stats > > in v2. Adding separate memcg stats for all __GFP_ACCOUNT kernel > > allocations is an impractical maintenance burden as there a lot of those > > all over the kernel code, with more use cases likely to show up in the > > future. > > No objection, I'm just curious how it makes migration to v2 easier in > particular. Or is it just that you've used the v1 stat to track > application regressions and would like to continue doing that in v2? We are using "memory.kmem.usage_in_bytes" in v1 to get the total kernel memory accounted to a memcg to maintain historical data of jobs memory (anon, file, kernel), and for debugging purposes. In v2 there is no equivalent. We found that the total of other existing v2 kernel memory stats (vmalloc, slab, stack, ..) is very different for some workloads compared to v1's "memory.kmem.usage_in_bytes". We need a v2 indicator of the total kernel memory accounted to a memcg." > > > Therefore, add a "kernel" memcg stat that is analogous to kmem > > page counter, with added benefits such as using rstat infrastructure > > which aggregates stats more efficiently. Additionally, this provides a > > lighter alternative in case the legacy kmem is deprecated in the future > > > > Signed-off-by: Yosry Ahmed > > --- > > Documentation/admin-guide/cgroup-v2.rst | 5 +++++ > > include/linux/memcontrol.h | 1 + > > mm/memcontrol.c | 24 ++++++++++++++++++------ > > 3 files changed, 24 insertions(+), 6 deletions(-) > > > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > > index 5aa368d165da..a0027d570a7f 100644 > > --- a/Documentation/admin-guide/cgroup-v2.rst > > +++ b/Documentation/admin-guide/cgroup-v2.rst > > @@ -1317,6 +1317,11 @@ PAGE_SIZE multiple when read back. > > vmalloc (npn) > > Amount of memory used for vmap backed memory. > > > > + kernel (npn) > > + Amount of total kernel memory, including > > + (kernel_stack, pagetables, percpu, vmalloc, slab) in > > + addition to other kernel memory use cases. > > + > > shmem > > Amount of cached filesystem data that is swap-backed, > > such as tmpfs, shm segments, shared anonymous mmap()s > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > index b72d75141e12..fa51986365a4 100644 > > --- a/include/linux/memcontrol.h > > +++ b/include/linux/memcontrol.h > > @@ -34,6 +34,7 @@ enum memcg_stat_item { > > MEMCG_SOCK, > > MEMCG_PERCPU_B, > > MEMCG_VMALLOC, > > + MEMCG_KMEM, > > MEMCG_NR_STAT, > > }; > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 09d342c7cbd0..c55d7056ac98 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -1376,6 +1376,7 @@ static const struct memory_stat memory_stats[] = { > > { "percpu", MEMCG_PERCPU_B }, > > { "sock", MEMCG_SOCK }, > > { "vmalloc", MEMCG_VMALLOC }, > > + { "kernel", MEMCG_KMEM }, > > It's a superset of percpu, sock, vmalloc etc., so please move it ahead > of them. > > anon > file > kernel > kernel_stack > pagetables > ... > > and in the doc as well. > Done in v2. > > { "shmem", NR_SHMEM }, > > { "file_mapped", NR_FILE_MAPPED }, > > { "file_dirty", NR_FILE_DIRTY }, > > @@ -2979,6 +2980,19 @@ static void memcg_free_cache_id(int id) > > ida_simple_remove(&memcg_cache_ida, id); > > } > > > > +static void mem_cgroup_kmem_record(struct mem_cgroup *memcg, > > + int nr_pages) > > No real need for the namespace prefix since it's a static > function. How about account_kmem()? Avoids the line wrap, too. Does memcg_account_kmem() sound good? It avoids the line wrap too and is more consistent with most static functions in the file that have either a "memcg_" or "mem_cgroup_" prefix. Thanks! --000000000000da439e05d6fc0b98 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
Hello Johannes,

Thanks so much for your review.
=
On Tue, Feb 1, 2022 at 1:00 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> Hello Y= osry,
>
> On Tue, Feb 01, 2022 at 08:08:23PM +0000, Yosry Ahmed= wrote:
> > Currently memcg stats show several types of kernel mem= ory:
> > kernel stack, page tables, sock, vmalloc, and slab.
&g= t; > However, there are other allocations with __GFP_ACCOUNT
> >= ; (or supersets such as GFP_KERNEL_ACCOUNT) that are not accounted
> = > in any of those stats, a few examples are:
> > - various kvm = allocations (e.g. allocated pages to create vcpus)
> > - io_uring<= br>> > - tmp_page in pipes during pipe_write()
> > - bpf rin= gbuffers
> > - unix sockets
> >
> > Keeping trac= k of the total kernel memory is essential for the ease of
> > migr= ation from cgroup v1 to v2 as there are large discrepancies between
>= > v1's kmem.usage_in_bytes and the sum of the available kernel memo= ry stats
> > in v2. Adding separate memcg stats for all __GFP_ACCO= UNT kernel
> > allocations is an impractical maintenance burden as= there a lot of those
> > all over the kernel code, with more use = cases likely to show up in the
> > future.
>
> No obje= ction, I'm just curious how it makes migration to v2 easier in
> = particular. Or is it just that you've used the v1 stat to track
>= application regressions and would like to continue doing that in v2?
We are using "memory.kmem.usage_in_bytes" in v1 to get the tota= l kernel memory accounted to a memcg to maintain historical data of jobs me= mory (anon, file, kernel), and for debugging purposes. In v2 there is no eq= uivalent.

We found that the total of other existing v2 kernel memor= y stats (vmalloc, slab, stack, ..) is very different for some workloads com= pared to v1's "memory.kmem.usage_in_bytes". We need a v2 indi= cator of the total kernel memory accounted to a memcg."

>> > Therefore, add a "kernel" memcg stat that is analogous= to kmem
> > page counter, with added benefits such as using rstat= infrastructure
> > which aggregates stats more efficiently. Addit= ionally, this provides a
> > lighter alternative in case the legac= y kmem is deprecated in the future
> >
> > Signed-off-by:= Yosry Ahmed <yosryahmed@google= .com>
> > ---
> > =C2=A0Documentation/admin-guide/= cgroup-v2.rst | =C2=A05 +++++
> > =C2=A0include/linux/memcontrol.h= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0| =C2=A01 +
> > = =C2=A0mm/memcontrol.c =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 | 24 ++++++++++++++++++------
> > = =C2=A03 files changed, 24 insertions(+), 6 deletions(-)
> >
>= ; > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation= /admin-guide/cgroup-v2.rst
> > index 5aa368d165da..a0027d570a7f 10= 0644
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> >= ; +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -1317,6 +13= 17,11 @@ PAGE_SIZE multiple when read back.
> > =C2=A0 =C2=A0 =C2= =A0 =C2=A0 vmalloc (npn)
> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 Amount of memory used for vmap backed memory.
> >> > + =C2=A0 =C2=A0 =C2=A0 kernel (npn)
> > + =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 Amount of total kernel memory, including
&g= t; > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 (kernel_stack, pagetabl= es, percpu, vmalloc, slab) in
> > + =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 addition to other kernel memory use cases.
> > +
= > > =C2=A0 =C2=A0 =C2=A0 =C2=A0 shmem
> > =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 Amount of cached filesystem data that is sw= ap-backed,
> > =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 su= ch as tmpfs, shm segments, shared anonymous mmap()s
> > diff --git= a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > ind= ex b72d75141e12..fa51986365a4 100644
> > --- a/include/linux/memco= ntrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -34,6= +34,7 @@ enum memcg_stat_item {
> > =C2=A0 =C2=A0 =C2=A0 MEMCG_SO= CK,
> > =C2=A0 =C2=A0 =C2=A0 MEMCG_PERCPU_B,
> > =C2=A0 = =C2=A0 =C2=A0 MEMCG_VMALLOC,
> > + =C2=A0 =C2=A0 MEMCG_KMEM,
&g= t; > =C2=A0 =C2=A0 =C2=A0 MEMCG_NR_STAT,
> > =C2=A0};
> &= gt;
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> >= ; index 09d342c7cbd0..c55d7056ac98 100644
> > --- a/mm/memcontrol.= c
> > +++ b/mm/memcontrol.c
> > @@ -1376,6 +1376,7 @@ sta= tic const struct memory_stat memory_stats[] =3D {
> > =C2=A0 =C2= =A0 =C2=A0 { "percpu", =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 MEMCG_PERCPU_B =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0},
> > =C2=A0 =C2=A0 =C2=A0 { "= ;sock", =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 MEMCG_SOCK =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0},
> > =C2=A0 =C2=A0 =C2=A0 { "vma= lloc", =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0MEMCG_VMALLOC =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 },
> > + =C2=A0 =C2=A0 { "kernel", =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 MEMCG_KMEM =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0},
= >
> It's a superset of percpu, sock, vmalloc etc., so please m= ove it ahead
> of them.
>
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 a= non
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 file
> =C2=A0 =C2=A0 =C2=A0 = =C2=A0 kernel
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 kernel_stack
> =C2= =A0 =C2=A0 =C2=A0 =C2=A0 pagetables
> =C2=A0 =C2=A0 =C2=A0 =C2=A0 ...=
>
> and in the doc as well.
>

Done in v2.

= > > =C2=A0 =C2=A0 =C2=A0 { "shmem", =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0NR_SHMEM =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0},=
> > =C2=A0 =C2=A0 =C2=A0 { "file_mapped", =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0NR_FILE_MAPPED =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0},
> > =C2=A0 =C2=A0 = =C2=A0 { "file_dirty", =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 = =C2=A0 =C2=A0 NR_FILE_DIRTY =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 },
> > @@ -2979,6 +2980,19 @@ static void memcg_= free_cache_id(int id)
> > =C2=A0 =C2=A0 =C2=A0 ida_simple_remove(&= amp;memcg_cache_ida, id);
> > =C2=A0}
> >
> > +s= tatic void mem_cgroup_kmem_record(struct mem_cgroup *memcg,
> > + = =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2= =A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0int nr_pages)
>
> No real= need for the namespace prefix since it's a static
> function. Ho= w about account_kmem()? Avoids the line wrap, too.

Does memcg_accoun= t_kmem() sound good? It avoids the line wrap too and is more consistent wit= h most static functions in the file that have either a "memcg_" o= r "mem_cgroup_" prefix.

Thanks!
--000000000000da439e05d6fc0b98--