From: Mina Almasry <almasrymina@google.com>
To: Johannes Weiner <hannes@cmpxchg.org>,
Yosry Ahmed <yosryahmed@google.com>
Cc: Tejun Heo <tj@kernel.org>, Yafang Shao <laoar.shao@gmail.com>,
Alexei Starovoitov <ast@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Andrii Nakryiko <andrii@kernel.org>, Martin Lau <kafai@fb.com>,
Song Liu <songliubraving@fb.com>, Yonghong Song <yhs@fb.com>,
john fastabend <john.fastabend@gmail.com>,
KP Singh <kpsingh@kernel.org>,
Stanislav Fomichev <sdf@google.com>, Hao Luo <haoluo@google.com>,
jolsa@kernel.org, Michal Hocko <mhocko@kernel.org>,
Roman Gushchin <roman.gushchin@linux.dev>,
Shakeel Butt <shakeelb@google.com>,
Muchun Song <songmuchun@bytedance.com>,
Andrew Morton <akpm@linux-foundation.org>,
Zefan Li <lizefan.x@bytedance.com>,
Cgroups <cgroups@vger.kernel.org>,
netdev <netdev@vger.kernel.org>, bpf <bpf@vger.kernel.org>,
Linux MM <linux-mm@kvack.org>,
Dan Schatzberg <schatzberg.dan@gmail.com>,
Lennart Poettering <lennart@poettering.net>
Subject: Re: [RFD RESEND] cgroup: Persistent memory usage tracking
Date: Mon, 22 Aug 2022 14:52:48 -0700 [thread overview]
Message-ID: <CAHS8izON5xo6GNmNAo_0121Hb=ikF7wjoh+44wU3M9Q2KOFdBg@mail.gmail.com> (raw)
In-Reply-To: <YwPy9hervVxfuuYE@cmpxchg.org>
On Mon, Aug 22, 2022 at 2:19 PM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Mon, Aug 22, 2022 at 12:02:48PM -0700, Mina Almasry wrote:
> > On Mon, Aug 22, 2022 at 4:29 AM Tejun Heo <tj@kernel.org> wrote:
> > > b. Let userspace specify which cgroup to charge for some of constructs like
> > > tmpfs and bpf maps. The key problems with this approach are
> > >
> > > 1. How to grant/deny what can be charged where. We must ensure that a
> > > descendant can't move charges up or across the tree without the
> > > ancestors allowing it.
> > >
> > > 2. How to specify the cgroup to charge. While specifying the target
> > > cgroup directly might seem like an obvious solution, it has a couple
> > > rather serious problems. First, if the descendant is inside a cgroup
> > > namespace, it might be able to see the target cgroup at all. Second,
> > > it's an interface which is likely to cause misunderstandings on how it
> > > can be used. It's too broad an interface.
> > >
> >
> > This is pretty much the solution I sent out for review about a year
> > ago and yes, it suffers from the issues you've brought up:
> > https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/
> >
> >
> > > One solution that I can think of is leveraging the resource domain
> > > concept which is currently only used for threaded cgroups. All memory
> > > usages of threaded cgroups are charged to their resource domain cgroup
> > > which hosts the processes for those threads. The persistent usages have a
> > > similar pattern, so maybe the service level cgroup can declare that it's
> > > the encompassing resource domain and the instance cgroup can say whether
> > > it's gonna charge e.g. the tmpfs instance to its own or the encompassing
> > > resource domain.
> > >
> >
> > I think this sounds excellent and addresses our use cases. Basically
> > the tmpfs/bpf memory would get charged to the encompassing resource
> > domain cgroup rather than the instance cgroup, making the memory usage
> > of the first and second+ instances consistent and predictable.
> >
> > Would love to hear from other memcg folks what they would think of
> > such an approach. I would also love to hear what kind of interface you
> > have in mind. Perhaps a cgroup tunable that says whether it's going to
> > charge the tmpfs/bpf instance to itself or to the encompassing
> > resource domain?
>
> I like this too. It makes shared charging predictable, with a coherent
> resource hierarchy (congruent OOM, CPU, IO domains), and without the
> need for cgroup paths in tmpfs mounts or similar.
>
> As far as who is declaring what goes, though: if the instance groups
> can declare arbitrary files/objects persistent or shared, they'd be
> able to abuse this and sneak private memory past local limits and
> burden the wider persistent/shared domain with it.
>
> I'm thinking it might make more sense for the service level to declare
> which objects are persistent and shared across instances.
>
> If that's the case, we may not need a two-component interface. Just
> the ability for an intermediate cgroup to say: "This object's future
> memory is to be charged to me, not the instantiating cgroup."
>
> Can we require a process in the intermediate cgroup to set up the file
> or object, and use madvise/fadvise to say "charge me", before any
> instances are launched?
I think doing this on a file granularity makes it logistically hard to
use, no? The service needs to create a file in the shared domain and
all its instances need to re-use this exact same file.
Our kubernetes use case from [1] shares a mount between subtasks
rather than specific files. This allows subtasks to create files at
will in the mount with the memory charged to the shared domain. I
imagine this is more convenient than a shared file.
Our other use case, which I hope to address here as well, is a
service-client relationship from [1] where the service would like to
charge per-client memory back to the client itself. In this case the
service or client can create a mount from the shared domain and pass
it to the service at which point the service is free to create/remove
files in this mount as it sees fit.
Would you be open to a per-mount interface rather than a per-file
fadvise interface?
Yosry, would a proposal like so be extensible to address the bpf
charging issues?
[1] https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/
next prev parent reply other threads:[~2022-08-22 21:53 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-08-18 14:31 [PATCH bpf-next v2 00/12] bpf: Introduce selectable memcg for bpf map Yafang Shao
2022-08-18 14:31 ` [PATCH bpf-next v2 01/12] cgroup: Update the comment on cgroup_get_from_fd Yafang Shao
2022-08-18 19:11 ` Yosry Ahmed
2022-08-18 14:31 ` [PATCH bpf-next v2 02/12] bpf: Introduce new helper bpf_map_put_memcg() Yafang Shao
2022-08-18 14:31 ` [PATCH bpf-next v2 03/12] bpf: Define bpf_map_{get,put}_memcg for !CONFIG_MEMCG_KMEM Yafang Shao
2022-08-18 14:31 ` [PATCH bpf-next v2 04/12] bpf: Call bpf_map_init_from_attr() immediately after map creation Yafang Shao
2022-08-18 14:31 ` [PATCH bpf-next v2 05/12] bpf: Save memcg in bpf_map_init_from_attr() Yafang Shao
2022-08-18 14:31 ` [PATCH bpf-next v2 06/12] bpf: Use scoped-based charge in bpf_map_area_alloc Yafang Shao
2022-08-18 14:31 ` [PATCH bpf-next v2 07/12] bpf: Introduce new helpers bpf_ringbuf_pages_{alloc,free} Yafang Shao
2022-08-18 17:30 ` Andrii Nakryiko
2022-08-18 14:31 ` [PATCH bpf-next v2 08/12] bpf: Use bpf_map_kzalloc in arraymap Yafang Shao
2022-08-18 14:31 ` [PATCH bpf-next v2 09/12] bpf: Use bpf_map_kvcalloc in bpf_local_storage Yafang Shao
2022-08-18 14:31 ` [PATCH bpf-next v2 10/12] mm, memcg: Add new helper get_obj_cgroup_from_cgroup Yafang Shao
2022-08-18 20:38 ` Shakeel Butt
2022-08-19 1:21 ` Yafang Shao
2022-08-18 14:31 ` [PATCH bpf-next v2 11/12] bpf: Add return value for bpf_map_init_from_attr Yafang Shao
2022-08-18 14:31 ` [PATCH bpf-next v2 12/12] bpf: Introduce selectable memcg for bpf map Yafang Shao
2022-08-18 22:20 ` [PATCH bpf-next v2 00/12] " Tejun Heo
2022-08-18 22:33 ` Tejun Heo
2022-08-19 1:09 ` Yafang Shao
2022-08-19 17:06 ` Tejun Heo
2022-08-20 2:25 ` Yafang Shao
2022-08-22 11:29 ` [RFD RESEND] cgroup: Persistent memory usage tracking Tejun Heo
2022-08-22 16:12 ` Shakeel Butt
2022-08-22 19:02 ` Mina Almasry
2022-08-22 21:19 ` Johannes Weiner
2022-08-22 21:52 ` Mina Almasry [this message]
2022-08-23 3:01 ` Roman Gushchin
2022-08-23 3:14 ` Tejun Heo
2022-08-24 19:02 ` Mina Almasry
2022-08-25 17:59 ` Tejun Heo
2022-08-23 11:08 ` Yafang Shao
2022-08-23 17:12 ` Tejun Heo
2022-08-24 11:57 ` Yafang Shao
2022-08-19 0:59 ` [PATCH bpf-next v2 00/12] bpf: Introduce selectable memcg for bpf map Yafang Shao
2022-08-19 16:45 ` Tejun Heo
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAHS8izON5xo6GNmNAo_0121Hb=ikF7wjoh+44wU3M9Q2KOFdBg@mail.gmail.com' \
--to=almasrymina@google.com \
--cc=akpm@linux-foundation.org \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=cgroups@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=hannes@cmpxchg.org \
--cc=haoluo@google.com \
--cc=john.fastabend@gmail.com \
--cc=jolsa@kernel.org \
--cc=kafai@fb.com \
--cc=kpsingh@kernel.org \
--cc=laoar.shao@gmail.com \
--cc=lennart@poettering.net \
--cc=linux-mm@kvack.org \
--cc=lizefan.x@bytedance.com \
--cc=mhocko@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=roman.gushchin@linux.dev \
--cc=schatzberg.dan@gmail.com \
--cc=sdf@google.com \
--cc=shakeelb@google.com \
--cc=songliubraving@fb.com \
--cc=songmuchun@bytedance.com \
--cc=tj@kernel.org \
--cc=yhs@fb.com \
--cc=yosryahmed@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox