Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Mina Almasry <almasrymina@google.com>
To: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>, Michal Hocko <mhocko@suse.com>,
	Yosry Ahmed <yosryahmed@google.com>,
	 Muchun Song <songmuchun@bytedance.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	 Yafang Shao <laoar.shao@gmail.com>,
	Alexei Starovoitov <alexei.starovoitov@gmail.com>,
	 Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>,
	 "David S. Miller" <davem@davemloft.net>,
	Daniel Borkmann <daniel@iogearbox.net>,
	 Andrii Nakryiko <andrii@kernel.org>,
	Martin KaFai Lau <kafai@fb.com>, bpf <bpf@vger.kernel.org>,
	 Kernel Team <kernel-team@fb.com>, linux-mm <linux-mm@kvack.org>,
	 Christoph Lameter <cl@linux.com>,
	Pekka Enberg <penberg@kernel.org>,
	David Rientjes <rientjes@google.com>,
	 Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	 Vlastimil Babka <vbabka@suse.cz>
Subject: Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
Date: Tue, 12 Jul 2022 12:11:48 -0700	[thread overview]
Message-ID: <CAHS8izPHjhTOXYTG5O4kpYUou51MDrUBEYb2SgFEP5vKZaOWtg@mail.gmail.com> (raw)
In-Reply-To: <CALvZod6Y3p1NZwSQe6+UWpY88iaOBrZXS5c5+uzMb+9sY1ziwg@mail.gmail.com>

On Tue, Jul 12, 2022 at 11:11 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> Ccing Mina who actually worked on upstreaming this. See [1] for
> previous discussion and more use-cases.
>
> [1] https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/
>
> On Tue, Jul 12, 2022 at 10:36 AM Tejun Heo <tj@kernel.org> wrote:
> >
> > Hello,
> >
> > On Tue, Jul 12, 2022 at 10:26:22AM -0700, Shakeel Butt wrote:
> > > One use-case we have is a build & test service which runs independent
> > > builds and tests but all the build utilities (compiler, linker,
> > > libraries) are shared between those builds and tests.
> > >
> > > In terms of topology, the service has a top level cgroup (P) and all
> > > independent builds and tests run in their own cgroup under P. These
> > > builds/tests continuously come and go.
> > >
> > > This service continuously monitors all the builds/tests running and
> > > may kill some based on some criteria which includes memory usage.
> > > However the memory usage is nondeterministic and killing a specific
> > > build/test may not really free memory if most of the memory charged to
> > > it is from shared build utilities.
> >
> > That doesn't sound too unusual. So, one saving grace here is that the memory
> > pressure in the stressed cgroup should trigger reclaim of the shared memory
> > which will be likely picked up by someone else, hopefully, under less memory
> > pressure. Can you give more concerete details? ie. describe a failing
> > scenario with actual ballpark memory numbers?
>
> Mina, can you please provide details requested by Tejun?
>

As far as I am aware the builds/tests service Shakeel mentioned is a
theoretical use case we're considering, but the actual use cases we're
running are the 3 I listed in my cover letter in my original proposal:

https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/

Still, the use case Shakeel is talking about is almost identical to
use case #2 in that proposal:
"Our infrastructure has large meta jobs such as kubernetes which spawn
multiple subtasks which share a tmpfs mount. These jobs and its
subtasks use that tmpfs mount for various purposes such as data
sharing or persistent data between the subtask restarts. In kubernetes
terminology, the meta job is similar to pods and subtasks are
containers under pods. We want the shared memory to be
deterministically charged to the kubernetes's pod and independent to
the lifetime of containers under the pod."

To run such a job we do the following:

- We setup a hierarchy like so:
                   pod_container
                  /           |                 \
container_a    container_b     container_c

- We set up a tmpfs mount with memcg= pod_container. This instructs
the kernel to charge all of this tmpfs user data to pod_container,
instead of the memcg of the task which faults in the shared memory.

- We set up the pod_container.max to be the maximum amount of memory
allowed to the _entire_ job.

- We set up container_a.max, container_b.max, and container_c.max to
be the limit of each of sub-tasks a, b, and c respectively, not
including the shared memory, which is allocated via the tmpfs mount
and charged directly to pod_container.


For some rough numbers, you can imagine a scenario:

tmpfs memcg=pod_container,size=100MB

                                 pod_container.max=130MB
                    /                           |
             \
container_a.max=10MB    container_b.max=20MB    container_c.max=30MB


Thanks to memcg=pod_container, neither tasks a, b, and c are charged
for the shared memory, so they can stay within their 10MB, 20MB, and
30MB limits respectively. This gives us fine grained control to
deterministically charge the shared memory and apply limits on the
memory usage of the individual sub-tasks and the overall amount of
memory the entire pod should consume.

For transparency's sake, this is Johannes's comments on the API:
https://lore.kernel.org/linux-mm/YZvppKvUPTIytM%2Fc@cmpxchg.org/

As Tejun puts it:

"it may make sense to have a way to escape certain resources to an ancestor for
shared resources provided that we can come up with a sane interface"

The interface Johannes has opted for is to reparent memory to the
common ancestor _when it is accessed by a task in another memcg_. This
doesn't work for us for a few reasons, one of which in the example
above container_a may get charged for all the 100MB of shared memory
if it's the unlucky task that faults in all the shared memory.


> >
> > FWIW, at least from generic resource constrol standpoint, I think it may
> > make sense to have a way to escape certain resources to an ancestor for
> > shared resources provided that we can come up with a sane interface.
> >
> > Thanks.
> >
> > --
> > tejun

next prev parent reply	other threads:[~2022-07-12 19:12 UTC|newest]

Thread overview: 64+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20220623003230.37497-1-alexei.starovoitov@gmail.com>
2022-06-27  7:03 ` Christoph Hellwig
2022-06-28  0:17   ` Christoph Lameter
2022-06-28  5:01     ` Alexei Starovoitov
2022-06-28 13:57       ` Christoph Lameter
2022-06-28 17:03         ` Alexei Starovoitov
2022-06-29  2:35           ` Christoph Lameter
2022-06-29  2:49             ` Alexei Starovoitov
2022-07-04 16:13               ` Vlastimil Babka
2022-07-06 17:43                 ` Alexei Starovoitov
2022-07-19 11:52                   ` Vlastimil Babka
2022-07-04 20:34   ` Matthew Wilcox
2022-07-06 17:50     ` Alexei Starovoitov
2022-07-06 17:55       ` Matthew Wilcox
2022-07-06 18:05         ` Alexei Starovoitov
2022-07-06 18:21           ` Matthew Wilcox
2022-07-06 18:26             ` Alexei Starovoitov
2022-07-06 18:31               ` Matthew Wilcox
2022-07-06 18:36                 ` Alexei Starovoitov
2022-07-06 18:40                   ` Matthew Wilcox
2022-07-06 18:51                     ` Alexei Starovoitov
2022-07-06 18:55                       ` Matthew Wilcox
2022-07-08 13:41           ` Michal Hocko
2022-07-08 17:48             ` Alexei Starovoitov
2022-07-08 20:13               ` Yosry Ahmed
2022-07-08 21:55               ` Shakeel Butt
2022-07-10  5:26                 ` Alexei Starovoitov
2022-07-10  7:32                   ` Shakeel Butt
2022-07-11 12:15                     ` Michal Hocko
2022-07-12  4:39                       ` Alexei Starovoitov
2022-07-12  7:40                         ` Michal Hocko
2022-07-12  8:39                           ` Yafang Shao
2022-07-12  9:52                             ` Michal Hocko
2022-07-12 15:25                               ` Shakeel Butt
2022-07-12 16:32                                 ` Tejun Heo
2022-07-12 17:26                                   ` Shakeel Butt
2022-07-12 17:36                                     ` Tejun Heo
2022-07-12 18:11                                       ` Shakeel Butt
2022-07-12 18:43                                         ` Alexei Starovoitov
2022-07-13 13:56                                           ` Yafang Shao
2022-07-12 19:11                                         ` Mina Almasry [this message]
2022-07-12 16:24                               ` Tejun Heo
2022-07-18 14:13                                 ` Michal Hocko
2022-07-13  2:39                               ` Roman Gushchin
2022-07-13 14:24                                 ` Yafang Shao
2022-07-13 16:24                                   ` Tejun Heo
2022-07-14  6:15                                     ` Yafang Shao
2022-07-18 17:55                                 ` Yosry Ahmed
2022-07-19 11:30                                   ` cgroup specific sticky resources (was: Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.) Michal Hocko
2022-07-19 18:00                                     ` Yosry Ahmed
2022-07-19 18:01                                       ` Yosry Ahmed
2022-07-19 18:46                                       ` Mina Almasry
2022-07-19 19:16                                         ` Tejun Heo
2022-07-19 19:30                                           ` Yosry Ahmed
2022-07-19 19:38                                             ` Tejun Heo
2022-07-19 19:40                                               ` Yosry Ahmed
2022-07-19 19:47                                               ` Mina Almasry
2022-07-19 19:54                                                 ` Tejun Heo
2022-07-19 20:16                                                   ` Mina Almasry
2022-07-19 20:29                                                     ` Tejun Heo
2022-07-20 12:26                                         ` Michal Hocko
2022-07-12 18:40                           ` [PATCH bpf-next 0/5] bpf: BPF specific memory allocator Alexei Starovoitov
2022-07-18 12:27                             ` Michal Hocko
2022-07-13  2:27                           ` Roman Gushchin
2022-07-11 12:22               ` Michal Hocko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAHS8izPHjhTOXYTG5O4kpYUou51MDrUBEYb2SgFEP5vKZaOWtg@mail.gmail.com \
    --to=almasrymina@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=cl@linux.com \
    --cc=daniel@iogearbox.net \
    --cc=davem@davemloft.net \
    --cc=hannes@cmpxchg.org \
    --cc=hch@infradead.org \
    --cc=iamjoonsoo.kim@lge.com \
    --cc=kafai@fb.com \
    --cc=kernel-team@fb.com \
    --cc=laoar.shao@gmail.com \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=penberg@kernel.org \
    --cc=rientjes@google.com \
    --cc=shakeelb@google.com \
    --cc=songmuchun@bytedance.com \
    --cc=tj@kernel.org \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox