From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 15F0DC43334
	for <linux-mm@archiver.kernel.org>; Tue, 12 Jul 2022 19:12:03 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 757139400CB; Tue, 12 Jul 2022 15:12:02 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 6DFB4940063; Tue, 12 Jul 2022 15:12:02 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 57EBD9400CB; Tue, 12 Jul 2022 15:12:02 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 42118940063
	for <linux-mm@kvack.org>; Tue, 12 Jul 2022 15:12:02 -0400 (EDT)
Received: from smtpin31.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id DB7D3D5C
	for <linux-mm@kvack.org>; Tue, 12 Jul 2022 19:12:01 +0000 (UTC)
X-FDA: 79679392842.31.6405F5C
Received: from mail-vs1-f42.google.com (mail-vs1-f42.google.com [209.85.217.42])
	by imf17.hostedemail.com (Postfix) with ESMTP id 75A1640075
	for <linux-mm@kvack.org>; Tue, 12 Jul 2022 19:12:01 +0000 (UTC)
Received: by mail-vs1-f42.google.com with SMTP id 125so2053790vsx.7
        for <linux-mm@kvack.org>; Tue, 12 Jul 2022 12:12:01 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=CKBUIvv8aNPODGFzkzruwu1gOgcsIvACZpHnA9s8S78=;
        b=JHIYKxGtDHW6Ai7dYGvzir8gEwGoLbDhv7XXPSHwGpgR3Ct2PJvzeZa0kOQYT7HMqM
         M0a+Cfive2wD3CQkvw5z/CwlQtajyTXWFXT1p9VlwCt0gVIeUea8iKSnOyFNMQqQk1Rk
         jlCB/ef/opJVGuHW7NaDhgrBvyGF+K8jAuUzw8s+HNcoM2xaCJGf17b3XSnInjTsLHxC
         oAybbOoF87qlV5ryOuzMzt85Dub5CcM1jg2T2fR6mMZk5vrfDGXWsW2ySMd1p7Am5Ktf
         jdZ5KfYk0tHSuXzgcq3y9ibUXDcqVLTOQtIttCLKOmTgA4NGJdt6HOvTDnqyfs/PYUOq
         S6qQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=CKBUIvv8aNPODGFzkzruwu1gOgcsIvACZpHnA9s8S78=;
        b=PmGn+gaRfbPFF1+XvOtzkFB0hZRRadnemfIFBWTrxB22HSgY56BSmEbV3MRygDSm9h
         nil8FZ+SFrc9w6t+DuR6SI99cSI+rIIYOKtPzbaZmMQedgoH2OAUPDs6a2NGUURjMi/I
         9WBR+HI0/QH9MpQRAuB/bxYaDrPek63F/PjwNmjDbYFvcy39NsS7B1HPz/RybIBaV2Sl
         I0lJNPXMuYIIPtIjZSqkt985/35uKkE0+0x4pX3mAliWIwk9zXp0pwEl+At6XD13R1NT
         6UtMNtgW9UDrMvGYvYSvs3ByfpnH+Kg/iJfbDB4GY/z+KD71EFcwmPpd+nYSakAbFL0x
         4rMQ==
X-Gm-Message-State: AJIora8U47RVX5uF+H74qIhvaeZAucfpEjiSXuNV6a/henzDGVszMiZy
	Nnex/AoAZ700Foa6Gg5zD+BuyPACSHNcNAJ3A5YOGA==
X-Google-Smtp-Source: AGRyM1sFYhzvYzQZieH6k7hGQVEDC0NgF++UpPOyOq17Q1VYqkc5b6iGNO0bhpi/rs8ljqh4Wy3LJ0u7uUYKNfaVsho=
X-Received: by 2002:a67:d495:0:b0:357:688d:f65c with SMTP id
 g21-20020a67d495000000b00357688df65cmr4040211vsj.59.1657653120487; Tue, 12
 Jul 2022 12:12:00 -0700 (PDT)
MIME-Version: 1.0
References: <CAADnVQL5ZQDqMGULJLDwT9xRTihdDvo6GvwxdEOtSAs8EwE78A@mail.gmail.com>
 <20220710073213.bkkdweiqrlnr35sv@google.com> <YswUS/5nbYb8nt6d@dhcp22.suse.cz>
 <20220712043914.pxmbm7vockuvpmmh@macbook-pro-3.dhcp.thefacebook.com>
 <Ys0lXfWKtwYlVrzK@dhcp22.suse.cz> <CALOAHbAhzNTkT9o_-PRX=n4vNjKhEK_09+-7gijrFgGjNH7iRA@mail.gmail.com>
 <Ys1ES+CygtnUvArz@dhcp22.suse.cz> <CALvZod460hip0mQouEVtfcOZ0M21Xmzaa-atxxrUnR3ZisDCNw@mail.gmail.com>
 <Ys2iIVMZJNPe73MI@slm.duckdns.org> <CALvZod7YKrTvh-5SkDgFvtRk=DkxQ8iEhRGhDhhRGBXmYM4sFw@mail.gmail.com>
 <Ys2xGe+rdviCAjsC@slm.duckdns.org> <CALvZod6Y3p1NZwSQe6+UWpY88iaOBrZXS5c5+uzMb+9sY1ziwg@mail.gmail.com>
In-Reply-To: <CALvZod6Y3p1NZwSQe6+UWpY88iaOBrZXS5c5+uzMb+9sY1ziwg@mail.gmail.com>
From: Mina Almasry <almasrymina@google.com>
Date: Tue, 12 Jul 2022 12:11:48 -0700
Message-ID: <CAHS8izPHjhTOXYTG5O4kpYUou51MDrUBEYb2SgFEP5vKZaOWtg@mail.gmail.com>
Subject: Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.
To: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>, Michal Hocko <mhocko@suse.com>, Yosry Ahmed <yosryahmed@google.com>, 
	Muchun Song <songmuchun@bytedance.com>, Johannes Weiner <hannes@cmpxchg.org>, 
	Yafang Shao <laoar.shao@gmail.com>, Alexei Starovoitov <alexei.starovoitov@gmail.com>, 
	Matthew Wilcox <willy@infradead.org>, Christoph Hellwig <hch@infradead.org>, 
	"David S. Miller" <davem@davemloft.net>, Daniel Borkmann <daniel@iogearbox.net>, 
	Andrii Nakryiko <andrii@kernel.org>, Martin KaFai Lau <kafai@fb.com>, bpf <bpf@vger.kernel.org>, 
	Kernel Team <kernel-team@fb.com>, linux-mm <linux-mm@kvack.org>, 
	Christoph Lameter <cl@linux.com>, Pekka Enberg <penberg@kernel.org>, David Rientjes <rientjes@google.com>, 
	Joonsoo Kim <iamjoonsoo.kim@lge.com>, Andrew Morton <akpm@linux-foundation.org>, 
	Vlastimil Babka <vbabka@suse.cz>
Content-Type: text/plain; charset="UTF-8"
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1657653121;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=CKBUIvv8aNPODGFzkzruwu1gOgcsIvACZpHnA9s8S78=;
	b=FxKnc0IkWK8VjaIJFUzosALdK9goAmr5EsQXGTQIrGg9K8S9JHWrMG5riAtr9ZfmAl5Gpp
	4Wcu37vrdOtziHU+ZJ9Uf9N5LQQWkxE3GfJg09Ks8/CTLPRx0w6hcYH0riwdN7DBUn5yI2
	9Jvat0i7Zif3xvw4IVgjcJBBCkztPok=
ARC-Authentication-Results: i=1;
	imf17.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=JHIYKxGt;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf17.hostedemail.com: domain of almasrymina@google.com designates 209.85.217.42 as permitted sender) smtp.mailfrom=almasrymina@google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1657653121; a=rsa-sha256;
	cv=none;
	b=MHyT99pk3HHKFR4QwIU+b6DOUYAFaeWiJl8F/h6scjGF2RVO0QqR+UjqDKiC/Hp8aerC90
	toLUToCbULBJGOMz8f3cTVZk/6CR3xp5sYq4w2+XQLskP56eraGTMarZMfPXtP5feufhMw
	bR/vZUnpge9vA2jlfdY2tdSZ8+061jk=
X-Stat-Signature: 6gzmi1gbymj8qhsozjtm4sx4njekjqu7
X-Rspamd-Server: rspam12
X-Rspamd-Queue-Id: 75A1640075
Authentication-Results: imf17.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=JHIYKxGt;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf17.hostedemail.com: domain of almasrymina@google.com designates 209.85.217.42 as permitted sender) smtp.mailfrom=almasrymina@google.com
X-Rspam-User: 
X-HE-Tag: 1657653121-250250
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue, Jul 12, 2022 at 11:11 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> Ccing Mina who actually worked on upstreaming this. See [1] for
> previous discussion and more use-cases.
>
> [1] https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/
>
> On Tue, Jul 12, 2022 at 10:36 AM Tejun Heo <tj@kernel.org> wrote:
> >
> > Hello,
> >
> > On Tue, Jul 12, 2022 at 10:26:22AM -0700, Shakeel Butt wrote:
> > > One use-case we have is a build & test service which runs independent
> > > builds and tests but all the build utilities (compiler, linker,
> > > libraries) are shared between those builds and tests.
> > >
> > > In terms of topology, the service has a top level cgroup (P) and all
> > > independent builds and tests run in their own cgroup under P. These
> > > builds/tests continuously come and go.
> > >
> > > This service continuously monitors all the builds/tests running and
> > > may kill some based on some criteria which includes memory usage.
> > > However the memory usage is nondeterministic and killing a specific
> > > build/test may not really free memory if most of the memory charged to
> > > it is from shared build utilities.
> >
> > That doesn't sound too unusual. So, one saving grace here is that the memory
> > pressure in the stressed cgroup should trigger reclaim of the shared memory
> > which will be likely picked up by someone else, hopefully, under less memory
> > pressure. Can you give more concerete details? ie. describe a failing
> > scenario with actual ballpark memory numbers?
>
> Mina, can you please provide details requested by Tejun?
>

As far as I am aware the builds/tests service Shakeel mentioned is a
theoretical use case we're considering, but the actual use cases we're
running are the 3 I listed in my cover letter in my original proposal:

https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/

Still, the use case Shakeel is talking about is almost identical to
use case #2 in that proposal:
"Our infrastructure has large meta jobs such as kubernetes which spawn
multiple subtasks which share a tmpfs mount. These jobs and its
subtasks use that tmpfs mount for various purposes such as data
sharing or persistent data between the subtask restarts. In kubernetes
terminology, the meta job is similar to pods and subtasks are
containers under pods. We want the shared memory to be
deterministically charged to the kubernetes's pod and independent to
the lifetime of containers under the pod."

To run such a job we do the following:

- We setup a hierarchy like so:
                   pod_container
                  /           |                 \
container_a    container_b     container_c

- We set up a tmpfs mount with memcg= pod_container. This instructs
the kernel to charge all of this tmpfs user data to pod_container,
instead of the memcg of the task which faults in the shared memory.

- We set up the pod_container.max to be the maximum amount of memory
allowed to the _entire_ job.

- We set up container_a.max, container_b.max, and container_c.max to
be the limit of each of sub-tasks a, b, and c respectively, not
including the shared memory, which is allocated via the tmpfs mount
and charged directly to pod_container.


For some rough numbers, you can imagine a scenario:

tmpfs memcg=pod_container,size=100MB

                                 pod_container.max=130MB
                    /                           |
             \
container_a.max=10MB    container_b.max=20MB    container_c.max=30MB


Thanks to memcg=pod_container, neither tasks a, b, and c are charged
for the shared memory, so they can stay within their 10MB, 20MB, and
30MB limits respectively. This gives us fine grained control to
deterministically charge the shared memory and apply limits on the
memory usage of the individual sub-tasks and the overall amount of
memory the entire pod should consume.

For transparency's sake, this is Johannes's comments on the API:
https://lore.kernel.org/linux-mm/YZvppKvUPTIytM%2Fc@cmpxchg.org/

As Tejun puts it:

"it may make sense to have a way to escape certain resources to an ancestor for
shared resources provided that we can come up with a sane interface"

The interface Johannes has opted for is to reparent memory to the
common ancestor _when it is accessed by a task in another memcg_. This
doesn't work for us for a few reasons, one of which in the example
above container_a may get charged for all the 100MB of shared memory
if it's the unlucky task that faults in all the shared memory.


> >
> > FWIW, at least from generic resource constrol standpoint, I think it may
> > make sense to have a way to escape certain resources to an ancestor for
> > shared resources provided that we can come up with a sane interface.
> >
> > Thanks.
> >
> > --
> > tejun