From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 15F0DC43334 for ; Tue, 12 Jul 2022 19:12:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 757139400CB; Tue, 12 Jul 2022 15:12:02 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6DFB4940063; Tue, 12 Jul 2022 15:12:02 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 57EBD9400CB; Tue, 12 Jul 2022 15:12:02 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 42118940063 for ; Tue, 12 Jul 2022 15:12:02 -0400 (EDT) Received: from smtpin31.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id DB7D3D5C for ; Tue, 12 Jul 2022 19:12:01 +0000 (UTC) X-FDA: 79679392842.31.6405F5C Received: from mail-vs1-f42.google.com (mail-vs1-f42.google.com [209.85.217.42]) by imf17.hostedemail.com (Postfix) with ESMTP id 75A1640075 for ; Tue, 12 Jul 2022 19:12:01 +0000 (UTC) Received: by mail-vs1-f42.google.com with SMTP id 125so2053790vsx.7 for ; Tue, 12 Jul 2022 12:12:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=CKBUIvv8aNPODGFzkzruwu1gOgcsIvACZpHnA9s8S78=; b=JHIYKxGtDHW6Ai7dYGvzir8gEwGoLbDhv7XXPSHwGpgR3Ct2PJvzeZa0kOQYT7HMqM M0a+Cfive2wD3CQkvw5z/CwlQtajyTXWFXT1p9VlwCt0gVIeUea8iKSnOyFNMQqQk1Rk jlCB/ef/opJVGuHW7NaDhgrBvyGF+K8jAuUzw8s+HNcoM2xaCJGf17b3XSnInjTsLHxC oAybbOoF87qlV5ryOuzMzt85Dub5CcM1jg2T2fR6mMZk5vrfDGXWsW2ySMd1p7Am5Ktf jdZ5KfYk0tHSuXzgcq3y9ibUXDcqVLTOQtIttCLKOmTgA4NGJdt6HOvTDnqyfs/PYUOq S6qQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=CKBUIvv8aNPODGFzkzruwu1gOgcsIvACZpHnA9s8S78=; b=PmGn+gaRfbPFF1+XvOtzkFB0hZRRadnemfIFBWTrxB22HSgY56BSmEbV3MRygDSm9h nil8FZ+SFrc9w6t+DuR6SI99cSI+rIIYOKtPzbaZmMQedgoH2OAUPDs6a2NGUURjMi/I 9WBR+HI0/QH9MpQRAuB/bxYaDrPek63F/PjwNmjDbYFvcy39NsS7B1HPz/RybIBaV2Sl I0lJNPXMuYIIPtIjZSqkt985/35uKkE0+0x4pX3mAliWIwk9zXp0pwEl+At6XD13R1NT 6UtMNtgW9UDrMvGYvYSvs3ByfpnH+Kg/iJfbDB4GY/z+KD71EFcwmPpd+nYSakAbFL0x 4rMQ== X-Gm-Message-State: AJIora8U47RVX5uF+H74qIhvaeZAucfpEjiSXuNV6a/henzDGVszMiZy Nnex/AoAZ700Foa6Gg5zD+BuyPACSHNcNAJ3A5YOGA== X-Google-Smtp-Source: AGRyM1sFYhzvYzQZieH6k7hGQVEDC0NgF++UpPOyOq17Q1VYqkc5b6iGNO0bhpi/rs8ljqh4Wy3LJ0u7uUYKNfaVsho= X-Received: by 2002:a67:d495:0:b0:357:688d:f65c with SMTP id g21-20020a67d495000000b00357688df65cmr4040211vsj.59.1657653120487; Tue, 12 Jul 2022 12:12:00 -0700 (PDT) MIME-Version: 1.0 References: <20220710073213.bkkdweiqrlnr35sv@google.com> <20220712043914.pxmbm7vockuvpmmh@macbook-pro-3.dhcp.thefacebook.com> In-Reply-To: From: Mina Almasry Date: Tue, 12 Jul 2022 12:11:48 -0700 Message-ID: Subject: Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator. To: Shakeel Butt Cc: Tejun Heo , Michal Hocko , Yosry Ahmed , Muchun Song , Johannes Weiner , Yafang Shao , Alexei Starovoitov , Matthew Wilcox , Christoph Hellwig , "David S. Miller" , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau , bpf , Kernel Team , linux-mm , Christoph Lameter , Pekka Enberg , David Rientjes , Joonsoo Kim , Andrew Morton , Vlastimil Babka Content-Type: text/plain; charset="UTF-8" ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1657653121; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=CKBUIvv8aNPODGFzkzruwu1gOgcsIvACZpHnA9s8S78=; b=FxKnc0IkWK8VjaIJFUzosALdK9goAmr5EsQXGTQIrGg9K8S9JHWrMG5riAtr9ZfmAl5Gpp 4Wcu37vrdOtziHU+ZJ9Uf9N5LQQWkxE3GfJg09Ks8/CTLPRx0w6hcYH0riwdN7DBUn5yI2 9Jvat0i7Zif3xvw4IVgjcJBBCkztPok= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=JHIYKxGt; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf17.hostedemail.com: domain of almasrymina@google.com designates 209.85.217.42 as permitted sender) smtp.mailfrom=almasrymina@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1657653121; a=rsa-sha256; cv=none; b=MHyT99pk3HHKFR4QwIU+b6DOUYAFaeWiJl8F/h6scjGF2RVO0QqR+UjqDKiC/Hp8aerC90 toLUToCbULBJGOMz8f3cTVZk/6CR3xp5sYq4w2+XQLskP56eraGTMarZMfPXtP5feufhMw bR/vZUnpge9vA2jlfdY2tdSZ8+061jk= X-Stat-Signature: 6gzmi1gbymj8qhsozjtm4sx4njekjqu7 X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 75A1640075 Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=JHIYKxGt; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf17.hostedemail.com: domain of almasrymina@google.com designates 209.85.217.42 as permitted sender) smtp.mailfrom=almasrymina@google.com X-Rspam-User: X-HE-Tag: 1657653121-250250 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Jul 12, 2022 at 11:11 AM Shakeel Butt wrote: > > Ccing Mina who actually worked on upstreaming this. See [1] for > previous discussion and more use-cases. > > [1] https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/ > > On Tue, Jul 12, 2022 at 10:36 AM Tejun Heo wrote: > > > > Hello, > > > > On Tue, Jul 12, 2022 at 10:26:22AM -0700, Shakeel Butt wrote: > > > One use-case we have is a build & test service which runs independent > > > builds and tests but all the build utilities (compiler, linker, > > > libraries) are shared between those builds and tests. > > > > > > In terms of topology, the service has a top level cgroup (P) and all > > > independent builds and tests run in their own cgroup under P. These > > > builds/tests continuously come and go. > > > > > > This service continuously monitors all the builds/tests running and > > > may kill some based on some criteria which includes memory usage. > > > However the memory usage is nondeterministic and killing a specific > > > build/test may not really free memory if most of the memory charged to > > > it is from shared build utilities. > > > > That doesn't sound too unusual. So, one saving grace here is that the memory > > pressure in the stressed cgroup should trigger reclaim of the shared memory > > which will be likely picked up by someone else, hopefully, under less memory > > pressure. Can you give more concerete details? ie. describe a failing > > scenario with actual ballpark memory numbers? > > Mina, can you please provide details requested by Tejun? > As far as I am aware the builds/tests service Shakeel mentioned is a theoretical use case we're considering, but the actual use cases we're running are the 3 I listed in my cover letter in my original proposal: https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/ Still, the use case Shakeel is talking about is almost identical to use case #2 in that proposal: "Our infrastructure has large meta jobs such as kubernetes which spawn multiple subtasks which share a tmpfs mount. These jobs and its subtasks use that tmpfs mount for various purposes such as data sharing or persistent data between the subtask restarts. In kubernetes terminology, the meta job is similar to pods and subtasks are containers under pods. We want the shared memory to be deterministically charged to the kubernetes's pod and independent to the lifetime of containers under the pod." To run such a job we do the following: - We setup a hierarchy like so: pod_container / | \ container_a container_b container_c - We set up a tmpfs mount with memcg= pod_container. This instructs the kernel to charge all of this tmpfs user data to pod_container, instead of the memcg of the task which faults in the shared memory. - We set up the pod_container.max to be the maximum amount of memory allowed to the _entire_ job. - We set up container_a.max, container_b.max, and container_c.max to be the limit of each of sub-tasks a, b, and c respectively, not including the shared memory, which is allocated via the tmpfs mount and charged directly to pod_container. For some rough numbers, you can imagine a scenario: tmpfs memcg=pod_container,size=100MB pod_container.max=130MB / | \ container_a.max=10MB container_b.max=20MB container_c.max=30MB Thanks to memcg=pod_container, neither tasks a, b, and c are charged for the shared memory, so they can stay within their 10MB, 20MB, and 30MB limits respectively. This gives us fine grained control to deterministically charge the shared memory and apply limits on the memory usage of the individual sub-tasks and the overall amount of memory the entire pod should consume. For transparency's sake, this is Johannes's comments on the API: https://lore.kernel.org/linux-mm/YZvppKvUPTIytM%2Fc@cmpxchg.org/ As Tejun puts it: "it may make sense to have a way to escape certain resources to an ancestor for shared resources provided that we can come up with a sane interface" The interface Johannes has opted for is to reparent memory to the common ancestor _when it is accessed by a task in another memcg_. This doesn't work for us for a few reasons, one of which in the example above container_a may get charged for all the 100MB of shared memory if it's the unlucky task that faults in all the shared memory. > > > > FWIW, at least from generic resource constrol standpoint, I think it may > > make sense to have a way to escape certain resources to an ancestor for > > shared resources provided that we can come up with a sane interface. > > > > Thanks. > > > > -- > > tejun