From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 30AC0C43334 for ; Mon, 18 Jul 2022 17:56:40 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A90A18E0001; Mon, 18 Jul 2022 13:56:39 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A1A236B0073; Mon, 18 Jul 2022 13:56:39 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 894328E0001; Mon, 18 Jul 2022 13:56:39 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 738F86B0072 for ; Mon, 18 Jul 2022 13:56:39 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay13.hostedemail.com (Postfix) with ESMTP id 4AFBA60613 for ; Mon, 18 Jul 2022 17:56:39 +0000 (UTC) X-FDA: 79700975718.30.4DCA14C Received: from mail-wr1-f44.google.com (mail-wr1-f44.google.com [209.85.221.44]) by imf23.hostedemail.com (Postfix) with ESMTP id 54C5F140089 for ; Mon, 18 Jul 2022 17:56:37 +0000 (UTC) Received: by mail-wr1-f44.google.com with SMTP id e15so12933307wro.5 for ; Mon, 18 Jul 2022 10:56:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=xUYa4FvEr0j+muM1OSqff8K+8eK7M8BpGNz6BG478ro=; b=k98T3SAbLzj/t6LT5j/X+Einj1T/WV5zhOrcX1I7klpDgDYv+2eh7xyfx/eJhD/27G 3E3EFhfOvJVuh9/opS0p9VEP26y3EC8GzBHNpbgA5erVEOkMaVlN1moSINnTJjwfxySF nTAkGI25XvrBgDiNN8N/8mVp15yVTU14+eyTojC/5N99P77P+og+eG+voeue0EcGJjxa 91c6Trqajcm/bO+3WXN6vpI3QLz5EQwKsvBgTfpmXtTGGGEFzE9Niq3ZRsheA66AA8+Z cvU1BM1o8Lgys2Fu4eY9r0lnikST8AIjuZCnlnXUALQUB9DjqN9nta2ZAllNh7+Qrib5 L3kA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=xUYa4FvEr0j+muM1OSqff8K+8eK7M8BpGNz6BG478ro=; b=NMAKdCjip5hwpjmk1PHXHWqulhAoljoNbjUU5sB7nInX5Emmx6LMd4c6GMc79LRT6p piP15q0zm/4UgjRDYMz3aFZHp5rM2h1gp8HRkAh7J/UYEWaxfbe7T6XexhEDsGtovcMv YOXn4a2LAodGkn39d8TWW4/c3zI3mDHQlopHOcjruF2aI8G2JgHcUqKdvcqP26l2mBYB asnK1wXgmc2OpfZe9Y7DXk/yMYG0Zjtsaw/BveqA0rmRqOB6EBK8s6h7RL7BoSlEI47W /aIN8K2hWTkYfekY0eQFp3yrbAZIHIIXrtOPr//neK08PqyXNdXaI6k8uOxGR7epgfDU vMXg== X-Gm-Message-State: AJIora8/GY7YN1DRN/GU0FCuJqWmMcuGxOFNpcAc46tp/uL4w1ikjAem UMOegsyEO+bM76ElOmpOhplHovXfKo4rjGnw+ZeD0Q== X-Google-Smtp-Source: AGRyM1vDD0owJd7Go+UdK8dVZWCYHhPlbH9InqRBMMfSWW3kNQVP7If92+7nld/97anTGJka6f3ZBvtMP9RmIn11mBU= X-Received: by 2002:a5d:47c6:0:b0:21d:97dc:8f67 with SMTP id o6-20020a5d47c6000000b0021d97dc8f67mr24403312wrc.372.1658166995767; Mon, 18 Jul 2022 10:56:35 -0700 (PDT) MIME-Version: 1.0 References: <20220708174858.6gl2ag3asmoimpoe@macbook-pro-3.dhcp.thefacebook.com> <20220708215536.pqclxdqvtrfll2y4@google.com> <20220710073213.bkkdweiqrlnr35sv@google.com> <20220712043914.pxmbm7vockuvpmmh@macbook-pro-3.dhcp.thefacebook.com> In-Reply-To: From: Yosry Ahmed Date: Mon, 18 Jul 2022 10:55:59 -0700 Message-ID: Subject: Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator. To: Roman Gushchin Cc: Michal Hocko , Yafang Shao , Alexei Starovoitov , Shakeel Butt , Matthew Wilcox , Christoph Hellwig , "David S. Miller" , Daniel Borkmann , Andrii Nakryiko , Tejun Heo , Martin KaFai Lau , bpf , Kernel Team , linux-mm , Christoph Lameter , Pekka Enberg , David Rientjes , Joonsoo Kim , Andrew Morton , Vlastimil Babka Content-Type: text/plain; charset="UTF-8" ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=k98T3SAb; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf23.hostedemail.com: domain of yosryahmed@google.com designates 209.85.221.44 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1658166997; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=xUYa4FvEr0j+muM1OSqff8K+8eK7M8BpGNz6BG478ro=; b=L6SQ+ZSuRsiA5Hf5ust1ZF3xfg3lb2s+ElQOxM/2y8D7ZUrbOnH/L2A5KVDa6FiXFJ0JVD O09b7AQFb3AMcA/7KfklvSQJfjmfPSo0Yqf2AUGFN5ofkVnfXvQ2SizSMnNDB/kGx9AZI0 5OADVtCQK2tVyP4Dle6LblVT85bygfE= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1658166997; a=rsa-sha256; cv=none; b=5qjnmR9lS6di7alr6ZK66jJyrivKKOrohVd7gobkrPjbHhIwmEXTejy9PBYCUSbo8q4PLj evMREymzQ0w4jWZf/mYrjPSpSxvGx5rHgD5TktsTdGKQX/Zo5yaJAcPInuUleqPetf80o0 9k4nnLY1HP4AJ5/ehh0U7g+YjPH+87A= X-Rspamd-Queue-Id: 54C5F140089 X-Rspam-User: Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=k98T3SAb; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf23.hostedemail.com: domain of yosryahmed@google.com designates 209.85.221.44 as permitted sender) smtp.mailfrom=yosryahmed@google.com X-Rspamd-Server: rspam11 X-Stat-Signature: to685xf8hkd76j9eu15r813n871p9a89 X-HE-Tag: 1658166997-34575 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Jul 12, 2022 at 7:39 PM Roman Gushchin wrote: > > On Tue, Jul 12, 2022 at 11:52:11AM +0200, Michal Hocko wrote: > > On Tue 12-07-22 16:39:48, Yafang Shao wrote: > > > On Tue, Jul 12, 2022 at 3:40 PM Michal Hocko wrote: > > [...] > > > > > Roman already sent reparenting fix: > > > > > https://patchwork.kernel.org/project/netdevbpf/patch/20220711162827.184743-1-roman.gushchin@linux.dev/ > > > > > > > > Reparenting is nice but not a silver bullet. Consider a shallow > > > > hierarchy where the charging happens in the first level under the root > > > > memcg. Reparenting to the root is just pushing everything under the > > > > system resources category. > > > > > > > > > > Agreed. That's why I don't like reparenting. > > > Reparenting just reparent the charged pages and then redirect the new > > > charge, but can't reparents the 'limit' of the original memcg. > > > So it is a risk if the original memcg is still being charged. We have > > > to forbid the destruction of the original memcg. > > I agree, I also don't like reparenting for !kmem case. For kmem (and *maybe* > bpf maps is an exception), I don't think there is a better choice. > > > yes, I was toying with an idea like that. I guess we really want a > > measure to keep cgroups around if they are bound to a resource which is > > sticky itself. I am not sure how many other resources like BPF (aka > > module like) we already do charge for memcg but considering the > > potential memory consumption just reparenting will not help in general > > case I am afraid. > > Well, then we have to make these objects a first-class citizens in cgroup API, > like processes. E.g. introduce cgroup.bpf.maps, cgroup.mounts.tmpfs etc. > I easily can see some value here, but it's a big API change. > > With the current approach when a bpf map pins a memory cgroup of the creator > process (which I think is completely transparent for most bpf users), I don't > think preventing the deletion of a such cgroup is possible. It will break too > many things. > > But honestly I don't see why userspace can't handle it. If there is a cgroup which > contains shared bpf maps, why would it delete it? It's a weird use case, I don't > think we have to optimize for it. Also, we do a ton of optimizations for live > cgroups (e.g. css refcounting being percpu) which are not working for a deleted > cgroup. So noone really should expect any properties from dying cgroups. > Just a random thought here, and I can easily be wrong (and this can easily be the wrong thread for this), but if we introduce a more generic concept to generally tie a resource explicitly to a cgroup (tmpfs, bpf maps, etc) using cgroupfs interfaces, and then prevent the cgroup from being deleted unless the resource is freed or moved to a different cgroup? This would be optional, so the current status quo is maintainable, but also gives flexibility to admins to assign resources to cgroups to make sure nothing is ( unaccounted / accounted to a zombie memcg / reparented to an unrelated parent ). This might be too fine-grained to be practical but I just thought it might be useful. We will also need to define an OOM behavior for such resources. Things like bpf maps will be unreclaimable, but tmpfs memory can be swapped out. I think this also partially addresses Johannes's concerns that the memcg= mount option uses file system mounts to create shareable resource domains outside of the cgroup hierarchy. > Thanks! >