From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6FFE8C43334 for ; Tue, 19 Jul 2022 18:01:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B01916B0071; Tue, 19 Jul 2022 14:01:32 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A64056B0073; Tue, 19 Jul 2022 14:01:32 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8DDC56B0074; Tue, 19 Jul 2022 14:01:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 78A136B0071 for ; Tue, 19 Jul 2022 14:01:32 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 3F04F1A0345 for ; Tue, 19 Jul 2022 18:01:32 +0000 (UTC) X-FDA: 79704616824.06.B337278 Received: from mail-wr1-f49.google.com (mail-wr1-f49.google.com [209.85.221.49]) by imf21.hostedemail.com (Postfix) with ESMTP id 9A2C91C0094 for ; Tue, 19 Jul 2022 18:01:31 +0000 (UTC) Received: by mail-wr1-f49.google.com with SMTP id a5so22737518wrx.12 for ; Tue, 19 Jul 2022 11:01:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=mWRuvQ3wzgz9ZKqWvsdUHpwP47do7E1Ix+nkjoW47hc=; b=qrJpDLx97/8arZeXxkLqQ42Z08+K4zwCbpZhU9NFbYtZxEdfAGAIrzURDYHJsDOdBr eBVf+lIJKhI1uwbPx/gqHBXgluvVNTAsAPhdN1JEusRE3zHmMhVgSgFJFPwf9e4IbbCy kNsBTlKIoCHotoAarOfLh38inleNGOCRnj2b8OrgCJrW5HCQ3HBKc+m+c3/jJdeclBIZ c47qxPdcXbZ/+lJ7LqkAs06c7eki8DOymafKzTqCWEBrvyF+J6xTsNQ07+IA7UPNm1+V jYkpexXOzLsQrvqAoDPSQxx89ptcqW2TBBS119zdcitGkp1Yr0a6p9MrM3k7PArPTJpw qZFw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=mWRuvQ3wzgz9ZKqWvsdUHpwP47do7E1Ix+nkjoW47hc=; b=PndEgkVojfLiegisb5JdbUFmdp4nemBmwUcWGF48nwHH6nl/RneYaZ8b/nrxmfVWlb D+mhir2c2gZPcrctuC4EF6NDmA1U8UEdSYQ4t2EEPd2uRn2cNDQWL7Soe8F/vFkh6m9G MkTtrlmGe7L6WSX/gVCsoWXDwyMkeRafsAlp9uv4NW7dVMw7FQMWTTg5DJCKWKzeBur7 xUABnTWvkosAaJW5kByrcW1NWZ4GAUMai7qWfSaIrgOjFrR8DjUeTNrVN+aD2Qj6F1e2 n2GdvVYdCSpxTJXhPpzeRYNzLHa7oWCqtg8HWfiug5NxK4r0xNeav+zr1sV31dI1ahhS /UBw== X-Gm-Message-State: AJIora+V1CP3aRjs0Efm7Wp+OuXzJl+XRQXHgOdPb0oX8G17FnZPvP6k mmA8UPUT6U8duvlgMFAfJTG5w9XfrTywu1w+9eOsIg== X-Google-Smtp-Source: AGRyM1uPFfnsa6zz5rNW5jFsEg7G6aOHDTBpDetoeyiZkOs9+Nr1cYpQejpCnIlimtt7bBPKTlREAHcJxaBgiGw72Jc= X-Received: by 2002:a05:6000:156f:b0:21d:887f:8ddf with SMTP id 15-20020a056000156f00b0021d887f8ddfmr27525572wrz.534.1658253690032; Tue, 19 Jul 2022 11:01:30 -0700 (PDT) MIME-Version: 1.0 References: <20220708215536.pqclxdqvtrfll2y4@google.com> <20220710073213.bkkdweiqrlnr35sv@google.com> <20220712043914.pxmbm7vockuvpmmh@macbook-pro-3.dhcp.thefacebook.com> In-Reply-To: From: Yosry Ahmed Date: Tue, 19 Jul 2022 11:00:53 -0700 Message-ID: Subject: Re: cgroup specific sticky resources (was: Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator.) To: Michal Hocko Cc: Roman Gushchin , Yafang Shao , Alexei Starovoitov , Shakeel Butt , Matthew Wilcox , Christoph Hellwig , "David S. Miller" , Daniel Borkmann , Andrii Nakryiko , Tejun Heo , Martin KaFai Lau , bpf , Kernel Team , linux-mm , Christoph Lameter , Pekka Enberg , David Rientjes , Joonsoo Kim , Andrew Morton , Vlastimil Babka Content-Type: text/plain; charset="UTF-8" ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=qrJpDLx9; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf21.hostedemail.com: domain of yosryahmed@google.com designates 209.85.221.49 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1658253691; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=mWRuvQ3wzgz9ZKqWvsdUHpwP47do7E1Ix+nkjoW47hc=; b=3N3rMBRvnb9xgElmxeO7MKhBVVtwTXK1/0J2KUhIPtt4pQl7xkV6p+Eq2u9+kuy3+zwv/Z prWYHgCxQE/HAZ9QgLHLhyvezq6TyxWnEFRl7g2quMh4ZSKDlZT5z3ZM3Jt+NGMhPHU/R0 ldsSL3LOd184u+QjqyaBoFpGsfknACo= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1658253691; a=rsa-sha256; cv=none; b=5RkZevgK1xHVrnkbEvrYzmI/7FdnS48njYsHR/+tCRnmLhTDPACRINYd+izcUNghYiNBwl s6E1MfaaIeL9KLNbKu7T43labYIhT8Dkt6EyYKMim4oW/8SsmhAkieX5DE4yIUHHXTL9ia YwY6Yh6C9UlKUhB7rNAPOFAcyD6zazw= X-Stat-Signature: 9rg3uiqw9psdaab5wmxfsrhrhez7wfwd X-Rspamd-Queue-Id: 9A2C91C0094 X-Rspam-User: Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=qrJpDLx9; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf21.hostedemail.com: domain of yosryahmed@google.com designates 209.85.221.49 as permitted sender) smtp.mailfrom=yosryahmed@google.com X-Rspamd-Server: rspam11 X-HE-Tag: 1658253691-431149 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Jul 19, 2022 at 4:30 AM Michal Hocko wrote: > > On Mon 18-07-22 10:55:59, Yosry Ahmed wrote: > > On Tue, Jul 12, 2022 at 7:39 PM Roman Gushchin wrote: > > > > > > On Tue, Jul 12, 2022 at 11:52:11AM +0200, Michal Hocko wrote: > > > > On Tue 12-07-22 16:39:48, Yafang Shao wrote: > > > > > On Tue, Jul 12, 2022 at 3:40 PM Michal Hocko wrote: > > > > [...] > > > > > > > Roman already sent reparenting fix: > > > > > > > https://patchwork.kernel.org/project/netdevbpf/patch/20220711162827.184743-1-roman.gushchin@linux.dev/ > > > > > > > > > > > > Reparenting is nice but not a silver bullet. Consider a shallow > > > > > > hierarchy where the charging happens in the first level under the root > > > > > > memcg. Reparenting to the root is just pushing everything under the > > > > > > system resources category. > > > > > > > > > > > > > > > > Agreed. That's why I don't like reparenting. > > > > > Reparenting just reparent the charged pages and then redirect the new > > > > > charge, but can't reparents the 'limit' of the original memcg. > > > > > So it is a risk if the original memcg is still being charged. We have > > > > > to forbid the destruction of the original memcg. > > > > > > I agree, I also don't like reparenting for !kmem case. For kmem (and *maybe* > > > bpf maps is an exception), I don't think there is a better choice. > > > > > > > yes, I was toying with an idea like that. I guess we really want a > > > > measure to keep cgroups around if they are bound to a resource which is > > > > sticky itself. I am not sure how many other resources like BPF (aka > > > > module like) we already do charge for memcg but considering the > > > > potential memory consumption just reparenting will not help in general > > > > case I am afraid. > > > > > > Well, then we have to make these objects a first-class citizens in cgroup API, > > > like processes. E.g. introduce cgroup.bpf.maps, cgroup.mounts.tmpfs etc. > > > I easily can see some value here, but it's a big API change. > > > > > > With the current approach when a bpf map pins a memory cgroup of the creator > > > process (which I think is completely transparent for most bpf users), I don't > > > think preventing the deletion of a such cgroup is possible. It will break too > > > many things. > > > > > > But honestly I don't see why userspace can't handle it. If there is a cgroup which > > > contains shared bpf maps, why would it delete it? It's a weird use case, I don't > > > think we have to optimize for it. Also, we do a ton of optimizations for live > > > cgroups (e.g. css refcounting being percpu) which are not working for a deleted > > > cgroup. So noone really should expect any properties from dying cgroups. > > > > > > > Just a random thought here, and I can easily be wrong (and this can > > easily be the wrong thread for this), but if we introduce a more > > generic concept to generally tie a resource explicitly to a cgroup > > (tmpfs, bpf maps, etc) using cgroupfs interfaces, and then prevent the > > cgroup from being deleted unless the resource is freed or moved to a > > different cgroup? > > My understanding is that Tejun would prefer a user space defined policy > by a proper layering. And I would tend to agree that this is less prone > to corner cases. > > Anyway, how would you envision such an interface? I imagine something like cgroup.sticky.[bpf/tmpfs/..] (I suck at naming things). The file can contain a list of bpf map ids or tmpfs inode ids or mount paths. You can write a new id/path to the file to add one, or read the file to see a list of them. Basically very similar semantics to cgroup.procs. Instead of processes, we have different types of sticky resources here that usually outlive processes. The charging of such resources would be deterministically attributed to this cgroup (instead of the cgroup of the process that created the map or touched the tmpfs file first). Since the system admin has explicitly attributed these resources to this cgroup, it makes sense that the cgroup cannot be deleted as long as these resources exist and are charged to it, similar to cgroup.procs. This also addresses some of the zombie cgroups problems. The obvious question is: to maintain the current behavior, bpf maps and tmpfs mounts will be by default initially charged the same way as today. bpf maps for the cgroup of the process that created them, and tmpfs pages on first touch basis. How do we move their charging after their creation in the current way to a cgroup.sticky file? For things like bpf maps, it should be relatively simple to move charges the same way we do when we move processes, because the bpf map is by default charged to one memcg. For tmpfs, it would be challenging to move the charges because pages could be individually attributed to different cgroups already. For this, we can impose a (imo reasonable) restriction that for tmpfs (and similar resources that are currently not attributed to a single cgroup), that you can only add a tmpfs mount to cgroup.sticky.tmpfs for the first time if no pages have been charged yet (no files created). Once the tmpfs mount lives in any cgroup.sticky.tmpfs file, we know that it is charged to one cgroup, and we can move the charges more easily. The system admin would basically mount the tmpfs directory and then directly write it to a cgroup.sticky.tmpfs file. For that point onwards, all tmpfs pages will be charged to that cgroup, and they need to be freed or moved before the cgroup can be removed. I could be missing something here, but I imagine such an interface(s) would address both the bpf progs/maps charging concerns, and our previous memcg= mount attempts. Please correct me if I am wrong. > > > This would be optional, so the current status quo is maintainable, but > > also gives flexibility to admins to assign resources to cgroups to > > make sure nothing is ( unaccounted / accounted to a zombie memcg / > > reparented to an unrelated parent ). This might be too fine-grained to > > be practical but I just thought it might be useful. We will also need > > to define an OOM behavior for such resources. Things like bpf maps > > will be unreclaimable, but tmpfs memory can be swapped out. > > Keep in mind that the swap is a shared resource in itself. So tmpfs is > essentially a sticky resource as well. A tmpfs file is not bound to any > proces life time the same way BPF program is. You might need less > priviledges to remove a file but in principle they are consuming > resources without any explicit owner. I fully agree, which is exactly why I am suggesting having a defined way to handle such "sticky" resources that outlive processes. Currently cgroups operate in terms of processes and this can be a limitation for such resources. > -- > Michal Hocko > SUSE Labs