From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5D1A2C43334 for ; Wed, 13 Jul 2022 14:24:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 004D294013C; Wed, 13 Jul 2022 10:24:44 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id ED0B8940134; Wed, 13 Jul 2022 10:24:43 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D714C94013C; Wed, 13 Jul 2022 10:24:43 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id C10DD940134 for ; Wed, 13 Jul 2022 10:24:43 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 9438E20C08 for ; Wed, 13 Jul 2022 14:24:43 +0000 (UTC) X-FDA: 79682297646.29.CDD25F2 Received: from mail-vs1-f42.google.com (mail-vs1-f42.google.com [209.85.217.42]) by imf07.hostedemail.com (Postfix) with ESMTP id 11C0F40038 for ; Wed, 13 Jul 2022 14:24:42 +0000 (UTC) Received: by mail-vs1-f42.google.com with SMTP id q26so8451968vsp.11 for ; Wed, 13 Jul 2022 07:24:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=JWXPjVie3sm72rSDN7Kiu+BHdXUWrF5jARXkQ8BtbN0=; b=BxoIP5+oTXnnHRCwZ4lrpGjz+ivHT0zYtrlPzqV2LvJowZUWFwTfKOhkJW6mcmXocU /x7XuBttgTBM3D2533dZaHLD99E3ttmPNc+EF2Sh8T/hOEORGrZ0A5ubSqA17uroh0Eh iiol2zMqN+47bNk34lKBt6JslkfEsbGBVkjG6A8XRIbsFjg6U+Q3n370yV0Yna/BnWPc /mxAqVCIFXB3DWLOnDnwVsGrE5Yun8BgZs6b3iumdawPJFZqS0FL+84oudojlYk5H2E+ i0GVnPWyRXr6PHVR9iuQhWXJDBlLXObY4yd0nGLL4iWyTZXpZZKBIoSgs6k4FG4vfjr4 Om9Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=JWXPjVie3sm72rSDN7Kiu+BHdXUWrF5jARXkQ8BtbN0=; b=zbb0EzBdo4RQrpq5y+Pv1Kodve9FxYXgmUBHHifyLN+AeDcqZRbkM2RySIVNzwjNiq wnJj1SrL5pSkQYSAyo9yiab2xDf2klIb7vqfs+4FTjj5MDi1Oe+ZI5EozWtgxmScOTy8 tTIVGDa9ECMzNrw+mHU+L2UCVVkaTMEyaLd9WQJV/DJOjamoqIeRrJyOHOaz/i5sNTAd KDto1Pag5Qh3jkAXD5eJOHhNnGv2fRrEFYRGbznfc2CBjvKkf6A2Q3gRb2LH/JCrLeSi UCqSrKKTJDv3ejNpIn5YvRWyCzxJIwgjfo6bTeULEBHZoIODE6UQ3Tlnkwf438i65oKR jYNw== X-Gm-Message-State: AJIora+g+D1xH/0a/qvBgT/L0SEFPvUMAeXBbEfT7JfecNDqAypDW2dC o6pi3zqTQe68589MHOtbFXN2SupFbRbSkQQCzlI= X-Google-Smtp-Source: AGRyM1tDKiU0EebMnrOiIsWxkxJInH8ck5QLYPgxv2VUGxnt9Ldyb/yIxvdnJxUJvz6fivGeg04TaHvSz+SxsYtjxyI= X-Received: by 2002:a05:6102:3d20:b0:357:7f61:6127 with SMTP id i32-20020a0561023d2000b003577f616127mr1246333vsv.11.1657722282306; Wed, 13 Jul 2022 07:24:42 -0700 (PDT) MIME-Version: 1.0 References: <20220708174858.6gl2ag3asmoimpoe@macbook-pro-3.dhcp.thefacebook.com> <20220708215536.pqclxdqvtrfll2y4@google.com> <20220710073213.bkkdweiqrlnr35sv@google.com> <20220712043914.pxmbm7vockuvpmmh@macbook-pro-3.dhcp.thefacebook.com> In-Reply-To: From: Yafang Shao Date: Wed, 13 Jul 2022 22:24:05 +0800 Message-ID: Subject: Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator. To: Roman Gushchin Cc: Michal Hocko , Alexei Starovoitov , Shakeel Butt , Matthew Wilcox , Christoph Hellwig , "David S. Miller" , Daniel Borkmann , Andrii Nakryiko , Tejun Heo , Martin KaFai Lau , bpf , Kernel Team , linux-mm , Christoph Lameter , Pekka Enberg , David Rientjes , Joonsoo Kim , Andrew Morton , Vlastimil Babka Content-Type: text/plain; charset="UTF-8" ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=BxoIP5+o; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf07.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.217.42 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1657722283; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=JWXPjVie3sm72rSDN7Kiu+BHdXUWrF5jARXkQ8BtbN0=; b=a0573T++cCRRSou2vpZMTP1/x1T+VXts0m8iS9W0TZ6s/q9UL/+pIZdHSQWjlFokUZRcC3 vR+k/f+7c2sqWcqkBn5DOh7c39pcyO3SkMFqaHnLN+o/1Zka8EnBed7iCp3BHDvEaSdFk+ 5aezZZMkxsLAS3+ZHdEONCpfWesYFpQ= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1657722283; a=rsa-sha256; cv=none; b=XbC08oIcFS6oIqTk38Li8AuDNpar9LEcaYVXyMXYUPFhX9H0IimvuuiOYq1KHwlQ2bkAwC BNd4gZXyBu47iTa2Io1qCjJVHfiTyU4OFUNN19SCvRCEftvEPOqtgs5aG/kR5IuVkMKw6U msFrTo/YS3bGkJQ7wGhbZNNKBadkBTI= X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 11C0F40038 X-Stat-Signature: mjo3rwhjah35ywdr14iej5qyb81ecj1t Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=BxoIP5+o; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf07.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.217.42 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com X-Rspam-User: X-HE-Tag: 1657722282-255925 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Jul 13, 2022 at 10:39 AM Roman Gushchin wrote: > > On Tue, Jul 12, 2022 at 11:52:11AM +0200, Michal Hocko wrote: > > On Tue 12-07-22 16:39:48, Yafang Shao wrote: > > > On Tue, Jul 12, 2022 at 3:40 PM Michal Hocko wrote: > > [...] > > > > > Roman already sent reparenting fix: > > > > > https://patchwork.kernel.org/project/netdevbpf/patch/20220711162827.184743-1-roman.gushchin@linux.dev/ > > > > > > > > Reparenting is nice but not a silver bullet. Consider a shallow > > > > hierarchy where the charging happens in the first level under the root > > > > memcg. Reparenting to the root is just pushing everything under the > > > > system resources category. > > > > > > > > > > Agreed. That's why I don't like reparenting. > > > Reparenting just reparent the charged pages and then redirect the new > > > charge, but can't reparents the 'limit' of the original memcg. > > > So it is a risk if the original memcg is still being charged. We have > > > to forbid the destruction of the original memcg. > > I agree, I also don't like reparenting for !kmem case. For kmem (and *maybe* > bpf maps is an exception), I don't think there is a better choice. > > > yes, I was toying with an idea like that. I guess we really want a > > measure to keep cgroups around if they are bound to a resource which is > > sticky itself. I am not sure how many other resources like BPF (aka > > module like) we already do charge for memcg but considering the > > potential memory consumption just reparenting will not help in general > > case I am afraid. > > Well, then we have to make these objects a first-class citizens in cgroup API, > like processes. E.g. introduce cgroup.bpf.maps, cgroup.mounts.tmpfs etc. > I easily can see some value here, but it's a big API change. > > With the current approach when a bpf map pins a memory cgroup of the creator > process (which I think is completely transparent for most bpf users), I don't > think preventing the deletion of a such cgroup is possible. It will break too > many things. > > But honestly I don't see why userspace can't handle it. If there is a cgroup which > contains shared bpf maps, why would it delete it? It's a weird use case, I don't > think we have to optimize for it. Also, we do a ton of optimizations for live > cgroups (e.g. css refcounting being percpu) which are not working for a deleted > cgroup. So noone really should expect any properties from dying cgroups. > I think we have discussed why the user can't handle it easily. Actually It's NOT a weird use case if you are a k8s user. (Of course it may seem weird to the systemd user, but unfortunately systemd doesn't rule the whole world. ) I have told you that it is not reasonable to refuse a containerized process to pin bpf programs, but if you are not familiar with k8s, it is not easy to explain clearly why it is a trouble for deployment. But I can try to explain to you from a *systemd user's* perspective. bpf-memcg (must be persistent) / \ bpf-foo-memcg bpf-bar-memcg (must be persistent, and limit here) ------------------------------------------------------- / \ bpf-foo pod bpf-bar pod (being created and destroyed, but not limited) I assume the above hierarchy is what you expect. But you know, in the k8s environment, everything is pod-based, that means if we use the above hierarchy in the k8s environment, the k8s's limiting, monitoring, debugging must be changed consequently. That means it may be a fullstack change in k8s, a great refactor. So below hierarchy is a reasonable solution, bpf-memcg | bpf-foo pod bpf-foo-memcg (limited) / \ / (charge) (not-charged) (charged) proc-foo bpf-foo And then keep the bpf-memgs persistent. -- Regards Yafang