From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7E6B6C433EF for ; Wed, 9 Mar 2022 13:29:37 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D34DC8D0002; Wed, 9 Mar 2022 08:29:36 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id CE49E8D0001; Wed, 9 Mar 2022 08:29:36 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BAC418D0002; Wed, 9 Mar 2022 08:29:36 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.27]) by kanga.kvack.org (Postfix) with ESMTP id AB6738D0001 for ; Wed, 9 Mar 2022 08:29:36 -0500 (EST) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 6582820AE4 for ; Wed, 9 Mar 2022 13:29:36 +0000 (UTC) X-FDA: 79224929952.14.53E93A0 Received: from mail-il1-f173.google.com (mail-il1-f173.google.com [209.85.166.173]) by imf12.hostedemail.com (Postfix) with ESMTP id EA0F04000E for ; Wed, 9 Mar 2022 13:29:35 +0000 (UTC) Received: by mail-il1-f173.google.com with SMTP id b14so1453566ilf.6 for ; Wed, 09 Mar 2022 05:29:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=+Ie3g071ONGf67RV/Jbx4FIY700bvMZXq2+ZmsEi+AI=; b=QIUfIU5qzXULTpfA1D74Pw/xIK0NPaopdTDk8k2ygZTjo1n2jfbhz0/iuGe6x1Oa7F xxbP5v3KNuVHO0k8x0S9369AYAQIgFi7cCeznz4Gf0erFdfjCSDu8MI9Kwf/cQzqebXR AHg8yinGUbusdz/Kb18I8Bn76YNjPMo5ILsYvYgY+zZbRfAJsO4p1A6MIgPWiQfWU3+8 /4TDpzoRd1+DXBKqrevTrJwbPRELTp+gNHa3WItasD8SXMvN2xnf4eKWWuo60DY/93bZ 6LDjkDzjeqByU7zI7xMpO64/iKqZ4lsxktrvmKKinv4axpudjJRz8TlDaiL+KiJHR8aA W7Mg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=+Ie3g071ONGf67RV/Jbx4FIY700bvMZXq2+ZmsEi+AI=; b=OCmMnbVsnRgR9vOs2c9Gr7HIgss3jPFlHE9PbA4hR9sKRbMapG0m3J+qVY8zNYkpo+ TaY+bLZxWUl8ygqHnKLJCBK5y1/8+DyPPOV2052sO+2WNRxEid3wNSEtTRtZQWMM2ouJ 3cXMTY1qWLouraY+FunYioCLhQCN5yYAt5sjZEOektPdcICqPqVObmANJ+nMSjNjC3kJ d61M9ug3vDGCk++w7V+YHfNBXlDc6i1JWKPJ5wnysMoz+mCMdXiUAkZ+vmTppFMAGsGZ M8usAoT7/+FlMvxNNSjyo05nNtfYUH8hJGiy1vPoQsKq5NU+YSoWHiu1AnsjaxJNZSG5 W6DA== X-Gm-Message-State: AOAM530owyAk/1puxi6B3CczF377Ax7bcAwac6usOaukul3fXiWaDllP 1Pb+I2yDQA/Ot8glBNISMCdj4yxEx03beh4cYoM= X-Google-Smtp-Source: ABdhPJzglI42fOllC5NUHdfXe8Gktl5ckVAcIjBAlwX/mY0PeB/ZjpzCY2ytg3QT5IHiwb9tRiiQNj7NVCEqRqX5CMA= X-Received: by 2002:a92:7c13:0:b0:2c6:610f:82f0 with SMTP id x19-20020a927c13000000b002c6610f82f0mr5740790ilc.6.1646832575111; Wed, 09 Mar 2022 05:29:35 -0800 (PST) MIME-Version: 1.0 References: <20220308131056.6732-1-laoar.shao@gmail.com> In-Reply-To: From: Yafang Shao Date: Wed, 9 Mar 2022 21:28:58 +0800 Message-ID: Subject: Re: [PATCH RFC 0/9] bpf, mm: recharge bpf memory from offline memcg To: Roman Gushchin Cc: Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin Lau , Song Liu , Yonghong Song , john fastabend , KP Singh , Andrew Morton , Christoph Lameter , penberg@kernel.org, David Rientjes , iamjoonsoo.kim@lge.com, Vlastimil Babka , Johannes Weiner , Michal Hocko , Vladimir Davydov , Roman Gushchin , Linux MM , netdev , bpf Content-Type: text/plain; charset="UTF-8" X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: EA0F04000E X-Rspam-User: Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=QIUfIU5q; spf=pass (imf12.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.166.173 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com X-Stat-Signature: cicibntgnpxi5z5jdn9i96hogn538qz7 X-HE-Tag: 1646832575-17914 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Mar 9, 2022 at 9:09 AM Roman Gushchin wrote: > > On Tue, Mar 08, 2022 at 01:10:47PM +0000, Yafang Shao wrote: > > When we use memcg to limit the containers which load bpf progs and maps, > > we find there is an issue that the lifecycle of container and bpf are not > > always the same, because we may pin the maps and progs while update the > > container only. So once the container which has alreay pinned progs and > > maps is restarted, the pinned progs and maps are no longer charged to it > > any more. In other words, this kind of container can steal memory from the > > host, that is not expected by us. This patchset means to resolve this > > issue. > > > > After the container is restarted, the old memcg which is charged by the > > pinned progs and maps will be offline but won't be freed until all of the > > related maps and progs are freed. If we want to charge these bpf memory to > > the new started memcg, we should uncharge them from the offline memcg first > > and then charge it to the new one. As we have already known how the bpf > > memroy is allocated and freed, we can also know how to charge and uncharge > > it. This pathset implements various charge and uncharge methords for these > > memory. > > > > Regarding how to do the recharge, we decide to implement new bpf syscalls > > to do it. With the new implemented bpf syscall, the agent running in the > > container can use it to do the recharge. As of now we only implement it for > > the bpf hash maps. Below is a simple example how to do the recharge, > > > > ==== > > int main(int argc, char *argv[]) > > { > > union bpf_attr attr = {}; > > int map_id; > > int pfd; > > > > if (argc < 2) { > > printf("Pls. give a map id \n"); > > exit(-1); > > } > > > > map_id = atoi(argv[1]); > > attr.map_id = map_id; > > pfd = syscall(SYS_bpf, BPF_MAP_RECHARGE, &attr, sizeof(attr)); > > if (pfd < 0) > > perror("BPF_MAP_RECHARGE"); > > > > return 0; > > } > > > > ==== > > > > Patch #1 and #2 is for the observability, with which we can easily check > > whether the bpf maps is charged to a memcg and whether the memcg is offline. > > Patch #3, #4 and #5 is for the charge and uncharge methord for vmalloc-ed, > > kmalloc-ed and percpu memory. > > Patch #6~#9 implements the recharge of bpf hash map, which is mostly used > > by our bpf services. The other maps hasn't been implemented yet. The bpf progs > > hasn't been implemented neither. > > > > This pathset is still a POC now, with limited testing. Any feedback is > > welcomed. > > Hello Yafang! > > It's an interesting topic, which goes well beyond bpf. In general, on cgroup > offlining we either do nothing either recharge pages to the parent cgroup > (latter is preferred), which helps to release the pinned memcg structure. > We have thought about recharging pages to the parent cgroup (the root memcg in our case), but it can't resolve our issue. Releasing the pinned memcg struct is the benefit of recharging pages to the parent, but as there won't be too many memcgs pinned by bpf, so it may not be worth it. > Your approach raises some questions: Nice questions. > 1) what if the new cgroup is not large enough to contain the bpf map? The recharge is supposed to be triggered at the container start time. After the container is started, the agent which will load the bpf programs will do it as follows, 1. Check if the bpf program has already been loaded, if not, goto 5. 2. Check if the bpf program will pin maps or progs, if not, goto 6. 3. Check if the pinned maps and progs are charged to an offline memcg, if not, goto 6. 4. Recharge the pinned maps or progs to the current memcg. goto 6. 5. load new bpf program, and also pinned maps and progs if desired. 6. End. If the recharge fails, it means that the memcg limit is too low, we should reconsider the limit of the container. Regarding other cases that it may do the recharge in the runtime, I think the failure is a common OOM case, that means the usage in this container is out of memory, we should kill something. > 2) does it mean that some userspace app will monitor the state of the cgroup > which was the original owner of the bpf map and recharge once it's deleted? In our use case, we don't need to monitor that behavior. The agent which loads the bpf programs has the responsibility to do the recharge. As all the agents are controlled by ourselves, it is easy to do it like that. For more generic use cases, it can do the bpf maintenance in a sidecar container in the containerized environment. The admin can provide such sidercar to bpf owners. The admin can also introduce an agent on the host to check if there're maps or progs charged to an offline memcg and then take the action. It is not easy to find which one owns the pinned maps or progs as the pinned path is unique. > 3) what if there are several cgroups are sharing the same map? who will be > the next owner? I think we can follow the same rule that we take care of sharing pages across memcgs currently: who loads it first, who owns the map. Then after the first one exit, the next owner is who firstly does the recharge. > 4) because recharging is fully voluntary, why any application should want to do > it, if it can just use the memory for free? it doesn't really look as a working > resource control mechanism. > As I explained in 2), all the agents are under our control, so we can easily handle it like that. For generic use cases, an agent running on the host and sidecar (or SDK) provided to bpf users can also handle it. > Will reparenting work for your case? If not, can you, please, describe the > problem you're trying to solve by recharging the memory? > Reparenting doesn't work for us. The problem is memory resource control: the limitation on the bpf containers will be useless if the lifecycle of bpf progs can containers are not the same. The containers are always upgraded - IOW restarted - more frequently than the bpf progs and maps, that is also one of the reasons why we choose to pin them on the host. -- Thanks Yafang