From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5CB12C433EF for ; Thu, 14 Jul 2022 06:16:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9834A94018E; Thu, 14 Jul 2022 02:16:08 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 93344940134; Thu, 14 Jul 2022 02:16:08 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7FBF794018E; Thu, 14 Jul 2022 02:16:08 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 704D2940134 for ; Thu, 14 Jul 2022 02:16:08 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 45C3F34B1F for ; Thu, 14 Jul 2022 06:16:08 +0000 (UTC) X-FDA: 79684695216.27.6DAE0D3 Received: from mail-vk1-f181.google.com (mail-vk1-f181.google.com [209.85.221.181]) by imf05.hostedemail.com (Postfix) with ESMTP id E56F110004E for ; Thu, 14 Jul 2022 06:16:07 +0000 (UTC) Received: by mail-vk1-f181.google.com with SMTP id w129so370735vkg.10 for ; Wed, 13 Jul 2022 23:16:07 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=ApEAtFZKw7vtTgl34dbe4a4f89XqU7dSoPUlyv5Fslk=; b=CNJb6LoaV77iVBZrOw2IbIS7nvxzEEsBfPPMrHDezQiGUIdXYc9mbbjfTW/VKmbCnb jjor7Krgh4QqKfBjW13VXMrQItbJ+CHc7T0Iytfifn2GTmNCsHF3csW8a4I549U3hTVY e9f3+r5bY9JxRx99t6bgEMEmczFZGI8df2pmMkj+bJsHZJs//Jdc0F+VttSSAzUpkOsG uewT1htMqrpCgCp42uIuhjTRkWvD1Zlbmqv9nHRFdXOmLvySAZP4ztfpb8mkQGmUGN3h Lj0W1zZGWsAWHM4EsExg0fbT8qH2g2OGcPdnWbZTd3E5Jcsxs8c1tGMKf2SpspZ28G0o xS+g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=ApEAtFZKw7vtTgl34dbe4a4f89XqU7dSoPUlyv5Fslk=; b=BCUJl4QEWWyeapWDYd8p6sO0j5tOIzTuHSOqQrhm3I4OsN0tlmufDRfHh7r5XDFwvL TKJCb9902Goct42qjqX1jcoBoSLrjSUPXe1e45hZ5EElgBXPAZz6EkviYjsdg9HJ9txK yhUgykGRiLH3mn50Luo2/F8EJv3xN0TtKESb99cIpYNLQebqZnoXHm/Vl07Nf0UjR8VK pD9xKEjgnewtpaMCbXOgSqdM0KLuhDhHxDWljxh4RrzhfCWin78Dx8ILrTmEtIgVLoKR wJYpzjf3U2gj1cu1kBTEJtQYOM+z5yzo8Kkl7a35n75rCYtyVqRNNj4pvEh4D6cdBc8t 2rYQ== X-Gm-Message-State: AJIora/0Jsy/a7/YHmvxDyTb1ZYfP0xQUfmyWFX0QTKc2oZ76DgvRD60 gr+OJEydqaxclZ00MmiB9fOO4UMGyYXlBdc/Y2k= X-Google-Smtp-Source: AGRyM1uTgYeeLjMSMaYITBIIxcEEHNXIlGdLvNBKEUO7NBCyJ+2CQVZLeTq809TH5t9MT8oa8l5+OIr0yi8pnQLS7f0= X-Received: by 2002:a1f:2d4c:0:b0:374:3037:8082 with SMTP id t73-20020a1f2d4c000000b0037430378082mr2827432vkt.5.1657779367049; Wed, 13 Jul 2022 23:16:07 -0700 (PDT) MIME-Version: 1.0 References: <20220708215536.pqclxdqvtrfll2y4@google.com> <20220710073213.bkkdweiqrlnr35sv@google.com> <20220712043914.pxmbm7vockuvpmmh@macbook-pro-3.dhcp.thefacebook.com> In-Reply-To: From: Yafang Shao Date: Thu, 14 Jul 2022 14:15:30 +0800 Message-ID: Subject: Re: [PATCH bpf-next 0/5] bpf: BPF specific memory allocator. To: Tejun Heo Cc: Roman Gushchin , Michal Hocko , Alexei Starovoitov , Shakeel Butt , Matthew Wilcox , Christoph Hellwig , "David S. Miller" , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau , bpf , Kernel Team , linux-mm , Christoph Lameter , Pekka Enberg , David Rientjes , Joonsoo Kim , Andrew Morton , Vlastimil Babka Content-Type: text/plain; charset="UTF-8" ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1657779368; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ApEAtFZKw7vtTgl34dbe4a4f89XqU7dSoPUlyv5Fslk=; b=d9zSfpahYPCDcsiBUqWJtMFyLINoszxkv1VInN7o1/fJuCwjqJEjvKjOOPtzlsH2vn+Yky x+HyDvssgDtsGpDGoydn9bCqVDpuK1ebPLdxf2Udzj0S6baHuR3awBRLuJ2M2IacvwT3oF eBbJgZjK5SxrU/fNxmtHUYYopz08+Uo= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1657779368; a=rsa-sha256; cv=none; b=AaqDN8VQxBXaIRIWm88gMGzGedBbNP0sDcuL8gDbA+uENHYRNtLUsQX8U8K2X34266+yyJ NxPCA6a5azUcuXNt3UyJ/Jt8jn8Hd/f6+q/SEtiZ4Onpfvia4+lbesyzsy05Pr6UVqqPOu FrVDwhq8S0inzCUGcIu9g6qwVFGeWK4= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=CNJb6Loa; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf05.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.221.181 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com X-Stat-Signature: b3foy9hns6pqdidpe6e891nx5cwoeuio X-Rspamd-Queue-Id: E56F110004E X-Rspam-User: Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=CNJb6Loa; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf05.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.221.181 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com X-Rspamd-Server: rspam05 X-HE-Tag: 1657779367-244390 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000004, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Jul 14, 2022 at 12:24 AM Tejun Heo wrote: > > Hello, > > On Wed, Jul 13, 2022 at 10:24:05PM +0800, Yafang Shao wrote: > > I have told you that it is not reasonable to refuse a containerized > > process to pin bpf programs, but if you are not familiar with k8s, it > > is not easy to explain clearly why it is a trouble for deployment. > > But I can try to explain to you from a *systemd user's* perspective. > > The way systemd currently sets up cgroup hierarchy doesn't work for > persistent per-service resource tracking. It needs to introduce an extra > layer for that which woudl be a significant change for systemd too. > > > I assume the above hierarchy is what you expect. > > But you know, in the k8s environment, everything is pod-based, that > > means if we use the above hierarchy in the k8s environment, the k8s's > > limiting, monitoring, debugging must be changed consequently. That > > means it may be a fullstack change in k8s, a great refactor. > > > > So below hierarchy is a reasonable solution, > > bpf-memcg > > | > > bpf-foo pod bpf-foo-memcg (limited) > > / \ / > > (charge) (not-charged) (charged) > > proc-foo bpf-foo > > > > And then keep the bpf-memgs persistent. > > It looks like you draw the diagram with variable width font and it's > difficult to tell what you're trying to say. Maybe below diagram is more clear to you ? bpf-memcg | bpf-foo pod bpf-foo-memcg (limited) / \ / (charge) (not-charged) (charged) | \ / | \ / proc-foo bpf-foo bpf-foo is loaded by process-foo, but it is not charge to the bpf-foo pod, while it is remotely charge to bpf-foo-memcg. > That said, I don't think the > argument you're making is a good one in general. The topic at hand is future > architectural direction in handling shared resources, which was never well > supported before. ie. We're not talking about breaking existing behaviors. > > We don't want to architect kernel features to suit the expectations of one > particular application. It has to be longer term than that and it can't be > an one way road. Sometimes the kernel adapts to existing applications > because the expectations make sense. At other times, kernel takes a > direction which may require some work from applications to use new > capabilities because that makes more sense in the long term. > The shared resources or remote charge is not a new issue, see also task->active_memcg. The case (map->memcg or map->objcg) we are handling now is similar with task->active_memcg. If we want to make it generic, I think we can start with task->active_memcg. To make it generic, I have some superficial thinking on the cgroup side. 1) Can we extend the cgroup tree to cgroup graph ? 2) Can we extend the cgroup from process-based (cgroup.procs) to resource-based (cgroup.resources) ? Regarding question 1). Originally the charge direction is vertical, looks like a tree, as below, parent ^ | cgroup But after the task->active_memcg, there's a newly horizontal charge, as below, parent ^ | cgroup ----> friend They will have a same ancestor, so finally it looks like a graph, ancestor / \ ... ... / \ cgroup ---- friend Regarding question 2). The lifecycle of a leaf cgroup is same with the processes inside it. But after the remote charge been introduced, the lifecycle of a leaf cgroup may be same with the process in other cgroups. That said, it is not sufficient to be treated as process-based, because what it really care about is the resources, so may be we should extend it to resource-based. > Let's keep the discussion more focused on technical merits. > > Thanks. > > -- > tejun -- Regards Yafang