From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AC8BDC28D13 for ; Mon, 22 Aug 2022 21:53:02 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D2D608D0003; Mon, 22 Aug 2022 17:53:01 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CDC968D0002; Mon, 22 Aug 2022 17:53:01 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B7D248D0003; Mon, 22 Aug 2022 17:53:01 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id A62B38D0002 for ; Mon, 22 Aug 2022 17:53:01 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 7ADAB160A1B for ; Mon, 22 Aug 2022 21:53:01 +0000 (UTC) X-FDA: 79828579362.29.7C63847 Received: from mail-ua1-f51.google.com (mail-ua1-f51.google.com [209.85.222.51]) by imf03.hostedemail.com (Postfix) with ESMTP id 301A520002 for ; Mon, 22 Aug 2022 21:53:01 +0000 (UTC) Received: by mail-ua1-f51.google.com with SMTP id z23so3358445uap.10 for ; Mon, 22 Aug 2022 14:53:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc; bh=2XNHoc9outJ6+FrRa+lvTbUCgpYDkDMEQnlx6lAVru4=; b=DoxhsSzF4mkTjC/l5iKWrNX1D3FuMiXOo5jZExAuxkdIzhGio0bIYfEJEocLCor3Xw dxzfOZYWeZ7PYceVCfCi5TaaZ43gQQUZ2xQqLIHKRl442YLxaNVd31DaYDp7/OQT21RW IU9cUwfSViaxVMf79gY7HJHDLOnQsGFF/Ggf6uKrOHbKzALbxokg39jTp5Avl4lY7rOb 4rdtqNDeIzeBeM452bP2s5Bx0+Ak7hIX2pvDe+x01CVNNp2sb9wnTOVX/QvWPW3X4O10 rMuspP1afMN/Sp8mXaPAmdoPAyMgx77p5hQfV7sU2RUip9cqwS6yCx6fF6VgXJAUO8Q5 6nsA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc; bh=2XNHoc9outJ6+FrRa+lvTbUCgpYDkDMEQnlx6lAVru4=; b=Ia6DcPW6mlKNn0nIicrxn6HXmVrgqQJS0qOx8jxY4UtmqEMV2uYEPgki9O9n9oPnwr RqKI3Tb8Rq6Mq1ig/SOWN8OAL+S/aKUxNYri2ECVG+1xTa9FPV5mXRuaxZuIPyOTnqcA L+ce1/1bIV4QD/0Zka5T7Z65E/oVnM5YL2hMD+ZwQQvMP1spFayVWxEdqb/kFVJSYh30 +k1WqOjtJHGBqU1K58epLuzPXXUNlSq6gqxS3s3IwJDkiCYOeHrgphA+4fhyYQzqoe0P k84ZJMs4iGHDZY6Hhquy3k8WS8il4ihM914xjhMXSbUepu12rQkUoamVO/iuhg527DuM kTtA== X-Gm-Message-State: ACgBeo3x9wSPd5QSWhkFk2vuHMD1I1mvMjBoGdEvqmYb2fHqw8I1W9aK YwvVtNEHSzshUZ3XEwNNn0WuEi4Jmh20eQqZ0VxeCQ== X-Google-Smtp-Source: AA6agR5qk2qbmqQpmyckbmBB/PpgF54yQcjMVlJIyWGkpldhgtRZPCmVOEm/t4a60iGO5IWUmBjGbCoLM+nu215XFUA= X-Received: by 2002:a9f:3641:0:b0:384:78e4:3b9d with SMTP id s1-20020a9f3641000000b0038478e43b9dmr8129139uad.90.1661205180349; Mon, 22 Aug 2022 14:53:00 -0700 (PDT) MIME-Version: 1.0 References: <20220818143118.17733-1-laoar.shao@gmail.com> In-Reply-To: From: Mina Almasry Date: Mon, 22 Aug 2022 14:52:48 -0700 Message-ID: Subject: Re: [RFD RESEND] cgroup: Persistent memory usage tracking To: Johannes Weiner , Yosry Ahmed Cc: Tejun Heo , Yafang Shao , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin Lau , Song Liu , Yonghong Song , john fastabend , KP Singh , Stanislav Fomichev , Hao Luo , jolsa@kernel.org, Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton , Zefan Li , Cgroups , netdev , bpf , Linux MM , Dan Schatzberg , Lennart Poettering Content-Type: text/plain; charset="UTF-8" ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1661205181; a=rsa-sha256; cv=none; b=6rs35C0Hfr593uTZ+LOPq2roBcX+vldR8Df3eND8KaliJtvb/7g/3w4yWY+R6yxpwJkL8H LXHqW3F+R0JRRGIkabs2eLmye4u0SecxDBPTFYJkRtR9NkubnTCLFS0brTfDOOC6kki2SO V4h4pgga+TyAzrJBl7sCRKEpIk74qqQ= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=DoxhsSzF; spf=pass (imf03.hostedemail.com: domain of almasrymina@google.com designates 209.85.222.51 as permitted sender) smtp.mailfrom=almasrymina@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1661205181; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=2XNHoc9outJ6+FrRa+lvTbUCgpYDkDMEQnlx6lAVru4=; b=2BOFST0ShiQj6OGBGL9ukqTtvi9xMi8euR/TkGbmV2FZ9+qQ7NqDgaZUD8oy1QkgW+FBe8 IwZr4axlLbJ+TZMJ2DQNUKczRdl/hBNgvb1xuXMG8D9Y8CrR5syQzN9mIzYN0lv1ipmuu8 mO1QXG9AKQ6SsWEinXlgMWzFrpR94Ws= X-Stat-Signature: zqphphiq5dfd98m4ttn4d8uag94gp5sc X-Rspamd-Queue-Id: 301A520002 X-Rspamd-Server: rspam11 X-Rspam-User: Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=DoxhsSzF; spf=pass (imf03.hostedemail.com: domain of almasrymina@google.com designates 209.85.222.51 as permitted sender) smtp.mailfrom=almasrymina@google.com; dmarc=pass (policy=reject) header.from=google.com X-HE-Tag: 1661205181-743130 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Aug 22, 2022 at 2:19 PM Johannes Weiner wrote: > > On Mon, Aug 22, 2022 at 12:02:48PM -0700, Mina Almasry wrote: > > On Mon, Aug 22, 2022 at 4:29 AM Tejun Heo wrote: > > > b. Let userspace specify which cgroup to charge for some of constructs like > > > tmpfs and bpf maps. The key problems with this approach are > > > > > > 1. How to grant/deny what can be charged where. We must ensure that a > > > descendant can't move charges up or across the tree without the > > > ancestors allowing it. > > > > > > 2. How to specify the cgroup to charge. While specifying the target > > > cgroup directly might seem like an obvious solution, it has a couple > > > rather serious problems. First, if the descendant is inside a cgroup > > > namespace, it might be able to see the target cgroup at all. Second, > > > it's an interface which is likely to cause misunderstandings on how it > > > can be used. It's too broad an interface. > > > > > > > This is pretty much the solution I sent out for review about a year > > ago and yes, it suffers from the issues you've brought up: > > https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/ > > > > > > > One solution that I can think of is leveraging the resource domain > > > concept which is currently only used for threaded cgroups. All memory > > > usages of threaded cgroups are charged to their resource domain cgroup > > > which hosts the processes for those threads. The persistent usages have a > > > similar pattern, so maybe the service level cgroup can declare that it's > > > the encompassing resource domain and the instance cgroup can say whether > > > it's gonna charge e.g. the tmpfs instance to its own or the encompassing > > > resource domain. > > > > > > > I think this sounds excellent and addresses our use cases. Basically > > the tmpfs/bpf memory would get charged to the encompassing resource > > domain cgroup rather than the instance cgroup, making the memory usage > > of the first and second+ instances consistent and predictable. > > > > Would love to hear from other memcg folks what they would think of > > such an approach. I would also love to hear what kind of interface you > > have in mind. Perhaps a cgroup tunable that says whether it's going to > > charge the tmpfs/bpf instance to itself or to the encompassing > > resource domain? > > I like this too. It makes shared charging predictable, with a coherent > resource hierarchy (congruent OOM, CPU, IO domains), and without the > need for cgroup paths in tmpfs mounts or similar. > > As far as who is declaring what goes, though: if the instance groups > can declare arbitrary files/objects persistent or shared, they'd be > able to abuse this and sneak private memory past local limits and > burden the wider persistent/shared domain with it. > > I'm thinking it might make more sense for the service level to declare > which objects are persistent and shared across instances. > > If that's the case, we may not need a two-component interface. Just > the ability for an intermediate cgroup to say: "This object's future > memory is to be charged to me, not the instantiating cgroup." > > Can we require a process in the intermediate cgroup to set up the file > or object, and use madvise/fadvise to say "charge me", before any > instances are launched? I think doing this on a file granularity makes it logistically hard to use, no? The service needs to create a file in the shared domain and all its instances need to re-use this exact same file. Our kubernetes use case from [1] shares a mount between subtasks rather than specific files. This allows subtasks to create files at will in the mount with the memory charged to the shared domain. I imagine this is more convenient than a shared file. Our other use case, which I hope to address here as well, is a service-client relationship from [1] where the service would like to charge per-client memory back to the client itself. In this case the service or client can create a mount from the shared domain and pass it to the service at which point the service is free to create/remove files in this mount as it sees fit. Would you be open to a per-mount interface rather than a per-file fadvise interface? Yosry, would a proposal like so be extensible to address the bpf charging issues? [1] https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/