From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3BDF8C28D13 for ; Mon, 22 Aug 2022 21:19:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id AFF138D0002; Mon, 22 Aug 2022 17:19:56 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id AADA08D0001; Mon, 22 Aug 2022 17:19:56 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 94F4F8D0002; Mon, 22 Aug 2022 17:19:56 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 857AE8D0001 for ; Mon, 22 Aug 2022 17:19:56 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 6064AA092A for ; Mon, 22 Aug 2022 21:19:56 +0000 (UTC) X-FDA: 79828495992.01.49DDBE6 Received: from mail-wr1-f51.google.com (mail-wr1-f51.google.com [209.85.221.51]) by imf06.hostedemail.com (Postfix) with ESMTP id 96ED118002B for ; Mon, 22 Aug 2022 21:19:53 +0000 (UTC) Received: by mail-wr1-f51.google.com with SMTP id h24so14724857wrb.8 for ; Mon, 22 Aug 2022 14:19:53 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20210112.gappssmtp.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc; bh=FxIK9S2wTvBehSQmWm4iua73649GO/cf0KFcF96zw1A=; b=WnSB6qB0/v5bDg0UBCH8XvyxgD+Go42M4ZW04QIoDlf1WY4R6dUzsRq/FJxljUgvzR P8qFnCxXLOwjhK7ac/biGzbje7m4mVABX+7q40/X8QQmMMdbA6H5JXrBHCxVySNiyAPs r0/0smN1EzbeaFhpHFykf4O6d1mRCTGlBWWv2uELOaSU/MgujmUXcl5XklPFCiFSh1Pt ALTePm2DGoPtixANbDQWxpnEW/X1Jjl9+nVLJQBMSTG0rKcXD4R7LsUwqlw/cXODDTXk CfNRxjJ+zttWupmRR1E9wx5X+sZwOtAvDfXvbyKPLJVhlUkJz74jPbl8/iGSz4N4tBc0 3zPA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc; bh=FxIK9S2wTvBehSQmWm4iua73649GO/cf0KFcF96zw1A=; b=M04VGswjbTNbQBOElFX8dKdwPYgkFp4XsYgZnQhfXwMA4ujAOItsgoCTtpd57sCTKi veepJgSYAEXtn0SR4iz0Qr4sigR9VarIW2HAqVpIgk7YxkLxtP0hwE4ia1/aEyWVrvSl 6l646vnEwvVdqNbsVSPlX4+wFWvJ+u/VjMyklopU9oVZnNoNLPJpb2OVym9h8b/6hRtr sC8k/EEj9cVHsKx+l8r1FtfufLZFywuyFwcJDRs1OrflMTaxt1k/qxr01ctdcCmkBlw0 jB1DituabZCqNlSld0DV4SjcZVHl9ZmOkLn6i2QNpdHhJdZS7Q9H//JtYd/idrHo1hZG dYbg== X-Gm-Message-State: ACgBeo3FCnQG4yChYMBFyWdcY4kNvELrcdiyPSu1C1Oo+DE2fqtxMKLx 40PYUIPtMaAYFDqt7BqxwnQIUQ== X-Google-Smtp-Source: AA6agR4WU200zopccA4rrsFBPH8Sb9d69Jkv0/J3olW29m0bbzj5YvGRv3feSz9ga0iywhlx9awj4A== X-Received: by 2002:a5d:5a9b:0:b0:225:3fa0:f9ca with SMTP id bp27-20020a5d5a9b000000b002253fa0f9camr8045558wrb.204.1661203191880; Mon, 22 Aug 2022 14:19:51 -0700 (PDT) Received: from localhost ([2620:10d:c092:400::5:4e25]) by smtp.gmail.com with ESMTPSA id s9-20020a05600c384900b003a35ec4bf4fsm15911202wmr.20.2022.08.22.14.19.51 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 22 Aug 2022 14:19:51 -0700 (PDT) Date: Mon, 22 Aug 2022 17:19:50 -0400 From: Johannes Weiner To: Mina Almasry Cc: Tejun Heo , Yafang Shao , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin Lau , Song Liu , Yonghong Song , john fastabend , KP Singh , Stanislav Fomichev , Hao Luo , jolsa@kernel.org, Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton , Zefan Li , Cgroups , netdev , bpf , Linux MM , Yosry Ahmed , Dan Schatzberg , Lennart Poettering Subject: Re: [RFD RESEND] cgroup: Persistent memory usage tracking Message-ID: References: <20220818143118.17733-1-laoar.shao@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1661203196; a=rsa-sha256; cv=none; b=olp4Mjnu4Ls8dYii3N4HI+R17LHNzvjTd/OruntzuikDz4Rs+bGShsk6lU5TUfZDD9/w6Q Zc7u069zI6oukMWiqTJ7w+1ys9CNXygA53k8xxlFzs0oQAZw1mpLmhuc6bHGtWecd1zUuS aZXW/pj90Dsxjs2H7CYdntwzCQY2qxo= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=cmpxchg-org.20210112.gappssmtp.com header.s=20210112 header.b=WnSB6qB0; spf=pass (imf06.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.221.51 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1661203196; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=FxIK9S2wTvBehSQmWm4iua73649GO/cf0KFcF96zw1A=; b=FqkVQUrl65N/p+DUKErnyIyf2pfz/V9P9hRXDF97Qq9ra/i+c6sL9WVN/T4QREyNL774qu UpoSbAvrc0TPgCWiagD0A8R5ClSqIYkUjtP2TVJMScLz5m1jrlZsMY+9YY1umGTzCYUTkg 00h9PrT4Az5kDdb457VoExHh+/dYDOY= Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=cmpxchg-org.20210112.gappssmtp.com header.s=20210112 header.b=WnSB6qB0; spf=pass (imf06.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.221.51 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org; dmarc=pass (policy=none) header.from=cmpxchg.org X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 96ED118002B X-Stat-Signature: tutg4ozsffcx9uwozs3xni94a7uq86y5 X-Rspam-User: X-HE-Tag: 1661203193-827236 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Aug 22, 2022 at 12:02:48PM -0700, Mina Almasry wrote: > On Mon, Aug 22, 2022 at 4:29 AM Tejun Heo wrote: > > b. Let userspace specify which cgroup to charge for some of constructs like > > tmpfs and bpf maps. The key problems with this approach are > > > > 1. How to grant/deny what can be charged where. We must ensure that a > > descendant can't move charges up or across the tree without the > > ancestors allowing it. > > > > 2. How to specify the cgroup to charge. While specifying the target > > cgroup directly might seem like an obvious solution, it has a couple > > rather serious problems. First, if the descendant is inside a cgroup > > namespace, it might be able to see the target cgroup at all. Second, > > it's an interface which is likely to cause misunderstandings on how it > > can be used. It's too broad an interface. > > > > This is pretty much the solution I sent out for review about a year > ago and yes, it suffers from the issues you've brought up: > https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/ > > > > One solution that I can think of is leveraging the resource domain > > concept which is currently only used for threaded cgroups. All memory > > usages of threaded cgroups are charged to their resource domain cgroup > > which hosts the processes for those threads. The persistent usages have a > > similar pattern, so maybe the service level cgroup can declare that it's > > the encompassing resource domain and the instance cgroup can say whether > > it's gonna charge e.g. the tmpfs instance to its own or the encompassing > > resource domain. > > > > I think this sounds excellent and addresses our use cases. Basically > the tmpfs/bpf memory would get charged to the encompassing resource > domain cgroup rather than the instance cgroup, making the memory usage > of the first and second+ instances consistent and predictable. > > Would love to hear from other memcg folks what they would think of > such an approach. I would also love to hear what kind of interface you > have in mind. Perhaps a cgroup tunable that says whether it's going to > charge the tmpfs/bpf instance to itself or to the encompassing > resource domain? I like this too. It makes shared charging predictable, with a coherent resource hierarchy (congruent OOM, CPU, IO domains), and without the need for cgroup paths in tmpfs mounts or similar. As far as who is declaring what goes, though: if the instance groups can declare arbitrary files/objects persistent or shared, they'd be able to abuse this and sneak private memory past local limits and burden the wider persistent/shared domain with it. I'm thinking it might make more sense for the service level to declare which objects are persistent and shared across instances. If that's the case, we may not need a two-component interface. Just the ability for an intermediate cgroup to say: "This object's future memory is to be charged to me, not the instantiating cgroup." Can we require a process in the intermediate cgroup to set up the file or object, and use madvise/fadvise to say "charge me", before any instances are launched?