From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 68929C28D13 for ; Tue, 23 Aug 2022 03:01:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BF4578D0002; Mon, 22 Aug 2022 23:01:53 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B7CA48D0001; Mon, 22 Aug 2022 23:01:53 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9F64E8D0002; Mon, 22 Aug 2022 23:01:53 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 88B3C8D0001 for ; Mon, 22 Aug 2022 23:01:53 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 54F9641309 for ; Tue, 23 Aug 2022 03:01:53 +0000 (UTC) X-FDA: 79829357706.15.7936334 Received: from out1.migadu.com (out1.migadu.com [91.121.223.63]) by imf28.hostedemail.com (Postfix) with ESMTP id B5138C0050 for ; Tue, 23 Aug 2022 03:01:52 +0000 (UTC) Date: Mon, 22 Aug 2022 20:01:41 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1661223711; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=Dv94wdcOO0xhOXBTXmDpJmi7ih1YwEJglayhdPRmD3k=; b=Q/s0dQ+GuuAtGebBkJtXV90obc+1Z2wPGeW1NDPwJ7o/ZeFkub5Hgp4xeS/eR5MTQnxkHC 9wZfI6JWVaySsx55Vt2v0Jqil9tD47K9d1QDkDgJ4Z2P5ahCSvPDf892RPulMt4lRMFWv2 d4hF54n1m03uq1he43PTBbtWh54SEtQ= X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Roman Gushchin To: Johannes Weiner Cc: Mina Almasry , Tejun Heo , Yafang Shao , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin Lau , Song Liu , Yonghong Song , john fastabend , KP Singh , Stanislav Fomichev , Hao Luo , jolsa@kernel.org, Michal Hocko , Shakeel Butt , Muchun Song , Andrew Morton , Zefan Li , Cgroups , netdev , bpf , Linux MM , Yosry Ahmed , Dan Schatzberg , Lennart Poettering Subject: Re: [RFD RESEND] cgroup: Persistent memory usage tracking Message-ID: References: <20220818143118.17733-1-laoar.shao@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Migadu-Flow: FLOW_OUT X-Migadu-Auth-User: linux.dev ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1661223713; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Dv94wdcOO0xhOXBTXmDpJmi7ih1YwEJglayhdPRmD3k=; b=l65ynOEFjoEXu/r9fSIj4I8tGx97nh+ZbEC7DVTmvwu0H5tqx0J4n8JqdIdgmb+HHKTz07 m8xol/lMdSs/l6/g0TurmWi2CgwNRk2Fa2UZGE2hFXF5COgbnkiK1vPoI4QgaIMgurUuid 9KsDOE5yDnnqyztd2NTNvF0wYZst7ec= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b="Q/s0dQ+G"; spf=pass (imf28.hostedemail.com: domain of roman.gushchin@linux.dev designates 91.121.223.63 as permitted sender) smtp.mailfrom=roman.gushchin@linux.dev; dmarc=pass (policy=none) header.from=linux.dev ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1661223713; a=rsa-sha256; cv=none; b=yyeEcsDp1tPXO87SDY4HTXJ4lb6PLUNHw+uTYutj/tXxFJVjHLvim/ETcVM2boA+gmpEiM chC1XbFoHu+MgB2wZPlsflvSkdIEMHKywT8HzZTtLiWZpVeUWaIU48JmPo7DxAd2q8+eA4 EOXnRmR9aaWh/RN8zsTAtwicMtQoxD8= X-Rspam-User: Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=linux.dev header.s=key1 header.b="Q/s0dQ+G"; spf=pass (imf28.hostedemail.com: domain of roman.gushchin@linux.dev designates 91.121.223.63 as permitted sender) smtp.mailfrom=roman.gushchin@linux.dev; dmarc=pass (policy=none) header.from=linux.dev X-Stat-Signature: wnwztgd5ahrygzgnyt77ihdfceuo5x31 X-Rspamd-Queue-Id: B5138C0050 X-Rspamd-Server: rspam08 X-HE-Tag: 1661223712-181866 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Aug 22, 2022 at 05:19:50PM -0400, Johannes Weiner wrote: > On Mon, Aug 22, 2022 at 12:02:48PM -0700, Mina Almasry wrote: > > On Mon, Aug 22, 2022 at 4:29 AM Tejun Heo wrote: > > > b. Let userspace specify which cgroup to charge for some of constructs like > > > tmpfs and bpf maps. The key problems with this approach are > > > > > > 1. How to grant/deny what can be charged where. We must ensure that a > > > descendant can't move charges up or across the tree without the > > > ancestors allowing it. > > > > > > 2. How to specify the cgroup to charge. While specifying the target > > > cgroup directly might seem like an obvious solution, it has a couple > > > rather serious problems. First, if the descendant is inside a cgroup > > > namespace, it might be able to see the target cgroup at all. Second, > > > it's an interface which is likely to cause misunderstandings on how it > > > can be used. It's too broad an interface. > > > > > > > This is pretty much the solution I sent out for review about a year > > ago and yes, it suffers from the issues you've brought up: > > https://lore.kernel.org/linux-mm/20211120045011.3074840-1-almasrymina@google.com/ > > > > > > > One solution that I can think of is leveraging the resource domain > > > concept which is currently only used for threaded cgroups. All memory > > > usages of threaded cgroups are charged to their resource domain cgroup > > > which hosts the processes for those threads. The persistent usages have a > > > similar pattern, so maybe the service level cgroup can declare that it's > > > the encompassing resource domain and the instance cgroup can say whether > > > it's gonna charge e.g. the tmpfs instance to its own or the encompassing > > > resource domain. > > > > > > > I think this sounds excellent and addresses our use cases. Basically > > the tmpfs/bpf memory would get charged to the encompassing resource > > domain cgroup rather than the instance cgroup, making the memory usage > > of the first and second+ instances consistent and predictable. > > > > Would love to hear from other memcg folks what they would think of > > such an approach. I would also love to hear what kind of interface you > > have in mind. Perhaps a cgroup tunable that says whether it's going to > > charge the tmpfs/bpf instance to itself or to the encompassing > > resource domain? > > I like this too. It makes shared charging predictable, with a coherent > resource hierarchy (congruent OOM, CPU, IO domains), and without the > need for cgroup paths in tmpfs mounts or similar. > > As far as who is declaring what goes, though: if the instance groups > can declare arbitrary files/objects persistent or shared, they'd be > able to abuse this and sneak private memory past local limits and > burden the wider persistent/shared domain with it. > > I'm thinking it might make more sense for the service level to declare > which objects are persistent and shared across instances. I like this idea. > > If that's the case, we may not need a two-component interface. Just > the ability for an intermediate cgroup to say: "This object's future > memory is to be charged to me, not the instantiating cgroup." > > Can we require a process in the intermediate cgroup to set up the file > or object, and use madvise/fadvise to say "charge me", before any > instances are launched? We need to think how to make this interface convenient to use. First, these persistent resources are likely created by some agent software, not the main workload. So the requirement to call madvise() from the actual cgroup might be not easily achievable. So _maybe_ something like writing a fd into cgroup.memory.resources. Second, it would be really useful to export the current configuration to userspace. E.g. a user should be able to query to which cgroup the given bpf map "belongs" and which bpf maps belong to the given cgroups. Otherwise it will create a problem for userspace programs which manage cgroups (e.g. systemd): they should be able to restore the current configuration from the kernel state, without "remembering" what has been configured before. Thanks!