From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 054EAC28D13 for ; Tue, 23 Aug 2022 03:14:06 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 876CE8D0002; Mon, 22 Aug 2022 23:14:05 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7FE6A8D0001; Mon, 22 Aug 2022 23:14:05 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 678CA8D0002; Mon, 22 Aug 2022 23:14:05 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 517FE8D0001 for ; Mon, 22 Aug 2022 23:14:05 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 2A7AF1A12AA for ; Tue, 23 Aug 2022 03:14:05 +0000 (UTC) X-FDA: 79829388450.03.EB3097C Received: from mail-pl1-f175.google.com (mail-pl1-f175.google.com [209.85.214.175]) by imf29.hostedemail.com (Postfix) with ESMTP id CFC2A120028 for ; Tue, 23 Aug 2022 03:14:04 +0000 (UTC) Received: by mail-pl1-f175.google.com with SMTP id jm11so11631004plb.13 for ; Mon, 22 Aug 2022 20:14:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:sender:from:to:cc; bh=ZUaQTOVIdh5xBvCaBb3Q2PJKSpNuIPnigKSbwq//BA0=; b=p5JQXzLW/Xthbd5wOYzW8Og2lJZLToL31V32MXDUH2rPEqdvOxbJTtsbehjMq6zYCt cxAUeoTbsLnGFCfbetELQkeKGb5rxIISeN9exB0GiOwG4QE4buEPzqKloTmr7wXdtnhW gbFAyh7AiYAKZ91PUX37nfVYZZDxMLMRNZtZ6irnJ26OIYzDHheCtc8CCt1DEnAXPMR/ 1P6E4qet6Z4aQEQJgrnK0ZkTrk+dskLvnHtbJV+a3z5jt+OGZXu3RFRozQph0zEuyDRx KY/R7uV3lwCFhAjHvikgjzYPO3Ak6Q9o9UOUr4Vjse3NXoWL8y++Vo1Ew5rrDwHXcCWu IqdQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:sender:x-gm-message-state:from:to:cc; bh=ZUaQTOVIdh5xBvCaBb3Q2PJKSpNuIPnigKSbwq//BA0=; b=ZNoprbgXcwapYaKdqvtW0W/880j+hmCRuBQwH3C7619cUW6UFIX5EYEC4SpGRAVcEd d/bx3YMSDZbppLFO2e8z8lCKVEClTEhezJk83W8uoPTP9cG4ybcu3vgokJoAulCx5dAP qON1ZXQGVSIVFd5ueUkxJv1eNOZ0O6/z37JCowgbRSMO2m8p76Z+M6/yv3lV11suecGp mFtX926mh9cZSIN3tCtCPOxsBxrnuVST22c1FhVKJo/TJm8TIsr1Wy+1jWu+QJ6HUfgQ yhUdd4fXBS6WXy5ahGR9vrWo3PutKHWqxf5RE5o56Z5DSgRxwTqZ+ejVXVxg0TPIQOob PYtA== X-Gm-Message-State: ACgBeo1rPceAILK4ZqwIEeycMgyQ6qzhoMXXz//Oo95ITf4VsIuhUO4G VdSzRgVZYJWP0l4k+xEej8M= X-Google-Smtp-Source: AA6agR4+ePw/12qEsZM/rFDM66EA8Tes3EWn+M6lZiTRJSnxm4m3BnnK4X9iR+UiLdozpMIAolz+9g== X-Received: by 2002:a17:90b:17c8:b0:1f5:4724:981f with SMTP id me8-20020a17090b17c800b001f54724981fmr1307469pjb.205.1661224443551; Mon, 22 Aug 2022 20:14:03 -0700 (PDT) Received: from localhost ([2620:10d:c090:400::5:2dc2]) by smtp.gmail.com with ESMTPSA id g7-20020a625207000000b00535ef76a602sm8288524pfb.74.2022.08.22.20.14.02 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 22 Aug 2022 20:14:02 -0700 (PDT) Date: Mon, 22 Aug 2022 17:14:01 -1000 From: Tejun Heo To: Roman Gushchin Cc: Johannes Weiner , Mina Almasry , Yafang Shao , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin Lau , Song Liu , Yonghong Song , john fastabend , KP Singh , Stanislav Fomichev , Hao Luo , jolsa@kernel.org, Michal Hocko , Shakeel Butt , Muchun Song , Andrew Morton , Zefan Li , Cgroups , netdev , bpf , Linux MM , Yosry Ahmed , Dan Schatzberg , Lennart Poettering Subject: Re: [RFD RESEND] cgroup: Persistent memory usage tracking Message-ID: References: <20220818143118.17733-1-laoar.shao@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1661224444; h=from:from:sender:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ZUaQTOVIdh5xBvCaBb3Q2PJKSpNuIPnigKSbwq//BA0=; b=WVrcZG4M/xMDe9Yg3Ges9e4CIF+2AWWWTHswqBzHhydN0KZ/HbqJUic2Zz/UuRSTZxgMlC wKTW6fkGlLfpj4B6E28X/yMjBJmStjadJBaSl5uVWcwbS70MsxVHjZwLdIsf7NMvlc9Otm tBFNemKwS63wjfWXMwRbojN6x0R2IZc= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=p5JQXzLW; spf=pass (imf29.hostedemail.com: domain of htejun@gmail.com designates 209.85.214.175 as permitted sender) smtp.mailfrom=htejun@gmail.com; dmarc=fail reason="SPF not aligned (relaxed), DKIM not aligned (relaxed)" header.from=kernel.org (policy=none) ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1661224444; a=rsa-sha256; cv=none; b=K7tTl9HKHtOYeNLAClFhYFyr+n9elFKyfZdgO1jyQKODQnPShRM4h0Csty6O5MSEK+RMH4 x1LLcDKOK0OCmlE/GclVY61QjxAztKS6V1otTEaZBNTxNy4uw8+qxdW9TzyTcOBpgGxOxX IW2n4lfQldzS/Lz+qsNxP49XzrxCdro= X-Rspam-User: Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=p5JQXzLW; spf=pass (imf29.hostedemail.com: domain of htejun@gmail.com designates 209.85.214.175 as permitted sender) smtp.mailfrom=htejun@gmail.com; dmarc=fail reason="SPF not aligned (relaxed), DKIM not aligned (relaxed)" header.from=kernel.org (policy=none) X-Stat-Signature: 7d9inukhi4bpd5c5t4mbdb1ep9b3u8hu X-Rspamd-Queue-Id: CFC2A120028 X-Rspamd-Server: rspam08 X-HE-Tag: 1661224444-314195 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hello, On Mon, Aug 22, 2022 at 08:01:41PM -0700, Roman Gushchin wrote: > > > > One solution that I can think of is leveraging the resource domain > > > > concept which is currently only used for threaded cgroups. All memory > > > > usages of threaded cgroups are charged to their resource domain cgroup > > > > which hosts the processes for those threads. The persistent usages have a > > > > similar pattern, so maybe the service level cgroup can declare that it's > > > > the encompassing resource domain and the instance cgroup can say whether > > > > it's gonna charge e.g. the tmpfs instance to its own or the encompassing > > > > resource domain. > > > > > > > > > > I think this sounds excellent and addresses our use cases. Basically > > > the tmpfs/bpf memory would get charged to the encompassing resource > > > domain cgroup rather than the instance cgroup, making the memory usage > > > of the first and second+ instances consistent and predictable. > > > > > > Would love to hear from other memcg folks what they would think of > > > such an approach. I would also love to hear what kind of interface you > > > have in mind. Perhaps a cgroup tunable that says whether it's going to > > > charge the tmpfs/bpf instance to itself or to the encompassing > > > resource domain? > > > > I like this too. It makes shared charging predictable, with a coherent > > resource hierarchy (congruent OOM, CPU, IO domains), and without the > > need for cgroup paths in tmpfs mounts or similar. > > > > As far as who is declaring what goes, though: if the instance groups > > can declare arbitrary files/objects persistent or shared, they'd be > > able to abuse this and sneak private memory past local limits and > > burden the wider persistent/shared domain with it. My thought was that the persistent cgroup and instance cgroups should belong to the same trust domain and system level control should be applied at the resource domain level. The application may decide to shift between persistent and per-instance however it wants to and may even configure resource control at that level but all that's for its own accounting accuracy and benefit. > > I'm thinking it might make more sense for the service level to declare > > which objects are persistent and shared across instances. > > I like this idea. > > > If that's the case, we may not need a two-component interface. Just > > the ability for an intermediate cgroup to say: "This object's future > > memory is to be charged to me, not the instantiating cgroup." > > > > Can we require a process in the intermediate cgroup to set up the file > > or object, and use madvise/fadvise to say "charge me", before any > > instances are launched? > > We need to think how to make this interface convenient to use. > First, these persistent resources are likely created by some agent software, > not the main workload. So the requirement to call madvise() from the > actual cgroup might be not easily achievable. So one worry that I have for this is that it requires the application itself to be aware of cgroup topolgies and restructure itself so that allocation of those resources are factored out into something else. Maybe that's not a huge problem but it may limit its applicability quite a bit. If we can express all the resource contraints and structures in the cgroup side and configured by the management agent, the application can simply e.g. madvise whatever memory region or flag bpf maps as "these are persistent" and the rest can be handled by the system. If the agent set up the environment for that, it gets accounted accordingly; otherwise, it'd behave as if those tagging didn't exist. Asking the application to set up all its resources in separate steps, that might require significant restructuring and knowledge of how the hierarchy is setup in many cases. > So _maybe_ something like writing a fd into cgroup.memory.resources. > > Second, it would be really useful to export the current configuration > to userspace. E.g. a user should be able to query to which cgroup the given > bpf map "belongs" and which bpf maps belong to the given cgroups. Otherwise > it will create a problem for userspace programs which manage cgroups > (e.g. systemd): they should be able to restore the current configuration > from the kernel state, without "remembering" what has been configured > before. This too can be achieved by separating out cgroup setup and tagging specific resources. Agent and cgroup know what each cgroup is supposed to do as they already do now and each resource is tagged whether they're persistent or not, so everything is always known without the agent and the application having to explicitly share the information. Thanks. -- tejun