From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 57A08C32796
	for <linux-mm@archiver.kernel.org>; Wed, 24 Aug 2022 19:02:18 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 8C267940008; Wed, 24 Aug 2022 15:02:17 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 84A73940007; Wed, 24 Aug 2022 15:02:17 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 69E7E940008; Wed, 24 Aug 2022 15:02:17 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 555AE940007
	for <linux-mm@kvack.org>; Wed, 24 Aug 2022 15:02:17 -0400 (EDT)
Received: from smtpin31.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay08.hostedemail.com (Postfix) with ESMTP id 313BF14037B
	for <linux-mm@kvack.org>; Wed, 24 Aug 2022 19:02:17 +0000 (UTC)
X-FDA: 79835406714.31.207CBF5
Received: from mail-il1-f169.google.com (mail-il1-f169.google.com [209.85.166.169])
	by imf29.hostedemail.com (Postfix) with ESMTP id C2E2912006D
	for <linux-mm@kvack.org>; Wed, 24 Aug 2022 19:02:16 +0000 (UTC)
Received: by mail-il1-f169.google.com with SMTP id h8so5386039ili.11
        for <linux-mm@kvack.org>; Wed, 24 Aug 2022 12:02:16 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20210112;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:from:to:cc;
        bh=vBlDS68ELiZr1QTGxJOPvYivh7LG+VsuyP+e5BkDizs=;
        b=KFgXrxcerQ3S3zXFOe9MmxseJphuExDFjgTrOTgMM5c828OwMG24lYumRxP31doY51
         K3cCe4S/6ni0oRgdRPgOkwDxqUe2EkEF41lt78W5XERFXOPcHB1Vmy1+NVexHW7RsEtj
         5wmzydLf5j6aH97yWc7b5GXH1MRAmKZyQeJ4rkkIdnI2o9NkcuTf4MH8xLurDgiB1Ld9
         KaA5GWUyP1EakYttIWAzdT0zbylCN4iGgdmjhx2I79snHw8drJY9PzikTZV8arxTMbsZ
         dAouvU+bhh2FPmxFp0+8M6bTInr4nLWqSqrIqdudgOI4HErlyyW4Fm8K+iyW2UfrIA80
         M1Mg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:x-gm-message-state:from:to:cc;
        bh=vBlDS68ELiZr1QTGxJOPvYivh7LG+VsuyP+e5BkDizs=;
        b=2qYVDbJZfZq0XQjP6q7FSpo4HE9+hCPiPk2gv489lsXKexJfDHNczp1XtCzeT+G+3s
         cuVlgWc88RENxkbTS7eUMJHsryI4sYchcIxLYILmIPjtPL1WGRgNeGUVUPoKNs9agIFZ
         hK9W2n00qz9qvtmzXNJcLpf2dndIZj1NSU5AsBAFpWW66lCH1TDcdqqFWghjLaoc1VDn
         qvj5NNdo7Wkh6/P7rY2IfJ+4AkvXDVyxw2RmuufDFcmyMvoKos+vpMlOpblNTmQXV+Wc
         Y8ylAwG7riML39cZi/OuxBvms7xffsyi3G4OCAMB0LYwyTBM1Tk5UAyUiN/dmQPM0CIG
         QLXg==
X-Gm-Message-State: ACgBeo0dhwtyrFZesQPb1wrhNyD4caMWfuSmm+bdLRB5qbDj8Yh3OtY+
	CLGC4so5JaeEw3/oVLD5s5wC6gppvPXF7J2RJMZPlA==
X-Google-Smtp-Source: AA6agR6kheze9pyJF4whRD8WXKbfetrZ6izDyQszCHM6aVpmA/tkPpShcjsBYFMe0z/XRNEgnfahZPqeHb1fLLgo3tc=
X-Received: by 2002:a05:6e02:2194:b0:2e9:7f9b:f1c4 with SMTP id
 j20-20020a056e02219400b002e97f9bf1c4mr207575ila.79.1661367735898; Wed, 24 Aug
 2022 12:02:15 -0700 (PDT)
MIME-Version: 1.0
References: <20220818143118.17733-1-laoar.shao@gmail.com> <Yv67MRQLPreR9GU5@slm.duckdns.org>
 <Yv6+HlEzpNy8y5kT@slm.duckdns.org> <CALOAHbDcrj1ifFsNMHBEih5-SXY2rWViig4rQHi9N07JY6CjXA@mail.gmail.com>
 <Yv/DK+AGlMeBGkF1@slm.duckdns.org> <CALOAHbCvUxQn5Zkp2FJ+eL1VgjeRSq1xQhzdiY87C1Cbib-nig@mail.gmail.com>
 <YwNold0GMOappUxc@slm.duckdns.org> <CAHS8izNvEpX3Lv7eFn-vu=4ZT96Djk2dU-VU+zOueZaZZbnWNw@mail.gmail.com>
 <YwPy9hervVxfuuYE@cmpxchg.org> <YwRDFe+K837tKGED@P9FQF9L96D> <YwRF+df9P2TPu7Zw@slm.duckdns.org>
In-Reply-To: <YwRF+df9P2TPu7Zw@slm.duckdns.org>
From: Mina Almasry <almasrymina@google.com>
Date: Wed, 24 Aug 2022 12:02:04 -0700
Message-ID: <CAHS8izMFMtM5ry12iEo72nwkynDpgycETn6QoXLGj=O6b8z1jg@mail.gmail.com>
Subject: Re: [RFD RESEND] cgroup: Persistent memory usage tracking
To: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>, Johannes Weiner <hannes@cmpxchg.org>, 
	Yafang Shao <laoar.shao@gmail.com>, Alexei Starovoitov <ast@kernel.org>, 
	Daniel Borkmann <daniel@iogearbox.net>, Andrii Nakryiko <andrii@kernel.org>, Martin Lau <kafai@fb.com>, 
	Song Liu <songliubraving@fb.com>, Yonghong Song <yhs@fb.com>, 
	john fastabend <john.fastabend@gmail.com>, KP Singh <kpsingh@kernel.org>, 
	Stanislav Fomichev <sdf@google.com>, Hao Luo <haoluo@google.com>, jolsa@kernel.org, 
	Michal Hocko <mhocko@kernel.org>, Shakeel Butt <shakeelb@google.com>, 
	Muchun Song <songmuchun@bytedance.com>, Andrew Morton <akpm@linux-foundation.org>, 
	Zefan Li <lizefan.x@bytedance.com>, Cgroups <cgroups@vger.kernel.org>, 
	netdev <netdev@vger.kernel.org>, bpf <bpf@vger.kernel.org>, 
	Linux MM <linux-mm@kvack.org>, Yosry Ahmed <yosryahmed@google.com>, 
	Dan Schatzberg <schatzberg.dan@gmail.com>, Lennart Poettering <lennart@poettering.net>
Content-Type: text/plain; charset="UTF-8"
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1661367736;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=vBlDS68ELiZr1QTGxJOPvYivh7LG+VsuyP+e5BkDizs=;
	b=8EYnkaCsxgdYHrUxb7WCB0/cS5xidP/rU/tiz6+xhIbSiBBOqU+E6GYlbdiUcMhRnTU2WX
	qwBhvvGcwNdqERECmADpkpZ2o5Dd7nvvb+BwOax9HTziYIktqXLB8VRQuLkNbpdGTQOUDA
	0NRG38PlZyRcCz5LUBivNulPFDocg+w=
ARC-Authentication-Results: i=1;
	imf29.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=KFgXrxce;
	spf=pass (imf29.hostedemail.com: domain of almasrymina@google.com designates 209.85.166.169 as permitted sender) smtp.mailfrom=almasrymina@google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1661367736; a=rsa-sha256;
	cv=none;
	b=x413kFcblzDZO5lfT/P8HtzJCo5g1r+sw4z65/EMtjjcxcg7tdLvHPQiwmh1H9pPmC0M8L
	dIpjb0E3F8uIX4MfW8hCRzIuoucseF0eLMh+H3m7bsiMmFH/T5Twb76Ex9LylJ6FFnKPx6
	GN5i8jU6XtqadUh6wSXfgbaQ6INwNZw=
X-Stat-Signature: dot9swmp6ogppdysjidk7baue4gj19fk
X-Rspamd-Queue-Id: C2E2912006D
Authentication-Results: imf29.hostedemail.com;
	dkim=pass header.d=google.com header.s=20210112 header.b=KFgXrxce;
	spf=pass (imf29.hostedemail.com: domain of almasrymina@google.com designates 209.85.166.169 as permitted sender) smtp.mailfrom=almasrymina@google.com;
	dmarc=pass (policy=reject) header.from=google.com
X-Rspam-User: 
X-Rspamd-Server: rspam03
X-HE-Tag: 1661367736-777475
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Mon, Aug 22, 2022 at 8:14 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello,
>
> On Mon, Aug 22, 2022 at 08:01:41PM -0700, Roman Gushchin wrote:
> > > > >    One solution that I can think of is leveraging the resource domain
> > > > >    concept which is currently only used for threaded cgroups. All memory
> > > > >    usages of threaded cgroups are charged to their resource domain cgroup
> > > > >    which hosts the processes for those threads. The persistent usages have a
> > > > >    similar pattern, so maybe the service level cgroup can declare that it's
> > > > >    the encompassing resource domain and the instance cgroup can say whether
> > > > >    it's gonna charge e.g. the tmpfs instance to its own or the encompassing
> > > > >    resource domain.
> > > > >
> > > >
> > > > I think this sounds excellent and addresses our use cases. Basically
> > > > the tmpfs/bpf memory would get charged to the encompassing resource
> > > > domain cgroup rather than the instance cgroup, making the memory usage
> > > > of the first and second+ instances consistent and predictable.
> > > >
> > > > Would love to hear from other memcg folks what they would think of
> > > > such an approach. I would also love to hear what kind of interface you
> > > > have in mind. Perhaps a cgroup tunable that says whether it's going to
> > > > charge the tmpfs/bpf instance to itself or to the encompassing
> > > > resource domain?
> > >
> > > I like this too. It makes shared charging predictable, with a coherent
> > > resource hierarchy (congruent OOM, CPU, IO domains), and without the
> > > need for cgroup paths in tmpfs mounts or similar.
> > >
> > > As far as who is declaring what goes, though: if the instance groups
> > > can declare arbitrary files/objects persistent or shared, they'd be
> > > able to abuse this and sneak private memory past local limits and
> > > burden the wider persistent/shared domain with it.
>
> My thought was that the persistent cgroup and instance cgroups should belong
> to the same trust domain and system level control should be applied at the
> resource domain level. The application may decide to shift between
> persistent and per-instance however it wants to and may even configure
> resource control at that level but all that's for its own accounting
> accuracy and benefit.
>
> > > I'm thinking it might make more sense for the service level to declare
> > > which objects are persistent and shared across instances.
> >
> > I like this idea.
> >
> > > If that's the case, we may not need a two-component interface. Just
> > > the ability for an intermediate cgroup to say: "This object's future
> > > memory is to be charged to me, not the instantiating cgroup."
> > >
> > > Can we require a process in the intermediate cgroup to set up the file
> > > or object, and use madvise/fadvise to say "charge me", before any
> > > instances are launched?
> >
> > We need to think how to make this interface convenient to use.
> > First, these persistent resources are likely created by some agent software,
> > not the main workload. So the requirement to call madvise() from the
> > actual cgroup might be not easily achievable.
>
> So one worry that I have for this is that it requires the application itself
> to be aware of cgroup topolgies and restructure itself so that allocation of
> those resources are factored out into something else. Maybe that's not a
> huge problem but it may limit its applicability quite a bit.
>

I agree with this point 100%. The interfaces being discussed here
require existing applications restructuring themselves which I don't
imagine will be very useful for us at least.

> If we can express all the resource contraints and structures in the cgroup
> side and configured by the management agent, the application can simply e.g.
> madvise whatever memory region or flag bpf maps as "these are persistent"
> and the rest can be handled by the system. If the agent set up the
> environment for that, it gets accounted accordingly; otherwise, it'd behave
> as if those tagging didn't exist. Asking the application to set up all its
> resources in separate steps, that might require significant restructuring
> and knowledge of how the hierarchy is setup in many cases.
>

I don't know if this level of granularity is needed with a madvise()
or such. The kernel knows whether resources are persistent due to the
nature of the resource. For example a shared tmpfs file is a resource
that is persistent and not cleaned up after the process using it dies,
but private memory is. madvise(PERSISTENT) on private memory would not
make sense, and I don't think madvise(NOT_PERSISTENT) on tmpfs-backed
memory region would make sense. Also, this requires adding madvise()
hints in userspace code to leverage this.

> > So _maybe_ something like writing a fd into cgroup.memory.resources.
> >

Sorry, I don't see this being useful - to us at least - if it is an
fd-based interface. It needs to support marking entire tmpfs mounts as
persistent. The reasoning is as Tejun alludes to: our container
management agent generally sets up the container hierarchy for a job
and also sets up the filesystem mounts the job requests. This is
generally because the job doesn't run as root and doesn't bother with
mount namespaces. Thus, our jobs are well-trained in receiving mounts
set up for them from our container management agent. Jobs are _not_
well-trained in receiving an fd from the container management agent,
and restructuring our jobs/services for such an interface will not be
feasible I think.

This applies to us but I imagine it is common practice for the
container management agent to set up mounts for the jobs to use,
rather than provide the job with an fd or collection of fds.

> > Second, it would be really useful to export the current configuration
> > to userspace. E.g. a user should be able to query to which cgroup the given
> > bpf map "belongs" and which bpf maps belong to the given cgroups. Otherwise
> > it will create a problem for userspace programs which manage cgroups
> > (e.g. systemd): they should be able to restore the current configuration
> > from the kernel state, without "remembering" what has been configured
> > before.
>
> This too can be achieved by separating out cgroup setup and tagging specific
> resources. Agent and cgroup know what each cgroup is supposed to do as they
> already do now and each resource is tagged whether they're persistent or
> not, so everything is always known without the agent and the application
> having to explicitly share the information.
>
> Thanks.
>
> --
> tejun