From: Roman Gushchin <guro@fb.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mina Almasry <almasrymina@google.com>,
Jonathan Corbet <corbet@lwn.net>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Andrew Morton <akpm@linux-foundation.org>,
Michal Hocko <mhocko@kernel.org>,
Vladimir Davydov <vdavydov.dev@gmail.com>,
Hugh Dickins <hughd@google.com>, Shuah Khan <shuah@kernel.org>,
Shakeel Butt <shakeelb@google.com>,
Greg Thelen <gthelen@google.com>,
Dave Chinner <david@fromorbit.com>,
Matthew Wilcox <willy@infradead.org>,
Theodore Ts'o <tytso@mit.edu>, <linux-kernel@vger.kernel.org>,
<linux-fsdevel@vger.kernel.org>, <linux-mm@kvack.org>
Subject: Re: [PATCH v4 0/4] Deterministic charging of shared memory
Date: Mon, 22 Nov 2021 15:09:26 -0800 [thread overview]
Message-ID: <YZwjJjccnlL1SDSN@carbon.dhcp.thefacebook.com> (raw)
In-Reply-To: <YZvppKvUPTIytM/c@cmpxchg.org>
On Mon, Nov 22, 2021 at 02:04:04PM -0500, Johannes Weiner wrote:
> On Fri, Nov 19, 2021 at 08:50:06PM -0800, Mina Almasry wrote:
> > Problem:
> > Currently shared memory is charged to the memcg of the allocating
> > process. This makes memory usage of processes accessing shared memory
> > a bit unpredictable since whichever process accesses the memory first
> > will get charged. We have a number of use cases where our userspace
> > would like deterministic charging of shared memory:
> >
> > 1. System services allocating memory for client jobs:
> > We have services (namely a network access service[1]) that provide
> > functionality for clients running on the machine and allocate memory
> > to carry out these services. The memory usage of these services
> > depends on the number of jobs running on the machine and the nature of
> > the requests made to the service, which makes the memory usage of
> > these services hard to predict and thus hard to limit via memory.max.
> > These system services would like a way to allocate memory and instruct
> > the kernel to charge this memory to the client’s memcg.
> >
> > 2. Shared filesystem between subtasks of a large job
> > Our infrastructure has large meta jobs such as kubernetes which spawn
> > multiple subtasks which share a tmpfs mount. These jobs and its
> > subtasks use that tmpfs mount for various purposes such as data
> > sharing or persistent data between the subtask restarts. In kubernetes
> > terminology, the meta job is similar to pods and subtasks are
> > containers under pods. We want the shared memory to be
> > deterministically charged to the kubernetes's pod and independent to
> > the lifetime of containers under the pod.
> >
> > 3. Shared libraries and language runtimes shared between independent jobs.
> > We’d like to optimize memory usage on the machine by sharing libraries
> > and language runtimes of many of the processes running on our machines
> > in separate memcgs. This produces a side effect that one job may be
> > unlucky to be the first to access many of the libraries and may get
> > oom killed as all the cached files get charged to it.
> >
> > Design:
> > My rough proposal to solve this problem is to simply add a
> > ‘memcg=/path/to/memcg’ mount option for filesystems:
> > directing all the memory of the file system to be ‘remote charged’ to
> > cgroup provided by that memcg= option.
> >
> > Caveats:
> >
> > 1. One complication to address is the behavior when the target memcg
> > hits its memory.max limit because of remote charging. In this case the
> > oom-killer will be invoked, but the oom-killer may not find anything
> > to kill in the target memcg being charged. Thera are a number of considerations
> > in this case:
> >
> > 1. It's not great to kill the allocating process since the allocating process
> > is not running in the memcg under oom, and killing it will not free memory
> > in the memcg under oom.
> > 2. Pagefaults may hit the memcg limit, and we need to handle the pagefault
> > somehow. If not, the process will forever loop the pagefault in the upstream
> > kernel.
> >
> > In this case, I propose simply failing the remote charge and returning an ENOSPC
> > to the caller. This will cause will cause the process executing the remote
> > charge to get an ENOSPC in non-pagefault paths, and get a SIGBUS on the pagefault
> > path. This will be documented behavior of remote charging, and this feature is
> > opt-in. Users can:
> > - Not opt-into the feature if they want.
> > - Opt-into the feature and accept the risk of received ENOSPC or SIGBUS and
> > abort if they desire.
> > - Gracefully handle any resulting ENOSPC or SIGBUS errors and continue their
> > operation without executing the remote charge if possible.
> >
> > 2. Only processes allowed the enter cgroup at mount time can mount a
> > tmpfs with memcg=<cgroup>. This is to prevent intential DoS of random cgroups
> > on the machine. However, once a filesysetem is mounted with memcg=<cgroup>, any
> > process with write access to this mount point will be able to charge memory to
> > <cgroup>. This is largely a non-issue because in configurations where there is
> > untrusted code running on the machine, mount point access needs to be
> > restricted to the intended users only regardless of whether the mount point
> > memory is deterministly charged or not.
>
> I'm not a fan of this. It uses filesystem mounts to create shareable
> resource domains outside of the cgroup hierarchy, which has all the
> downsides you listed, and more:
>
> 1. You need a filesystem interface in the first place, and a new
> ad-hoc channel and permission model to coordinate with the cgroup
> tree, which isn't great. All filesystems you want to share data on
> need to be converted.
>
> 2. It doesn't extend to non-filesystem sources of shared data, such as
> memfds, ipc shm etc.
>
> 3. It requires unintuitive configuration for what should be basic
> shared accounting semantics. Per default you still get the old
> 'first touch' semantics, but to get sharing you need to reconfigure
> the filesystems?
>
> 4. If a task needs to work with a hierarchy of data sharing domains -
> system-wide, group of jobs, job - it must interact with a hierarchy
> of filesystem mounts. This is a pain to setup and may require task
> awareness. Moving data around, working with different mount points.
> Also, no shared and private data accounting within the same file.
>
> 5. It reintroduces cgroup1 semantics of tasks and resouces, which are
> entangled, sitting in disjunct domains. OOM killing is one quirk of
> that, but there are others you haven't touched on. Who is charged
> for the CPU cycles of reclaim in the out-of-band domain? Who is
> charged for the paging IO? How is resource pressure accounted and
> attributed? Soon you need cpu= and io= as well.
>
> My take on this is that it might work for your rather specific
> usecase, but it doesn't strike me as a general-purpose feature
> suitable for upstream.
>
>
> If we want sharing semantics for memory, I think we need a more
> generic implementation with a cleaner interface.
>
> Here is one idea:
>
> Have you considered reparenting pages that are accessed by multiple
> cgroups to the first common ancestor of those groups?
>
> Essentially, whenever there is a memory access (minor fault, buffered
> IO) to a page that doesn't belong to the accessing task's cgroup, you
> find the common ancestor between that task and the owning cgroup, and
> move the page there.
>
> With a tree like this:
>
> root - job group - job
> `- job
> `- job group - job
> `- job
>
> all pages accessed inside that tree will propagate to the highest
> level at which they are shared - which is the same level where you'd
> also set shared policies, like a job group memory limit or io weight.
>
> E.g. libc pages would (likely) bubble to the root, persistent tmpfs
> pages would bubble to the respective job group, private data would
> stay within each job.
>
> No further user configuration necessary. Although you still *can* use
> mount namespacing etc. to prohibit undesired sharing between cgroups.
>
> The actual user-visible accounting change would be quite small, and
> arguably much more intuitive. Remember that accounting is recursive,
> meaning that a job page today also shows up in the counters of job
> group and root. This would not change. The only thing that IS weird
> today is that when two jobs share a page, it will arbitrarily show up
> in one job's counter but not in the other's. That would change: it
> would no longer show up as either, since it's not private to either;
> it would just be a job group (and up) page.
In general I like the idea, but I think the user-visible change will be quite
large, almost "cgroup v3"-large. Here are some problems:
1) Anything shared between e.g. system.slice and user.slice now belongs
to the root cgroup and is completely unaccounted/unlimited. E.g. all pagecache
belonging to shared libraries.
2) It's concerning in security terms. If I understand the idea correctly, a
read-only access will allow to move charges to an upper level, potentially
crossing memory.max limits. It doesn't sound safe.
3) It brings a non-trivial amount of memory to non-leave cgroups. To some extent
it returns us to the cgroup v1 world and a question of competition between
resources consumed by a cgroup directly and through children cgroups. Not
like the problem doesn't exist now, but it's less pronounced.
If say >50% of system.slice's memory will belong to system.slice directly,
then we likely will need separate non-recursive counters, limits, protections,
etc.
4) Imagine a production server and a system administrator entering using ssh
(and being put into user.slice) and running a big grep... It screws up all
memory accounting until a next reboot. Not a completely impossible scenario.
That said, I agree with Johannes and I'm also not a big fan of this patchset.
I agree that the problem exist and that the patchset provides a solution, but
it doesn't look nice (and generic enough) and creates a lot of questions and
corner cases.
Btw, won't (an optional) disabling of memcg accounting for a tmpfs solve your
problem? It will be less invasive and will not require any oom changes.
next prev parent reply other threads:[~2021-11-22 23:10 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-11-20 4:50 Mina Almasry
2021-11-20 4:50 ` [PATCH v4 1/4] mm: support deterministic memory charging of filesystems Mina Almasry
2021-11-20 7:53 ` Shakeel Butt
2021-11-20 4:50 ` [PATCH v4 2/4] mm/oom: handle remote ooms Mina Almasry
2021-11-20 5:07 ` Matthew Wilcox
2021-11-20 5:31 ` Mina Almasry
2021-11-20 7:58 ` Shakeel Butt
2021-11-20 4:50 ` [PATCH v4 3/4] mm, shmem: add filesystem memcg= option documentation Mina Almasry
2021-11-20 4:50 ` [PATCH v4 4/4] mm, shmem, selftests: add tmpfs memcg= mount option tests Mina Almasry
2021-11-20 5:01 ` [PATCH v4 0/4] Deterministic charging of shared memory Matthew Wilcox
2021-11-20 5:27 ` Mina Almasry
2021-11-22 19:04 ` Johannes Weiner
2021-11-22 22:09 ` Mina Almasry
2021-11-22 23:09 ` Roman Gushchin [this message]
2021-11-23 19:26 ` Mina Almasry
2021-11-23 20:21 ` Johannes Weiner
2021-11-23 21:19 ` Mina Almasry
2021-11-23 22:49 ` Roman Gushchin
2021-11-24 17:27 ` Michal Hocko
2021-11-29 6:00 ` Shakeel Butt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YZwjJjccnlL1SDSN@carbon.dhcp.thefacebook.com \
--to=guro@fb.com \
--cc=akpm@linux-foundation.org \
--cc=almasrymina@google.com \
--cc=corbet@lwn.net \
--cc=david@fromorbit.com \
--cc=gthelen@google.com \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@kernel.org \
--cc=shakeelb@google.com \
--cc=shuah@kernel.org \
--cc=tytso@mit.edu \
--cc=vdavydov.dev@gmail.com \
--cc=viro@zeniv.linux.org.uk \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox