From: Chris Li <chrisl@kernel.org>
To: "T.J. Mercier" <tjmercier@google.com>
Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org,
cgroups@vger.kernel.org, Yosry Ahmed <yosryahmed@google.com>,
Tejun Heo <tj@kernel.org>, Shakeel Butt <shakeelb@google.com>,
Muchun Song <muchun.song@linux.dev>,
Johannes Weiner <hannes@cmpxchg.org>,
Roman Gushchin <roman.gushchin@linux.dev>,
Alistair Popple <apopple@nvidia.com>,
Jason Gunthorpe <jgg@nvidia.com>,
Kalesh Singh <kaleshsingh@google.com>,
Yu Zhao <yuzhao@google.com>
Subject: Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
Date: Wed, 3 May 2023 15:15:43 -0700 [thread overview]
Message-ID: <ZFLdDyHoIdJSXJt+@google.com> (raw)
In-Reply-To: <CABdmKX2M6koq4Q0Cmp_-=wbP0Qa190HdEGGaHfxNS05gAkUtPA@mail.gmail.com>
Hi T.J.,
On Tue, Apr 11, 2023 at 04:36:37PM -0700, T.J. Mercier wrote:
> When a memcg is removed by userspace it gets offlined by the kernel.
> Offline memcgs are hidden from user space, but they still live in the
> kernel until their reference count drops to 0. New allocations cannot
> be charged to offline memcgs, but existing allocations charged to
> offline memcgs remain charged, and hold a reference to the memcg.
>
> As such, an offline memcg can remain in the kernel indefinitely,
> becoming a zombie memcg. The accumulation of a large number of zombie
> memcgs lead to increased system overhead (mainly percpu data in struct
> mem_cgroup). It also causes some kernel operations that scale with the
> number of memcgs to become less efficient (e.g. reclaim).
>
> There are currently out-of-tree solutions which attempt to
> periodically clean up zombie memcgs by reclaiming from them. However
> that is not effective for non-reclaimable memory, which it would be
> better to reparent or recharge to an online cgroup. There are also
> proposed changes that would benefit from recharging for shared
> resources like pinned pages, or DMA buffer pages.
I am also interested in this topic. T.J. and I have some offline
discussion about this. We have some proposals to solve this
problem.
I will share the write up here for the up coming LSF/MM discussion.
Shared Memory Cgroup Controllers
= Introduction
The current memory cgroup controller does not support shared memory objects. For the memory that is shared between different processes, it is not obvious which process should get charged. Google has some internal tmpfs “memcg=” mount option to charge tmpfs data to a specific memcg that’s often different from where charging processes run. However it faces some difficulties when the charged memcg exits and the charged memcg becomes a zombie memcg.
Other approaches include “re-parenting” the memcg charge to the parent memcg. Which has its own problem. If the charge is huge, iteration of the reparenting can be costly.
= Proposed Solution
The proposed solution is to add a new type of memory controller for shared memory usage. E.g. tmpfs, hugetlb, file system mmap and dma_buf. This shared memory cgroup controller object will have the same life cycle of the underlying shared memory.
Processes can not be added to the shared memory cgroup. Instead the shared memory cgroup can be added to the memcg using a “smemcg” API file, similar to adding a process into the “tasks” API file.
When a smemcg is added to the memcg, the amount of memory that has been shared in the memcg process will be accounted for as the part of the memcg “memory.current”.The memory.current of the memcg is make up of two parts, 1) the processes anonymous memory and 2) the memory shared from smemcg.
When the memcg “memory.current” is raised to the limit. The kernel will active try to reclaim for the memcg to make “smemcg memory + process anonymous memory” within the limit. Further memory allocation within those memcg processes will fail if the limit can not be followed. If many reclaim attempts fail to bring the memcg “memory.current” within the limit, the process in this memcg will get OOM killed.
= Benefits
The benefits of this solution include:
* No zombie memcg. The life cycle of the smemcg match the share memory file system or dma_buf.
* No reparenting. The shared memory only charge once to the smemcg object. A memcg can include a smemcg to as part of the memcg memory usage. When process exit and memcg get deleted, the charge remain to the smemcg object.
* Much cleaner mental model of the smemcg, each share memory page is charge to one smemcg only once.
* Notice the same smemcg can add to more than one memcg. It can better describe the shared memory relation.
Chris
> Suggested attendees:
> Yosry Ahmed <yosryahmed@google.com>
> Yu Zhao <yuzhao@google.com>
> T.J. Mercier <tjmercier@google.com>
> Tejun Heo <tj@kernel.org>
> Shakeel Butt <shakeelb@google.com>
> Muchun Song <muchun.song@linux.dev>
> Johannes Weiner <hannes@cmpxchg.org>
> Roman Gushchin <roman.gushchin@linux.dev>
> Alistair Popple <apopple@nvidia.com>
> Jason Gunthorpe <jgg@nvidia.com>
> Kalesh Singh <kaleshsingh@google.com>
>
next prev parent reply other threads:[~2023-05-03 22:15 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-04-11 23:36 T.J. Mercier
2023-04-11 23:48 ` Yosry Ahmed
2023-04-25 11:36 ` Yosry Ahmed
2023-04-25 18:42 ` Waiman Long
2023-04-25 18:53 ` Yosry Ahmed
2023-04-26 20:15 ` Waiman Long
2023-05-01 16:38 ` Roman Gushchin
2023-05-02 7:18 ` Yosry Ahmed
2023-05-02 20:02 ` Yosry Ahmed
2023-05-03 22:15 ` Chris Li [this message]
2023-05-04 11:58 ` Alistair Popple
2023-05-04 15:31 ` Chris Li
2023-05-05 13:53 ` Alistair Popple
2023-05-06 22:49 ` Chris Li
2023-05-08 8:17 ` Alistair Popple
2023-05-10 14:51 ` Chris Li
2023-05-12 8:45 ` Alistair Popple
2023-05-12 21:09 ` Jason Gunthorpe
2023-05-16 12:21 ` Alistair Popple
2023-05-19 15:47 ` Jason Gunthorpe
2023-05-20 15:09 ` Chris Li
2023-05-20 15:31 ` Chris Li
2023-05-29 19:31 ` Jason Gunthorpe
2023-05-04 17:02 ` Shakeel Butt
2023-05-04 17:36 ` Chris Li
2023-05-12 3:08 ` Yosry Ahmed
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZFLdDyHoIdJSXJt+@google.com \
--to=chrisl@kernel.org \
--cc=apopple@nvidia.com \
--cc=cgroups@vger.kernel.org \
--cc=hannes@cmpxchg.org \
--cc=jgg@nvidia.com \
--cc=kaleshsingh@google.com \
--cc=linux-mm@kvack.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=muchun.song@linux.dev \
--cc=roman.gushchin@linux.dev \
--cc=shakeelb@google.com \
--cc=tj@kernel.org \
--cc=tjmercier@google.com \
--cc=yosryahmed@google.com \
--cc=yuzhao@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox