linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Waiman Long <longman@redhat.com>
To: Yosry Ahmed <yosryahmed@google.com>
Cc: "T.J. Mercier" <tjmercier@google.com>,
	lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org,
	cgroups@vger.kernel.org, Tejun Heo <tj@kernel.org>,
	Shakeel Butt <shakeelb@google.com>,
	Muchun Song <muchun.song@linux.dev>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Roman Gushchin <roman.gushchin@linux.dev>,
	Alistair Popple <apopple@nvidia.com>,
	Jason Gunthorpe <jgg@nvidia.com>,
	Kalesh Singh <kaleshsingh@google.com>,
	Yu Zhao <yuzhao@google.com>, Matthew Wilcox <willy@infradead.org>,
	David Rientjes <rientjes@google.com>,
	Greg Thelen <gthelen@google.com>
Subject: Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs
Date: Wed, 26 Apr 2023 16:15:47 -0400	[thread overview]
Message-ID: <8ad74529-890a-8300-c2ad-ddaa679b9c87@redhat.com> (raw)
In-Reply-To: <CAJD7tkb1W0bP3AU9KepOYPx-AD-fMKSfUhj_Cmth63RS9umMsg@mail.gmail.com>

On 4/25/23 14:53, Yosry Ahmed wrote:
> On Tue, Apr 25, 2023 at 11:42 AM Waiman Long <longman@redhat.com> wrote:
>> On 4/25/23 07:36, Yosry Ahmed wrote:
>>>    +David Rientjes +Greg Thelen +Matthew Wilcox
>>>
>>> On Tue, Apr 11, 2023 at 4:48 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>>>> On Tue, Apr 11, 2023 at 4:36 PM T.J. Mercier <tjmercier@google.com> wrote:
>>>>> When a memcg is removed by userspace it gets offlined by the kernel.
>>>>> Offline memcgs are hidden from user space, but they still live in the
>>>>> kernel until their reference count drops to 0. New allocations cannot
>>>>> be charged to offline memcgs, but existing allocations charged to
>>>>> offline memcgs remain charged, and hold a reference to the memcg.
>>>>>
>>>>> As such, an offline memcg can remain in the kernel indefinitely,
>>>>> becoming a zombie memcg. The accumulation of a large number of zombie
>>>>> memcgs lead to increased system overhead (mainly percpu data in struct
>>>>> mem_cgroup). It also causes some kernel operations that scale with the
>>>>> number of memcgs to become less efficient (e.g. reclaim).
>>>>>
>>>>> There are currently out-of-tree solutions which attempt to
>>>>> periodically clean up zombie memcgs by reclaiming from them. However
>>>>> that is not effective for non-reclaimable memory, which it would be
>>>>> better to reparent or recharge to an online cgroup. There are also
>>>>> proposed changes that would benefit from recharging for shared
>>>>> resources like pinned pages, or DMA buffer pages.
>>>> I am very interested in attending this discussion, it's something that
>>>> I have been actively looking into -- specifically recharging pages of
>>>> offlined memcgs.
>>>>
>>>>> Suggested attendees:
>>>>> Yosry Ahmed <yosryahmed@google.com>
>>>>> Yu Zhao <yuzhao@google.com>
>>>>> T.J. Mercier <tjmercier@google.com>
>>>>> Tejun Heo <tj@kernel.org>
>>>>> Shakeel Butt <shakeelb@google.com>
>>>>> Muchun Song <muchun.song@linux.dev>
>>>>> Johannes Weiner <hannes@cmpxchg.org>
>>>>> Roman Gushchin <roman.gushchin@linux.dev>
>>>>> Alistair Popple <apopple@nvidia.com>
>>>>> Jason Gunthorpe <jgg@nvidia.com>
>>>>> Kalesh Singh <kaleshsingh@google.com>
>>> I was hoping I would bring a more complete idea to this thread, but
>>> here is what I have so far.
>>>
>>> The idea is to recharge the memory charged to memcgs when they are
>>> offlined. I like to think of the options we have to deal with memory
>>> charged to offline memcgs as a toolkit. This toolkit includes:
>>>
>>> (a) Evict memory.
>>>
>>> This is the simplest option, just evict the memory.
>>>
>>> For file-backed pages, this writes them back to their backing files,
>>> uncharging and freeing the page. The next access will read the page
>>> again and the faulting process’s memcg will be charged.
>>>
>>> For swap-backed pages (anon/shmem), this swaps them out. Swapping out
>>> a page charged to an offline memcg uncharges the page and charges the
>>> swap to its parent. The next access will swap in the page and the
>>> parent will be charged. This is effectively deferred recharging to the
>>> parent.
>>>
>>> Pros:
>>> - Simple.
>>>
>>> Cons:
>>> - Behavior is different for file-backed vs. swap-backed pages, for
>>> swap-backed pages, the memory is recharged to the parent (aka
>>> reparented), not charged to the "rightful" user.
>>> - Next access will incur higher latency, especially if the pages are active.
>>>
>>> (b) Direct recharge to the parent
>>>
>>> This can be done for any page and should be simple as the pages are
>>> already hierarchically charged to the parent.
>>>
>>> Pros:
>>> - Simple.
>>>
>>> Cons:
>>> - If a different memcg is using the memory, it will keep taxing the
>>> parent indefinitely. Same not the "rightful" user argument.
>> Muchun had actually posted patch to do this last year. See
>>
>> https://lore.kernel.org/all/20220621125658.64935-10-songmuchun@bytedance.com/T/#me9dbbce85e2f3c4e5f34b97dbbdb5f79d77ce147
>>
>> I am wondering if he is going to post an updated version of that or not.
>> Anyway, I am looking forward to learn about the result of this
>> discussion even thought I am not a conference invitee.
> There are a couple of problems that were brought up back then, mainly
> that memory will be reparented to the root memcg eventually,
> practically escaping accounting. Shared resources may end up being
> eventually unaccounted. Ideally, we can come up with a scheme where
> the memory is charged to the real user, instead of just to the parent.
>
> Consider the case where processes in memcg A and B are both using
> memory that is charged to memcg A. If memcg A goes offline, and we
> reparent the memory, memcg B keeps using the memory for free, taxing
> A's parent, or the entire system if that's root.
>
> Also, if there is a kernel bug and a page is being pinned
> unnecessarily, those pages will never be reclaimed and will stick
> around and eventually be reparented to the root memcg. If being
> reparented to the root memcg is a legitimate action, you can't simply
> tell apart if pages are sticking around just because they are being
> used by someone or if there is a kernel bug.

This is certainly a valid concern. We are currently doing reparenting 
for slab objects. However physical pages have a higher probability of 
being shared by different tasks. I do hope that we can come to agreement 
soon on how best to address this issue.

Thanks,
Longman



  reply	other threads:[~2023-04-26 20:15 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-11 23:36 T.J. Mercier
2023-04-11 23:48 ` Yosry Ahmed
2023-04-25 11:36   ` Yosry Ahmed
2023-04-25 18:42     ` Waiman Long
2023-04-25 18:53       ` Yosry Ahmed
2023-04-26 20:15         ` Waiman Long [this message]
2023-05-01 16:38     ` Roman Gushchin
2023-05-02  7:18       ` Yosry Ahmed
2023-05-02 20:02       ` Yosry Ahmed
2023-05-03 22:15 ` Chris Li
2023-05-04 11:58   ` Alistair Popple
2023-05-04 15:31     ` Chris Li
2023-05-05 13:53       ` Alistair Popple
2023-05-06 22:49         ` Chris Li
2023-05-08  8:17           ` Alistair Popple
2023-05-10 14:51             ` Chris Li
2023-05-12  8:45               ` Alistair Popple
2023-05-12 21:09                 ` Jason Gunthorpe
2023-05-16 12:21                   ` Alistair Popple
2023-05-19 15:47                     ` Jason Gunthorpe
2023-05-20 15:09                   ` Chris Li
2023-05-20 15:31                 ` Chris Li
2023-05-29 19:31                   ` Jason Gunthorpe
2023-05-04 17:02   ` Shakeel Butt
2023-05-04 17:36     ` Chris Li
2023-05-12  3:08 ` Yosry Ahmed

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8ad74529-890a-8300-c2ad-ddaa679b9c87@redhat.com \
    --to=longman@redhat.com \
    --cc=apopple@nvidia.com \
    --cc=cgroups@vger.kernel.org \
    --cc=gthelen@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=jgg@nvidia.com \
    --cc=kaleshsingh@google.com \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=muchun.song@linux.dev \
    --cc=rientjes@google.com \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeelb@google.com \
    --cc=tj@kernel.org \
    --cc=tjmercier@google.com \
    --cc=willy@infradead.org \
    --cc=yosryahmed@google.com \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox