From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id EADCEC77B7E for ; Wed, 26 Apr 2023 20:15:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 361816B0119; Wed, 26 Apr 2023 16:15:58 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 312366B0126; Wed, 26 Apr 2023 16:15:58 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1D9406B0129; Wed, 26 Apr 2023 16:15:58 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 0DEB46B0119 for ; Wed, 26 Apr 2023 16:15:58 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id C1365ACE5C for ; Wed, 26 Apr 2023 20:15:57 +0000 (UTC) X-FDA: 80724648354.19.11DE66A Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf10.hostedemail.com (Postfix) with ESMTP id C960CC0028 for ; Wed, 26 Apr 2023 20:15:54 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="WJEudLl/"; spf=pass (imf10.hostedemail.com: domain of longman@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=longman@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1682540154; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=9L7Q3TbkejCawHPfrk62/EEaFRYMLFfTS6p9ROANUHw=; b=WooK5tFOWX7jzTTeRnHPFrhkjHOBxGDFA0PlCS2kZRnGXtB8gq6i19dEM0b1/T3Uak8w63 ekxcdpe4HEIMEyA3G4KHbbC5NfTmvfCJLtQ0unktMY4kIB5NWQumIGfcqoKNrnM0k0p2yn SRFMcsLk/rL7TFsgrsmdy/Ej97aspQ8= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="WJEudLl/"; spf=pass (imf10.hostedemail.com: domain of longman@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=longman@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1682540154; a=rsa-sha256; cv=none; b=0F2g5b2Nd0DedY4k743oZ664L1C2/ak1X/iEwqaHojeKBm5DtbULhXyYBItpuXLDl8d2Gz SKLEyIiQjiwtwBXlPBySPASi9sMkbWSvK5WXMeOV/owxil7lyhPB14RKOqQd+tB1iFf3jl NZcTDaJjkYV41JGts2HxtxppQT497j0= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1682540154; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=9L7Q3TbkejCawHPfrk62/EEaFRYMLFfTS6p9ROANUHw=; b=WJEudLl/TsqIacO8RyOCmud4rHFe9bzOGIMrFxQzi5TTl4MijrQ/pgdjL3Om+WQckCXA7E Ec3qwV84iuUggpVvaasc5UsVhSkGjiMOHAnbJJ2tNL0nw2aIuwtbI7hdITM3WveajuGKpZ RHxTVqs82t+fYdl6LwGO02pLtC8pE+Y= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-434-qAFtVQO6Msej9GQ6rMCgCQ-1; Wed, 26 Apr 2023 16:15:49 -0400 X-MC-Unique: qAFtVQO6Msej9GQ6rMCgCQ-1 Received: from smtp.corp.redhat.com (int-mx09.intmail.prod.int.rdu2.redhat.com [10.11.54.9]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id AF648185A78B; Wed, 26 Apr 2023 20:15:48 +0000 (UTC) Received: from [10.22.18.92] (unknown [10.22.18.92]) by smtp.corp.redhat.com (Postfix) with ESMTP id DEA92492C13; Wed, 26 Apr 2023 20:15:47 +0000 (UTC) Message-ID: <8ad74529-890a-8300-c2ad-ddaa679b9c87@redhat.com> Date: Wed, 26 Apr 2023 16:15:47 -0400 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.7.1 Subject: Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs Content-Language: en-US To: Yosry Ahmed Cc: "T.J. Mercier" , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, cgroups@vger.kernel.org, Tejun Heo , Shakeel Butt , Muchun Song , Johannes Weiner , Roman Gushchin , Alistair Popple , Jason Gunthorpe , Kalesh Singh , Yu Zhao , Matthew Wilcox , David Rientjes , Greg Thelen References: <27e15be8-d0eb-ed32-a0ec-5ec9b59f1f27@redhat.com> From: Waiman Long In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.1 on 10.11.54.9 X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: C960CC0028 X-Rspam-User: X-Stat-Signature: zqtzybz53jmpfhajqxh9egd4crfjmbym X-HE-Tag: 1682540154-34482 X-HE-Meta: U2FsdGVkX188bFJb5Qi+Pt+q51MTjcaNPKEKmCn9d9bFUrvo5Z35okAH1J1j456zctSSSEn0OLYDPi9qYUHSYZQ7C1qmMIWnDEBDRvy4M+9WoWLYLbWrrPEHk5merWd3cOfpbno7wwBQcfxAZPL17YmSORLMy+g3j4TY63iSPYxqaLuXRLeACzBwkLgLzSpp4UyH5XMH93pk+rFPC8DS3o4VfWw+sfxeeRx2yY6fsCT1KB78uC3Y4K/yZ8zuRcBVOSiur5aPpqpVlKjYDn4aUF9OmHGPDdR2BfSfCUCNxyWc9a8QTkhca6mtBIiLE1xsIwNhrmy52Crk4NARP9xJGkMOIu6WJOW294IADJhHLPQiNKPgTqX85xQj7K2Eua1ew2ZKzWdN03TV8TTCFtaHQpTomWr0V6OlYdblz5IhALROopnLMEx1nlBIPtWe7GFJTa7i4StjPZPaPOKNPDuUfQFd258qA/Fwf7WaMIDhxXXPRq36VdqAO46Fdg45zCwDiboLD8yKUlQ0oAlzkmCj4y2ExDUHJKPODG6df2hPNfym0Qqduk0cpGdi7gpti+wjBf9PdCNDQVqEcz8wsB9tBzFpAKAlHzy48PporvH/Phhbk7dOMTOhuHg5ZQybxW/06MvKaH48lBY4w6Hj76OpGyP+jyGlrR1KAfQObU7FO8UadX/MdhJU4gOlKutazMaEtlSqxgGtZRHcT16vFXy3wYulMdjojdw7f6DoIx2rkV0YS35fE7GXfS8O1YApoiP7i19QEhQfpD5pnuYECvSInmE2r42dD6J4e6g9nuZNziDy/D3JImhh5JE2u9N+CF1UjHcqDdy/8pcCXC8MPwHi9u06By0+gZDtoWu7EZm9L++GP/nmYaBqdznWcv5zMgzSFxrQBdkzMiVJTXCktm/z2LbfcIPiZcwrhYynsRyVVA5iW6s85R2xwZgRBZhchNJwvYjMmHa9fMr0puvY1IT 5d/vQVAs Xd70gCQxyHgPPOu0JxgKB/ZtGC1bVz3Ub3tM9FdHFSEdD4gWoyToSFhbEVKaFTqjuo+WsAaRNq3GxMiMZe5Wv7ZR+kP9LWnqS2eTTSD0MvIltSrjWSw/3p/xOQ+pqJ6FJEDt2Lf33L0Qx8ADUwn2sHpphMnFcwCtApUGY6iexbxw48EiGuYFmQQFk37gFfbIc4zujsI0mxYUWcinDfevjzVDXBWLsOoIJU8kIuelwXcSjedVoX9ekHGfHExzRxGcMkG61zPIDQkHViF6BKnPZTFzzGiF+OpGD6jowfKunYfluEkxpbhydjaHvO2uzoiSYcs0Y8PM1e95NPKx7+B1Rt71Ok1VoweHa0yfA1zDtnGgRcWIuOXfaE68LjmICSjki7G+FUDpq/A8LQIdDvSAMl1zX4jWhgDN/E44r X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 4/25/23 14:53, Yosry Ahmed wrote: > On Tue, Apr 25, 2023 at 11:42 AM Waiman Long wrote: >> On 4/25/23 07:36, Yosry Ahmed wrote: >>> +David Rientjes +Greg Thelen +Matthew Wilcox >>> >>> On Tue, Apr 11, 2023 at 4:48 PM Yosry Ahmed wrote: >>>> On Tue, Apr 11, 2023 at 4:36 PM T.J. Mercier wrote: >>>>> When a memcg is removed by userspace it gets offlined by the kernel. >>>>> Offline memcgs are hidden from user space, but they still live in the >>>>> kernel until their reference count drops to 0. New allocations cannot >>>>> be charged to offline memcgs, but existing allocations charged to >>>>> offline memcgs remain charged, and hold a reference to the memcg. >>>>> >>>>> As such, an offline memcg can remain in the kernel indefinitely, >>>>> becoming a zombie memcg. The accumulation of a large number of zombie >>>>> memcgs lead to increased system overhead (mainly percpu data in struct >>>>> mem_cgroup). It also causes some kernel operations that scale with the >>>>> number of memcgs to become less efficient (e.g. reclaim). >>>>> >>>>> There are currently out-of-tree solutions which attempt to >>>>> periodically clean up zombie memcgs by reclaiming from them. However >>>>> that is not effective for non-reclaimable memory, which it would be >>>>> better to reparent or recharge to an online cgroup. There are also >>>>> proposed changes that would benefit from recharging for shared >>>>> resources like pinned pages, or DMA buffer pages. >>>> I am very interested in attending this discussion, it's something that >>>> I have been actively looking into -- specifically recharging pages of >>>> offlined memcgs. >>>> >>>>> Suggested attendees: >>>>> Yosry Ahmed >>>>> Yu Zhao >>>>> T.J. Mercier >>>>> Tejun Heo >>>>> Shakeel Butt >>>>> Muchun Song >>>>> Johannes Weiner >>>>> Roman Gushchin >>>>> Alistair Popple >>>>> Jason Gunthorpe >>>>> Kalesh Singh >>> I was hoping I would bring a more complete idea to this thread, but >>> here is what I have so far. >>> >>> The idea is to recharge the memory charged to memcgs when they are >>> offlined. I like to think of the options we have to deal with memory >>> charged to offline memcgs as a toolkit. This toolkit includes: >>> >>> (a) Evict memory. >>> >>> This is the simplest option, just evict the memory. >>> >>> For file-backed pages, this writes them back to their backing files, >>> uncharging and freeing the page. The next access will read the page >>> again and the faulting process’s memcg will be charged. >>> >>> For swap-backed pages (anon/shmem), this swaps them out. Swapping out >>> a page charged to an offline memcg uncharges the page and charges the >>> swap to its parent. The next access will swap in the page and the >>> parent will be charged. This is effectively deferred recharging to the >>> parent. >>> >>> Pros: >>> - Simple. >>> >>> Cons: >>> - Behavior is different for file-backed vs. swap-backed pages, for >>> swap-backed pages, the memory is recharged to the parent (aka >>> reparented), not charged to the "rightful" user. >>> - Next access will incur higher latency, especially if the pages are active. >>> >>> (b) Direct recharge to the parent >>> >>> This can be done for any page and should be simple as the pages are >>> already hierarchically charged to the parent. >>> >>> Pros: >>> - Simple. >>> >>> Cons: >>> - If a different memcg is using the memory, it will keep taxing the >>> parent indefinitely. Same not the "rightful" user argument. >> Muchun had actually posted patch to do this last year. See >> >> https://lore.kernel.org/all/20220621125658.64935-10-songmuchun@bytedance.com/T/#me9dbbce85e2f3c4e5f34b97dbbdb5f79d77ce147 >> >> I am wondering if he is going to post an updated version of that or not. >> Anyway, I am looking forward to learn about the result of this >> discussion even thought I am not a conference invitee. > There are a couple of problems that were brought up back then, mainly > that memory will be reparented to the root memcg eventually, > practically escaping accounting. Shared resources may end up being > eventually unaccounted. Ideally, we can come up with a scheme where > the memory is charged to the real user, instead of just to the parent. > > Consider the case where processes in memcg A and B are both using > memory that is charged to memcg A. If memcg A goes offline, and we > reparent the memory, memcg B keeps using the memory for free, taxing > A's parent, or the entire system if that's root. > > Also, if there is a kernel bug and a page is being pinned > unnecessarily, those pages will never be reclaimed and will stick > around and eventually be reparented to the root memcg. If being > reparented to the root memcg is a legitimate action, you can't simply > tell apart if pages are sticking around just because they are being > used by someone or if there is a kernel bug. This is certainly a valid concern. We are currently doing reparenting for slab objects. However physical pages have a higher probability of being shared by different tasks. I do hope that we can come to agreement soon on how best to address this issue. Thanks, Longman