From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 03FDCC6FD18 for ; Tue, 25 Apr 2023 18:54:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 573606B0071; Tue, 25 Apr 2023 14:54:03 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4FC836B0072; Tue, 25 Apr 2023 14:54:03 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 39CB06B0074; Tue, 25 Apr 2023 14:54:03 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 252C56B0071 for ; Tue, 25 Apr 2023 14:54:03 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id D9BBF16034B for ; Tue, 25 Apr 2023 18:54:02 +0000 (UTC) X-FDA: 80720813124.29.2897971 Received: from mail-ed1-f49.google.com (mail-ed1-f49.google.com [209.85.208.49]) by imf09.hostedemail.com (Postfix) with ESMTP id DF9D8140011 for ; Tue, 25 Apr 2023 18:53:59 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=cpBNalHy; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf09.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.49 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1682448840; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=RWVSQZ+0SzhYTj35AIqktL1T3erQBqIEznyQOxJFUF8=; b=7Snpp184YN+ePSWdzp+PR8EU8ZzrKdlePJNvUG2HgonirXCUYDy/aXbLSNCbP1jjDhzhbI xRqWk+Smyljdy2nYHoJ3PZpIKGGWdKiXLvl/7+Hp3Ch8Q0K3tP2xWs7amcKpoJaWBuKD7f IBjKJzTN3pyiXEETphN0/2nFN/kenpM= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=cpBNalHy; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf09.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.49 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1682448840; a=rsa-sha256; cv=none; b=hJihkmPWdh99LFzeag2VZpEsdsh8EjwjHLc0IdPnBXsyjhzghRTIEEpja2AFfsrtrYPx5H +yK3GoLrquQZzf8SU+nfsNFOz8AvcfLS1H5v7TKYxf+HcMThsHstv7ZutTwTrS1BSKY2Pg 5gzN3FWw9fBjqSV/9p8/7IcfGJ7ynOo= Received: by mail-ed1-f49.google.com with SMTP id 4fb4d7f45d1cf-5052caa1e32so10980600a12.2 for ; Tue, 25 Apr 2023 11:53:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1682448838; x=1685040838; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=RWVSQZ+0SzhYTj35AIqktL1T3erQBqIEznyQOxJFUF8=; b=cpBNalHy/F3AiuVvLY/LRZPtUf6B98St0z5R90iAuyRryxYnpzm2DpkqmUbG8mw4xX 8geNDh5t1gL5S1EASloVvEyUcdUFqfWw4lsjk5EECiG9jusf+0ECM0njqOtOp7kd+YfX REqXAzY9/uabyFdd57SQ7/z70a/gUpxxdheY3AVgbrn5pFZKbsYLHI/Ahr4vKbChiVu7 JIBisb8xwEGvnpkK/nQdfwPYiA4zg9N6wzOgXjF2+e10oYKOmhHukQXDTjTmk6dW862G V0BqDyUtnqPI9o31BL/0K8k4vgZjgZesOlo5yWaOW7YvfLN+/fgGko9m2MSK5iZvd1ve liEg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1682448838; x=1685040838; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=RWVSQZ+0SzhYTj35AIqktL1T3erQBqIEznyQOxJFUF8=; b=h2h7R6mOay2oNWl9f9MEQRzunlZZZEfnDc0/9U6XcSDQ92FP3W6VQcDaMLxlVzayFp o2e30FQuy4rP6BRIUIjLQS7DzzExuhJyjEhNNhWIzWKtHIUD6whgZmjs//vJ3LzI3dGe pZQYdvsV4cePJ7ef+A6sZqHD9664SLdxlXPr60LNWVCCvsEBfP0mEWVFFhFlHKUNr32v tTHEJQayLn5BWulB/FV/m1KWGo28iSwP/tRy5CuHYsqxoFu+OID/18pmumMGrOdI3fXg twRSj30Ujd//uxRMWn3lTces4IGwVfoj71KL/+7G5CvdBgf3nx/E1ai8aXj57VaDfiue raJA== X-Gm-Message-State: AAQBX9enpsSFDTvNfoXdK1Driaia5Y/QQ6Ym5cxCiyS8UdSqtMr7Bua3 XQij7xKlk3pmf6JL9x/UfYWLPJ+VCVnkxa26ePWmJA== X-Google-Smtp-Source: AKy350Z1qww47AFUaPBEGfnK2/230aRH/MUVTIXc38GkRqgDyZUj8915JYrxzn6zTU/52TUZdwAa0b/2BnuqP5f/uhI= X-Received: by 2002:a05:6402:1355:b0:502:2a76:5781 with SMTP id y21-20020a056402135500b005022a765781mr15269266edw.5.1682448838046; Tue, 25 Apr 2023 11:53:58 -0700 (PDT) MIME-Version: 1.0 References: <27e15be8-d0eb-ed32-a0ec-5ec9b59f1f27@redhat.com> In-Reply-To: <27e15be8-d0eb-ed32-a0ec-5ec9b59f1f27@redhat.com> From: Yosry Ahmed Date: Tue, 25 Apr 2023 11:53:21 -0700 Message-ID: Subject: Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs To: Waiman Long Cc: "T.J. Mercier" , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, cgroups@vger.kernel.org, Tejun Heo , Shakeel Butt , Muchun Song , Johannes Weiner , Roman Gushchin , Alistair Popple , Jason Gunthorpe , Kalesh Singh , Yu Zhao , Matthew Wilcox , David Rientjes , Greg Thelen Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: DF9D8140011 X-Stat-Signature: t49c3q8fc454rbqrxdndm5zbete39kyy X-HE-Tag: 1682448839-703726 X-HE-Meta: U2FsdGVkX1+Fs9ECGJX+xpTOgGjfHGe/6P3TRlw8eCjv1qzZMY/veV4vPoY/7GdphDiyMuE5ThFaXuT/giFnlEQX4ShusW6rvthH4jflBuwk4l2ZimyOronbadLrJ415djHL8/3RXbedG+odD2PdFgOlHT+yI+HndtZ7AmG5TxWfAO3J7v2uG3LAvp6i2eyHnyz8DgXhJOofgHK5q8hEkKZhcctdibdMI0lh6i9YUu8P2Fd7ZOGlVfNpKGBia2eW9I1pf2RL7MfJ4eN1d11g6nJzD7sH/sD/c8rrE+lm3EJ+whWHePHoMy+16EbJj94FBuEWydWlC4i789svi5caGRuvmSOLnKi1zBLy5IzNtvxcUxn1ub0Ct/g0iMwdt2vOL7QegORk3LASpAQ16LcxxIVukYxHpzzJwAQl2TMOzuEWXMMWwHxULoTFQUKxRfGpq2e+zPYxZJjPfzzsrZkDFJrkN+pwek9pIE+m3TkV1UkeQbVQGxaeMjAD0M1D4afBLYfQTgGCBQIwXPLC7tTIImGm4Z3KyiOiBRfjfmh6gg6aYpJAACYFV1osCNdna+QOoq9/923kfHATz6dTGUIQ1m7lOpVc0T0Iz2/cBs8v7Xd3B7an4+orRQGZEGbjYhXeoHHTRBC86C+TB0hmD5sR9yPTt/Z6tRPHtAcsmqRuURZvCM3B6qaC8AdiTFZjwZG/3H8CLGc/acvr48EuJ9jnPxO/9JOcQdbIfb63gr9hE+tSrZ5NDeo5i4o97sluEmePvTXrKkALJnP3pnjNC0b51DpJI8+C7p5c7+cP9ZbZi6oaVnKFH8vMUlN2bIYNp1aB38gVtRB5+le7oIYY2g4O83NKRnG7NDm/SaKa0NFRHAuhtIiULVEVwCxc835TZCkV96ru5Yz9Vv5s0d8JzxnvCOLKKMdl9PWRTk5cq1Rk0UHGYJWNPN0zv6Q0WsOsz9McTp/Pt65O3hTSArL3zWZ mFAo3gQW nU5WHx6lj91PkCk6skFGqVIFNODoZxhghv3uVwKE8VQV0PQhIuSQ4afyeNKNIshDlSTIdQvo24+GIeBAD6rSNs0i7NuoSrDjBrPgh6N+oqC+Hl77dIeLWymKqfyU4eem1CKasF4YWN/rpuPXqtgxw28N8d4YsPRrSu2+RBKwztDxSCXDaozhvX0I75OBvXj2/6RIuHf0VsO5MJuj7eMAw7dJpGy/17BdvkZaNiIOPUof2P5rupy+2La2npYBMGShRspwmsYgNmXSZJQ/1CGLDQED+rtNQFYjiQsF6XFmlnxl6+vEAYsqO/NXtKRnTp1R1hn+25byup/mcakrxmXm3h1iK+P3y6LPyfKWZLpeHuRdvaNNWZnZ49N/KPU7tqtvxaVWLHsbdaoTlFEheNoFplFCidAsYM9/SNuyTe1LfGhhKB9WiWZLmnsChXeR2SJET9qWAZkXPbA0ZD45O/OTgeqKN11nYYgSUKdDX X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Apr 25, 2023 at 11:42=E2=80=AFAM Waiman Long w= rote: > > On 4/25/23 07:36, Yosry Ahmed wrote: > > +David Rientjes +Greg Thelen +Matthew Wilcox > > > > On Tue, Apr 11, 2023 at 4:48=E2=80=AFPM Yosry Ahmed wrote: > >> On Tue, Apr 11, 2023 at 4:36=E2=80=AFPM T.J. Mercier wrote: > >>> When a memcg is removed by userspace it gets offlined by the kernel. > >>> Offline memcgs are hidden from user space, but they still live in the > >>> kernel until their reference count drops to 0. New allocations cannot > >>> be charged to offline memcgs, but existing allocations charged to > >>> offline memcgs remain charged, and hold a reference to the memcg. > >>> > >>> As such, an offline memcg can remain in the kernel indefinitely, > >>> becoming a zombie memcg. The accumulation of a large number of zombie > >>> memcgs lead to increased system overhead (mainly percpu data in struc= t > >>> mem_cgroup). It also causes some kernel operations that scale with th= e > >>> number of memcgs to become less efficient (e.g. reclaim). > >>> > >>> There are currently out-of-tree solutions which attempt to > >>> periodically clean up zombie memcgs by reclaiming from them. However > >>> that is not effective for non-reclaimable memory, which it would be > >>> better to reparent or recharge to an online cgroup. There are also > >>> proposed changes that would benefit from recharging for shared > >>> resources like pinned pages, or DMA buffer pages. > >> I am very interested in attending this discussion, it's something that > >> I have been actively looking into -- specifically recharging pages of > >> offlined memcgs. > >> > >>> Suggested attendees: > >>> Yosry Ahmed > >>> Yu Zhao > >>> T.J. Mercier > >>> Tejun Heo > >>> Shakeel Butt > >>> Muchun Song > >>> Johannes Weiner > >>> Roman Gushchin > >>> Alistair Popple > >>> Jason Gunthorpe > >>> Kalesh Singh > > I was hoping I would bring a more complete idea to this thread, but > > here is what I have so far. > > > > The idea is to recharge the memory charged to memcgs when they are > > offlined. I like to think of the options we have to deal with memory > > charged to offline memcgs as a toolkit. This toolkit includes: > > > > (a) Evict memory. > > > > This is the simplest option, just evict the memory. > > > > For file-backed pages, this writes them back to their backing files, > > uncharging and freeing the page. The next access will read the page > > again and the faulting process=E2=80=99s memcg will be charged. > > > > For swap-backed pages (anon/shmem), this swaps them out. Swapping out > > a page charged to an offline memcg uncharges the page and charges the > > swap to its parent. The next access will swap in the page and the > > parent will be charged. This is effectively deferred recharging to the > > parent. > > > > Pros: > > - Simple. > > > > Cons: > > - Behavior is different for file-backed vs. swap-backed pages, for > > swap-backed pages, the memory is recharged to the parent (aka > > reparented), not charged to the "rightful" user. > > - Next access will incur higher latency, especially if the pages are ac= tive. > > > > (b) Direct recharge to the parent > > > > This can be done for any page and should be simple as the pages are > > already hierarchically charged to the parent. > > > > Pros: > > - Simple. > > > > Cons: > > - If a different memcg is using the memory, it will keep taxing the > > parent indefinitely. Same not the "rightful" user argument. > > Muchun had actually posted patch to do this last year. See > > https://lore.kernel.org/all/20220621125658.64935-10-songmuchun@bytedance.= com/T/#me9dbbce85e2f3c4e5f34b97dbbdb5f79d77ce147 > > I am wondering if he is going to post an updated version of that or not. > Anyway, I am looking forward to learn about the result of this > discussion even thought I am not a conference invitee. There are a couple of problems that were brought up back then, mainly that memory will be reparented to the root memcg eventually, practically escaping accounting. Shared resources may end up being eventually unaccounted. Ideally, we can come up with a scheme where the memory is charged to the real user, instead of just to the parent. Consider the case where processes in memcg A and B are both using memory that is charged to memcg A. If memcg A goes offline, and we reparent the memory, memcg B keeps using the memory for free, taxing A's parent, or the entire system if that's root. Also, if there is a kernel bug and a page is being pinned unnecessarily, those pages will never be reclaimed and will stick around and eventually be reparented to the root memcg. If being reparented to the root memcg is a legitimate action, you can't simply tell apart if pages are sticking around just because they are being used by someone or if there is a kernel bug. > > Thanks, > Longman > >