From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6E0C2EB64DD for ; Fri, 21 Jul 2023 21:00:36 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E776C8D0002; Fri, 21 Jul 2023 17:00:35 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E27D88D0001; Fri, 21 Jul 2023 17:00:35 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CC8338D0002; Fri, 21 Jul 2023 17:00:35 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id BD9558D0001 for ; Fri, 21 Jul 2023 17:00:35 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 865914042E for ; Fri, 21 Jul 2023 21:00:35 +0000 (UTC) X-FDA: 81036837630.09.B3BA075 Received: from mail-ej1-f54.google.com (mail-ej1-f54.google.com [209.85.218.54]) by imf02.hostedemail.com (Postfix) with ESMTP id A077C80006 for ; Fri, 21 Jul 2023 21:00:33 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=hkZ+D5g5; spf=pass (imf02.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.54 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1689973233; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=JOgbHJec+Tti6Wru7OdTqH4+oMnk38DLOG9c4qJJzP8=; b=Ok9KKxqLK5ZCcYo9t/I53ry4/tKEO8MedjKwYZ+mzNVOIVaeYATciRta28nT1fpZxsSGEV edIA0VNHs21hr62OjKOeOP++is6M9HpNEbpZQyFQNQSxXVmxC3QlP0CG2Anek2LkulDa3I 7yDZx5/kn0S9jV9LWxccjZ2aa9ZtOgc= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1689973233; a=rsa-sha256; cv=none; b=nqC3brcHBLqQUoXonB0o4w5S+hY9+53KqnKdm5qDdgmnEZnRF4fi9pX3ZW/SRerkrIMzeU Do0UlrlvA9byepBlaCIST1u293bRT0XBoJSXIOQsuNEr4ZMDGIJfqAyYq2eNQ+P7OwR+L9 so4n5HMzqryl4a5++aIn84lna45tZ0E= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=hkZ+D5g5; spf=pass (imf02.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.54 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-ej1-f54.google.com with SMTP id a640c23a62f3a-9926623e367so378926566b.0 for ; Fri, 21 Jul 2023 14:00:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689973232; x=1690578032; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=JOgbHJec+Tti6Wru7OdTqH4+oMnk38DLOG9c4qJJzP8=; b=hkZ+D5g5lDrDl4A690EnFIQrEBWzd57RPaM7SaKw8Y1xk7u1cAMRN4ld2mJkGcGqTY /owNQyHKnflgKZIMZHrBc4AfSR3tYOaMOz+dYxvCdQz1ey2+U9QmjTviN0Qum0KxXOdw 8SFQWbBe/mtcxHqLJFyQClCzU3YhFgrJb03GP0qsdY0KqtaR9x7Hx4Kq9tfA6JVlHuMC ypntCYhcJXE3lxIkq+N9LeOkrGyUDm31Kih652QMPZI12bzYmDTEAxVdi1OIoHwEX4uX mQtgA1yxQCz6Ur0+IYN0S57Jk3wDnFBAsv0rrUdhewR+YipnqPXBqEiLqMAQS7fN91mG kzeA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689973232; x=1690578032; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=JOgbHJec+Tti6Wru7OdTqH4+oMnk38DLOG9c4qJJzP8=; b=E/tZs6cuwxTYEoY9e3QSOED3H4s5hH3Y79q4q5nBpxvLFwpFQ1gvgIlmQXWiKxbNE/ 9/NcojwJGgr1jRSTFJ+E13tbjaMBAdVmG+uKfwufa1hlnxnccvdhiyEZog+MhvwbtR2U 9OoTvJq04Eb+Q+k1wwqFUPbSBkl+lGLUQV39OGjGotz5jle45mwyRWKg+guio+tUwA8K S8nipDubCA5dboum3ltlPslH2TPrI93x2frIINayy6IUP8nEn6oh+LVYIQ/5R1xIBz6b 7Brj33sXsMgVTGc1hI/UdG8AjuoJQAEylgQ8k7dB8ZTuGB94Snxh3gf04NycB28K+Xlr U1uQ== X-Gm-Message-State: ABy/qLaIMoOt6SkTyhwW/8aDaGokVtd63RZ85wSLOdneXF5wvcF24NyX 24ccPetiqNdxBzuZJ/QG0VoaBwgq8T2Jw71JXo0ZFQ== X-Google-Smtp-Source: APBJJlF4w7IolAQjvFt45KYghjPLjVomuKrZ2Ype4QjwYKABfJu98+K6+QIzMGKxUWt/Is9nh3WdSTiYhrlnIY4617k= X-Received: by 2002:a17:906:101a:b0:98d:cd3e:c193 with SMTP id 26-20020a170906101a00b0098dcd3ec193mr2203336ejm.46.1689973231695; Fri, 21 Jul 2023 14:00:31 -0700 (PDT) MIME-Version: 1.0 References: <20230720070825.992023-1-yosryahmed@google.com> <20230720153515.GA1003248@cmpxchg.org> <20230721204408.GA1033322@cmpxchg.org> In-Reply-To: <20230721204408.GA1033322@cmpxchg.org> From: Yosry Ahmed Date: Fri, 21 Jul 2023 13:59:55 -0700 Message-ID: Subject: Re: [RFC PATCH 0/8] memory recharging for offline memcgs To: Johannes Weiner Cc: Tejun Heo , Andrew Morton , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , "Matthew Wilcox (Oracle)" , Zefan Li , Yu Zhao , Luis Chamberlain , Kees Cook , Iurii Zaikin , "T.J. Mercier" , Greg Thelen , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: A077C80006 X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: g6fhaxe6fhn5twr1r8igigxic4fsfqqn X-HE-Tag: 1689973233-615198 X-HE-Meta: U2FsdGVkX18V9luHNBSVvClb5ommmBiQX4HTLgahoRg66Q2PYgUJ7LHZTHnLohaIyuMxJQ76rcR/YRMmksd+pqbuD/KtZ2GNMreJw1NfiSgBNz8S9M+aggELO8EmxQlgVHFNDZPO3vgcXZ3pflyetNSNCnpcdZKD17PNp4cfolod7QCGNQIJ+XqqLt0vCWCIcrJHm2j0XHQBl8VJhAA0dXNOYuyCSMzkv0ebvA0KAqUs8Kk0sMrpAvxfLrJSU+S9AuhskIdWjsb5XUQDF+0Ptk5i3Hwp+cPjVYrsTX65J+I/YjW1BaP76BZNfCNSu3mkxctJ4ZxZ0h6rVd9uCaV5dypQcifIIQgbI124uKUp8vFUlWQeRZ9gN8qB//pyUpWhQ2fS8zMc+LrcI0AJ0KCxKO/xNx9Zyera3/xXQCtwCjx/ZF9ppQffzYCkfllTjO1fIrISPDDQ6jUrdOR/uk0f9W5pPphrNEdvQUCip3kn6crpbWF8ume7NFdz1BQVS+P6W5S0dSfTW7L0MugLUddncAC58ol+5mvATR8ZR83SN25W8PGfegE4GYPin5fqfZ63bc/yjXFLwDqS9qkGgNd69Dr+KOEMlfxO5BcNdPf1eHL/u/rcyHiQMjBZtYsx5ELkgjo57zEFC0uGaJ8aclpQeyXFLN2V4p1+mAtEwYu5L5JYQhgD0MlfyYi+640Gd9xciPn7RryJAPt6oatBBOxFbCFcSmSLbBQg2ES3o/tl1r58yo3gsi73eAaIA28Xgfb3QCy9REOYDBeSHDBNV3O9mXMjVn3qOJ1yzI3Lt2qq8o2PUa4ROW7nuvf5zjd+cxxsshUD7w9ySEx5PWJPLYaqKH3Sm7h7SaEU4B4uHuh/L8v7X7C2zJjw3jy+WFXrObbQ0AYKB7nu7W4zraOt2VHr9WgZpFxtFuCt3nrxwH9inolXZEMPPLVLJIOW9zevN8sSzccTLD7VSWKcpHZB0pm 39PpTnqd JkcQ6443zl8+FmTjadnaD/WSO5rQVajtwwibwwfAH/LbU3M5j0uH9KfClrPoBJLeVqloX801MO3eGEv5jvso7ONFejo1AUsZC1FZ6TGfPTMoA02RW2LokFSIwuaTIh5RFzwY5T/JQDtdO0QMJFH/orcBgQNP6SHDPsGQJ1fJohm4IavfhvNhb13D2opE0AobxiQU4RmD3twr3sBI0w6nB5z6d/IdtnlGg5eaQzk76TntopILCY9MUSysHm7A7IgDwbZs+cl4mlrhy32l81LqCsVz7AHrABaTYt0Ta98t9bwM64w81epoizNbNRmWKCYApaCkBck5AKnt3QSYFgP67auxPkg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Jul 21, 2023 at 1:44=E2=80=AFPM Johannes Weiner wrote: > > On Fri, Jul 21, 2023 at 11:47:49AM -0700, Yosry Ahmed wrote: > > On Fri, Jul 21, 2023 at 11:26=E2=80=AFAM Tejun Heo wrot= e: > > > > > > Hello, > > > > > > On Fri, Jul 21, 2023 at 11:15:21AM -0700, Yosry Ahmed wrote: > > > > On Thu, Jul 20, 2023 at 3:31=E2=80=AFPM Tejun Heo w= rote: > > > > > memory at least in our case. The sharing across them comes down t= o things > > > > > like some common library pages which don't really account for muc= h these > > > > > days. > > > > > > > > Keep in mind that even a single page charged to a memcg and used by > > > > another memcg is sufficient to result in a zombie memcg. > > > > > > I mean, yeah, that's a separate issue or rather a subset which isn't = all > > > that controversial. That can be deterministically solved by reparenti= ng to > > > the parent like how slab is handled. I think the "deterministic" part= is > > > important here. As you said, even a single page can pin a dying cgrou= p. > > > > There are serious flaws with reparenting that I mentioned above. We do > > it for kernel memory, but that's because we really have no other > > choice. Oftentimes the memory is not reclaimable and we cannot find an > > owner for it. This doesn't mean it's the right answer for user memory. > > > > The semantics are new compared to normal charging (as opposed to > > recharging, as I explain below). There is an extra layer of > > indirection that we did not (as far as I know) measure the impact of. > > Parents end up with pages that they never used and we have no > > observability into where it came from. Most importantly, over time > > user memory will keep accumulating at the root, reducing the accuracy > > and usefulness of accounting, effectively an accounting leak and > > reduction of capacity. Memory that is not attributed to any user, aka > > system overhead. > > Reparenting has been the behavior since the first iteration of cgroups > in the kernel. The initial implementation would loop over the LRUs and > reparent pages synchronously during rmdir. This had some locking > issues, so we switched to the current implementation of just leaving > the zombie memcg behind but neutralizing its controls. Thanks for the context. > > Thanks to Roman's objcg abstraction, we can now go back to the old > implementation of directly moving pages up to avoid the zombies. > > However, these were pure implementation changes. The user-visible > semantics never varied: when you delete a cgroup, any leftover > resources are subject to control by the remaining parent cgroups. > Don't remove control domains if you still need to control resources. > But none of this is new or would change in any way! The problem is that you cannot fully monitor or control all the resources charged to a control domain. The example of common shared libraries stands, the pages are charged on first touch basis. You can't easily control it or monitor who is charged for what exactly. Even if you can find out, is the answer to leave the cgroup alive forever because it is charged for a shared resource? > Neutralizing > controls of a zombie cgroup results in the same behavior and > accounting as linking the pages to the parent cgroup's LRU! > > The only thing that's new is the zombie cgroups. We can fix that by > effectively going back to the earlier implementation, but thanks to > objcg without the locking problems. > > I just wanted to address this, because your description/framing of > reparenting strikes me as quite wrong. Thanks for the context, and sorry if my framing was inaccurate. I was more focused on the in-kernel semantics rather than user-visible semantics. Nonetheless, with today's status or with reparenting, once the memory is at the root level (whether reparented to the root level, or in a zombie memcg whose parent is root), the memory has effectively escaped accounting. This is not a new problem that reparenting would introduce, but it's a problem that recharging is trying to fix that reparenting won't. As I outlined above, the semantics of recharging are not new, they are equivalent to reclaiming and refaulting the memory in a more accelerated/efficient manner. The indeterminism in recharging is very similar to reclaiming and refaulting. What do you think?