From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D6C4DEB64DD for ; Fri, 21 Jul 2023 18:48:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 77E668D0002; Fri, 21 Jul 2023 14:48:30 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 72E1F8D0001; Fri, 21 Jul 2023 14:48:30 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5F6FB8D0002; Fri, 21 Jul 2023 14:48:30 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 518358D0001 for ; Fri, 21 Jul 2023 14:48:30 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id EB3D7B15A0 for ; Fri, 21 Jul 2023 18:48:29 +0000 (UTC) X-FDA: 81036504738.04.0F86F08 Received: from mail-ej1-f43.google.com (mail-ej1-f43.google.com [209.85.218.43]) by imf30.hostedemail.com (Postfix) with ESMTP id E0BD58001A for ; Fri, 21 Jul 2023 18:48:27 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=muASEYnZ; spf=pass (imf30.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.43 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1689965308; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=tdq/W+QGmbFYjBwzec6GU8stgr9jLA3Mk5O05X2GOvY=; b=wMxet5Q4r0XOOcDRcGpIxR2QFCsNCt5RR53VuVT+yxwewpllDWuF9wfXBYrU3/1Ano35iJ 5ubW6UUcyec8kSEZcmrBN5AjXEaDp41nItUilU2TdZ0HUElR2nS6K8byISpjkypTcAb3Z7 WHmZA2nF15amXNgBqcFgseXQ4Woh2G8= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1689965308; a=rsa-sha256; cv=none; b=HwSM2NVqHQQfo8q/vOUbmSJXZ9htkOfgEAHhnwYq2QoOUoMB/cyv0wXoT/Qo6QF2APxCQw LWJoI8fqpIqG9OjK/O6P1NykNp8n6PazIsVKQykOo/JLKIcMBgYvzuZNQvNxMnj3o2SFAY /kj9fwzHd2rFQHABJCMmB3+4p+g7B8A= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=muASEYnZ; spf=pass (imf30.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.43 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-ej1-f43.google.com with SMTP id a640c23a62f3a-99454855de1so327943366b.2 for ; Fri, 21 Jul 2023 11:48:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1689965306; x=1690570106; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=tdq/W+QGmbFYjBwzec6GU8stgr9jLA3Mk5O05X2GOvY=; b=muASEYnZmQZKBYoEZ0S4+U/aJoPH1HHib/qcQHHw3GxE+05UEBQofzyZMM6Atdd0Yg iVLuraXsAfHkF8ldSyV7GfVUX31oAUm34s0teCQepYf7rLqtpiA01nCnVOn2tnrj5CMm DUDokWhrZG8N240U4zghZjjxw/sQCIjjLknPmfGDFbpXzSUo2nvgHzuSyz/ChM3q8GBV wN+s0L2B2DYK5k8ZNnOaM244w71qa8p2Y2ok7OcD0cAHzY4tiZj8X4C/GvGWN+udQars ksS0fIcBPzTY8ecxgk9/+GpHK3lU5nZ1X/ZDddKSqb0cEmI0D8W+SPo2QIKCxHMQwhot bxWg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689965306; x=1690570106; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=tdq/W+QGmbFYjBwzec6GU8stgr9jLA3Mk5O05X2GOvY=; b=SwTBCyACLtmqMLxCNFtCsry/anIUef38RN4zmK8e2TLCnO0z+2FxQYk9zvG5EX0jLK gWic7PC5n4Xqkub/1ew+pZY976clfZ+7ZT8/BJd3+lUP3fpKrw5kBeSBcq9iH1iquZTj AHwvqjxTEFEmpyyHaqLQZ/NwWVTnRRYOsJ6LfsyFey3UIv7lE8/LiLG2KwxiJPbDCte7 WkSpFA1znlULQCPL/JDFnLPP0SHH9ZGCKxABgJ5+NaYQ/pluHv/YahIKPN5zHo/Iev9D yjImwHW7tgbPG3EGolOGtQKdgn/V6bKeKWbH8WzplP4Ni09tDNoG1YwcrQE+7Q7DoSp5 naZA== X-Gm-Message-State: ABy/qLYlJFI6wKQdkfhq7bolj0mCamibo0bWGA7Rl9qVtUPegjH/Ea+0 xMNB48F8kAj5J513VPcpLKYwbcfATn7jMLVacdsTDQ== X-Google-Smtp-Source: APBJJlGonDSO5Pgyd78wlklAx5E2nmQGEM/OjC4MyYifHtkvMGDRyiZyaRq3AzCZntGuE6DWdJYmCOroguJM1d7muI4= X-Received: by 2002:a17:906:5190:b0:991:37d2:c9ea with SMTP id y16-20020a170906519000b0099137d2c9eamr2384811ejk.6.1689965306141; Fri, 21 Jul 2023 11:48:26 -0700 (PDT) MIME-Version: 1.0 References: <20230720070825.992023-1-yosryahmed@google.com> <20230720153515.GA1003248@cmpxchg.org> In-Reply-To: From: Yosry Ahmed Date: Fri, 21 Jul 2023 11:47:49 -0700 Message-ID: Subject: Re: [RFC PATCH 0/8] memory recharging for offline memcgs To: Tejun Heo Cc: Johannes Weiner , Andrew Morton , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , "Matthew Wilcox (Oracle)" , Zefan Li , Yu Zhao , Luis Chamberlain , Kees Cook , Iurii Zaikin , "T.J. Mercier" , Greg Thelen , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: dkoagdz4ke1cjqd6turic87z5irjr7ya X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: E0BD58001A X-Rspam-User: X-HE-Tag: 1689965307-281877 X-HE-Meta: U2FsdGVkX1++U+Qf7a7uLg/r2K58XeGYtLeVecFDhqubBDaRDgc0xBfrXg0SHIzZwO7LJ+/q6LTNYMzZ2jQNgsIe4NIkYiNdA1zuc98ZrWc6+61I97p6vfutMSFrbl4Ttk9+ob0ARtDYQReLseVaW6owWOoMviFlhAE2x1+Pij9HDs5i8000W9G3ApQnSGXCNmP0P2w8l3hk2AtaQhHd+ppYCKt7iI+gnNfC2UgB/52+/fT+xeOcYS0yqrIvEuP9DuXv7HVILZUHSW/2njpI4uvRy6ofwreetaNVaRf/vFg1dCXu5wybVQr6QaiDbzLJ4mKN0D7UdH4YCx2vMValcWKIgBQKyg6iZni6XDPiAoFn94YsxD1jGqYXSVBIObmeqTV9lWzJOMElWg5W/hfobFas2CyjHmI0ljO/onbbKlhWE0J1f6QzIA/DdE+n+HT/9mBaAEWuGhfCcL6Zst6u1Zl5rIbfpjHZ72OxacOjJb5Z+gpHtiKRqMKYKB+TRs9rH3bDXT0KhWkcfS2gyC4FQ2N4C7rTRZHfQd41t7tHfrFFDOZhu1S4shNs7n1Qg7xNKGvpfyedtor2dVg7OZoH9pSqvLRhuxmF23DnwGZeraVXTKvYUyDKTbXj5kzN+AOWMoR6is/QmGa9KDiOT5ztoZ6vrmpqM9Ru+xQ3BOLMaVwKm9MkYgz5vxk2to+M/8KmgRcIELJ1Uo3dsjbhWsR85NjOKdcubXDFCvQGwd/HDmpSYuxAwPP8KyA5XPwJG2ZZtSAGsv8pFqrYMlArW/txav4tCmKRS0gXg/8POGNAajqRZUeOJTWoUU4tBbj+XDWovcU/yYkbNCXFdEYQ05M9jQqYObY1CNeDpuDLNmTEwzfRpucw4dxazlncYVbJ1bNO9/8C+WQGCX0vdf0SdZLamn+x2RoTi3YwvR1sQ4hVkTcP6RN9IWaXtGLK0C50AT640Nm0FDo5VOEhECSta4o dg0ZW6Vn sOn1lZrwG9C4JARSqbqtDzAWQ7X//OSOxpUVVXVLM28m1qw+Vi8m4rwQQjJGxktZnw+uFXHtzxtT8VtTL727fy9OJ0sIbB0hlSxWP4Tdz3AsrMmhzmOboz345Yn9uJtUSuQEcB/hejh828a9Rjod6B8j2MlNdPYhsBRI4tUwvQ/9qOgcbbuFQKlfQQTSQTa13dIW9c/ZyrOvMPHQhAy6P9xfUX5HheSm4dME4nAgpuiFhgRYO0imChjLY3GStDtVGfea+yoZsMlheeW5mFqGwyaffj+w+HZCOByZT X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Jul 21, 2023 at 11:26=E2=80=AFAM Tejun Heo wrote: > > Hello, > > On Fri, Jul 21, 2023 at 11:15:21AM -0700, Yosry Ahmed wrote: > > On Thu, Jul 20, 2023 at 3:31=E2=80=AFPM Tejun Heo wrote= : > > > memory at least in our case. The sharing across them comes down to th= ings > > > like some common library pages which don't really account for much th= ese > > > days. > > > > Keep in mind that even a single page charged to a memcg and used by > > another memcg is sufficient to result in a zombie memcg. > > I mean, yeah, that's a separate issue or rather a subset which isn't all > that controversial. That can be deterministically solved by reparenting t= o > the parent like how slab is handled. I think the "deterministic" part is > important here. As you said, even a single page can pin a dying cgroup. There are serious flaws with reparenting that I mentioned above. We do it for kernel memory, but that's because we really have no other choice. Oftentimes the memory is not reclaimable and we cannot find an owner for it. This doesn't mean it's the right answer for user memory. The semantics are new compared to normal charging (as opposed to recharging, as I explain below). There is an extra layer of indirection that we did not (as far as I know) measure the impact of. Parents end up with pages that they never used and we have no observability into where it came from. Most importantly, over time user memory will keep accumulating at the root, reducing the accuracy and usefulness of accounting, effectively an accounting leak and reduction of capacity. Memory that is not attributed to any user, aka system overhead. > > > > > Keep in mind that the environment is dynamic, workloads are constan= tly > > > > coming and going. Even if find the perfect nesting to appropriately > > > > scope resources, some rescheduling may render the hierarchy obsolet= e > > > > and require us to start over. > > > > > > Can you please go into more details on how much memory is shared for = what > > > across unrelated dynamic workloads? That sounds different from other = use > > > cases. > > > > I am trying to collect more information from our fleet, but the > > application restarting in a different cgroup is not what is happening > > in our case. It is not easy to find out exactly what is going on on > > This is the point that Johannes raised but I don't think the current > proposal would make things more deterministic. From what I can see, it > actually pushes it towards even less predictability. Currently, yeah, som= e > pages may end up in cgroups which aren't the majority user but it at leas= t > is clear how that would happen. The proposed change adds layers of > indeterministic behaviors on top. I don't think that's the direction we w= ant > to go. I believe recharging is being mis-framed here :) Recharging semantics are not new, it is a shortcut to a process that is already happening that is focused on offline memcgs. Let's take a step back. It is common practice (at least in my knowledge) to try to reclaim memory from a cgroup before deleting it (by lowering the limit or using memory.reclaim). Reclaim heuristics are biased towards reclaiming memory from offline cgroups. After the memory is reclaimed, if it is used again by a different process, it will be refaulted and charged again (aka recharged) to the new What recharging is doing is *not* anything new. It is effectively doing what reclaim + refault would do above, with an efficient shortcut. It avoids the unnecessary fault, avoids disrupting the workload that will access the memory after it is reclaimed, and cleans up zombie memcgs memory faster than reclaim would. Moreover, it works for memory that may not be reclaimable (e.g. because of lack of swap). All the indeterministic behaviors in recharging are exactly the indeterministic behaviors in reclaim. It is very similar. We iterate the lrus, try to isolate and lock folios, etc. This is what reclaim does. Recharging is basically lightweight reclaim + charging again (as opposed to fully reclaiming the memory then refaulting it). We are not introducing new indeterminism or charging semantics. Recharging does exactly what would happen when we reclaim zombie memory. It is just more efficient and accelerated. > > > machines and where the memory is coming from due to the > > indeterministic nature of charging. The goal of this proposal is to > > let the kernel handle leftover memory in zombie memcgs because it is > > not always obvious to userspace what's going on (like it's not obvious > > to me now where exactly is the sharing happening :) ). > > > > One thing to note is that in some cases, maybe a userspace bug or > > failed cleanup is a reason for the zombie memcgs. Ideally, this > > wouldn't happen, but it would be nice to have a fallback mechanism in > > the kernel if it does. > > I'm not disagreeing on that. Our handling of pages owned by dying cgroups > isn't great but I don't think the proposed change is an acceptable soluti= on. I hope the above arguments change your mind :) > > Thanks. > > -- > tejun