From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3CAB6C77B61 for ; Tue, 25 Apr 2023 11:37:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8AF426B0071; Tue, 25 Apr 2023 07:37:34 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 85F846B0074; Tue, 25 Apr 2023 07:37:34 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 726036B0075; Tue, 25 Apr 2023 07:37:34 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 5EE546B0071 for ; Tue, 25 Apr 2023 07:37:34 -0400 (EDT) Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 22CE71A031F for ; Tue, 25 Apr 2023 11:37:34 +0000 (UTC) X-FDA: 80719713228.10.501AFEF Received: from mail-ed1-f42.google.com (mail-ed1-f42.google.com [209.85.208.42]) by imf24.hostedemail.com (Postfix) with ESMTP id 437B8180016 for ; Tue, 25 Apr 2023 11:37:32 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=sW7PePOl; spf=pass (imf24.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.42 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1682422652; a=rsa-sha256; cv=none; b=ep7j0wotlu0LpbhSg85qQO5m1m0XjNr0n3NfXCkn8s7sDi37idVyu0BlxaVx4DQicv/RIq VeKeGq63rB46otiBDBr5MI9Q3aRpGUu/Hgpgz5FzT4lYdfVPSVXxshcTNHr248JznzIlXF XRBiHVEFZgQkFbQ2cJcEkXZUeV3kA2M= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=sW7PePOl; spf=pass (imf24.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.42 as permitted sender) smtp.mailfrom=yosryahmed@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1682422652; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=tsngWKOv6aKbFUh03eXnVj16rHV0v3GvuIuES1vw7XQ=; b=o4iisl3QsBTTCIxpfBKgNb0QZCNQxNFVZL10E4RyM9KU17ZyHMjHlq2C/i0xkk3Bu6eJoe F6+R/nldb+Usgv7cerYKGRTePuNVDw+LdKH6IxED1by6cM70WNr6Zz+xUKj5GHjRzsLD9C nTshlduqE1yu4J0Vj2hoigvNU3Ybon0= Received: by mail-ed1-f42.google.com with SMTP id 4fb4d7f45d1cf-506767f97f8so9748575a12.1 for ; Tue, 25 Apr 2023 04:37:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1682422651; x=1685014651; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=tsngWKOv6aKbFUh03eXnVj16rHV0v3GvuIuES1vw7XQ=; b=sW7PePOlREhSSKOqwGhCAUa77Mfx2c0oO5NMs11uS6k4rpmQJiEZ+KnZAm+9x5JaWZ 1pSVRErSatbvbV/KIHwODNwd4hoXQYP5XagVp9VVMamtMHynOgzq7Emtzvq5Ihq7l+cl 2PYeEe9XuELLvkdd6eOYPwDg+Vq87Kno3Yq/heJacEBz3XMX1izHs47JdCGizKtfY7Fb tDfzeahJeIXAXFVGyNWmUL1vyTDbbG3iXv+iUSxi6aEdzikXWbDoM1VgQ2qPIC2QYGLD MLhv3M/4d707lxq4XD4KeTvYo/ZeOLhvJLLLMz77q8BWWOfaIeLGlQp6PT63pRu4n8Se 8guA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1682422651; x=1685014651; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=tsngWKOv6aKbFUh03eXnVj16rHV0v3GvuIuES1vw7XQ=; b=Uo+iKMKo7dV5eW6xgnMADJffphZ5gnyQ3O6lcgq/zU3iRuKuzp/e//+EyHeLcKq0NI 6b7polfMlR476z649x+RSENWfBkeHnnywKlWaaE5poO/2njRG+abGvoKgiA2QGB45zI4 2HGOMUpqpf2EGQV1IGMP2Ryg4DyMSAFRYn0ByK8xm0o6h3c7YDthZwx9T/bRRAACOJOB BIIA8E9+t3R71Nvk1NvKN8roIIHGnpbItAmR94FaOgv3+LGMYLn/TYo795tov11lBRno 8lQhC8OuzkiPBIbqnBPF/tcZyHbk44FXzgfmVjoFOFdMgqSSGyea0INPBySsT94yYja7 F2Yw== X-Gm-Message-State: AAQBX9fSrHb6Hp8I2E2OXVLpia6cmUSV9fr6Z5k4P6Y2UtFJGVx1YsUV eiQayCuQQC6qkCn+JpQGK0pVohAV/sIZU4BT53ypmw== X-Google-Smtp-Source: AKy350Yg6SEFmKprABhEVr3p6HW18rpf5k+kUwjC6AJx8MMPu2bxEyZ4OcBslXi7LzdGYWirELU+zO7i56GB/ow95lE= X-Received: by 2002:aa7:d744:0:b0:506:c2b2:72fc with SMTP id a4-20020aa7d744000000b00506c2b272fcmr12884048eds.7.1682422650405; Tue, 25 Apr 2023 04:37:30 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Yosry Ahmed Date: Tue, 25 Apr 2023 04:36:53 -0700 Message-ID: Subject: Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs To: "T.J. Mercier" Cc: lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, cgroups@vger.kernel.org, Tejun Heo , Shakeel Butt , Muchun Song , Johannes Weiner , Roman Gushchin , Alistair Popple , Jason Gunthorpe , Kalesh Singh , Yu Zhao , Matthew Wilcox , David Rientjes , Greg Thelen Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Queue-Id: 437B8180016 X-Rspamd-Server: rspam01 X-Stat-Signature: zjqca963tfpxapmkn66kxx4pdj5bpwzm X-HE-Tag: 1682422652-779877 X-HE-Meta: U2FsdGVkX1+p4rSCzKJmVVmSHfocOBDOgkNtHhaf7mITDpfcS1QjtbtJdVwlC/M2J46+aqU96s17JPoVgA9FvwXN02k/LUu4rlQiAr9OmhKQZ+ecKkRl8rkDoPD7Ax3wQ0bs/2bOmrU0XtExe2Q9ZyKYrjAWmrhaZiLc+aZ39b1SwOYGS9CVYnV22fFikPZsfIbq8CD33RSusdxntBtWGSNeugvgXuBxi5swjqDyJavSWzA2xEZ3fwG+J2YteQjQb4rka9xg5adofUrBGb4ieqjgefcTAszIU/ygshKLNbWyILdFpkCWiCWC/QeBjRBLHgrwCV0UiZX+Bnc6DXRe79ksqDWiZP4dOU2+JgyuzswxiQ6MYvKcWhql3oX5zGH45aaPRRvQIrXlTuuFDWrfm3JNc0mlkqKT1toHTAyJ+FbGSoubD86ap77AsgmLAwBdYP06dz4vwYbNetpaygXcZcKoSSAThO73IWrun2wSdYSvYFA/3yVKfUyur2ct4uPVhcU+8PFavIjmwaW184cqIkRuP5cQk+EmOvfu5dliR3iZSgNsFTz8WbuE/lF4h5EVS2HDwwpNJtyeletfRRYOJ140/UOOemOlxo8SIBO6bsNbMyGPVwRHfzNqIARAtKHlR2WmyM5PbEJnDFc7tYZg/PpLGiqI7P6F5o6YL28uHW/Ytpw0m6BT47g6ESMtZm/RFvelYDg/rZ5sBDdj/ntciut+3MUDI6fGpI6DQ1e48XCeKDJrRV21s8OMu7BaEQrTLXQfpri5Ii7SjshSA7QFHK7qjxyfYBzHEKlIxTiUgej7qLbvi6lMc23ma/Yf5qoWsrIm8zOg5HKoWC35VHkkhGy7YY0f8wkjaEgVj8iXzOmIay0eh34I+BpNRp444DQrgLyGfI+eNPU82gNHeuX6MJ3D0LIz2H6lsV6XtzspV+Dr0sswwSNkAnGjP8BVbS3GFkEosGZNjtvcoqNqsdr xMpnHwqE Qvl17WTvIoCj6YUW3PL6d5fuf6cat1gjB8b7u1mu4T6iBhLsrSVoDULdSOEG2KTeNcF0cFsUSYH2NcE1toThqrmXKkdjR52Kp+J37kNktld/Q67rW4h9RMLIP7uJTPUsD5XrvTrbo2DyW2qE+MpRcUsE5o1DNdsWB15vp4D9AgopgN0o0DxnZqTrnCWbjniKMhttBD5I4DBjZPGcMl+vDzr0da5nYs017rXckSLclsFv6no/exzeh3ovTHv9px+5D3uN5qgvlEIeywu43cbiOL3WTn83rrTt1G3wHPbyaFYLY3Kjq4KVkqhtg51O4lylw2IFBZQrhP6Vbi0aCX+rnRaBVYzgTpdekSIXxkvoppQt2wewjEFiFQxz6csklqPZ3N4jRgEySJfDq+lQ= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: +David Rientjes +Greg Thelen +Matthew Wilcox On Tue, Apr 11, 2023 at 4:48=E2=80=AFPM Yosry Ahmed = wrote: > > On Tue, Apr 11, 2023 at 4:36=E2=80=AFPM T.J. Mercier wrote: > > > > When a memcg is removed by userspace it gets offlined by the kernel. > > Offline memcgs are hidden from user space, but they still live in the > > kernel until their reference count drops to 0. New allocations cannot > > be charged to offline memcgs, but existing allocations charged to > > offline memcgs remain charged, and hold a reference to the memcg. > > > > As such, an offline memcg can remain in the kernel indefinitely, > > becoming a zombie memcg. The accumulation of a large number of zombie > > memcgs lead to increased system overhead (mainly percpu data in struct > > mem_cgroup). It also causes some kernel operations that scale with the > > number of memcgs to become less efficient (e.g. reclaim). > > > > There are currently out-of-tree solutions which attempt to > > periodically clean up zombie memcgs by reclaiming from them. However > > that is not effective for non-reclaimable memory, which it would be > > better to reparent or recharge to an online cgroup. There are also > > proposed changes that would benefit from recharging for shared > > resources like pinned pages, or DMA buffer pages. > > I am very interested in attending this discussion, it's something that > I have been actively looking into -- specifically recharging pages of > offlined memcgs. > > > > > Suggested attendees: > > Yosry Ahmed > > Yu Zhao > > T.J. Mercier > > Tejun Heo > > Shakeel Butt > > Muchun Song > > Johannes Weiner > > Roman Gushchin > > Alistair Popple > > Jason Gunthorpe > > Kalesh Singh I was hoping I would bring a more complete idea to this thread, but here is what I have so far. The idea is to recharge the memory charged to memcgs when they are offlined. I like to think of the options we have to deal with memory charged to offline memcgs as a toolkit. This toolkit includes: (a) Evict memory. This is the simplest option, just evict the memory. For file-backed pages, this writes them back to their backing files, uncharging and freeing the page. The next access will read the page again and the faulting process=E2=80=99s memcg will be charged. For swap-backed pages (anon/shmem), this swaps them out. Swapping out a page charged to an offline memcg uncharges the page and charges the swap to its parent. The next access will swap in the page and the parent will be charged. This is effectively deferred recharging to the parent. Pros: - Simple. Cons: - Behavior is different for file-backed vs. swap-backed pages, for swap-backed pages, the memory is recharged to the parent (aka reparented), not charged to the "rightful" user. - Next access will incur higher latency, especially if the pages are active= . (b) Direct recharge to the parent This can be done for any page and should be simple as the pages are already hierarchically charged to the parent. Pros: - Simple. Cons: - If a different memcg is using the memory, it will keep taxing the parent indefinitely. Same not the "rightful" user argument. (c) Direct recharge to the mapper This can be done for any mapped page by walking the rmap and identifying the memcg of the process(es) mapping the page. Pros: - Memory is recharged to the =E2=80=9Crightful=E2=80=9D user. Cons: - More complicated, the =E2=80=9Crightful=E2=80=9D user=E2=80=99s memcg mig= ht run into an OOM situation =E2=80=93 which in this case will be unpredictable and hard to correlate with an allocation. (d) Deferred recharging This is a mixture of (b) & (c) above. It is a two-step process. We first recharge the memory to the parent, which should be simple and reliable. Then, we mark the pages so that the next time they are accessed or mapped we recharge them to the "rightful" user. For mapped pages, we can use the numa balancing approach of protecting the mapping (while the vma is still accessible), and then in the fault path recharge the page. This is better than eviction because the fault on the next access is minor, and better than direct recharging to the mapping in the sense that the charge is correlated with an allocation/mapping. Of course, it is more complicated, we have to handle different protection interactions (e.g. what if the page is already protected?). Another disadvantage is that the recharging happens in the context of a page fault, rather than asynchronously in the case of directly recharging to the mapper. Page faults are more latency sensitive, although this shouldn't be a common path. For unmapped pages, I am struggling to find a way that is simple enough to recharge the memory on the next access. My first intuition was to add a hook to folio_mark_accessed(), but I was quickly told that this is not invoked in all access paths to unmapped pages (e.g. writes through fds). We can also add a hook to folio_mark_dirty() to add more coverage, but it seems like this path is fragile, and it would be ideal if there is a shared well-defined common path (or paths) for all accesses to unmapped pages. I would imagine if such a path exists or can be forged it would probably be in the page cache code somewhere. For both cases, if a new mapping is created, we can do recharging there. Pros: - Memory is recharged to the =E2=80=9Crightful=E2=80=9D user, eventually. - The charge is predictable and correlates to a user's access. - Less overhead on next access than eviction. Cons: - The memory will remain charged to the parent until the next access happens, if it ever happens. - Worse overhead on next access than directly recharging to the mapper. With this (incompletely defined) toolkit, a recharging algorithm can look like this (as a rough example): - If the page is file-backed: - Unmapped? evict (a). - Mapped? recharge to the mapper -- direct (c) or deferred (d). - If the page is swap-backed: - Unmapped? deferred recharge to the next accessor (d). - Mapped? recharge to the mapper -- direct (c) or deferred (d). There are, of course, open questions: 1) How do we do deferred recharging for unmapped pages? Is deferred recharging even a reliable option to begin with? What if the pages are never accessed again? 2) How do we avoid hiding kernel bugs (e.g. extraneous references) with recharging? Ideally, all recharged pages eventually end up somewhere other than root, such that accumulation of recharged pages at root signals a kernel bug. 3) What do we do about swapped pages charged to offline memcgs? Even if we recharge all charged pages, preexisting swap entries will pin the offline memcg. Do we walk all swap_cgroup entries and reparent the swap entries? Again, I was hoping to come up with a more concrete proposal, but as LSF/MM/BPF is approaching, I wanted to share my thoughts on the mailing list looking for any feedback. Thanks!