From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1B733EB64DA for ; Thu, 20 Jul 2023 15:35:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 68690280123; Thu, 20 Jul 2023 11:35:21 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 635FB28004C; Thu, 20 Jul 2023 11:35:21 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4D6AA280123; Thu, 20 Jul 2023 11:35:21 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 3AECE28004C for ; Thu, 20 Jul 2023 11:35:21 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 0CEB7C01AF for ; Thu, 20 Jul 2023 15:35:21 +0000 (UTC) X-FDA: 81032389242.24.3E29D1B Received: from mail-qt1-f178.google.com (mail-qt1-f178.google.com [209.85.160.178]) by imf23.hostedemail.com (Postfix) with ESMTP id DB26514002B for ; Thu, 20 Jul 2023 15:35:18 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=cmpxchg-org.20221208.gappssmtp.com header.s=20221208 header.b=4k9D+J7K; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf23.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.160.178 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1689867319; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=1n+rFJmpt3qB6JDh31e2VoNgqrjqEMhcmnlZeysYmYE=; b=CUYnKwbjcNMS4t/xVkbAoni4/N7P5apQw6v9XcgVbYs6AX5+A8+HuNxJEGQyqVaomNz4cR H+azxCFkZHCJ194QoZz91lAG6/MxAVwd4Jw975PP8BXR/QPtcDs/KUdGaD8MlbXlNQjzDe W1da+n2J51iiRRvimalM4831+QRlX7E= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=cmpxchg-org.20221208.gappssmtp.com header.s=20221208 header.b=4k9D+J7K; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf23.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.160.178 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1689867319; a=rsa-sha256; cv=none; b=LiNxbjy7AOMvFvxyS/Bo0V5C7f8s6DTxAjrLd/t/6Xxfw2EQI3E3oc3ZTJ6aJXu4NzkOk6 9md7bbuatPpS0JCWoNQNvjll3KfJH9i177asHbnhzGwvPCFW5q0RcPhm2VyZjjeqLyTyum t+ZM8CTPBX+mJTfqukzEoRk4BCcegoI= Received: by mail-qt1-f178.google.com with SMTP id d75a77b69052e-403f65a3f8cso8574801cf.2 for ; Thu, 20 Jul 2023 08:35:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20221208.gappssmtp.com; s=20221208; t=1689867318; x=1690472118; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=1n+rFJmpt3qB6JDh31e2VoNgqrjqEMhcmnlZeysYmYE=; b=4k9D+J7Kpdc0v2PabFm8Hmk1Rg0EfWw4RPCEG+efTc4WtuWKXjt/fnBhLoqCeZE2tu ZJZgFzOafPl7KmThIItTT/LAYNvIuj4QKYPs3Y/WUaerPCDvtD8AbSYxdJBV6gq7erZ0 QkSqfy+6llSiQFqdoU+Kj0piE8M5QnD+2ORQ6bx9iEqn54wsDOFBvtj9QYGtnMZTZ8U7 KfVsmYYL0Ey+U08VcwemcoYSxU65LDlXnWhRO97b+vEfwHygN164fwsiag0Q4Bu72//P +QQgOU1mJjt87DcTNtsI3PcXd+3/ZgW182eFDepQugrD6/ju3NE4lu6THcTcE7LkVK6j uHUA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689867318; x=1690472118; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=1n+rFJmpt3qB6JDh31e2VoNgqrjqEMhcmnlZeysYmYE=; b=h8FIkaZ9A9nuRpRq78ejDpjhSa79Izkg3D+3HkvOVAj9CrI6LVmI4b7v6MishBLHcS mI2rDIlXaL5Wz/ix63X4gSslL89uCHhyC4mCY7dsugQazsTm40ow+YoRPc6VPfwfS9kX bMkTn67EMCohkppJ0p+MdTPsIPHEqZwaKVrcbmvfRRmkKWtWadUMAQV2ryMKKiinOHap YJqrN0WjgOGsE8wkokhM51OMs4Nc8UHobuJEhKw1n3Z9ApbD3hO52CtrGq4uC23ClHmL KabTF0C1x3mQ1XFMznPqjoEAjhAJIeeTtelPmedRAaAcYf8NEzQ0/taw8G3rKK7QD0xN MElg== X-Gm-Message-State: ABy/qLbDcq0iJqw/9tXGFhPv9I8e+evbLNzFgHyHMT9Syyx165WMcJ18 ZYTbpvI76Ea+vh8bZWudqEF7rQ== X-Google-Smtp-Source: APBJJlHmfBUgJQ9Se08lJnq+G/rIdvG63OoRA/rvVvwg/oR7kPGHTKFIctM5V0m+lPCob+QaJys7Ow== X-Received: by 2002:a05:622a:15c7:b0:405:42e9:8a8e with SMTP id d7-20020a05622a15c700b0040542e98a8emr2557869qty.57.1689867317844; Thu, 20 Jul 2023 08:35:17 -0700 (PDT) Received: from localhost (2603-7000-0c01-2716-8f57-5681-ccd3-4a2e.res6.spectrum.com. [2603:7000:c01:2716:8f57:5681:ccd3:4a2e]) by smtp.gmail.com with ESMTPSA id fg14-20020a05622a580e00b0040399fb5ef3sm443668qtb.0.2023.07.20.08.35.17 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 20 Jul 2023 08:35:17 -0700 (PDT) Date: Thu, 20 Jul 2023 11:35:15 -0400 From: Johannes Weiner To: Yosry Ahmed Cc: Andrew Morton , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , "Matthew Wilcox (Oracle)" , Tejun Heo , Zefan Li , Yu Zhao , Luis Chamberlain , Kees Cook , Iurii Zaikin , "T.J. Mercier" , Greg Thelen , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org Subject: Re: [RFC PATCH 0/8] memory recharging for offline memcgs Message-ID: <20230720153515.GA1003248@cmpxchg.org> References: <20230720070825.992023-1-yosryahmed@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20230720070825.992023-1-yosryahmed@google.com> X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: DB26514002B X-Stat-Signature: scicymutjmu4e3d6ujed6199g7qj8gh8 X-HE-Tag: 1689867318-616443 X-HE-Meta: U2FsdGVkX18H+y4X8kmtFrVoTyh0l056ZcMLZ/vAqab6xVFamCtO+anMyl80puUHCDFGffH2YER6wEgZqsiFmwm46XG2oLocw4Sn6LmZjPJzN8dRCT3T+1yB/QsLAOtzt3nyZmGD3fUN7AhSnSFrQlOtKphazSL+ydEo+4AAH76nAWMKNjOCzkU2M1RU/apZbXaTR1anJqekawKLCsr49lQwiZdX70ryo0jZP/HeYka6+izB+WcVjkTK/mXw7WcfEmav/HplKEdgyyijqevMbQ03NG7fWd1qArcJtl/Jhsws+jUetXV1zKOJ3/fVirsUKAE+DzVGZkaqcY8yBJQVcZ+Dkr9KFLFS8IsLH4W8BmIeGVHhoJlaZWMY7lfgcDFuxNcxCOJklAXdn9+AtK6W8Xzx9VJ5FGTVmEnbgtxQpKwDXfdCE/oQHDdTxywfPeTIV0KADD60+7Gv5vS2FTNBdl8K2J7uWQHCGGTvEs2Yn0rECcFhdLG+Ww4g/Ni/BBg0XwIQ0qz/bBPaQi4I04ad7VgydyPyRGySAArRlmRn+6l/bWez3bICiE03/4ZrT0DkwjrKA8R/k1HsYPcrUIsnvhO1lQGMa/QTyZy5DdtK5Q7SE4EINdrjftVKBtOSLB2sMbmisOoP94vn0jRwqgRt7cgYop7RHvlCg6OInJOZirNr4eS/xPDWj9wZ4fyeFq/sWt+W5dB0IQEyrbDm6eVbBCuCRrNCyQtmK/MjB5blsQqXLU1uRB6FWIwhnq0PE+h0+02FRedxnTsRODt59W8/rlID01LBFGQf+g1h50Hm9tNkDCaHtz+QCJP4CE1pixmKauoUUpBktsD+QY1br2DFh9m0ppOJFaiFRgWUo/eUTZZgXRftiwtQUFvJEOSRVhFbi383zAnSz9Qla+VMOVlkf1S6D43/yZ3vqUyuqZTUuktAdojQu13tuG0VJ9erCuqGfYNzl8vaNq6P6uFGy58 w2haLhge ByW79Vs8CcMGHxf83jDAoe30jfF5GrGxiVQr3tdpXPAyqaiFfe1vUtOUKW7Oez6O8Egp2pXDQP8DJNEcQEMa0juA9vOURtsJYZtKspkV0kyRCl322MGGaH5SylVhEGpAwai3Kgcn/LiVXYGMbiDc8WItEsYBl9kf4RpguKS9JgR0lRExvhkEb4z429raW1yDfswKdQ48r0jrxAeezI2F3IAovZtbA3UIhE9CZBItKpcuCjSRkZMumxdoB9ZmvPRfI64gJIFwhh+MHzLMIRSNW1r1fAR9W/q2aU4XKMpYC1khyRCkmCFl3WnySxgMD1rOa+7mhaIY5XN7Xkm/NoUdpvZh4KEc4maguy6rgQgPCOOfIHbVd2vz8eQRuCRgddkCMkAUkDKq2nKgq2q4Py1JuxLMiHQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.002224, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Jul 20, 2023 at 07:08:17AM +0000, Yosry Ahmed wrote: > This patch series implements the proposal in LSF/MM/BPF 2023 conference > for reducing offline/zombie memcgs by memory recharging [1]. The main > difference is that this series focuses on recharging and does not > include eviction of any memory charged to offline memcgs. > > Two methods of recharging are proposed: > > (a) Recharging of mapped folios. > > When a memcg is offlined, queue an asynchronous worker that will walk > the lruvec of the offline memcg and try to recharge any mapped folios to > the memcg of one of the processes mapping the folio. The main assumption > is that a process mapping the folio is the "rightful" owner of the > memory. > > Currently, this is only supported for evictable folios, as the > unevictable lru is imaginary and we cannot iterate the folios on it. A > separate proposal [2] was made to revive the unevictable lru, which > would allow recharging of unevictable folios. > > (b) Deferred recharging of folios. > > For folios that are unmapped, or mapped but we fail to recharge them > with (a), we rely on deferred recharging. Simply put, any time a folio > is accessed or dirtied by a userspace process, and that folio is charged > to an offline memcg, we will try to recharge it to the memcg of the > process accessing the folio. Again, we assume this process should be the > "rightful" owner of the memory. This is also done asynchronously to avoid > slowing down the data access path. I'm super skeptical of this proposal. Recharging *might* be the most desirable semantics from a user pov, but only if it applies consistently to the whole memory footprint. There is no mention of slab allocations such as inodes, dentries, network buffers etc. which can be a significant part of a cgroup's footprint. These are currently reparented. I don't think doing one thing with half of the memory, and a totally different thing with the other half upon cgroup deletion is going to be acceptable semantics. It appears this also brings back the reliability issue that caused us to deprecate charge moving. The recharge path has trylocks, LRU isolation attempts, GFP_ATOMIC allocations. These introduce a variable error rate into the relocation process, which causes pages that should belong to the same domain to be scattered around all over the place. It also means that zombie pinning still exists, but it's now even more influenced by timing and race conditions, and so less predictable. There are two issues being conflated here: a) the problem of zombie cgroups, and b) who controls resources that outlive the control domain. For a), reparenting is still the most reasonable proposal. It's reliable for one, but it also fixes the problem fully within the established, user-facing semantics: resources that belong to a cgroup also hierarchically belong to all ancestral groups; if those resources outlive the last-level control domain, they continue to belong to the parents. This is how it works today, and this is how it continues to work with reparenting. The only difference is that those resources no longer pin a dead cgroup anymore, but instead are physically linked to the next online ancestor. Since dead cgroups have no effective control parameters anymore, this is semantically equivalent - it's just a more memory efficient implementation of the same exact thing. b) is a discussion totally separate from this. We can argue what we want this behavior to be, but I'd argue strongly that whatever we do here should apply to all resources managed by the controller equally. It could also be argued that if you don't want to lose control over a set of resources, then maybe don't delete their control domain while they are still alive and in use. For example, when restarting a workload, and the new instance is expected to have largely the same workingset, consider reusing the cgroup instead of making a new one. For the zombie problem, I think we should merge Muchun's patches ASAP. They've been proposed several times, they have Roman's reviews and acks, and they do not change user-facing semantics. There is no good reason not to merge them.