From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4958DC77B73 for ; Sat, 6 May 2023 22:49:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 519116B0072; Sat, 6 May 2023 18:49:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4CB476B0078; Sat, 6 May 2023 18:49:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 390D06B007B; Sat, 6 May 2023 18:49:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 2B3DC6B0072 for ; Sat, 6 May 2023 18:49:15 -0400 (EDT) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id E82381C7131 for ; Sat, 6 May 2023 22:49:14 +0000 (UTC) X-FDA: 80761322628.20.4FECB2A Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf02.hostedemail.com (Postfix) with ESMTP id 13A5180006 for ; Sat, 6 May 2023 22:49:12 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=KeK0udaO; spf=pass (imf02.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1683413353; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=IIl9Eoqbc4175ijtMx2njDHe3mohz8ZQA277Qnp7/rM=; b=tR+l0Unm8gvhVWK7mpomkFFj4kqC5AL22juvLq381sQG/NVt4Lso5ecvY9H34Iwaj3asfF 8DlmV0oZrf7J+F2VU0XZdzAmlng+L3JGV6wInExlGGStdo5OSS9x6Cz75f/0IKPQlFcpZt /q9Yv93miV+L+3DPYbql/KwAUkT3dlg= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1683413353; a=rsa-sha256; cv=none; b=P/N0LBSuRf3CgTqIC9Z5kCAzBzWs3BHuqi9j+XjpBUU9xdVR5HCYrwgSFE/6lzqp0S27tL nfGx7KGJWWRVAZuzda4B3FMiIlpSx/uLXw77VBqTZeux4udapVgXC/m73JdVX4ODj7H1Du nkTSZN0GmDj3CGv6LwSUZdNUV9TGLfM= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=KeK0udaO; spf=pass (imf02.hostedemail.com: domain of chrisl@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=chrisl@kernel.org; dmarc=pass (policy=none) header.from=kernel.org Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 16D0160B50; Sat, 6 May 2023 22:49:12 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id EA9D3C433D2; Sat, 6 May 2023 22:49:10 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1683413351; bh=kx7QPUz0Y8c8AV4z4IpLKWfOYG/J2j7xHQZfV8fP1ic=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=KeK0udaOZpM6vUDMDN04busMODfk9CA9ck1xfge5y3enIwkvBq89tbuHPUCO7wL3X j3Wf1oTsYMDkM+AFFYCCnxiAoSzPL6s07lOCEkms7FcD4+byrS5flkUsMtSxIOzZe4 9Bz/YWhPNP7XoYxSrMQDoBvTSZeubN/WriMTRvABWQJ0ggMbLHCrnqHPbo14SeqNH4 oOI1LquKwvY9B6DBCv7cJ1eqzRLeTe0PldWQJYKGBZOeH8u4NW2FmG5Of8CAsTmoRE tUXCeDwcJbjJA7SVfkjup6dUeUrsLTHzTHnZNyEbRLp6PjcmUvJKh+GN8oodm+Co3i WiZrAVGg3yJqA== Date: Sat, 6 May 2023 15:49:08 -0700 From: Chris Li To: Alistair Popple Cc: "T.J. Mercier" , lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org, cgroups@vger.kernel.org, Yosry Ahmed , Tejun Heo , Shakeel Butt , Muchun Song , Johannes Weiner , Roman Gushchin , Jason Gunthorpe , Kalesh Singh , Yu Zhao Subject: Re: [LSF/MM/BPF TOPIC] Reducing zombie memcgs Message-ID: References: <874josz4rd.fsf@nvidia.com> <877ctm518f.fsf@nvidia.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <877ctm518f.fsf@nvidia.com> X-Rspamd-Queue-Id: 13A5180006 X-Stat-Signature: 1bjha9k7i9iqp4b7f9qtuaddxobky4cc X-Rspam-User: X-Rspamd-Server: rspam09 X-HE-Tag: 1683413352-360460 X-HE-Meta: U2FsdGVkX18hurwh1yn9GljVz2H10iVdnybu0ElvZtviYwjrsXybZfA7GJE6V0r+O44GAKpNytBCgtjAmTeEqZhWqqBKew9tczNnQgv76BIo9t5gCzLlC0l2hKc2YAZBG6FBrxWdNnEZxrVrWA1wFfivNk2cAyjONhcLApelJ3wQ3smL95sDyjCc+JnTJenWJ+L4K+W3opE7A+j+8xOjCBFwFK2/x3vpFBnxe3o6eR9DJ6Qm+uZ0ROF0HZfXYmRo3Pf3nUrit7JdbMIzyiXAqK361Ld3cCzDJeSSFtVAvl49gKhL2MpDoioFpeJKQYX1tSucnun3fzTbjUJvqZyooRKiOvpRgZoynH5MLeG8LO3rNC2zdXtX/x849Ld9n58MUVemO08/pyyzYsQ5AQ9FpMx7dZwNLl7hfNvrm+OiWYSqr8oPE5NArutuKgakbU6g0J80NM5cn3o4J/2j3HYUIu3XihtNocvSFefp9ewdFxdA831U8TU1aZkaTui1i9NwjNAgtDsAaW5wCGsg6IdJIkNH1134uBnvlJZPfZd9L71ri5yeN7dx9pzgGksuR+6NUQqLg4GGDC5vFr+bLC9jHpu7enBtrX88mr2YAvWkBHBPzzmrCumfPjqRnK6154TkdKNcRbbR6kNC8cF6dO9pJV3uXgeukaUyvkvNF/VSRGG7NdeD5rJQgjcfCnaSHWCdM+gS1NSYsepHu24Qvjp0nyF9plCvlR16wHpCGPHFXHNbJX25Oamc/M5xeVC85jmGD+kZ3IFzwwp8vKOPiZu1xDXIpM666YkY/D2Ii6BRD5Z71raO/XivtRI/irEWPlJIP77hiISGO1ozCOPWKW8mo3WDAPiEIVC95BCcZE+1A+JkdMnP06cQHMtvVj5rh6fstyRTjOwNj2CUbym4QvRofrrZTtW3koFjno77If524MBwjXtS5s7c4CoN13XZtzQnUUq4ShdKZYqK4u4Xe2z rgzmPsXQ KsvgPKE0876zljyMhMWTf3uXooMeGbA5JbWu/2WPGlAi0++5p1C/ZeG99oR/rzAkYiCv67DoWhltY6a8qRU4dV+rp2yXYgRY2NzEJwVwiuPZ/zYrPZPJ8KKwQZsOcU/3RHocag0JZZcjKQ32uiEOOtASLOLL1PD4D6D2ZyOAPTOIbIDwaViYn25mmmwYJnLuo6gKguNGeYXzr8S1D4anoRWyEkJabVITZCrdrKCjbCIhzf64gxMOkd+b92uaSFnFAJ8AU X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, May 05, 2023 at 11:53:24PM +1000, Alistair Popple wrote: > > >> Unfortunately I won't be attending LSF/MM in person this year but I am > > > > Will you be able to join virtually? > > I should be able to join afternoon sessions virtually. Great. > >> The issue with per-page memcg limits is what to do for shared > >> mappings. The below suggestion sounds promising because the pins for > >> shared pages could be charged to the smemcg. However I'm not sure how it > >> would solve the problem of a process in cgroup A being able to raise the > >> pin count of cgroup B when pinning a smemcg page which was one of the > >> reason I introduced a new cgroup controller. > > > > Now that I think of it, I can see the pin count memcg as a subtype of > > smemcg. > > > > The smemcg can have a limit as well, when it add to a memcg, the operation > > raise the pin count smemcg charge over the smemcg limit will fail. > > I'm not sure that works for the pinned scenario. If a smemcg already has > pinned pages adding it to another memcg shouldn't raise the pin count of > the memcg it's being added to. The pin counts should only be raised in > memcg's of processes actually requesting the page be pinned. See below > though, the idea of borrowing seems helpful. I am very interested in letting smemcg support your pin usage case. I read your patch thread a bit but I still feel a bit fuzzy about the pin usage workflow. If you have some more detailed write up of the usage case, with a sequence of interaction and desired outcome that would help me understand. Links to the previous threads work too. We can set up some meetings to discuss it as well. > So for pinning at least I don't see a per smemcg limit being useful. That is fine. I see you are interested in the limit. > > For the detail tracking of shared/unshared behavior, the smemcg can model it > > as a step 2 feature. > > > > There are four different kind of operation can perform on a smemcg: > > > > 1) allocate/charge memory. The charge will add on the per smemcg charge > > counter, check against the per smemcg limit. ENOMEM if it is over the limit. > > > > 2) free/uncharge memory. Similar to above just subtract the counter. > > > > 3) share/mmap already charged memory. This will not change the smemcg charge > > count, it will add to a per borrow counter. It is possible to > > put a limit on that counter as well, even though I haven't given too much thought > > of how useful it is. That will limit how much memory can mapped from the smemcg. > > I would like to see the idea of a borrow counter fleshed out some more > but this sounds like it could work for the pinning scenario. > > Pinning could be charged to the per borrow counter and > the pin limit would be enforced against that plus the anonymous pins. > > Implementation wise we'd need a way to lookup both the smemcg of the > struct page and the memcg that the pinning task belongs to. The page->memcg_data points to the pin smemcg. I am hoping pinning API or the current memcg can get to the pinning memcg. > > 4) unshare/unmmap already charged memory. That will reduce the per > > borrow counter. > > Actually this is where things might get a bit tricky for pinning. We'd > have to reduce the pin charge when a driver calls put_page(). But that > implies looking up the borrow counter / pair a driver > charged the page to. Does the pin page share between different memcg or just one memcg? If it is shared, can the put_page() API indicate it is performing in behalf of which memcg? > I will have to give this idea some more tought though. Most drivers > don't store anything other than the struct page pointers, but my series > added an accounting struct which I think could reference the borrow > counter. Ack. > > > Will that work for your pin memory usage? > > I think it could help. I will give it some thought. Ack. > > >> > >> > Shared Memory Cgroup Controllers > >> > > >> > = Introduction > >> > > >> > The current memory cgroup controller does not support shared memory > >> > objects. For the memory that is shared between different processes, it > >> > is not obvious which process should get charged. Google has some > >> > internal tmpfs “memcg=” mount option to charge tmpfs data to a > >> > specific memcg that’s often different from where charging processes > >> > run. However it faces some difficulties when the charged memcg exits > >> > and the charged memcg becomes a zombie memcg. > >> > Other approaches include “re-parenting” the memcg charge to the parent > >> > memcg. Which has its own problem. If the charge is huge, iteration of > >> > the reparenting can be costly. > >> > > >> > = Proposed Solution > >> > > >> > The proposed solution is to add a new type of memory controller for > >> > shared memory usage. E.g. tmpfs, hugetlb, file system mmap and > >> > dma_buf. This shared memory cgroup controller object will have the > >> > same life cycle of the underlying shared memory. > >> > > >> > Processes can not be added to the shared memory cgroup. Instead the > >> > shared memory cgroup can be added to the memcg using a “smemcg” API > >> > file, similar to adding a process into the “tasks” API file. > >> > When a smemcg is added to the memcg, the amount of memory that has > >> > been shared in the memcg process will be accounted for as the part of > >> > the memcg “memory.current”.The memory.current of the memcg is make up > >> > of two parts, 1) the processes anonymous memory and 2) the memory > >> > shared from smemcg. > >> > > >> > When the memcg “memory.current” is raised to the limit. The kernel > >> > will active try to reclaim for the memcg to make “smemcg memory + > >> > process anonymous memory” within the limit. > >> > >> That means a process in one cgroup could force reclaim of smemcg memory > >> in use by a process in another cgroup right? I guess that's no different > >> to the current situation though. > >> > >> > Further memory allocation > >> > within those memcg processes will fail if the limit can not be > >> > followed. If many reclaim attempts fail to bring the memcg > >> > “memory.current” within the limit, the process in this memcg will get > >> > OOM killed. > >> > >> How would this work if say a charge for cgroup A to a smemcg in both > >> cgroup A and B would cause cgroup B to go over its memory limit and not > >> enough memory could be reclaimed from cgroup B? OOM killing a process in > >> cgroup B due to a charge from cgroup A doesn't sound like a good idea. > > > > If we separate out the charge counter with the borrow counter, that problem > > will be solved. When smemcg is add to memcg A, we can have a policy specific > > that adding the borrow counter into memcg A's "memory.current". > > > > If B did not map that page, that page will not be part of > > borrow count. B will not be punished. > > > > However if B did map that page, The need to increase as well. > > B will be punished for it. > > > > Will that work for your example situation? > > I think so, although I have been looking at this more from the point of > view of pinning. It sounds like we could treat pinning in much the same > way as mapping though. Ack. > > >> > = Benefits > >> > > >> > The benefits of this solution include: > >> > * No zombie memcg. The life cycle of the smemcg match the share memory file system or dma_buf. > >> > >> If we added pinning it could get a bit messier, as it would have to hang > >> around until the driver unpinned the pages. But I don't think that's a > >> problem. > > > > > > That is exactly the reason pin memory can belong to a pin smemcg. You just need > > to model the driver holding the pin ref count as one of the share/mmap operation. > > > > Then the pin smemcg will not go away if there is a pending pin ref count on it. > > > > We have have different policy option on smemcg. > > For the simple usage don't care the per memcg borrow counter, it can add the > > smemcg's charge count to "memory.current". > > > > Only the user who cares about per memcg usage of a smemcg will need to maintain > > per borrow counter, at additional cost. > > Right, I think pinning drivers will always have to care about the borrow > counter so will have to track that. Ack. Chris