From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-0.5 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id AD3AEC4361B for ; Wed, 16 Dec 2020 13:17:19 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id AEDF22339E for ; Wed, 16 Dec 2020 13:17:18 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org AEDF22339E Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=ya.ru Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 28E746B0068; Wed, 16 Dec 2020 08:17:18 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 266336B006C; Wed, 16 Dec 2020 08:17:18 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 106F46B0070; Wed, 16 Dec 2020 08:17:18 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0166.hostedemail.com [216.40.44.166]) by kanga.kvack.org (Postfix) with ESMTP id ECFBE6B0068 for ; Wed, 16 Dec 2020 08:17:17 -0500 (EST) Received: from smtpin16.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id BB73C180AD820 for ; Wed, 16 Dec 2020 13:17:17 +0000 (UTC) X-FDA: 77599196514.16.month69_57134b72742c Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin16.hostedemail.com (Postfix) with ESMTP id 7CDF5100E6903 for ; Wed, 16 Dec 2020 13:17:17 +0000 (UTC) X-HE-Tag: month69_57134b72742c X-Filterd-Recvd-Size: 9228 Received: from forward500j.mail.yandex.net (forward500j.mail.yandex.net [5.45.198.250]) by imf25.hostedemail.com (Postfix) with ESMTP for ; Wed, 16 Dec 2020 13:17:16 +0000 (UTC) Received: from mxback23j.mail.yandex.net (mxback23j.mail.yandex.net [IPv6:2a02:6b8:c04:242:0:640:d354:6fb1]) by forward500j.mail.yandex.net (Yandex) with ESMTP id 6852B11C20EA; Wed, 16 Dec 2020 16:17:13 +0300 (MSK) Received: from localhost (localhost [::1]) by mxback23j.mail.yandex.net (mxback/Yandex) with ESMTP id rOYmt1NkfU-HBcGOrNc; Wed, 16 Dec 2020 16:17:12 +0300 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ya.ru; s=mail; t=1608124632; bh=AIKI065k+KDfUs+kGR5R7VTIaMQ9gbze8hqh8AjmM4Q=; h=Message-Id:Cc:Subject:In-Reply-To:Date:References:To:From; b=JaZ3MMdMdDaKf3fXgnNUumpN6M86BDDcToRiW78/GL/jG6R1duP4yoJ94AeT2J7E7 jMqfqjf66FslvGEPt2Dv6bREV0gMC/QvKbqYX+FAwU1buK6ZGlvZgF6d6KaeR7JPCV pxjL5ykZj3DCH3Xk0d5474WUtN20DgXj8OWzmKFs= Authentication-Results: mxback23j.mail.yandex.net; dkim=pass header.i=@ya.ru Received: by myt6-4204cefb5b39.qloud-c.yandex.net with HTTP; Wed, 16 Dec 2020 16:17:11 +0300 From: Kirill Tkhai Envelope-From: tkhai@yandex.ru To: Dave Chinner , Johannes Weiner Cc: Yang Shi , "guro@fb.com" , "ktkhai@virtuozzo.com" , "shakeelb@google.com" , "mhocko@suse.com" , "akpm@linux-foundation.org" , "linux-mm@kvack.org" , "linux-fsdevel@vger.kernel.org" , "linux-kernel@vger.kernel.org" In-Reply-To: <20201215215938.GQ3913616@dread.disaster.area> References: <20201214223722.232537-1-shy828301@gmail.com> <20201214223722.232537-3-shy828301@gmail.com> <20201215020957.GK3913616@dread.disaster.area> <20201215135348.GC379720@cmpxchg.org> <20201215215938.GQ3913616@dread.disaster.area> Subject: Re: [v2 PATCH 2/9] mm: memcontrol: use shrinker_rwsem to protect shrinker_maps allocation MIME-Version: 1.0 X-Mailer: Yamail [ http://yandex.ru ] 5.0 Date: Wed, 16 Dec 2020 16:17:11 +0300 Message-Id: <982431608123901@mail.yandex.ru> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: 16.12.2020, 00:59, "Dave Chinner" : > On Tue, Dec 15, 2020 at 02:53:48PM +0100, Johannes Weiner wrote: >> =C2=A0On Tue, Dec 15, 2020 at 01:09:57PM +1100, Dave Chinner wrote: >> =C2=A0> On Mon, Dec 14, 2020 at 02:37:15PM -0800, Yang Shi wrote: >> =C2=A0> > Since memcg_shrinker_map_size just can be changd under holdi= ng shrinker_rwsem >> =C2=A0> > exclusively, the read side can be protected by holding read = lock, so it sounds >> =C2=A0> > superfluous to have a dedicated mutex. >> =C2=A0> >> =C2=A0> I'm not sure this is a good idea. This couples the shrinker >> =C2=A0> infrastructure to internal details of how cgroups are initiali= sed >> =C2=A0> and managed. Sure, certain operations might be done in certain >> =C2=A0> shrinker lock contexts, but that doesn't mean we should share = global >> =C2=A0> locks across otherwise independent subsystems.... >> >> =C2=A0They're not independent subsystems. Most of the memory controlle= r is >> =C2=A0an extension of core VM operations that is fairly difficult to >> =C2=A0understand outside the context of those operations. Then there a= re a >> =C2=A0limited number of entry points from the cgroup interface. We use= d to >> =C2=A0have our own locks for core VM structures (private page lock e.g= .) to >> =C2=A0coordinate VM and cgroup, and that was mostly unintelligble. > > Yes, but OTOH you can CONFIG_MEMCG=3Dn and the shrinker infrastructure > and shrinkers all still functions correctly. Ergo, the shrinker > infrastructure is independent of memcgs. Yes, it may have functions > to iterate and manipulate memcgs, but it is not dependent on memcgs > existing for correct behaviour and functionality. > > Yet. > >> =C2=A0We have since established that those two components coordinate w= ith >> =C2=A0native VM locking and lifetime management. If you need to lock t= he >> =C2=A0page, you lock the page - instead of having all VM paths that al= ready >> =C2=A0hold the page lock acquire a nested lock to exclude one cgroup p= ath. >> >> =C2=A0In this case, we have auxiliary shrinker data, subject to shrink= er >> =C2=A0lifetime and exclusion rules. It's much easier to understand tha= t >> =C2=A0cgroup creation needs a stable shrinker list (shrinker_rwsem) to >> =C2=A0manage this data, than having an aliased lock that is private to= the >> =C2=A0memcg callbacks and obscures this real interdependency. > > Ok, so the way to do this is to move all the stuff that needs to be > done under a "subsystem global" lock to the one file, not turn a > static lock into a globally visible lock and spray it around random > source files. There's already way too many static globals to manage > separate shrinker and memcg state.. > > I certainly agree that shrinkers and memcg need to be more closely > integrated. I've only been saying that for ... well, since memcgs > essentially duplicated the top level shrinker path so the shrinker > map could be introduced to avoid calling shrinkers that have no work > to do for memcgs. The shrinker map should be generic functionality > for all shrinker invocations because even a non-memcg machine can > have thousands of registered shrinkers that are mostly idle all the > time. > > IOWs, I think the shrinker map management is not really memcg > specific - it's just allocation and assignment of a structure, and > the only memcg bit is the map is being stored in a memcg structure. > Therefore, if we are looking towards tighter integration then we > should acutally move the map management to the shrinker code, not > split the shrinker infrastructure management across different files. > There's already a heap of code in vmscan.c under #ifdef > CONFIG_MEMCG, like the prealloc_shrinker() code path: > > prealloc_shrinker() vmscan.c > =C2=A0=C2=A0if (MEMCG_AWARE) vmscan.c > =C2=A0=C2=A0=C2=A0=C2=A0prealloc_memcg_shrinker vmscan.c > #ifdef CONFIG_MEMCG vmscan.c > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0down_write(shrinker_rwsem) vmscan.c > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0if (id > shrinker_id_max) vmscan.c > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0memcg_expand_shrinker_m= aps memcontrol.c > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0for_each_me= mcg memcontrol.c > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= reallocate shrinker map memcontrol.c > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= replace shrinker map memcontrol.c > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0shrinker_id_max =3D id = vmscan.c > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0down_write(shrinker_rwsem) vmscan.c > #endif > > And, really, there's very little code in memcg_expand_shrinker_maps() > here - the only memcg part is the memcg iteration loop, and we > already have them in vmscan.c (e.g. shrink_node_memcgs(), > age_active_anon(), drop_slab_node()) so there's precedence for > moving this memcg iteration for shrinker map management all into > vmscan.c. > > Doing so would formalise the shrinker maps as first class shrinker > infrastructure rather than being tacked on to the side of the memcg > infrastructure. At this point it makes total sense to serialise map > manipulations under the shrinker_rwsem. > > IOWs, I'm not disagreeing with the direction this patch takes us in, > I'm disagreeing with the implementation as published in the patch > because it doesn't move us closer to a clean, concise single > shrinker infrastructure implementation. > > That is, for the medium term, I think we should be getting rid of > the "legacy" non-memcg shrinker path and everything runs under > memcgs. With this patchset moving all the deferred counts to be > memcg aware, the only reason for keeping the non-memcg path around > goes away. If sc->memcg is null, then after this patch set we can > simply use the root memcg and just use it's per-node accounting > rather than having a separate construct for non-memcg aware per-node > accounting. Killing "sc->memcg =3D=3D NULL" cases looks like a great idea. This is eq= ual to making possible "memory_cgrp_subsys.early_init =3D 1" with all require= pments for that, which is a topic for big separate patchset. > Hence if SHRINKER_MEMCG_AWARE is set, it simply means we should run > the shrinker if sc->memcg is set. There is no difference in setup > of shrinkers, the duplicate non-memcg/memcg paths go away, and a > heap of code drops out of the shrinker infrastructure. It becomes > much simpler overall. > > It also means we have a path for further integrating memcg aware > shrinkers into the shrinker infrastructure because we can always > rely on the shrinker infrastructure being memcg aware. And with that > in mind, I think we should probably also be moving the shrinker code > out of vmscan.c into it's own file as it's really completely > separate infrastructure from the vast majority of page reclaim > infrastructure in vmscan.c... > > That's the view I'm looking at this patchset from. Not just as a > standalone bug fix, but also from the perspective of what the > architectural change implies and the directions for tighter > integration it opens up for us. > > Cheers, > > Dave. > > -- > Dave Chinner > david@fromorbit.com > > .