From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7FB60C77B73 for ; Tue, 30 May 2023 21:46:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F26B1280002; Tue, 30 May 2023 17:46:38 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EAE91280001; Tue, 30 May 2023 17:46:38 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D2900280002; Tue, 30 May 2023 17:46:38 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id BA482280001 for ; Tue, 30 May 2023 17:46:38 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 92E90160367 for ; Tue, 30 May 2023 21:46:38 +0000 (UTC) X-FDA: 80848256076.06.DF8410B Received: from mail-qv1-f48.google.com (mail-qv1-f48.google.com [209.85.219.48]) by imf09.hostedemail.com (Postfix) with ESMTP id C9F6914000D for ; Tue, 30 May 2023 21:46:35 +0000 (UTC) Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=gmail.com header.s=20221208 header.b="Sg/HA5Rd"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf09.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.219.48 as permitted sender) smtp.mailfrom=nphamcs@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1685483195; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=RU77GmD/d34Gfv6f4V1oiMSEmCl6avPwtTX7fFsFv1w=; b=WOqlZrv3Emw5yL94OVSaFbThBwXh3+1O+y/Jy/3At9A3uFQimU3dMLO9zFzZmAKSAz1Uuw W6HOl111oKOi33wMTJ1Bn7quO1cHlGHd4LP+WKzde2yVr0s//znLw/wSPNCrLpLNfMgGmg GLu0oVkG4iiTezfWexID9JooO0DIY08= ARC-Authentication-Results: i=1; imf09.hostedemail.com; dkim=pass header.d=gmail.com header.s=20221208 header.b="Sg/HA5Rd"; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf09.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.219.48 as permitted sender) smtp.mailfrom=nphamcs@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1685483195; a=rsa-sha256; cv=none; b=mW8YgpFpThP2Gw9NY30DXDEitbCAh0BXE7w7+kU9yyriknvB53azApfZlp+Abjp8CM6WF2 C5laz5biGQ4V6Vz99QU/9PYfMiAYWc9rZ7slTlN7edethJrbikBrQQHkJc1d7MqQsfOzxK xNpw7ILUhlCZyrfL6LsK/q4q1Jtmz80= Received: by mail-qv1-f48.google.com with SMTP id 6a1803df08f44-62613b2c8b7so35592986d6.1 for ; Tue, 30 May 2023 14:46:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1685483195; x=1688075195; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=RU77GmD/d34Gfv6f4V1oiMSEmCl6avPwtTX7fFsFv1w=; b=Sg/HA5RdBQ+gfjFLFA5KHfGQkz2K3vY24TSFIoteoRFGuS4IPeKgN8cuMgTMdhsOgg //6KimtFEuOxRZW/0r80I8Y1u63Fihzczs/sn27rbUt8FUXYA77fbFjspG2SzGVAT18I +ANndvDw+1yrRmpyYeYWGh4HXcY98yYNwUiThrBLK+0SZ6C7qgvxOyNHfjEtvs66vKHu vzRPMSMqL6qOFOkj4YNQAItThbHZCVOxzWuawW20cS11ykDRnjeWomb1A7C1Q59hJwLv kk4NPDlE3lC9fT9nUKUWXYlBLR0qLQHQRHrjtOMIhFlD2C7Zq6jzSYDsHhxeyVObHiah rMvA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1685483195; x=1688075195; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=RU77GmD/d34Gfv6f4V1oiMSEmCl6avPwtTX7fFsFv1w=; b=HRGndtC87ay2SselpQbCpIdtR0mdtoI+WGVj4AWhcJ4qGiJ1EzufY2qqkN7Au7oBnB Yj+NjYzgN6Ixqh7STqK//72dmQnrziObqfFZ0sOHNGcepz/rjIlOSD+1haXLHqa9FHMG +dAE9v3OcUst6uJRyp/mPUjw7PkZfqjhFvWmX53TXi7MvhJCIPb1QjRXwd8W4heTSgFm Jhlq9QwKEHpLNy+s0mDSmdcC2f0hNhLDbVOqmrRIQL+kcOXlsm1yIa+WGYZtCYOUfWAi zJurk+7XpfCxjNLu3JmMSuQYU64ywXMHNfCs8yYQCr2pI8duNjbGsK7dWjkfxvraMbmh armQ== X-Gm-Message-State: AC+VfDxDyMXCri+8pDJzpygiM+7pDkd0LFKHFs4RbXliPs3VPeJpMpfX OO2+cttTdN1+csziXDWWF+yqdxS27r5yfn76MpI= X-Google-Smtp-Source: ACHHUZ7DfNnlmTLcfiLt/y2c6iQNWB97DdnJqmPXOxQPpvyXEeVt94ZQ3RVncoZmSNLkich3wtkMTBXh1/P8H3sdsxo= X-Received: by 2002:a05:6214:2245:b0:623:7707:5650 with SMTP id c5-20020a056214224500b0062377075650mr4973154qvc.15.1685483194741; Tue, 30 May 2023 14:46:34 -0700 (PDT) MIME-Version: 1.0 References: <20230530162153.836565-1-nphamcs@gmail.com> <20230530180038.GC97194@cmpxchg.org> <20230530191336.GB101722@cmpxchg.org> <20230530205940.GA102494@cmpxchg.org> In-Reply-To: From: Nhat Pham Date: Tue, 30 May 2023 14:46:23 -0700 Message-ID: Subject: Re: [PATCH] zswap: do not shrink when memory.zswap.max is 0 To: Yosry Ahmed Cc: Johannes Weiner , akpm@linux-foundation.org, cerasuolodomenico@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, sjenning@redhat.com, ddstreet@ieee.org, vitaly.wool@konsulko.com, kernel-team@meta.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: C9F6914000D X-Stat-Signature: wg8nigy4bj8huw8xdp7ri9zn8hod9ywr X-HE-Tag: 1685483195-640500 X-HE-Meta: U2FsdGVkX1998/IyGrQZKciKwZCweJ/mGnwVSViYyor1ca7CXEP0IPJxZ8AQSLxH0CGzcw7tXFmZ1LdZnSTt4J5BcL8pQrZYvvI46zGAzSWcQaGkMDB3nAUwoH3cdyhf7Yzy9+n1rkxyqXeQUcRg7yWuVjgeWQjoeLS0STtoFgTP9M+Ikd8qmgnc13bX3VJY1RJ71jINc/+02LC7bc50zMPKWl1hQLyVrOharGAsRS/HScHXcMgWZ4cSC28VJ91BnsTNHRZGsft/Dk9M5b3UyjGnJGUa+R8tRlC4duaPwy4Sk0HMX60C2/zu97421+a6NCH5qUfIhvoq75PugJDF4Vr3Qe+7hJ++9uDQPqy5lyJ+aJ6OwSe3k+xom5Hp7Z/8W0q2jIwqROstZ+ZB1T7fCYnqK/48jpjfX7vO3pJ+SrgBx3ZMC6FGaSiuza+n6fmuLD7FHi+QIfjItHwDnMtA6sRC49J9+y6GMStj/V6ub/HFb+Jd/FV5/MRKYL3LF6EVBaiPO2ER7ubHz025hf170aCiIU4mzoBVTwLn8dJC7iqg/iPNy3BUicHKB1V1qUzrK75X0UFT0ds+9jhW51eZkV5kHFHviFa8SkMmRY1BdSVwrvj3lsiDkVJfsOIRnB2CYMfhBw9nz5E0TTroEhdFNDW439T01Cfny2ZS/XJtgWpil+aiJ4e5T7ssnwUDXp3kZcO2COeEUszgHiWZROqnjzn9EONEBMNUbxgOEdLqWH1cagNS4iktzl+jogp4miOSsCliW4XgaLHLBr/YkQnhWPOgwRmhBMeddWoO+SY83SdNAqgZcpCPn+v7ZmdYU92tKo9Hnx5tiudGgI/uUmVq3nORh2jPob7DshRa8LAYpspk+De08fD3K8EVSShNDavtS8dgatF8l58aj8K1EKIbsj4W4V9snIlHRs7+GhKyB2QrJBdznHk0W2+Duf/LqBA0r860NB0FGimKiis/vgz 2PqcmJcl HMZslYBdhGu8fsOvDNTCKb1DN9LJvm571VP4m7enYXjgbPfGAV9h5yl5Kg28JiiloS6U/yP/UvSd6Avsf0uH15xSrG/0r7xDpgWig4YqNWzWgKAO0Efm8lWajN5/GdofMfSmdUJ7uZ1VoCGI3S6mgi0lrj5l4Iylv7LGE/18F9u/I7CnpO5q1eyoWIgMtIFv6nRetF43Phw2DLGwonYCJP6LsBz7Suf1zJ2BrC+eLv2aOVZWh7uE5S3Pjq9T3kDDASpokHJKiL3dP9QqV1X6hTA7ecWT7EyfKymxmP+tQ1hJL0hYlapfqsQdU90t6rVtQ81Fb1BoiVdUQDRoq5iAfU8ajL7A9JyQdXuGjfMThqDdGs2CJmfNPnF4eEqVXJ7oOrre5fYEy1zt1fWtbeT9+i5oD7WUW7YAL3sJ75cKg7r29ktGn1g2yfNhWhoCnZaHOe43bFz0CWI9pRSk= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, May 30, 2023 at 2:05=E2=80=AFPM Yosry Ahmed = wrote: > > On Tue, May 30, 2023 at 1:59=E2=80=AFPM Johannes Weiner wrote: > > > > On Tue, May 30, 2023 at 01:19:12PM -0700, Yosry Ahmed wrote: > > > On Tue, May 30, 2023 at 12:13=E2=80=AFPM Johannes Weiner wrote: > > > > > > > > On Tue, May 30, 2023 at 11:41:32AM -0700, Yosry Ahmed wrote: > > > > > On Tue, May 30, 2023 at 11:00=E2=80=AFAM Johannes Weiner wrote: > > > > > > > > > > > > On Tue, May 30, 2023 at 09:52:36AM -0700, Yosry Ahmed wrote: > > > > > > > On Tue, May 30, 2023 at 9:22=E2=80=AFAM Nhat Pham wrote: > > > > > > > > > > > > > > > > Before storing a page, zswap first checks if the number of = stored pages > > > > > > > > exceeds the limit specified by memory.zswap.max, for each c= group in the > > > > > > > > hierarchy. If this limit is reached or exceeded, then zswap= shrinking is > > > > > > > > triggered and short-circuits the store attempt. > > > > > > > > > > > > > > > > However, if memory.zswap.max =3D 0 for a cgroup, no amount = of writeback > > > > > > > > will allow future store attempts from processes in this cgr= oup to > > > > > > > > succeed. Furthermore, this create a pathological behavior i= n a system > > > > > > > > where some cgroups have memory.zswap.max =3D 0 and some do = not: the > > > > > > > > processes in the former cgroups, under memory pressure, wil= l evict pages > > > > > > > > stored by the latter continually, until the need for swap c= eases or the > > > > > > > > pool becomes empty. > > > > > > > > > > > > > > > > As a result of this, we observe a disproportionate amount o= f zswap > > > > > > > > writeback and a perpetually small zswap pool in our experim= ents, even > > > > > > > > though the pool limit is never hit. > > > > > > > > > > > > > > > > This patch fixes the issue by rejecting zswap store attempt= without > > > > > > > > shrinking the pool when memory.zswap.max is 0. > > > > > > > > > > > > > > > > Fixes: f4840ccfca25 ("zswap: memcg accounting") > > > > > > > > Signed-off-by: Nhat Pham > > > > > > > > --- > > > > > > > > include/linux/memcontrol.h | 6 +++--- > > > > > > > > mm/memcontrol.c | 8 ++++---- > > > > > > > > mm/zswap.c | 9 +++++++-- > > > > > > > > 3 files changed, 14 insertions(+), 9 deletions(-) > > > > > > > > > > > > > > > > diff --git a/include/linux/memcontrol.h b/include/linux/mem= control.h > > > > > > > > index 222d7370134c..507bed3a28b0 100644 > > > > > > > > --- a/include/linux/memcontrol.h > > > > > > > > +++ b/include/linux/memcontrol.h > > > > > > > > @@ -1899,13 +1899,13 @@ static inline void count_objcg_even= t(struct obj_cgroup *objcg, > > > > > > > > #endif /* CONFIG_MEMCG_KMEM */ > > > > > > > > > > > > > > > > #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP) > > > > > > > > -bool obj_cgroup_may_zswap(struct obj_cgroup *objcg); > > > > > > > > +int obj_cgroup_may_zswap(struct obj_cgroup *objcg); > > > > > > > > void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, siz= e_t size); > > > > > > > > void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, s= ize_t size); > > > > > > > > #else > > > > > > > > -static inline bool obj_cgroup_may_zswap(struct obj_cgroup = *objcg) > > > > > > > > +static inline int obj_cgroup_may_zswap(struct obj_cgroup *= objcg) > > > > > > > > { > > > > > > > > - return true; > > > > > > > > + return 0; > > > > > > > > } > > > > > > > > static inline void obj_cgroup_charge_zswap(struct obj_cgro= up *objcg, > > > > > > > > size_t size) > > > > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > > > > > > index 4b27e245a055..09aad0e6f2ea 100644 > > > > > > > > --- a/mm/memcontrol.c > > > > > > > > +++ b/mm/memcontrol.c > > > > > > > > @@ -7783,10 +7783,10 @@ static struct cftype memsw_files[] = =3D { > > > > > > > > * spending cycles on compression when there is already no= room left > > > > > > > > * or zswap is disabled altogether somewhere in the hierar= chy. > > > > > > > > */ > > > > > > > > -bool obj_cgroup_may_zswap(struct obj_cgroup *objcg) > > > > > > > > +int obj_cgroup_may_zswap(struct obj_cgroup *objcg) > > > > > > > > { > > > > > > > > struct mem_cgroup *memcg, *original_memcg; > > > > > > > > - bool ret =3D true; > > > > > > > > + int ret =3D 0; > > > > > > > > > > > > > > > > if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) > > > > > > > > return true; > > > > > > > > @@ -7800,7 +7800,7 @@ bool obj_cgroup_may_zswap(struct obj_= cgroup *objcg) > > > > > > > > if (max =3D=3D PAGE_COUNTER_MAX) > > > > > > > > continue; > > > > > > > > if (max =3D=3D 0) { > > > > > > > > - ret =3D false; > > > > > > > > + ret =3D -ENODEV; > > > > > > > > break; > > > > > > > > } > > > > > > > > > > > > > > > > @@ -7808,7 +7808,7 @@ bool obj_cgroup_may_zswap(struct obj_= cgroup *objcg) > > > > > > > > pages =3D memcg_page_state(memcg, MEMCG_ZSW= AP_B) / PAGE_SIZE; > > > > > > > > if (pages < max) > > > > > > > > continue; > > > > > > > > - ret =3D false; > > > > > > > > + ret =3D -ENOMEM; > > > > > > > > break; > > > > > > > > } > > > > > > > > mem_cgroup_put(original_memcg); > > > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > > > > > > index 59da2a415fbb..7b13dc865438 100644 > > > > > > > > --- a/mm/zswap.c > > > > > > > > +++ b/mm/zswap.c > > > > > > > > @@ -1175,8 +1175,13 @@ static int zswap_frontswap_store(uns= igned type, pgoff_t offset, > > > > > > > > } > > > > > > > > > > > > > > > > objcg =3D get_obj_cgroup_from_page(page); > > > > > > > > - if (objcg && !obj_cgroup_may_zswap(objcg)) > > > > > > > > - goto shrink; > > > > > > > > + if (objcg) { > > > > > > > > + ret =3D obj_cgroup_may_zswap(objcg); > > > > > > > > + if (ret =3D=3D -ENODEV) > > > > > > > > + goto reject; > > > > > > > > + if (ret =3D=3D -ENOMEM) > > > > > > > > + goto shrink; > > > > > > > > + } > > > > > > > > > > > > > > I wonder if we should just make this: > > > > > > > > > > > > > > if (objcg && !obj_cgroup_may_zswap(objcg)) > > > > > > > goto reject; > > > > > > > > > > > > > > Even if memory.zswap.max is > 0, if the limit is hit, shrinki= ng the > > > > > > > zswap pool will only help if we happen to writeback a page fr= om the > > > > > > > same memcg that hit its limit. Keep in mind that we will only > > > > > > > writeback one page every time we observe that the limit is hi= t (even > > > > > > > with Domenico's patch, because zswap_can_accept() should be t= rue). > > > > > > > > > > > > > > On a system with a handful of memcgs, > > > > > > > it seems likely that we wrongfully writeback pages from other= memcgs > > > > > > > because of this. Achieving nothing for this memcg, while hurt= ing > > > > > > > others. OTOH, without invoking writeback when the limit is hi= t, the > > > > > > > memcg will just not be able to use zswap until some pages are > > > > > > > faulted back in or invalidated. > > > > > > > > > > > > > > I am not sure which is better, just thinking out loud. > > > > > > > > > > > > You're absolutely right. > > > > > > > > > > > > Currently the choice is writing back either everybody or nobody= , > > > > > > meaning between writeback and cgroup containment. They're both = so poor > > > > > > that I can't say I strongly prefer one over the other. > > > > > > > > > > > > However, I have a lame argument in favor of this patch: > > > > > > > > > > > > The last few fixes from Nhat and Domenico around writeback show= that > > > > > > few people, if anybody, are actually using writeback. So it mig= ht not > > > > > > actually matter that much in practice which way we go with this= patch. > > > > > > Per-memcg LRUs will be necessary for it to work right. > > > > > > > > > > > > However, what Nhat is proposing is how we want the behavior dow= n the > > > > > > line. So between two equally poor choices, I figure we might as= well > > > > > > go with the one that doesn't require another code change later = on. > > > > > > > > > > > > Doesn't that fill you with radiant enthusiasm? > > > > > > > > > > If we have per-memcg LRUs, and memory.zswap.max =3D=3D 0, then we= should > > > > > be in one of two situations: > > > > > > > > > > (a) memory.zswap.max has always been 0, so the LRU for this memcg= is > > > > > empty, so we don't really need the special case for memory.zswap.= max > > > > > =3D=3D 0. > > > > > > > > > > (b) memory.zswap.max was reduced to 0 at some point, and some pag= es > > > > > are already in zswap. In this case, I don't think shrinking the m= emcg > > > > > is such a bad idea, we would be lazily enforcing the limit. > > > > > > > > > > In that sense I am not sure that this change won't require anothe= r > > > > > code change. It feels like special casing memory.zswap.max =3D=3D= 0 is > > > > > only needed now due to the lack of per-memcg LRUs. > > > > > > > > Good point. And I agree down the line we should just always send th= e > > > > shrinker off optimistically on the cgroup's lru list. > > > > > > > > So I take back my lame argument. But that then still leaves us with > > > > the situation that both choices are equal here, right? > > > > > > > > If so, my vote would be to go with the patch as-is. > > > > > > I *think* it's better to punish the memcg that exceeded its limit by > > > not allowing it to use zswap until its usage goes down, rather than > > > punish random memcgs on the machine because one memcg hit its limit. > > > It also seems to me that on a system with a handful of memcgs, it is > > > statistically more likely for zswap shrinking to writeback a page fro= m > > > the wrong memcg. > > > > Right, but in either case a hybrid zswap + swap setup with cgroup > > isolation is broken anyway. Without it being usable, I'm assuming > > there are no users - maybe that's optimistic of me ;) > > > > However, if you think it's better to just be conservative about taking > > action in general, that's fine by me as well. > > Exactly, I just prefer erroring on the conservative side. > > > > > > The code would also be simpler if obj_cgroup_may_zswap() just returns > > > a boolean and we do not shrink at all if it returns false. If it no > > > longer returns a boolean we should at least rename it. > > > > > > Did you try just not shrinking at all if the memcg limit is hit in > > > your experiments? > > > > > > I don't feel strongly, but my preference would be to just not shrink > > > at all if obj_cgroup_may_zswap() returns false. > > > > Sounds reasonable to me. Basically just replace the goto shrink with > > goto reject for now. Maybe a comment that says "XXX: Writeback/reclaim > > does not work with cgroups yet. Needs a cgroup-aware entry LRU first, > > or we'd push out entries system-wide based on local cgroup limits." > > Yeah, exactly -- if Nhat agrees of course. Sounds good to me! I don't have a strong opinion on this either. I was just trying to make minimal behavioral change to fix this (i.e keep the shrinking behavior where possible, but definitely reject where it does not make sense to shrink). But this works too, and is actually a smaller change code-wise. We can revisit this piece of code when the per-memcg LRU comes in. I'll send a new version with the proposed change (and documentation) shortly. Thanks for the review and suggestion, everyone! > > > > > Nhat, does that sound good to you?