From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 145DCC77B73 for ; Tue, 30 May 2023 21:05:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6F5196B0074; Tue, 30 May 2023 17:05:38 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6A5306B0075; Tue, 30 May 2023 17:05:38 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 56D3C280001; Tue, 30 May 2023 17:05:38 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 47E4E6B0074 for ; Tue, 30 May 2023 17:05:38 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 0C08CA0315 for ; Tue, 30 May 2023 21:05:38 +0000 (UTC) X-FDA: 80848152756.01.D8C39A2 Received: from mail-lj1-f172.google.com (mail-lj1-f172.google.com [209.85.208.172]) by imf12.hostedemail.com (Postfix) with ESMTP id DF47A4001C for ; Tue, 30 May 2023 21:05:35 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=IxRIhda7; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf12.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.172 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1685480736; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=lRlag3IP6WRuvIKALTZZ5sR/28OhDehgXmzr30sFnqw=; b=Kd6B6Iw35wCvKi07Tv9HASlYxtATezt/kqYxo+F+KD++iqLy7/pI56D6RrwDYQr7aHmnVY YRjymgwTkLfoTB3Xqk0MC193ymPQn2PfBBF7asxVc6HGhstN8WMNNuuehD2UUqazLRgVLu IjRzYKBXmJFIJYU/Aeu/+ehbQB4Qo7w= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=IxRIhda7; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf12.hostedemail.com: domain of yosryahmed@google.com designates 209.85.208.172 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1685480736; a=rsa-sha256; cv=none; b=cxBNfvQmT172xq8O9XKQez84nO8QaUWfY3HNvLRW4ZeXiNq0AHTd2O+PlO9/PLKS95M+xw 3iDAPRlPJBXi3C96bHgr0iYtRRyeyK7jlotpG/IKtRKh34uma+abUodOt8ONeLIJmodLvc V0Avcsq3Hu4icJnNq9UQbX4G4akRI74= Received: by mail-lj1-f172.google.com with SMTP id 38308e7fff4ca-2af2b74d258so54500221fa.3 for ; Tue, 30 May 2023 14:05:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1685480734; x=1688072734; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=lRlag3IP6WRuvIKALTZZ5sR/28OhDehgXmzr30sFnqw=; b=IxRIhda7mJBVSqM0DNAHjz5FBFnKEO82bGPtbYgiu54O4JxFUWijDwy7RCJ5n0lBZh RSximn2kYVXe5wlwgIOZw+c7wslvrCEPRXQ4fPE4YLMSodyQd3dtFSLgFlYytEqDk7p4 NayOoRfh7fA9QJp3m4Q7iZWDmKpU0MD+Sv2gZH6wkcpkS40NIIZrkwbLKHkfYcjyMP9s 8ZY73CHtNG465ZcvLs8i1hePNiyLzoU8y4mcVBGGquosx5cXdjypGiKvOasm7iUFSNQz JvnGlqZymoyzPmliyqexvE2Xk0CoTIqxL7NqeiDLc4lOOaVIRnhibsoRMKpdyC0U7LgF a9BA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1685480734; x=1688072734; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=lRlag3IP6WRuvIKALTZZ5sR/28OhDehgXmzr30sFnqw=; b=D4vJWUrG+ESj6jhG6tmSA9qaWsFU1SI1Mj58uRRAMr7Idf0MEznpV00576/zo8C2AM 7NdMHiiRF2WASQ+riZhwAsIItC+PsnpXtHSHli15JATb6yBN5PkBuww2Cy+LnqjCO1wo FDk9f+Jxj/tv3/mz/CB8cUBxtIkBNOK6wxpGTfJsWZdBbuTHU5HFo3y1pvFWQFDkqxQ3 GixYjwc1fOxnse4K4X/lO5/fWtXBbTzuyMSLFH3a4sDvl43WJRaDf/Ew934mJUROCUrJ mEi8Zu2+xSFyTZ1N/EMbsnoNC4cc2HNrkn4uP22hrL9mpneSxuH1v+c/1SrjBcJoaUji sEmw== X-Gm-Message-State: AC+VfDzGDZXU8UhT+4cY0olOQdVnb56hX3t0SDYf9UKBapPIX/Ogwb/i lm68JOyic2YAE2tjKcWjGoG5PSKk9ASlIL8PH2nc4g== X-Google-Smtp-Source: ACHHUZ7Zp8oJbHvSFe3mBEKaiZhOHzYy8+TmMS6BK/n3/7mGCMRkjrQTfTp2QOU+zAUjpV8u7O/KVADsqjHV60dyB7w= X-Received: by 2002:a2e:8244:0:b0:2a8:ae7e:b9cb with SMTP id j4-20020a2e8244000000b002a8ae7eb9cbmr1570757ljh.42.1685480733570; Tue, 30 May 2023 14:05:33 -0700 (PDT) MIME-Version: 1.0 References: <20230530162153.836565-1-nphamcs@gmail.com> <20230530180038.GC97194@cmpxchg.org> <20230530191336.GB101722@cmpxchg.org> <20230530205940.GA102494@cmpxchg.org> In-Reply-To: <20230530205940.GA102494@cmpxchg.org> From: Yosry Ahmed Date: Tue, 30 May 2023 14:04:57 -0700 Message-ID: Subject: Re: [PATCH] zswap: do not shrink when memory.zswap.max is 0 To: Johannes Weiner Cc: Nhat Pham , akpm@linux-foundation.org, cerasuolodomenico@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, sjenning@redhat.com, ddstreet@ieee.org, vitaly.wool@konsulko.com, kernel-team@meta.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: DF47A4001C X-Stat-Signature: xpia5ccowmyyrg6nua3o4j9oqcq4rct5 X-HE-Tag: 1685480735-714160 X-HE-Meta: U2FsdGVkX1+IJynZnvwbUWKospXPUnEtEVBeZByI6wXJfD8Wa/e6IfRmKsW8EKedTkRe3jrRriIEMdclkTkFUalJ4YZtFR7laPn+5PL80kCJkn63gLvSR/M2cSHBm3uxXY0MZtdNRRpQW3GpnQfzomfo83mTaRoDEj7o3lUR5FM42p3rwyOovBMe9lQVXSjl2GKHU4WhuzjEp4pC6twfbSDo2c3eiUWQ/WImnyzhyimjWShzc+Gl7DzVO6t0Lygq7ALmC+nKC/L5wYUcdNlDRc8h7S3/l++Xg20mH+MUCPdLTwEEdeCDcLfF9bTw3RDS+XUmaa2OC68ZjUra66IlEEJWdSxglhErRpQy5Ez1XWYh8qoZa2/aqfSD+QVFWsJ3r2iFRi+vKAL0p66+yWZTL8eAPrJiTgg/xS7Cz9221Av96SAZznZfNDHB8GCHyiO9oR9xtdleOUkXilRcHMTFWPqEXX04KDUQJ6o/w8mySUWVEdP4swmrRQqPc/TvuDGOR3mSfPj18IqFEWoL+/QvSKKT40f1EKTQoZWr9hBKu7s+r4VlnH3+RaJFAEhmQQBIaeAeATaBQ7+5Gfn16jUXkRZzvaL7ws69YwJ5/LdWE+JPzGu2TpkXqzvQ0yqiLPhQpPA76RCoDtK/iEopcznMt2lRH7vxTdFUINeiCY+VxrMSQj2L5TV0L4XiVnyUmAvK0wtY6EHkHydGnVeSjrgKiawy5mV2087bW83OcssShsXpDHEmy71fSixrHgPF7ZQfKI8ewEuYsZxAMitELhmjek1kE/X7lgm80/ixB9vdKwxccSKkCsrZ5k/zSARD3H30o/q0+DRQ2jY08LItwK0+bE1cAGdNS1adZQ3ewJlOcLLNAuMARigvxq+8ZFFFh/BFarWtCDFvalqV30RsMBbUhRqefmHZa6ouMTCBHU9L427gw0xMv7mHHTdopPDRI6UWfFd34+vsV4zvGwByfCa aK+YJCss CAZcM3zpJ41dyWh/9ZimrKB+VjQNgbfJgo0qGsfo2VCkQGK65LIXNqMkk2I1WhLNKBUBVNfgHTCSDWKx7J9w8DDGBmBSGwofHq9pAgK4EpDvxzLpKzDIGyOwXh8KqBu2VkJqAq6nTWXU4uleMdjp9qUwJmFS52y9FrMt3UMIg/G/2UN2pGq5Mus35hZYMnMbrEFm8yH22Nxk1e37Wl9CgGXJfvlaZYcnlFYrKSbfflx2DH+M3Z8If8FBMFUYP66L3r7tbPdzyKkSmTga+7Xgrwml68LkXMc6Y134z4f6De+w8Txn8Z8DPf6P2pQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, May 30, 2023 at 1:59=E2=80=AFPM Johannes Weiner wrote: > > On Tue, May 30, 2023 at 01:19:12PM -0700, Yosry Ahmed wrote: > > On Tue, May 30, 2023 at 12:13=E2=80=AFPM Johannes Weiner wrote: > > > > > > On Tue, May 30, 2023 at 11:41:32AM -0700, Yosry Ahmed wrote: > > > > On Tue, May 30, 2023 at 11:00=E2=80=AFAM Johannes Weiner wrote: > > > > > > > > > > On Tue, May 30, 2023 at 09:52:36AM -0700, Yosry Ahmed wrote: > > > > > > On Tue, May 30, 2023 at 9:22=E2=80=AFAM Nhat Pham wrote: > > > > > > > > > > > > > > Before storing a page, zswap first checks if the number of st= ored pages > > > > > > > exceeds the limit specified by memory.zswap.max, for each cgr= oup in the > > > > > > > hierarchy. If this limit is reached or exceeded, then zswap s= hrinking is > > > > > > > triggered and short-circuits the store attempt. > > > > > > > > > > > > > > However, if memory.zswap.max =3D 0 for a cgroup, no amount of= writeback > > > > > > > will allow future store attempts from processes in this cgrou= p to > > > > > > > succeed. Furthermore, this create a pathological behavior in = a system > > > > > > > where some cgroups have memory.zswap.max =3D 0 and some do no= t: the > > > > > > > processes in the former cgroups, under memory pressure, will = evict pages > > > > > > > stored by the latter continually, until the need for swap cea= ses or the > > > > > > > pool becomes empty. > > > > > > > > > > > > > > As a result of this, we observe a disproportionate amount of = zswap > > > > > > > writeback and a perpetually small zswap pool in our experimen= ts, even > > > > > > > though the pool limit is never hit. > > > > > > > > > > > > > > This patch fixes the issue by rejecting zswap store attempt w= ithout > > > > > > > shrinking the pool when memory.zswap.max is 0. > > > > > > > > > > > > > > Fixes: f4840ccfca25 ("zswap: memcg accounting") > > > > > > > Signed-off-by: Nhat Pham > > > > > > > --- > > > > > > > include/linux/memcontrol.h | 6 +++--- > > > > > > > mm/memcontrol.c | 8 ++++---- > > > > > > > mm/zswap.c | 9 +++++++-- > > > > > > > 3 files changed, 14 insertions(+), 9 deletions(-) > > > > > > > > > > > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memco= ntrol.h > > > > > > > index 222d7370134c..507bed3a28b0 100644 > > > > > > > --- a/include/linux/memcontrol.h > > > > > > > +++ b/include/linux/memcontrol.h > > > > > > > @@ -1899,13 +1899,13 @@ static inline void count_objcg_event(= struct obj_cgroup *objcg, > > > > > > > #endif /* CONFIG_MEMCG_KMEM */ > > > > > > > > > > > > > > #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP) > > > > > > > -bool obj_cgroup_may_zswap(struct obj_cgroup *objcg); > > > > > > > +int obj_cgroup_may_zswap(struct obj_cgroup *objcg); > > > > > > > void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_= t size); > > > > > > > void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, siz= e_t size); > > > > > > > #else > > > > > > > -static inline bool obj_cgroup_may_zswap(struct obj_cgroup *o= bjcg) > > > > > > > +static inline int obj_cgroup_may_zswap(struct obj_cgroup *ob= jcg) > > > > > > > { > > > > > > > - return true; > > > > > > > + return 0; > > > > > > > } > > > > > > > static inline void obj_cgroup_charge_zswap(struct obj_cgroup= *objcg, > > > > > > > size_t size) > > > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > > > > > index 4b27e245a055..09aad0e6f2ea 100644 > > > > > > > --- a/mm/memcontrol.c > > > > > > > +++ b/mm/memcontrol.c > > > > > > > @@ -7783,10 +7783,10 @@ static struct cftype memsw_files[] = =3D { > > > > > > > * spending cycles on compression when there is already no r= oom left > > > > > > > * or zswap is disabled altogether somewhere in the hierarch= y. > > > > > > > */ > > > > > > > -bool obj_cgroup_may_zswap(struct obj_cgroup *objcg) > > > > > > > +int obj_cgroup_may_zswap(struct obj_cgroup *objcg) > > > > > > > { > > > > > > > struct mem_cgroup *memcg, *original_memcg; > > > > > > > - bool ret =3D true; > > > > > > > + int ret =3D 0; > > > > > > > > > > > > > > if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) > > > > > > > return true; > > > > > > > @@ -7800,7 +7800,7 @@ bool obj_cgroup_may_zswap(struct obj_cg= roup *objcg) > > > > > > > if (max =3D=3D PAGE_COUNTER_MAX) > > > > > > > continue; > > > > > > > if (max =3D=3D 0) { > > > > > > > - ret =3D false; > > > > > > > + ret =3D -ENODEV; > > > > > > > break; > > > > > > > } > > > > > > > > > > > > > > @@ -7808,7 +7808,7 @@ bool obj_cgroup_may_zswap(struct obj_cg= roup *objcg) > > > > > > > pages =3D memcg_page_state(memcg, MEMCG_ZSWAP= _B) / PAGE_SIZE; > > > > > > > if (pages < max) > > > > > > > continue; > > > > > > > - ret =3D false; > > > > > > > + ret =3D -ENOMEM; > > > > > > > break; > > > > > > > } > > > > > > > mem_cgroup_put(original_memcg); > > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > > > > > index 59da2a415fbb..7b13dc865438 100644 > > > > > > > --- a/mm/zswap.c > > > > > > > +++ b/mm/zswap.c > > > > > > > @@ -1175,8 +1175,13 @@ static int zswap_frontswap_store(unsig= ned type, pgoff_t offset, > > > > > > > } > > > > > > > > > > > > > > objcg =3D get_obj_cgroup_from_page(page); > > > > > > > - if (objcg && !obj_cgroup_may_zswap(objcg)) > > > > > > > - goto shrink; > > > > > > > + if (objcg) { > > > > > > > + ret =3D obj_cgroup_may_zswap(objcg); > > > > > > > + if (ret =3D=3D -ENODEV) > > > > > > > + goto reject; > > > > > > > + if (ret =3D=3D -ENOMEM) > > > > > > > + goto shrink; > > > > > > > + } > > > > > > > > > > > > I wonder if we should just make this: > > > > > > > > > > > > if (objcg && !obj_cgroup_may_zswap(objcg)) > > > > > > goto reject; > > > > > > > > > > > > Even if memory.zswap.max is > 0, if the limit is hit, shrinking= the > > > > > > zswap pool will only help if we happen to writeback a page from= the > > > > > > same memcg that hit its limit. Keep in mind that we will only > > > > > > writeback one page every time we observe that the limit is hit = (even > > > > > > with Domenico's patch, because zswap_can_accept() should be tru= e). > > > > > > > > > > > > On a system with a handful of memcgs, > > > > > > it seems likely that we wrongfully writeback pages from other m= emcgs > > > > > > because of this. Achieving nothing for this memcg, while hurtin= g > > > > > > others. OTOH, without invoking writeback when the limit is hit,= the > > > > > > memcg will just not be able to use zswap until some pages are > > > > > > faulted back in or invalidated. > > > > > > > > > > > > I am not sure which is better, just thinking out loud. > > > > > > > > > > You're absolutely right. > > > > > > > > > > Currently the choice is writing back either everybody or nobody, > > > > > meaning between writeback and cgroup containment. They're both so= poor > > > > > that I can't say I strongly prefer one over the other. > > > > > > > > > > However, I have a lame argument in favor of this patch: > > > > > > > > > > The last few fixes from Nhat and Domenico around writeback show t= hat > > > > > few people, if anybody, are actually using writeback. So it might= not > > > > > actually matter that much in practice which way we go with this p= atch. > > > > > Per-memcg LRUs will be necessary for it to work right. > > > > > > > > > > However, what Nhat is proposing is how we want the behavior down = the > > > > > line. So between two equally poor choices, I figure we might as w= ell > > > > > go with the one that doesn't require another code change later on= . > > > > > > > > > > Doesn't that fill you with radiant enthusiasm? > > > > > > > > If we have per-memcg LRUs, and memory.zswap.max =3D=3D 0, then we s= hould > > > > be in one of two situations: > > > > > > > > (a) memory.zswap.max has always been 0, so the LRU for this memcg i= s > > > > empty, so we don't really need the special case for memory.zswap.ma= x > > > > =3D=3D 0. > > > > > > > > (b) memory.zswap.max was reduced to 0 at some point, and some pages > > > > are already in zswap. In this case, I don't think shrinking the mem= cg > > > > is such a bad idea, we would be lazily enforcing the limit. > > > > > > > > In that sense I am not sure that this change won't require another > > > > code change. It feels like special casing memory.zswap.max =3D=3D 0= is > > > > only needed now due to the lack of per-memcg LRUs. > > > > > > Good point. And I agree down the line we should just always send the > > > shrinker off optimistically on the cgroup's lru list. > > > > > > So I take back my lame argument. But that then still leaves us with > > > the situation that both choices are equal here, right? > > > > > > If so, my vote would be to go with the patch as-is. > > > > I *think* it's better to punish the memcg that exceeded its limit by > > not allowing it to use zswap until its usage goes down, rather than > > punish random memcgs on the machine because one memcg hit its limit. > > It also seems to me that on a system with a handful of memcgs, it is > > statistically more likely for zswap shrinking to writeback a page from > > the wrong memcg. > > Right, but in either case a hybrid zswap + swap setup with cgroup > isolation is broken anyway. Without it being usable, I'm assuming > there are no users - maybe that's optimistic of me ;) > > However, if you think it's better to just be conservative about taking > action in general, that's fine by me as well. Exactly, I just prefer erroring on the conservative side. > > > The code would also be simpler if obj_cgroup_may_zswap() just returns > > a boolean and we do not shrink at all if it returns false. If it no > > longer returns a boolean we should at least rename it. > > > > Did you try just not shrinking at all if the memcg limit is hit in > > your experiments? > > > > I don't feel strongly, but my preference would be to just not shrink > > at all if obj_cgroup_may_zswap() returns false. > > Sounds reasonable to me. Basically just replace the goto shrink with > goto reject for now. Maybe a comment that says "XXX: Writeback/reclaim > does not work with cgroups yet. Needs a cgroup-aware entry LRU first, > or we'd push out entries system-wide based on local cgroup limits." Yeah, exactly -- if Nhat agrees of course. > > Nhat, does that sound good to you?