From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C3804C7EE23 for ; Tue, 30 May 2023 20:59:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CC435900002; Tue, 30 May 2023 16:59:45 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C73026B0074; Tue, 30 May 2023 16:59:45 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B144C900002; Tue, 30 May 2023 16:59:45 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 9CD8E6B0072 for ; Tue, 30 May 2023 16:59:45 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 7602AAE2D2 for ; Tue, 30 May 2023 20:59:45 +0000 (UTC) X-FDA: 80848137930.04.7D3CBA4 Received: from mail-qt1-f181.google.com (mail-qt1-f181.google.com [209.85.160.181]) by imf25.hostedemail.com (Postfix) with ESMTP id 6B702A000A for ; Tue, 30 May 2023 20:59:42 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=cmpxchg-org.20221208.gappssmtp.com header.s=20221208 header.b=bD73S+5L; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf25.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.160.181 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1685480382; a=rsa-sha256; cv=none; b=dYCBwzGSD7dQCJyUEEsgUWJKeCTBjc9bFpSjVnBSFbIAaDQnis6dXO39sHn+9nQrqyM9GF qy4FUF+fW6uImQgmbcyG8HU5pIH1N+5HdXAImTXd0ch2PNBFJmd3cVyzotO9CL9S0K0I78 pMUIMH4LROuTLyssLYQTBcJzC0NL6Tg= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=cmpxchg-org.20221208.gappssmtp.com header.s=20221208 header.b=bD73S+5L; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf25.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.160.181 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1685480382; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=7eblH3QdRlICNFaCB+m62xz5shb1LJNMBbo87wL9f0M=; b=u0Me39ZiYMREiZzYJya7Q38UvEdp35Z0xWL8nwgNmmpPRbONzbLqUyQZeUYahQQmiotzC2 xYkXttVFaXasWce5Zz7PQiuX1ehP6ExwUFdaO5i7oVmOK4PWq3aXnDYkoC3mwfJJKDSmC2 yQR24GIA3G5ScnC09tOw7JQt2czZbhM= Received: by mail-qt1-f181.google.com with SMTP id d75a77b69052e-3f83114e0c6so17848561cf.0 for ; Tue, 30 May 2023 13:59:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20221208.gappssmtp.com; s=20221208; t=1685480381; x=1688072381; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=7eblH3QdRlICNFaCB+m62xz5shb1LJNMBbo87wL9f0M=; b=bD73S+5LUY83mZa0edlzw+q87HIkO5vf9WehMtJWC9PFfP0I/diLC1vBl9HDRie92h cBc9L9ZKojZ7FFAUvyk+aV/Ov4iNoFmy8C4NmwgfLiIIMgfXVf/ZMbRC1aIZ1QYm/GKx I87su7p19aUqhQoPLKQ3W09lzwa+byYrWgx+GGl8TH1yKLinNSv6A0YCxnHmOf0CCPUi WBpQSZ1HksNKZ6AvZE/tmdrwYzkZ4vSiE3HRoJJhryT26qRypgJ9gtSev5N1yZmtvHrM pfJGUzQ/2KqnXzZT+8/2KqCPovPV1B+1pSEZ1pjnR24NBI6Vr6sfLoRQ5xJ4GnI0UENo 8Yyw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1685480381; x=1688072381; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=7eblH3QdRlICNFaCB+m62xz5shb1LJNMBbo87wL9f0M=; b=GBGtKdsh7UThoKmuxkrFpJwhbfnc/bE3yxVdMwqrMZzbFRr58NhDDB0/CrbLbY6C+/ aL3gMZKDSbwUDlLFeyJWCFOXWPT5wgV52I6wzSCShACscasqxqvWwOC1l/QHrfTvgZKk cdYbbtlNNlO2Xv2KshuxHAp4I9nYviPKC1s0IPCppeT+TMlK4QY275WQO29GCHc+VR+M P6ZL0J27tIy4pzcuKcEKhGX1KuQTiq8GHsEl63FcNZmXaSZpxYDkLDyDm9r2YyUcyfP2 ibBjqzRSTpALnkOPViwZOT0l7ccPCjn3BLYeFlQx7W6BWS5G63cXln2K15ztGUQmEaca cXNQ== X-Gm-Message-State: AC+VfDw8E2Qg9wrFYYj4SkWeYUGmAKim1ecvfAu0xCG/nXrpgp1c4IdC Y743joCUlxu6le6vLbxR78GDfA== X-Google-Smtp-Source: ACHHUZ6AQEP9JHBzG9mXbwzSBdS5kWqw3jmWZMI2lzZC8MfSEcnL4+MLvRjP0YxuvZvOLZZBbTNKEQ== X-Received: by 2002:ac8:57ce:0:b0:3ef:28dc:fdd0 with SMTP id w14-20020ac857ce000000b003ef28dcfdd0mr3920666qta.44.1685480381364; Tue, 30 May 2023 13:59:41 -0700 (PDT) Received: from localhost ([2620:10d:c091:400::5:8bb6]) by smtp.gmail.com with ESMTPSA id y14-20020ac8524e000000b003e3914c6839sm5051821qtn.43.2023.05.30.13.59.40 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 30 May 2023 13:59:41 -0700 (PDT) Date: Tue, 30 May 2023 16:59:40 -0400 From: Johannes Weiner To: Yosry Ahmed Cc: Nhat Pham , akpm@linux-foundation.org, cerasuolodomenico@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, sjenning@redhat.com, ddstreet@ieee.org, vitaly.wool@konsulko.com, kernel-team@meta.com Subject: Re: [PATCH] zswap: do not shrink when memory.zswap.max is 0 Message-ID: <20230530205940.GA102494@cmpxchg.org> References: <20230530162153.836565-1-nphamcs@gmail.com> <20230530180038.GC97194@cmpxchg.org> <20230530191336.GB101722@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 6B702A000A X-Stat-Signature: kxfq3kipsctpbnpixhyfdg5h7hzh7ize X-HE-Tag: 1685480382-378880 X-HE-Meta: U2FsdGVkX1++5mn0lQiryNO/Hc7p6HhV/RS8D+rkcAiBwh4cyS+xNxjDQBqV/iRk8fGKWeu9r6W4sjCZA2KoZKFQ2hkqiym6Qp5BD02dPd2nm8uXRKNsdOS28yDOn6VC3hMmjQz/ARVjcuC/QRQ02utQsyXvGq2WBHP0nZRrX9OZLOZcfSw53gPiEXo54xrRhY86gj2yQTUKuKSgMwwJA3WATs7rz+Hcjc8jyJTNLqXQ5i+dPwIZGsxolli0BSFuBL7dEbqY89RMlWOqUYmngCpHijKRDBLirEMuw3J93LmDlhPqseO4N2tiYfSMFbAARO46XaR2xU8Pfvkvl3GcnW02rsApem6bWA9++Mh8bXrkC1hBcBK2XmyT2qUZliCsARywHbY8WKqFR3McIZnyK/ucGKLxzm/LeL73xxNAzNjjkeiQZDvJWUEpAuQYCX1coiEJLe0N1N+AJMsDysSRFE/mPKFbIh9jjbhIIb7hYzZl60de71sdAHV4/90fWvokREKf8R2ZSD7Dv9nCoLoOwteHWkEzI6IIWwzJlLodYeihEyMsqszYyOu34KgVOi5PWsiJiFBiYmDEQsS+T2AFwwwoq7AnnKkgM32K6vdNSpvcAHpAYifniCOZh7ytko98iLN5yW3hsXUYLgsglLfgHlL8FC6kPFtZTpafp/TF5aylDOqMbejhjL16DvtIMYWWdwGjTHzEaO6IZFp5AA/5h09jap3uTvfysjc7mG5kLn0/211Po3yc4fgsxn98Lz0cYcUblatsYV4NTvvCcTYWO13Ucd9Mo3FTD48Qw89rZAjNIRwVPPn0wx8/30B37/dAyII5GQoRwmjaCrH9mycMjoK8ZVDQcBjlVUzm4Bd+MExC6h3+c7m/t4aTolj+aKSy968/M6YfQQqdj4ahPkqocfIif5AyhXgEUI7zNbTQdDXTv8LoGbzZb69LBW4+Nds6o8wKCtvE6JzVWWoDvph UJ989sDa ydkP9yOBMByXpYhbO21LS40Tfk4d+YWdtrxv/rF1EHGeRmnExRf88aLFZNKl/DeMbLIdG4LYLVeb9Q5S0KkAeBGB/dzS2yrQeW9lgTmbgr5kovg+i0d7PVTpdScP0LZ7bLBdlMjMAqScOWAXrsQzHK4ZXHzJ2/wYFD9L+5LaHdcs1f6cEvTADYJzgEKE35JjZ9g0onSL5/vUxLKMafouxlkBKWv7lOJeeq2mjjQsuyICSEd8gCDOa8gRdfq7dhtRAnJcgq9SEwyxhL4h8Fe3feGKr2uHrH6TRJc9q/AWCA7ma0nxBZ1v6mU08+XMn0Z6c5av0Mlf7Sepk4i2MgnwG319WkvpMtfdET30umZNdYaycIKii9E4yZZYjV495APDB0Ll2weyWN9Xzo4xb976hRZ/wTJJNfGcZdIqRvF5OA0vkr0w1EpCczcENQtbIniMA9IFqmIgucoyUZx01452rNkv/idTJZ/j8+qdTfZ6URf9GMNBAQWX4ZHCoWDG32/H2NaKG X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, May 30, 2023 at 01:19:12PM -0700, Yosry Ahmed wrote: > On Tue, May 30, 2023 at 12:13 PM Johannes Weiner wrote: > > > > On Tue, May 30, 2023 at 11:41:32AM -0700, Yosry Ahmed wrote: > > > On Tue, May 30, 2023 at 11:00 AM Johannes Weiner wrote: > > > > > > > > On Tue, May 30, 2023 at 09:52:36AM -0700, Yosry Ahmed wrote: > > > > > On Tue, May 30, 2023 at 9:22 AM Nhat Pham wrote: > > > > > > > > > > > > Before storing a page, zswap first checks if the number of stored pages > > > > > > exceeds the limit specified by memory.zswap.max, for each cgroup in the > > > > > > hierarchy. If this limit is reached or exceeded, then zswap shrinking is > > > > > > triggered and short-circuits the store attempt. > > > > > > > > > > > > However, if memory.zswap.max = 0 for a cgroup, no amount of writeback > > > > > > will allow future store attempts from processes in this cgroup to > > > > > > succeed. Furthermore, this create a pathological behavior in a system > > > > > > where some cgroups have memory.zswap.max = 0 and some do not: the > > > > > > processes in the former cgroups, under memory pressure, will evict pages > > > > > > stored by the latter continually, until the need for swap ceases or the > > > > > > pool becomes empty. > > > > > > > > > > > > As a result of this, we observe a disproportionate amount of zswap > > > > > > writeback and a perpetually small zswap pool in our experiments, even > > > > > > though the pool limit is never hit. > > > > > > > > > > > > This patch fixes the issue by rejecting zswap store attempt without > > > > > > shrinking the pool when memory.zswap.max is 0. > > > > > > > > > > > > Fixes: f4840ccfca25 ("zswap: memcg accounting") > > > > > > Signed-off-by: Nhat Pham > > > > > > --- > > > > > > include/linux/memcontrol.h | 6 +++--- > > > > > > mm/memcontrol.c | 8 ++++---- > > > > > > mm/zswap.c | 9 +++++++-- > > > > > > 3 files changed, 14 insertions(+), 9 deletions(-) > > > > > > > > > > > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > > > > > index 222d7370134c..507bed3a28b0 100644 > > > > > > --- a/include/linux/memcontrol.h > > > > > > +++ b/include/linux/memcontrol.h > > > > > > @@ -1899,13 +1899,13 @@ static inline void count_objcg_event(struct obj_cgroup *objcg, > > > > > > #endif /* CONFIG_MEMCG_KMEM */ > > > > > > > > > > > > #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP) > > > > > > -bool obj_cgroup_may_zswap(struct obj_cgroup *objcg); > > > > > > +int obj_cgroup_may_zswap(struct obj_cgroup *objcg); > > > > > > void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size); > > > > > > void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size); > > > > > > #else > > > > > > -static inline bool obj_cgroup_may_zswap(struct obj_cgroup *objcg) > > > > > > +static inline int obj_cgroup_may_zswap(struct obj_cgroup *objcg) > > > > > > { > > > > > > - return true; > > > > > > + return 0; > > > > > > } > > > > > > static inline void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, > > > > > > size_t size) > > > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > > > > index 4b27e245a055..09aad0e6f2ea 100644 > > > > > > --- a/mm/memcontrol.c > > > > > > +++ b/mm/memcontrol.c > > > > > > @@ -7783,10 +7783,10 @@ static struct cftype memsw_files[] = { > > > > > > * spending cycles on compression when there is already no room left > > > > > > * or zswap is disabled altogether somewhere in the hierarchy. > > > > > > */ > > > > > > -bool obj_cgroup_may_zswap(struct obj_cgroup *objcg) > > > > > > +int obj_cgroup_may_zswap(struct obj_cgroup *objcg) > > > > > > { > > > > > > struct mem_cgroup *memcg, *original_memcg; > > > > > > - bool ret = true; > > > > > > + int ret = 0; > > > > > > > > > > > > if (!cgroup_subsys_on_dfl(memory_cgrp_subsys)) > > > > > > return true; > > > > > > @@ -7800,7 +7800,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg) > > > > > > if (max == PAGE_COUNTER_MAX) > > > > > > continue; > > > > > > if (max == 0) { > > > > > > - ret = false; > > > > > > + ret = -ENODEV; > > > > > > break; > > > > > > } > > > > > > > > > > > > @@ -7808,7 +7808,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg) > > > > > > pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE; > > > > > > if (pages < max) > > > > > > continue; > > > > > > - ret = false; > > > > > > + ret = -ENOMEM; > > > > > > break; > > > > > > } > > > > > > mem_cgroup_put(original_memcg); > > > > > > diff --git a/mm/zswap.c b/mm/zswap.c > > > > > > index 59da2a415fbb..7b13dc865438 100644 > > > > > > --- a/mm/zswap.c > > > > > > +++ b/mm/zswap.c > > > > > > @@ -1175,8 +1175,13 @@ static int zswap_frontswap_store(unsigned type, pgoff_t offset, > > > > > > } > > > > > > > > > > > > objcg = get_obj_cgroup_from_page(page); > > > > > > - if (objcg && !obj_cgroup_may_zswap(objcg)) > > > > > > - goto shrink; > > > > > > + if (objcg) { > > > > > > + ret = obj_cgroup_may_zswap(objcg); > > > > > > + if (ret == -ENODEV) > > > > > > + goto reject; > > > > > > + if (ret == -ENOMEM) > > > > > > + goto shrink; > > > > > > + } > > > > > > > > > > I wonder if we should just make this: > > > > > > > > > > if (objcg && !obj_cgroup_may_zswap(objcg)) > > > > > goto reject; > > > > > > > > > > Even if memory.zswap.max is > 0, if the limit is hit, shrinking the > > > > > zswap pool will only help if we happen to writeback a page from the > > > > > same memcg that hit its limit. Keep in mind that we will only > > > > > writeback one page every time we observe that the limit is hit (even > > > > > with Domenico's patch, because zswap_can_accept() should be true). > > > > > > > > > > On a system with a handful of memcgs, > > > > > it seems likely that we wrongfully writeback pages from other memcgs > > > > > because of this. Achieving nothing for this memcg, while hurting > > > > > others. OTOH, without invoking writeback when the limit is hit, the > > > > > memcg will just not be able to use zswap until some pages are > > > > > faulted back in or invalidated. > > > > > > > > > > I am not sure which is better, just thinking out loud. > > > > > > > > You're absolutely right. > > > > > > > > Currently the choice is writing back either everybody or nobody, > > > > meaning between writeback and cgroup containment. They're both so poor > > > > that I can't say I strongly prefer one over the other. > > > > > > > > However, I have a lame argument in favor of this patch: > > > > > > > > The last few fixes from Nhat and Domenico around writeback show that > > > > few people, if anybody, are actually using writeback. So it might not > > > > actually matter that much in practice which way we go with this patch. > > > > Per-memcg LRUs will be necessary for it to work right. > > > > > > > > However, what Nhat is proposing is how we want the behavior down the > > > > line. So between two equally poor choices, I figure we might as well > > > > go with the one that doesn't require another code change later on. > > > > > > > > Doesn't that fill you with radiant enthusiasm? > > > > > > If we have per-memcg LRUs, and memory.zswap.max == 0, then we should > > > be in one of two situations: > > > > > > (a) memory.zswap.max has always been 0, so the LRU for this memcg is > > > empty, so we don't really need the special case for memory.zswap.max > > > == 0. > > > > > > (b) memory.zswap.max was reduced to 0 at some point, and some pages > > > are already in zswap. In this case, I don't think shrinking the memcg > > > is such a bad idea, we would be lazily enforcing the limit. > > > > > > In that sense I am not sure that this change won't require another > > > code change. It feels like special casing memory.zswap.max == 0 is > > > only needed now due to the lack of per-memcg LRUs. > > > > Good point. And I agree down the line we should just always send the > > shrinker off optimistically on the cgroup's lru list. > > > > So I take back my lame argument. But that then still leaves us with > > the situation that both choices are equal here, right? > > > > If so, my vote would be to go with the patch as-is. > > I *think* it's better to punish the memcg that exceeded its limit by > not allowing it to use zswap until its usage goes down, rather than > punish random memcgs on the machine because one memcg hit its limit. > It also seems to me that on a system with a handful of memcgs, it is > statistically more likely for zswap shrinking to writeback a page from > the wrong memcg. Right, but in either case a hybrid zswap + swap setup with cgroup isolation is broken anyway. Without it being usable, I'm assuming there are no users - maybe that's optimistic of me ;) However, if you think it's better to just be conservative about taking action in general, that's fine by me as well. > The code would also be simpler if obj_cgroup_may_zswap() just returns > a boolean and we do not shrink at all if it returns false. If it no > longer returns a boolean we should at least rename it. > > Did you try just not shrinking at all if the memcg limit is hit in > your experiments? > > I don't feel strongly, but my preference would be to just not shrink > at all if obj_cgroup_may_zswap() returns false. Sounds reasonable to me. Basically just replace the goto shrink with goto reject for now. Maybe a comment that says "XXX: Writeback/reclaim does not work with cgroups yet. Needs a cgroup-aware entry LRU first, or we'd push out entries system-wide based on local cgroup limits." Nhat, does that sound good to you?