From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D2088E6FE31 for ; Mon, 25 Sep 2023 23:29:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2D06A6B0174; Mon, 25 Sep 2023 19:29:25 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 280D36B0175; Mon, 25 Sep 2023 19:29:25 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0D3256B0177; Mon, 25 Sep 2023 19:29:25 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id EC7756B0174 for ; Mon, 25 Sep 2023 19:29:24 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id B9CB8C0755 for ; Mon, 25 Sep 2023 23:29:24 +0000 (UTC) X-FDA: 81276713448.12.276B5AE Received: from mail-il1-f182.google.com (mail-il1-f182.google.com [209.85.166.182]) by imf29.hostedemail.com (Postfix) with ESMTP id AC88A120004 for ; Mon, 25 Sep 2023 23:29:22 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=VzJ13kl6; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf29.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.166.182 as permitted sender) smtp.mailfrom=nphamcs@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1695684562; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Ny9ypAU/qqK1hY/TsODsiqQg6ArxO2Bwb47RHWpThas=; b=Ep8xgNLX7KSsRmqv6kZyvSKdTyqEfauTVkWTV1VrYy7sg4am/0aL+BA0O0jP8JCKMaLTuf wTRL4VkbUx18h+Mf1+CDiMaB5Fr8Ncxx8Z2zhS6MwhgLdSOL/M5iBO9CRfoelFIax/Q40c Joiyt2djOfYcb49D0jbWksONmEZ41u8= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=VzJ13kl6; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf29.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.166.182 as permitted sender) smtp.mailfrom=nphamcs@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695684562; a=rsa-sha256; cv=none; b=AofXu8PTxRF1Ghfgkb2iK1Bmoe2jUBP5omEj51he8b82/shXt0HREgQCuimnu97PvAySUn y85NELjOrza4RL4AK1gEWOhTZhLidnROAWUyAAAwfGsb8MEgvtoQAw7cqTyGFCsPG1Ew2N 47FtBU9UPt2yS5s9+oiozgR9l5KxkNc= Received: by mail-il1-f182.google.com with SMTP id e9e14a558f8ab-3513fa06bafso9503345ab.3 for ; Mon, 25 Sep 2023 16:29:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1695684562; x=1696289362; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Ny9ypAU/qqK1hY/TsODsiqQg6ArxO2Bwb47RHWpThas=; b=VzJ13kl6UiwB3A4GIp9GP2Ban8ozWYxsIMF4+K0r4sJjN57O6VC1xo64FYAEIqgfmb 02v0Ajc0AWjkGTqmUE6nYOyCgVBLo7hYkXZ72tFqCvvZD15cdEonG6x0eaJRyvzzGrWA 4XtAkurVOsGcTxdshglL2Gh70+oLZnBUuCT2fXj5JvNMr2cAsLFz4Wrqji3cfNPNk2Nr efZvHahENUGYYDunKqTBXveFytfpZAEHbne11kVUaUvgODIz7ugeLmr2qnxbTMAq7XmD v/1byHQ1XiWh2f6YvjDT0H1ppffu9zJTeU9YLhVbizSpMLGf45LK3cPtpSK7pDOLvSSk q6gg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1695684562; x=1696289362; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Ny9ypAU/qqK1hY/TsODsiqQg6ArxO2Bwb47RHWpThas=; b=hpfbV2WYHmTqnjMxdSaM+zQl1us1z1oXqA4dBa6HfFDBQbN8EIbodI//vnuaXcTfUW T5GwgGsQDXJCOfNM2pYlfEtCH/ckvTyNp0XxYS/FamLPfE0oKLQRs7++hKKK9S0ScY+y REtOa5Yi5LAsCTXZ7PIUb1aBBxIYJDBsyRwRHLnR5mckusto4QgK+6RjOg5o11MbSgDm zBOpoybK+0a82I5L0Le6hSr6Lbsx88WWPM7/4PJ+K78ApL4Bg5H8jc7sHy3tkye4s1mv ewGlGPbD4TN51YgwoRYQ29s33VNqixvsPop+46YcN1mgP5JSdfPfZy/AqKx1GD/UEQzH 6cJQ== X-Gm-Message-State: AOJu0YyTmC8lr0+GtStNKeTwsqt3BAeHwL9kRAB+iaabh3uJy0appFzb fOHMVQzBFDCkFgIE69zkIiokUsedwIDuVSxayos= X-Google-Smtp-Source: AGHT+IHge3Hv2JJ2lFk2o3T5ehjBb5HqXEv53FPDMMY8sGEPZXObcUvTCU9XWJPHmfiCXvKP5+ztxmLpryqHQ2+rKvI= X-Received: by 2002:a92:c5a6:0:b0:349:36e1:10fb with SMTP id r6-20020a92c5a6000000b0034936e110fbmr8221864ilt.18.1695684561590; Mon, 25 Sep 2023 16:29:21 -0700 (PDT) MIME-Version: 1.0 References: <20230919171447.2712746-1-nphamcs@gmail.com> <20230919171447.2712746-3-nphamcs@gmail.com> In-Reply-To: From: Nhat Pham Date: Mon, 25 Sep 2023 16:29:10 -0700 Message-ID: Subject: Re: [PATCH v2 2/2] zswap: shrinks zswap pool based on memory pressure To: Yosry Ahmed Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, cerasuolodomenico@gmail.com, sjenning@redhat.com, ddstreet@ieee.org, vitaly.wool@konsulko.com, mhocko@kernel.org, roman.gushchin@linux.dev, shakeelb@google.com, muchun.song@linux.dev, linux-mm@kvack.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: AC88A120004 X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: jppxgdfcg4fwu9aiwzybgqhr85xkutm4 X-HE-Tag: 1695684562-277285 X-HE-Meta: U2FsdGVkX18JQxjAMm1UtKH8TA9uJ5nm2AlR4rlOQE+NkMOr7rm4lyJtW772nPIjEa9Mt6G4zdRIux/ZWj6xhpAgfaflNfmnwc2GDWMWzPXSH5J66pBMFg62oY4767QyD6B+49szsz8a+SjG5gqFggPi5Y0xPwjNT8NLUfvSUXTrBwFPj8kx7fAAWySp1mr3zpZUs+aEsaZMq8o5BnID50cI5jisHCuM6fK7C7vFi/gU+A7Tx7Pe2qIXfvr6Lg84x6IQ+QC8b2bZIZo4k4R+I/ppeERL/CyUiOWGjSY2QIiOedRJRJ+ZvelpXYSMRW0i+aseyI3jWElfJgJ2AaeThRZg5t/Wx13cNVZNtZ6HU3oUDS0vtHABC1jAERpO1u03idvzy6N1yeM7XTZaTy2BgeU9/+18WTRcTHsDKCfPLsVll2C06XBNma6ohlSV+Y1l5nXPmkNidISmhA2utxYJu4gKFw2GVWzk0bpzyru4hj0wVAldh6JeXMOtz/ExntfvO5lCSGwzPbCju8hsT4/8AYBgg3SH0LCkKAD5ggGXpJsP/Z1NVwFHOfNfnrcLj/TPn3PdoyzbloyALmxIhOc0OHTrPct0gdGFze+He59HAdEP+42UdkVtMzIBG7Si5WrCX6Se73s7zvNExGWUspKX6r0rvAv3xTEH05rvBGOPMMaBmyLooymjDibfIvIXUqjhT7y+pwsJK2A7EebAS/s5VcYZgwcywXi/to5BJZt/cFaMOEf0nA/tH/iZiryYHzw25MYshRjkoH4wTUuAf38GkTX+E6ZElBARPGkGu0VdNL/7H3N/s43qPXU87i7VJB0fy27+rFSpZOgU8xLQ4BmjTYrPuKkz1wZoRHXt7xQtaEiWVW6qVxdDLOkooDSrM7c2b+9sgnXprFV9yJTLjC6kAECyDAcfUc4pJ4WH5UQ0quF7LWkVMR1/9ml0LYluS/dky9ILCPUIFukYzvewq7q gNNbevPx qT9v18Q9uzIemjMvx9c0ruukY+vKWJcoIjomKedD87B2zndJgYDbdDDGc7ZNuOsp5aF55669FPGflwPUFX3hg8KH5ji1GsvLY82D1r0xVfgQEN8fFmIhvSzvSC6iTCE8B+1zaiRtcFLQJSXwCg8WBnP/eYrPH5Z3plRb77FLEYx+ZCAUuwLjpEi6CYEFX/iHr7JomGIRrLOX+iDo/6WotQY1nxr4pfwI0K4uOWtRCfvlQ3Li1yFSWqDFvqxbSHB44b40x7HtbEIIfwyz1FoAbj5hrA1eNikafaZ2dOVhWHDTKfpMQmHeT9+1rOYb974qxUZnHIhe+H8FEVre9vXzdY5uEsr0icZIlrWNKLlJwbH1xxACR9FaW3VqmCwmJOzubSMx9bW9ndiiArVmAkp2yQkLhgFIj0t4gOkboGVJLfA2GM5cSLiFjOkfLRQFtu1Asi5QP3hfEWyfz2BI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Sep 25, 2023 at 1:38=E2=80=AFPM Yosry Ahmed = wrote: > > On Tue, Sep 19, 2023 at 10:14=E2=80=AFAM Nhat Pham wr= ote: > > > > Currently, we only shrink the zswap pool when the user-defined limit is > > hit. This means that if we set the limit too high, cold data that are > > unlikely to be used again will reside in the pool, wasting precious > > memory. It is hard to predict how much zswap space will be needed ahead > > of time, as this depends on the workload (specifically, on factors such > > as memory access patterns and compressibility of the memory pages). > > > > This patch implements a memcg- and NUMA-aware shrinker for zswap, that > > is initiated when there is memory pressure. The shrinker does not > > have any parameter that must be tuned by the user, and can be opted in > > or out on a per-memcg basis. > > What's the use case for having per-memcg opt-in/out? > > If there is memory pressure, reclaiming swap-backed pages will push > pages out of zswap anyway, regardless of this patch. With this patch, > any sort of reclaim can push pages out of zswap. Wouldn't that be > preferable to reclaiming memory that is currently resident in memory > (so arguably hotter than the pages in zswap)? Why would this decision > be different per-memcg? I'm not quite following your argument here. The point of having this be done on a per-memcg basis is that we have different workloads with different memory access pattern (and as a result, different memory coldness distribution). In a workload where there is a lot of cold data, we can really benefit from reclaiming all of those pages and repurpose the memory reclaimed (for e.g for filecache). On the other hand, in a workload where there aren't a lot of cold data, reclaiming its zswapped pages will at best do nothing (wasting CPU cycles on compression/decompression), and at worst hurt performance (due to increased IO when we need those writtenback pages again). Such different workloads could co-exist in the same system, and having a per-memcg knob allows us to crank on the shrinker only on workloads where it makes sense. > > > > > Furthermore, to make it more robust for many workloads and prevent > > overshrinking (i.e evicting warm pages that might be refaulted into > > memory), we build in the following heuristics: > > > > * Estimate the number of warm pages residing in zswap, and attempt to > > protect this region of the zswap LRU. > > * Scale the number of freeable objects by an estimate of the memory > > saving factor. The better zswap compresses the data, the fewer pages > > we will evict to swap (as we will otherwise incur IO for relatively > > small memory saving). > > * During reclaim, if the shrinker encounters a page that is also being > > brought into memory, the shrinker will cautiously terminate its > > shrinking action, as this is a sign that it is touching the warmer > > region of the zswap LRU. > > I don't have an opinion about the reclaim heuristics here, I will let > reclaim experts chip in. > > > > > On a benchmark that we have run: > > Please add more details (as much as possible) about the benchmarks used h= ere. Sure! I built the kernel in a memory-limited cgroup a couple times, then measured the build time. To simulate conditions where there are cold, unused data, I also generated a bunch of data in tmpfs (and never touch them again). > > > > > (without the shrinker) > > real -- mean: 153.27s, median: 153.199s > > sys -- mean: 541.652s, median: 541.903s > > user -- mean: 4384.9673999999995s, median: 4385.471s > > > > (with the shrinker) > > real -- mean: 151.4956s, median: 151.456s > > sys -- mean: 461.14639999999997s, median: 465.656s > > user -- mean: 4384.7118s, median: 4384.675s > > > > We observed a 14-15% reduction in kernel CPU time, which translated to > > over 1% reduction in real time. > > > > On another benchmark, where there was a lot more cold memory residing i= n > > zswap, we observed even more pronounced gains: > > > > (without the shrinker) > > real -- mean: 157.52519999999998s, median: 157.281s > > sys -- mean: 769.3082s, median: 780.545s > > user -- mean: 4378.1622s, median: 4378.286s > > > > (with the shrinker) > > real -- mean: 152.9608s, median: 152.845s > > sys -- mean: 517.4446s, median: 506.749s > > user -- mean: 4387.694s, median: 4387.935s > > > > Here, we saw around 32-35% reduction in kernel CPU time, which > > translated to 2.8% reduction in real time. These results confirm our > > hypothesis that the shrinker is more helpful the more cold memory we > > have. > > > > Suggested-by: Johannes Weiner > > Signed-off-by: Nhat Pham > > --- > > Documentation/admin-guide/mm/zswap.rst | 12 ++ > > include/linux/memcontrol.h | 1 + > > include/linux/mmzone.h | 14 ++ > > mm/memcontrol.c | 33 +++++ > > mm/swap_state.c | 31 ++++- > > mm/zswap.c | 180 ++++++++++++++++++++++++- > > 6 files changed, 263 insertions(+), 8 deletions(-) > > > > diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/adm= in-guide/mm/zswap.rst > > index 45b98390e938..ae8597a67804 100644 > > --- a/Documentation/admin-guide/mm/zswap.rst > > +++ b/Documentation/admin-guide/mm/zswap.rst > > @@ -153,6 +153,18 @@ attribute, e. g.:: > > > > Setting this parameter to 100 will disable the hysteresis. > > > > +When there is a sizable amount of cold memory residing in the zswap po= ol, it > > +can be advantageous to proactively write these cold pages to swap and = reclaim > > +the memory for other use cases. By default, the zswap shrinker is disa= bled. > > +User can enable it by first switching on the global knob: > > + > > + echo Y > /sys/module/zswap/par meters/shrinker_enabled > > + > > +When the kernel is compiled with CONFIG_MEMCG_KMEM, user needs to furt= her turn > > +it on for each cgroup that the shrinker should target: > > + > > + echo 1 > /sys/fs/cgroup//memory.zswap.shrinker.enabled > > + > > A debugfs interface is provided for various statistic about pool size,= number > > of pages stored, same-value filled pages and various counters for the = reasons > > pages are rejected. > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > > index 05d34b328d9d..f005ea667863 100644 > > --- a/include/linux/memcontrol.h > > +++ b/include/linux/memcontrol.h > > @@ -219,6 +219,7 @@ struct mem_cgroup { > > > > #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP) > > unsigned long zswap_max; > > + atomic_t zswap_shrinker_enabled; > > #endif > > > > unsigned long soft_limit; > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > > index 4106fbc5b4b3..81f4c5ea3e16 100644 > > --- a/include/linux/mmzone.h > > +++ b/include/linux/mmzone.h > > @@ -637,6 +637,20 @@ struct lruvec { > > #ifdef CONFIG_MEMCG > > struct pglist_data *pgdat; > > #endif > > +#ifdef CONFIG_ZSWAP > > + /* > > + * Number of pages in zswap that should be protected from the s= hrinker. > > + * This number is an estimate of the following counts: > > + * > > + * a) Recent page faults. > > + * b) Recent insertion to the zswap LRU. This includes new zswa= p stores, > > + * as well as recent zswap LRU rotations. > > + * > > + * These pages are likely to be warm, and might incur IO if the= are written > > + * to swap. > > + */ > > + unsigned long nr_zswap_protected; > > +#endif > > Would this be better abstracted in a zswap lruvec struct? There is just one field, so that sounds like overkill to me. But if we need to store more data (for smarter heuristics), that'll be a good idea. I'll keep this in mind. Thanks for the suggestion, Yosry! > > > }; > > > > /* Isolate unmapped pages */ > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 9f84b3f7b469..1a2c97cf396f 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -5352,6 +5352,8 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *= parent_css) > > WRITE_ONCE(memcg->soft_limit, PAGE_COUNTER_MAX); > > #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP) > > memcg->zswap_max =3D PAGE_COUNTER_MAX; > > + /* Disable the shrinker by default */ > > + atomic_set(&memcg->zswap_shrinker_enabled, 0); > > #endif > > page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX); > > if (parent) { > > @@ -7877,6 +7879,31 @@ static ssize_t zswap_max_write(struct kernfs_ope= n_file *of, > > return nbytes; > > } > > > > +static int zswap_shrinker_enabled_show(struct seq_file *m, void *v) > > +{ > > + struct mem_cgroup *memcg =3D mem_cgroup_from_seq(m); > > + > > + seq_printf(m, "%d\n", atomic_read(&memcg->zswap_shrinker_enable= d)); > > + return 0; > > +} > > + > > +static ssize_t zswap_shrinker_enabled_write(struct kernfs_open_file *o= f, > > + char *buf, size_t nbytes, loff_t off) > > +{ > > + struct mem_cgroup *memcg =3D mem_cgroup_from_css(of_css(of)); > > + int zswap_shrinker_enabled; > > + ssize_t parse_ret =3D kstrtoint(strstrip(buf), 0, &zswap_shrink= er_enabled); > > + > > + if (parse_ret) > > + return parse_ret; > > + > > + if (zswap_shrinker_enabled < 0 || zswap_shrinker_enabled > 1) > > + return -ERANGE; > > + > > + atomic_set(&memcg->zswap_shrinker_enabled, zswap_shrinker_enabl= ed); > > + return nbytes; > > +} > > + > > static struct cftype zswap_files[] =3D { > > { > > .name =3D "zswap.current", > > @@ -7889,6 +7916,12 @@ static struct cftype zswap_files[] =3D { > > .seq_show =3D zswap_max_show, > > .write =3D zswap_max_write, > > }, > > + { > > + .name =3D "zswap.shrinker.enabled", > > + .flags =3D CFTYPE_NOT_ON_ROOT, > > + .seq_show =3D zswap_shrinker_enabled_show, > > + .write =3D zswap_shrinker_enabled_write, > > + }, > > { } /* terminate */ > > }; > > #endif /* CONFIG_MEMCG_KMEM && CONFIG_ZSWAP */ > > diff --git a/mm/swap_state.c b/mm/swap_state.c > > index 1c826737aacb..788e36a06c34 100644 > > --- a/mm/swap_state.c > > +++ b/mm/swap_state.c > > @@ -618,6 +618,22 @@ static unsigned long swapin_nr_pages(unsigned long= offset) > > return pages; > > } > > > > +#ifdef CONFIG_ZSWAP > > +/* > > + * Refault is an indication that warmer pages are not resident in memo= ry. > > + * Increase the size of zswap's protected area. > > + */ > > +static void inc_nr_protected(struct page *page) > > +{ > > + struct lruvec *lruvec =3D folio_lruvec(page_folio(page)); > > + unsigned long flags; > > + > > + spin_lock_irqsave(&lruvec->lru_lock, flags); > > + lruvec->nr_zswap_protected++; > > + spin_unlock_irqrestore(&lruvec->lru_lock, flags); > > +} > > +#endif > > + > > A few questions: > - Why is this function named in such a generic way? Perhaps inc_nr_zswap_protected would be better? :) > - Why is this function here instead of in mm/zswap.c? No particular reason :) It's not being used anywhere else, so I just put it as a static function here. > - Why is this protected by the heavily contested lruvec lock instead > of being an atomic? nr_zswap_protected can be decayed (see zswap_lru_add), which I don't think it can be implemented with atomics :( It'd be much cleaner indeed. I'm wary of adding new locks, so I just re-use this existing lock. But if lruvec lock is heavily congested (I'm not aware/familar with this issue), then perhaps a new, dedicated lock would help? > > > /** > > * swap_cluster_readahead - swap in pages in hope we need them soon > > * @entry: swap entry of this memory > > @@ -686,7 +702,12 @@ struct page *swap_cluster_readahead(swp_entry_t en= try, gfp_t gfp_mask, > > lru_add_drain(); /* Push any new pages onto the LRU now = */ > > skip: > > /* The page was likely read above, so no need for plugging here= */ > > - return read_swap_cache_async(entry, gfp_mask, vma, addr, NULL); > > + page =3D read_swap_cache_async(entry, gfp_mask, vma, addr, NULL= ); > > +#ifdef CONFIG_ZSWAP > > + if (page) > > + inc_nr_protected(page); > > +#endif > > + return page; > > } > > > > int init_swap_address_space(unsigned int type, unsigned long nr_pages) > > @@ -853,8 +874,12 @@ static struct page *swap_vma_readahead(swp_entry_t= fentry, gfp_t gfp_mask, > > lru_add_drain(); > > skip: > > /* The page was likely read above, so no need for plugging here= */ > > - return read_swap_cache_async(fentry, gfp_mask, vma, vmf->addres= s, > > - NULL); > > + page =3D read_swap_cache_async(fentry, gfp_mask, vma, vmf->addr= ess, NULL); > > +#ifdef CONFIG_ZSWAP > > + if (page) > > + inc_nr_protected(page); > > +#endif > > + return page; > > } > > > > /** > > diff --git a/mm/zswap.c b/mm/zswap.c > > index 1a469e5d5197..79cb18eeb8bf 100644 > > --- a/mm/zswap.c > > +++ b/mm/zswap.c > > @@ -145,6 +145,26 @@ module_param_named(exclusive_loads, zswap_exclusiv= e_loads_enabled, bool, 0644); > > /* Number of zpools in zswap_pool (empirically determined for scalabil= ity) */ > > #define ZSWAP_NR_ZPOOLS 32 > > > > +/* > > + * Global flag to enable/disable memory pressure-based shrinker for al= l memcgs. > > + * If CONFIG_MEMCG_KMEM is on, we can further selectively disable > > + * the shrinker for each memcg. > > + */ > > +static bool zswap_shrinker_enabled; > > +module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 064= 4); > > +#ifdef CONFIG_MEMCG_KMEM > > +static bool is_shrinker_enabled(struct mem_cgroup *memcg) > > +{ > > + return zswap_shrinker_enabled && > > + atomic_read(&memcg->zswap_shrinker_enabled); > > +} > > +#else > > +static bool is_shrinker_enabled(struct mem_cgroup *memcg) > > +{ > > + return zswap_shrinker_enabled; > > +} > > +#endif > > + > > /********************************* > > * data structures > > **********************************/ > > @@ -174,6 +194,8 @@ struct zswap_pool { > > char tfm_name[CRYPTO_MAX_ALG_NAME]; > > struct list_lru list_lru; > > struct mem_cgroup *next_shrink; > > + struct shrinker *shrinker; > > + atomic_t nr_stored; > > }; > > > > /* > > @@ -273,17 +295,26 @@ static bool zswap_can_accept(void) > > DIV_ROUND_UP(zswap_pool_total_size, PAGE_SIZE); > > } > > > > +static u64 get_zswap_pool_size(struct zswap_pool *pool) > > +{ > > + u64 pool_size =3D 0; > > + int i; > > + > > + for (i =3D 0; i < ZSWAP_NR_ZPOOLS; i++) > > + pool_size +=3D zpool_get_total_size(pool->zpools[i]); > > + > > + return pool_size; > > +} > > + > > static void zswap_update_total_size(void) > > { > > struct zswap_pool *pool; > > u64 total =3D 0; > > - int i; > > > > rcu_read_lock(); > > > > list_for_each_entry_rcu(pool, &zswap_pools, list) > > - for (i =3D 0; i < ZSWAP_NR_ZPOOLS; i++) > > - total +=3D zpool_get_total_size(pool->zpools[i]= ); > > + total +=3D get_zswap_pool_size(pool); > > > > rcu_read_unlock(); > > > > @@ -318,8 +349,23 @@ static bool zswap_lru_add(struct list_lru *list_lr= u, struct zswap_entry *entry) > > { > > struct mem_cgroup *memcg =3D entry->objcg ? > > get_mem_cgroup_from_objcg(entry->objcg) : NULL; > > + struct lruvec *lruvec =3D mem_cgroup_lruvec(memcg, NODE_DATA(en= try->nid)); > > bool added =3D __list_lru_add(list_lru, &entry->lru, entry->nid= , memcg); > > + unsigned long flags, lru_size; > > + > > + if (added) { > > + lru_size =3D list_lru_count_one(list_lru, entry->nid, m= emcg); > > + spin_lock_irqsave(&lruvec->lru_lock, flags); > > + lruvec->nr_zswap_protected++; > > > > + /* > > + * Decay to avoid overflow and adapt to changing worklo= ads. > > + * This is based on LRU reclaim cost decaying heuristic= s. > > + */ > > + if (lruvec->nr_zswap_protected > lru_size / 4) > > + lruvec->nr_zswap_protected /=3D 2; > > + spin_unlock_irqrestore(&lruvec->lru_lock, flags); > > + } > > mem_cgroup_put(memcg); > > return added; > > } > > @@ -420,6 +466,7 @@ static void zswap_free_entry(struct zswap_entry *en= try) > > else { > > zswap_lru_del(&entry->pool->list_lru, entry); > > zpool_free(zswap_find_zpool(entry), entry->handle); > > + atomic_dec(&entry->pool->nr_stored); > > zswap_pool_put(entry->pool); > > } > > zswap_entry_cache_free(entry); > > @@ -461,6 +508,98 @@ static struct zswap_entry *zswap_entry_find_get(st= ruct rb_root *root, > > return entry; > > } > > > > +/********************************* > > +* shrinker functions > > +**********************************/ > > +static enum lru_status shrink_memcg_cb(struct list_head *item, struct = list_lru_one *l, > > + spinlock_t *lock, void *arg); > > + > > +static unsigned long zswap_shrinker_scan(struct shrinker *shrinker, > > + struct shrink_control *sc) > > +{ > > + struct zswap_pool *pool =3D shrinker->private_data; > > + unsigned long shrink_ret, nr_zswap_protected, flags, > > + lru_size =3D list_lru_shrink_count(&pool->list_lru, sc)= ; > > + struct lruvec *lruvec =3D mem_cgroup_lruvec(sc->memcg, NODE_DAT= A(sc->nid)); > > + bool encountered_page_in_swapcache =3D false; > > + > > + spin_lock_irqsave(&lruvec->lru_lock, flags); > > + nr_zswap_protected =3D lruvec->nr_zswap_protected; > > + spin_unlock_irqrestore(&lruvec->lru_lock, flags); > > + > > + /* > > + * Abort if the shrinker is disabled or if we are shrinking int= o the > > + * protected region. > > + */ > > + if (!is_shrinker_enabled(sc->memcg) || > > + nr_zswap_protected >=3D lru_size - sc->nr_to_sc= an) { > > + sc->nr_scanned =3D 0; > > + return SHRINK_STOP; > > + } > > + > > + shrink_ret =3D list_lru_shrink_walk(&pool->list_lru, sc, &shrin= k_memcg_cb, > > + &encountered_page_in_swapcache); > > + > > + if (encountered_page_in_swapcache) > > + return SHRINK_STOP; > > + > > + return shrink_ret ? shrink_ret : SHRINK_STOP; > > +} > > + > > +static unsigned long zswap_shrinker_count(struct shrinker *shrinker, > > + struct shrink_control *sc) > > +{ > > + struct zswap_pool *pool =3D shrinker->private_data; > > + struct mem_cgroup *memcg =3D sc->memcg; > > + struct lruvec *lruvec =3D mem_cgroup_lruvec(memcg, NODE_DATA(sc= ->nid)); > > + unsigned long nr_backing, nr_stored, nr_freeable, flags; > > + > > +#ifdef CONFIG_MEMCG_KMEM > > + cgroup_rstat_flush(memcg->css.cgroup); > > + nr_backing =3D memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_S= HIFT; > > + nr_stored =3D memcg_page_state(memcg, MEMCG_ZSWAPPED); > > +#else > > + /* use pool stats instead of memcg stats */ > > + nr_backing =3D get_zswap_pool_size(pool) >> PAGE_SHIFT; > > + nr_stored =3D atomic_read(&pool->nr_stored); > > +#endif > > + > > + if (!is_shrinker_enabled(memcg) || !nr_stored) > > + return 0; > > + > > + nr_freeable =3D list_lru_shrink_count(&pool->list_lru, sc); > > + /* > > + * Subtract the lru size by an estimate of the number of pages > > + * that should be protected. > > + */ > > + spin_lock_irqsave(&lruvec->lru_lock, flags); > > + nr_freeable =3D nr_freeable > lruvec->nr_zswap_protected ? > > + nr_freeable - lruvec->nr_zswap_protected : 0; > > + spin_unlock_irqrestore(&lruvec->lru_lock, flags); > > + > > + /* > > + * Scale the number of freeable pages by the memory saving fact= or. > > + * This ensures that the better zswap compresses memory, the fe= wer > > + * pages we will evict to swap (as it will otherwise incur IO f= or > > + * relatively small memory saving). > > + */ > > + return mult_frac(nr_freeable, nr_backing, nr_stored); > > +} > > + > > +static void zswap_alloc_shrinker(struct zswap_pool *pool) > > +{ > > + pool->shrinker =3D > > + shrinker_alloc(SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWA= RE, "mm-zswap"); > > + if (!pool->shrinker) > > + return; > > + > > + pool->shrinker->private_data =3D pool; > > + pool->shrinker->scan_objects =3D zswap_shrinker_scan; > > + pool->shrinker->count_objects =3D zswap_shrinker_count; > > + pool->shrinker->batch =3D 0; > > + pool->shrinker->seeks =3D DEFAULT_SEEKS; > > +} > > + > > /********************************* > > * per-cpu code > > **********************************/ > > @@ -656,11 +795,14 @@ static enum lru_status shrink_memcg_cb(struct lis= t_head *item, struct list_lru_o > > spinlock_t *lock, void *arg) > > { > > struct zswap_entry *entry =3D container_of(item, struct zswap_e= ntry, lru); > > + bool *encountered_page_in_swapcache =3D (bool *)arg; > > struct mem_cgroup *memcg; > > struct zswap_tree *tree; > > + struct lruvec *lruvec; > > pgoff_t swpoffset; > > enum lru_status ret =3D LRU_REMOVED_RETRY; > > int writeback_result; > > + unsigned long flags; > > > > /* > > * Once the lru lock is dropped, the entry might get freed. The > > @@ -696,8 +838,24 @@ static enum lru_status shrink_memcg_cb(struct list= _head *item, struct list_lru_o > > /* we cannot use zswap_lru_add here, because it increme= nts node's lru count */ > > list_lru_putback(&entry->pool->list_lru, item, entry->n= id, memcg); > > spin_unlock(lock); > > - mem_cgroup_put(memcg); > > ret =3D LRU_RETRY; > > + > > + /* > > + * Encountering a page already in swap cache is a sign = that we are shrinking > > + * into the warmer region. We should terminate shrinkin= g (if we're in the dynamic > > + * shrinker context). > > + */ > > + if (writeback_result =3D=3D -EEXIST && encountered_page= _in_swapcache) { > > + ret =3D LRU_SKIP; > > + *encountered_page_in_swapcache =3D true; > > + } > > + lruvec =3D mem_cgroup_lruvec(memcg, NODE_DATA(entry->ni= d)); > > + spin_lock_irqsave(&lruvec->lru_lock, flags); > > + /* Increment the protection area to account for the LRU= rotation. */ > > + lruvec->nr_zswap_protected++; > > + spin_unlock_irqrestore(&lruvec->lru_lock, flags); > > + > > + mem_cgroup_put(memcg); > > goto put_unlock; > > } > > > > @@ -828,6 +986,11 @@ static struct zswap_pool *zswap_pool_create(char *= type, char *compressor) > > &pool->node); > > if (ret) > > goto error; > > + > > + zswap_alloc_shrinker(pool); > > + if (!pool->shrinker) > > + goto error; > > + > > pr_debug("using %s compressor\n", pool->tfm_name); > > > > /* being the current pool takes 1 ref; this func expects the > > @@ -836,12 +999,17 @@ static struct zswap_pool *zswap_pool_create(char = *type, char *compressor) > > kref_init(&pool->kref); > > INIT_LIST_HEAD(&pool->list); > > INIT_WORK(&pool->shrink_work, shrink_worker); > > - list_lru_init_memcg(&pool->list_lru, NULL); > > + if (list_lru_init_memcg(&pool->list_lru, pool->shrinker)) > > + goto lru_fail; > > + shrinker_register(pool->shrinker); > > > > zswap_pool_debug("created", pool); > > > > return pool; > > > > +lru_fail: > > + list_lru_destroy(&pool->list_lru); > > + shrinker_free(pool->shrinker); > > error: > > if (pool->acomp_ctx) > > free_percpu(pool->acomp_ctx); > > @@ -899,6 +1067,7 @@ static void zswap_pool_destroy(struct zswap_pool *= pool) > > > > zswap_pool_debug("destroying", pool); > > > > + shrinker_free(pool->shrinker); > > cpuhp_state_remove_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->= node); > > free_percpu(pool->acomp_ctx); > > list_lru_destroy(&pool->list_lru); > > @@ -1431,6 +1600,7 @@ bool zswap_store(struct folio *folio) > > if (entry->length) { > > INIT_LIST_HEAD(&entry->lru); > > zswap_lru_add(&pool->list_lru, entry); > > + atomic_inc(&pool->nr_stored); > > } > > spin_unlock(&tree->lock); > > > > -- > > 2.34.1 Thanks for the comments/suggestion, Yosry!