From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E5BFCC4167B for ; Fri, 8 Dec 2023 00:19:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 67C2D6B007B; Thu, 7 Dec 2023 19:19:43 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 604C36B007E; Thu, 7 Dec 2023 19:19:43 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 457736B0080; Thu, 7 Dec 2023 19:19:43 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 338F06B007B for ; Thu, 7 Dec 2023 19:19:43 -0500 (EST) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 0D03BA0F23 for ; Fri, 8 Dec 2023 00:19:43 +0000 (UTC) X-FDA: 81541742646.03.7575E33 Received: from ams.source.kernel.org (ams.source.kernel.org [145.40.68.75]) by imf23.hostedemail.com (Postfix) with ESMTP id DC8A114000E for ; Fri, 8 Dec 2023 00:19:40 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=XtzycNTW; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf23.hostedemail.com: domain of chrisl@kernel.org designates 145.40.68.75 as permitted sender) smtp.mailfrom=chrisl@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1701994781; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=OGH3TWzcYowTwf1LY7vraf7E+oc1HjGFV+unAkOa284=; b=fQICmfJJ2bewJLmCq0nYUP7vNANq7UE1jXk/pE2cUMarC7Fau+VPbWPc5/bDplz0QI8TFv VOwB5c5xm8yDKZnbQaVKWN18t/mU2oSvVfDWWwrzDd9ZtBu1vnuCHjEnOBGAqf20wSSGsY HPrzH0jaNEPgoa3q7di5tsj34iKWFSE= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=XtzycNTW; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf23.hostedemail.com: domain of chrisl@kernel.org designates 145.40.68.75 as permitted sender) smtp.mailfrom=chrisl@kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1701994781; a=rsa-sha256; cv=none; b=uIfGAMFL69zB4n9VfDywbzSjBsocBL/Ls2PVSIaLJrY2nrqnQkAwHjBuSiFOBSkpq5sKVn Rd8SAMeaBdvp2B5cmXuRqzo5PbpI38AqX8ZX9TDy6gZYSjPQX4LOzHXb97eMq/hN3Bwxzx l/nlrr4r/ELgjoy5Ebc6QEOrHptF5Z0= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by ams.source.kernel.org (Postfix) with ESMTP id 2014DB82AC9 for ; Fri, 8 Dec 2023 00:19:39 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 87EC5C433AD for ; Fri, 8 Dec 2023 00:19:37 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1701994777; bh=mDwxbPBb1A5AYZHcCsx/GV3t7KzcIQ+T/wJuXDoRX4w=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=XtzycNTW+r/62msOptf1OlGQpif2upg7e1LC6bp5ZcDJuEhk6EaBVz1f1sU9fm6Sm chittvZbq3h7437LutF/JG9a6eIdUlJpQUUsCPiesDqyoPFoyPrumW0CJVh+gq85Hx qBFQz6q3x0WUfGHYhD/ogteNjnyW5Vz28kxkh/WgQ0lBUGVSHO7K3+tECjNbWGn36x WOTDj2WlTj0cSeuGoc70Fbkmk9tqtjDk8a8PwtqGG3FvSwHEdmGX1msSVjCAEyk0w2 gW/ZqdzlVcfAzxikwUuG4FAnfv/wIi5rXu1S1ioqExXJqGqokqCDTGr3ECFkaYXhPv bdMLn1nVLG59g== Received: by mail-pf1-f178.google.com with SMTP id d2e1a72fcca58-6ce939ecfc2so1121204b3a.2 for ; Thu, 07 Dec 2023 16:19:37 -0800 (PST) X-Gm-Message-State: AOJu0YwtGhr8KCKjGUNIq2CFnH6vTMq2fMD/LxDVKoBHOHKupk/73yXr gUZFZ7onO9h0BTeYHX6E93LTcUyyVpy8lpwNSGyxyg== X-Google-Smtp-Source: AGHT+IHH8qUzAYmZPIqWzQmA5CT/QTGVkxl12ScCXZvPymESIzGdG+x2E7gfCbehuq7nxkvRQnm/B7h88Z2nEtK/96M= X-Received: by 2002:a05:6a20:5603:b0:18c:4f:ce9e with SMTP id ir3-20020a056a20560300b0018c004fce9emr3737185pzc.43.1701994776637; Thu, 07 Dec 2023 16:19:36 -0800 (PST) MIME-Version: 1.0 References: <20231207192406.3809579-1-nphamcs@gmail.com> In-Reply-To: <20231207192406.3809579-1-nphamcs@gmail.com> From: Chris Li Date: Thu, 7 Dec 2023 16:19:25 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH v6] zswap: memcontrol: implement zswap writeback disabling To: Nhat Pham Cc: akpm@linux-foundation.org, tj@kernel.org, lizefan.x@bytedance.com, hannes@cmpxchg.org, cerasuolodomenico@gmail.com, yosryahmed@google.com, sjenning@redhat.com, ddstreet@ieee.org, vitaly.wool@konsulko.com, mhocko@kernel.org, roman.gushchin@linux.dev, shakeelb@google.com, muchun.song@linux.dev, hughd@google.com, corbet@lwn.net, konrad.wilk@oracle.com, senozhatsky@chromium.org, rppt@kernel.org, linux-mm@kvack.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, david@ixit.cz Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: DC8A114000E X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: 3mcohfkdprb7borybdmeb3k5y1y7h8tk X-HE-Tag: 1701994780-553773 X-HE-Meta: U2FsdGVkX192ooyEBU21iCdMclOBbGYrSZFtMrOhMJqL+hOoOTttZlOH7oJualfiyo+WAsEcvz396dEA461lgRirrLC06Jts73CUoKVf39xJ79nv+4gQ4BZs2VfZR4PlifYXKpzDsx20TcRAAluXiyB7v7oIMZ+V6yF97m5pgKfXlA1kgv6IXZa9HoAEJQIN9JTZ3o6+1U7fItvJDWZ8SMHdqBSur2KrgvBNUufxaN34j0bf7eJIVjRND/ghLgdLfLA5mnJN3waBTjH1ByEU8kBBlCN7b+cZkLmKQ4j1uxJrh+o7j768ZOnJISEbzq8sj7gYwAXFmWmP4w4eKithJT/tym3IyjEZKDIFklI9H5sjBDQLvdbrQj244SbajQDTPz2p/k4Tx+2oTTrN1eABco0UEvhPqfl5oflpcUUPGdMindgWMw1C9Qu4lG3e4lRS3T+CT8AHFyiOOINyL5ZnaqkuWJi+wRQXdkth3/OLLhNfKoAqKkjpHn1dii7W+pgXGE2yc0cCyCVkoZWfaY40VisxA97recVT+QZghx0Puy1s1pyBZQZMi75PQM+s1gEe+Zw+BumTfExA9qVrqhFffpm0iR4YO+ZxBOIC/HCZ6jbWEpjt8weLyLv9iYjzBAk/UK196/eQ6eWopahKOTSLiS158WGN1+FBNUPQYOTqs3xxF38z9Q2t1zwzqyercP91qZOJgQSFf5srsE3gVcxhizb9oDxTazK8EX+IhaG34KUCBJeAXDeue6zMyhsTaKvaN/DJ0xtMDHbSpevQqL5pAITzaRc89zPWkrJeCSVRpkHilRs1tH3TakVKEu9yMzj56uRwIjxBcaKlL3uTWPSj0bXl9kFYlf/8tYCeW/KTQn0sDC3A73hd8rBAiMPjo1vvKLQSUpG4vt/riTNED8YQFS0E6kRjKmRYE6R5vZbwMRokkTnrBUf58+u0lHt/XijF+XQw8xtbpgt7/B/Ovpd zBQKxtCw 3jE7HGvKHI4kjQ4uODvZDYcha+B/t9VNK/OI81nskrA7wYBsKoRUr09Swb0M75up43IfSSUjepGr5DMLBLuFk0XVwkQu/o32CGcGLGai/qHgw6eXOfwfP2kcQFrV/Q7EHuDPzwlulBqJ8M5rVoBzJl9Mrf0kqQZ4KrEQM12nXPCtCkVM8zEPxFiUt7+OA4kHFbl5GVUAPyRgajuTGve70ffd701B7ZS1YLGAfqC5pvdoAB54l9/K8sac3EuA0YSHdc1GAFhnDCCQtu1SJUyLZMhiY3wz9LjNiKQC4gWDHzGwxDDNinE5Ry7DjAIJh8lQb4OH5lZ5A4Umk88stfQYuJ5xEMD+49UW6JVe3xGmccL/sAgEAq7dh0WLI87uh02NyZbuo X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Nhat, On Thu, Dec 7, 2023 at 11:24=E2=80=AFAM Nhat Pham wrote= : > > During our experiment with zswap, we sometimes observe swap IOs due to > occasional zswap store failures and writebacks-to-swap. These swapping > IOs prevent many users who cannot tolerate swapping from adopting zswap > to save memory and improve performance where possible. > > This patch adds the option to disable this behavior entirely: do not > writeback to backing swapping device when a zswap store attempt fail, > and do not write pages in the zswap pool back to the backing swap > device (both when the pool is full, and when the new zswap shrinker is > called). > > This new behavior can be opted-in/out on a per-cgroup basis via a new > cgroup file. By default, writebacks to swap device is enabled, which is > the previous behavior. Initially, writeback is enabled for the root > cgroup, and a newly created cgroup will inherit the current setting of > its parent. > > Note that this is subtly different from setting memory.swap.max to 0, as > it still allows for pages to be stored in the zswap pool (which itself > consumes swap space in its current form). > > This patch should be applied on top of the zswap shrinker series: > > https://lore.kernel.org/linux-mm/20231130194023.4102148-1-nphamcs@gmail.c= om/ > > as it also disables the zswap shrinker, a major source of zswap > writebacks. I am wondering about the status of "memory.swap.tiers" proof of concept pat= ch? Are we still on board to have this two patch merge together somehow so we can have "memory.swap.tiers" =3D=3D "all" and "memory.swap.tiers" =3D=3D "zswap" cov= er the memory.zswap.writeback =3D=3D 1 and memory.zswap.writeback =3D=3D 0 case? Thanks Chris > > Suggested-by: Johannes Weiner > Signed-off-by: Nhat Pham > Reviewed-by: Yosry Ahmed > --- > Documentation/admin-guide/cgroup-v2.rst | 12 ++++++++ > Documentation/admin-guide/mm/zswap.rst | 6 ++++ > include/linux/memcontrol.h | 12 ++++++++ > include/linux/zswap.h | 6 ++++ > mm/memcontrol.c | 38 +++++++++++++++++++++++++ > mm/page_io.c | 6 ++++ > mm/shmem.c | 3 +- > mm/zswap.c | 13 +++++++-- > 8 files changed, 92 insertions(+), 4 deletions(-) > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admi= n-guide/cgroup-v2.rst > index 3f85254f3cef..2b4ac43efdc8 100644 > --- a/Documentation/admin-guide/cgroup-v2.rst > +++ b/Documentation/admin-guide/cgroup-v2.rst > @@ -1679,6 +1679,18 @@ PAGE_SIZE multiple when read back. > limit, it will refuse to take any more stores before existing > entries fault back in or are written out to disk. > > + memory.zswap.writeback > + A read-write single value file. The default value is "1". The > + initial value of the root cgroup is 1, and when a new cgroup is > + created, it inherits the current value of its parent. > + > + When this is set to 0, all swapping attempts to swapping devices > + are disabled. This included both zswap writebacks, and swapping d= ue > + to zswap store failure. > + > + Note that this is subtly different from setting memory.swap.max t= o > + 0, as it still allows for pages to be written to the zswap pool. > + > memory.pressure > A read-only nested-keyed file. > > diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin= -guide/mm/zswap.rst > index 62fc244ec702..cfa653130346 100644 > --- a/Documentation/admin-guide/mm/zswap.rst > +++ b/Documentation/admin-guide/mm/zswap.rst > @@ -153,6 +153,12 @@ attribute, e. g.:: > > Setting this parameter to 100 will disable the hysteresis. > > +Some users cannot tolerate the swapping that comes with zswap store fail= ures > +and zswap writebacks. Swapping can be disabled entirely (without disabli= ng > +zswap itself) on a cgroup-basis as follows: > + > + echo 0 > /sys/fs/cgroup//memory.zswap.writeback > + > When there is a sizable amount of cold memory residing in the zswap pool= , it > can be advantageous to proactively write these cold pages to swap and re= claim > the memory for other use cases. By default, the zswap shrinker is disabl= ed. > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 43b77363ab8e..5de775e6cdd9 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -219,6 +219,12 @@ struct mem_cgroup { > > #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP) > unsigned long zswap_max; > + > + /* > + * Prevent pages from this memcg from being written back from zsw= ap to > + * swap, and from being swapped out on zswap store failures. > + */ > + bool zswap_writeback; > #endif > > unsigned long soft_limit; > @@ -1941,6 +1947,7 @@ static inline void count_objcg_event(struct obj_cgr= oup *objcg, > bool obj_cgroup_may_zswap(struct obj_cgroup *objcg); > void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size); > void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size); > +bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg); > #else > static inline bool obj_cgroup_may_zswap(struct obj_cgroup *objcg) > { > @@ -1954,6 +1961,11 @@ static inline void obj_cgroup_uncharge_zswap(struc= t obj_cgroup *objcg, > size_t size) > { > } > +static inline bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup = *memcg) > +{ > + /* if zswap is disabled, do not block pages going to the swapping= device */ > + return true; > +} > #endif > > #endif /* _LINUX_MEMCONTROL_H */ > diff --git a/include/linux/zswap.h b/include/linux/zswap.h > index 08c240e16a01..a78ceaf3a65e 100644 > --- a/include/linux/zswap.h > +++ b/include/linux/zswap.h > @@ -35,6 +35,7 @@ void zswap_swapoff(int type); > void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg); > void zswap_lruvec_state_init(struct lruvec *lruvec); > void zswap_page_swapin(struct page *page); > +bool is_zswap_enabled(void); > #else > > struct zswap_lruvec_state {}; > @@ -55,6 +56,11 @@ static inline void zswap_swapoff(int type) {} > static inline void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)= {} > static inline void zswap_lruvec_state_init(struct lruvec *lruvec) {} > static inline void zswap_page_swapin(struct page *page) {} > + > +static inline bool is_zswap_enabled(void) > +{ > + return false; > +} > #endif > > #endif /* _LINUX_ZSWAP_H */ > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index d7bc47316acb..ae8c62c7aa53 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -5538,6 +5538,8 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *pa= rent_css) > WRITE_ONCE(memcg->soft_limit, PAGE_COUNTER_MAX); > #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP) > memcg->zswap_max =3D PAGE_COUNTER_MAX; > + WRITE_ONCE(memcg->zswap_writeback, > + !parent || READ_ONCE(parent->zswap_writeback)); > #endif > page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX); > if (parent) { > @@ -8174,6 +8176,12 @@ void obj_cgroup_uncharge_zswap(struct obj_cgroup *= objcg, size_t size) > rcu_read_unlock(); > } > > +bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg) > +{ > + /* if zswap is disabled, do not block pages going to the swapping= device */ > + return !is_zswap_enabled() || !memcg || READ_ONCE(memcg->zswap_wr= iteback); > +} > + > static u64 zswap_current_read(struct cgroup_subsys_state *css, > struct cftype *cft) > { > @@ -8206,6 +8214,31 @@ static ssize_t zswap_max_write(struct kernfs_open_= file *of, > return nbytes; > } > > +static int zswap_writeback_show(struct seq_file *m, void *v) > +{ > + struct mem_cgroup *memcg =3D mem_cgroup_from_seq(m); > + > + seq_printf(m, "%d\n", READ_ONCE(memcg->zswap_writeback)); > + return 0; > +} > + > +static ssize_t zswap_writeback_write(struct kernfs_open_file *of, > + char *buf, size_t nbytes, loff_t off) > +{ > + struct mem_cgroup *memcg =3D mem_cgroup_from_css(of_css(of)); > + int zswap_writeback; > + ssize_t parse_ret =3D kstrtoint(strstrip(buf), 0, &zswap_writebac= k); > + > + if (parse_ret) > + return parse_ret; > + > + if (zswap_writeback !=3D 0 && zswap_writeback !=3D 1) > + return -EINVAL; > + > + WRITE_ONCE(memcg->zswap_writeback, zswap_writeback); > + return nbytes; > +} > + > static struct cftype zswap_files[] =3D { > { > .name =3D "zswap.current", > @@ -8218,6 +8251,11 @@ static struct cftype zswap_files[] =3D { > .seq_show =3D zswap_max_show, > .write =3D zswap_max_write, > }, > + { > + .name =3D "zswap.writeback", > + .seq_show =3D zswap_writeback_show, > + .write =3D zswap_writeback_write, > + }, > { } /* terminate */ > }; > #endif /* CONFIG_MEMCG_KMEM && CONFIG_ZSWAP */ > diff --git a/mm/page_io.c b/mm/page_io.c > index cb559ae324c6..5e606f1aa2f6 100644 > --- a/mm/page_io.c > +++ b/mm/page_io.c > @@ -201,6 +201,12 @@ int swap_writepage(struct page *page, struct writeba= ck_control *wbc) > folio_end_writeback(folio); > return 0; > } > + > + if (!mem_cgroup_zswap_writeback_enabled(folio_memcg(folio))) { > + folio_mark_dirty(folio); > + return AOP_WRITEPAGE_ACTIVATE; > + } > + > __swap_writepage(&folio->page, wbc); > return 0; > } > diff --git a/mm/shmem.c b/mm/shmem.c > index c62f904ba1ca..dd084fbafcf1 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -1514,8 +1514,7 @@ static int shmem_writepage(struct page *page, struc= t writeback_control *wbc) > > mutex_unlock(&shmem_swaplist_mutex); > BUG_ON(folio_mapped(folio)); > - swap_writepage(&folio->page, wbc); > - return 0; > + return swap_writepage(&folio->page, wbc); > } > > mutex_unlock(&shmem_swaplist_mutex); > diff --git a/mm/zswap.c b/mm/zswap.c > index daaa949837f2..7ee54a3d8281 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -153,6 +153,11 @@ static bool zswap_shrinker_enabled =3D IS_ENABLED( > CONFIG_ZSWAP_SHRINKER_DEFAULT_ON); > module_param_named(shrinker_enabled, zswap_shrinker_enabled, bool, 0644)= ; > > +bool is_zswap_enabled(void) > +{ > + return zswap_enabled; > +} > + > /********************************* > * data structures > **********************************/ > @@ -596,7 +601,8 @@ static unsigned long zswap_shrinker_scan(struct shrin= ker *shrinker, > struct zswap_pool *pool =3D shrinker->private_data; > bool encountered_page_in_swapcache =3D false; > > - if (!zswap_shrinker_enabled) { > + if (!zswap_shrinker_enabled || > + !mem_cgroup_zswap_writeback_enabled(sc->memcg)) { > sc->nr_scanned =3D 0; > return SHRINK_STOP; > } > @@ -637,7 +643,7 @@ static unsigned long zswap_shrinker_count(struct shri= nker *shrinker, > struct lruvec *lruvec =3D mem_cgroup_lruvec(memcg, NODE_DATA(sc->= nid)); > unsigned long nr_backing, nr_stored, nr_freeable, nr_protected; > > - if (!zswap_shrinker_enabled) > + if (!zswap_shrinker_enabled || !mem_cgroup_zswap_writeback_enable= d(memcg)) > return 0; > > #ifdef CONFIG_MEMCG_KMEM > @@ -956,6 +962,9 @@ static int shrink_memcg(struct mem_cgroup *memcg) > struct zswap_pool *pool; > int nid, shrunk =3D 0; > > + if (!mem_cgroup_zswap_writeback_enabled(memcg)) > + return -EINVAL; > + > /* > * Skip zombies because their LRUs are reparented and we would be > * reclaiming from the parent instead of the dead memcg. > > base-commit: cdcab2d34f129f593c0afbb2493bcaf41f4acd61 > -- > 2.34.1 >