From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4B06AC4332F for ; Thu, 2 Nov 2023 20:28:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7F9E58D0057; Thu, 2 Nov 2023 16:28:06 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7834E8D000F; Thu, 2 Nov 2023 16:28:06 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5D5768D0057; Thu, 2 Nov 2023 16:28:06 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 464F98D000F for ; Thu, 2 Nov 2023 16:28:06 -0400 (EDT) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 2302B1406EA for ; Thu, 2 Nov 2023 20:28:06 +0000 (UTC) X-FDA: 81414150972.01.845B98F Received: from mail-ej1-f41.google.com (mail-ej1-f41.google.com [209.85.218.41]) by imf26.hostedemail.com (Postfix) with ESMTP id 424B8140019 for ; Thu, 2 Nov 2023 20:28:03 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=zIO+ELbD; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf26.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.41 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698956884; a=rsa-sha256; cv=none; b=hVFEL/64W4dHDPK8bKDlsM3yHuOgh+vtFtPAR9OcqmvQv2qzi/ljw0/Q8FoLtcALDG/Mit X7U6BMJha9wIfwyiGRSGxuWacl/MRVBNPEC2hEl1B4b0qSRwBlHMVvgJVhiQLtAMyvJbar BCv0Nn/Fat/bNDhKAVKDHgYJX0EWnHU= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=zIO+ELbD; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf26.hostedemail.com: domain of yosryahmed@google.com designates 209.85.218.41 as permitted sender) smtp.mailfrom=yosryahmed@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1698956884; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=D2tG9+LYPAjXHoqgeXN+dTD/rVonlm13np1kfFquZ3k=; b=mPS7HzYuwXrIZhqkrz58fie2naJ/0lW1dPDa7ooqVudNOrRq2Hy4CtBot7oJg5jaILnmJc jo3UIkNjGGI31msD+sTGGoS6pDgKTI+1IQLOTMVRjQt8cPmRjrTMlM2HkV+Cy/mj+diFXh Bx0mi5SFgVPieUEyOOrM9C53cxLkPjw= Received: by mail-ej1-f41.google.com with SMTP id a640c23a62f3a-9c2a0725825so213644366b.2 for ; Thu, 02 Nov 2023 13:28:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1698956883; x=1699561683; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=D2tG9+LYPAjXHoqgeXN+dTD/rVonlm13np1kfFquZ3k=; b=zIO+ELbDRyZs+bSiITUPqPXf4luz4SmoSazvIfLZcQYaN2ZqI1nn1HcrzWWxwOHAI4 WBsPEraZta+LUsLXnaW2RtvUDXYjj3vYOMQWoE5bOV8s8rZVdyubJRe5Fq4ZaoAU+Y2q Q/PkkjMmpXAkB7oSwmkW+QHH07b5IvykosMQH2nLOvv0W9/zM9E651eB4QY8jdVFYJr/ BhUc+5mZBBUqo60grgczNk3rKzWyikdHsqUuw4D+LMtWx8dqUL9BHrydzZmXCaWqyx0F 25jFKSuU0x2j6Jtcxvfm0G//YYTdaQJf1dhzjnaJajxJOQJCE6AYf8cbGJbQLsKiN/ey Opiw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698956883; x=1699561683; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=D2tG9+LYPAjXHoqgeXN+dTD/rVonlm13np1kfFquZ3k=; b=LHiQKEEi2Wg537t1m7z9ytfVdF0rYPWLUjrKDwdUV87ic1QDqcRoUz+tXOWvDS2jqw ERaCzPncTNsPg8vCYX8TFE7xJg5oml+mZwpy9KLUZD+ymjYDw3rQeRUnlCuletYuuMXi ra6vuWfLhgn4V+7MbvOoTPHLgn/xEePkHOQbuQzH++D+Slm0j0YhF3IOYslNqeW4DeXK 3fUo0GCEadyfaejoXIQ403DYESWumrQKxGGAsb7Igml9UL9toz5oOrgme5rVaqmht6vl tQvu3p8Fp59VC+w8rrixsmFsktmIHBNGCMTCYw3Q5A8WCntT79ufCxQQAKCTf9y7SPbw XzTg== X-Gm-Message-State: AOJu0YyFxZIJUvVDL6MgDLPnwCBVa4YBJanfka6L5pR6NzZ8hhAoXIcY CkxFCc8fPgHApreLxwE9SiPvEAoz+0oV+wzwE6or5g== X-Google-Smtp-Source: AGHT+IGopUO2TFYjW2zMpI8RZLl67y2PbZV0MEimJN6sYLHh5XVKfwjFtdjxfx/e9iEddzJSgGQc6t30fNaR9NI4vvA= X-Received: by 2002:a17:907:3c22:b0:9dc:2281:2f0a with SMTP id gh34-20020a1709073c2200b009dc22812f0amr1266432ejc.50.1698956882559; Thu, 02 Nov 2023 13:28:02 -0700 (PDT) MIME-Version: 1.0 References: <20231102200202.920461-1-nphamcs@gmail.com> In-Reply-To: <20231102200202.920461-1-nphamcs@gmail.com> From: Yosry Ahmed Date: Thu, 2 Nov 2023 13:27:24 -0700 Message-ID: Subject: Re: [RFC PATCH v2] zswap: memcontrol: implement zswap writeback disabling To: Nhat Pham Cc: akpm@linux-foundation.org, tj@kernel.org, lizefan.x@bytedance.com, hannes@cmpxchg.org, cerasuolodomenico@gmail.com, sjenning@redhat.com, ddstreet@ieee.org, vitaly.wool@konsulko.com, mhocko@kernel.org, roman.gushchin@linux.dev, shakeelb@google.com, muchun.song@linux.dev, hughd@google.com, corbet@lwn.net, konrad.wilk@oracle.com, senozhatsky@chromium.org, rppt@kernel.org, linux-mm@kvack.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, david@ixit.cz Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 424B8140019 X-Stat-Signature: 7rgmhnbzdz979ck7emmrdx9pgkfdadcx X-HE-Tag: 1698956883-983610 X-HE-Meta: U2FsdGVkX1+DKMz5mo7YJ5hJDeyeEX2W4JbbuoOdjnw8C7CpYyxZUCWW6pXE3ebtmB8SMX0J1MVRIs46F3l2A3z9TA1NJL88SQNVO09LIBa1mnsoez2Qypef5hHA+rHec+JFKYePCBIK1Jw45/C7r9Qun5kxltZliekIVzBHBB513sBzm+8AUOP4PkyW/mb2Sa/qMH+L+u+C+RZYn/m3qDrlfcBgFXoR5bniuo/wp4nkjXrWuC9gZckVyCMjy1+5EsUr3kFq/HOwxsMok3rNVgh3M375/7BhbdT0Ucr7uI5dUg/rWp+aO1qlkGu3Pncuwey9W7P/r3tCJQKwRL2ZjOXhqtgXeRQHraqmpshVPS9zGbwN16viSsP5CSX3mytzM6a9KHC/h+gefgj4cp6KhyxIULDOGwkD3dY/PuHOAQgQoi/vUy72tAEzZPTJJz9gNELWvDOoMfNNHOCsaiK5lItyD8LqGL/nRY5RHTZSXyWnxuCXD8Af+e6QF5dZScoaABpXqOK+nUZjIU4+C/G1h8RzjdJaRY5r9DrDIGbej9tMxY4R92d8smMAWvxGKcoWbuSKn/6JBmrzPtqfuB5dRxb/Tg/1Et1JO/f93ETCnqqR2loMVVRxxxnegdrNrki4ZUn4NzmACS/lNJ7oMuHM2Pe7URys/khCT79Pr6ikdeOyFMQvhlYLpniQMDisaCh7qvlp2p3JjV0rCSTh310tnXNjkOwMDt1+s0VU1KZ/Bsc5faD4WGbW2kFdVfAurnsWs79VhG0vTLxRzcbvN+63U9ABGhs8iccqQl0Wzkgh1l7p+CmSqJ0vHM0RymFxmaDBFX4zFxZhCLg/pIKo/EqY++rtqk/qomoAneTxmpVmRdtM8dxvw4T3Xn17DFr+VWj0gjHhmNxvLENwt+1yYkqnMNxuG+RFIaxrx7P7oBtmhn5CNegvLDpXf/EN14dXMcuxyKLhK1BloYKj2lMltf4 jgcn8UMT /6MC2OuDxEyT2g5HzMAJorRYBh4wK7hKJnemEMa/1GZ5iMfa5ZrbNnPx7t51tiltxodZok/DQ8kLB1Q3hh4/4sCc79NS4NX34nLB/YHjoIen0p84w1/MQTC8PiBAUY4lQbCG0KyINIZp4UvzWxTZW0xUJOscPOuVUW46c/mrR9aJZfjvSqZIi8vqsB1lAkOOmaBPPm9urwmIH55h8LmPgEz9Kj46E7ZgQTB/LbRUlVMOj91N7+do8H4wlVLatfVD+dlVRYhxwIZ2AZSglZA2OH4dYmoqbB0knlIwDoihXKg7bsNG6p7OoqbsGqQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Nov 2, 2023 at 1:02=E2=80=AFPM Nhat Pham wrote: > > During our experiment with zswap, we sometimes observe swap IOs due to > occasional zswap store failures and writebacks-to-swap. These swapping > IOs prevent many users who cannot tolerate swapping from adopting zswap > to save memory and improve performance where possible. > > This patch adds the option to disable this behavior entirely: do not > writeback to backing swapping device when a zswap store attempt fail, > and do not write pages in the zswap pool back to the backing swap > device (both when the pool is full, and when the new zswap shrinker is > called). > > This new behavior can be opted-in/out on a per-cgroup basis via a new > cgroup file. By default, writebacks to swap device is enabled, which is > the previous behavior. > > Note that this is subtly different from setting memory.swap.max to 0, as > it still allows for pages to be stored in the zswap pool (which itself > consumes swap space in its current form). > > Suggested-by: Johannes Weiner > Signed-off-by: Nhat Pham > --- > Documentation/admin-guide/cgroup-v2.rst | 11 +++++++ > Documentation/admin-guide/mm/zswap.rst | 6 ++++ > include/linux/memcontrol.h | 17 +++++++++++ > mm/memcontrol.c | 38 +++++++++++++++++++++++++ > mm/page_io.c | 6 ++++ > mm/shmem.c | 3 +- > mm/zswap.c | 9 ++++++ > 7 files changed, 88 insertions(+), 2 deletions(-) > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admi= n-guide/cgroup-v2.rst > index 606b2e0eac4b..18c4171392ea 100644 > --- a/Documentation/admin-guide/cgroup-v2.rst > +++ b/Documentation/admin-guide/cgroup-v2.rst > @@ -1672,6 +1672,17 @@ PAGE_SIZE multiple when read back. > limit, it will refuse to take any more stores before existing > entries fault back in or are written out to disk. > > + memory.zswap.writeback > + A read-write single value file which exists on non-root > + cgroups. The default value is "1". > + > + When this is set to 0, all swapping attempts to swapping devices > + are disabled. This included both zswap writebacks, and swapping d= ue > + to zswap store failure. > + > + Note that this is subtly different from setting memory.swap.max t= o > + 0, as it still allows for pages to be written to the zswap pool. > + > memory.pressure > A read-only nested-keyed file. > > diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin= -guide/mm/zswap.rst > index 522ae22ccb84..b987e58edb70 100644 > --- a/Documentation/admin-guide/mm/zswap.rst > +++ b/Documentation/admin-guide/mm/zswap.rst > @@ -153,6 +153,12 @@ attribute, e. g.:: > > Setting this parameter to 100 will disable the hysteresis. > > +Some users cannot tolerate the swapping that comes with zswap store fail= ures > +and zswap writebacks. Swapping can be disabled entirely (without disabli= ng > +zswap itself) on a cgroup-basis as follows: > + > + echo 0 > /sys/fs/cgroup//memory.zswap.writeback > + > When there is a sizable amount of cold memory residing in the zswap pool= , it > can be advantageous to proactively write these cold pages to swap and re= claim > the memory for other use cases. By default, the zswap shrinker is disabl= ed. > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 95f6c9e60ed1..e3a3a06727dc 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -219,6 +219,12 @@ struct mem_cgroup { > > #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP) > unsigned long zswap_max; > + > + /* > + * Prevent pages from this memcg from being written back from zsw= ap to > + * swap, and from being swapped out on zswap store failures. > + */ > + bool zswap_writeback; > #endif > > unsigned long soft_limit; > @@ -1615,6 +1621,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_dat= a_t *pgdat, int order, > { > return 0; > } > + > +static inline bool mem_cgroup_swap_disk_enabled(struct mem_cgroup *memcg= ) > +{ > + return false; > +} > + This seems to be a leftover from a prior version. > #endif /* CONFIG_MEMCG */ > > static inline void __inc_lruvec_kmem_state(void *p, enum node_stat_item = idx) > @@ -1931,6 +1943,7 @@ static inline void count_objcg_event(struct obj_cgr= oup *objcg, > bool obj_cgroup_may_zswap(struct obj_cgroup *objcg); > void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size); > void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size); > +bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg); > #else > static inline bool obj_cgroup_may_zswap(struct obj_cgroup *objcg) > { > @@ -1944,6 +1957,10 @@ static inline void obj_cgroup_uncharge_zswap(struc= t obj_cgroup *objcg, > size_t size) > { > } > +static inline bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup = *memcg) > +{ > + return false; > +} > #endif > > #endif /* _LINUX_MEMCONTROL_H */ > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index e43b5aba8efc..b68c613c23a9 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -5545,6 +5545,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *pa= rent_css) > WRITE_ONCE(memcg->soft_limit, PAGE_COUNTER_MAX); > #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP) > memcg->zswap_max =3D PAGE_COUNTER_MAX; > + WRITE_ONCE(memcg->zswap_writeback, true); > #endif > page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX); > if (parent) { > @@ -8177,6 +8178,12 @@ void obj_cgroup_uncharge_zswap(struct obj_cgroup *= objcg, size_t size) > rcu_read_unlock(); > } > > +bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg) > +{ > + return cgroup_subsys_on_dfl(memory_cgrp_subsys) && memcg > + && READ_ONCE(memcg->zswap_writeback); > +} > + > static u64 zswap_current_read(struct cgroup_subsys_state *css, > struct cftype *cft) > { > @@ -8209,6 +8216,31 @@ static ssize_t zswap_max_write(struct kernfs_open_= file *of, > return nbytes; > } > > +static int zswap_writeback_show(struct seq_file *m, void *v) > +{ > + struct mem_cgroup *memcg =3D mem_cgroup_from_seq(m); > + > + seq_printf(m, "%d\n", READ_ONCE(memcg->zswap_writeback)); > + return 0; > +} > + > +static ssize_t zswap_writeback_write(struct kernfs_open_file *of, > + char *buf, size_t nbytes, loff_t off) > +{ > + struct mem_cgroup *memcg =3D mem_cgroup_from_css(of_css(of)); > + int zswap_writeback; > + ssize_t parse_ret =3D kstrtoint(strstrip(buf), 0, &zswap_writebac= k); > + > + if (parse_ret) > + return parse_ret; > + > + if (zswap_writeback !=3D 0 && zswap_writeback !=3D 1) > + return -EINVAL; > + > + WRITE_ONCE(memcg->zswap_writeback, zswap_writeback); > + return nbytes; > +} > + > static struct cftype zswap_files[] =3D { > { > .name =3D "zswap.current", > @@ -8221,6 +8253,12 @@ static struct cftype zswap_files[] =3D { > .seq_show =3D zswap_max_show, > .write =3D zswap_max_write, > }, > + { > + .name =3D "zswap.writeback", > + .flags =3D CFTYPE_NOT_ON_ROOT, > + .seq_show =3D zswap_writeback_show, > + .write =3D zswap_writeback_write, > + }, > { } /* terminate */ > }; > #endif /* CONFIG_MEMCG_KMEM && CONFIG_ZSWAP */ > diff --git a/mm/page_io.c b/mm/page_io.c > index cb559ae324c6..5e606f1aa2f6 100644 > --- a/mm/page_io.c > +++ b/mm/page_io.c > @@ -201,6 +201,12 @@ int swap_writepage(struct page *page, struct writeba= ck_control *wbc) > folio_end_writeback(folio); > return 0; > } > + > + if (!mem_cgroup_zswap_writeback_enabled(folio_memcg(folio))) { > + folio_mark_dirty(folio); > + return AOP_WRITEPAGE_ACTIVATE; > + } > + I am not a fan of this, because it will disable using disk swap if "zswap_writeback" is disabled, even if zswap is disabled or the page was never in zswap. The term zswap_writeback makes no sense here tbh. I am still hoping someone else will suggest better semantics, because honestly I can't think of anything. Perhaps something like memory.swap.zswap_only or memory.swap.types which accepts a string (e.g. "zswap"/"all",..). Don't take my suggestions strongly because I am not very fond of them. Can anyone else come back with better naming/semantics for "use zswap but nothing else when swapping"? > __swap_writepage(&folio->page, wbc); > return 0; > } > diff --git a/mm/shmem.c b/mm/shmem.c > index cab053831fea..e5044678de8b 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -1514,8 +1514,7 @@ static int shmem_writepage(struct page *page, struc= t writeback_control *wbc) > > mutex_unlock(&shmem_swaplist_mutex); > BUG_ON(folio_mapped(folio)); > - swap_writepage(&folio->page, wbc); > - return 0; > + return swap_writepage(&folio->page, wbc); > } > > mutex_unlock(&shmem_swaplist_mutex); > diff --git a/mm/zswap.c b/mm/zswap.c > index 260e01180ee0..42a478d1a21f 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -590,6 +590,9 @@ static unsigned long zswap_shrinker_scan(struct shrin= ker *shrinker, > struct zswap_pool *pool =3D shrinker->private_data; > bool encountered_page_in_swapcache =3D false; > > + if (!mem_cgroup_zswap_writeback_enabled(sc->memcg)) > + return SHRINK_STOP; > + > nr_protected =3D > atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_pro= tected); > lru_size =3D list_lru_shrink_count(&pool->list_lru, sc); > @@ -620,6 +623,9 @@ static unsigned long zswap_shrinker_count(struct shri= nker *shrinker, > struct lruvec *lruvec =3D mem_cgroup_lruvec(memcg, NODE_DATA(sc->= nid)); > unsigned long nr_backing, nr_stored, nr_freeable, nr_protected; > > + if (!mem_cgroup_zswap_writeback_enabled(memcg)) > + return 0; > + > #ifdef CONFIG_MEMCG_KMEM > cgroup_rstat_flush(memcg->css.cgroup); > nr_backing =3D memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHI= FT; > @@ -935,6 +941,9 @@ static int shrink_memcg(struct mem_cgroup *memcg) > struct zswap_pool *pool; > int nid, shrunk =3D 0; > > + if (!mem_cgroup_zswap_writeback_enabled(memcg)) > + return -EINVAL; > + > /* > * Skip zombies because their LRUs are reparented and we would be > * reclaiming from the parent instead of the dead memcg. > -- > 2.34.1