linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Nhat Pham <nphamcs@gmail.com>
To: Yosry Ahmed <yosryahmed@google.com>
Cc: akpm@linux-foundation.org, tj@kernel.org,
	lizefan.x@bytedance.com,  hannes@cmpxchg.org,
	cerasuolodomenico@gmail.com, sjenning@redhat.com,
	 ddstreet@ieee.org, vitaly.wool@konsulko.com, mhocko@kernel.org,
	 roman.gushchin@linux.dev, shakeelb@google.com,
	muchun.song@linux.dev,  hughd@google.com, corbet@lwn.net,
	konrad.wilk@oracle.com,  senozhatsky@chromium.org,
	rppt@kernel.org, linux-mm@kvack.org,  kernel-team@meta.com,
	linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org,
	 david@ixit.cz
Subject: Re: [RFC PATCH v2] zswap: memcontrol: implement zswap writeback disabling
Date: Thu, 2 Nov 2023 13:44:43 -0700	[thread overview]
Message-ID: <CAKEwX=MV-Rpq9K9+ddZES_kE_5WM6ri-wk1QRQ4Jv5SFMqhXgw@mail.gmail.com> (raw)
In-Reply-To: <CAJD7tkY8iPBo99+1gdsSRMNDu4jkVKz8rb=W+xk9=GE0y=kSuw@mail.gmail.com>

On Thu, Nov 2, 2023 at 1:28 PM Yosry Ahmed <yosryahmed@google.com> wrote:
>
> On Thu, Nov 2, 2023 at 1:02 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > During our experiment with zswap, we sometimes observe swap IOs due to
> > occasional zswap store failures and writebacks-to-swap. These swapping
> > IOs prevent many users who cannot tolerate swapping from adopting zswap
> > to save memory and improve performance where possible.
> >
> > This patch adds the option to disable this behavior entirely: do not
> > writeback to backing swapping device when a zswap store attempt fail,
> > and do not write pages in the zswap pool back to the backing swap
> > device (both when the pool is full, and when the new zswap shrinker is
> > called).
> >
> > This new behavior can be opted-in/out on a per-cgroup basis via a new
> > cgroup file. By default, writebacks to swap device is enabled, which is
> > the previous behavior.
> >
> > Note that this is subtly different from setting memory.swap.max to 0, as
> > it still allows for pages to be stored in the zswap pool (which itself
> > consumes swap space in its current form).
> >
> > Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> > Signed-off-by: Nhat Pham <nphamcs@gmail.com>
> > ---
> >  Documentation/admin-guide/cgroup-v2.rst | 11 +++++++
> >  Documentation/admin-guide/mm/zswap.rst  |  6 ++++
> >  include/linux/memcontrol.h              | 17 +++++++++++
> >  mm/memcontrol.c                         | 38 +++++++++++++++++++++++++
> >  mm/page_io.c                            |  6 ++++
> >  mm/shmem.c                              |  3 +-
> >  mm/zswap.c                              |  9 ++++++
> >  7 files changed, 88 insertions(+), 2 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> > index 606b2e0eac4b..18c4171392ea 100644
> > --- a/Documentation/admin-guide/cgroup-v2.rst
> > +++ b/Documentation/admin-guide/cgroup-v2.rst
> > @@ -1672,6 +1672,17 @@ PAGE_SIZE multiple when read back.
> >         limit, it will refuse to take any more stores before existing
> >         entries fault back in or are written out to disk.
> >
> > +  memory.zswap.writeback
> > +       A read-write single value file which exists on non-root
> > +       cgroups.  The default value is "1".
> > +
> > +       When this is set to 0, all swapping attempts to swapping devices
> > +       are disabled. This included both zswap writebacks, and swapping due
> > +       to zswap store failure.
> > +
> > +       Note that this is subtly different from setting memory.swap.max to
> > +       0, as it still allows for pages to be written to the zswap pool.
> > +
> >    memory.pressure
> >         A read-only nested-keyed file.
> >
> > diff --git a/Documentation/admin-guide/mm/zswap.rst b/Documentation/admin-guide/mm/zswap.rst
> > index 522ae22ccb84..b987e58edb70 100644
> > --- a/Documentation/admin-guide/mm/zswap.rst
> > +++ b/Documentation/admin-guide/mm/zswap.rst
> > @@ -153,6 +153,12 @@ attribute, e. g.::
> >
> >  Setting this parameter to 100 will disable the hysteresis.
> >
> > +Some users cannot tolerate the swapping that comes with zswap store failures
> > +and zswap writebacks. Swapping can be disabled entirely (without disabling
> > +zswap itself) on a cgroup-basis as follows:
> > +
> > +       echo 0 > /sys/fs/cgroup/<cgroup-name>/memory.zswap.writeback
> > +
> >  When there is a sizable amount of cold memory residing in the zswap pool, it
> >  can be advantageous to proactively write these cold pages to swap and reclaim
> >  the memory for other use cases. By default, the zswap shrinker is disabled.
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 95f6c9e60ed1..e3a3a06727dc 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -219,6 +219,12 @@ struct mem_cgroup {
> >
> >  #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
> >         unsigned long zswap_max;
> > +
> > +       /*
> > +        * Prevent pages from this memcg from being written back from zswap to
> > +        * swap, and from being swapped out on zswap store failures.
> > +        */
> > +       bool zswap_writeback;
> >  #endif
> >
> >         unsigned long soft_limit;
> > @@ -1615,6 +1621,12 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
> >  {
> >         return 0;
> >  }
> > +
> > +static inline bool mem_cgroup_swap_disk_enabled(struct mem_cgroup *memcg)
> > +{
> > +       return false;
> > +}
> > +
>
> This seems to be a leftover from a prior version.

100%! Thanks for picking that up. Not sure why I didn't see it with my
grep and ctrl F. I'll send a fixlet/new version to remove this later.

It nominally depends on the shrinker series (which is currently under
cleanups - thanks for the suggestions over there too, Yosry) so no rush
in merging this in. Just wanna send this out early to hear people's
feedback about the implementation + naming.

>
> >  #endif /* CONFIG_MEMCG */
> >
> >  static inline void __inc_lruvec_kmem_state(void *p, enum node_stat_item idx)
> > @@ -1931,6 +1943,7 @@ static inline void count_objcg_event(struct obj_cgroup *objcg,
> >  bool obj_cgroup_may_zswap(struct obj_cgroup *objcg);
> >  void obj_cgroup_charge_zswap(struct obj_cgroup *objcg, size_t size);
> >  void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size);
> > +bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg);
> >  #else
> >  static inline bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
> >  {
> > @@ -1944,6 +1957,10 @@ static inline void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg,
> >                                              size_t size)
> >  {
> >  }
> > +static inline bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg)
> > +{
> > +       return false;
> > +}
> >  #endif
> >
> >  #endif /* _LINUX_MEMCONTROL_H */
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index e43b5aba8efc..b68c613c23a9 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -5545,6 +5545,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
> >         WRITE_ONCE(memcg->soft_limit, PAGE_COUNTER_MAX);
> >  #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
> >         memcg->zswap_max = PAGE_COUNTER_MAX;
> > +       WRITE_ONCE(memcg->zswap_writeback, true);
> >  #endif
> >         page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX);
> >         if (parent) {
> > @@ -8177,6 +8178,12 @@ void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size)
> >         rcu_read_unlock();
> >  }
> >
> > +bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg)
> > +{
> > +       return cgroup_subsys_on_dfl(memory_cgrp_subsys) && memcg
> > +                       && READ_ONCE(memcg->zswap_writeback);
> > +}
> > +
> >  static u64 zswap_current_read(struct cgroup_subsys_state *css,
> >                               struct cftype *cft)
> >  {
> > @@ -8209,6 +8216,31 @@ static ssize_t zswap_max_write(struct kernfs_open_file *of,
> >         return nbytes;
> >  }
> >
> > +static int zswap_writeback_show(struct seq_file *m, void *v)
> > +{
> > +       struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
> > +
> > +       seq_printf(m, "%d\n", READ_ONCE(memcg->zswap_writeback));
> > +       return 0;
> > +}
> > +
> > +static ssize_t zswap_writeback_write(struct kernfs_open_file *of,
> > +                               char *buf, size_t nbytes, loff_t off)
> > +{
> > +       struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> > +       int zswap_writeback;
> > +       ssize_t parse_ret = kstrtoint(strstrip(buf), 0, &zswap_writeback);
> > +
> > +       if (parse_ret)
> > +               return parse_ret;
> > +
> > +       if (zswap_writeback != 0 && zswap_writeback != 1)
> > +               return -EINVAL;
> > +
> > +       WRITE_ONCE(memcg->zswap_writeback, zswap_writeback);
> > +       return nbytes;
> > +}
> > +
> >  static struct cftype zswap_files[] = {
> >         {
> >                 .name = "zswap.current",
> > @@ -8221,6 +8253,12 @@ static struct cftype zswap_files[] = {
> >                 .seq_show = zswap_max_show,
> >                 .write = zswap_max_write,
> >         },
> > +       {
> > +               .name = "zswap.writeback",
> > +               .flags = CFTYPE_NOT_ON_ROOT,
> > +               .seq_show = zswap_writeback_show,
> > +               .write = zswap_writeback_write,
> > +       },
> >         { }     /* terminate */
> >  };
> >  #endif /* CONFIG_MEMCG_KMEM && CONFIG_ZSWAP */
> > diff --git a/mm/page_io.c b/mm/page_io.c
> > index cb559ae324c6..5e606f1aa2f6 100644
> > --- a/mm/page_io.c
> > +++ b/mm/page_io.c
> > @@ -201,6 +201,12 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
> >                 folio_end_writeback(folio);
> >                 return 0;
> >         }
> > +
> > +       if (!mem_cgroup_zswap_writeback_enabled(folio_memcg(folio))) {
> > +               folio_mark_dirty(folio);
> > +               return AOP_WRITEPAGE_ACTIVATE;
> > +       }
> > +
>
> I am not a fan of this, because it will disable using disk swap if
> "zswap_writeback" is disabled, even if zswap is disabled or the page
> was never in zswap. The term zswap_writeback makes no sense here tbh.
>
> I am still hoping someone else will suggest better semantics, because
> honestly I can't think of anything. Perhaps something like
> memory.swap.zswap_only or memory.swap.types which accepts a string
> (e.g. "zswap"/"all",..).
>
> Don't take my suggestions strongly because I am not very fond of them.
>
> Can anyone else come back with better naming/semantics for "use zswap
> but nothing else when swapping"?

I discussed this a little bit with Johannes, and decided to go with
memory.zswap.writeback because zswap store failure is quite rare in practice
(and will become even rarer once I have the time to get the storing
compressed objects to happen).

But still, I'm happy to hear any other naming suggestions! This (+ the shrinker
dependency) is why I send this as an RFC anyway.

>
> >         __swap_writepage(&folio->page, wbc);
> >         return 0;
> >  }
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index cab053831fea..e5044678de8b 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -1514,8 +1514,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
> >
> >                 mutex_unlock(&shmem_swaplist_mutex);
> >                 BUG_ON(folio_mapped(folio));
> > -               swap_writepage(&folio->page, wbc);
> > -               return 0;
> > +               return swap_writepage(&folio->page, wbc);
> >         }
> >
> >         mutex_unlock(&shmem_swaplist_mutex);
> > diff --git a/mm/zswap.c b/mm/zswap.c
> > index 260e01180ee0..42a478d1a21f 100644
> > --- a/mm/zswap.c
> > +++ b/mm/zswap.c
> > @@ -590,6 +590,9 @@ static unsigned long zswap_shrinker_scan(struct shrinker *shrinker,
> >         struct zswap_pool *pool = shrinker->private_data;
> >         bool encountered_page_in_swapcache = false;
> >
> > +       if (!mem_cgroup_zswap_writeback_enabled(sc->memcg))
> > +               return SHRINK_STOP;
> > +
> >         nr_protected =
> >                 atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_protected);
> >         lru_size = list_lru_shrink_count(&pool->list_lru, sc);
> > @@ -620,6 +623,9 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
> >         struct lruvec *lruvec = mem_cgroup_lruvec(memcg, NODE_DATA(sc->nid));
> >         unsigned long nr_backing, nr_stored, nr_freeable, nr_protected;
> >
> > +       if (!mem_cgroup_zswap_writeback_enabled(memcg))
> > +               return 0;
> > +
> >  #ifdef CONFIG_MEMCG_KMEM
> >         cgroup_rstat_flush(memcg->css.cgroup);
> >         nr_backing = memcg_page_state(memcg, MEMCG_ZSWAP_B) >> PAGE_SHIFT;
> > @@ -935,6 +941,9 @@ static int shrink_memcg(struct mem_cgroup *memcg)
> >         struct zswap_pool *pool;
> >         int nid, shrunk = 0;
> >
> > +       if (!mem_cgroup_zswap_writeback_enabled(memcg))
> > +               return -EINVAL;
> > +
> >         /*
> >          * Skip zombies because their LRUs are reparented and we would be
> >          * reclaiming from the parent instead of the dead memcg.
> > --
> > 2.34.1


  reply	other threads:[~2023-11-02 20:44 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-02 20:02 Nhat Pham
2023-11-02 20:27 ` Yosry Ahmed
2023-11-02 20:44   ` Nhat Pham [this message]
2023-11-02 20:50   ` Johannes Weiner
2023-11-02 20:54     ` Yosry Ahmed
2023-11-02 22:40       ` Nhat Pham
2023-11-02 20:58 ` [RFC PATCH v2] zswap: memcontrol: mplement zswap writeback disabling (fix) Nhat Pham

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAKEwX=MV-Rpq9K9+ddZES_kE_5WM6ri-wk1QRQ4Jv5SFMqhXgw@mail.gmail.com' \
    --to=nphamcs@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=cerasuolodomenico@gmail.com \
    --cc=corbet@lwn.net \
    --cc=david@ixit.cz \
    --cc=ddstreet@ieee.org \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=kernel-team@meta.com \
    --cc=konrad.wilk@oracle.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lizefan.x@bytedance.com \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=roman.gushchin@linux.dev \
    --cc=rppt@kernel.org \
    --cc=senozhatsky@chromium.org \
    --cc=shakeelb@google.com \
    --cc=sjenning@redhat.com \
    --cc=tj@kernel.org \
    --cc=vitaly.wool@konsulko.com \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox