linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Yosry Ahmed <yosryahmed@google.com>
To: Nhat Pham <nphamcs@gmail.com>
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org,
	cerasuolodomenico@gmail.com,  sjenning@redhat.com,
	ddstreet@ieee.org, vitaly.wool@konsulko.com,  mhocko@kernel.org,
	roman.gushchin@linux.dev, shakeelb@google.com,
	 muchun.song@linux.dev, chrisl@kernel.org, linux-mm@kvack.org,
	 kernel-team@meta.com, linux-kernel@vger.kernel.org,
	cgroups@vger.kernel.org,  linux-doc@vger.kernel.org,
	linux-kselftest@vger.kernel.org, shuah@kernel.org
Subject: Re: [PATCH v4 2/5] zswap: make shrinking memcg-aware
Date: Mon, 30 Oct 2023 11:16:10 -0700	[thread overview]
Message-ID: <CAJD7tkZ7PPQmWu9UVH7WS3KFjmNW3q=JoMASmYtUb-Uy702iJg@mail.gmail.com> (raw)
In-Reply-To: <CAKEwX=Nr7xJYpL2nE_ob0dWg9rnfoz67OMe_wvGsKjxboo1H+A@mail.gmail.com>

> > [..]
> > > +/*********************************
> > > +* lru functions
> > > +**********************************/
> > > +static bool zswap_lru_add(struct list_lru *list_lru, struct zswap_entry *entry)
> > > +{
> > > +       struct mem_cgroup *memcg = get_mem_cgroup_from_entry(entry);
> > > +       int nid = entry_to_nid(entry);
> > > +       bool added = list_lru_add(list_lru, &entry->lru, nid, memcg);
> > > +
> > > +       mem_cgroup_put(memcg);
> >
> > Still not fond of the get/put pattern but okay..
>
> Actually, Johannes and I took another look to see if we can replace
> the memcg reference getting with just rcu_read_lock().
>
> It seems there might be a race between zswap LRU manipulation
> and memcg offlining - not just with the rcu_read_lock() idea, but also
> with our current implementation!
>
> I'll shoot another email with more details later when I'm sure of it
> one way or another...
>

Interesting, well at least something came out of my complaining :)

> > [..]
> > > @@ -652,28 +679,37 @@ static int zswap_reclaim_entry(struct zswap_pool *pool)
> > >          */
> > >         swpoffset = swp_offset(entry->swpentry);
> > >         tree = zswap_trees[swp_type(entry->swpentry)];
> > > -       spin_unlock(&pool->lru_lock);
> > > +       list_lru_isolate(l, item);
> > > +       /*
> > > +        * It's safe to drop the lock here because we return either
> > > +        * LRU_REMOVED_RETRY or LRU_RETRY.
> > > +        */
> > > +       spin_unlock(lock);
> > >
> > >         /* Check for invalidate() race */
> > >         spin_lock(&tree->lock);
> > > -       if (entry != zswap_rb_search(&tree->rbroot, swpoffset)) {
> > > -               ret = -EAGAIN;
> > > +       if (entry != zswap_rb_search(&tree->rbroot, swpoffset))
> > >                 goto unlock;
> > > -       }
> > > +
> > >         /* Hold a reference to prevent a free during writeback */
> > >         zswap_entry_get(entry);
> > >         spin_unlock(&tree->lock);
> > >
> > > -       ret = zswap_writeback_entry(entry, tree);
> > > +       writeback_result = zswap_writeback_entry(entry, tree);
> > >
> > >         spin_lock(&tree->lock);
> > > -       if (ret) {
> > > -               /* Writeback failed, put entry back on LRU */
> > > -               spin_lock(&pool->lru_lock);
> > > -               list_move(&entry->lru, &pool->lru);
> > > -               spin_unlock(&pool->lru_lock);
> > > +       if (writeback_result) {
> > > +               zswap_reject_reclaim_fail++;
> > > +               memcg = get_mem_cgroup_from_entry(entry);
> >
> > Can this return NULL? Seems like we don't check the return in most/all places.
>
> I believe so, but memcg experts should fact check me on this.

If that's the case, there should be NULL checks, no?

> It's roughly the same pattern as zswap charging/uncharging:
>
> obj_cgroup_uncharge_zswap(entry->objcg, entry->length)
> -> getting memcg (under rcu_read_lock())
>
> >
> > > +               spin_lock(lock);
> > > +               /* we cannot use zswap_lru_add here, because it increments node's lru count */
> > > +               list_lru_putback(&entry->pool->list_lru, item, entry_to_nid(entry), memcg);
> >
> > Perhaps we can move this call with the memcg get/put to a helper like
> > add/del? (e.g. zswap_lru_putback)
> >
> > We would need to move get_mem_cgroup_from_entry() into the lock but I
> > think that's okay.
>
> We probably could, but that sounds like extra code for not a lot of gains, no?

I don't feel strongly, just a fan of consistency.

>
> >
> > > +               spin_unlock(lock);
> > > +               mem_cgroup_put(memcg);
> > > +               ret = LRU_RETRY;
> > >                 goto put_unlock;
> > >         }
> > > +       zswap_written_back_pages++;
> > >
> > >         /*
> > >          * Writeback started successfully, the page now belongs to the
[..]
> > > @@ -696,15 +759,17 @@ static void shrink_worker(struct work_struct *w)
> > >                                                 shrink_work);
> > >         int ret, failures = 0;
> > >
> > > +       /* global reclaim will select cgroup in a round-robin fashion. */
> > >         do {
> > > -               ret = zswap_reclaim_entry(pool);
> > > -               if (ret) {
> > > -                       zswap_reject_reclaim_fail++;
> > > -                       if (ret != -EAGAIN)
> > > -                               break;
> > > -                       if (++failures == MAX_RECLAIM_RETRIES)
> > > -                               break;
> > > -               }
> > > +               pool->next_shrink = mem_cgroup_iter(NULL, pool->next_shrink, NULL);
> >
> > I think this can be a problem. We hold a ref to a memcg here until the
> > next time we shrink, which can be a long time IIUC. This can cause the
> > memcg to linger as a zombie. I understand it is one memcg per-zswap
> > pool, but I am still unsure about it.
> >
> > MGLRU maintains a memcg LRU for global reclaim that gets properly
> > cleaned up when a memcg is going away, so that's one option, although
> > complicated.
> >
> > A second option would be to hold a pointer to the objcg instead, which
> > should be less problematic (although we are still holding that objcg
> > hostage indefinitely). The problem here is that if the objcg gets
> > reparented, next time we will start at the parent of the memcg we
> > stopped at last time, which tbh doesn't sound bad at all to me.
> >
> > A third option would be to flag the memcg such that when it is getting
> > offlined we can call into zswap to reset pool->next_shrink (or move it
> > to the parent) and drop the ref. Although synchronization can get
> > hairy when racing with offlining.
> >
> > Not sure what's the right solution, but I prefer we don't hold any
> > memcgs hostages indefinitely. I also think if we end up using
> > mem_cgroup_iter() then there should be a mem_cgroup_iter_break()
> > somewhere if/when breaking the iteration.
> >
>
> I'm not sure if this is that big of a problem in the first place, but
> if it is, doing something similar to MGLRU is probably the cleanest:
> when the memcg is freed, trigger the zswap_exit_memcg() callback,
> which will loop through all the zswap pools and update pool->next_shrink
> where appropriate.
>
> Note that we only have one pool per (compression algorithm x allocator)
> combinations, so there cannot be that many pools, correct?
>
> Johannes suggests this idea to me (my apologies if I butcher it)
> during one of our conversations. That sounds relatively easy IIUC.

Be careful that there will be a race between memcg offlining and
zswap's usage of pool->next_shrink. AFAICT there is no lock to prevent
offlining so there will have to be some sort of dance here to make
sure everything works correctly.


  parent reply	other threads:[~2023-10-30 18:16 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-24 20:32 [PATCH v4 0/5] workload-specific and memory pressure-driven zswap writeback Nhat Pham
2023-10-24 20:32 ` [PATCH v4 1/5] list_lru: allows explicit memcg and NUMA node selection Nhat Pham
2023-10-24 20:32 ` [PATCH v4 2/5] zswap: make shrinking memcg-aware Nhat Pham
2023-10-25  3:16   ` Yosry Ahmed
2023-10-27 21:10     ` Nhat Pham
2023-10-29  1:26       ` Nhat Pham
2023-10-30 18:16       ` Yosry Ahmed [this message]
2023-11-01  1:26   ` Nhat Pham
2023-11-01  1:32     ` Yosry Ahmed
2023-11-01  1:41       ` Nhat Pham
2023-11-01  3:06     ` Muchun Song
2023-11-01 17:44       ` Nhat Pham
2023-11-02  2:17         ` Muchun Song
2023-10-24 20:33 ` [PATCH v4 3/5] mm: memcg: add per-memcg zswap writeback stat Nhat Pham
2023-10-24 20:33 ` [PATCH v4 5/5] zswap: shrinks zswap pool based on memory pressure Nhat Pham

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAJD7tkZ7PPQmWu9UVH7WS3KFjmNW3q=JoMASmYtUb-Uy702iJg@mail.gmail.com' \
    --to=yosryahmed@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=cerasuolodomenico@gmail.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chrisl@kernel.org \
    --cc=ddstreet@ieee.org \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@meta.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=muchun.song@linux.dev \
    --cc=nphamcs@gmail.com \
    --cc=roman.gushchin@linux.dev \
    --cc=shakeelb@google.com \
    --cc=shuah@kernel.org \
    --cc=sjenning@redhat.com \
    --cc=vitaly.wool@konsulko.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox