linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Takero Funaki <flintglass@gmail.com>
To: Nhat Pham <nphamcs@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
	Yosry Ahmed <yosryahmed@google.com>,
	 Chengming Zhou <chengming.zhou@linux.dev>,
	Jonathan Corbet <corbet@lwn.net>,
	 Andrew Morton <akpm@linux-foundation.org>,
	 Domenico Cerasuolo <cerasuolodomenico@gmail.com>,
	linux-mm@kvack.org, linux-doc@vger.kernel.org,
	 linux-kernel@vger.kernel.org,
	Shakeel Butt <shakeel.butt@linux.dev>
Subject: Re: [PATCH v1 1/3] mm: zswap: fix global shrinker memcg iteration
Date: Thu, 13 Jun 2024 03:16:42 +0900	[thread overview]
Message-ID: <CAPpoddeigM44jhTA8Ua=+J4MC1MikouBZVoPrCW2LZF+9r5YeA@mail.gmail.com> (raw)
In-Reply-To: <CAKEwX=P1Ojb71AEJ2gzQTrfWidFPcJZmoNxEwji7TceBN-szCg@mail.gmail.com>

2024年6月12日(水) 3:26 Nhat Pham <nphamcs@gmail.com>:

>
> As I have noted in v0, I think this is unnecessary and makes it more confusing.
>

Does spin_lock() ensure that compiler optimizations do not remove
memory access to an external variable? I think we need to use
READ_ONCE/WRITE_ONCE for shared variable access even under a spinlock.
For example,
https://elixir.bootlin.com/linux/latest/source/mm/mmu_notifier.c#L234

isn't this a common use case of READ_ONCE?
```c
bool shared_flag = false;
spinlock_t flag_lock;

void somefunc(void) {
    for (;;) {
        spin_lock(&flag_lock);
        /* check external updates */
        if (READ_ONCE(shared_flag))
            break;
        /* do something */
        spin_unlock(&flag_lock);
    }
    spin_unlock(&flag_lock);
}
```
Without READ_ONCE, the check can be extracted from the loop by optimization.

In shrink_worker, zswap_next_shrink is the shared_flag , which can be
updated by concurrent cleaner threads, so it must be re-read every
time we reacquire the lock. Am I badly misunderstanding something?

> >         do {
> > +iternext:
> >                 spin_lock(&zswap_shrink_lock);
> > -               zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
> > -               memcg = zswap_next_shrink;
> > +               next_memcg = READ_ONCE(zswap_next_shrink);
> > +
> > +               if (memcg != next_memcg) {
> > +                       /*
> > +                        * Ours was released by offlining.
> > +                        * Use the saved memcg reference.
> > +                        */
> > +                       memcg = next_memcg;
> > +               } else {
> > +                       /* advance cursor */
> > +                       memcg = mem_cgroup_iter(NULL, memcg, NULL);
> > +                       WRITE_ONCE(zswap_next_shrink, memcg);
> > +               }
>
> I suppose I'm fine with not advancing the memcg when it is already
> advanced by the memcg offlining callback.
>

For where to restart the shrinking, as Yosry pointed, my version
starts from the last memcg (=retrying failed memcg or evicting once
more)
I now realize that skipping the next memcg of offlined memcg is less
likely to happen. I am reverting it to restart from the next memcg of
zswap_next_shrink.
Which one could be better?

> >
> >                 /*
> > -                * We need to retry if we have gone through a full round trip, or if we
> > -                * got an offline memcg (or else we risk undoing the effect of the
> > -                * zswap memcg offlining cleanup callback). This is not catastrophic
> > -                * per se, but it will keep the now offlined memcg hostage for a while.
> > -                *
> >                  * Note that if we got an online memcg, we will keep the extra
> >                  * reference in case the original reference obtained by mem_cgroup_iter
> >                  * is dropped by the zswap memcg offlining callback, ensuring that the
> > @@ -1434,16 +1468,25 @@ static void shrink_worker(struct work_struct *w)
> >                 }
> >
> >                 if (!mem_cgroup_tryget_online(memcg)) {
> > -                       /* drop the reference from mem_cgroup_iter() */
> > -                       mem_cgroup_iter_break(NULL, memcg);
> > -                       zswap_next_shrink = NULL;
> > +                       /*
> > +                        * It is an offline memcg which we cannot shrink
> > +                        * until its pages are reparented.
> > +                        *
> > +                        * Since we cannot determine if the offline cleaner has
> > +                        * been already called or not, the offline memcg must be
> > +                        * put back unconditonally. We cannot abort the loop while
> > +                        * zswap_next_shrink has a reference of this offline memcg.
> > +                        */
> >                         spin_unlock(&zswap_shrink_lock);
> > -
> > -                       if (++failures == MAX_RECLAIM_RETRIES)
> > -                               break;
> > -
> > -                       goto resched;
> > +                       goto iternext;
>
> Hmmm yeah in the past, I set it to NULL to make sure we're not
> replacing zswap_next_shrink with an offlined memcg, after that zswap
> offlining callback for that memcg has been completed..
>
> I suppose we can just call mem_cgroup_iter(...) on that offlined
> cgroup, but I'm not 100% sure what happens when we call this function
> on a cgroup that is currently being offlined, and has gone past the
> zswap offline callback stage. So I was just playing it safe and
> restart from the top of the tree :)
>
> I think this implementation has that behavior right? We see that the
> memcg is offlined, so we drop the lock and go to the beginning of the
> loop. We reacquire the lock, and might see that zswap_next_shrink ==
> memcg, so we call mem_cgroup_iter(...) on it. Is this safe?
>
> Note that zswap_shrink_lock only orders serializes this memcg
> selection loop with memcg offlining after it - there's no guarantee
> what's the behavior is for memcg offlining before it (well other than
> one reference that we manage to acquire thanks to
> mem_cgroup_iter(...), so that memcg has not been freed, but not sure
> what we can guarantee regarding its place in the memcg hierarchy
> tree?).

The locking mechanism in shrink_worker does not rely on what the next
memcg is.sorting stability of mem_cgroup_iter does not matter
here.
The expectation for the iterator is that it will walk through all live
memcgs. I believe mem_cgroup_iter uses parent-to-leaf ordering of
cgroup and it ensures all live cgroups are walked at least once,
regardless of its onlineness.
https://elixir.bootlin.com/linux/v6.10-rc2/source/mm/memcontrol.c#L1368

Regarding reference leak, I overlooked a scenario where a leak might
occur in the existing cleaner. although it should be rare.

When the cleaner is called on a memcg in zswap_next_shrink, the next
memcg from mem_cgroup_iter() can be an offline already-cleaned memcg,
resulting in a reference leak of the next memcg from the cleaner. We
should implement the same online check in the cleaner, like this:


```c
void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
{
        struct mem_cgroup *next;

        /* lock out zswap shrinker walking memcg tree */
        spin_lock(&zswap_shrink_lock);
        if (zswap_next_shrink == memcg) {
                next = zswap_next_shrink;
                do {
                        next = mem_cgroup_iter(NULL, next, NULL);
                        WRITE_ONCE(zswap_next_shrink, next);

                        spin_unlock(&zswap_shrink_lock);
                        /* zswap_next_shrink might be updated here */
                        spin_lock(&zswap_shrink_lock);

                        next = READ_ONCE(zswap_next_shrink);
                        if (!next)
                                break;
                } while (!mem_cgroup_online(next));
                /*
                 * We verified the next memcg is online under lock.
                 * Even if the next memcg is being offlined here, another
                 * cleaner for the next memcg is waiting for our unlock just
                 * behind us.  We can leave the next memcg reference.
                 */
        }
        spin_unlock(&zswap_shrink_lock);
}
```

As same as in shrink_worker, we must check if the next memcg is online
under the lock before leaving the ref in zswap_next_shrink.
Otherwise, zswap_next_shrink might hold the ref of offlined and cleaned memcg.

Or if you are concerning about temporary storing unchecked or offlined
memcg in zswap_next_shrink, it is safe because:

1. If there is no other cleaner running for zswap_next_shrink, the ref
saved in zswap_next_shrink ensures liveness of the memcg when
reacquired.
2. Another cleaner thread may put back and replace zswap_next_shrink
with its next. We will check onlineness of the new zswap_next_shrink
under reacquired lock.
3. Even if the verified-online memcg is being offlined concurrently,
another cleaner thread must wait for our unlock. We can leave the
online memcg and rely on its respective cleaner.


  parent reply	other threads:[~2024-06-12 18:16 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-08 15:53 [PATCH v1 0/3] mm: zswap: global shrinker fix and proactive shrink Takero Funaki
2024-06-08 15:53 ` [PATCH v1 1/3] mm: zswap: fix global shrinker memcg iteration Takero Funaki
2024-06-10 19:16   ` Yosry Ahmed
2024-06-11 14:50     ` Takero Funaki
2024-06-11 18:26   ` Nhat Pham
2024-06-11 23:03     ` Shakeel Butt
2024-06-12 18:16     ` Takero Funaki [this message]
2024-06-12 18:28       ` Yosry Ahmed
2024-06-13  2:13         ` Takero Funaki
2024-06-13  2:18           ` Yosry Ahmed
2024-06-13  2:35             ` Takero Funaki
2024-06-13  2:57               ` Yosry Ahmed
2024-06-13 15:04                 ` Nhat Pham
2024-06-13 16:49                   ` Shakeel Butt
2024-06-14  4:39                     ` Takero Funaki
2024-06-13 16:08   ` Nhat Pham
2024-06-13 16:09     ` Nhat Pham
2024-06-08 15:53 ` [PATCH v1 2/3] mm: zswap: fix global shrinker error handling logic Takero Funaki
2024-06-10 20:27   ` Yosry Ahmed
2024-06-11 15:21     ` Takero Funaki
2024-06-11 15:51       ` Nhat Pham
2024-06-11 18:15     ` Nhat Pham
2024-06-08 15:53 ` [PATCH v1 3/3] mm: zswap: proactive shrinking before pool size limit is hit Takero Funaki
2024-06-13 15:13   ` Nhat Pham
2024-06-11 18:10 ` [PATCH v1 0/3] mm: zswap: global shrinker fix and proactive shrink Nhat Pham
2024-06-13 15:22 ` Nhat Pham
2024-06-14  4:09   ` Takero Funaki
2024-06-14 22:34     ` Nhat Pham
2024-06-14 22:48     ` Nhat Pham
2024-06-15  0:19     ` Yosry Ahmed
2024-06-20  1:03       ` Takero Funaki
2024-06-20 22:45         ` Nhat Pham

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAPpoddeigM44jhTA8Ua=+J4MC1MikouBZVoPrCW2LZF+9r5YeA@mail.gmail.com' \
    --to=flintglass@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=cerasuolodomenico@gmail.com \
    --cc=chengming.zhou@linux.dev \
    --cc=corbet@lwn.net \
    --cc=hannes@cmpxchg.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nphamcs@gmail.com \
    --cc=shakeel.butt@linux.dev \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox