From: Takero Funaki <flintglass@gmail.com>
To: Nhat Pham <nphamcs@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>,
Yosry Ahmed <yosryahmed@google.com>,
Chengming Zhou <chengming.zhou@linux.dev>,
Jonathan Corbet <corbet@lwn.net>,
Andrew Morton <akpm@linux-foundation.org>,
Domenico Cerasuolo <cerasuolodomenico@gmail.com>,
linux-mm@kvack.org, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org,
Shakeel Butt <shakeel.butt@linux.dev>
Subject: Re: [PATCH v1 1/3] mm: zswap: fix global shrinker memcg iteration
Date: Thu, 13 Jun 2024 03:16:42 +0900 [thread overview]
Message-ID: <CAPpoddeigM44jhTA8Ua=+J4MC1MikouBZVoPrCW2LZF+9r5YeA@mail.gmail.com> (raw)
In-Reply-To: <CAKEwX=P1Ojb71AEJ2gzQTrfWidFPcJZmoNxEwji7TceBN-szCg@mail.gmail.com>
2024年6月12日(水) 3:26 Nhat Pham <nphamcs@gmail.com>:
>
> As I have noted in v0, I think this is unnecessary and makes it more confusing.
>
Does spin_lock() ensure that compiler optimizations do not remove
memory access to an external variable? I think we need to use
READ_ONCE/WRITE_ONCE for shared variable access even under a spinlock.
For example,
https://elixir.bootlin.com/linux/latest/source/mm/mmu_notifier.c#L234
isn't this a common use case of READ_ONCE?
```c
bool shared_flag = false;
spinlock_t flag_lock;
void somefunc(void) {
for (;;) {
spin_lock(&flag_lock);
/* check external updates */
if (READ_ONCE(shared_flag))
break;
/* do something */
spin_unlock(&flag_lock);
}
spin_unlock(&flag_lock);
}
```
Without READ_ONCE, the check can be extracted from the loop by optimization.
In shrink_worker, zswap_next_shrink is the shared_flag , which can be
updated by concurrent cleaner threads, so it must be re-read every
time we reacquire the lock. Am I badly misunderstanding something?
> > do {
> > +iternext:
> > spin_lock(&zswap_shrink_lock);
> > - zswap_next_shrink = mem_cgroup_iter(NULL, zswap_next_shrink, NULL);
> > - memcg = zswap_next_shrink;
> > + next_memcg = READ_ONCE(zswap_next_shrink);
> > +
> > + if (memcg != next_memcg) {
> > + /*
> > + * Ours was released by offlining.
> > + * Use the saved memcg reference.
> > + */
> > + memcg = next_memcg;
> > + } else {
> > + /* advance cursor */
> > + memcg = mem_cgroup_iter(NULL, memcg, NULL);
> > + WRITE_ONCE(zswap_next_shrink, memcg);
> > + }
>
> I suppose I'm fine with not advancing the memcg when it is already
> advanced by the memcg offlining callback.
>
For where to restart the shrinking, as Yosry pointed, my version
starts from the last memcg (=retrying failed memcg or evicting once
more)
I now realize that skipping the next memcg of offlined memcg is less
likely to happen. I am reverting it to restart from the next memcg of
zswap_next_shrink.
Which one could be better?
> >
> > /*
> > - * We need to retry if we have gone through a full round trip, or if we
> > - * got an offline memcg (or else we risk undoing the effect of the
> > - * zswap memcg offlining cleanup callback). This is not catastrophic
> > - * per se, but it will keep the now offlined memcg hostage for a while.
> > - *
> > * Note that if we got an online memcg, we will keep the extra
> > * reference in case the original reference obtained by mem_cgroup_iter
> > * is dropped by the zswap memcg offlining callback, ensuring that the
> > @@ -1434,16 +1468,25 @@ static void shrink_worker(struct work_struct *w)
> > }
> >
> > if (!mem_cgroup_tryget_online(memcg)) {
> > - /* drop the reference from mem_cgroup_iter() */
> > - mem_cgroup_iter_break(NULL, memcg);
> > - zswap_next_shrink = NULL;
> > + /*
> > + * It is an offline memcg which we cannot shrink
> > + * until its pages are reparented.
> > + *
> > + * Since we cannot determine if the offline cleaner has
> > + * been already called or not, the offline memcg must be
> > + * put back unconditonally. We cannot abort the loop while
> > + * zswap_next_shrink has a reference of this offline memcg.
> > + */
> > spin_unlock(&zswap_shrink_lock);
> > -
> > - if (++failures == MAX_RECLAIM_RETRIES)
> > - break;
> > -
> > - goto resched;
> > + goto iternext;
>
> Hmmm yeah in the past, I set it to NULL to make sure we're not
> replacing zswap_next_shrink with an offlined memcg, after that zswap
> offlining callback for that memcg has been completed..
>
> I suppose we can just call mem_cgroup_iter(...) on that offlined
> cgroup, but I'm not 100% sure what happens when we call this function
> on a cgroup that is currently being offlined, and has gone past the
> zswap offline callback stage. So I was just playing it safe and
> restart from the top of the tree :)
>
> I think this implementation has that behavior right? We see that the
> memcg is offlined, so we drop the lock and go to the beginning of the
> loop. We reacquire the lock, and might see that zswap_next_shrink ==
> memcg, so we call mem_cgroup_iter(...) on it. Is this safe?
>
> Note that zswap_shrink_lock only orders serializes this memcg
> selection loop with memcg offlining after it - there's no guarantee
> what's the behavior is for memcg offlining before it (well other than
> one reference that we manage to acquire thanks to
> mem_cgroup_iter(...), so that memcg has not been freed, but not sure
> what we can guarantee regarding its place in the memcg hierarchy
> tree?).
The locking mechanism in shrink_worker does not rely on what the next
memcg is.sorting stability of mem_cgroup_iter does not matter
here.
The expectation for the iterator is that it will walk through all live
memcgs. I believe mem_cgroup_iter uses parent-to-leaf ordering of
cgroup and it ensures all live cgroups are walked at least once,
regardless of its onlineness.
https://elixir.bootlin.com/linux/v6.10-rc2/source/mm/memcontrol.c#L1368
Regarding reference leak, I overlooked a scenario where a leak might
occur in the existing cleaner. although it should be rare.
When the cleaner is called on a memcg in zswap_next_shrink, the next
memcg from mem_cgroup_iter() can be an offline already-cleaned memcg,
resulting in a reference leak of the next memcg from the cleaner. We
should implement the same online check in the cleaner, like this:
```c
void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
{
struct mem_cgroup *next;
/* lock out zswap shrinker walking memcg tree */
spin_lock(&zswap_shrink_lock);
if (zswap_next_shrink == memcg) {
next = zswap_next_shrink;
do {
next = mem_cgroup_iter(NULL, next, NULL);
WRITE_ONCE(zswap_next_shrink, next);
spin_unlock(&zswap_shrink_lock);
/* zswap_next_shrink might be updated here */
spin_lock(&zswap_shrink_lock);
next = READ_ONCE(zswap_next_shrink);
if (!next)
break;
} while (!mem_cgroup_online(next));
/*
* We verified the next memcg is online under lock.
* Even if the next memcg is being offlined here, another
* cleaner for the next memcg is waiting for our unlock just
* behind us. We can leave the next memcg reference.
*/
}
spin_unlock(&zswap_shrink_lock);
}
```
As same as in shrink_worker, we must check if the next memcg is online
under the lock before leaving the ref in zswap_next_shrink.
Otherwise, zswap_next_shrink might hold the ref of offlined and cleaned memcg.
Or if you are concerning about temporary storing unchecked or offlined
memcg in zswap_next_shrink, it is safe because:
1. If there is no other cleaner running for zswap_next_shrink, the ref
saved in zswap_next_shrink ensures liveness of the memcg when
reacquired.
2. Another cleaner thread may put back and replace zswap_next_shrink
with its next. We will check onlineness of the new zswap_next_shrink
under reacquired lock.
3. Even if the verified-online memcg is being offlined concurrently,
another cleaner thread must wait for our unlock. We can leave the
online memcg and rely on its respective cleaner.
next prev parent reply other threads:[~2024-06-12 18:16 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-06-08 15:53 [PATCH v1 0/3] mm: zswap: global shrinker fix and proactive shrink Takero Funaki
2024-06-08 15:53 ` [PATCH v1 1/3] mm: zswap: fix global shrinker memcg iteration Takero Funaki
2024-06-10 19:16 ` Yosry Ahmed
2024-06-11 14:50 ` Takero Funaki
2024-06-11 18:26 ` Nhat Pham
2024-06-11 23:03 ` Shakeel Butt
2024-06-12 18:16 ` Takero Funaki [this message]
2024-06-12 18:28 ` Yosry Ahmed
2024-06-13 2:13 ` Takero Funaki
2024-06-13 2:18 ` Yosry Ahmed
2024-06-13 2:35 ` Takero Funaki
2024-06-13 2:57 ` Yosry Ahmed
2024-06-13 15:04 ` Nhat Pham
2024-06-13 16:49 ` Shakeel Butt
2024-06-14 4:39 ` Takero Funaki
2024-06-13 16:08 ` Nhat Pham
2024-06-13 16:09 ` Nhat Pham
2024-06-08 15:53 ` [PATCH v1 2/3] mm: zswap: fix global shrinker error handling logic Takero Funaki
2024-06-10 20:27 ` Yosry Ahmed
2024-06-11 15:21 ` Takero Funaki
2024-06-11 15:51 ` Nhat Pham
2024-06-11 18:15 ` Nhat Pham
2024-06-08 15:53 ` [PATCH v1 3/3] mm: zswap: proactive shrinking before pool size limit is hit Takero Funaki
2024-06-13 15:13 ` Nhat Pham
2024-06-11 18:10 ` [PATCH v1 0/3] mm: zswap: global shrinker fix and proactive shrink Nhat Pham
2024-06-13 15:22 ` Nhat Pham
2024-06-14 4:09 ` Takero Funaki
2024-06-14 22:34 ` Nhat Pham
2024-06-14 22:48 ` Nhat Pham
2024-06-15 0:19 ` Yosry Ahmed
2024-06-20 1:03 ` Takero Funaki
2024-06-20 22:45 ` Nhat Pham
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='CAPpoddeigM44jhTA8Ua=+J4MC1MikouBZVoPrCW2LZF+9r5YeA@mail.gmail.com' \
--to=flintglass@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=cerasuolodomenico@gmail.com \
--cc=chengming.zhou@linux.dev \
--cc=corbet@lwn.net \
--cc=hannes@cmpxchg.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=nphamcs@gmail.com \
--cc=shakeel.butt@linux.dev \
--cc=yosryahmed@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox