From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 6DA74C27C53
	for <linux-mm@archiver.kernel.org>; Wed, 12 Jun 2024 18:16:58 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id E097A6B0088; Wed, 12 Jun 2024 14:16:57 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id D91BC6B0092; Wed, 12 Jun 2024 14:16:57 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id BE4236B0093; Wed, 12 Jun 2024 14:16:57 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id 91C2D6B0088
	for <linux-mm@kvack.org>; Wed, 12 Jun 2024 14:16:57 -0400 (EDT)
Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id 4B7C6A0DE8
	for <linux-mm@kvack.org>; Wed, 12 Jun 2024 18:16:57 +0000 (UTC)
X-FDA: 82223042874.23.BC90555
Received: from mail-yb1-f171.google.com (mail-yb1-f171.google.com [209.85.219.171])
	by imf25.hostedemail.com (Postfix) with ESMTP id 6FB97A0006
	for <linux-mm@kvack.org>; Wed, 12 Jun 2024 18:16:54 +0000 (UTC)
Authentication-Results: imf25.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=JpfrxW2g;
	spf=pass (imf25.hostedemail.com: domain of flintglass@gmail.com designates 209.85.219.171 as permitted sender) smtp.mailfrom=flintglass@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1718216214;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=bTwDbvfDAEVuWYB0ESepl2sBU9bDbBMIjrrkfroH+Uw=;
	b=GdIV9/VVo9wIZU4ucsJ6fDTB5Uq3en52sFg8LHP0vuVlqZygEl1C9O+Ow+BWs5EuimKoyW
	LBkAFACX3PahUveIluQLhsr9CUt00vnM3fYa9KxtjcYrfQWQvuZN6IInsE6xUjzA7pgEmU
	kNgDqM3TlI3eQWCWHStHKNhDm8x3xXI=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718216214; a=rsa-sha256;
	cv=none;
	b=ihRuMBo9xucxls/mkYUPkx7zK0DvihmqiJDyi7N7h45+NbEltEDD3rqIRUVlA8ulKZBHhh
	pfYIpPOIB6k8qlXe1ka8rgcellYrv2GdIEliL1Vw6f9IWBMFbqPgRMLTvUpvQ0oJVGefsm
	MxRhUFT8QU9e+tf4vau0tEtC710NM58=
ARC-Authentication-Results: i=1;
	imf25.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=JpfrxW2g;
	spf=pass (imf25.hostedemail.com: domain of flintglass@gmail.com designates 209.85.219.171 as permitted sender) smtp.mailfrom=flintglass@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
Received: by mail-yb1-f171.google.com with SMTP id 3f1490d57ef6-dfdb6122992so144220276.3
        for <linux-mm@kvack.org>; Wed, 12 Jun 2024 11:16:54 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1718216213; x=1718821013; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=bTwDbvfDAEVuWYB0ESepl2sBU9bDbBMIjrrkfroH+Uw=;
        b=JpfrxW2gcunQ4NjeOn+lUEwERhhQJ8+PHOlywqJWxG329BmXc99nCZU6t1OKIrS7M8
         J7O6IafbTVD6rvlsGyzR9TaRuX3cM6a1oLi7rVSeNTs2BiKsdV9d4iW2MoP5LY9qRCp+
         bSvSdGKrU+0H5iyDuWMRNlEdofHyLroE+007mHvZCGIh+iFv2SatXGjpK4xcvva1NVQt
         wU/jC1RhR/D4akIEeTdHs7yjrIC09Mh4XbKDmZHKFV/2guFgY4hyzSZVgpQKrtbFziXX
         eCri7GYI8a1RwHnIoZcWo3imyvEKvWYmTrRONpKPGc+9wTMc0LVvSI297x+RBqW2r7xe
         cV/w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1718216213; x=1718821013;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=bTwDbvfDAEVuWYB0ESepl2sBU9bDbBMIjrrkfroH+Uw=;
        b=Ld72FALSfJ/BEezVgeWj3Oq4/aX4EpUHHxC1ZdMCgHzhRyFh4l3IpINqwEFVJCk04g
         hB23xls0gIFLsP31hqe+AE4qa4GV1WtXRJJtXNx6J0ytdH+OqPqFMheS4fcDDFfPx46v
         CVA91nNmhu1/15qDm2jYf9UEguqHuRmA7Msejv8uFrGL65wcZXBEcOUu4qgljNwrvtm4
         PNbDF8mu348OLKpVkpTu/KCnseerLI77bbXx9vyYg7UcJdTwS4s04KAWnfOlcff0pc7W
         DXriKGhu2gn4/lUKlWtQ4lBzQOv+X4nYqIpnOQA/hR2GrnNjF7J4zfLlrYgiwl00w7lY
         6mDw==
X-Forwarded-Encrypted: i=1; AJvYcCUz7yuPZE1n1K8iusAHIBw/y3B2oJ9ijPmot8i/Do6/OS4hgXSdVBkN1zcKr4L0Lh5ennshdO27AxzEjyMGuDVsadU=
X-Gm-Message-State: AOJu0YzkSVCXvSSX3tLZBMUJy9b2DrNqAi/KDTeFjg9N+fp5YydFwjti
	bPwmnK2hITRIZ8jTo9d90njJ9RE+sKHmGbxhfzmlVh4baEEBVPr7kAEWkp58z5zmLLpmH/Pma6M
	jdsN9ioifJ8j+BrHxT32u6oyh7go=
X-Google-Smtp-Source: AGHT+IEuPoSniiMQhKZmRRgqC3SgXJOkxDhb4vEa9QopgajXXTElzjq26aqJJGOS015KIrnPvNsKb2nlr3F/PSMhC8I=
X-Received: by 2002:a25:ac8d:0:b0:df4:dd49:7ae7 with SMTP id
 3f1490d57ef6-dfe6706216dmr2553240276.24.1718216213279; Wed, 12 Jun 2024
 11:16:53 -0700 (PDT)
MIME-Version: 1.0
References: <20240608155316.451600-1-flintglass@gmail.com> <20240608155316.451600-2-flintglass@gmail.com>
 <CAKEwX=P1Ojb71AEJ2gzQTrfWidFPcJZmoNxEwji7TceBN-szCg@mail.gmail.com>
In-Reply-To: <CAKEwX=P1Ojb71AEJ2gzQTrfWidFPcJZmoNxEwji7TceBN-szCg@mail.gmail.com>
From: Takero Funaki <flintglass@gmail.com>
Date: Thu, 13 Jun 2024 03:16:42 +0900
Message-ID: <CAPpoddeigM44jhTA8Ua=+J4MC1MikouBZVoPrCW2LZF+9r5YeA@mail.gmail.com>
Subject: Re: [PATCH v1 1/3] mm: zswap: fix global shrinker memcg iteration
To: Nhat Pham <nphamcs@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>, Yosry Ahmed <yosryahmed@google.com>, 
	Chengming Zhou <chengming.zhou@linux.dev>, Jonathan Corbet <corbet@lwn.net>, 
	Andrew Morton <akpm@linux-foundation.org>, 
	Domenico Cerasuolo <cerasuolodomenico@gmail.com>, linux-mm@kvack.org, linux-doc@vger.kernel.org, 
	linux-kernel@vger.kernel.org, Shakeel Butt <shakeel.butt@linux.dev>
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Queue-Id: 6FB97A0006
X-Rspam-User: 
X-Rspamd-Server: rspam08
X-Stat-Signature: 9w3wcjphnm6boofzi817ztnaizucouh4
X-HE-Tag: 1718216214-148206
X-HE-Meta: U2FsdGVkX1+4oMA+bX7oiMhr27LN0wjbpF6RBJkqbCW2sSeIWrQ193nrscPBWbwwlegnRTKH4aYJzMLzyHFhBv6BVaWCKAnkgdcgjDKoRWKtfRKveE41d+LUGtvNuOUvT4ryLvFqat5TmWS/EwqFnx2edoYpv3irM+t53UFavNbURIfpzOD74jz8xwd0xk+6iSfg/5YmqYrO0F/Q8i2/QUkq/5S2SqqDfkZxlZFR163pZDIHrKRA3WFb1GNY7+3Qv4WVbE7DHLH9NMjdWJo3bBJ4IbUu1yi1cgxjsRvlNNHzdufhkMbIMSLubNFQGLd7zCIEosSa2iLZtbfFDvDI+ClH6n06BOXwA0zp8EwagIHS+HIR+g32rKVhwnS7tsyaCfsQ42dlrHE2yLNQivAprCUXAUGb2Ml1xQ5d/MWTbMyImajleTE4nAcQATQFyOy0fhNBiW6nxA9qKxoGELv0K8WBsvEJirM6rSJm6KvZovJRk/ASsyFTbRrQ0C/0OdOCq5zjLugzwYHlIg5vr6L//aOkgaiExpUqW0KZWrz9q+oBcpdZfyRziW6JAXm9dQ17787EspRsO2x5MPkLaAWK7h8jW7r7XF3Hbm8cLcoPHguMOXTwvqmf/EUExAH02Wn8Xa4u6nAoJp9HMOiPV8LYH+rwfcc7Qc5NLVGgBOAEoidRYr3MauNB493fuJMAif0uiDSLtYq35iuwE//3s2y8xFJddJmtayibWeEyZU8dLXLKo/IYAYfZ38IzR+TffaHLXmVdlkI+1+MzEaEG2FBU+sAI9tt63k1BYGHiQaRriBt4noFaUssBNWJKRbsxmYdBpXE7M8oB3WaAOq9IpmJPegsYrH6Q+Cx4WeG8ABn9acI1yTYcyeBwXDOx1kp0BGVQ52MBVfLfe1iHCDBRkbYrDkoaEDggcdsrmLki2oo4vEEmJc/zzrf1EA62jvD5oEHr1xKbKzt4iNjKgIdg3is
 NgrwDLyD
 VF8rwFBF/tOMzzoLZopZfh7+eSgxt36IGbwslmrZOHacFN1rFUY4DWlqOutfoTmtQ0YhNFUtf2f9gTvFhMbFSczTfqyMmBi4HkwIkuu7srjX2YtX5BhPIPRCUCRYRJCzfGmTu+XW1tHyGXIyLTh93Jl7a6GonbnyW+jk+T14FpDNawER5vnMWZn8T4a0LvcAQHMc77cx314cqlg7OxlhvRy5hTr8YPWSZ/0u5AitRpqAr8FoE+BOkVeyMMPYznOuCuiwpcJfAsMT5NdYU3vvAkph9zHHFpVeWyI0sKWiA+7kxioVchwKHJdruY+9j4kKiiexLOqrYzX7IUmfmWE0b2BPcVg==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

2024=E5=B9=B46=E6=9C=8812=E6=97=A5(=E6=B0=B4) 3:26 Nhat Pham <nphamcs@gmail=
.com>:

>
> As I have noted in v0, I think this is unnecessary and makes it more conf=
using.
>

Does spin_lock() ensure that compiler optimizations do not remove
memory access to an external variable? I think we need to use
READ_ONCE/WRITE_ONCE for shared variable access even under a spinlock.
For example,
https://elixir.bootlin.com/linux/latest/source/mm/mmu_notifier.c#L234

isn't this a common use case of READ_ONCE?
```c
bool shared_flag =3D false;
spinlock_t flag_lock;

void somefunc(void) {
    for (;;) {
        spin_lock(&flag_lock);
        /* check external updates */
        if (READ_ONCE(shared_flag))
            break;
        /* do something */
        spin_unlock(&flag_lock);
    }
    spin_unlock(&flag_lock);
}
```
Without READ_ONCE, the check can be extracted from the loop by optimization=
.

In shrink_worker, zswap_next_shrink is the shared_flag , which can be
updated by concurrent cleaner threads, so it must be re-read every
time we reacquire the lock. Am I badly misunderstanding something?

> >         do {
> > +iternext:
> >                 spin_lock(&zswap_shrink_lock);
> > -               zswap_next_shrink =3D mem_cgroup_iter(NULL, zswap_next_=
shrink, NULL);
> > -               memcg =3D zswap_next_shrink;
> > +               next_memcg =3D READ_ONCE(zswap_next_shrink);
> > +
> > +               if (memcg !=3D next_memcg) {
> > +                       /*
> > +                        * Ours was released by offlining.
> > +                        * Use the saved memcg reference.
> > +                        */
> > +                       memcg =3D next_memcg;
> > +               } else {
> > +                       /* advance cursor */
> > +                       memcg =3D mem_cgroup_iter(NULL, memcg, NULL);
> > +                       WRITE_ONCE(zswap_next_shrink, memcg);
> > +               }
>
> I suppose I'm fine with not advancing the memcg when it is already
> advanced by the memcg offlining callback.
>

For where to restart the shrinking, as Yosry pointed, my version
starts from the last memcg (=3Dretrying failed memcg or evicting once
more)
I now realize that skipping the next memcg of offlined memcg is less
likely to happen. I am reverting it to restart from the next memcg of
zswap_next_shrink.
Which one could be better?

> >
> >                 /*
> > -                * We need to retry if we have gone through a full roun=
d trip, or if we
> > -                * got an offline memcg (or else we risk undoing the ef=
fect of the
> > -                * zswap memcg offlining cleanup callback). This is not=
 catastrophic
> > -                * per se, but it will keep the now offlined memcg host=
age for a while.
> > -                *
> >                  * Note that if we got an online memcg, we will keep th=
e extra
> >                  * reference in case the original reference obtained by=
 mem_cgroup_iter
> >                  * is dropped by the zswap memcg offlining callback, en=
suring that the
> > @@ -1434,16 +1468,25 @@ static void shrink_worker(struct work_struct *w=
)
> >                 }
> >
> >                 if (!mem_cgroup_tryget_online(memcg)) {
> > -                       /* drop the reference from mem_cgroup_iter() */
> > -                       mem_cgroup_iter_break(NULL, memcg);
> > -                       zswap_next_shrink =3D NULL;
> > +                       /*
> > +                        * It is an offline memcg which we cannot shrin=
k
> > +                        * until its pages are reparented.
> > +                        *
> > +                        * Since we cannot determine if the offline cle=
aner has
> > +                        * been already called or not, the offline memc=
g must be
> > +                        * put back unconditonally. We cannot abort the=
 loop while
> > +                        * zswap_next_shrink has a reference of this of=
fline memcg.
> > +                        */
> >                         spin_unlock(&zswap_shrink_lock);
> > -
> > -                       if (++failures =3D=3D MAX_RECLAIM_RETRIES)
> > -                               break;
> > -
> > -                       goto resched;
> > +                       goto iternext;
>
> Hmmm yeah in the past, I set it to NULL to make sure we're not
> replacing zswap_next_shrink with an offlined memcg, after that zswap
> offlining callback for that memcg has been completed..
>
> I suppose we can just call mem_cgroup_iter(...) on that offlined
> cgroup, but I'm not 100% sure what happens when we call this function
> on a cgroup that is currently being offlined, and has gone past the
> zswap offline callback stage. So I was just playing it safe and
> restart from the top of the tree :)
>
> I think this implementation has that behavior right? We see that the
> memcg is offlined, so we drop the lock and go to the beginning of the
> loop. We reacquire the lock, and might see that zswap_next_shrink =3D=3D
> memcg, so we call mem_cgroup_iter(...) on it. Is this safe?
>
> Note that zswap_shrink_lock only orders serializes this memcg
> selection loop with memcg offlining after it - there's no guarantee
> what's the behavior is for memcg offlining before it (well other than
> one reference that we manage to acquire thanks to
> mem_cgroup_iter(...), so that memcg has not been freed, but not sure
> what we can guarantee regarding its place in the memcg hierarchy
> tree?).

The locking mechanism in shrink_worker does not rely on what the next
memcg is.sorting stability of mem_cgroup_iter does not matter
here.
The expectation for the iterator is that it will walk through all live
memcgs. I believe mem_cgroup_iter uses parent-to-leaf ordering of
cgroup and it ensures all live cgroups are walked at least once,
regardless of its onlineness.
https://elixir.bootlin.com/linux/v6.10-rc2/source/mm/memcontrol.c#L1368

Regarding reference leak, I overlooked a scenario where a leak might
occur in the existing cleaner. although it should be rare.

When the cleaner is called on a memcg in zswap_next_shrink, the next
memcg from mem_cgroup_iter() can be an offline already-cleaned memcg,
resulting in a reference leak of the next memcg from the cleaner. We
should implement the same online check in the cleaner, like this:


```c
void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg)
{
        struct mem_cgroup *next;

        /* lock out zswap shrinker walking memcg tree */
        spin_lock(&zswap_shrink_lock);
        if (zswap_next_shrink =3D=3D memcg) {
                next =3D zswap_next_shrink;
                do {
                        next =3D mem_cgroup_iter(NULL, next, NULL);
                        WRITE_ONCE(zswap_next_shrink, next);

                        spin_unlock(&zswap_shrink_lock);
                        /* zswap_next_shrink might be updated here */
                        spin_lock(&zswap_shrink_lock);

                        next =3D READ_ONCE(zswap_next_shrink);
                        if (!next)
                                break;
                } while (!mem_cgroup_online(next));
                /*
                 * We verified the next memcg is online under lock.
                 * Even if the next memcg is being offlined here, another
                 * cleaner for the next memcg is waiting for our unlock jus=
t
                 * behind us.  We can leave the next memcg reference.
                 */
        }
        spin_unlock(&zswap_shrink_lock);
}
```

As same as in shrink_worker, we must check if the next memcg is online
under the lock before leaving the ref in zswap_next_shrink.
Otherwise, zswap_next_shrink might hold the ref of offlined and cleaned mem=
cg.

Or if you are concerning about temporary storing unchecked or offlined
memcg in zswap_next_shrink, it is safe because:

1. If there is no other cleaner running for zswap_next_shrink, the ref
saved in zswap_next_shrink ensures liveness of the memcg when
reacquired.
2. Another cleaner thread may put back and replace zswap_next_shrink
with its next. We will check onlineness of the new zswap_next_shrink
under reacquired lock.
3. Even if the verified-online memcg is being offlined concurrently,
another cleaner thread must wait for our unlock. We can leave the
online memcg and rely on its respective cleaner.