From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E9EEDC27C65 for ; Tue, 11 Jun 2024 18:26:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 75E516B00B9; Tue, 11 Jun 2024 14:26:27 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6E6B76B00BA; Tue, 11 Jun 2024 14:26:27 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 55FFD6B00BB; Tue, 11 Jun 2024 14:26:27 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 313CC6B00B9 for ; Tue, 11 Jun 2024 14:26:27 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id F0B1AC0676 for ; Tue, 11 Jun 2024 18:26:25 +0000 (UTC) X-FDA: 82219437930.03.B5F9A8E Received: from mail-qt1-f175.google.com (mail-qt1-f175.google.com [209.85.160.175]) by imf25.hostedemail.com (Postfix) with ESMTP id 293D5A0003 for ; Tue, 11 Jun 2024 18:26:23 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=GcL6d2Uv; spf=pass (imf25.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.160.175 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718130384; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Rx/97KVP+3IrA7BDYOY/vWXjIAWDZj39a6aZjMqO+Qw=; b=VSDghqNBxpWwa5wOCySgm1NTEeo23HOuzeI22xhbVgv3uEZ6hCnGqNs8F3/r7pBrble3gq usomJW3Tdt01tD04hkb79MWwkpueQ8C0ndjupnDL3Jrbf7jarbdyKDewyzWl5RDd8kqoPT /eMo32cjR9OqQE/k1mL6QUDDHQJUu/I= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=GcL6d2Uv; spf=pass (imf25.hostedemail.com: domain of nphamcs@gmail.com designates 209.85.160.175 as permitted sender) smtp.mailfrom=nphamcs@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718130384; a=rsa-sha256; cv=none; b=VdR0d2NYCxdhKfbOM9BDHtq+IV6QZo+oCPiHQ8Jhpb83kGLX/rU8BS9iFy4BzX7L/Sf5G7 Fsv11q858M9IOtDWvjjHBZD9FdUuZd3vwE0oVOVoFdx7x3vY0/MEziiGKgpb6I8sYbSSZW 5M9umxqfcgPnXUB1ugAKxp7PXViWtjc= Received: by mail-qt1-f175.google.com with SMTP id d75a77b69052e-44055ca3103so18282581cf.3 for ; Tue, 11 Jun 2024 11:26:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1718130383; x=1718735183; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Rx/97KVP+3IrA7BDYOY/vWXjIAWDZj39a6aZjMqO+Qw=; b=GcL6d2Uv69Oq1lF/bXAIEt+KAzQETZhdEKf63vgMQulfH+AynUu3oKdVtPzGGiuHbR inM9jcsJS0/owtC4E3YmXsYS7Vq9OmZ4XfoWRCYoiUAuTjc9jPgE2LeqIsvXwYS9TwlB nU+DNp0kwQLrK/rzFeAqN6noCanqRhi84ZkfAEX5mQH3ovLcrtFqgBthCNCuIc4SjaCf 5HQzjZiSk1eoAd8DISXqbA9cUX5UNXymaeZUVVatpfUXbcvWL0q3YIonxdkVhge9JUfh ybr6aiRnBuhg+N9mfMRQ92q/SfMk/Jy5vsaGsJkkdSCyPmTg1v5uT0qGBLRRR5jb5d38 h4vw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718130383; x=1718735183; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Rx/97KVP+3IrA7BDYOY/vWXjIAWDZj39a6aZjMqO+Qw=; b=gyQ8Y3Kr4N5s73mg69j5kFeQ/QHcmwvsGEfKgCE5VVKPudtn+cK9Av1rh1ppZXJK0T mEXHi6d9BB1NRoOSnqzK+/YAyvOTpVDOCddXh8KuOGo2aYWstz7FxBuKdYqyM8S/Tyy8 VHTZ2eBxI/cvbY6/92ZoayCRDXuC11Md3JiFjNaEZ6IyljuzMV992EyFUINJ8Oy+GrZ0 QErhEudg/JTXxrJaXv8svQQZjEcPvvY4RQcCU6VHnautwb00dCXs7PJyY9JVOswMuwlX uSXp220tBzdvgYqG5VvQTyYN6n7VGsiNiu6a7eh48CBx0OPotDLQYuONRo84QrxtWPn1 GUfw== X-Forwarded-Encrypted: i=1; AJvYcCVpKVYCPNGH+l8rFuk0WvsGKyUQk4snuAdseOhXv8jQduDILNzCMpoek3HzVLUNX2vqeOJpntpSWTw22QMh6SRuxz8= X-Gm-Message-State: AOJu0YyETQlyLiwJBKRBM7X6cTqdZyeoqg0XOOKYJL2FdJSv0K4jJX4J QyImJbyIHjL/yxjLPjnDVUKdFI9gDhk6FCCZaFX7kkvt0V+Bq6fsJWwN3SrrEIT1F9oofAHO7ZO CcSarrG8pYgXConShktLe5NUcI14= X-Google-Smtp-Source: AGHT+IHFN6sVSYA40zzZBqCr8PAU+d9mRS6vS72srGcbMmX/Mq5CHxpjucNAE1JbeD0EBCqKlVBjM84vdcJ2zUa2AOs= X-Received: by 2002:a05:6214:2d42:b0:6b0:82a9:936a with SMTP id 6a1803df08f44-6b082a995d2mr57151716d6.2.1718130382913; Tue, 11 Jun 2024 11:26:22 -0700 (PDT) MIME-Version: 1.0 References: <20240608155316.451600-1-flintglass@gmail.com> <20240608155316.451600-2-flintglass@gmail.com> In-Reply-To: <20240608155316.451600-2-flintglass@gmail.com> From: Nhat Pham Date: Tue, 11 Jun 2024 11:26:12 -0700 Message-ID: Subject: Re: [PATCH v1 1/3] mm: zswap: fix global shrinker memcg iteration To: Takero Funaki Cc: Johannes Weiner , Yosry Ahmed , Chengming Zhou , Jonathan Corbet , Andrew Morton , Domenico Cerasuolo , linux-mm@kvack.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, Shakeel Butt Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 293D5A0003 X-Stat-Signature: pkrptyts6tisa5x8ae6z34gupsx7ciuu X-HE-Tag: 1718130383-599555 X-HE-Meta: U2FsdGVkX1/Ow2WHGfPqBCglLpruQwvaDv4Ee6hwdnZiXcpVa58czmjRijO95mVpZSoejajxLTGwtnF14aKZizf3fnSyZpW6RsXgxa9lTrD6D+T/mud7wDXctjPvqwXOaE8T9kRLmuSWb42DZs6lVRhGUZpFqGaiuia0mioFIvy5gc31kVvMKbvpsewtxyVltOqyX/w4p7aGujLYEUqMyPlsq5abGbz4covlXEg+hzSM19GuB2s0VdO0p+cA0K3jnrStDGw+mbcDPX00FVbylDetV6j5oGtJrdCzf0byPwKPlOmiYnu6/vnOPxVM8ghz729V9r0/UnMEevPXhCS7JVrWmjclYkPENYCxoU2Q1PTdaAOdZ7prtmrIX8orlmn+DhCttgpMJepee/5xNGYqgk0KUurxHLyogwWUtvQ7qzAB1aKgdWwRQMfmeDRA0ShjGl9KdCogZYCylsGf59UxToSAlGnmgJ0rTFmoRGhOV1N+BvTrxkuveHHFOZ6LobcyIJDwPBkT0LiZqwTZJHOMSovk5cj47Jt08TOstpAGZOqAM929xRk+vBJ6GElOw42tvPDajTmxSiSJmvX5QA14kTe/jOTrpue49Pak0h1Uar/9WHJOxEi6hpK8bj/cASb/KmuUHS23bDNMoPaRO3ZLsF0RzTjkSG/j2FnsgoXeCM7jQIS2Zi/cdWBcpKKCQgiE8RrzPVQhWvKWzFqxJLiyDdGa31Mtumo3agSaIPPxDwTv2ngaVhlHZHsRqRsQ4JdX5Wq4qS7FP+YWaW41VotcXVhkAppStrYrION52XMzjPff3tMdAqrwh7Udb700PkDeh/1QnfY7WF3ZlC6riPZnQnh4drP6EBiESH/2IC120Gxl2fSA0ieW53D0aLOogNU1f713iWS1kvlSedTJqeuzGWBxL9/XBKGGCBYUq0+hqZyOwXw5nQ/QXMRaIPX5nsV44x880BMj4mdevUvDGRk HHm1BAPz 2a7ID4A8MCRkNyZK/I/GU6JQlJX7bfMuXN1GIHauctsfpGlR/UI07O3c1uETWEKbpOgAL/fZuoo+XSOD0pry4cnbLxi1rg3q9LXESyB5qWjgUCsboenwBS1hQO8SGL1PpGFymwf+nWzxEzx6XNPuJ/mVFGlvu1deW23TwdgV0IV2hOM+96WV607u/FNpnvh9KTuUVsnwhaK5Hmxffvok5+zhdxYSr1w2w+atNMlNr5Um6H7D0GHy2RPeIGR+ogfU6/pyyYQB9bp/L+pI8XxFDMZ0b2EcHqnX+27Bt X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, Jun 8, 2024 at 8:53=E2=80=AFAM Takero Funaki = wrote: > > This patch fixes an issue where the zswap global shrinker stopped > iterating through the memcg tree. > > The problem was that `shrink_worker()` would stop iterating when a memcg > was being offlined and restart from the tree root. Now, it properly > handles the offlining memcg and continues shrinking with the next memcg. > > This patch also modified handing of the lock for offlined memcg cleaner > to adapt the change in the iteration, and avoid negligibly rare skipping > of a memcg from shrink iteration. > > Fixes: a65b0e7607cc ("zswap: make shrinking memcg-aware") > Signed-off-by: Takero Funaki > --- > mm/zswap.c | 87 ++++++++++++++++++++++++++++++++++++++++++------------ > 1 file changed, 68 insertions(+), 19 deletions(-) > > diff --git a/mm/zswap.c b/mm/zswap.c > index 80c634acb8d5..d720a42069b6 100644 > --- a/mm/zswap.c > +++ b/mm/zswap.c > @@ -827,12 +827,27 @@ void zswap_folio_swapin(struct folio *folio) > } > } > > +/* > + * This function should be called when a memcg is being offlined. > + * > + * Since the global shrinker shrink_worker() may hold a reference > + * of the memcg, we must check and release the reference in > + * zswap_next_shrink. > + * > + * shrink_worker() must handle the case where this function releases > + * the reference of memcg being shrunk. > + */ > void zswap_memcg_offline_cleanup(struct mem_cgroup *memcg) > { > /* lock out zswap shrinker walking memcg tree */ > spin_lock(&zswap_shrink_lock); > - if (zswap_next_shrink =3D=3D memcg) > - zswap_next_shrink =3D mem_cgroup_iter(NULL, zswap_next_sh= rink, NULL); > + > + if (READ_ONCE(zswap_next_shrink) =3D=3D memcg) { > + /* put back reference and advance the cursor */ > + memcg =3D mem_cgroup_iter(NULL, memcg, NULL); > + WRITE_ONCE(zswap_next_shrink, memcg); > + } > + > spin_unlock(&zswap_shrink_lock); > } As I have noted in v0, I think this is unnecessary and makes it more confus= ing. > > @@ -1401,25 +1416,44 @@ static int shrink_memcg(struct mem_cgroup *memcg) > > static void shrink_worker(struct work_struct *w) > { > - struct mem_cgroup *memcg; > + struct mem_cgroup *memcg =3D NULL; > + struct mem_cgroup *next_memcg; > int ret, failures =3D 0; > unsigned long thr; > > /* Reclaim down to the accept threshold */ > thr =3D zswap_accept_thr_pages(); > > - /* global reclaim will select cgroup in a round-robin fashion. */ > + /* global reclaim will select cgroup in a round-robin fashion. > + * > + * We save iteration cursor memcg into zswap_next_shrink, > + * which can be modified by the offline memcg cleaner > + * zswap_memcg_offline_cleanup(). > + * > + * Since the offline cleaner is called only once, we cannot aband= one > + * offline memcg reference in zswap_next_shrink. > + * We can rely on the cleaner only if we get online memcg under l= ock. > + * If we get offline memcg, we cannot determine the cleaner will = be > + * called later. We must put it before returning from this functi= on. > + */ > do { > +iternext: > spin_lock(&zswap_shrink_lock); > - zswap_next_shrink =3D mem_cgroup_iter(NULL, zswap_next_sh= rink, NULL); > - memcg =3D zswap_next_shrink; > + next_memcg =3D READ_ONCE(zswap_next_shrink); > + > + if (memcg !=3D next_memcg) { > + /* > + * Ours was released by offlining. > + * Use the saved memcg reference. > + */ > + memcg =3D next_memcg; > + } else { > + /* advance cursor */ > + memcg =3D mem_cgroup_iter(NULL, memcg, NULL); > + WRITE_ONCE(zswap_next_shrink, memcg); > + } I suppose I'm fine with not advancing the memcg when it is already advanced by the memcg offlining callback. > > /* > - * We need to retry if we have gone through a full round = trip, or if we > - * got an offline memcg (or else we risk undoing the effe= ct of the > - * zswap memcg offlining cleanup callback). This is not c= atastrophic > - * per se, but it will keep the now offlined memcg hostag= e for a while. > - * > * Note that if we got an online memcg, we will keep the = extra > * reference in case the original reference obtained by m= em_cgroup_iter > * is dropped by the zswap memcg offlining callback, ensu= ring that the > @@ -1434,16 +1468,25 @@ static void shrink_worker(struct work_struct *w) > } > > if (!mem_cgroup_tryget_online(memcg)) { > - /* drop the reference from mem_cgroup_iter() */ > - mem_cgroup_iter_break(NULL, memcg); > - zswap_next_shrink =3D NULL; > + /* > + * It is an offline memcg which we cannot shrink > + * until its pages are reparented. > + * > + * Since we cannot determine if the offline clean= er has > + * been already called or not, the offline memcg = must be > + * put back unconditonally. We cannot abort the l= oop while > + * zswap_next_shrink has a reference of this offl= ine memcg. > + */ > spin_unlock(&zswap_shrink_lock); > - > - if (++failures =3D=3D MAX_RECLAIM_RETRIES) > - break; > - > - goto resched; > + goto iternext; Hmmm yeah in the past, I set it to NULL to make sure we're not replacing zswap_next_shrink with an offlined memcg, after that zswap offlining callback for that memcg has been completed.. I suppose we can just call mem_cgroup_iter(...) on that offlined cgroup, but I'm not 100% sure what happens when we call this function on a cgroup that is currently being offlined, and has gone past the zswap offline callback stage. So I was just playing it safe and restart from the top of the tree :) I think this implementation has that behavior right? We see that the memcg is offlined, so we drop the lock and go to the beginning of the loop. We reacquire the lock, and might see that zswap_next_shrink =3D=3D memcg, so we call mem_cgroup_iter(...) on it. Is this safe? Note that zswap_shrink_lock only orders serializes this memcg selection loop with memcg offlining after it - there's no guarantee what's the behavior is for memcg offlining before it (well other than one reference that we manage to acquire thanks to mem_cgroup_iter(...), so that memcg has not been freed, but not sure what we can guarantee regarding its place in the memcg hierarchy tree?). Johannes, do you have any idea what we can expect here? Let me also cc Shak= eel. > } > + /* > + * We got an extra memcg reference before unlocking. > + * The cleaner cannot free it using zswap_next_shrink. > + * > + * Our memcg can be offlined after we get online memcg he= re. > + * In this case, the cleaner is waiting the lock just beh= ind us. > + */ > spin_unlock(&zswap_shrink_lock); > > ret =3D shrink_memcg(memcg); > @@ -1457,6 +1500,12 @@ static void shrink_worker(struct work_struct *w) > resched: > cond_resched(); > } while (zswap_total_pages() > thr); > + > + /* > + * We can still hold the original memcg reference. > + * The reference is stored in zswap_next_shrink, and then reused > + * by the next shrink_worker(). > + */ > } > > /********************************* > -- > 2.43.0 >